Add anomaly detection tests
After you install the dbt package, you can add Elementary data anomaly detection tests.
Data anomaly detection dbt tests
Elementary dbt package includes anomaly detection tests, implemented as dbt tests. These tests can detect anomalies in volume, freshness, null rates, and anomalies in specific dimensions, among others. The tests are configured and executed like any other tests in your project.
Table (model / source) tests
- Volume anomalies
elementary.volume_anomalies
Monitors the row count of your table over
time per time bucket (if configured without timestamp_column
, will count
table total rows).
- Freshness anomalies
elementary.freshness_anomalies
Monitors the freshness of your table
over time, as the expected time between data updates. Requires a
timestamp_column
configuration.
- Event freshness anomalies
elementary.event_freshness_anomalies
Monitors the freshness of event
data over time, as the expected time it takes each event to load - that is,
the time between when the event actually occurs (the event timestamp
), and
when it is loaded to the database (the update timestamp
). Configuring
event_timestamp_column
is required, and update_timestamp_column
is
optional.
- Dimension anomalies
elementary.dimension_anomalies
This test monitors the frequency of
values in the configured dimension over time, and alerts on unexpected changes
in the distribution. It is best to configure it on low-cardinality fields. The
test counts rows grouped by given dimensions
(columns/expressions).
- All columns anomalies
elementary.all_columns_anomalies
Executes column level monitors and
anomaly detection on all the columns of the table. Specific monitors are
detailed here.
You can use column_anomalies
param to override the default monitors, and
exclude_prefix
/ exclude_regexp
to exclude columns from the test.
Column tests
- Columns anomalies
elementary.column_anomalies
Executes column level monitors and anomaly
detection on the column. Specific monitors are detailed
here and can be
configured using the columns_anomalies
configuration.
Adding tests examples
Configure your elementary anomaly detection tests
If your data set has a timestamp column that represents the creation time of a
field, it is highly recommended configuring it as a timestamp_column
.
To support different types of data sets, the tests have configuration that can be used to customize their behavior. Read more about data anomaly detection tests configuration here.
We recommend adding a tag to the tests so you could execute these in a dedicated run using the selection parameter --select tag:elementary
.
If you wish to only be warned on anomalies, configure the severity
of the tests to warn
.
What happens on each test?
Upon running a test, your data is split into time buckets based on the time_bucket
field and is limited by
the training_period
var. The test then compares a certain metric (e.g. row count) of the buckets that are within the
detection-period
to the row count of all the previous time buckets within the training_period
period.
If there were any anomalies in the detection period, the test will fail.
On each test elementary package executes the relevant monitors, and searches for anomalies by comparing to historical metrics.
To learn more, refer to core concepts.
What does it mean when a test fails?
When a test fail, it means that an anomaly was detected on this metric and dataset. To learn more, refer to core concepts and anomaly detection.