ELEMENTARY CLOUD

One of the challenges data teams face is tracking and understand and collaborate on the status of data issues. Tests fail daily, pipelines are executed frequently, alerts are sent to different channels. There is a need for a centralized place to track:

  • What data issues are open? Which issues were already resolved?
  • Who is on it, and what’s the latest status?
  • Are multiple failures part of the same issue?
  • What actions and events happened since the incident started?
  • Did such issue happen before? Who resolved it and how?

In Elementary, these are solved with Incidents.

A comprehensive view of all incidents can be found in the Incidents page.

How incidents work?

Every failure or warning in Elementary will automatically open a new incident or be added as an event to an ongoing incident. Based on grouping rules, different failures are grouped to the same incident.

An incident has a status, assignee and severity. These can be set in the Incidents page, or from an alert in integrations that support alert actions.

How incidents are resolved?

Each incident starts at the first failure, and ends when the status is changed manually or automatically to Resolved. An incident is automatically resolved when the failing tests, monitors and / or models are successful again.

Incident grouping rules

Different failures and warnings are grouped to the same incident by the following grouping rules:

  1. Additional failures of the same test / monitor on a table that has an active incident.
  2. _ Coming soon _ Freshness and volume issues that are downstream of an open incident on a model failure.
  3. _ Coming soon _ Failures of the same test / monitor that are on downstream tables of an active incident.

Incident deep dive

Clicking on an incident will open the test overview side panel, showing the following information:

  1. Test owner, tags and subscribers (if the incident is a model failure, the model owner, tags and subscribers will be shown).
  2. The execution history of the test / model, including the following information on each execution:
    • Execution time
    • Result (pass / fail / warning, etc)
    • If the test failed -
      • a sample of the failed rows
      • The Slack channel where the alert was sent
    • For anomaly tests - the result chart
    • Compiled query
  3. Configuration of the test / model - the Yaml or SQL code of the test / model. For cloud tests, the configuration is also editable.

You can also see the list of upstream and downstream assets - if the test is a column test you can see the upstream and downstream columns, if it’s a table test you can see the upstream and downstream tables.