Unstructured Data Validations
Beta Feature: Unstructured data validation tests is currently in beta. The functionality and interface may change in future releases.
Version Requirement: This feature requires Elementary dbt package version 0.18.0 or above.
Validating Unstructured Data with Elementary
What is Unstructured Data Validation?
Elementary’s elementary.unstructured_data_validation
test allows you to validate unstructured data using AI and LLM language models. Instead of writing complex code, you can simply describe what you expect from your data in plain English, and Elementary will check if your data meets those expectations.
For example, you can verify that customer feedback comments are in English, product descriptions contain required information, or support tickets follow a specific format or a sentiment.
How It Works
Elementary leverages the AI and LLM capabilities built directly into your data warehouse. When you run a validation test:
- Your unstructured data stays within your data warehouse
- The warehouse’s built-in AI and LLM functions analyze the data
- Elementary reports whether each text value meets your expectations
Required Setup for Each Data Warehouse
Before you can use Elementary’s unstructured data validations, you need to set up AI and LLM capabilities in your data warehouse:
Snowflake
- Prerequisite: Enable Snowflake Cortex AI LLM functions
- Recommended Model:
claude-3-5-sonnet
- View Snowflake’s Guide
Databricks
- Prerequisite: Ensure Databricks AI Functions are available
- Recommended Model:
databricks-meta-llama-3-3-70b-instruct
- View Databrick’s Setup Guide
BigQuery
- Prerequisite: Configure BigQuery to use Vertex AI models
- Recommended Model:
gemini-1.5-pro
- View BigQuery’s Setup Guide
Redshift
- Support coming soon
Data Lakes
- Currently supported through Snowflake, Databricks, or BigQuery external object tables
- View Data Lakes Information
Using the Validation Test
The test requires two main parameters:
expectation_prompt
: Describe what you expect from the text in plain Englishllm_model_name
: Specify which AI model to use (see recommendations above for each warehouse)
This test works with any column containing unstructured text data such as descriptions, comments, or other free-form text fields. It can also be applied to structured columns that can be converted to strings, enabling natural language data validations.
Usage Examples
Here are some powerful ways you can apply unstructured data validations:
Validating Structure
Test fails if: A doctor’s note does not specify a time period or lacks recommendations for the patient.
Validating Sentiment
Test fails if: Any feedback in negative_feedbacks
is not actually negative.
Validating Similarities Coming Soon
Test fails if: A PDF summary does not accurately represent the original PDF’s content. The validation will use the pdf name as the key to match a summary from the pdf_summary table to the pdf_content in the pdf_source_table.
Test fails if: The job title does not align with the job description.
Accepted Categories Coming Soon
Test fails if: A support ticket does not fall within the predefined categories.
Accepted Entities Coming Soon
Test fails if:
- The required entity (e.g.,
organization
) is missing. - Extracted entities do not match the expected values.
Compare Numeric Values Coming Soon
Test fails if:
- Required entities are missing
- The numerical entities do not match the structured CRM data