DEV Community: Blaine Elliott

What Tools Should I Use for Data Observability in 2026?

Blaine Elliott — Mon, 04 May 2026 14:27:29 +0000

The best data observability tool depends on your warehouse, team size, and budget. If you want a short answer: full-platform tools like AnomalyArmor, Monte Carlo, and Metaplane offer the fastest time to value. Open-source tools like Great Expectations and Soda give you maximum control at the cost of setup time. Point solutions like Datafold and Elementary excel at specific workflows like CI testing and dbt monitoring.

This guide breaks down what data observability actually means, how to evaluate tools, and how the top 10 options compare on features, pricing, and trade-offs.

What is data observability?

Data observability is the practice of continuously monitoring your data pipelines to detect problems before they reach dashboards, reports, and ML models. It borrows the concept from software observability (metrics, logs, traces) and applies it to data infrastructure.

The goal is simple: know when your data is broken before someone on the business team sends you a Slack message asking why the numbers look wrong.

Data observability tools monitor five core pillars and alert you when something deviates from expected behavior. Unlike data quality testing, which requires you to write explicit rules, observability tools learn what "normal" looks like from historical patterns and flag anomalies automatically.

What are the 5 pillars of data observability?

The five pillars of data observability are freshness, volume, schema, distribution, and lineage. Each pillar monitors a different failure mode in your data pipeline.

1. Freshness

Freshness tracks whether tables are updating on their expected schedule. A table that normally refreshes every hour but hasn't been updated in six hours has a freshness problem. This is the most common data issue and the easiest to detect automatically, because it only requires checking the most recent timestamp in each table. See our data freshness monitoring guide for the full detection pattern.

2. Volume

Volume monitors whether the number of rows in a table matches expected patterns. If your orders table normally receives 10,000 rows per day and suddenly receives 200, something is wrong upstream. Volume anomalies also catch accidental bulk deletes, duplicate loads, and partial pipeline failures.

3. Schema

Schema monitoring detects when columns are added, removed, renamed, or change data types. Schema changes are the single most common cause of pipeline failures. A backend engineer renames a column, and twelve downstream models break silently. Good schema monitoring catches these changes within minutes, not days. See Schema Drift: The Silent Pipeline Killer for why this matters.

4. Distribution

Distribution tracks whether the statistical properties of your data have shifted. This includes null rates, distinct value counts, min/max ranges, and value distributions. If a column that's normally 2% null suddenly jumps to 40% null, that's a distribution anomaly. Distribution monitoring catches data quality problems that freshness, volume, and schema checks would miss entirely. The full algorithm space is covered in Data Anomaly Detection: The Complete Guide.

5. Lineage

Lineage maps the upstream and downstream dependencies between tables, models, and dashboards. When a problem is detected, lineage tells you what broke and everything downstream that's affected. Without lineage, you spend hours tracing impact manually. With it, you know the blast radius instantly.

What categories of data observability tools exist?

Data observability tools fall into four broad categories. Understanding which category fits your team saves you from evaluating tools that were never designed for your use case.

Full-platform tools

Full-platform tools provide automated monitoring across all five pillars with minimal configuration. You connect your warehouse, the tool profiles your tables, learns baselines, and starts alerting. Examples: AnomalyArmor, Monte Carlo, Metaplane, Bigeye.

Best for: Teams that want fast time to value and don't want to maintain monitoring infrastructure.

Point-solution tools

Point solutions focus on one or two areas and do them exceptionally well. Datafold specializes in data diffing and CI/CD testing. Elementary focuses on dbt-native monitoring. These tools often complement a full-platform tool rather than replacing one.

Best for: Teams with specific workflow needs (dbt-heavy shops, CI/CD-driven data teams).

Open-source frameworks

Open-source tools like Great Expectations and Soda Core give you a testing framework where you define expectations as code. They're free to run but require significant setup, maintenance, and rule-writing. You get maximum flexibility at the cost of engineering time.

Best for: Teams with strong engineering culture, limited budget, and willingness to invest in building their own monitoring layer.

DIY approaches

Some teams build monitoring with custom SQL queries, Airflow checks, and dbt tests. This works for small-scale pipelines but becomes unmanageable beyond 50-100 tables. You'll spend more time maintaining the monitoring system than monitoring the data.

Best for: Teams with fewer than 20 tables or teams evaluating whether they need data observability at all.

How should I evaluate data observability tools?

Before comparing specific tools, establish your evaluation criteria. The features matrix on every vendor's website looks identical. What actually differentiates tools is the stuff that's harder to measure.

Time to value

How long from connecting your database to receiving your first useful alert? Some tools require days of configuration. Others show you insights within hours. This is the single most important criterion and the one most teams overlook during evaluation.

Alert quality

A tool that sends 50 alerts per day is worse than no tool at all. Alert fatigue kills adoption faster than any missing feature. Evaluate how the tool handles noise reduction, prioritization, and suppression of known issues.

Warehouse coverage

Most teams run more than one database. Confirm that the tool supports your specific warehouse and version, and that all features work across all your databases. "Supports Snowflake" might mean full functionality or it might mean a basic connection with half the features missing.

Pricing transparency

Data observability pricing ranges from free (open-source) to six figures annually (enterprise platforms). Get a complete quote for your actual table count. Watch for hidden costs: per-user fees, per-alert charges, premium features behind upsells.

Integration depth

Where do alerts go? Does the tool integrate with Slack, PagerDuty, your orchestrator? Can it enrich dbt models with metadata? Does it expose an API or MCP server for AI agent workflows? The best tool in the world is useless if it doesn't fit your team's workflow.

How do the top data observability tools compare?

Here's a comparison of the 10 most relevant data observability tools in 2026, covering full-platform solutions, point solutions, and open-source options.

Tool	Type	Pricing	Warehouse Support	Key Strength
AnomalyArmor	Full platform	$5/table	Snowflake, Databricks, PostgreSQL, MySQL, Redshift	Fast setup, AI-powered Q&A, lowest per-table cost
Monte Carlo	Full platform	Enterprise only (custom quotes)	Snowflake, Databricks, BigQuery, Redshift, others	Market leader, deepest lineage, largest customer base
Metaplane	Full platform	~$10/table	Snowflake, BigQuery, Redshift, Databricks, PostgreSQL	Strong UI, column-level lineage, good Slack integration
Bigeye	Full platform	Custom pricing	Snowflake, Databricks, BigQuery, Redshift, others	Granular metric monitoring, flexible rule engine
Soda	Open-source + cloud	Free (Core) / custom (Cloud)	Most major warehouses	Checks-as-code, SodaCL language, CI/CD friendly
Datafold	Point solution	Custom pricing	Snowflake, BigQuery, Databricks, Redshift, PostgreSQL	Data diffing, CI/CD integration, PR-level impact analysis
Great Expectations	Open-source	Free (OSS) / custom (Cloud)	Any SQL database via SQLAlchemy	Mature framework, huge community, maximum flexibility
Elementary	Open-source	Free (OSS) / custom (Cloud)	dbt-supported warehouses	dbt-native, runs inside your dbt project, no separate infra
Atlan	Data catalog + observability	Custom pricing	Most major warehouses	Combines catalog, governance, and observability in one platform
DataHub (Acryl)	Data catalog + observability	Free (OSS) / custom (Acryl Cloud)	Most major warehouses	Open-source catalog with observability features, strong metadata

What are the full-platform data observability tools?

AnomalyArmor

AnomalyArmor is a full-platform data observability tool built for fast time to value. Connect your warehouse and monitoring begins automatically. No manual rule configuration required for baseline monitoring.

Strengths: Pricing at $5/table is roughly half the industry standard. AI-powered intelligence lets you ask natural language questions about your data ("when did this table last update?", "what changed in the schema?"). Schema drift detection identifies breaking vs non-breaking changes. Supports Snowflake, Databricks, PostgreSQL, MySQL, and Redshift. MCP server integration allows AI agents to query data health programmatically.

Limitations: Smaller customer base compared to Monte Carlo. Fewer third-party integrations than more established platforms. BigQuery support not yet available.

Pricing: $5/table per month. Free trial with 5 tables for 15 days. Annual discount of 15%.

Monte Carlo

Monte Carlo is the market leader in data observability and the company that popularized the term. They have the largest customer base, the deepest integration ecosystem, and the most mature lineage capabilities.

Strengths: End-to-end lineage spanning warehouses, BI tools, and ETL pipelines. Large ecosystem of integrations. Field-level lineage and impact analysis. Strong incident management workflows. Well-established customer success organization.

Limitations: Enterprise-only pricing means you won't get a quote without a sales call, and costs tend to be significantly higher than alternatives. The platform's breadth can mean a steeper learning curve. Recent organizational changes (the company reduced headcount by roughly 30% in early 2026) may affect long-term support capacity.

Pricing: Custom enterprise pricing only. No self-serve option. Typical contracts start in the mid-five-figure range annually.

Metaplane

Metaplane offers a clean, well-designed observability platform with strong column-level lineage and a polished Slack integration. It sits in the middle of the market between Monte Carlo's enterprise positioning and smaller tools.

Strengths: Intuitive UI that data teams actually enjoy using. Column-level lineage. Strong anomaly detection with customizable sensitivity. Good documentation and onboarding experience.

Limitations: At approximately $10/table, pricing is double some alternatives. Fewer warehouse integrations than Monte Carlo. Less AI-native than newer entrants.

Pricing: Approximately $10/table per month. Self-serve signup available.

Bigeye

Bigeye provides granular metric-level monitoring with a flexible rule engine. It's designed for teams that want fine-grained control over exactly what gets monitored and how.

Strengths: Highly configurable monitoring rules. Strong support for custom metrics. Good API for programmatic monitor management. Detailed metric history and trending.

Limitations: The flexibility comes with a steeper learning curve. Time to value can be longer than more opinionated tools. Pricing is not publicly available.

Pricing: Custom pricing. Contact sales for quotes.

What are the best open-source data observability tools?

Soda

Soda offers both an open-source framework (Soda Core) and a commercial cloud platform (Soda Cloud). The open-source component uses SodaCL, a domain-specific language for defining data checks as code.

Strengths: SodaCL is well-designed and readable. Strong CI/CD integration for catching data issues in pull requests. Active open-source community. Cloud platform adds anomaly detection, alerting, and collaboration features on top of the OSS core.

Limitations: Requires writing checks manually. No automated baseline learning in the open-source version. Cloud pricing is not publicly listed.

Pricing: Soda Core is free. Soda Cloud has custom pricing.

Great Expectations

Great Expectations is the most mature open-source data quality framework. It provides a library of "expectations" (test assertions) that you define in code and run against your data.

Strengths: Massive library of built-in expectations. Large community with thousands of contributors. Works with any database that SQLAlchemy supports. Excellent documentation. The GX Cloud offering adds a UI and collaboration features.

Limitations: Significant setup and maintenance overhead. You must write and maintain every expectation. No automated anomaly detection. Not a monitoring system on its own: you need to schedule and orchestrate runs yourself. The learning curve is real, especially for non-engineers.

Pricing: Open-source is free. GX Cloud has custom pricing.

Elementary

Elementary runs inside your dbt project as a dbt package. It adds anomaly detection, schema change tracking, and data quality tests that execute during your normal dbt runs.

Strengths: Zero additional infrastructure. If you already run dbt, Elementary adds observability with a package install. Native dbt integration means monitors stay in sync with your models. Free open-source tier covers most use cases.

Limitations: Only works if you use dbt. Monitoring only runs when dbt runs, so you won't catch issues between dbt executions. Less suitable for real-time or near-real-time monitoring.

Pricing: Open-source is free. Elementary Cloud has custom pricing.

What about data catalog tools with observability features?

Atlan

Atlan is primarily a data catalog and governance platform that has added observability capabilities. It combines metadata management, data discovery, lineage, and monitoring in a single platform.

Strengths: Single platform for catalog, governance, and observability. Strong metadata management and data discovery. Column-level lineage. Active community and modern UI.

Limitations: Observability is a secondary feature, not the core product. Monitoring depth may not match purpose-built observability tools. Enterprise pricing puts it out of reach for smaller teams.

Pricing: Custom enterprise pricing.

DataHub / Acryl

DataHub is an open-source metadata platform originally created at LinkedIn. Acryl Data is the commercial company offering a managed version (Acryl Cloud) with additional features including data observability.

Strengths: Open-source core with a massive community. Strong metadata model that integrates with most data tools. Acryl Cloud adds managed observability on top. Good for teams already invested in DataHub for cataloging.

Limitations: The open-source version requires significant operational effort to run. Observability features are newer and less mature than purpose-built tools. Steep learning curve for self-hosted deployments.

Pricing: DataHub OSS is free. Acryl Cloud has custom pricing.

Should I choose a full-platform tool or build with open-source?

This is the most common decision point, and the answer depends on your team's engineering capacity and your table count.

Choose a full-platform tool if:

You have 50+ tables to monitor
You want results in hours, not weeks
Your team's time is better spent on data engineering than building monitoring infrastructure
You need automated baseline detection, not just rule-based checks

Choose open-source if:

You have strong engineering capacity and willingness to maintain monitoring code
Budget is the primary constraint
You need deep customization that commercial tools don't support
You're already heavily invested in dbt and want monitoring in that workflow

Combine both if:

You want automated baselines from a platform tool plus custom business logic from dbt tests or Great Expectations
You need CI/CD-level testing (Datafold, Soda) alongside production monitoring (AnomalyArmor, Monte Carlo)

Most mature data teams end up running a combination: a platform tool for automated monitoring and an open-source framework for business-specific validations.

How much do data observability tools cost?

Pricing in data observability is notoriously opaque. Here's what we know as of 2026:

Tool	Pricing Model	Public Pricing	Estimated Annual Cost (200 tables)
AnomalyArmor	Per table	$5/table/month	~$10,200/year (with annual discount)
Monte Carlo	Custom	Not published	$50,000-$150,000+/year (estimated)
Metaplane	Per table	~$10/table/month	~$24,000/year
Bigeye	Custom	Not published	Contact sales
Soda Core	Free (OSS)	$0	$0 + engineering time
Soda Cloud	Custom	Not published	Contact sales
Great Expectations	Free (OSS)	$0	$0 + engineering time
Elementary	Free (OSS)	$0	$0 + engineering time
Datafold	Custom	Not published	Contact sales
Atlan	Custom	Not published	$50,000+/year (estimated)
Acryl Cloud	Custom	Not published	Contact sales

The hidden cost with open-source tools is engineering time. Setting up, maintaining, and extending Great Expectations or Soda Core across 200 tables is a meaningful ongoing commitment. Budget 2-4 hours per week for maintenance, more during initial setup. Whether that's cheaper than a commercial tool depends on what your engineers' time is worth.

Data Observability Tools FAQ

What is the difference between data observability and data quality?

Data observability monitors pipeline health: freshness, volume, schema changes, and distribution anomalies. Data quality validates the data itself across the six standard dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. Observability watches the plumbing. Quality checks the water. Most teams need both. See our deeper breakdown of data observability vs data quality.

Do I need a data observability tool if I already use dbt tests?

dbt tests are excellent for rule-based validation (not null, unique, accepted values, relationships). They run at build time and catch known failure modes. Data observability adds automated anomaly detection, freshness monitoring, schema change tracking, and alerting between dbt runs. They complement each other. dbt tests catch what you anticipate. Observability catches what you don't.

How long does it take to set up a data observability tool?

Full-platform tools (AnomalyArmor, Monte Carlo, Metaplane) typically connect in under an hour and begin generating baselines within 24-48 hours. Open-source tools (Great Expectations, Soda) can take days to weeks depending on your table count and the complexity of your checks. The gap in time to value is the main trade-off between commercial and open-source.

Can data observability tools monitor real-time streaming data?

Most tools focus on batch/warehouse monitoring. Monte Carlo and Bigeye have added some streaming support. For true real-time monitoring of Kafka topics or streaming pipelines, you'll likely need purpose-built streaming observability or custom solutions. This is a gap in the market as of 2026.

What warehouse integrations should I look for?

At minimum, your tool should support your primary warehouse with full feature parity. The major warehouses are Snowflake, Databricks, BigQuery, Redshift, and PostgreSQL. If you run multiple warehouses, confirm that the tool provides consistent functionality across all of them, not just a basic connection for secondary warehouses.

How do data observability tools handle alert fatigue?

Good tools use ML-based anomaly detection with configurable sensitivity, deduplication of related alerts, grouping by root cause, and prioritization based on table importance. Some tools let you tag tables by criticality so that alerts on business-critical tables get elevated while development tables stay quiet. Ask vendors specifically how they handle noise reduction.

Is open-source data observability production-ready?

Great Expectations and Soda Core are battle-tested in production at large companies. Elementary is production-ready for dbt shops. The trade-off is operational: you're responsible for hosting, scheduling, scaling, and maintaining the infrastructure. If your team has the capacity, open-source works well. If not, the maintenance burden accumulates.

What role does AI play in data observability?

AI is used in three ways: automated anomaly detection (learning baselines without manual rule-writing), natural language querying (asking questions about your data in plain English), and intelligent alerting (reducing noise by correlating related issues). Some tools also expose AI agent interfaces (MCP servers) so that coding assistants and automation pipelines can query data health programmatically.

How do I calculate ROI for a data observability tool?

Measure data downtime before and after adoption. Data downtime is the total time your data is missing, inaccurate, or unusable. Track time-to-detection (how fast you find issues) and time-to-resolution (how fast you fix them). Multiply hours saved by engineering hourly cost. Most teams see ROI within 2-3 months because the tool catches issues that previously took hours or days of manual investigation.

Should I consolidate on one tool or use multiple?

Start with one full-platform tool for automated monitoring, then add specialized tools as needed. A common stack is a platform tool (AnomalyArmor, Monte Carlo, or Metaplane) for automated baseline monitoring plus dbt tests or Great Expectations for business-specific validation. Avoid running two full-platform tools, as the overlap creates confusion about which alerts to trust.

Choosing a data observability tool comes down to time to value, alert quality, and cost. See how AnomalyArmor monitors freshness, schema changes, and data anomalies across your pipeline.

How Do I Monitor Schema Changes in a Data Warehouse?

Blaine Elliott — Mon, 27 Apr 2026 15:20:34 +0000

You monitor schema changes in a data warehouse by periodically querying metadata catalogs (like INFORMATION_SCHEMA), subscribing to event-driven notifications, or comparing structural hashes of your tables over time. Each method trades off between detection latency, implementation complexity, and warehouse compatibility.

Schema changes are the silent killers of data pipelines. A column rename, a type change from INTEGER to VARCHAR, or a dropped table can cascade through downstream models, dashboards, and ML features without any error until someone notices the numbers look wrong. Monitoring schema changes means catching these mutations before they reach your consumers.

This guide covers what schema changes are, why they break things, how to detect them across Snowflake, Databricks, and PostgreSQL, and which tools can automate the process.

What counts as a schema change?

A schema change is any modification to the structure of a table, view, or other database object. Common schema changes include:

Change Type	Example	Risk Level
Column added	New `discount_type` column appears	Low
Column removed	`customer_email` column dropped	Critical
Column renamed	`user_id` becomes `usr_id`	Critical
Type changed	`price` moves from `DECIMAL(10,2)` to `VARCHAR`	High
Nullability changed	`order_date` becomes nullable	Medium
Default changed	`status` default changes from `active` to `pending`	Medium
Table dropped	`dim_customers` is deleted	Critical
Table added	New `stg_payments_v2` table appears	Low
Constraint changed	Primary key removed from `transaction_id`	High

Not all schema changes are dangerous. Adding a new column is usually safe. Removing or renaming a column is almost always breaking. The goal of monitoring is to detect the dangerous changes before they propagate.

Why do schema changes break data pipelines?

Schema changes break pipelines because most data transformations assume a fixed structure. A dbt model that references SELECT customer_email FROM raw.customers will fail the moment that column is renamed to email_address. But the failure mode depends on the warehouse and the tool:

Hard failures happen when a query references a column that no longer exists. The pipeline errors out, someone gets paged, and the fix is obvious (if annoying). These are actually the best case.

Silent failures happen when a type change causes implicit casting, a new column shifts positional references, or a nullable column starts producing NULLs where downstream logic assumes NOT NULL. The pipeline succeeds, the data looks plausible, and no one notices for days or weeks.

Silent failures are why schema monitoring matters. You need to detect the change, not just the downstream symptom. These silent pipeline breaks are the biggest source of data downtime in most production teams.

How do you detect schema changes with INFORMATION_SCHEMA?

The most portable detection method is polling INFORMATION_SCHEMA.COLUMNS. Every major data warehouse exposes this metadata catalog. The strategy is simple: snapshot the schema periodically, compare snapshots, and alert on differences.

Snowflake

-- Snapshot current schema metadata
CREATE OR REPLACE TABLE schema_snapshots.columns_snapshot AS
SELECT
    table_catalog,
    table_schema,
    table_name,
    column_name,
    ordinal_position,
    data_type,
    is_nullable,
    column_default,
    CURRENT_TIMESTAMP() AS snapshot_ts
FROM information_schema.columns
WHERE table_schema NOT IN ('INFORMATION_SCHEMA');

-- Compare current schema against previous snapshot
WITH current_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM information_schema.columns
    WHERE table_schema NOT IN ('INFORMATION_SCHEMA')
),
previous_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM schema_snapshots.columns_snapshot
    WHERE snapshot_ts = (SELECT MAX(snapshot_ts) FROM schema_snapshots.columns_snapshot)
)
-- Columns added (in current but not previous)
SELECT 'ADDED' AS change_type, c.table_schema, c.table_name, c.column_name,
       NULL AS old_data_type, c.data_type AS new_data_type
FROM current_cols c
LEFT JOIN previous_cols p USING (table_schema, table_name, column_name)
WHERE p.column_name IS NULL

UNION ALL

-- Columns removed (in previous but not current)
SELECT 'REMOVED', p.table_schema, p.table_name, p.column_name,
       p.data_type, NULL
FROM previous_cols p
LEFT JOIN current_cols c USING (table_schema, table_name, column_name)
WHERE c.column_name IS NULL

UNION ALL

-- Type changes
SELECT 'TYPE_CHANGED', c.table_schema, c.table_name, c.column_name,
       p.data_type, c.data_type
FROM current_cols c
JOIN previous_cols p USING (table_schema, table_name, column_name)
WHERE c.data_type != p.data_type;

Databricks

Databricks uses Unity Catalog, which exposes schema metadata through information_schema at the catalog level.

-- Detect schema changes in Databricks Unity Catalog
WITH current_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM system.information_schema.columns
    WHERE table_catalog = 'your_catalog'
),
previous_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM schema_audit.columns_snapshot
)
SELECT
    CASE
        WHEN p.column_name IS NULL THEN 'ADDED'
        WHEN c.column_name IS NULL THEN 'REMOVED'
        WHEN c.data_type != p.data_type THEN 'TYPE_CHANGED'
        WHEN c.is_nullable != p.is_nullable THEN 'NULLABILITY_CHANGED'
    END AS change_type,
    COALESCE(c.table_schema, p.table_schema) AS table_schema,
    COALESCE(c.table_name, p.table_name) AS table_name,
    COALESCE(c.column_name, p.column_name) AS column_name,
    p.data_type AS old_type,
    c.data_type AS new_type
FROM current_cols c
FULL OUTER JOIN previous_cols p
    USING (table_schema, table_name, column_name)
WHERE p.column_name IS NULL
   OR c.column_name IS NULL
   OR c.data_type != p.data_type
   OR c.is_nullable != p.is_nullable;

PostgreSQL

-- PostgreSQL schema diff using pg_catalog
WITH current_cols AS (
    SELECT
        n.nspname AS table_schema,
        c.relname AS table_name,
        a.attname AS column_name,
        pg_catalog.format_type(a.atttypid, a.atttypmod) AS data_type,
        NOT a.attnotnull AS is_nullable
    FROM pg_catalog.pg_attribute a
    JOIN pg_catalog.pg_class c ON a.attrelid = c.oid
    JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
    WHERE a.attnum > 0
      AND NOT a.attisdropped
      AND n.nspname NOT IN ('pg_catalog', 'information_schema')
),
previous_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM schema_audit.columns_snapshot
)
SELECT
    CASE
        WHEN p.column_name IS NULL THEN 'ADDED'
        WHEN c.column_name IS NULL THEN 'REMOVED'
        WHEN c.data_type != p.data_type THEN 'TYPE_CHANGED'
    END AS change_type,
    COALESCE(c.table_schema, p.table_schema) AS table_schema,
    COALESCE(c.table_name, p.table_name) AS table_name,
    COALESCE(c.column_name, p.column_name) AS column_name,
    p.data_type AS old_type,
    c.data_type AS new_type
FROM current_cols c
FULL OUTER JOIN previous_cols p
    USING (table_schema, table_name, column_name)
WHERE p.column_name IS NULL
   OR c.column_name IS NULL
   OR c.data_type != p.data_type;

How does event-driven schema change detection work?

Instead of polling on a schedule, some warehouses support event-driven notifications when DDL statements execute. This eliminates the detection delay between polls.

Snowflake provides QUERY_HISTORY and ACCESS_HISTORY views that log DDL operations. You can query for recent ALTER TABLE, DROP COLUMN, and CREATE TABLE statements:

-- Find recent DDL operations in Snowflake
SELECT
    query_text,
    user_name,
    start_time,
    database_name,
    schema_name
FROM snowflake.account_usage.query_history
WHERE query_type IN ('ALTER_TABLE', 'DROP', 'CREATE')
  AND start_time > DATEADD('hour', -24, CURRENT_TIMESTAMP())
ORDER BY start_time DESC;

Databricks logs DDL events through Unity Catalog's audit logs, which can be streamed to a monitoring system.

PostgreSQL supports EVENT TRIGGER functions that fire on DDL commands:

-- PostgreSQL: event trigger for schema changes
CREATE OR REPLACE FUNCTION log_ddl_change()
RETURNS event_trigger AS $$
BEGIN
    INSERT INTO schema_audit.ddl_log (event, command_tag, object_type, object_name, logged_at)
    SELECT
        tg_event,
        tg_tag,
        objtype,
        objid::regclass::text,
        NOW()
    FROM pg_event_trigger_ddl_commands();
END;
$$ LANGUAGE plpgsql;

CREATE EVENT TRIGGER track_ddl ON ddl_command_end
    EXECUTE FUNCTION log_ddl_change();

Event triggers give you real-time detection and attribution (who changed what), but they require write access to create triggers and only catch changes made through SQL. Changes made through external tools or direct catalog manipulation may be missed.

How does hash-based schema comparison work?

Hash comparison is a lightweight approach that reduces schema state to a single value. You compute a hash of the column names, types, and order for each table, store it, and compare on the next run.

-- Snowflake: hash-based schema fingerprint
SELECT
    table_schema,
    table_name,
    MD5(LISTAGG(column_name || ':' || data_type || ':' || is_nullable, ',')
        WITHIN GROUP (ORDER BY ordinal_position)) AS schema_hash
FROM information_schema.columns
WHERE table_schema NOT IN ('INFORMATION_SCHEMA')
GROUP BY table_schema, table_name;

When the hash changes, you know the schema changed. You then run a detailed diff to find exactly what changed. This two-phase approach minimizes query cost: you only run the expensive diff query when the cheap hash check flags a change.

How do these detection methods compare?

Method	Detection Latency	Setup Complexity	Warehouse Support	Change Attribution	Cost
INFORMATION_SCHEMA polling	Minutes to hours (depends on poll interval)	Low	All major warehouses	No (what changed, not who)	Low (metadata queries are cheap)
Event triggers / DDL audit logs	Seconds to minutes	Medium	PostgreSQL (native), Snowflake (query history), Databricks (audit logs)	Yes (user, timestamp, exact DDL)	Low
Hash comparison	Minutes to hours (depends on poll interval)	Low	All major warehouses	No (detects change, not details)	Very low (single hash per table)
Data observability platform	Minutes (automated polling)	Low (SaaS)	All major warehouses	Yes (full context and history)	Medium (subscription cost)

INFORMATION_SCHEMA polling is the most practical starting point. It works everywhere, requires no special permissions beyond read access to metadata views, and gives you full detail on what changed. The main drawback is latency: you only detect changes on your polling schedule.

Event triggers provide the fastest detection and full attribution, but they are database-specific and require elevated permissions. They work well in PostgreSQL. In Snowflake and Databricks, you approximate this by querying audit logs.

Hash comparison is useful as an optimization layer on top of polling. It reduces the volume of detailed diff queries when you are monitoring hundreds or thousands of tables.

Data observability platforms combine all three approaches and add alerting, historical tracking, lineage, and impact analysis. They are the right choice when your warehouse has enough tables that manual monitoring becomes a full-time job.

What is the difference between automated and manual schema monitoring?

Manual schema monitoring means someone runs a diff query, reviews the output, and decides whether the change is expected. This works when you have a small number of tables and a disciplined team that runs the check before every deployment.

Automated schema monitoring means a system polls your warehouse on a schedule, compares against the last known state, and sends alerts when changes are detected. Automated monitoring is necessary when:

You have more than 50 tables
Multiple teams or external vendors modify schemas
Upstream sources change without notice (third-party SaaS data, partner feeds)
You need an audit trail of every schema change over time

The transition from manual to automated usually happens after the first silent schema change that breaks a dashboard for a week before anyone notices.

Which tools can automate schema change monitoring?

Several categories of tools address schema monitoring, from open-source libraries to full observability platforms.

Data observability platforms like AnomalyArmor, Monte Carlo, and Sifflet monitor schema changes as part of a broader data quality suite. They poll your warehouse metadata automatically, track changes over time, and alert on unexpected modifications. AnomalyArmor detects column additions, removals, type changes, and nullability shifts across Snowflake, Databricks, and PostgreSQL. Monte Carlo provides similar capabilities as part of its data observability platform, though it recently reduced its engineering team significantly. Sifflet offers schema drift detection alongside data quality rules.

Data testing frameworks like Great Expectations and dbt tests let you write explicit schema assertions. For example, a Great Expectations expect_table_columns_to_match_ordered_list check will fail if columns change. Datafold provides schema-aware diff tooling for pull request review. These tools catch schema changes at test time rather than through continuous monitoring, which means changes are detected during CI/CD runs rather than in real-time.

Custom scripts using the SQL patterns shown above work well for small environments. A Python script that runs the INFORMATION_SCHEMA diff query on a cron schedule and posts results to Slack is a common starting point. The problem is maintenance: custom scripts need error handling, retry logic, credential management, state storage, and someone to maintain them when they break.

How do you respond to a schema change once it is detected?

Detection is only half the problem. When a schema change is detected, the response workflow matters as much as the alert:

Classify the change: Is it additive (new column) or breaking (removed column, type change)? Additive changes usually need no immediate action. Breaking changes need investigation.
Trace the impact: Which downstream models, dashboards, and consumers depend on the changed table? Lineage metadata answers this question. Without lineage, you are searching through dbt DAGs and dashboard definitions manually.
Determine intent: Was this change planned (a migration, a new feature) or accidental (someone ran ALTER TABLE in production)? DDL audit logs with user attribution answer this question.
Remediate or adapt: For planned changes, update downstream models to reference the new schema. For accidental changes, revert the DDL if possible or fix the upstream system.
Update monitoring: If the change is intentional, update your schema baseline so future checks don't flag it as anomalous.

The best schema monitoring tools automate steps 1 and 2 (classification and impact analysis) and provide context for step 3 (audit trail). Steps 4 and 5 still require human judgment.

Schema Change Monitoring FAQ

What is schema drift?

Schema drift is the gradual, often unplanned divergence of a table's structure from its expected definition. It happens when upstream systems evolve independently, when different teams make ad-hoc changes, or when third-party data sources update their export formats. Schema drift is cumulative: each individual change may be small, but over months the actual schema can diverge significantly from what downstream consumers expect. See Schema Drift: The Silent Pipeline Killer for a deeper look at why drift is so damaging and Using AI to Set Up Schema Drift Detection for an end-to-end walkthrough.

How often should I poll for schema changes?

For most teams, polling every 1 to 6 hours is sufficient. Critical production tables that feed real-time dashboards may warrant hourly checks. Staging and development tables can be checked daily. The right frequency depends on how quickly your upstream sources change and how much latency you can tolerate before detecting a break.

Can schema changes happen without anyone running DDL?

Yes. Schema-on-read systems like Databricks Delta Lake can infer schema from data files. If a new Parquet file arrives with a different column set and schema evolution is enabled, the table schema changes automatically. Similarly, some ETL tools auto-detect source schema changes and propagate them to the warehouse without explicit DDL.

What is the difference between schema drift and schema evolution?

Schema evolution is an intentional, managed process where a table's structure changes according to a plan (e.g., adding a column for a new feature, migrating a type for better precision). Schema drift is unintentional or uncoordinated change. The technical mechanism is the same. The difference is whether someone planned and communicated the change.

How do I monitor schema changes in dbt?

dbt provides schema tests through its schema.yml configuration. You can assert expected columns, data types, and constraints. The dbt-expectations package adds expect_table_columns_to_match_ordered_list and similar checks. These tests run during dbt test rather than continuously, so they catch schema changes at build time but not between builds. For continuous monitoring, pair dbt with a data observability tool.

Do column additions break pipelines?

Usually not. Most SQL queries use SELECT column_name syntax rather than SELECT *, so a new column is invisible to existing queries. The exceptions are: pipelines that use SELECT *, positional CSV exports, and systems that validate the full schema against an expected list. If your downstream consumers are strict about schema, even an additive change can cause failures.

How do I track who made a schema change?

Use DDL audit logs. In Snowflake, query snowflake.account_usage.query_history for DDL operations to see the user, timestamp, and exact SQL. In PostgreSQL, use event triggers to log DDL commands with session user information. In Databricks, Unity Catalog audit logs capture DDL events with user attribution. Without audit logs, you can only see that a change happened, not who made it.

What schema changes are most dangerous?

Column removals and type changes are the most dangerous because they cause silent data corruption. A removed column produces NULL values in SELECT * queries or hard failures in named-column queries. A type change from INTEGER to VARCHAR can cause implicit casting that silently changes aggregate results. Table renames are equally dangerous because every downstream reference breaks simultaneously.

Should I version my warehouse schema?

Yes, if your warehouse supports it. Delta Lake and Apache Iceberg provide time-travel and schema versioning natively. You can query the table as it existed at a previous point in time and compare schemas across versions. For warehouses without native versioning, maintain your own schema snapshot table (as shown in the SQL examples above) and treat it as a version history.

How is schema monitoring different from data quality monitoring?

Schema monitoring checks the structure of your data: column names, types, constraints, and table existence. Data quality monitoring checks the content: null rates, value distributions, freshness, and anomalies. Schema monitoring catches the container changing. Data quality monitoring catches the contents going wrong. Both are necessary, and both feed into data anomaly detection. A schema change often causes data quality failures downstream, so catching the schema change first gives you a head start on remediation.

Schema changes are inevitable. Catching them before they break your pipelines is not. See how AnomalyArmor monitors schema drift automatically across Snowflake, Databricks, and PostgreSQL.

What Is Data Downtime and How Do You Measure It?

Blaine Elliott — Mon, 20 Apr 2026 16:04:34 +0000

Data downtime is the total period during which data is missing, erroneous, or otherwise unfit for use. It is the data equivalent of application downtime: the window between when something breaks and when it is fully resolved. During data downtime, dashboards show wrong numbers, ML models ingest bad features, and business users make decisions based on information they cannot trust.

The standard formula is:

Data Downtime = (Time to Detection + Time to Resolution) x Number of Incidents

A team that takes 8 hours to notice a broken pipeline (TTD) and 4 hours to fix it (TTR) accumulates 12 hours of downtime per incident. If that happens 10 times a month, the team has 120 hours of data downtime per month, roughly 16% of total available hours.

This guide breaks down how to measure TTD and TTR, how to estimate the dollar cost of downtime, how to reduce both metrics, and how data downtime relates to broader data observability practices.

Why does data downtime matter?

Data downtime is expensive in ways that don't show up on infrastructure bills. The costs are indirect but real:

Bad business decisions: A marketing team running a campaign based on stale conversion data will misallocate spend. A finance team reporting revenue from a pipeline that silently dropped 20% of transactions will publish incorrect numbers.
Lost engineering time: Data engineers spend 30-40% of their time firefighting data quality issues according to multiple industry surveys, including reports from Monte Carlo and Wakefield Research. Every hour of downtime generates follow-up work: root cause analysis, stakeholder communication, manual data patches.
Eroded trust: When dashboards are wrong often enough, business users stop trusting the data platform entirely. They build shadow spreadsheets, export CSVs, and do manual reconciliation. Once trust is gone, it takes months to rebuild even after the technical problems are fixed.
Compliance risk: For regulated industries, data downtime in reporting pipelines can mean missed filing deadlines, incorrect disclosures, or audit findings.

The DAMA International Data Management Body of Knowledge (DMBOK) frames data quality as a continuous process, not a one-time check. Data downtime is the metric that quantifies how well that continuous process is working.

How much does data downtime cost?

Estimating the dollar cost of data downtime helps justify investment in monitoring. The calculation depends on two factors: engineering time spent on incidents and the business impact of decisions made on bad data.

Engineering cost per incident

Engineering cost = (TTD + TTR) x Number of engineers involved x Hourly loaded cost

A fully loaded data engineer in the US costs $80-150/hour (salary + benefits + overhead). If a typical incident involves 2 engineers spending a combined 6 hours (2 hours detecting, 4 hours fixing), each incident costs $480-900 in engineering time alone.

Business impact cost

Business impact is harder to quantify but often dwarfs engineering cost. Examples:

Scenario	Estimated cost per hour of bad data
Marketing campaign running on stale conversion data	$500-5,000 in misallocated ad spend
Revenue dashboard showing incorrect totals during board prep	10-40 hours of manual reconciliation by finance
ML recommendation model trained on corrupted features	Degraded conversion rate until retraining completes
Compliance report filed with missing transactions	Potential regulatory penalty + audit remediation

Total cost formula

Monthly cost of data downtime =
  (Avg incidents/month x Avg engineers/incident x Avg hours/incident x Hourly rate)
  + Estimated business impact per incident x Avg incidents/month

For a mid-size data team with 10 incidents per month, 2 engineers per incident at $100/hour, and 6 hours per incident:

Engineering cost: 10 x 2 x 6 x $100 = $12,000/month
Business impact: varies, but even a conservative $1,000/incident adds $10,000/month

A team spending $22,000/month on data downtime can justify significant investment in monitoring tooling. For context, AnomalyArmor prices at $5/table/month for automated monitoring across schema drift, freshness, and anomaly detection.

How do you calculate data downtime?

Data downtime has two components that you measure separately and then combine:

Time to Detection (TTD)

TTD is the elapsed time between when a data issue occurs and when someone (or something) detects it. If a pipeline breaks at 2:00 AM and a data engineer notices at 10:00 AM, TTD is 8 hours.

Most teams discover their TTD is shockingly high. Without automated monitoring, the typical detection method is a Slack message from a business user: "Hey, the dashboard looks wrong." By that point, the issue has often been present for hours or days.

-- Measure TTD: compare when the issue started vs. when it was detected
-- Requires an incident log table with timestamps
SELECT
  incident_id,
  issue_started_at,
  issue_detected_at,
  TIMESTAMP_DIFF(issue_detected_at, issue_started_at, MINUTE) AS ttd_minutes
FROM data_incidents
WHERE resolved_at IS NOT NULL
ORDER BY ttd_minutes DESC;

-- Average TTD over the last 30 days
SELECT
  ROUND(AVG(TIMESTAMP_DIFF(issue_detected_at, issue_started_at, MINUTE)), 1) AS avg_ttd_minutes,
  MAX(TIMESTAMP_DIFF(issue_detected_at, issue_started_at, MINUTE)) AS max_ttd_minutes,
  COUNT(*) AS total_incidents
FROM data_incidents
WHERE issue_started_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);

Time to Resolution (TTR)

TTR is the elapsed time between detection and full resolution. "Full resolution" means the data is correct and downstream consumers have been updated, not just that the pipeline is running again. A pipeline restart that reprocesses data but leaves a 3-hour gap in the destination table is not a full resolution.

-- Measure TTR per incident
SELECT
  incident_id,
  issue_detected_at,
  resolved_at,
  TIMESTAMP_DIFF(resolved_at, issue_detected_at, MINUTE) AS ttr_minutes,
  root_cause_category
FROM data_incidents
WHERE resolved_at IS NOT NULL
ORDER BY ttr_minutes DESC;

-- TTR breakdown by root cause
SELECT
  root_cause_category,
  COUNT(*) AS incidents,
  ROUND(AVG(TIMESTAMP_DIFF(resolved_at, issue_detected_at, MINUTE)), 1) AS avg_ttr_minutes,
  ROUND(AVG(TIMESTAMP_DIFF(issue_detected_at, issue_started_at, MINUTE)), 1) AS avg_ttd_minutes
FROM data_incidents
WHERE resolved_at IS NOT NULL
  AND issue_started_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY root_cause_category
ORDER BY incidents DESC;

Combining TTD and TTR

Total downtime per incident is simply TTD + TTR. To get your monthly data downtime:

-- Monthly data downtime in hours
SELECT
  DATE_TRUNC(issue_started_at, MONTH) AS month,
  COUNT(*) AS incidents,
  ROUND(SUM(
    TIMESTAMP_DIFF(resolved_at, issue_started_at, MINUTE)
  ) / 60.0, 1) AS total_downtime_hours,
  ROUND(AVG(
    TIMESTAMP_DIFF(issue_detected_at, issue_started_at, MINUTE)
  ), 1) AS avg_ttd_minutes,
  ROUND(AVG(
    TIMESTAMP_DIFF(resolved_at, issue_detected_at, MINUTE)
  ), 1) AS avg_ttr_minutes
FROM data_incidents
WHERE resolved_at IS NOT NULL
GROUP BY DATE_TRUNC(issue_started_at, MONTH)
ORDER BY month DESC;

A healthy target for mature data teams is less than 4 hours of total downtime per month across all pipelines. Most teams starting out measure in the range of 40-100+ hours per month.

Anatomy of a data downtime incident

To make data downtime concrete, here is a realistic example of how a single schema change cascades into 14 hours of downtime.

Timeline:

Time	Event
11:00 PM (Tue)	Partner API deploys v3, adding a required `currency_code` field to the payments endpoint and changing `amount` from integer cents to decimal dollars. No changelog published.
11:15 PM	Airflow ingestion DAG runs on schedule, pulls the new payload, and loads it into the `raw_payments` staging table. The DAG succeeds with no errors because the new field is simply an extra column.
11:30 PM	dbt runs the nightly transform. The `stg_payments` model casts `amount` as `INTEGER`, silently truncating `49.99` to `49`. Downstream `fct_revenue` now understates revenue by ~49%. The dbt run completes successfully.
6:00 AM (Wed)	The finance team opens the daily revenue dashboard for the morning standup. Numbers look "a little off" but within the range of normal daily fluctuation. No one raises a flag.
1:00 PM	A product manager notices that yesterday's conversion value in the marketing attribution report is half of what the ad platform shows. She Slack-messages the data team: "Is the revenue number right?"
1:15 PM	On-call data engineer begins investigating. Checks the Airflow logs (no errors). Checks dbt logs (no errors). Manually queries `raw_payments` and notices the `amount` field now has decimal values. Finds the new `currency_code` column.
2:00 PM	Engineer identifies the root cause: upstream schema change. Writes a fix for the `stg_payments` model to handle the new decimal format and adds the `currency_code` field.
3:00 PM	Fix is deployed, dbt full-refresh runs, downstream tables rebuilt. Finance confirms the numbers are correct. Incident closed.

Downtime calculation:

TTD: 11:00 PM to 1:00 PM next day = 14 hours
TTR: 1:00 PM to 3:00 PM = 2 hours
Total downtime: 16 hours
Revenue dashboard was wrong for 14 hours before anyone noticed

With automated schema change monitoring, the new currency_code column and the type change on amount would have triggered an alert at 11:15 PM, cutting TTD from 14 hours to 15 minutes.

What causes data downtime?

Data downtime has five primary root causes. Understanding the distribution helps you prioritize where to invest in prevention.

Root cause	Typical % of incidents	Example
Schema changes	25-35%	An upstream API adds a new required field, breaking the ingestion job
Data freshness failures	20-30%	A scheduled pipeline silently fails and no new data arrives
Data volume anomalies	15-20%	A source table that normally has 1M rows/day suddenly has 100 rows
Data distribution anomalies	10-15%	A column that's normally 2% null jumps to 40% null
Code/logic changes	10-15%	A dbt model refactor introduces a join that drops 30% of rows

Schema changes and freshness failures together account for roughly half of all data downtime. This is why most data observability tools prioritize automated schema change detection and freshness monitoring as their first capabilities.

How do you reduce TTD?

Reducing TTD is the highest-leverage improvement most data teams can make. Moving from "business user reports a problem" to "automated alert fires within minutes" typically cuts TTD from hours or days down to single-digit minutes.

1. Automated freshness monitoring

Check every table's last-updated timestamp against its expected schedule. If orders is normally updated by 6:00 AM and it's 6:30 AM with no new rows, fire an alert immediately.

-- Freshness check: flag tables that haven't been updated on schedule
SELECT
  table_name,
  expected_update_interval_hours,
  TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), last_updated_at, HOUR) AS hours_since_update
FROM table_metadata
WHERE TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), last_updated_at, HOUR) > expected_update_interval_hours;

2. Schema change detection

Compare the current schema of every table against its last known schema. Any added, removed, or type-changed column triggers an alert before downstream models run.

3. Volume anomaly detection

Track row counts over time and flag statistically significant deviations. A table that normally receives 500K-600K rows per day but suddenly receives 50K is almost always broken.

4. Distribution monitoring

Track key column statistics (null rate, distinct count, min/max, mean) and flag when they drift outside historical norms. This catches subtle data quality issues that volume checks miss.

5. Circuit breakers in pipelines

Add pre-load validation steps that halt a pipeline if the source data fails basic sanity checks. This prevents bad data from propagating downstream and turns a multi-table incident into a single-table incident.

How do you reduce TTR?

TTR reduction requires operational investment in tooling, documentation, and incident response processes.

1. Automated root cause analysis

When an alert fires, the monitoring system should immediately surface: which table is affected, what changed, when it changed, and which upstream source is responsible. Without this context, engineers waste 30-60 minutes just figuring out where to look.

2. Lineage-aware alerting

If a source table breaks, don't fire separate alerts for every downstream table that inherits the problem. Use data lineage to identify the root table and alert on that, with a note about the blast radius of affected downstream assets.

3. Runbooks per failure type

Document the fix for each common failure mode. Schema change on the payments API? Here's the runbook. Freshness failure on the Snowflake warehouse? Here's the runbook. When an incident fires at 3:00 AM, the on-call engineer should not be debugging from scratch.

4. Automated remediation

For known failure patterns, automate the fix entirely. If a pipeline fails because of a transient API timeout, retry automatically. If a source table arrives late but eventually shows up, backfill automatically once the data lands. Reserve human intervention for novel failures.

5. Data SLAs with upstream teams

Formalize agreements with upstream data producers about schema change notification windows, expected freshness, and volume ranges. When upstream teams know that unannounced schema changes cause downstream incidents, they're more likely to communicate proactively.

Data downtime vs application downtime

Data downtime and application downtime are related concepts that require different detection strategies. Application monitoring tools (Datadog, PagerDuty, New Relic) do not catch data downtime because data issues are often invisible at the infrastructure layer.

Dimension	Application downtime	Data downtime
Definition	Service is unavailable or unresponsive	Data is missing, stale, or incorrect
Detection	Health checks, HTTP status codes, latency metrics	Schema checks, freshness SLAs, volume/distribution anomalies
Visibility	Obvious (users see errors, pages don't load)	Silent (dashboards render but show wrong numbers)
Typical TTD	Seconds to minutes (automated monitoring is standard)	Hours to days (many teams still rely on manual detection)
Blast radius	Users of the affected service	Every downstream consumer of the affected data
Tooling	Datadog, PagerDuty, New Relic, Prometheus	AnomalyArmor, Monte Carlo, Metaplane, Great Expectations
Cultural maturity	Well-established SRE practices	Emerging "data SRE" or "data reliability engineering"

The key difference is visibility. When an application goes down, users immediately see error pages and the team gets paged. When data goes bad, the pipeline still runs, the dashboard still renders, and nobody knows the numbers are wrong until a human spots the discrepancy. This is why automated data anomaly detection is critical.

How is data downtime different from data observability?

Data downtime is a metric. Data observability is a practice.

Data downtime measures the outcome: how much time your data spent in an unusable state. Data observability is the set of tools, processes, and practices that let you detect, diagnose, and resolve data issues, thereby reducing downtime.

The relationship is similar to application reliability engineering. Application uptime is the metric. Site reliability engineering (SRE) is the practice. You measure uptime to evaluate how well your SRE practices are working, and you invest in SRE to improve uptime.

A data team with good observability will have low downtime. But observability alone is not enough. You also need incident response processes, data SLAs, and a culture of treating data issues with the same urgency as application outages.

The DAMA DMBOK describes this as "data quality management," which includes establishing quality standards, measuring against them, and continuously improving. Data observability is the modern, tooling-driven implementation of that principle applied to production data pipelines.

What is a good data downtime benchmark?

Benchmarks vary by industry and data maturity, but general guidelines based on industry reports and practitioner surveys:

Maturity level	Monthly downtime	TTD	TTR
No monitoring	100+ hours	Days	Hours to days
Basic (manual checks, dbt tests)	40-80 hours	Hours	Hours
Intermediate (automated alerts)	10-30 hours	Minutes to 1 hour	1-4 hours
Advanced (full observability)	< 4 hours	< 5 minutes	< 1 hour

The biggest jump happens between "no monitoring" and "intermediate." Adding automated freshness and schema monitoring alone can cut TTD by 90% or more. The jump from intermediate to advanced requires investment in lineage, automated root cause analysis, and incident response processes.

Data uptime SLA targets

Teams that formalize data reliability use SLA-style targets, similar to how application teams use "nines" of uptime:

Data uptime target	Allowed downtime per month	Typical team profile
99.9% (three nines)	~43 minutes	Tier-1 financial/compliance pipelines
99.5%	~3.6 hours	Mature data teams with full observability
99%	~7.3 hours	Teams with automated monitoring, some manual steps
95%	~36 hours	Teams with basic monitoring and ad-hoc incident response
< 90%	73+ hours	No systematic monitoring

For most teams, 99.5% data uptime (under 4 hours/month) is a reasonable first target. Achieving it requires automated TTD (monitoring catches issues in minutes, not hours) and documented TTR processes (runbooks, automated remediation for common failures).

How do you track data downtime over time?

Tracking downtime requires an incident log. Every detected data issue should be recorded with timestamps for when it started, when it was detected, and when it was resolved.

Most teams track this in one of three ways:

Dedicated incident table: A table in your warehouse with one row per incident, populated automatically by your monitoring tool or manually during incident response.
Incident management tool: PagerDuty, Opsgenie, or a similar tool that already tracks TTD and TTR for application incidents. Add data incidents to the same workflow.
Observability platform metrics: Tools like AnomalyArmor, Monte Carlo, and Metaplane track incidents and resolution times natively, providing dashboards for downtime trends without manual logging.

The key is consistency. If you only log some incidents, your downtime metric will be artificially low and you will not see the improvement trend when you invest in better monitoring.

Data downtime incident response plan

Teams that resolve data incidents quickly share a common trait: a documented response plan that engineers follow before they start debugging. Here is a minimal template:

Step 1: Assess blast radius. Which tables, dashboards, and teams are affected? Use data lineage if available. Notify impacted stakeholders immediately, even before the root cause is known.

Step 2: Stop the bleeding. If bad data is actively flowing downstream, pause the pipeline or add a circuit breaker. It is better to have stale data than actively wrong data.

Step 3: Identify root cause. Check: Did the schema change? Is the source table fresh? Is the row count normal? Are column distributions within range? Start with the most common causes (schema, freshness) before investigating rare ones.

Step 4: Fix and validate. Apply the fix, backfill affected data, and verify correctness with stakeholders. A pipeline that runs green is not enough. Confirm that the output numbers match expectations.

Step 5: Update the incident log. Record TTD, TTR, root cause, and the fix applied. This data feeds your downtime tracking and helps identify recurring patterns.

Step 6: Prevent recurrence. Add monitoring that would have caught this issue earlier. Update runbooks. If the root cause was an unannounced upstream change, follow up with the upstream team about notification SLAs.

Data Downtime FAQ

What is data downtime?

Data downtime is the total period during which data is missing, erroneous, or otherwise unfit for use by downstream consumers. It is measured as the sum of Time to Detection (TTD) and Time to Resolution (TTR) across all incidents in a given period. The formula is: Data Downtime = (TTD + TTR) x Number of Incidents.

What is the difference between TTD and TTR?

Time to Detection (TTD) is the elapsed time between when a data issue occurs and when it is noticed. Time to Resolution (TTR) is the elapsed time between detection and full resolution, meaning the data is correct and downstream systems have been updated. TTD measures how fast you find problems. TTR measures how fast you fix them.

How much does data downtime cost?

The cost depends on engineering time and business impact. A typical incident involving 2 engineers for 6 hours at $100/hour loaded cost is $1,200 in engineering time alone. Business impact (bad decisions, manual reconciliation, compliance risk) often adds $1,000-5,000 per incident. A team with 10 incidents per month can easily spend $20,000+/month on data downtime.

How much data downtime is normal?

Teams without automated monitoring typically experience 100+ hours of data downtime per month. Teams with basic monitoring (freshness checks, dbt tests) average 40-80 hours. Teams with full data observability platforms target less than 4 hours per month. Your starting point depends on the number of pipelines, upstream sources, and the rate of change in your data environment.

What is the biggest cause of data downtime?

Schema changes and freshness failures together account for roughly 50-60% of data downtime incidents. Schema changes are particularly damaging because they often cascade through multiple downstream models before detection. Freshness failures are common because scheduled pipelines fail silently unless explicitly monitored.

How do you reduce data downtime without buying a tool?

Start with three free practices. First, add freshness checks to your orchestrator (Airflow, dbt, Dagster) that verify table update timestamps after each run. Second, add row count assertions that compare today's load volume against a trailing average. Third, create a shared incident log (even a spreadsheet) to track TTD and TTR so you have a baseline to measure improvement against.

What tools help reduce data downtime?

Data observability platforms including AnomalyArmor, Monte Carlo, Metaplane, and Bigeye provide automated monitoring for freshness, schema changes, volume anomalies, and distribution drift. Open-source tools like Great Expectations and Soda Core handle rule-based validation checks. AnomalyArmor offers automated anomaly detection at $5/table, roughly half the cost of comparable commercial tools.

Is data downtime the same as pipeline failure?

No. Pipeline failure is one cause of data downtime, but not the only one. A pipeline can succeed (run to completion, no errors) and still produce bad data. For example, a pipeline that ingests data from an API where the API silently changed its schema will run successfully but load incorrect data. Data downtime captures all cases where data is unusable, regardless of whether the pipeline itself reported a failure.

What is the difference between data downtime and application downtime?

Application downtime means a service is unavailable (users see errors or pages don't load). Data downtime means data is present but wrong (dashboards render but show incorrect numbers). Application downtime is immediately visible. Data downtime is silent until someone checks. Application monitoring tools like Datadog do not detect data downtime because the infrastructure appears healthy even when the data is not.

What is a data downtime SLA?

A data downtime SLA is a formal commitment to maintain a target level of data uptime, measured as a percentage of total hours in a period. For example, a 99.5% monthly data uptime SLA allows roughly 3.6 hours of downtime per month. Teams define SLAs per pipeline tier: critical pipelines (revenue, compliance) get stricter targets than exploratory or internal-only pipelines.

How does data downtime relate to data quality dimensions?

Data downtime is the time-based consequence of failures across any of the six data quality dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. A completeness failure (missing rows) causes downtime from the moment rows stop arriving until backfill completes. A timeliness failure (stale data) causes downtime from the missed SLA until the refresh runs. Downtime is the unifying metric that converts dimension-level failures into business impact.

Should data downtime be tracked like application uptime?

Yes. Leading data teams apply the same SLA/SLO framework used for application reliability to data pipelines. Define a target (e.g., 99.5% data uptime per month, which allows roughly 3.6 hours of downtime), measure against it, and treat breaches with the same urgency as application outages. This approach, sometimes called "data SRE," is gaining adoption at companies that treat data as a production service rather than a back-office function.

Can you have zero data downtime?

In theory, yes. In practice, no. Data pipelines depend on external sources, third-party APIs, upstream teams, and infrastructure that will eventually fail. The goal is not zero downtime but rapid detection and resolution. A team with 15 incidents per month but a 5-minute TTD and 20-minute TTR will have less total downtime than a team with 2 incidents per month but an 8-hour TTD and 6-hour TTR.

How do you create a data incident response plan?

Start with six steps: assess blast radius, stop bad data from flowing, identify root cause, fix and validate, update the incident log, and prevent recurrence. Document common root causes (schema changes, freshness failures, volume drops) with specific runbooks for each. The goal is that any on-call engineer can resolve common incidents without escalation.

Data downtime shrinks when detection is automated. See how AnomalyArmor monitors freshness, schema changes, and anomalies across your data pipelines to cut TTD to minutes.

State of Data Engineering 2026: Why Data Teams Spend 60% of Their Time Firefighting

Blaine Elliott — Sun, 12 Apr 2026 17:43:27 +0000

It's 9am. You planned to build a new pipeline today. Instead you're debugging why the revenue dashboard shows zeros, tracing a stale table through three upstream dependencies, and explaining to a VP that yesterday's numbers were wrong. By noon you've fixed the fire but built nothing.

This is normal for most data teams. And the 2026 State of Data Engineering Survey (1,101 respondents) now has the numbers to prove it. The interactive explorer lets you query the raw data yourself.

Key findings from the 2026 survey

Before the deeper cut, here's what the survey found across 1,101 data professionals:

82% use AI tools daily (code generation dominates at 82%, documentation at 56%)
42% expect their teams to grow in 2026
43.8% run on cloud data warehouses, 26.8% on lakehouses
90% report data modeling pain points
52.2% say organizational challenges are their biggest bottleneck (vs 25.4% technical debt)

The AI and team growth numbers got the headlines. The time allocation data tells a more important story.

How data engineers actually spend their time in 2026

Two stats from the survey:

34% of time goes to data quality and reliability
26% goes to firefighting

That's 60% of a data engineer's week reacting to problems. Not building pipelines. Not designing models. Reacting.

When asked about their biggest bottleneck, only 10.1% cited data quality. Legacy systems (25.4%), lack of leadership direction (21.3%), and poor requirements (18.8%) all ranked higher.

Data engineers spend most of their time on reactive data quality work but don't identify it as their biggest problem. They've normalized it. Firefighting isn't a crisis. It's the job.

Ad-hoc data modeling doubles firefighting time

The survey's most actionable finding: ad-hoc data modeling (17.4% of respondents) correlates with 38% of time spent firefighting. Teams using canonical or semantic models spend 19%. Half the fires, same job.

But 59.3% of respondents cited "pressure to move fast" as their top modeling pain point, followed by "lack of clear ownership" at 50.7%.

The cycle: pressure to move fast leads to ad-hoc decisions, which create data quality issues, which create fires, which consume the time needed to do things properly. The pressure increases because you're behind.

How to reduce data engineering firefighting

Three things the survey data supports:

1. Assign data quality ownership. 50.7% cited lack of ownership as a top pain point. When quality is everyone's responsibility, it's nobody's responsibility.

2. Invest in data modeling. Teams with canonical models spend half as much time firefighting. The "move fast" pressure is self-defeating when it creates the fires that slow you down.

3. Automate the detection layer. This is the highest-leverage fix for teams that can't reorganize overnight. You can't prevent every schema change, stale table, or anomaly. But you can find out about them in minutes instead of hours.

The difference between a 30-minute fire and a half-day fire is almost always detection speed. A schema change that breaks a pipeline at 2am is a 5-minute fix if you get an alert at 2:05am. It's a 4-hour investigation if the CFO finds it at 9am. (For a deeper look at how this works in practice, see how data freshness monitoring catches stale tables and setting up data quality monitoring for Snowflake and Databricks.)

Automated schema change detection, freshness monitoring, and anomaly alerts compress the gap between "something broke" and "we know about it." That's the gap where firefighting time lives. AnomalyArmor is built specifically for this: monitoring across Snowflake, Databricks, BigQuery, Redshift, and PostgreSQL with alerts in minutes. Email support@anomalyarmor.ai for a trial code.

How to Set Up Data Quality Monitoring in Minutes, Not Hours

Blaine Elliott — Sun, 12 Apr 2026 17:37:55 +0000

You sign up for a data quality tool. You land on an empty dashboard. There's a button that says "Add Connection." You click it, paste your credentials, wait for discovery to finish, and then... nothing obvious to do next.

You poke around. Maybe you find a freshness tab. Maybe you set up an alert. Maybe you close the tab and never come back.

This is how most data observability tools lose customers. Not because the product is bad, but because nobody showed you what to do with it.

We measured the gap. Without guidance, the median time to configure a first freshness monitor in AnomalyArmor was over 40 minutes. With our new guided onboarding, it's under 8. That's the difference between a tool that gets adopted and a tool that gets abandoned during the trial.

TL;DR: AnomalyArmor now has guided onboarding that gets you to your first live data monitor in under 8 minutes. A pre-loaded demo database lets you learn without connecting production. No guesswork, no empty dashboards, no "figure it out yourself."

Why data quality tools have an onboarding problem

Data tools have a unique setup challenge. Unlike a project management app where you create a board and start dragging cards, data observability requires multiple sequential steps before you see any value:

Connect a database
Run schema discovery
Understand what was found
Configure monitoring
Set up alerts
Wait for something to happen

Most users drop off somewhere between steps 2 and 4. They connected their database. Discovery ran. Now there are 200 tables on the screen and no clear next step.

According to Appcues research, 40-60% of users who sign up for a SaaS product will use it once and never come back. For data tools, that number is likely higher because the setup complexity is steeper. Every minute between "signed up" and "seeing value" increases the probability that someone closes the tab and moves on to the next tool in their evaluation.

We decided to fix this.

How AnomalyArmor's guided onboarding works

Instead of dropping you into an empty dashboard, AnomalyArmor starts a guided walkthrough the moment you sign up. It's built around a chapter system where each chapter teaches one capability by having you actually use it.

This is not a product tour. Product tours are overlays that point at every button on the screen and say "this is the sidebar" while you click "Next" fourteen times. Nobody learns anything from those.

GIF: Record the Intro or Connect chapter. Show the spotlight overlay dimming the rest of the screen while highlighting a specific UI element (like the navigation sidebar or the "Add Connection" button). The tooltip popover should be visible with a title, description, and "Next" or action button. Capture 2-3 steps advancing to show the flow of moving through a chapter.

Each chapter uses a spotlight overlay to highlight specific UI elements, explain what they do, and guide you through real actions. Steps don't advance until you've completed the required action, so you're building hands-on familiarity, not just reading tooltips.

A demo database you can explore on day one

The first thing we did was remove the cold start problem entirely.

When you sign up, you get a pre-configured demo database called BalloonBazaar. It has 4 schemas, 24 tables, and 147 columns of realistic e-commerce data. It comes pre-loaded with actual issues: stale tables, schema changes, anomalous patterns, the kinds of problems you'd find in a real data pipeline.

SCREENSHOT: The asset list page with the BalloonBazaar demo database expanded. Should show the schema tree (bronze, silver, gold, raw) with tables nested underneath. Ideally capture a state where at least one table shows a freshness violation badge or a schema change indicator, so the reader can see that the demo data comes with real issues out of the box.
You don't need to connect your own database to start learning. You can explore schema drift on the demo data, set up freshness monitors, configure alerts, and see what AnomalyArmor catches. All without risking your production credentials during a tire-kicking session.

The demo data is flagged internally so it doesn't count against your usage. It's there for learning, not billing.

Want to try it right now? Sign up and the demo database is waiting. No sales call.

The core onboarding path: first monitor in minutes, full coverage when you're ready

The core path has five chapters. The first four get you to a live freshness monitor in under 8 minutes. The fifth adds alerting so issues reach you where you work. Here's the breakdown:

Chapter	What you do	What you'll have when it's done
Intro	Quick orientation: navigation, alerts overview, getting help	Familiarity with the AnomalyArmor interface
Connect	Walk through the database connection form	Understanding of how to add your own databases later
Discover	Run schema discovery, explore tables and columns	Visibility into every table, column, and type in your database
Freshness	Configure a freshness monitor, set intervals and thresholds	Live freshness monitoring that tells you when tables go stale
Alerts	Set up email, Slack, or webhook notifications	Alert delivery so issues reach you where you already work

Once you've got monitoring and alerts running, nine optional chapters let you go deeper: alert routing rules, data quality metrics, correctness checks, lineage tracking, AI-powered intelligence, data tagging, team administration, and CLI/agent workflows. Tackle them at your own pace, in any order.

SCREENSHOT: The chapter selection / learning page showing all 14 chapters. The core path chapters (Intro, Connect, Discover, Freshness, Alerts) should show as completed or in-progress with checkmarks or progress bars. The optional chapters (Alert Rules, Metrics, Correctness, Lineage, Intelligence, Tags, Admin, MCP) should show as available but not started, so the reader can see the breadth of coverage and the progress tracking.

Three step types that teach, not just tour

Each step in a chapter is one of three types, and the distinction matters:

Observation steps highlight something on the screen and explain what it does. You read, you understand, you move on. These are for context, like understanding what the freshness chart axes represent.

Action steps require you to actually do something: click a button, fill in a form, make a selection. The step doesn't advance until you've taken the action. This is where the learning happens, because you're building muscle memory, not just reading instructions.

Wait steps pause while something async completes. When you trigger schema discovery, the step waits for discovery to finish before advancing. No "click here after it's done" guesswork. The system knows when the job is done and moves you forward automatically.

GIF: Record the Freshness chapter. Start from the step where the spotlight highlights the freshness configuration panel on a demo table (e.g. bronze_orders). Show the user setting a check interval, defining a staleness threshold, and clicking save/enable. Then show the freshness check kicking off and the step auto-advancing once the check completes. This is the "aha moment" where the user sees live monitoring working for the first time.

The system tracks your progress per chapter. You can pause mid-chapter, close the browser, come back next week, and pick up where you left off. You can also replay any chapter if you want a refresher.

Why onboarding quality decides which data tool your team adopts

Data observability is not a solo activity. You set it up, your team uses it. If the person who signed up can't get to value quickly, the tool never reaches the rest of the team.

The evaluation pattern is predictable: one engineer evaluates three tools over a week, picks the one they figured out fastest, and rolls it out. The product with the best onboarding wins the evaluation, even if a competitor has more features on paper.

Pendo's 2024 State of Software report found that feature adoption, not feature count, is the strongest predictor of retention. Users who activate three or more features in their first session are 3x more likely to convert. That's exactly what guided onboarding is designed to do: get you to schema discovery, freshness monitoring, and alerting in a single sitting.

Our target: within minutes of signing up, you should have freshness monitoring running on real tables with alerts going to your Slack channel. Everything in the onboarding flow is designed to get you there.

GIF: Record the Alerts chapter. Show the spotlight guiding the user to add a new alert destination (Slack is the most visual). Walk through selecting Slack, connecting the channel, and sending a test alert. End with the test notification appearing in the Slack channel preview or the success confirmation in the UI. This shows the full loop: monitoring detects an issue, alert reaches you where you work.

How we keep improving it

We track onboarding analytics internally: chapter completion rates, drop-off points, time to complete each chapter, and completion trends over time. This isn't vanity metrics. When we see a chapter with a high drop-off rate, we know the steps are confusing and we rewrite them.

Every chapter is scored against a quality rubric with six dimensions: clarity, value demonstration, action quality, pacing, error recovery, and completion momentum. If a chapter scores below our threshold, it gets reworked before it ships.

We treat onboarding like a product feature, not an afterthought. For most users evaluating data quality tools, onboarding IS the product. If they don't get through it, nothing else matters.

Get started with data quality monitoring in minutes

AnomalyArmor's guided onboarding starts automatically when you sign up. The demo database is pre-loaded. You'll have your first live monitor running in under 8 minutes, with alert delivery configured shortly after.

No credit card. No sales call. No staring at an empty dashboard wondering what to click.

Start the guided onboarding now

Key takeaways:

Most data observability tools lose users between "connected" and "configured" because setup is complex and unguided
AnomalyArmor's guided onboarding uses interactive chapters with spotlight overlays, not passive product tours
A pre-loaded demo database (BalloonBazaar) eliminates the cold start problem, so you can learn without connecting production
First live freshness monitor in under 8 minutes (down from 40+ without guidance)
Full core path covers connection, discovery, monitoring, and alerting
Nine optional chapters cover the full product surface: alert rules, metrics, correctness, lineage, AI intelligence, tagging, admin, and CLI workflows

Have questions about setting up data quality monitoring? Email blaine@anomalyarmor.ai. I'll walk you through it.

AI Data Quality Monitoring: Why Most Tools Stop at Tactical AI

Blaine Elliott — Sun, 12 Apr 2026 17:37:53 +0000

Your data observability tool just sent you 47 alerts. Three dashboards are showing anomalies. A stakeholder is asking why the numbers in their report changed. You open your "AI-powered" monitoring tool, and it waits for you to ask the right question.

This is tactical AI. And it's where most data quality tools stop.

The real opportunity is strategic AI: monitoring that thinks proactively about your data problems, surfaces patterns you didn't know to look for, and tells you what to fix before anyone notices something is broken.

Understanding the difference explains why some AI data quality features feel genuinely useful while others feel like marketing checkboxes.

What is Tactical AI in Data Quality Monitoring?

Tactical AI handles reactive observations and analysis. You ask a question, it retrieves information and presents it clearly.

Examples of tactical AI in data observability:

"What columns does the orders table have?"
"When was user_events last updated?"
"What freshness violations do I have right now?"
"What's the blast radius if dim_customers goes down?"

This is AI as an intelligent interface to your data catalog. It saves you from clicking through dashboards, writing queries, or holding complex lineage relationships in your head. Good tactical AI can even correlate information across domains, connecting a schema change to a downstream freshness issue.

But tactical AI is fundamentally reactive. You ask, it answers. You have to know what questions to ask. You have to initiate every interaction. You have to do all the thinking about what might be wrong.

When you have 47 alerts and an angry stakeholder, tactical AI makes you play detective. It hands you a magnifying glass and wishes you luck.

What is Strategic AI in Data Quality Monitoring?

Strategic AI does something fundamentally different. It doesn't wait for questions. It thinks about your data problems autonomously.

Here's a concrete example:

The scenario: Your revenue_daily table failed a freshness check this morning. Three dashboards are showing stale data. The CFO is asking questions.

Tactical AI response: You ask "why is revenue_daily stale?" It tells you the upstream orders table hasn't updated. You ask "why hasn't orders updated?" It tells you there was a schema change yesterday. You ask "what changed?" It shows you a column rename. Fifteen minutes of detective work to find a two-minute fix.

Strategic AI response: You open your monitoring tool and it tells you: "The freshness failure in revenue_daily was caused by yesterday's schema change in orders, when order_status was renamed to status. This broke the ETL job at line 47 of transform_orders.sql. Similar pattern to the incident on January 3rd, which was resolved by updating the column reference. Here's the specific change needed."

Same incident. One approach makes you investigate. The other hands you the answer.

Strategic AI for data observability reasons about:

Root causes, not symptoms. Instead of telling you what's broken, it hypothesizes why things keep breaking. It identifies systemic data quality issues across your entire data estate.

Behavioral patterns over time. Which tables are high-risk based on historical incident rates? Which pipelines are fragile? Which data producers cause the most downstream issues? Strategic AI tracks these patterns and surfaces them unprompted.

Options and tradeoffs. When something needs fixing, strategic AI doesn't just flag the problem. It proposes solutions, explains the tradeoffs, and helps you decide.

Proactive alerts before incidents. Strategic AI notices that a table's null rate is trending upward over three days, or that a schema change is about to break two downstream consumers, and warns you before the incident happens.

Learning from your resolutions. When you fix an alert, strategic AI remembers how. When similar patterns emerge, it suggests the same resolution. When you consistently ignore certain alert types, it asks if those rules should be adjusted.

The difference is autonomy. Tactical AI is a tool you use. Strategic AI is a collaborator that thinks alongside you.

Why Most AI Data Observability Tools Are Stuck on Tactical

Almost every "AI-powered" data quality tool today is purely tactical. They've added chat interfaces to their metadata catalogs. Some can answer sophisticated questions. A few can correlate across domains.

But none of them think proactively:

They don't tell you "here are the three issues you should worry about today, and here's why"
They don't notice that your data quality is degrading in a specific pattern
They don't learn from how you resolve incidents and apply those patterns to new situations
They don't warn you about problems before they become incidents

Tactical AI is useful. It's where everyone has to start. It's where AnomalyArmor is starting. But it's also becoming table stakes. Every tool will have a chat interface within a year. The real differentiation in AI data quality monitoring comes from AI that understands your data deeply enough to be proactive. We're building a path to reach that objective.

The cost of staying tactical: A 2024 study found data teams spend 40% of their time on data quality issues. Most of that time is investigation, not resolution. Strategic AI compresses investigation from hours to seconds.

Building Proactive AI Data Quality Monitoring

You can't skip tactical AI to get to strategic. The foundation matters.

Strategic AI requires rich context: schema metadata, lineage graphs, historical incidents, resolution patterns, freshness trends, validity rules, team ownership. If the tactical layer can't access and correlate this information, the strategic layer has nothing to reason about.

The path to proactive data monitoring:

Phase 1: Comprehensive context. The AI needs access to everything: schema changes, freshness status, alert history, lineage relationships, data quality metrics, user actions. Most tools only expose a fraction of this to their AI layer.

Phase 2: Cross-domain correlation. The AI connects information across domains. A schema change in orders caused a freshness failure in revenue_daily which triggered anomalies in the CFO dashboard. This requires deep understanding, not keyword matching.

Phase 3: Pattern recognition over time. The AI needs memory. What happened last month? What patterns recur? Which resolutions worked? This is where tactical becomes strategic.

Phase 4: Autonomous reasoning. The AI synthesizes patterns into recommendations without being asked. It surfaces what matters before you know to look for it.

What Strategic AI Data Quality Looks Like in Practice

Proactive AI data monitoring looks different from today's chat interfaces.

Morning briefings. You open your data observability tool at 9am and it tells you:

"Three things need attention today:

user_events has had increasing null rates in session_id for 5 days. Downstream tables session_metrics and user_journeys are starting to show anomalies. Likely cause: the mobile app update on Monday.

The ETL job for inventory_snapshot failed twice this week with the same timeout pattern I saw last month. That was resolved by increasing the batch size. Here's the config change.

Team Platform pushed a schema change to api_logs that will break the error_rates dashboard when it propagates tonight. They should coordinate with the analytics team first."

No questions asked. No investigation required. Just: here's what matters, here's why, here's what to do.

Automated incident analysis. When something breaks, the AI doesn't just show you what's broken. It investigates automatically:

"This freshness failure in revenue_daily correlates with yesterday's schema change in orders by user jsmith. The column order_status was renamed to status. This matches the pattern from the January 3rd incident, which was resolved by updating line 47 of transform_orders.sql. Suggested fix: change order_status to status in the SELECT clause."

Proactive risk identification. After observing your data estate for months, the AI notices:

"Your three highest-risk tables are orders, user_events, and payments. Combined, they've caused 73% of downstream incidents this quarter. None have SLAs defined. Adding freshness SLAs would reduce incident impact by an estimated 60%. Here's a suggested configuration."

Resolution learning. The AI tracks how you fix things:

"You've resolved 12 freshness alerts for daily_aggregates in the past month by re-running the Airflow DAG. Should I suggest automatic retry as the first resolution step for this table?"

This is AI as a thinking partner for data engineering teams, not just a query interface.

The Future of AI in Data Observability

Data engineering teams are drowning in signals. Every monitoring tool produces alerts. Every dashboard shows metrics. The job isn't collecting more data quality information. The job is knowing what matters and what to do about it.

Tactical AI helps you find information faster. Strategic AI helps you understand what the information means and what actions to take.

The data observability platforms that win will be the ones that make the leap from reactive to proactive. From answering questions to anticipating them. From flagging problems to solving them.

Where AnomalyArmor Fits

We're building toward strategic AI for data quality monitoring. Today, we have a strong tactical foundation. Tomorrow, we're aiming for something more ambitious.

What's live today:

AI Q&A across your schema, lineage, freshness, and alerts
Cross-domain correlation that connects schema changes to downstream impact
Natural language investigation: "What changed in orders this week?" "Why are there nulls in customer_id?"
Git blast radius that links data issues to the commits and authors responsible

What we're building toward:

Proactive daily briefings that surface issues before you look for them
Pattern recognition across your incident history
Autonomous recommendations based on how you've resolved similar issues
Predictive alerts that warn you before the incident happens

We're not just adding chat to a dashboard. We're building the foundation for AI that thinks about your data quality so you can focus on building.

Try AnomalyArmor and see the difference between AI that waits for questions and AI that has answers ready.

Questions about our AI approach? Email blaine@anomalyarmor.ai. I'll show you exactly where we are on the tactical-to-strategic journey.

Why We Open-Sourced Our Database Query Layer

Blaine Elliott — Sun, 12 Apr 2026 17:32:21 +0000

When you connect a data quality tool to your database, you're trusting that tool with access to your data. Most tools ask you to just trust them. We decided to show our work.

Every query AnomalyArmor runs against your database goes through our Query Security Gateway. The gateway is open source. You can read every line of code. You can verify exactly what we're allowed to do.

GitHub: https://github.com/anomalyarmor/anomalyarmor-query-gateway
PyPI: https://pypi.org/project/anomalyarmor-query-gateway/

The trust problem

Data quality tools need database access to do their job. Schema discovery requires reading metadata. Freshness monitoring requires checking timestamps. Anomaly detection requires looking at distributions.

But customers have legitimate concerns. What queries are you actually running? Could you read our customer data? How do we know you're not doing more than you say?

"Trust us" isn't a good enough answer. Especially when the data is sensitive.

Three access levels

We built the gateway around three access levels. You choose how much access to grant based on your security requirements.

Schema Only: The most restrictive. We can query metadata tables (information_schema, pg_catalog, system tables) but nothing else. You get schema discovery and basic tagging. No access to actual table data.

Aggregates: We can run aggregate functions: COUNT, SUM, AVG, MIN, MAX. No raw values. This enables freshness monitoring (checking MAX(updated_at)), row counts, null rates, and statistical distributions. We never see individual records.

Full: Unrestricted read access. This enables improved tagging and intelligence features that sample values to detect patterns. For example, detecting that a column named "data" actually contains Social Security numbers.

Most customers use Aggregates. You get the monitoring features without exposing raw data.

How it works

The gateway sits between AnomalyArmor and your database. Every query passes through it. The gateway parses the SQL, validates it against your access level, and blocks anything that doesn't comply.

Your Query → Gateway → Parser → Validator → Database
                          ↓
                    Audit Logger

If you've set Aggregates access and something tries to run SELECT email FROM users, the gateway blocks it. Doesn't matter if it's a bug in our code or a misconfigured feature. The query never reaches your database.

Every query attempt is logged. You can audit what we ran and what we tried to run.

Why open source

We published the gateway code for a few reasons.

First, transparency. You shouldn't have to take our word for how the access levels work. Read the code. The validator logic is right there. If we say "aggregates mode only allows aggregate functions," you can verify that claim yourself.

Second, security review. Open source means security researchers can audit it. If there's a bypass or a flaw in our logic, someone can find it and report it. Closed source security is security through obscurity.

Third, trust through verification. When your security team asks "how does this tool handle database access," you can point them to a GitHub repo instead of a marketing page.

Defense in depth

We don't just rely on the gateway. There are two layers of enforcement.

The first layer checks features. Before any SQL is constructed, we check if your access level permits that feature. Trying to run freshness monitoring with Schema Only access? Blocked at the feature layer. You never even see a query.

The second layer is the gateway. It parses and validates the actual SQL. This catches anything that somehow bypasses the feature layer. If a bug in our code constructs a query it shouldn't, the gateway stops it.

Both layers have to allow the operation. If either blocks, nothing runs.

What this means for you

When you connect AnomalyArmor to your database, you choose your access level. The default is Full, for maximum monitoring capability. But you can restrict it at any time.

Some customers use Schema Only on production databases and Full on staging. Some use Aggregates everywhere. You can set a company-wide default and override it per data source.

You can change levels whenever you want. Downgrading disables features that require higher access. Upgrading enables them. No migration, no reconfiguration.

The features at each level

Schema Only gets you:

Schema discovery (tables, columns, types)
Basic tagging (inferred from column names and types)
Basic intelligence (metadata-based insights)

Aggregates adds:

Row counts
Freshness monitoring
Null and completeness checks
Cardinality (distinct counts)
Numeric statistics (min, max, average)

Full adds:

Improved tagging (samples values to detect patterns)
Improved intelligence (value-based insights)

Most data quality monitoring works fine with Aggregates. Full is for when you want the AI to analyze actual values to find things like PII in unexpected columns.

Check it yourself

The gateway code is at https://github.com/anomalyarmor/anomalyarmor-query-gateway. It's Apache 2.0 licensed. Read it, fork it, run the tests.

If you find a security issue, email security@anomalyarmor.ai. We take reports seriously.

This is how we think data tools should work. Not "trust us," but "verify us."

Ready to try data observability with transparent security? Sign up for AnomalyArmor and choose your access level when you connect your database.

Data Quality Tools in 2026: What to Actually Look For

Blaine Elliott — Sun, 12 Apr 2026 17:32:19 +0000

Every data quality vendor has a features page with the same checkboxes. Schema monitoring. Freshness tracking. Anomaly detection. Column profiling. The features are table stakes. What separates the good tools from the mediocre ones is everything else.

Time to value

How long from signup to seeing your first useful alert? This is the single most important question and almost nobody talks about it.

Some tools require a week of configuration before they're useful. You need to define every monitor. Set every threshold. Map every relationship. By the time you're done, you've spent more time setting up the tool than you would have spent just writing SQL checks yourself.

Good tools should give you value in hours, not weeks. Connect your database. Let the tool figure out what normal looks like. Get your first alert when something breaks. You can fine-tune later.

When evaluating, ask: "If I connect my database right now, what will I learn in the next 24 hours?" If the answer is "nothing until you configure monitors," keep looking.

Noise level

A tool that alerts on everything is worse than a tool that alerts on nothing. Alert fatigue is real. If your data quality tool sends fifty alerts a day and forty-eight of them don't matter, you'll start ignoring all of them.

Good tools give you control over what matters. Tags and data classification let you prioritize critical tables and ignore the noise. AI-powered intelligence helps you understand context and triage issues quickly. And integrations with your existing workflow, whether that's Slack, your orchestrator, or AI agents via MCP, mean alerts reach you where you actually work.

Ask vendors: "How do I control which alerts I see and where they go?" If the answer is complicated, expect frustration.

Database coverage

You probably have more than one database. Maybe Postgres for your application, Snowflake for analytics, and some vendor data landing in BigQuery. Your data quality tool needs to work across all of them.

Watch out for tools that technically support your databases but treat some as second-class citizens. "We support MySQL" might mean "we can connect to MySQL but half our features don't work." Ask for specifics. Which features work on which databases?

Pricing model

Most data quality tools price per table. This makes sense: more tables means more monitoring. But the per-table rate varies wildly, from $5 to $20 per table.

Do the math for your actual usage. If you have 200 tables, the difference between $5 and $15 per table is $24,000 a year. That's a real budget item, not a rounding error.

Also watch for hidden costs. Some tools charge extra for features that should be standard. Some charge for users. Some charge for alerts. Get a complete quote, not just the headline price.

Integration with your workflow

Where do your alerts go? If your team lives in Slack, the tool better have good Slack integration. Not just "can send to Slack" but "sends useful, actionable messages that you can respond to."

Same for your orchestration tools. If you're running dbt, can the tool integrate with your dbt tests? Can it trigger alerts based on dbt run failures? Can it show lineage from your dbt models?

The best tool in the world is useless if it doesn't fit into how your team actually works.

AI and agent integration

Data quality tools are starting to add AI features, but most stop at chat interfaces for querying metadata. That's useful, but it's just the beginning.

The real question is whether the tool fits into how AI agents work. Does it expose an MCP server so your AI coding assistant can check data quality before making changes? Can an agent query freshness status or schema changes programmatically? Can it trigger monitors or pull context into your existing AI workflows?

This matters because data engineering workflows are increasingly agent-assisted. If your data quality tool can't participate in those workflows, you're stuck copying and pasting between systems. Look for tools that treat AI integration as a first-class feature, not an afterthought.

What I'd actually evaluate

If I were evaluating data quality tools today, here's my process:

Day 1: Sign up. Connect one database with maybe 50 tables. How long until you have working monitors? If you're still configuring after an hour, that's a red flag. Good tools make setup simple enough that you can be monitoring real tables in minutes, not days.

Day 2-3: Look at the alerts. Are they useful? Are they noise? Intentionally break something in a test environment and see how long it takes to get an alert.

Week 1: Try the integrations you actually need. Set up Slack alerts. Connect to your orchestrator. See if it feels native or bolted-on.

Week 2: Do the pricing math. How much will this cost at your current scale? What about double that scale? Are there features you need that cost extra?

Questions to ask every vendor

Before you buy, get answers to these:

How long does initial setup take for a database with 100 tables?
What's your actual per-table price at my expected scale?
Which features work on which databases?
How does alerting integrate with Slack/Teams/PagerDuty?
Do you support dbt integration? What does it include?
Do you have an MCP server or API for AI agent integration?
What happens if I exceed my plan limits?

The bottom line

Every tool will tell you they have the features you need. What matters is whether those features actually work in practice, whether the tool fits your workflow, and whether the price makes sense for your scale.

Don't buy based on a demo. Run a real trial with real data. See how it performs in your actual environment. That's the only way to know if a tool is good or just good at demos.

AnomalyArmor is built for fast time-to-value. Connect your database and get automated data quality scoring, null rate monitoring, anomaly detection, and schema drift alerts in minutes. Pricing starts at $5/table, roughly half what competitors charge. Sign up.

Schema Drift: The Silent Pipeline Killer

Blaine Elliott — Sun, 12 Apr 2026 17:26:46 +0000

Schema drift is when your database schema changes in ways your downstream systems don't expect. It sounds boring. It will ruin your week.

Unlike a crashed server or a failed deployment, schema drift doesn't announce itself. There's no error page. No alert. Your pipelines keep running. Your dashboards keep updating. The numbers just quietly become wrong.

How it happens

Schema drift happens because databases are shared infrastructure. Your data warehouse isn't just used by your team. Backend engineers add columns. Product teams rename fields. Someone decides user_id should be customer_id for consistency. An intern drops a table they thought was unused.

None of these changes are malicious. Most of them are reasonable in isolation. The problem is that nobody told the data team. And why would they? To the person making the change, it's just a database column. They don't know it feeds into seventeen downstream tables and a board reporting dashboard.

The five types of schema drift

Not all schema changes are equally dangerous. Here's what to watch for:

Column renames are the worst. They look like dropped columns to your queries, but the data is still there under a different name. If you're selecting amount and someone renamed it to total_amount, you get nulls. Not an error. Nulls.

Column drops are at least obvious. Your query fails. You get an error. You can trace the problem immediately.

Type changes are subtle. A varchar becomes a text. An int becomes a bigint. Sometimes it doesn't matter. Sometimes your aggregations start returning slightly different results and nobody notices for weeks.

Column additions are usually safe, but they can break SELECT * queries in unexpected ways. More columns means more memory, slower queries, and occasionally hitting column limits in downstream systems.

Table drops or renames are the nuclear option. Everything downstream breaks loudly. At least you'll notice.

A real example

Last year, a SaaS company I worked with had their entire customer churn model break. The ML team spent three days debugging before they found the issue: a column called last_activity_date had been renamed to last_active_at in the production database.

The rename happened as part of a Rails convention cleanup. Totally reasonable. The backend team did it in a migration with proper deprecation warnings in the API. What they didn't know was that the data warehouse was syncing that table directly, and the churn model was using last_activity_date to calculate days since last login.

When the column disappeared, the pipeline kept running. The null values got coerced to some default date. Suddenly every customer looked like they'd been inactive for decades. The churn model started predicting 100% churn for everyone.

Three days of debugging. One column rename.

Why traditional monitoring misses it

Most monitoring focuses on "is the system up" and "are the jobs running." Those are good things to monitor. They won't catch schema drift.

Your dbt job ran successfully. Great. It just produced wrong data because the source schema changed. Your Airflow DAG is green. Wonderful. It's now loading nulls into a column that shouldn't have nulls.

You need monitoring that understands what the schema looked like yesterday and what it looks like today. You need something that can tell you "column user_status changed from varchar(50) to varchar(20)" before your pipeline truncates half your status values.

Detecting schema drift

The simplest approach is to snapshot your schema periodically and diff it. Every hour, run a query against information_schema, store the results, compare to the previous snapshot. Any differences trigger an alert.

This works. It's also tedious to build and maintain. You need to handle every database type differently. You need to store the snapshots somewhere. You need alerting infrastructure. You need to filter out the noise (not every schema change is a problem).

This is exactly the kind of problem that makes sense to outsource to a dedicated tool. Let someone else deal with the cross-database compatibility. Let someone else figure out which changes are breaking versus benign. You have actual work to do.

What good detection looks like

When a schema change happens, you should know immediately. Not tomorrow. Not when the weekly report looks wrong. Immediately.

The alert should tell you exactly what changed: which table, which column, what the old definition was, what the new definition is. It should tell you when the change happened. And ideally, it should tell you what downstream systems might be affected.

That last part is hard. It requires lineage tracking, knowing which tables feed into which other tables and reports. But even without lineage, just knowing about the change within minutes instead of days is a massive improvement.

Prevention vs detection

In a perfect world, schema changes would go through a review process. Backend teams would notify data teams before making changes. There would be a deprecation period. Downstream systems would be updated first.

In the real world, changes happen fast. Startups move quickly. People forget. Communication breaks down. You can't rely on perfect process to prevent schema drift.

Detection is your safety net. Good process is great. Detection catches everything that process misses.

Key takeaways:

Schema drift happens when database schemas change without downstream systems knowing
Column renames are the most dangerous because they don't cause obvious errors
Traditional job monitoring won't catch schema drift
You need schema-aware monitoring that diffs your database structure over time
Detection is your safety net when process fails

AnomalyArmor detects schema drift automatically, plus monitors data quality metrics like null rates, row counts, and distribution shifts. Connect your database and get alerts within minutes. Sign up.

Why I Built AnomalyArmor

Blaine Elliott — Sun, 12 Apr 2026 17:26:44 +0000

I've done data engineering over the years at CJ, Savings.com, MySpace, Chegg, LinkedIn, Microsoft, One Medical, and AbnormalAI. The thing that's always stuck with me is how the job gets harder in a way that sneaks up on you.

When you build a pipeline, you're not just creating one thing to maintain. You're creating a machine that generates new things to maintain. Every run, every interval, every partition of data that pipeline produces becomes another touch point you're responsible for. One pipeline running hourly for a year is 8,760 data points you now own. Scale that across dozens of pipelines feeding into each other, and you've got an exponential maintenance problem.

This is the part nobody warns you about when you start in data engineering. The pipelines themselves aren't that hard. It's everything they produce that buries you.

The problem without a solution

I spent years looking for elegant tooling to handle this. Something that could watch all those touch points without requiring me to manually define what "good" looks like for each one. The solutions I found were either too simple (just run some SQL tests), too complex (six-week implementations that needed a dedicated admin) or too expensive (out of reach for our budget or company size).

What I wanted was analysis at scale. Limited human interaction to set up, comprehensive coverage across all my data, and smart enough to distill thousands of potential issues into a small set of things I actually needed to look at. Signal, not noise.

The hackathon that started it

A few years back I built a hackathon project around this idea. The core concept was automated statistical profiling: connect to a database, analyze the distributions, detect when something changed meaningfully, and surface only the stuff worth investigating. And do all this at scale with a little I/O as possible to achieve the desired outcome: does my data have any land mines in it?

It worked better than I expected. Not because the statistics were novel, but because it removed the manual effort. I didn't have to write a test for every column. I didn't have to define thresholds for every metric. The system figured out what normal looked like and told me when things deviated.

That project sat in a repo for a while. But the idea kept nagging at me.

Building for myself

AnomalyArmor came from recognizing voids in the industry that nobody was filling. The expensive enterprise tools were overkill for most teams. The lightweight open source options required too much manual configuration. There was a middle ground that didn't exist: something that worked out of the box, scaled with your data, and didn't cost a fortune.

I also just wanted better tooling for myself. Every data engineering job I've had, I've ended up building some version of this internally. Schema change detection scripts. Freshness monitoring cron jobs. Anomaly alerts cobbled together from Airflow sensors. AnomalyArmor is what all of that should have been from the start.

What it does

The pitch is simple: connect your database, get alerts when something's wrong.

Schema drift detection tells you when columns change before your pipelines break. Freshness monitoring tells you when tables stop updating before anyone asks why the dashboard is stale. Data quality metrics catch null spikes, distribution shifts, and anomalies before they corrupt your analytics. Lineage extends these offerings to give you a blast radius of what should be monitored, then does that monitoring for you.

Why $5 per table

I priced it at roughly half what competitors charge because I know what data team budgets look like. At 100 tables, you're paying $475 a month. That's affordable for a real team, not just enterprises with unlimited spend.

If AnomalyArmor saves you one fire drill per month, one late-night debugging session, one embarrassing "why are these numbers wrong" conversation, it's paid for itself.

Try it yourself

If you're tired of the exponential maintenance problem and want tooling that actually helps, sign up and connect your first database in under 5 minutes.

No sales pitch. Just see if it solves a problem you have.

— Blaine

The 6 Dimensions of Data Quality: Definitions, Examples, and How to Monitor Each

Blaine Elliott — Sun, 12 Apr 2026 17:15:39 +0000

The six dimensions of data quality are accuracy, completeness, consistency, timeliness, validity, and uniqueness. Each dimension measures a different aspect of whether data is fit for its intended use. Together they define whether a dataset can be trusted for analytics, machine learning, or customer-facing applications.

This guide defines each dimension with practical examples, SQL detection patterns, and monitoring strategies for production data pipelines.

What are the dimensions of data quality?

Data quality dimensions are measurable attributes that describe different ways data can be wrong. The widely accepted framework includes six core dimensions:

#	Dimension	Question it answers
1	Accuracy	Does the data reflect real-world truth?
2	Completeness	Is any expected data missing?
3	Consistency	Does the same fact match across systems?
4	Timeliness	Is the data current enough to be useful?
5	Validity	Does the data conform to expected formats and rules?
6	Uniqueness	Are there duplicate records where there shouldn't be?

These six dimensions come from the DAMA International Data Management Body of Knowledge (DMBOK) and are used by organizations including the UK Government Data Quality Hub, Monte Carlo, Collibra, and Informatica. Different sources sometimes add dimensions like integrity or conformity, but the core six cover the vast majority of data quality failures.

Why do data quality dimensions matter?

Without a framework, data teams describe quality problems anecdotally: "the data looks off," "something's wrong with customer IDs," "the numbers don't match the dashboard." These complaints are hard to prioritize and harder to fix systematically.

The six dimensions convert vague complaints into measurable categories. A data team that says "we have a completeness problem on 3% of rows and a timeliness problem on 2 tables" can write monitoring rules, assign owners, and track improvement over time. A team that just says "data quality is bad" cannot.

1. Accuracy

Definition: Accuracy measures how closely data reflects the real-world entity or event it describes.

A customer's street address stored as "123 Mai Street" when it should be "123 Main Street" is inaccurate. A transaction recorded as $100 when the actual amount was $1000 is inaccurate. A birth date of 1900-01-01 for a 30-year-old customer is inaccurate.

Accuracy is the hardest dimension to verify automatically because it requires comparing data to an authoritative external truth. Most teams verify accuracy through:

Cross-reference with source systems: Compare warehouse data against the upstream OLTP database
Sampling and manual review: Audit a random subset against original documents
Reference data checks: Compare against a trusted master data source (e.g., a zip code database)
Statistical sanity checks: Flag values that are impossibly high or low

-- Detect impossibly old ages (accuracy check)
SELECT customer_id, birth_date, DATE_DIFF(CURRENT_DATE(), birth_date, YEAR) AS age
FROM customers
WHERE DATE_DIFF(CURRENT_DATE(), birth_date, YEAR) > 120
   OR DATE_DIFF(CURRENT_DATE(), birth_date, YEAR) < 0;

2. Completeness

Definition: Completeness measures whether all expected data is present. It covers both row-level completeness (no missing rows) and column-level completeness (no missing values in required fields).

A daily sales table that should contain one row per store per day but is missing rows for three stores has a row-level completeness problem. A customers table with email IS NULL for 15% of records has a column-level completeness problem.

Completeness checks are straightforward to automate:

-- Column-level completeness: null rate for required fields
SELECT
  COUNT(*) AS total_rows,
  COUNT(email) AS rows_with_email,
  COUNT(*) - COUNT(email) AS null_emails,
  ROUND(100.0 * (COUNT(*) - COUNT(email)) / COUNT(*), 2) AS null_rate_pct
FROM customers;

-- Row-level completeness: missing expected records
SELECT store_id, sale_date
FROM expected_stores_and_dates
LEFT JOIN daily_sales USING (store_id, sale_date)
WHERE daily_sales.store_id IS NULL;

The hard part isn't writing the query. It's deciding what "expected" means. You need a ground truth for what should exist, which usually comes from a reference table, a calendar, or a contract with the upstream source.

3. Consistency

Definition: Consistency measures whether the same fact matches across different systems, tables, or timestamps.

If the customer table shows 10,000 active users and the billing table shows 9,850 active users, there's a consistency problem. If a transaction amount appears as $100 in one system and $100.00 in another, that's usually formatting, not a consistency failure. But if the same transaction appears as $100 in one system and $1000 in another, that's a critical consistency failure.

Consistency checks compare aggregate or row-level values across data sources:

-- Cross-system consistency: customer count reconciliation
WITH crm_count AS (
  SELECT COUNT(*) AS n FROM crm_customers WHERE status = 'active'
),
warehouse_count AS (
  SELECT COUNT(*) AS n FROM dim_customers WHERE is_active = TRUE
)
SELECT
  crm_count.n AS crm_active_customers,
  warehouse_count.n AS warehouse_active_customers,
  ABS(crm_count.n - warehouse_count.n) AS delta
FROM crm_count, warehouse_count;

Consistency problems often stem from timing: one system was updated, the other hasn't synced yet. The monitoring question is whether the gap is within an acceptable SLA or has exceeded it.

4. Timeliness

Definition: Timeliness measures whether data is fresh enough to be useful. A timely dataset is updated on its expected schedule and is current relative to the real-world events it describes.

A dashboard showing "sales last hour" that's actually showing data from 6 hours ago has a timeliness problem. A machine learning model trained on data that's 3 months stale may produce incorrect predictions. A fraud detection system running on yesterday's transactions is useless.

Timeliness is measured in two ways:

Freshness lag: How long since the last update? (CURRENT_TIMESTAMP - MAX(inserted_at))
Schedule adherence: Did the expected update happen on time?

-- Freshness: hours since last row was added
SELECT
  TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(inserted_at), HOUR) AS hours_since_last_insert,
  MAX(inserted_at) AS most_recent_row
FROM orders
HAVING hours_since_last_insert > 2;  -- alert if stale beyond SLA

Timeliness is the easiest dimension to monitor at scale because it only requires a single max-timestamp query per table. This is why freshness monitoring is typically the first data quality check teams implement.

5. Validity

Definition: Validity measures whether data conforms to defined formats, types, ranges, and business rules.

An email field containing "not-an-email" is invalid. A phone number field with "call my cell" is invalid. A country field with "Martian Empire" is invalid. A percentage field with 150 is invalid. A timestamp in the year 9999 is invalid.

Validity is the most rule-heavy dimension. It requires explicit definitions of what "valid" means for each field:

-- Validity: email format check
SELECT customer_id, email
FROM customers
WHERE email IS NOT NULL
  AND NOT REGEXP_CONTAINS(email, r'^[^@\s]+@[^@\s]+\.[^@\s]+$');

-- Validity: range check
SELECT order_id, discount_pct
FROM orders
WHERE discount_pct < 0 OR discount_pct > 100;

-- Validity: enum check
SELECT order_id, status
FROM orders
WHERE status NOT IN ('pending', 'paid', 'shipped', 'delivered', 'refunded');

Modern data quality tools automate validity checks by profiling historical data to learn expected formats, then flagging new records that deviate.

6. Uniqueness

Definition: Uniqueness measures whether records that should be unique are unique. It covers both primary key uniqueness and business-level deduplication.

A customers table should have exactly one row per customer. A transactions table should have exactly one row per transaction. When the same customer appears twice with slightly different spellings, or the same transaction appears twice because of a retry bug, you have a uniqueness failure.

Uniqueness checks are simple to write:

-- Primary key uniqueness
SELECT customer_id, COUNT(*) AS occurrences
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;

-- Business-level uniqueness (same email, different IDs = probable duplicate)
SELECT LOWER(TRIM(email)) AS normalized_email, COUNT(*) AS dup_count,
       ARRAY_AGG(customer_id) AS customer_ids
FROM customers
WHERE email IS NOT NULL
GROUP BY LOWER(TRIM(email))
HAVING COUNT(*) > 1;

The hard part is defining the business rule for uniqueness. Primary keys are enforced by the database. Business-level deduplication (same person, different spellings) requires fuzzy matching, normalization, or entity resolution algorithms.

How do these dimensions relate to each other?

The six dimensions overlap and interact. A single data quality failure often affects multiple dimensions at once:

Duplicate records violate uniqueness, but also affect accuracy (counts are wrong) and sometimes completeness (aggregates miss data)
Schema drift violates validity (new values don't match expected format), often triggers completeness failures (previously required columns become null), and degrades accuracy (wrong values flow through)
Pipeline delays violate timeliness, but also create consistency problems between source and destination systems

Good monitoring tracks all six dimensions because a problem in one often predicts problems in others. A sudden spike in uniqueness failures for customer IDs is often an upstream completeness problem (nulls being converted to a default value).

How do you measure data quality across all six dimensions?

The standard approach is to calculate a quality score per table per dimension, then aggregate:

Per-dimension score: For each table and each dimension, compute pass/fail against defined rules
Rollup to table score: Average the six dimension scores (or weight by business importance)
Rollup to dataset score: Average across all tables in a dataset
Track over time: Plot the score daily to catch degradation trends

For production data pipelines, modern data observability tools automate this by:

Profiling historical data to learn baselines (typical null rates, value distributions, update frequencies)
Detecting anomalies in new data against those baselines
Tagging each anomaly by the dimension it violates
Rolling up to dashboards that show quality over time per table and per dimension

The key insight is that you cannot manually write rules for every edge case across 500 tables. You need statistical baselines that learn from the data itself, with explicit rules for the invariants that matter most to the business.

Data Quality Dimensions FAQ

What are the 6 dimensions of data quality?

The six dimensions of data quality are accuracy, completeness, consistency, timeliness, validity, and uniqueness. Accuracy measures truth against reality, completeness measures missing data, consistency measures cross-system agreement, timeliness measures freshness, validity measures conformance to rules, and uniqueness measures duplicate records.

Are there more than 6 dimensions of data quality?

Yes. Some frameworks add dimensions like integrity (referential relationships), conformity (adherence to standards), reasonableness (within expected bounds), or auditability (traceable to source). The DAMA DMBOK lists six core dimensions that cover the most common failure modes, which is why the "six dimensions" framework is the most widely cited.

Which data quality dimension is most important?

It depends on the use case. For financial reporting, accuracy and consistency matter most. For real-time dashboards, timeliness is critical. For machine learning features, completeness and validity drive model performance. Most production data teams treat timeliness and completeness as the top two because their failures are easiest to detect and most visible to downstream users.

How do you measure data quality dimensions?

Each dimension is measured by running rule-based or statistical checks and counting pass/fail rates. Accuracy is typically measured by sampling and cross-reference. Completeness is measured as null rate or row-count against expectation. Consistency is measured by reconciling aggregates across systems. Timeliness is measured as lag from expected update. Validity is measured by format and range checks. Uniqueness is measured by primary key and business-level dedup queries.

What is the difference between data quality and data integrity?

Data quality is the broader concept covering accuracy, completeness, consistency, timeliness, validity, and uniqueness. Data integrity is a narrower concept focused on referential relationships and constraint enforcement (foreign keys resolve, required fields aren't null, allowed values are enforced). Integrity is sometimes listed as a seventh dimension of quality, but most frameworks treat it as a subset of validity and completeness.

Can you have high data quality in one dimension and low in another?

Yes, and this is common. A table can have perfect uniqueness (no duplicates) but terrible timeliness (updated weekly when it should be hourly). A dataset can be perfectly complete (no missing rows) but inaccurate (values are wrong). Monitoring each dimension separately reveals these patterns. A single "data quality score" that averages all six hides the specific failure modes you need to fix.

How is data quality different from data observability?

Data quality is the outcome: whether data is fit for use. Data observability is the practice: continuously monitoring data pipelines to detect quality issues in production. You can have high data quality without observability (if nothing ever breaks), but in practice you need observability to maintain quality over time as systems evolve and upstream sources change.

What tools automate data quality dimension monitoring?

Modern data observability platforms including AnomalyArmor, Monte Carlo, Metaplane, Bigeye, and Datafold automate monitoring across all six dimensions by profiling historical baselines and flagging anomalies. Open-source tools like Great Expectations, Soda Core, and dbt tests cover rule-based validity and completeness checks but require manual rule writing. Most production teams combine both: automated baseline monitoring for the long tail plus explicit rules for business-critical invariants.

How much historical data do you need to monitor data quality dimensions?

Statistical baselines typically require 7-14 days of historical data for basic anomaly detection. Weekly seasonality needs at least 4 weeks. Yearly seasonality requires 12-18 months. For rule-based checks (validity, uniqueness, primary key enforcement), no history is needed, you can run them on any new data as it arrives.

Can you fix low data quality after the fact?

Sometimes yes, often no. Validity and uniqueness problems can often be fixed retroactively by cleaning and deduplication. Completeness problems can sometimes be fixed by re-running upstream loads. Accuracy problems usually can't be fixed without access to the original source, which may have been lost. Timeliness problems can't be fixed at all: once data is late, it's late. Prevention through monitoring is always cheaper than retroactive cleanup.

Data quality dimensions are only useful if you can measure them in production. See how AnomalyArmor automatically monitors accuracy, completeness, consistency, timeliness, validity, and uniqueness across your data pipelines.

Data Anomaly Detection: The Complete Guide for Data Engineers

Blaine Elliott — Sat, 11 Apr 2026 22:48:30 +0000

Data anomaly detection is the process of identifying data points, patterns, or values that deviate from expected behavior. It catches schema changes, stale tables, row count spikes, and statistical outliers before they break dashboards or corrupt downstream analytics. Modern data anomaly detection combines statistical methods like z-scores and Welford's algorithm with machine learning models that learn seasonal patterns from historical data.

This guide explains the four types of data anomalies, the algorithms used to detect each one, and how to implement detection in Snowflake, Databricks, and PostgreSQL.

What is data anomaly detection?

Data anomaly detection is the automated identification of unexpected values, patterns, or changes in a dataset. In data engineering, it monitors production tables for problems like:

A column gets renamed, dropped, or changes type (schema drift)
A daily-updated table hasn't received new rows in 36 hours (freshness failure)
Row counts drop by 80% overnight (volume anomaly)
Null rate in a critical column spikes from 2% to 40% (quality anomaly)
A customer ID in a fact table references a non-existent record (referential anomaly)

The goal is to catch these problems before they reach dashboards, ML models, or customer-facing applications.

The four types of data anomalies

1. Schema anomalies

Schema anomalies occur when the structure of a table changes unexpectedly. Common examples:

Column added: A new column appears upstream, which can break SELECT * queries
Column dropped: A column disappears, breaking any query that references it
Column renamed: The column exists under a different name, causing silent NULL returns
Type changed: A VARCHAR becomes an INTEGER, causing cast failures

Schema anomalies are the most common cause of silent data failures because queries often continue to run without error, returning wrong results.

2. Freshness anomalies

Freshness anomalies happen when a table stops updating on its expected schedule. A table that normally updates every hour but hasn't received new rows in 6 hours has a freshness anomaly. These are caused by:

Upstream pipeline failures
Source system outages
Broken scheduled jobs
Permission changes

Freshness is typically measured as "time since last insert" or "max(timestamp_column)".

3. Volume anomalies

Volume anomalies are unexpected changes in row counts. A daily sales table that normally receives 10,000-12,000 rows suddenly receiving 500 rows (or 100,000) is a volume anomaly. Causes include:

Upstream filter changes
Duplicate data ingestion
Failed partial loads
Fraud or bot activity

4. Value anomalies

Value anomalies are statistical outliers in column values. Examples:

A revenue column where 5% of rows are negative when they should always be positive
A foreign key column where null rates spike from 2% to 40%
A timestamp column with future dates

Value anomalies are detected using statistical methods applied to specific columns.

How data anomaly detection works

Anomaly detection uses three main approaches: static thresholds, statistical methods, and machine learning.

Static thresholds

The simplest approach. You define the expected range manually:

SELECT 'anomaly' AS status
FROM orders
WHERE COUNT(*) < 1000 OR COUNT(*) > 50000;

Static thresholds work for stable metrics but fail for anything with seasonality (weekend traffic drops, end-of-month spikes).

Statistical methods

Statistical anomaly detection uses historical data to compute expected ranges automatically. The most common approach is the z-score:

z = (current_value - historical_mean) / historical_stddev

If the absolute z-score exceeds a threshold (typically 2 or 3), the value is flagged as anomalous. A z-score of 2 catches values more than 2 standard deviations from the mean, which is roughly the top or bottom 2.5% of a normal distribution.

Welford's algorithm is the most efficient way to compute running mean and standard deviation for anomaly detection. It maintains three numbers (count, mean, and sum of squared deviations) and updates them incrementally with each new data point, requiring constant memory:

def update_stats(count, mean, m2, value):
    count += 1
    delta = value - mean
    mean += delta / count
    delta2 = value - mean
    m2 += delta * delta2
    return count, mean, m2

def get_variance(count, m2):
    return m2 / (count - 1) if count > 1 else 0

This is the foundation of most production anomaly detection systems because it scales to high-volume event streams without storing historical data.

Machine learning methods

For data with complex seasonality (weekly patterns, business hours, holiday effects), machine learning models outperform simple statistics. The most common approach is Prophet (Facebook's time-series forecasting library), which decomposes a series into trend, weekly seasonality, and yearly seasonality, then flags values outside the prediction interval.

Prophet requires at least 14 data points to detect weekly patterns and 365 points to detect yearly patterns. For tables with less history, fall back to z-scores.

How to detect data anomalies in Snowflake

Snowflake provides metadata views that make anomaly detection straightforward.

Schema anomalies: Track column changes via INFORMATION_SCHEMA.COLUMNS:

SELECT table_name, column_name, data_type, last_altered
FROM information_schema.columns
WHERE table_schema = 'PRODUCTION'
  AND last_altered > DATEADD(hour, -24, CURRENT_TIMESTAMP());

Freshness anomalies: Check ACCOUNT_USAGE.TABLES for last DML operation:

SELECT table_name,
       last_altered,
       DATEDIFF(hour, last_altered, CURRENT_TIMESTAMP()) AS hours_stale
FROM snowflake.account_usage.tables
WHERE table_schema = 'PRODUCTION'
  AND DATEDIFF(hour, last_altered, CURRENT_TIMESTAMP()) > 24;

Volume anomalies: Compare today's row count against a rolling 30-day average:

WITH daily_counts AS (
  SELECT DATE(created_at) AS day, COUNT(*) AS row_count
  FROM orders
  WHERE created_at >= DATEADD(day, -30, CURRENT_DATE())
  GROUP BY DATE(created_at)
),
stats AS (
  SELECT AVG(row_count) AS mean, STDDEV(row_count) AS stddev
  FROM daily_counts
  WHERE day < CURRENT_DATE()
)
SELECT row_count,
       (row_count - mean) / stddev AS z_score
FROM daily_counts, stats
WHERE day = CURRENT_DATE()
  AND ABS((row_count - mean) / stddev) > 2;

How to detect data anomalies in Databricks

Databricks offers Delta Live Tables expectations for inline anomaly detection:

import dlt
from pyspark.sql.functions import col

@dlt.table
@dlt.expect_or_drop("valid_order_total", "order_total > 0")
@dlt.expect_or_fail("recent_data", "created_at > current_date() - interval 2 days")
def clean_orders():
    return spark.read.table("raw_orders")

For volume and statistical anomalies, use Unity Catalog's lineage tracking combined with scheduled queries:

SELECT table_name,
       COUNT(*) AS row_count,
       MAX(ingestion_time) AS last_update
FROM production.orders
GROUP BY table_name;

How to detect data anomalies in PostgreSQL

PostgreSQL doesn't have built-in anomaly detection, but you can implement it with pg_stat_user_tables and custom queries:

SELECT relname AS table_name,
       n_live_tup AS row_count,
       last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
  AND last_autoanalyze < NOW() - INTERVAL '24 hours';

For value anomalies, use window functions to compute rolling statistics:

WITH rolling_stats AS (
  SELECT order_id,
         amount,
         AVG(amount) OVER (ORDER BY created_at ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) AS rolling_mean,
         STDDEV(amount) OVER (ORDER BY created_at ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) AS rolling_stddev
  FROM orders
)
SELECT order_id, amount, rolling_mean, rolling_stddev,
       (amount - rolling_mean) / NULLIF(rolling_stddev, 0) AS z_score
FROM rolling_stats
WHERE ABS((amount - rolling_mean) / NULLIF(rolling_stddev, 0)) > 3;

Build vs buy: data anomaly detection tools

Building anomaly detection in-house gives you control but requires engineering time to maintain. Most data teams outgrow custom solutions because:

Alert fatigue: Static thresholds fire too often and get ignored
Seasonality blindness: Simple statistics miss weekly and yearly patterns
Cross-platform monitoring: Different code for Snowflake, Databricks, and Postgres
Incident triage: No unified view of which alerts matter most

AnomalyArmor is a data observability platform that uses AI to configure anomaly detection automatically. You connect your data warehouse, describe what you want to monitor in plain English, and the AI agent sets up schema drift alerts, freshness schedules, and statistical anomaly detection across all your tables. It works on Snowflake, Databricks, PostgreSQL, and BigQuery.

Data anomaly detection FAQ

What is the difference between anomaly detection and data validation?

Data validation checks if data matches explicit rules (e.g., "order_id is not null"). Anomaly detection uses statistical methods to identify values that deviate from historical patterns. Validation catches known problems. Anomaly detection catches unknown ones.

What is the best algorithm for data anomaly detection?

For most production use cases, z-scores computed with Welford's algorithm work well. For data with strong weekly or yearly seasonality, Prophet or similar time-series models are better. For high-dimensional data, isolation forests outperform statistical methods.

How do I detect schema drift automatically?

Query your database's INFORMATION_SCHEMA or metadata views on a schedule, store the previous state, and diff the current state against the stored version. When columns change, type definitions change, or tables are added or removed, fire an alert. AnomalyArmor does this automatically for Snowflake, Databricks, and PostgreSQL.

What is a z-score and how is it used in anomaly detection?

A z-score measures how many standard deviations a value is from the historical mean. A z-score of 2 means the value is 2 standard deviations above the mean, which occurs in roughly 2.5% of a normal distribution. Most anomaly detection systems use z-scores between 2 and 3 as thresholds.

How much historical data do I need for anomaly detection?

Statistical methods like z-scores need at least 7-10 data points to produce meaningful baselines. Machine learning methods like Prophet need at least 14 points for weekly seasonality and 365 points for yearly seasonality. During the learning phase, most systems don't fire alerts.

What is the difference between data observability and anomaly detection?

Anomaly detection is one component of data observability. Data observability also includes lineage tracking, impact analysis, schema change detection, and root cause analysis. Anomaly detection tells you something is wrong. Observability tells you what, where, and why.

Can AI improve data anomaly detection?

Yes. AI improves anomaly detection in three ways. First, AI agents can configure monitoring rules from natural language instead of YAML or GUI forms. Second, LLMs can analyze alert patterns to reduce false positives. Third, AI can correlate anomalies across tables to identify root causes faster than manual investigation.

How do I avoid alert fatigue in anomaly detection?

Use adaptive thresholds that learn from historical patterns instead of static rules. Set sensitivity per table based on how critical it is. Group related alerts so a single upstream failure generates one notification instead of ten. Suppress alerts during known maintenance windows.

What data platforms support anomaly detection natively?

Snowflake has data metric functions and ACCOUNT_USAGE views. Databricks has Delta Live Tables expectations and Unity Catalog lineage. BigQuery has table metadata and scheduled queries. PostgreSQL has pg_stat_user_tables. None of these are full anomaly detection systems, but they provide the raw metrics needed to build one.

How real-time should anomaly detection be?

It depends on the use case. Schema drift and freshness checks should run every 5-15 minutes. Row count and statistical anomalies should run hourly for most tables and daily for slower-changing ones. Real-time streaming anomaly detection (sub-second) is rarely needed for data warehouses but is critical for fraud detection and security monitoring.

Summary

Data anomaly detection catches schema changes, freshness failures, volume spikes, and statistical outliers before they break downstream analytics. The four main types of anomalies require different detection approaches: schema changes need metadata diffs, freshness needs time-since-update checks, volume needs historical baselines, and value anomalies need statistical methods like z-scores or machine learning models like Prophet.

Modern data observability platforms combine all four detection methods with AI-powered configuration to make anomaly detection practical at scale. Whether you build in-house or buy a tool, the fundamental algorithms are the same: maintain historical baselines, compute expected ranges, and flag deviations beyond your sensitivity threshold.

Want to see data anomaly detection in action? Watch a 30-second demo of AI configuring schema drift monitoring in real time.