Blaine Elliott

Posted on Apr 27 • Originally published at blog.anomalyarmor.ai

How Do I Monitor Schema Changes in a Data Warehouse?

#dataengineering #dataquality

You monitor schema changes in a data warehouse by periodically querying metadata catalogs (like INFORMATION_SCHEMA), subscribing to event-driven notifications, or comparing structural hashes of your tables over time. Each method trades off between detection latency, implementation complexity, and warehouse compatibility.

Schema changes are the silent killers of data pipelines. A column rename, a type change from INTEGER to VARCHAR, or a dropped table can cascade through downstream models, dashboards, and ML features without any error until someone notices the numbers look wrong. Monitoring schema changes means catching these mutations before they reach your consumers.

This guide covers what schema changes are, why they break things, how to detect them across Snowflake, Databricks, and PostgreSQL, and which tools can automate the process.

What counts as a schema change?

A schema change is any modification to the structure of a table, view, or other database object. Common schema changes include:

Change Type	Example	Risk Level
Column added	New `discount_type` column appears	Low
Column removed	`customer_email` column dropped	Critical
Column renamed	`user_id` becomes `usr_id`	Critical
Type changed	`price` moves from `DECIMAL(10,2)` to `VARCHAR`	High
Nullability changed	`order_date` becomes nullable	Medium
Default changed	`status` default changes from `active` to `pending`	Medium
Table dropped	`dim_customers` is deleted	Critical
Table added	New `stg_payments_v2` table appears	Low
Constraint changed	Primary key removed from `transaction_id`	High

Not all schema changes are dangerous. Adding a new column is usually safe. Removing or renaming a column is almost always breaking. The goal of monitoring is to detect the dangerous changes before they propagate.

Why do schema changes break data pipelines?

Schema changes break pipelines because most data transformations assume a fixed structure. A dbt model that references SELECT customer_email FROM raw.customers will fail the moment that column is renamed to email_address. But the failure mode depends on the warehouse and the tool:

Hard failures happen when a query references a column that no longer exists. The pipeline errors out, someone gets paged, and the fix is obvious (if annoying). These are actually the best case.

Silent failures happen when a type change causes implicit casting, a new column shifts positional references, or a nullable column starts producing NULLs where downstream logic assumes NOT NULL. The pipeline succeeds, the data looks plausible, and no one notices for days or weeks.

Silent failures are why schema monitoring matters. You need to detect the change, not just the downstream symptom. These silent pipeline breaks are the biggest source of data downtime in most production teams.

How do you detect schema changes with INFORMATION_SCHEMA?

The most portable detection method is polling INFORMATION_SCHEMA.COLUMNS. Every major data warehouse exposes this metadata catalog. The strategy is simple: snapshot the schema periodically, compare snapshots, and alert on differences.

Snowflake

-- Snapshot current schema metadata
CREATE OR REPLACE TABLE schema_snapshots.columns_snapshot AS
SELECT
    table_catalog,
    table_schema,
    table_name,
    column_name,
    ordinal_position,
    data_type,
    is_nullable,
    column_default,
    CURRENT_TIMESTAMP() AS snapshot_ts
FROM information_schema.columns
WHERE table_schema NOT IN ('INFORMATION_SCHEMA');

-- Compare current schema against previous snapshot
WITH current_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM information_schema.columns
    WHERE table_schema NOT IN ('INFORMATION_SCHEMA')
),
previous_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM schema_snapshots.columns_snapshot
    WHERE snapshot_ts = (SELECT MAX(snapshot_ts) FROM schema_snapshots.columns_snapshot)
)
-- Columns added (in current but not previous)
SELECT 'ADDED' AS change_type, c.table_schema, c.table_name, c.column_name,
       NULL AS old_data_type, c.data_type AS new_data_type
FROM current_cols c
LEFT JOIN previous_cols p USING (table_schema, table_name, column_name)
WHERE p.column_name IS NULL

UNION ALL

-- Columns removed (in previous but not current)
SELECT 'REMOVED', p.table_schema, p.table_name, p.column_name,
       p.data_type, NULL
FROM previous_cols p
LEFT JOIN current_cols c USING (table_schema, table_name, column_name)
WHERE c.column_name IS NULL

UNION ALL

-- Type changes
SELECT 'TYPE_CHANGED', c.table_schema, c.table_name, c.column_name,
       p.data_type, c.data_type
FROM current_cols c
JOIN previous_cols p USING (table_schema, table_name, column_name)
WHERE c.data_type != p.data_type;

Databricks

Databricks uses Unity Catalog, which exposes schema metadata through information_schema at the catalog level.

-- Detect schema changes in Databricks Unity Catalog
WITH current_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM system.information_schema.columns
    WHERE table_catalog = 'your_catalog'
),
previous_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM schema_audit.columns_snapshot
)
SELECT
    CASE
        WHEN p.column_name IS NULL THEN 'ADDED'
        WHEN c.column_name IS NULL THEN 'REMOVED'
        WHEN c.data_type != p.data_type THEN 'TYPE_CHANGED'
        WHEN c.is_nullable != p.is_nullable THEN 'NULLABILITY_CHANGED'
    END AS change_type,
    COALESCE(c.table_schema, p.table_schema) AS table_schema,
    COALESCE(c.table_name, p.table_name) AS table_name,
    COALESCE(c.column_name, p.column_name) AS column_name,
    p.data_type AS old_type,
    c.data_type AS new_type
FROM current_cols c
FULL OUTER JOIN previous_cols p
    USING (table_schema, table_name, column_name)
WHERE p.column_name IS NULL
   OR c.column_name IS NULL
   OR c.data_type != p.data_type
   OR c.is_nullable != p.is_nullable;

PostgreSQL

-- PostgreSQL schema diff using pg_catalog
WITH current_cols AS (
    SELECT
        n.nspname AS table_schema,
        c.relname AS table_name,
        a.attname AS column_name,
        pg_catalog.format_type(a.atttypid, a.atttypmod) AS data_type,
        NOT a.attnotnull AS is_nullable
    FROM pg_catalog.pg_attribute a
    JOIN pg_catalog.pg_class c ON a.attrelid = c.oid
    JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
    WHERE a.attnum > 0
      AND NOT a.attisdropped
      AND n.nspname NOT IN ('pg_catalog', 'information_schema')
),
previous_cols AS (
    SELECT table_schema, table_name, column_name, data_type, is_nullable
    FROM schema_audit.columns_snapshot
)
SELECT
    CASE
        WHEN p.column_name IS NULL THEN 'ADDED'
        WHEN c.column_name IS NULL THEN 'REMOVED'
        WHEN c.data_type != p.data_type THEN 'TYPE_CHANGED'
    END AS change_type,
    COALESCE(c.table_schema, p.table_schema) AS table_schema,
    COALESCE(c.table_name, p.table_name) AS table_name,
    COALESCE(c.column_name, p.column_name) AS column_name,
    p.data_type AS old_type,
    c.data_type AS new_type
FROM current_cols c
FULL OUTER JOIN previous_cols p
    USING (table_schema, table_name, column_name)
WHERE p.column_name IS NULL
   OR c.column_name IS NULL
   OR c.data_type != p.data_type;

How does event-driven schema change detection work?

Instead of polling on a schedule, some warehouses support event-driven notifications when DDL statements execute. This eliminates the detection delay between polls.

Snowflake provides QUERY_HISTORY and ACCESS_HISTORY views that log DDL operations. You can query for recent ALTER TABLE, DROP COLUMN, and CREATE TABLE statements:

-- Find recent DDL operations in Snowflake
SELECT
    query_text,
    user_name,
    start_time,
    database_name,
    schema_name
FROM snowflake.account_usage.query_history
WHERE query_type IN ('ALTER_TABLE', 'DROP', 'CREATE')
  AND start_time > DATEADD('hour', -24, CURRENT_TIMESTAMP())
ORDER BY start_time DESC;

Databricks logs DDL events through Unity Catalog's audit logs, which can be streamed to a monitoring system.

PostgreSQL supports EVENT TRIGGER functions that fire on DDL commands:

-- PostgreSQL: event trigger for schema changes
CREATE OR REPLACE FUNCTION log_ddl_change()
RETURNS event_trigger AS $$
BEGIN
    INSERT INTO schema_audit.ddl_log (event, command_tag, object_type, object_name, logged_at)
    SELECT
        tg_event,
        tg_tag,
        objtype,
        objid::regclass::text,
        NOW()
    FROM pg_event_trigger_ddl_commands();
END;
$$ LANGUAGE plpgsql;

CREATE EVENT TRIGGER track_ddl ON ddl_command_end
    EXECUTE FUNCTION log_ddl_change();

Event triggers give you real-time detection and attribution (who changed what), but they require write access to create triggers and only catch changes made through SQL. Changes made through external tools or direct catalog manipulation may be missed.

How does hash-based schema comparison work?

Hash comparison is a lightweight approach that reduces schema state to a single value. You compute a hash of the column names, types, and order for each table, store it, and compare on the next run.

-- Snowflake: hash-based schema fingerprint
SELECT
    table_schema,
    table_name,
    MD5(LISTAGG(column_name || ':' || data_type || ':' || is_nullable, ',')
        WITHIN GROUP (ORDER BY ordinal_position)) AS schema_hash
FROM information_schema.columns
WHERE table_schema NOT IN ('INFORMATION_SCHEMA')
GROUP BY table_schema, table_name;

When the hash changes, you know the schema changed. You then run a detailed diff to find exactly what changed. This two-phase approach minimizes query cost: you only run the expensive diff query when the cheap hash check flags a change.

How do these detection methods compare?

Method	Detection Latency	Setup Complexity	Warehouse Support	Change Attribution	Cost
INFORMATION_SCHEMA polling	Minutes to hours (depends on poll interval)	Low	All major warehouses	No (what changed, not who)	Low (metadata queries are cheap)
Event triggers / DDL audit logs	Seconds to minutes	Medium	PostgreSQL (native), Snowflake (query history), Databricks (audit logs)	Yes (user, timestamp, exact DDL)	Low
Hash comparison	Minutes to hours (depends on poll interval)	Low	All major warehouses	No (detects change, not details)	Very low (single hash per table)
Data observability platform	Minutes (automated polling)	Low (SaaS)	All major warehouses	Yes (full context and history)	Medium (subscription cost)

INFORMATION_SCHEMA polling is the most practical starting point. It works everywhere, requires no special permissions beyond read access to metadata views, and gives you full detail on what changed. The main drawback is latency: you only detect changes on your polling schedule.

Event triggers provide the fastest detection and full attribution, but they are database-specific and require elevated permissions. They work well in PostgreSQL. In Snowflake and Databricks, you approximate this by querying audit logs.

Hash comparison is useful as an optimization layer on top of polling. It reduces the volume of detailed diff queries when you are monitoring hundreds or thousands of tables.

Data observability platforms combine all three approaches and add alerting, historical tracking, lineage, and impact analysis. They are the right choice when your warehouse has enough tables that manual monitoring becomes a full-time job.

What is the difference between automated and manual schema monitoring?

Manual schema monitoring means someone runs a diff query, reviews the output, and decides whether the change is expected. This works when you have a small number of tables and a disciplined team that runs the check before every deployment.

Automated schema monitoring means a system polls your warehouse on a schedule, compares against the last known state, and sends alerts when changes are detected. Automated monitoring is necessary when:

You have more than 50 tables
Multiple teams or external vendors modify schemas
Upstream sources change without notice (third-party SaaS data, partner feeds)
You need an audit trail of every schema change over time

The transition from manual to automated usually happens after the first silent schema change that breaks a dashboard for a week before anyone notices.

Which tools can automate schema change monitoring?

Several categories of tools address schema monitoring, from open-source libraries to full observability platforms.

Data observability platforms like AnomalyArmor, Monte Carlo, and Sifflet monitor schema changes as part of a broader data quality suite. They poll your warehouse metadata automatically, track changes over time, and alert on unexpected modifications. AnomalyArmor detects column additions, removals, type changes, and nullability shifts across Snowflake, Databricks, and PostgreSQL. Monte Carlo provides similar capabilities as part of its data observability platform, though it recently reduced its engineering team significantly. Sifflet offers schema drift detection alongside data quality rules.

Data testing frameworks like Great Expectations and dbt tests let you write explicit schema assertions. For example, a Great Expectations expect_table_columns_to_match_ordered_list check will fail if columns change. Datafold provides schema-aware diff tooling for pull request review. These tools catch schema changes at test time rather than through continuous monitoring, which means changes are detected during CI/CD runs rather than in real-time.

Custom scripts using the SQL patterns shown above work well for small environments. A Python script that runs the INFORMATION_SCHEMA diff query on a cron schedule and posts results to Slack is a common starting point. The problem is maintenance: custom scripts need error handling, retry logic, credential management, state storage, and someone to maintain them when they break.

How do you respond to a schema change once it is detected?

Detection is only half the problem. When a schema change is detected, the response workflow matters as much as the alert:

Classify the change: Is it additive (new column) or breaking (removed column, type change)? Additive changes usually need no immediate action. Breaking changes need investigation.
Trace the impact: Which downstream models, dashboards, and consumers depend on the changed table? Lineage metadata answers this question. Without lineage, you are searching through dbt DAGs and dashboard definitions manually.
Determine intent: Was this change planned (a migration, a new feature) or accidental (someone ran ALTER TABLE in production)? DDL audit logs with user attribution answer this question.
Remediate or adapt: For planned changes, update downstream models to reference the new schema. For accidental changes, revert the DDL if possible or fix the upstream system.
Update monitoring: If the change is intentional, update your schema baseline so future checks don't flag it as anomalous.

The best schema monitoring tools automate steps 1 and 2 (classification and impact analysis) and provide context for step 3 (audit trail). Steps 4 and 5 still require human judgment.

Schema Change Monitoring FAQ

What is schema drift?

Schema drift is the gradual, often unplanned divergence of a table's structure from its expected definition. It happens when upstream systems evolve independently, when different teams make ad-hoc changes, or when third-party data sources update their export formats. Schema drift is cumulative: each individual change may be small, but over months the actual schema can diverge significantly from what downstream consumers expect. See Schema Drift: The Silent Pipeline Killer for a deeper look at why drift is so damaging and Using AI to Set Up Schema Drift Detection for an end-to-end walkthrough.

How often should I poll for schema changes?

For most teams, polling every 1 to 6 hours is sufficient. Critical production tables that feed real-time dashboards may warrant hourly checks. Staging and development tables can be checked daily. The right frequency depends on how quickly your upstream sources change and how much latency you can tolerate before detecting a break.

Can schema changes happen without anyone running DDL?

Yes. Schema-on-read systems like Databricks Delta Lake can infer schema from data files. If a new Parquet file arrives with a different column set and schema evolution is enabled, the table schema changes automatically. Similarly, some ETL tools auto-detect source schema changes and propagate them to the warehouse without explicit DDL.

What is the difference between schema drift and schema evolution?

Schema evolution is an intentional, managed process where a table's structure changes according to a plan (e.g., adding a column for a new feature, migrating a type for better precision). Schema drift is unintentional or uncoordinated change. The technical mechanism is the same. The difference is whether someone planned and communicated the change.

How do I monitor schema changes in dbt?

dbt provides schema tests through its schema.yml configuration. You can assert expected columns, data types, and constraints. The dbt-expectations package adds expect_table_columns_to_match_ordered_list and similar checks. These tests run during dbt test rather than continuously, so they catch schema changes at build time but not between builds. For continuous monitoring, pair dbt with a data observability tool.

Do column additions break pipelines?

Usually not. Most SQL queries use SELECT column_name syntax rather than SELECT *, so a new column is invisible to existing queries. The exceptions are: pipelines that use SELECT *, positional CSV exports, and systems that validate the full schema against an expected list. If your downstream consumers are strict about schema, even an additive change can cause failures.

How do I track who made a schema change?

Use DDL audit logs. In Snowflake, query snowflake.account_usage.query_history for DDL operations to see the user, timestamp, and exact SQL. In PostgreSQL, use event triggers to log DDL commands with session user information. In Databricks, Unity Catalog audit logs capture DDL events with user attribution. Without audit logs, you can only see that a change happened, not who made it.

What schema changes are most dangerous?

Column removals and type changes are the most dangerous because they cause silent data corruption. A removed column produces NULL values in SELECT * queries or hard failures in named-column queries. A type change from INTEGER to VARCHAR can cause implicit casting that silently changes aggregate results. Table renames are equally dangerous because every downstream reference breaks simultaneously.

Should I version my warehouse schema?

Yes, if your warehouse supports it. Delta Lake and Apache Iceberg provide time-travel and schema versioning natively. You can query the table as it existed at a previous point in time and compare schemas across versions. For warehouses without native versioning, maintain your own schema snapshot table (as shown in the SQL examples above) and treat it as a version history.

How is schema monitoring different from data quality monitoring?

Schema monitoring checks the structure of your data: column names, types, constraints, and table existence. Data quality monitoring checks the content: null rates, value distributions, freshness, and anomalies. Schema monitoring catches the container changing. Data quality monitoring catches the contents going wrong. Both are necessary, and both feed into data anomaly detection. A schema change often causes data quality failures downstream, so catching the schema change first gives you a head start on remediation.

Schema changes are inevitable. Catching them before they break your pipelines is not. See how AnomalyArmor monitors schema drift automatically across Snowflake, Databricks, and PostgreSQL.

Top comments (1)

Andrew Tan • Apr 28

Comparing hashes to detect schema drift is pragmatic, but the real pain usually starts when a column type change causes existing queries to start producing nulls silently. Have you tied this monitoring into your CI pipeline so the pull request gets blocked before the migration even lands?