Why Null Rate Spikes Are the First Sign of Schema Drift in the Warehouse

#api #data #monitoring

If you only monitor one statistic on your warehouse-side tables for drift detection, monitor the null rate on every column. The reason is simple: most forms of upstream schema drift produce a null spike in the destination before they produce anything else. Catching the spike catches the drift before the dashboards catch it weeks later.

This is short on why null rate is the load-bearing metric, what to alert on, and the common false positives.

Renamed fields show up as null spikes

The most common upstream change is a renamed field. The SaaS calls customer_id account_id in a minor version. The integration code reads customer_id, gets null, writes null. Tomorrow's destination table has a column that was 98 percent populated yesterday and is 100 percent null today.

The pipeline's runtime checks all pass. The records processed. The destination write succeeded. The data is wrong.

Null rate monitoring is the cheapest detection you can run because the signal is unambiguous. A column that was almost always populated and is now almost never populated is broken. Alert on the change and you get to the drift before the downstream report does.

Optional fields with low historical null rates are the load-bearing case

The trap is monitoring all columns equally. A column with a 50 percent historical null rate has noisy day-to-day null rates. A change from 50 to 60 percent is probably noise. A change from 50 to 95 percent might be drift but might just be a slow day.

The columns where null rate monitoring works best are the ones with low historical null rates - under 5 percent populated reliably. Those are the columns where a spike from 2 percent null to 80 percent null is unmistakable. Set the alert thresholds tighter on those columns and looser on the legitimately noisy ones.

For most warehouses, this means partitioning your monitoring config by column class: "always populated" (alert at two-sigma deviation), "usually populated" (alert at three-sigma), "noisy" (do not alert, or aggregate to weekly trends). The classification is a 30-minute exercise per table and saves alert fatigue forever after.

Photo by Lightsaber Collection on Unsplash

Statistical monitoring tools that ship null-rate alerting out of the box

Most modern data quality tools include null rate as a first-class check. Great Expectations has expect_column_values_to_not_be_null and friends with built-in distributional thresholds. dbt's data tests ship not_null as a core test and dbt_expectations adds the distributional layer.

For lighter-weight setups, a daily query that calculates per-column null rates and writes them to a metrics table, plus a simple alert that fires when the latest day diverges from a 30-day rolling baseline, covers most of the value with no framework overhead. The metric is cheap to compute and the alert logic is simple.

What tool you pick matters less than the discipline of actually running the monitoring on the high-value tables. Most teams have data quality tools installed but only run them on a handful of marquee tables. Drift hides in the unmonitored ones.

The false positives you have to tune around

Null rate monitoring produces false positives, and the false positives are predictable.

Legitimate source changes. The source deprecates a field. Null rate goes to 100 percent. The drift is real but the remediation is "update the consumer to not read this field." Different from the silent-corruption case but still flags through the same alert path. Document it in triage.

Time zone or scheduling artifacts. A column gets populated by an asynchronous backfill that runs at midnight UTC. The null rate looks high until the backfill completes. Tag these columns in the monitoring config so the alert window starts after the expected backfill.

Small samples. A low-volume table with 50 records a day will have noisy null rates regardless. Aggregate to weekly thresholds for low-volume tables. Or set absolute thresholds (alert if more than 90 percent null in a week) instead of statistical ones.

The work of tuning these out is a one-time effort that pays back over the lifetime of the table. Skipping the tuning is the most common reason teams abandon null rate alerting after a month of false positives.

What null rate monitoring will not catch

This is where you have to be honest with yourself about the limits.

Null rate monitoring catches the renamed-field case. It catches the field-deprecation case. It catches the case where a transformation step silently drops values.

It does not catch the case where the values are present but wrong. A column whose values used to mean "quantity in units" and now means "quantity in packages" has the same null rate. The data is corrupted in a way the statistic cannot see.

For that class of drift you need record-level diffs between source and warehouse, or you need the integration approach 137Foundry uses - a typed contract layer at the intake boundary that fails when the values' meanings change. Both are harder than null rate monitoring. Both are necessary on the integrations where downstream consumers cannot tolerate any drift.

The Wikipedia entry on schema evolution and the Schema.org type system both have useful background reading on the broader problem space. For the practical guide to combining null rate monitoring with the other detection layers, the long-form piece is at https://137foundry.com/articles/how-to-detect-schema-drift-data-integration-before-silent-drops.

What this gets you on a small budget

If you have a Saturday afternoon and a warehouse with no data quality monitoring at all, the highest-leverage thing you can do is build a per-column null rate monitor on your three or four most important tables, with alerts on the columns that historically run under 5 percent null.

That is one query per table, plus one alert rule per column, plus a small metrics table to store the time series. Maybe two hours of work. It will not catch every drift event. It will catch the renamed-field case, which is the most common and the most expensive when it goes silent.

After that initial layer is running, add Great Expectations or dbt tests for the broader distributional checks (cardinality, range, accepted values). Layer in record-level diffs on the integrations carrying revenue, billing, or regulatory data. The order matters - start with the cheapest signal that catches the most common failure, then expand to the more expensive signals that catch the rarer ones.

Null rate is the highest signal-to-effort detection layer for warehouse-side drift. Start there. Add the rest as the team's drift-detection maturity grows.

How the baseline window affects what you catch

The choice of baseline window changes what null rate monitoring catches. Three windows in common use:

Short rolling window (7 to 14 days). Sensitive to recent shifts but noisy. Good for high-volume tables where the baseline updates quickly enough to reflect legitimate evolution. Bad for low-volume tables where day-to-day variance overwhelms the signal.

Long rolling window (30 to 90 days). Less sensitive, less noisy. Better baseline for slowly-evolving data. Will miss slow drift that creeps into the baseline as it progresses.

Fixed window with periodic resets. A baseline that gets explicitly reset every quarter after the team confirms current data is correct. Catches slow drift the rolling window absorbs but requires the manual reset discipline.

For most production tables, the 30-day rolling baseline is the right default. Run it for a quarter, see what alerts fire, tune from there. The teams that try to design the perfect baseline up front usually end up with monitoring that alerts on everything or nothing.

What the alert payload should include

The alert that fires when null rate spikes is only as useful as its payload. The minimum useful payload includes the table name, the column name, the current null rate, the baseline null rate, the number of records in the current window, and a one-line triage instruction.

Without the record count, engineers cannot distinguish "real drift on a low-volume day" from "noise on a high-volume day." Without the triage instruction, the alert sits in a channel until someone has bandwidth to investigate.