137Foundry

Posted on Jul 2

How to Write Your First Data Quality Assertion Set From an Incident Log

#programming #api #productivity

The single most common failure mode of a data quality gate is that the assertions were written from imagination rather than incidents. An engineer sits down, thinks about what could go wrong, writes a set of "reasonable-sounding" checks, and ships. Half the assertions never fire because the imagined failure mode does not actually occur in this pipeline. The other half miss the real failure modes because they were not on anyone's radar. Within a quarter, the team has both a false-negative and a false-positive problem simultaneously.

The fix is to write the assertion set from the last six months of production incidents rather than from a whiteboard. This is a walkthrough of how to do that in about a day.

Photo by Christina Morillo on Pexels

Step 1: Pull the incident log

Every team has one, even if it does not call itself an "incident log." The relevant records are Slack threads about broken dashboards, pager alerts about pipeline failures, retrospectives from actual outages, and support escalations tagged with "data is wrong."

Filter to the last six months. Older incidents may not reflect current pipeline topology or source system behavior. Six months is a common window because it is long enough to capture multiple examples of most failure modes and short enough to be relevant.

For each incident, extract three things: what record or batch triggered the problem, what specifically was wrong with the data, and what would have had to be true about the record for the pipeline to have stopped it at ingest.

Step 2: Group by failure mode

You will find that 30 to 50 incidents cluster into 5 to 8 failure modes. Common ones:

Field with wrong type (string where int was expected)
Enum value not in the allowed set
Referential violation (foreign key that does not exist in the parent table)
Range violation (negative amount, date in the future, ID out of expected range)
Duplicate primary key within a batch
Batch-level anomaly (total count off by an order of magnitude)
Format violation (email without an @, phone number with wrong digit count)

Each cluster becomes a category of assertion. The exact number of clusters varies by pipeline, but the pattern of "handful of categories, most incidents fit in the top three" is nearly universal. The Wikipedia article on data validation organizes these categories in more detail if you want a reference model.

Step 3: Write one assertion per incident cluster

Not one assertion per incident. One assertion per cluster. If you have 12 incidents where an unknown currency code caused a downstream aggregation to drop rows, you write one assertion that checks currency codes against the allowed set. The next 12 incidents of the same type will be blocked by the same assertion.

Writing one assertion per incident produces the paranoid-assertion-set problem: an assertion set that grows linearly with your incident history and eventually becomes unmaintainable. The right granularity is per cluster.

Write the assertion in whatever DSL or tool your pipeline uses. Common patterns:

# JSON Schema-based check
assert record["currency"] in ALLOWED_CURRENCIES

# Range check
assert 0 <= record["amount_cents"] <= 10**10

# Referential check
assert record["customer_id"] in known_customer_ids

# Batch-level check
assert 0.5 * baseline_count <= len(batch) <= 2 * baseline_count

Tools like Great Expectations and dbt tests provide libraries of pre-built assertion patterns; the code above is illustrative of the shape rather than the exact syntax.

Step 4: Prioritize by severity, not count

Not all clusters deserve the same treatment. A cluster of 30 incidents that each caused a 0.1 percent aggregation error is less urgent than a single incident that caused a customer to be double-billed. Sort clusters by cost-per-incident, not by count.

For each cluster, decide whether the assertion is a "block" (record does not reach production) or a "log-and-alert" (record reaches production but generates a warning). Block is the default. Log-and-alert is reserved for cases where the downstream cost of dropping the record is worse than the cost of letting a bad one through. Those cases exist but are rare.

Step 5: Deploy in shadow mode first

Never deploy a fresh assertion set to active blocking on day one. Run it in log-only mode against the current pipeline for two weeks. Watch what it catches. Every alert should be either a true positive (the assertion caught a real problem, celebrate) or a tuneable false positive (the assertion needs a small tweak).

The shadow mode is where the team learns what the assertion actually does under production traffic. Almost every assertion needs a small tuning pass here: an edge case the team did not anticipate, a boundary condition, a special-case record that is legitimately weird. Better to discover those in shadow mode than to spend a weekend explaining why the pipeline blocked a valid customer's data.

Step 6: Turn on blocking one cluster at a time

Do not flip the whole assertion set from shadow to blocking simultaneously. Do it one cluster at a time, with a week between each. If a cluster's blocking mode causes an operational issue (queue depth grows too fast, quarantine team gets overwhelmed), roll that cluster back to log-only and diagnose. If everything is stable, move on to the next.

Full rollout should take four to six weeks for a five-cluster assertion set. That is slower than teams want to go and about the right speed for a stable outcome. The production data quality gate work at 137Foundry covers the rollout logistics in more detail.

Step 7: Set up the quarterly review

Every quarter, review the assertion set with two questions. First: which assertions fired in the past quarter, and were the catches useful? Assertions that never fire and did not fire this quarter are candidates for removal. Second: which incidents in the past quarter escaped the gate, and what new assertion would have caught them?

The review is short (an hour, maybe two) and it is the mechanism that keeps the assertion set relevant over time. Without it, the set drifts toward staleness. With it, the set stays calibrated to actual production failure modes.

Photo by Marek Piwnicki on Pexels

What good looks like after two quarters

A well-tuned assertion set two quarters after rollout has these properties. Between 6 and 15 assertions, not 50. Each one traces to a real incident in the historical record. The false positive rate is under 2 percent, low enough that the on-call engineer takes each alert seriously. The team has removed at least one assertion during a quarterly review, which means the pruning discipline is real rather than aspirational.

The 137Foundry data integration services page covers the broader context of what surrounds the assertion set (ownership, SLAs, quarantine workflows). The assertion set alone is not enough. The set plus the human process is what makes the gate an actual control instead of a code artifact that runs.

What to do if you have no incident log

Some teams do not have a good incident log to draw from. Either the pipeline is new and there is not enough history, or the team never documented incidents in a searchable form. In that case, you have two options.

Option A: Run the pipeline in shadow mode against a broad but generic assertion set (schema checks against your known contract, plausibility ranges on the numeric fields, foreign-key checks on the join keys) and use the first month of production alerts as your synthetic incident log. Prune down from there.

Option B: Interview downstream consumers. Ask each one "what would have to be wrong with a record for it to break your downstream job?" The answers are the assertions. This is slower but produces a set that is directly grounded in what downstream cares about.

Both options work. Option A is faster to get to a running gate. Option B produces a more defensible set at first review. If you have the time, do a bit of both.

The core insight, whichever path you take, is that assertions written from imagination fail predictably and assertions written from evidence succeed predictably. The evidence can come from a real incident log, from downstream consumers, or from the pipeline's own behavior in shadow mode. Whichever source you use, use one grounded in something concrete rather than what an engineer thinks would be a good idea to check.

DEV Community