Picking a data quality framework in 2026 is no longer a "should we test our pipelines?" debate — every senior data engineer knows the answer. The real debate is which framework: Python-native Great Expectations with its expectations + suites + checkpoints, warehouse-native dbt tests living next to models, or YAML-first Soda Core with SodaCL checks and the optional Soda Cloud dashboard. Each one solves the same problem — catch the broken row before the dashboard ships it — but each carries a different team shape, a different runtime contract, and a different blast radius when a test fails.
This guide is the dialect matrix you wished existed the first time a stakeholder asked "why is the metric off by 12%?" and the answer was a NULL row that slipped through ingestion. It walks the data quality testing vocabulary across the three frameworks (an Expectation vs a generic test vs a check is the same idea wearing three name tags), the expectations suite structure and how it maps to dbt's YAML, the generic tests library and how to extend it, and the coverage strategy that prevents alert fatigue. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, and a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the data validation practice library →, rehearse on NULL handling problems →, and stack the schema muscles with validation drills →.
On this page
- Why pick a data quality framework at all
- The DQ framework landscape in 2026
- Expectations vs tests vs checks — the same idea, three vocabularies
- Where they plug into the pipeline
- Coverage strategy — what to test where, without drowning
- Cheat sheet — DQ framework recipes
- Frequently asked questions
- Practice on PipeCode
1. Why pick a data quality framework at all
Silent data corruption is the most expensive class of pipeline bug — and BI dashboards are the worst place to catch it
The one-sentence invariant: the cost of catching a bad row at the source is one developer-hour; the cost of catching it in the executive dashboard is one quarter of trust. A data quality framework moves the assertion from "the analyst noticed the number looked weird" to "the pipeline failed at 03:14 and paged the on-call before the report ran" — and that shift is what separates a team that ships metrics from a team that explains metrics.
The four categories of data quality checks.
-
Row-level. "Every row of
ordershas a non-NULLorder_id." Cheap, deterministic, the 80% of coverage. -
Column-level. "The
amountcolumn is between 0 and 1,000,000 in 99.9% of rows; thestatuscolumn belongs to the enum{placed, paid, shipped, cancelled}." Catches schema drift and bad enums. -
Freshness. "The
eventstable has a row in the last 60 minutes." Catches stalled ingestion before the report consumer notices. -
Schema / contract. "The
orderstable has exactly these columns with these types in this order." Catches upstream renames, drops, and silent type changes — the hardest class to debug after the fact.
Tests live next to code, not in BI dashboards.
The anti-pattern is a Looker / Power BI / Tableau dashboard with a "data quality" tab that runs the same SQL on the same warehouse and turns red when something breaks. The dashboard is downstream — by the time it turns red, the bad data has already been read by other consumers, the metric snapshot has been computed, and the broken row has propagated. Push the assertion to the place where the data is produced (ingestion, transform, contract boundary) and the bad row never enters the warehouse in the first place.
Build vs buy vs adopt-an-OSS-framework.
- Build — write your own checks as SQL UNION ALL queries that return failing rows. Cheap to start; expensive to maintain; no shared vocabulary across teams.
- Buy — Monte Carlo, Bigeye, Anomalo. Strong on anomaly detection and lineage; sticker shock; lock-in.
- Adopt OSS — Great Expectations, dbt tests, Soda Core. Zero vendor cost; ecosystem of community-contributed assertions; lives next to your code in version control. This guide is about the OSS lane.
Why "we already have dbt tests" is rarely enough alone.
dbt tests are excellent for the warehouse-native transform layer — every model already has YAML, so co-locating tests next to columns is a low-friction win. But dbt only knows what runs in dbt build. It cannot test the ingest layer (the Airflow task that lands the file), it cannot run pre-warehouse assertions, and its anomaly / distribution muscles are weaker than what GE or Soda ship out of the box. Most mature teams end up combining: dbt tests at transform, GE or Soda at the edges.
Worked example — three places NULL can break a dashboard and the four assertion families that catch each one
Detailed explanation. New analysts often assume a single dbt test on the mart table is enough. The reality is that the same NULL row can be introduced at ingest (the source API silently changed and started returning null), at transform (a LEFT JOIN with missing right-side rows), or at contract drift (a column renamed upstream). Each of those failures needs a different assertion family — and a single test at the end of the pipeline catches them too late to attribute the source.
Question. Given a daily_revenue mart table that reads from staging_orders and staging_payments, list the four data-quality test families and the failure mode each one catches first. Show which framework excels at each layer.
Input.
| Layer | Table | Common failure mode |
|---|---|---|
| Ingest | raw_orders |
Upstream API returns NULL customer_id
|
| Transform | staging_orders |
LEFT JOIN with missing right side leaks NULL |
| Mart | daily_revenue |
SUM over NULL group yields 0 instead of NULL |
| Contract | daily_revenue |
Downstream BI expects revenue column; it was renamed to gross_revenue
|
Code.
# dbt tests (mart layer) — runs inside `dbt build`
models:
- name: daily_revenue
columns:
- name: revenue
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: day
tests:
- unique
- not_null
# Great Expectations (ingest layer) — runs inside an Airflow PythonOperator
suite = context.add_expectation_suite("raw_orders_ingest")
batch.expect_column_values_to_not_be_null("customer_id")
batch.expect_column_values_to_be_in_set(
"status", ["placed", "paid", "shipped", "cancelled"]
)
batch.expect_table_row_count_to_be_between(min_value=1, max_value=10_000_000)
# Soda Core (contract / freshness) — runs as a CLI step
checks for daily_revenue:
- row_count > 0
- schema:
fail:
when required column missing: [day, revenue]
- freshness(day) < 1d
Step-by-step explanation.
- The ingest test (GE) catches the upstream API change before the row ever reaches the warehouse — the Airflow task fails at 03:14 instead of the daily report being wrong at 09:00.
- The transform test (dbt) catches a NULL leak inside
staging_orders—dbt buildhalts the DAG and the mart never gets the bad day. - The mart test (dbt + Soda) catches the SUM-of-NULL pathology —
revenue >= 0would have been satisfied silently if SUM(NULL) is NULL; thenot_nulltest on the day makes the failure loud. - The contract test (Soda schema check) catches the column rename — the
daily_revenuetable either has therevenuecolumn or the run fails. - Each layer needs the assertion to live at that layer. A single mart-level test sees only the end result and cannot say which upstream layer broke.
Output.
| Layer | Best-fit framework | Why it wins here |
|---|---|---|
| Ingest | Great Expectations | Python-native, pluggable into Airflow / Dagster operators |
| Transform | dbt tests | YAML next to model, runs as part of dbt build
|
| Mart | dbt + Soda | dbt for column-level, Soda for freshness + row-count |
| Contract | Soda schema / dbt contracts | Declarative schema assertion, no glue code |
Rule of thumb. Map every dataset to the four assertion families before writing a single test. If a category is empty, that is the silent class of bug that will bite next quarter. Pick the framework per layer based on where it plugs in cleanly, not by which one your favourite blogger uses.
Worked example — the cost of not having a DQ framework
Detailed explanation. Teams often justify "we don't need a DQ framework — our analysts catch issues" until the first big silent outage. The math is unforgiving: a single bad metric that runs for three days on a board-level dashboard costs more in lost trust than three years of GE / Soda licences ever would (and both are free).
Question. Quantify the engineering cost of catching a bad row at four points: at ingest, in the warehouse, in the BI tool, and at the executive review. Show the time-to-detect and time-to-fix multiplier.
Input.
| Catch point | Avg time-to-detect | Avg time-to-fix | Trust cost |
|---|---|---|---|
| Ingest test fires | 5 min | 30 min | none |
| Warehouse transform fails | 30 min | 60 min | small |
| BI tool shows wrong number | 4 hours | 4 hours | medium |
| Executive review flags it | 3 days | 1 day | high |
Code.
# A trivial Great Expectations check on the ingest side
import great_expectations as gx
context = gx.get_context()
asset = context.sources.add_pandas("ingest").add_dataframe_asset("raw_orders")
batch_request = asset.build_batch_request(dataframe=df)
suite = context.add_expectation_suite("raw_orders_smoke")
validator = context.get_validator(batch_request=batch_request, expectation_suite=suite)
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("amount", min_value=0, max_value=10_000_000)
validator.expect_table_row_count_to_be_between(min_value=100)
results = validator.validate()
if not results.success:
raise RuntimeError("ingest DQ failed — quarantining batch")
Step-by-step explanation.
- Catching the bad row at ingest fires within 5 minutes — the Airflow task fails, the on-call sees a single page, the fix is "rerun after upstream is patched."
- Catching it at the transform layer (dbt test on staging) fires within 30 minutes — slightly more cost because some downstream models may have started building.
- Catching it at the BI layer is hours later because dashboards refresh on a schedule, and humans only look at certain dashboards on certain days.
- The worst case — exec review — is days later, and the cost includes a board-level apology, an "explainability" retro, and weeks of rebuilt trust in the affected metric.
- The order of magnitude difference between "ingest fail" and "exec review fail" is roughly 1,000x in eng-hours and "infinite" in trust.
Output.
| Catch point | Cost multiplier |
|---|---|
| Ingest | 1x |
| Warehouse | 4x |
| BI | 50x |
| Executive | 1,000x+ |
Rule of thumb. Every dataset deserves a minimum of not_null + unique + freshness at the ingest or staging boundary. Adding those three lines of YAML / Python costs minutes; missing them costs quarters of trust the first time a row slips through.
SQL interview question on detecting a silent NULL leak
A senior interviewer often opens with: "You shipped a daily_revenue table to the exec dashboard last week. This morning the CFO says the number is off by 12%. Walk me through the audit: where would you look first, and what assertions would have caught it earlier?"
Solution Using a layered audit + retroactive DQ assertion plan
-- 1) Audit the mart for NULL leaks per column
SELECT
COUNT(*) AS total_rows,
COUNT(*) - COUNT(customer_id) AS null_customer,
COUNT(*) - COUNT(revenue) AS null_revenue,
SUM(CASE WHEN revenue < 0 THEN 1 ELSE 0 END) AS negative_revenue
FROM daily_revenue
WHERE day = CURRENT_DATE - 1;
-- 2) Audit the staging table for upstream drift
SELECT
COUNT(*) - COUNT(amount) AS null_amount_in_staging,
COUNT(*) FILTER (WHERE status NOT IN
('placed','paid','shipped','cancelled')) AS bad_enum_status
FROM staging_orders
WHERE created_at >= CURRENT_DATE - 1;
-- 3) Compare against the contract: row_count delta day-over-day
WITH day_counts AS (
SELECT day, COUNT(*) AS row_count
FROM daily_revenue
GROUP BY day
)
SELECT day, row_count,
row_count - LAG(row_count) OVER (ORDER BY day) AS delta
FROM day_counts
ORDER BY day DESC
LIMIT 7;
Step-by-step trace.
| Step | Finding | Implied source layer |
|---|---|---|
| Mart audit | 12% rows have NULL customer_id
|
Either ingest or transform |
| Staging audit |
null_amount_in_staging = 0, bad_enum_status = 4,800
|
Ingest layer — upstream API drift |
| Row-count delta |
delta = -12,000 (down 12% vs prior week) |
Confirms the silent drop |
The audit narrows the source layer to ingest in three queries. The fix is twofold: a one-time backfill for the affected days, and a permanent GE expectation on the ingest table to prevent recurrence.
Output:
| Layer | Retroactive assertion |
|---|---|
| Ingest (GE) | expect_column_values_to_be_in_set("status", ["placed","paid","shipped","cancelled"]) |
| Staging (dbt) |
accepted_values test on status with the same enum |
| Mart (Soda) |
row_count anomaly check with a 5% deviation threshold |
Why this works — concept by concept:
- Layered audit, not single query — running the same NULL count at each layer reveals where the NULLs were introduced. A single mart-level query only confirms "something broke."
-
Enum drift detection —
bad_enum_status = 4,800proves the upstream API started returning an undocumented status. This is the kind of failure schema-only contracts miss. - Row-count delta — comparing today's count to the trailing window of comparable days exposes silent drops that pass column-level checks. A row that is "missing" never appears in NULL counts.
- Retroactive assertions become the regression suite — every audit finding becomes a permanent test. The first time the bug is caught is at exec review; the second time, the pipeline catches it before any consumer sees the bad day.
-
Cost — three short SQL queries (O(rows) each). The retroactive tests add zero marginal cost to
dbt buildand seconds to GE / Soda runs.
SQL
Topic — data validation
Data validation problems (SQL)
2. The DQ framework landscape in 2026
Five open-source frameworks dominate — pick by team shape and stack, not by feature count
The mental model in one line: Great Expectations is Python-native, dbt tests are warehouse-native, Soda Core is YAML-first and multi-source — every other contender is either a specialist (pandera, Deequ) or a thin wrapper around one of those three. Once you can pattern-match a team's stack to one of those archetypes, the framework choice is mostly a fait accompli.
Great Expectations (GE) in one paragraph.
- Vocabulary. An Expectation is a single assertion. A Suite bundles expectations for a dataset. A Checkpoint binds a Suite to a runtime execution and ships the results to Data Docs (the auto-generated HTML report).
- Strengths. Python-native, integrates with Pandas, Spark, SQLAlchemy, and every warehouse. Largest community-contributed expectation library. Data Docs is the gold standard for human-readable test reports.
- Weaknesses. Heaviest of the three on setup ceremony; "Data Context" objects can feel ceremonious; recent v3 migration shed weight but added migration burden for v2 users.
- Team shape it assumes. A Python-heavy team running Airflow / Dagster / Prefect, comfortable with config-as-code, and willing to invest a sprint in onboarding.
dbt tests in one paragraph.
-
Vocabulary. A generic test is a reusable macro that returns failing rows (built-ins:
not_null,unique,accepted_values,relationships). A singular test is a single SQL file that returns failing rows. A custom generic test is a macro you write that any model can reuse. -
Strengths. Lives in YAML next to the model definition — zero context switch. Runs as part of
dbt build, which means tests gate the next model. dbt_utils and dbt_expectations packages multiply the built-in library tenfold. - Weaknesses. Warehouse-only — cannot test before the data lands. Anomaly / distribution detection is weak without dbt_expectations. Severity and warn-vs-fail semantics are simpler than GE / Soda.
- Team shape it assumes. A warehouse-first team already running dbt for transforms — adding tests is a no-op decision because the YAML is already there.
Soda Core in one paragraph.
-
Vocabulary. A check is a single SodaCL clause (
row_count > 0,missing_count(col) = 0). A scan is one run over a dataset. User-defined metrics extend SodaCL with custom SQL. - Strengths. YAML-first, dialect-agnostic (SodaCL compiles to whatever SQL the source dialect speaks). CLI-friendly — drops into Airflow / GitHub Actions / cron without ceremony. Soda Cloud (optional, paid) adds a dashboard and Slack alerts on top of the OSS engine.
- Weaknesses. Smaller community than GE; some advanced expectations (distribution comparisons, partial-suite reuse) require Soda Cloud or DIY SQL.
- Team shape it assumes. A multi-source team that wants a single declarative file per dataset across Postgres, Snowflake, BigQuery, Spark, with optional managed dashboard.
The lightweight neighbours.
- pandera — DataFrame validation library for pandas / Polars / PySpark. Strong typing, schema inference, in-memory only. Use when the assertion lives in a Python script, not in the warehouse.
- Deequ — Spark/Scala library from AWS, ports of which exist in PyDeequ. Strong on big-data anomaly detection. Best when the dataset is already a Spark DataFrame.
- Elementary — dbt-native observability layer on top of dbt tests. Adds anomaly detection, lineage, and a dashboard. Treat as a dbt augmentation, not a standalone framework.
- Pydantic — application-layer schema validation. Not a DQ framework per se; pairs well at the API / ingest boundary.
The terminology cheat sheet.
| Concept | GE | dbt | Soda |
|---|---|---|---|
| One assertion | Expectation | Generic test (per column) | Check |
| Bundle of assertions | Suite | YAML test block on a model | Checks-file (per dataset) |
| One run | Checkpoint |
dbt test / dbt build
|
Scan |
| Custom assertion | Custom expectation | Custom generic test (macro) | User-defined metric |
| Severity | Severity (action level) | Severity (warn / error) | Warn / Fail clauses |
| Result store | Data Docs + Validation Result store |
run_results.json + manifest |
Soda Cloud / file / table |
Team-shape decision matrix.
- Python team, Airflow / Dagster, multi-source. GE wins. Pluggable into the orchestrator, expectation library beats everything.
- dbt-shop, warehouse-native, transform-heavy. dbt tests + dbt_expectations. Don't bring in a second framework until you hit anomaly or freshness needs that dbt can't cover.
- Polyglot team, YAML-first, multi-source, want a managed dashboard. Soda Core (free) + Soda Cloud (paid for the dashboard). The CLI scales from one dataset to one hundred without re-architecture.
- Data-science / DataFrame-heavy, Pandas / Polars. pandera. Treat schema as code at the boundary of every script.
- Big-data Spark shop. Deequ. Native to the engine you're already running.
Worked example — the same assertion in three frameworks
Detailed explanation. The fastest way to internalise the three vocabularies is to write the same assertion in all three. The assertion: "orders.amount is not NULL and is greater than 0." This is the most common production check, and seeing it spelled out three ways highlights where each framework's syntax becomes ergonomic.
Question. Translate "orders.amount IS NOT NULL AND orders.amount > 0" into a Great Expectations Suite, a dbt YAML test block, and a Soda Core checks file. Which is shortest? Which is most readable?
Input.
| order_id | amount |
|---|---|
| 1 | 50 |
| 2 | 0 |
| 3 | NULL |
| 4 | 25 |
Code.
# Great Expectations — Python suite
suite = context.add_expectation_suite("orders_amount")
validator = context.get_validator(
batch_request=orders_batch_request,
expectation_suite=suite,
)
validator.expect_column_values_to_not_be_null("amount")
validator.expect_column_values_to_be_between(
"amount", min_value=0, strict_min=True
)
validator.save_expectation_suite()
# dbt — schema.yml next to the model
models:
- name: orders
columns:
- name: amount
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "> 0"
# Soda Core — checks file
checks for orders:
- missing_count(amount) = 0
- min(amount) > 0
Step-by-step explanation.
- GE's expression is the most verbose because it carries the Suite + Validator + Batch ceremony. The trade-off: every assertion ships with rich metadata and a Data Docs page automatically.
- The dbt version is the shortest if you accept the dbt_utils dependency for
expression_is_true. The built-innot_nulltest handles half the assertion; the other half needs an expression macro. - Soda Core sits between the two on verbosity.
missing_count = 0is the SodaCL idiom for "not null";min > 0is more declarative than a per-row predicate but means the same. - All three produce the same failing row in the input table:
order_id = 2(amount = 0) andorder_id = 3(amount NULL). - Readability is taste — but YAML wins for "non-engineer can read the assertion" by a wide margin.
Output.
| Framework | Lines of code | Failing rows detected |
|---|---|---|
| Great Expectations | 9 | 2 (order_id 2, 3) |
| dbt tests | 6 | 2 (order_id 2, 3) |
| Soda Core | 3 | 2 (order_id 2, 3) |
Rule of thumb. When the assertion is purely column-level (not_null, range, enum), all three frameworks are roughly equivalent. The choice comes down to where the assertion runs and who maintains it.
Worked example — distribution / anomaly checks where the frameworks diverge
Detailed explanation. "Yesterday's row count was 10,000; today's is 8,000 — is that a 20% drop or normal weekend dip?" This is a distribution check, not a per-row assertion. Each framework's answer is materially different.
Question. Write a "row count must be within 10% of the trailing 7-day average" assertion in GE, dbt (with dbt_expectations), and Soda. Show which is most ergonomic.
Input.
| day | row_count |
|---|---|
| 2026-06-08 | 10,000 |
| 2026-06-09 | 10,200 |
| 2026-06-10 | 9,800 |
| 2026-06-11 | 10,500 |
| 2026-06-12 | 9,900 |
| 2026-06-13 | 10,100 |
| 2026-06-14 | 10,000 |
| 2026-06-15 | 8,000 |
Code.
# Great Expectations — using a custom expectation
validator.expect_table_row_count_to_be_between(
min_value=9000, max_value=11000,
meta={"reason": "trailing 7-day average +/- 10%"}
)
# In practice teams wrap this in a custom expectation that
# reads the trailing window from a metadata table.
# dbt — using dbt_expectations
models:
- name: daily_revenue
tests:
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 9000
max_value: 11000
# Soda Core — using SodaCL anomaly clause
checks for daily_revenue:
- anomaly score for row_count < 0.7
- change for row_count > -10% and change for row_count < 10%
Step-by-step explanation.
- GE makes you compute the trailing average yourself (or in a custom expectation) — the built-in
expect_table_row_count_to_be_betweentakes static bounds. - dbt + dbt_expectations is the same —
expect_table_row_count_to_be_betweenis the closest built-in. Anomaly detection requires either Elementary or a singular test that joins against the trailing window. - Soda Core ships native
anomaly scoreandchangeclauses that compute the trailing window for you. This is where Soda materially outperforms the others on declarative anomaly checks. - For the input on
2026-06-15, the row count is 8,000, which is 20% below the 7-day average of 10,071. All three frameworks fire — but Soda fired without the engineer having to write the trailing-window logic.
Output.
| Framework | Lines of declarative config for trailing-window anomaly | Failing day detected |
|---|---|---|
| GE | needs custom expectation | 2026-06-15 |
| dbt + dbt_expectations | static bounds only without singular test | 2026-06-15 |
| Soda Core | 2 (built-in anomaly score + change) |
2026-06-15 |
Rule of thumb. For static / deterministic checks, all three are similar. For trailing-window anomaly checks, Soda is the most ergonomic out of the box; GE catches up with custom expectations; dbt needs Elementary or a singular test.
SQL interview question on choosing a DQ framework
A senior interviewer might frame this as: "Your team runs Airflow + dbt + Snowflake. You already use dbt tests for the warehouse models. Where would you add Great Expectations or Soda Core, and what is the failure mode of each layer if you skip it?"
Solution Using a layered framework decision tree
# Layer 1 — Ingest (Airflow PythonOperator + Great Expectations)
- task_id: ingest_orders_dq
python_callable: run_ge_checkpoint
op_kwargs:
checkpoint_name: raw_orders_smoke
retries: 0 # fail fast on bad ingest
# Layer 2 — Transform (dbt build runs dbt tests)
- task_id: dbt_build
bash_command: dbt build --select staging.+ marts.+
# Layer 3 — Mart freshness + row-count anomaly (Soda Core)
- task_id: soda_scan_mart
bash_command: soda scan -d snowflake checks/marts.yml
Step-by-step trace.
| Layer | Framework | Failure mode if skipped |
|---|---|---|
| Ingest | Great Expectations | Bad rows enter the warehouse; downstream metrics silently wrong |
| Transform | dbt tests | NULL leaks across JOIN; marts compute on incomplete data |
| Mart | Soda Core | Row-count anomalies / freshness lag go undetected until BI tab refreshes |
Each framework owns the layer where it plugs in cleanest. Skip a layer and the failure mode pictured above is what bites you next quarter.
Output:
| Layer | Test count | Run-time budget | Blast radius if it fails |
|---|---|---|---|
| Ingest GE | 12 expectations | ~30s | Single batch quarantined |
| Transform dbt | 50 tests | ~2 min | Single model halts; downstream gated |
| Mart Soda | 8 checks | ~20s | Single dataset flagged; downstream still runs but alerts fire |
Why this works — concept by concept:
- Layer-per-framework — each framework wins at the layer it was designed for. Forcing GE into the warehouse (or dbt into ingest) costs more than the integration debt of running all three.
- Fail-fast at ingest — GE Checkpoint fails the Airflow task; downstream dbt run never starts; nothing bad lands in the warehouse.
-
Gating transform with dbt build —
dbt buildruns tests between models, so a failed staging test halts the downstream marts in a single command. -
Soda for mart freshness + anomaly — declarative
freshness < 1dandanomaly scoreclauses outperform what dbt tests express natively, without writing a custom test. - Alert routing per layer — ingest fails go to the data-platform on-call; transform fails go to the analytics-engineer on-call; mart anomalies go to the analytics-engineer and the metric owner.
- Cost — three framework runs add ~3 minutes to the DAG. The savings: catching the same bad row in BI takes hours, in exec review takes days. Net positive every quarter.
SQL
Topic — validation
Validation problems (SQL)
3. Expectations vs tests vs checks — the same idea, three vocabularies
Once you align the three vocabularies on the same row of a matrix, every DQ migration is a syntax exercise — not a semantic one
The mental model in one line: an Expectation, a generic test, and a SodaCL check are isomorphic — same input, same output, same failure semantics, only the names differ. The hard part of DQ work was never the vocabulary; it is the coverage strategy and the alert routing. Once you internalise the synonym table, the migration cost between frameworks falls to a few hours per dataset.
The three vocabularies aligned.
| Concept | Great Expectations | dbt | Soda |
|---|---|---|---|
| One assertion | Expectation | generic test invocation | Check |
| Bundle for a dataset | Expectation Suite | YAML tests: block under a model |
Checks file (checks for <dataset>:) |
| One execution | Checkpoint |
dbt test / dbt build
|
Scan |
| Reusable assertion logic | Custom Expectation class | Custom generic test macro | User-defined metric (SQL) |
| One-off assertion | Custom Expectation | Singular test (single SQL file) | Inline SQL inside a check |
| Severity |
action_list per result |
severity: warn or error
|
warn: / fail: clauses |
| Result artefact | Data Docs + ValidationResults |
run_results.json + manifest |
Scan results (CLI / Cloud / warehouse table) |
Common assertions that exist verbatim in all three.
-
not_null. GE:
expect_column_values_to_not_be_null. dbt:not_null. Soda:missing_count(col) = 0. -
unique. GE:
expect_column_values_to_be_unique. dbt:unique. Soda:duplicate_count(col) = 0. -
accepted_values. GE:
expect_column_values_to_be_in_set. dbt:accepted_values. Soda:invalid_count(col) = 0withvalid values:clause. -
foreign_key / relationships. GE:
expect_column_values_to_be_in_setwith a query. dbt:relationships. Soda:failed rows querywith a join. -
row_count. GE:
expect_table_row_count_to_be_between. dbt:dbt_utils.expression_is_trueor singular test. Soda:row_count > 0. -
freshness. GE:
expect_column_max_to_be_betweenon a timestamp. dbt:freshnessblock in source config. Soda:freshness(updated_at) < 1d.
Custom logic — three styles.
-
GE custom expectation. Subclass
ColumnExpectationorBatchExpectationin Python. Ships with the package; reusable across suites. -
dbt custom generic test. A Jinja macro that returns failing rows. Lives in
tests/generic/and is invoked from YAML by name. -
SodaCL user-defined metric. Inline SQL inside a
metric_name:block; reusable inside the same checks file or imported.
Severity, thresholds, warn-vs-fail.
-
GE. Each expectation result is
success: bool. Severity is configured per Checkpoint viaaction_list, which can route a "warning" to one Slack channel and a "failure" to another. -
dbt. Built-in
severity: warn/severity: error. Per-test threshold viaerror_if/warn_if(e.g.,error_if: ">= 10"means "fail if 10+ failing rows"). -
Soda. Inline
warn:andfail:clauses per check. Multi-level threshold by writing two checks with different bounds.
Common interview probes on vocabulary.
- "What is the difference between an Expectation and a Suite in GE?" — an Expectation is one assertion; a Suite bundles them for a dataset and is reusable across batches.
- "What does
dbt builddo thatdbt testdoes not?" —dbt buildinterleaves model materialisation and test execution per node, so a failed test halts downstream models.dbt testruns tests independently after all models. - "Is Soda Core the same as Soda Cloud?" — Soda Core is the OSS engine (CLI + Python library). Soda Cloud is the paid dashboard / Slack-alerts layer that consumes Soda Core results. You can run Soda Core forever without Soda Cloud.
- "What is a singular test in dbt vs a generic test?" — singular is one SQL file in
tests/; generic is a Jinja macro intests/generic/invoked by name from YAML. Singulars are bespoke; generics are reusable.
Worked example — orders.amount > 0 AND amount IS NOT NULL in all three frameworks
Detailed explanation. The "amount must be positive and non-NULL" check is the canonical first assertion teams write. Spelling it in all three frameworks side by side surfaces the syntax differences and makes the migration delta concrete.
Question. Write the assertion in (a) a GE Expectation Suite, (b) a dbt YAML schema block, and (c) a Soda Core checks file. For each, show how to add a severity (warn-vs-fail) so the on-call only pages on >10 failing rows.
Input.
| order_id | amount |
|---|---|
| 1 | 50 |
| 2 | -10 |
| 3 | NULL |
| 4 | 25 |
| 5 | 0 |
Code.
# GE — Expectation Suite with action-list severity
suite = context.add_expectation_suite("orders_amount")
validator = context.get_validator(
batch_request=orders_batch_request, expectation_suite=suite
)
validator.expect_column_values_to_not_be_null("amount")
validator.expect_column_values_to_be_between(
"amount", min_value=0, strict_min=True,
mostly=0.99, # 99% must pass — under that, fail
)
validator.save_expectation_suite()
# Checkpoint routes: <0.95 -> page; <0.99 -> Slack warn
# dbt — schema.yml with severity and threshold
models:
- name: orders
columns:
- name: amount
tests:
- not_null:
severity: error
error_if: ">= 10"
warn_if: ">= 1"
- dbt_utils.expression_is_true:
expression: "> 0"
severity: warn
# Soda — checks file with warn / fail thresholds
checks for orders:
- missing_count(amount) = 0:
warn: when between 1 and 9
fail: when > 9
- min(amount) > 0:
warn: when min(amount) <= 0
Step-by-step explanation.
- Each framework expresses "0% missing, all positive" with two assertions. The first asserts non-NULL; the second asserts positivity.
- GE's
mostly=0.99lets the check pass when 99%+ of rows satisfy — the closest GE primitive to "warn at 1 failing row, fail at 10." - dbt's
error_if/warn_ifis the cleanest per-test threshold expression — semantically "fail if 10 or more failing rows, warn if 1 or more." - Soda's
warn: when between 1 and 9/fail: when > 9is the most declarative — the threshold reads like English. - All three correctly classify the input: rows 2, 3, 5 are failing (negative, NULL, zero). With 3 failing rows out of 5, the dbt test would warn (≥1) but not error (<10); Soda would do the same; GE with
mostly=0.99would fail because 60% pass < 99%.
Output.
| Framework | Failing rows | Severity classification |
|---|---|---|
| GE (mostly 0.99) | 3 of 5 | Fail (success rate 40% < 99%) |
| dbt (warn ≥1, error ≥10) | 3 | Warn |
| Soda (warn 1-9, fail >9) | 3 | Warn |
Rule of thumb. When migrating between frameworks, the vocabulary is a clean translation but the severity semantics require careful matching. GE thinks in success-rate percentages; dbt and Soda think in failing-row counts. Match the threshold by computing the equivalent.
Worked example — a custom assertion (regex-validate email)
Detailed explanation. A frequent custom-logic check: "every value in users.email matches a basic email regex." None of the three frameworks ship this as a built-in (regex is fast-moving and dialect-specific), so each one's custom-assertion mechanism is exercised.
Question. Write a regex-validation assertion for users.email in (a) a GE custom expectation, (b) a dbt custom generic test, and (c) a SodaCL user-defined metric.
Input.
| user_id | |
|---|---|
| 1 | alice@x.com |
| 2 | not-an-email |
| 3 | bob@y.org |
| 4 | NULL |
Code.
# GE — using built-in expect_column_values_to_match_regex
validator.expect_column_values_to_match_regex(
"email",
r"^[^@\s]+@[^@\s]+\.[^@\s]+$",
mostly=1.0,
)
-- dbt — tests/generic/test_matches_regex.sql
{% test matches_regex(model, column_name, pattern) %}
SELECT *
FROM {{ model }}
WHERE {{ column_name }} IS NOT NULL
AND NOT regexp_like({{ column_name }}, '{{ pattern }}')
{% endtest %}
-- Invoked in schema.yml
models:
- name: users
columns:
- name: email
tests:
- matches_regex:
pattern: "^[^@\\s]+@[^@\\s]+\\.[^@\\s]+$"
# Soda — user-defined metric with valid regex clause
checks for users:
- invalid_count(email) = 0:
valid regex: '^[^@\s]+@[^@\s]+\.[^@\s]+$'
Step-by-step explanation.
- GE has
expect_column_values_to_match_regexbuilt in — the simplest of the three. Pass the regex andmostlythreshold; done. - dbt does not ship regex tests, so a custom generic test wraps
regexp_likein a Jinja macro. The macro returns failing rows (the dbt contract for any test); dbt counts them and applies severity. - Soda Core ships
valid regexas a clause oninvalid_count— almost as concise as GE. The check fails when any non-NULL email fails the regex. - All three skip NULLs by convention — regex validation is a content check, not a presence check. Pair with
missing_count = 0if NULLs should also fail. - For the input, all three flag
user_id=2(not-an-email). NULLuser_id=4is skipped.
Output.
| Framework | Failing rows | Notes |
|---|---|---|
| GE | 1 | Built-in expectation, smallest LOC |
| dbt | 1 | Custom generic test, reusable across columns |
| Soda | 1 |
valid regex clause built-in |
Rule of thumb. For common custom logic (regex, range with units, foreign-key with filter), check if the framework already ships it before writing a custom assertion. For bespoke custom logic (cross-table invariants, time-bounded distributions), the custom mechanism is unavoidable — write it once, reuse forever.
SQL interview question on vocabulary alignment
A senior interviewer might ask: "Your team is consolidating a Soda Core checks library into a dbt project for a warehouse-only audit. Walk me through how you'd migrate row_count > 0, missing_count(col) = 0, and freshness(col) < 1d from SodaCL to dbt YAML — and what semantic gaps to flag for the team."
Solution Using a SodaCL-to-dbt migration table + a dbt_expectations bridge
# Original SodaCL checks
checks for daily_revenue:
- row_count > 0
- missing_count(revenue) = 0
- freshness(day) < 1d
- duplicate_count(day) = 0
# Migrated dbt schema.yml + dbt_utils + dbt_expectations
sources:
- name: marts
tables:
- name: daily_revenue
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
loaded_at_field: day
models:
- name: daily_revenue
tests:
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 1
columns:
- name: revenue
tests:
- not_null
- name: day
tests:
- unique
Step-by-step trace.
| SodaCL clause | dbt equivalent | Semantic gap |
|---|---|---|
row_count > 0 |
dbt_expectations.expect_table_row_count_to_be_between (min_value=1) |
None |
missing_count(revenue) = 0 |
not_null test on column |
None |
freshness(day) < 1d |
sources.freshness block, error_after: 24h
|
dbt source freshness runs separately via dbt source freshness, not inline in dbt build
|
duplicate_count(day) = 0 |
unique test on column |
None |
The migration is direct for three of four clauses. The freshness check moves from inline (Soda) to a separate dbt source freshness command (dbt) — that is the semantic gap to flag in the migration plan.
Output:
| Aspect | SodaCL (before) | dbt (after) |
|---|---|---|
| Lines of YAML | 4 | 14 |
| Single-command run | soda scan ... |
needs both dbt build and dbt source freshness
|
| Severity granularity | per-clause warn/fail
|
per-test severity: warn/error
|
| Anomaly checks (built-in) |
anomaly score clause |
needs Elementary or singular test |
Why this works — concept by concept:
-
Vocabulary isomorphism —
missing_count = 0maps directly tonot_null;duplicate_count = 0maps tounique; the semantics are identical. -
Freshness is split in dbt — source freshness lives in the source config (
dbt source freshness), not indbt build. The migration must add a second command to the orchestrator. - dbt_expectations bridges the row-count gap — the built-in dbt test library covers the column-level surface; dbt_expectations adds the table-level checks Soda has natively.
-
Severity translation is direct — Soda's
warn:/fail:maps to dbt'sseverity: warn/severity: errorwithwarn_if/error_ifthresholds. -
Anomaly is the residue — SodaCL's
anomaly scoreandchangeclauses do not migrate cleanly to dbt. Layer Elementary on top, or accept that anomaly detection moves out of the DQ framework and into a separate observability tool. - Cost — same per-row cost (every framework compiles to similar SQL). The migration cost is a one-time YAML rewrite plus one extra command in the DAG for freshness.
SQL
Topic — data validation
Data validation problems (SQL)
4. Where they plug into the pipeline
Placement (pre-ingest vs in-transform vs post-load) decides what you can quarantine — and that decision is bigger than the framework choice
The mental model in one line: GE plugs in via Python operators (Airflow / Dagster / Prefect), dbt tests run as part of dbt build in the warehouse, and Soda scans run as a CLI step anywhere in the DAG. Each placement carries a different blast radius when a check fails — fail-fast at ingest vs warn-only at the mart — and that contract is what the on-call cares about most.
Three placement archetypes.
- Pre-ingest. Assertion runs before the data lands in the warehouse — usually inside an Airflow / Dagster task that pulls a file or API response. Fail-fast: the bad batch is quarantined and never enters the table.
-
In-transform. Assertion runs between transform steps —
dbt buildinterleaves models and tests, so a failed test halts the downstream models in the same DAG. - Post-load. Assertion runs after the data is in the table, often on a schedule independent of the producing pipeline. Cannot fail-fast — the bad data is already there — but is the only option for cross-pipeline checks like freshness and anomaly.
Great Expectations — Checkpoints in Airflow / Dagster / Prefect.
# Airflow — GreatExpectationsOperator
from airflow_provider_great_expectations.operators.great_expectations import (
GreatExpectationsOperator,
)
ingest_dq = GreatExpectationsOperator(
task_id="ingest_orders_dq",
data_context_root_dir="/opt/ge",
checkpoint_name="raw_orders_smoke",
fail_task_on_validation_failure=True,
)
land_to_warehouse = SnowflakeOperator(
task_id="land_to_warehouse", sql="COPY INTO ...",
)
ingest_dq >> land_to_warehouse # land only if DQ passes
The Checkpoint binds a Suite to a batch, runs it, ships results to Data Docs, and (optionally) raises an exception when validation fails — which the operator translates into a failed task and a paged on-call.
dbt tests — dbt build interleaving.
# Single command runs every model + test in DAG order
dbt build --select staging.+ marts.+
# Equivalent to (conceptually):
# for each model in topological order:
# run model
# run tests configured on this model
# if any test fails: halt downstream
The interleaving is dbt build's key feature. dbt run (models only) plus a later dbt test (tests only) does not halt downstream models on a test failure — the difference matters and shows up on senior interviews.
Soda Core — CLI step or Soda Cloud agent.
# CLI step inside Airflow BashOperator
soda scan -d snowflake \
-c configuration.yml \
checks/marts.yml
# Or, the Soda Cloud agent runs on a managed schedule
# without needing the DAG to call it explicitly.
The CLI version drops into any orchestrator that can run a shell command. The Soda Cloud agent is the managed alternative — useful when you want a centralised dashboard and Slack routing without writing Airflow tasks per dataset.
Failure semantics — three modes.
-
Fail the run. The task / model / scan fails; the DAG halts; on-call is paged. Use at ingest (GE) and inside
dbt build. -
Warn-only. Log the failure to Data Docs /
run_results.json/ Soda Cloud; do not halt the DAG. Use for known-noisy checks (anomaly thresholds, soft enums) until the noise is calibrated. - Quarantine-and-continue. Bad rows are written to a side table; good rows continue to the main table; the DAG does not fail. Use when partial data is better than no data (e.g., reporting on a partial batch is acceptable; missing the batch entirely is not).
Result stores.
- GE → Data Docs. Auto-generated HTML report site. Hosts on S3 / GCS / a local FS; rebuilt after every Checkpoint.
-
dbt →
run_results.json+ manifest. Machine-readable artefacts; consumed by Elementary, dbt Cloud, or custom dashboards. - Soda → file / warehouse table / Soda Cloud. Most flexible: CLI emits JSON / writes to a warehouse table you specify; Soda Cloud (paid) ingests into a managed dashboard.
Alerting payloads.
- GE. Action list per Checkpoint; built-in Slack action, EmailAction, OpsgenieAction. Payload includes a link to the Data Docs page.
-
dbt.
run_results.jsonis the source; downstream tools (Elementary, dbt Cloud, custom) translate to Slack / PagerDuty. - Soda. Native Slack / Webhook / Soda Cloud integration. Payload includes the failing check, failing row count, and a link to the Cloud dashboard.
Worked example — GE Checkpoint inside an Airflow DAG that fails fast on ingest
Detailed explanation. The textbook ingest pattern: pull → DQ → land. A GreatExpectationsOperator fires between the pull and the land; on failure, the batch is quarantined and the load is skipped. This is the only pattern that guarantees the bad row never reaches the warehouse.
Question. Write an Airflow DAG with three tasks: pull a daily orders dump, run a GE Checkpoint, and land it into Snowflake. The DQ task must halt the DAG on failure.
Input.
| Task | Purpose |
|---|---|
| pull_orders | Download the API dump to S3 |
| dq_orders | Run raw_orders_smoke Checkpoint |
| land_to_warehouse | COPY INTO Snowflake |
Code.
from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.s3 import S3CreateObjectOperator
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from airflow_provider_great_expectations.operators.great_expectations import (
GreatExpectationsOperator,
)
with DAG(
"orders_ingest",
start_date=datetime(2026, 6, 15),
schedule="@daily",
catchup=False,
) as dag:
pull_orders = S3CreateObjectOperator(
task_id="pull_orders",
s3_bucket="ingest",
s3_key="orders/{{ ds }}.parquet",
data="<<api response>>",
)
dq_orders = GreatExpectationsOperator(
task_id="dq_orders",
data_context_root_dir="/opt/ge",
checkpoint_name="raw_orders_smoke",
fail_task_on_validation_failure=True,
)
land_to_warehouse = SnowflakeOperator(
task_id="land_to_warehouse",
sql="""
COPY INTO raw.orders
FROM @ingest_stage/orders/
PATTERN = '.*{{ ds }}\\.parquet';
""",
)
pull_orders >> dq_orders >> land_to_warehouse
Step-by-step explanation.
- The DAG runs daily.
pull_orderswrites the API response to S3. -
dq_ordersreads the same Parquet file and runs theraw_orders_smokeCheckpoint. The Checkpoint is configured to assertnot_null(customer_id),accepted_values(status), androw_count > 100. - If the Checkpoint succeeds,
land_to_warehouseruns and the bad row never had the chance to reach the warehouse. - If the Checkpoint fails,
fail_task_on_validation_failure=Trueraises an exception in the operator; Airflow marks the task as failed;land_to_warehouseis skipped; the on-call is paged via the standard Airflow on-failure callback. - The batch sits in S3 untouched — replaying after a fix is a simple manual re-trigger of
land_to_warehouseafter the upstream issue is resolved.
Output.
| Day | dq_orders status | land_to_warehouse status |
|---|---|---|
| 2026-06-14 | passed | succeeded |
| 2026-06-15 | failed (bad enum) | skipped |
| 2026-06-16 | passed (after upstream fix) | succeeded |
Rule of thumb. The "pull → DQ → land" pattern is the default for ingest DAGs. Add it once per source; every future bad batch is caught before it touches the warehouse.
Worked example — dbt build halts downstream models on a failed test
Detailed explanation. The reason senior DEs prefer dbt build over dbt run && dbt test is that build interleaves models and tests in DAG order — a failed test on the staging table halts the downstream marts in the same command. Pre-build, the same pipeline would build the mart on broken staging data and only fail at the final dbt test.
Question. Given stg_orders → int_orders_enriched → mart_daily_revenue, write a dbt build command and explain what happens when stg_orders.not_null(customer_id) fails.
Input.
| Model | Dependency |
|---|---|
| stg_orders | raw.orders |
| int_orders_enriched | stg_orders + dim_customers |
| mart_daily_revenue | int_orders_enriched |
Code.
# dbt build runs every node in DAG order, with tests interleaved
dbt build --select +mart_daily_revenue
# schema.yml beside stg_orders
models:
- name: stg_orders
columns:
- name: customer_id
tests:
- not_null:
severity: error
Step-by-step explanation.
-
dbt buildmaterialisesstg_ordersfirst. - Immediately after materialisation, it runs the tests configured on
stg_orders— includingnot_null(customer_id). - If
not_null(customer_id)fails (severity: error), dbt marks the test as failed and halts downstream —int_orders_enrichedandmart_daily_revenueare skipped. - The downstream models retain their previous values (the prior day's data), which is usually safer than overwriting with a half-broken build.
- The on-call receives a failure notification from dbt Cloud / Airflow / Elementary with the failing test name and the failing row count.
Output.
| Node | dbt build status |
|---|---|
| stg_orders | materialised + test failed |
| int_orders_enriched | skipped (upstream test failed) |
| mart_daily_revenue | skipped (upstream test failed) |
Rule of thumb. Always run dbt build, never dbt run followed by dbt test. The interleaved semantics is the entire reason build exists — and it is the only way to halt downstream materialisation on a failed test in a single command.
SQL interview question on alert routing
A senior interviewer might ask: "You have 200 DQ tests across GE, dbt, and Soda. The team is drowning in Slack alerts. Design the routing policy so the on-call only pages on actionable failures."
Solution Using a tiered severity + routing policy
# Tier 1 — critical (page on-call)
# Run: at ingest, in dbt build, on every Soda scan
# Severity: error / fail
# Route: PagerDuty + #data-oncall
# Tier 2 — important (Slack-only)
# Run: in dbt build, in Soda scan
# Severity: warn (single failing row tolerated)
# Route: #data-quality channel
# Tier 3 — informational (dashboard-only)
# Run: in Soda anomaly scan, in Elementary anomaly job
# Severity: anomaly_score < 0.5
# Route: Soda Cloud / Elementary dashboard, no Slack
# Example dbt config
models:
- name: stg_orders
columns:
- name: customer_id
tests:
- not_null: # tier 1
severity: error
- dbt_expectations.expect_column_value_lengths_to_be_between: # tier 2
min_value: 1
max_value: 64
severity: warn
Step-by-step trace.
| Tier | Trigger | Route | Expected daily volume |
|---|---|---|---|
| 1 — critical |
not_null, unique, schema |
PagerDuty + Slack #data-oncall | 0–1 per day |
| 2 — important | enum, range, FK | Slack #data-quality | 1–5 per day |
| 3 — informational | anomaly, freshness drift | Dashboard | dozens per day, no human triage |
The policy is enforced at the framework level (severity: clauses) plus the routing config (PagerDuty integration / Slack webhooks).
Output:
| Channel | Daily volume | On-call paged? |
|---|---|---|
| PagerDuty | 0–1 | yes (critical only) |
| #data-oncall | 0–1 | yes (mirrors PagerDuty) |
| #data-quality | 1–5 | no — async triage |
| Soda Cloud dashboard | dozens | no — review weekly |
Why this works — concept by concept:
- Tiered severity — every check has a tier baked in. Tier 1 is the contract; tier 2 is the smell; tier 3 is the signal. Each tier routes to a different channel with different expected response times.
- Page only on contract violations — the on-call sleeps unless a column the business depends on is broken. Everything else waits for business hours.
- Dashboard for anomalies — anomaly scores are noisy by nature; they belong on a dashboard reviewed weekly, not in a Slack channel that paged everyone last Tuesday.
-
Severity at the framework level —
severity: errorin dbt,fail:clauses in Soda,action_listin GE — the framework decides whether to escalate, not a custom Python receiver. - Audit the routing quarterly — every quarter, count alerts per channel. If #data-quality fired 200 times last quarter, the threshold is wrong or the tests are stale.
- Cost — zero additional pipeline cost; pure config / routing change. The cost was already paid the day the tests were written.
SQL
Topic — data validation
Data validation problems (SQL)
5. Coverage strategy — what to test where, without drowning
80% of outages are caught by not_null + unique + freshness on contract columns — the rest is incremental insurance
The mental model in one line: tier datasets by blast radius (gold = consumed by exec dashboards, silver = consumed by analysts, bronze = raw ingestion), and increase test density per tier. A flat "test everything equally" strategy either bankrupts the warehouse budget or drowns the team in noise — neither failure mode survives a quarterly review.
The 80/20 rule of DQ.
-
not_nullon every contract column. The column the business named in the data contract gets anot_nulltest, full stop. -
uniqueon every primary key. Including surrogate keys, natural keys, and grain-defining composite keys. -
freshnesson every table consumed by anything that runs on a schedule. A dashboard that refreshes daily needs a freshness check at ingest and at the mart. -
accepted_valueson every enum. Status, type, category, country code — any column with a fixed domain. -
relationships(FK) on every join key. Cheap to write; catches half of "the join lost 12% of the rows" tickets.
Tiered datasets by blast radius.
| Tier | Examples | Test density | Run frequency |
|---|---|---|---|
| Gold | exec dashboards, regulatory reports, revenue metrics | 100% contract coverage + anomaly + distribution | every run |
| Silver | analyst-facing marts, business-team reports | not_null + unique + accepted_values + freshness | every run |
| Bronze | raw ingest, change-data-capture tables | row_count + freshness | every run |
The pyramid logic: gold tier is narrow (fewer tables) but dense (many tests); bronze is wide (many tables) but sparse (few tests). The aggregate cost stays manageable while the contract layer gets the strictest scrutiny.
Schema contracts — three flavours.
- dbt contracts. Built-in to dbt 1.5+. Declare the contract in YAML; dbt enforces column names + types at build time. Native to the dbt build flow.
-
GE schema expectations.
expect_table_columns_to_match_ordered_list+expect_column_values_to_be_of_type. Asserts at runtime; integrates with Data Docs. -
SodaCL schema checks.
schema:clause withfail: when wrong column type/fail: when forbidden column present. Most ergonomic in YAML.
Anomaly checks layered on top of deterministic rules.
The deterministic layer (not_null, unique, FK) catches new breakage. The anomaly layer (row-count drift, distribution shift, freshness lag) catches gradual breakage. Together they cover both the cliff-edge failures and the slow-burn failures. Layer order matters: deterministic first, anomaly second — never the other way around.
Cost of running checks.
- Warehouse credits. Every dbt test compiles to a SELECT; every Soda check is a SELECT; GE checks on Spark / SQL also compile to SELECT. Tests on huge tables can dominate the build budget. Mitigate with sampled tests on bronze and full tests on gold.
-
Run-time budget. A 200-test dbt run that doubles the build time blows the SLA. Profile tests with
dbt --debugand prune the slowest. - False-positive fatigue. A test that fires once a week without being actionable trains the team to ignore it — and then a real failure goes unseen. Audit fired-but-ignored tests quarterly.
When to combine frameworks.
- dbt tests for the transform layer. Lowest friction; YAML is already there.
- GE Checkpoints at ingest + critical anomaly. Python-native, custom expectations for cross-table invariants.
- Soda scans for cross-source freshness + multi-warehouse audits. SodaCL is the only declarative answer for "test the same metric across Snowflake and Postgres."
Migration patterns.
- Start on dbt tests. Lowest cost of entry; covers warehouse layer end-to-end.
- Layer Soda for freshness + anomaly. Once dbt covers the contract, Soda adds the cross-cutting checks dbt does not express natively.
- Add GE at ingest. When the team has a bad-batch story to tell (e.g., "API silently changed an enum"), GE pays for itself in one quarter.
- Optional: Elementary on top of dbt. Adds anomaly detection and a dashboard without leaving the dbt project.
Worked example — coverage matrix for a 12-table mart
Detailed explanation. A team has 12 mart tables: 2 gold, 4 silver, 6 bronze. They want a coverage matrix that fits inside a 5-minute test budget. Building the matrix surfaces the trade-offs concretely.
Question. Design a coverage matrix for the 12 tables. Specify which tests run on which tier and estimate the test count.
Input.
| Table | Tier | Row count |
|---|---|---|
| mart_executive_revenue | gold | 10M |
| mart_executive_users | gold | 50M |
| mart_finance_invoices | silver | 5M |
| mart_finance_payments | silver | 8M |
| mart_marketing_attribution | silver | 100M |
| mart_marketing_funnel | silver | 200M |
| stg_orders | bronze | 500M |
| stg_payments | bronze | 1B |
| stg_users | bronze | 50M |
| raw_orders | bronze | 500M |
| raw_payments | bronze | 1B |
| raw_users | bronze | 50M |
Code.
# Gold tier — dense
models:
- name: mart_executive_revenue
tests:
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 1
columns:
- name: day { tests: [unique, not_null] }
- name: revenue { tests: [not_null] }
- name: region { tests: [not_null, accepted_values: { values: ['EU','US','APAC'] }] }
# Silver tier — moderate
- name: mart_finance_invoices
columns:
- name: invoice_id { tests: [unique, not_null] }
- name: amount { tests: [not_null] }
# Bronze tier — minimal (run as Soda freshness only)
# Soda — bronze tier freshness
checks for stg_orders:
- freshness(created_at) < 6h
checks for raw_orders:
- freshness(loaded_at) < 6h
Step-by-step explanation.
- Gold tables get full coverage: row_count, not_null on every contract column, unique on PKs, accepted_values on enums. Estimated 8 tests per gold table → 16 tests.
- Silver tables get the essentials: not_null on contract columns, unique on PKs. Estimated 4 tests per silver table → 16 tests.
- Bronze tables get a single freshness check each (via Soda, not dbt). Estimated 1 test per bronze table → 6 tests.
- Total: 38 tests across the 12 tables. Fits comfortably in 5 minutes of warehouse time on a Snowflake medium warehouse.
- If the test budget grows, add anomaly checks to the gold tier first (highest blast radius). Add range / FK tests to silver next. Bronze tier stays minimal.
Output.
| Tier | Tables | Tests per table | Total tests | Approx run time |
|---|---|---|---|---|
| Gold | 2 | 8 | 16 | ~2 min |
| Silver | 4 | 4 | 16 | ~2 min |
| Bronze | 6 | 1 | 6 | ~30s |
| Total | 12 | — | 38 | ~4.5 min |
Rule of thumb. Estimate test budget before writing tests. If the matrix exceeds the SLA, tier down (gold loses anomaly, silver loses FK). If the matrix fits with room, tier up (silver gains anomaly, bronze gains row_count).
Worked example — quarantine pattern with Soda + dbt
Detailed explanation. A team needs partial-data continuity: if 1% of rows fail the DQ check, the report should still ship on the 99% of good rows, with the bad rows quarantined to a side table for investigation. This is the "quarantine-and-continue" pattern, and it requires the assertion to live inside the transform, not as a separate test step.
Question. Write a dbt model that materialises only the rows that pass amount > 0 AND status IN (...), and writes the failing rows to a _quarantine side table. Show the Soda check that monitors the quarantine table size.
Input.
| order_id | amount | status |
|---|---|---|
| 1 | 50 | placed |
| 2 | -10 | placed |
| 3 | 25 | unknown |
| 4 | 0 | paid |
| 5 | 75 | paid |
Code.
-- models/marts/orders_clean.sql
{{ config(materialized='incremental') }}
WITH passed AS (
SELECT *
FROM {{ ref('stg_orders') }}
WHERE amount > 0
AND status IN ('placed','paid','shipped','cancelled')
)
SELECT * FROM passed
-- models/marts/orders_quarantine.sql
{{ config(materialized='incremental') }}
SELECT *,
CURRENT_TIMESTAMP AS quarantined_at,
CASE
WHEN amount IS NULL OR amount <= 0 THEN 'bad_amount'
WHEN status NOT IN ('placed','paid','shipped','cancelled') THEN 'bad_status'
END AS quarantine_reason
FROM {{ ref('stg_orders') }}
WHERE amount IS NULL
OR amount <= 0
OR status NOT IN ('placed','paid','shipped','cancelled')
# Soda — monitor quarantine table size
checks for orders_quarantine:
- row_count < 1000:
warn: when between 100 and 999
fail: when >= 1000
Step-by-step explanation.
-
orders_cleanfilters the passing rows. The downstream marts read from this clean table — the bad rows never propagate. -
orders_quarantinewrites the failing rows to a side table with aquarantine_reasoncolumn. The investigation team can query this table directly to triage upstream bugs. - The Soda check on
orders_quarantinemonitors the size of the quarantine. Below 100: silent. 100–999: warn. ≥1000: fail (something systemic is happening upstream). - The DAG does not fail unless the quarantine size crosses the fail threshold — partial data continues to flow, the reporting layer is happy, the on-call only wakes when the failure rate is genuinely abnormal.
- The trade-off: downstream metrics computed on
orders_cleanexclude the bad rows. Be explicit about this in the metric's definition ("excluding orders with invalid amount or status").
Output.
| Table | Rows | Downstream consumers |
|---|---|---|
| orders_clean | 3 (ids 1, 5, plus reclassifications) | marts, BI |
| orders_quarantine | 3 (ids 2, 3, 4 with reasons) | DQ on-call investigation |
Rule of thumb. Use quarantine-and-continue when partial data is materially better than no data — most reporting use cases qualify. Use fail-the-run only when partial data is worse than no data, e.g., financial reconciliation where a missing transaction is worse than a delayed report.
SQL interview question on combining frameworks
A senior interviewer might frame this as: "You have a Postgres source, a Snowflake warehouse, and a dbt project. You need DQ coverage from raw ingestion to the executive dashboard. How would you split the work across GE, dbt tests, and Soda Core — and why?"
Solution Using a three-framework layered architecture
# Layer 1 — Pre-ingest (Great Expectations)
# Where: Airflow Python task before Snowflake COPY
# What: not_null on PKs, accepted_values on enums, row_count
# Why: fail-fast quarantine; bad batch never reaches Snowflake
# Layer 2 — In-transform (dbt tests + dbt_expectations)
# Where: dbt build, schema.yml beside each model
# What: not_null, unique, accepted_values, relationships, FK
# Why: gates downstream model materialisation; YAML lives next to the model
# Layer 3 — Post-load freshness + anomaly (Soda Core)
# Where: Soda scan after dbt build completes
# What: freshness, row_count anomaly, distribution drift, cross-source
# Why: declarative anomaly clauses; cross-source checks dbt cannot express
Step-by-step trace.
| Layer | Framework | Test count (illustrative) | When it runs | Blast radius if it fires |
|---|---|---|---|---|
| 1 | GE | 12 | Airflow ingest task | Quarantine batch, skip Snowflake load |
| 2 | dbt | 60 | dbt build |
Halt downstream models in DAG |
| 3 | Soda | 20 | Post-build scan | Alert mart owner; no downstream halt |
Output:
| Concern | Owned by |
|---|---|
| "Did the right data land?" | GE (ingest) |
| "Did the transform preserve the contract?" | dbt (in-build) |
| "Is the mart fresh and within bounds?" | Soda (post-load) |
| "Is the dashboard right?" | All three combined |
Why this works — concept by concept:
- Layered framework choice — each framework owns the layer where it plugs in cleanest. No single framework spans all three layers without significant friction.
- Fail-fast at ingest, gate at transform, alert at mart — the failure mode scales down by layer. Ingest = block; transform = halt downstream; mart = alert and inspect.
- YAML proliferation is okay — both dbt and Soda are YAML-first; engineers learn one syntax (declarative DQ) once and apply it twice. The duplication cost is lower than the integration cost of forcing one framework to do all three jobs.
- **Soda Cloud is optional__ — the Soda layer works on Soda Core alone (free OSS). Adding Soda Cloud later gives a managed dashboard without rewriting any check.
- Severity per layer — ingest = fail, transform = fail (severity:error) or warn (severity:warn), mart = warn / dashboard. Routing matches the response time the team can sustain.
- Cost — three frameworks add maybe 5 minutes to the daily DAG. Compared to the cost of a silent week-long outage, it is the cheapest insurance the team will ever buy.
SQL
Topic — validation
Validation problems (SQL)
SQL
Topic — null handling
NULL handling problems (SQL)
Cheat sheet — DQ framework recipes
-
"Test inside the warehouse, beside the model." dbt generic + singular tests. YAML lives next to
schema.yml;dbt buildinterleaves models and tests; failed test halts downstream. -
"Python-heavy stack with Airflow / Dagster / Prefect." Great Expectations Checkpoints inside an operator. Pull → DQ → land is the textbook pattern;
fail_task_on_validation_failure=Truequarantines bad batches. - "YAML-first, multi-source, want a cloud dashboard." Soda Core (free) + optional Soda Cloud (paid). SodaCL compiles to source-dialect SQL; CLI drops into any DAG.
- "DataFrame validation in code." pandera for pandas / Polars / PySpark schemas; Deequ / PyDeequ for Spark. Apply at the boundary of every script.
-
"Schema contract at the edge." dbt contracts (in-warehouse, declared in YAML) or GE schema expectations (
expect_table_columns_to_match_ordered_list) or SodaCLschema:clause. -
"Anomaly + freshness without writing rules." Layer Soda Core (native
anomaly score+changeclauses) or Elementary (on top of dbt) or a vendor (Monte Carlo, Bigeye) on top of the deterministic test layer. -
"Quarantine bad rows, continue with good ones." Split the model into
_clean(WHERE pass) and_quarantine(WHERE fail) materialisations inside dbt. Monitor quarantine size with a Soda check. -
"Warn on 1 failing row, page on 10+." dbt:
severity: error+error_if: ">= 10"+warn_if: ">= 1". Soda:warn: when between 1 and 9/fail: when > 9. GE:mostlythreshold + Checkpoint action_list routing. - "Cross-source DQ (compare Postgres to Snowflake)." SodaCL — only one of the three frameworks that expresses cross-source declaratively. GE / dbt require glue code.
-
"Backfill DQ over the last 30 days after fixing the upstream bug." GE: re-run the Checkpoint with a date-range BatchRequest. dbt:
dbt test --select stg_orders --vars '{"start_date":"2026-05-15"}'. Soda: parametrise the checks file with a date variable. - "DQ in a notebook for ad hoc inspection." GE — the Python API loads a Suite, runs against a DataFrame, returns a result object you can render inline. The only one of the three that is notebook-ergonomic.
Frequently asked questions
Should I use Great Expectations or dbt tests if I already use dbt?
Use dbt tests for the warehouse-native transform layer — the YAML is already there, dbt build interleaves models and tests, and failed tests halt downstream in a single command. Reach for Great Expectations when you need to assert before the data lands in the warehouse (ingest checks inside Airflow / Dagster), when you need rich custom expectations that go beyond what dbt_expectations covers, or when Data Docs is the human-readable report your stakeholders want. Most mature teams run both: dbt tests own the transform; GE owns ingest + critical anomaly. They do not compete — they cover different layers.
What does Soda Core add that dbt tests don't?
Three things. First, native cross-source checks — SodaCL can run the same check against Postgres and Snowflake without rewriting the SQL, which dbt cannot do declaratively. Second, native anomaly detection — anomaly score for row_count and change for row_count > -10% are built-in clauses, whereas dbt needs Elementary or a singular test. Third, ergonomic freshness — freshness(updated_at) < 1d is a one-line check, whereas dbt source freshness lives in a separate config and runs via a separate command. Adopt Soda Core when your stack is multi-source or when anomaly / freshness coverage is the gap.
Are Great Expectations Checkpoints overkill for small teams?
For a team of one to three engineers running a single dbt project, yes — GE adds setup ceremony (Data Context, Suite, Checkpoint, action_list) that pays off only when you have multiple datasets, multiple environments, or stakeholders who consume Data Docs. Start with dbt tests; add Soda Core when you need freshness or anomaly; bring in GE when you have a bad-batch story (e.g., the API changed an enum and broke the warehouse) and you need fail-fast ingest assertions. The migration cost from "no GE" to "GE at ingest" is a sprint — small teams should defer that sprint until the value is proven.
Can I run Soda Core without Soda Cloud?
Yes — Soda Core is the OSS engine and runs entirely from the CLI or as a Python library. You write a checks.yml, run soda scan -d <warehouse> checks.yml, and the engine prints results to stdout / writes JSON to a file / inserts into a warehouse result table. Soda Cloud (paid) is the managed dashboard + Slack-alert layer that consumes Soda Core scan results, but it is strictly optional. Many teams run Soda Core forever and route results to their own dashboard (Grafana, Metabase, a custom dbt-rendered HTML page) — no licence fee, no vendor dependency.
How do I avoid alert fatigue from data quality tests?
Tier every check by severity before writing it. Tier 1 (contract: not_null on PKs, unique on grain keys, accepted_values on enums) pages the on-call via PagerDuty. Tier 2 (smell: range checks, FK relationships, length bounds) goes to a Slack channel with no paging. Tier 3 (anomaly: row-count drift, freshness lag) goes to a dashboard reviewed weekly. Audit the alert volumes quarterly — if Tier 2 fired 200 times last quarter and nothing was actionable, either the threshold is wrong or the tests are stale. The most expensive class of alert is the one that fires repeatedly without ever being acted on — it trains the team to ignore the next real failure.
Where do dbt contracts fit relative to all three frameworks?
dbt contracts (introduced in dbt 1.5) declare the expected column names + types of a model in YAML, and dbt enforces them at build time — a column rename or type change breaks dbt build before the model ships. Contracts complement but do not replace the assertion frameworks. Use contracts for structural guarantees (the schema is what it claims to be); use dbt tests for content guarantees (the values are not NULL, are unique, are in the enum). GE and Soda offer their own schema-assertion mechanisms (expect_table_columns_to_match_ordered_list, SodaCL schema: clause) which serve the same role for non-dbt or pre-warehouse layers. Adopt contracts inside dbt; mirror the same schema assertion at the ingest layer with GE or Soda for end-to-end structural coverage.
Practice on PipeCode
- Drill the data validation practice library → for the not_null / unique / accepted_values / FK family of probes.
- Rehearse on validation problems → for schema and range assertions across SQL dialects.
- Stack the NULL handling drills → for the IS NULL / NOT IN / DISTINCT FROM family that every DQ engineer needs cold.
- Layer the aggregation library → for the COUNT(*) vs COUNT(col) family that audits use.
- Stack the joins practice library → for the LEFT JOIN + COALESCE patterns the quarantine-and-continue model needs.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the warehouse axis with the SQL for data engineering interviews course →.
- For long-form pipeline craft, work through ETL system design for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every DQ recipe above ships with hands-on practice rooms where you write the GE Suite, the dbt YAML, and the SodaCL check against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `not_null + unique + freshness` coverage actually behaves the same on Snowflake as on Postgres.





Top comments (0)