dbt for data engineering is the canonical transformation layer of the modern data stack in 2026: it sits between your warehouse (Snowflake, BigQuery, Redshift, Databricks, Postgres) and your BI tools and replaces brittle stored procedures with version-controlled SQL models, declarative dbt tests, reusable dbt macros, and CI/CD-driven dbt production patterns. Seven things make a production dbt project hang together — dbt project structure, profiles.yml, dbt models with ref() / source() / materializations, the three dbt tests families, dbt macros and Jinja, the dbt packages ecosystem (dbt_utils, dbt_expectations, dbt_audit_helper, Elementary), and Slim CI with orchestration — and every senior dbt interview questions loop circles every one of them.
This deep guide walks all seven pillars in order, with real dbt YAML, SQL, and Jinja in every section. You'll see the canonical dbt_project.yml layout that ships in 90% of real projects, profiles.yml for dev / prod / ci targets across adapters, dbt ref vs source and the four materializations (view, table, incremental, ephemeral) as a layered DAG, dbt generic tests vs singular tests vs dbt model contracts, Jinja macros that compile per-call, the four community dbt packages every team installs, and dbt Slim CI with --defer state:modified+, Airflow DbtRunOperator, dbt Cloud vs Core, and Elementary freshness alerts. Every numbered H2 ends with a Question → Input → Code → Step-by-step → Output → Why this works worked example you can drop into a project.
When you want hands-on reps immediately after reading, browse the SQL practice lane →, drill ETL pipeline drills →, sharpen CTE patterns →, rehearse aggregation drills →, reinforce dimensional-modeling problems →, or widen coverage on the full data-modeling library →.
On this page
- Why dbt won the transformation layer of the modern data stack
- Project structure + profiles — dbt_project.yml · profiles.yml · adapters
- Models — refs, sources, materializations, layered DAG
- Tests — generic schema tests, singular tests, model contracts
- Macros + Jinja — write once, compile per-call
- Packages ecosystem — dbt_utils · dbt_expectations · dbt_audit_helper · Elementary
- Production patterns + CI/CD — Slim CI · orchestration · observability
- Choosing the right dbt primitive (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why dbt won the transformation layer of the modern data stack
dbt for data engineering — the warehouse-first, SQL-first, Git-first thesis
The one-sentence invariant: dbt for data engineering is "Git + SQL + Jinja + tests, compiled against your warehouse" — every transformation is a versioned .sql file, every dependency is a ref(), every column is testable, every business rule is one reusable macro, and every deploy runs in CI before it touches production. Once you internalise that, every other dbt design decision becomes a follow-up.
The three architectural commitments that won dbt the transformation layer.
-
Warehouse-first — dbt compiles to native warehouse SQL (
CREATE TABLE,CREATE VIEW,MERGE) and pushes the compute to Snowflake / BigQuery / Redshift / Databricks / Postgres. No data leaves the warehouse. - SQL-first — the surface language is SQL, the language your analysts and data engineers already share. Jinja adds templating without forcing engineers to learn a new DSL.
-
Git-first — every model is a
.sqlfile, every test is a YAML entry, every change is a pull request. The whole transformation layer is reviewable, blameable, and revertable.
Why the modern data stack converged on dbt.
-
pandasand Spark moved compute out of the warehouse — dbt moved it back in. Modern warehouses are cheap and elastic; the round-trip cost of moving data out is the bottleneck. - Stored procedures and ETL GUIs lost the diff war — they don't show up cleanly in PR reviews, can't be unit-tested, and don't version cleanly. dbt models are just text files, so Git handles all three.
-
ref()killed hard-coded table names — every model declares its upstreams; dbt computes the DAG and runs nodes in the right order without you maintaining a runbook. -
Tests as a first-class citizen —
unique,not_null,accepted_values,relationshipsship out of the box; bad data fails the build before it lands in BI. - Jinja templating — variables, conditionals, loops, macros — without leaving SQL.
-
Adapter ecosystem — one project runs on every major warehouse via a swappable adapter (
dbt-snowflake,dbt-bigquery,dbt-databricks,dbt-redshift,dbt-postgres).
What interviewers listen for in 2026 dbt loops.
- Do you reach for
ref()andsource()instead of hard-codeddb.schema.tablenames? — basic-but-tested fluency. - Do you name the four layers (sources → staging → intermediate → marts) when asked about project structure? — junior baseline.
- Do you contrast
view,table,incremental,ephemeraland pick the right one per layer? — mid-level signal. - Do you mention
model contracts, Slim CI (--defer + state:modified+),dbt build(run + test in one command), andElementaryfor freshness alerts? — senior signal. - Do you explain
dbt Cloud vs Coreas "Core is the engine; Cloud is the convenience layer (IDE + scheduler + Semantic Layer)"? — interview-canonical answer.
The five sub-themes the deeper loops add.
- Model contracts — enforce column types and constraints at build time; fail the run before SQL hits the warehouse.
-
Incremental models —
unique_key+mergestrategy for billion-row tables that you can't fully rebuild every run. -
Slim CI — only build models that changed (
--defer + state:modified+); 10× faster PR feedback. - Semantic Layer — metric definitions BI tools query so every team agrees on what "active user" means.
-
Observability — Elementary or re_data on top of
dbt artifactsfor freshness, anomaly detection, lineage.
Worked example — a 10-line dbt build cycle that touches every pillar
Detailed explanation. Every interviewer's favorite question shape: "walk me through what happens when you run dbt build in CI". The answer touches project structure, profiles, ref-resolution, materializations, tests, macros, and CI in one breath.
Question. A PR changes models/staging/stg_orders.sql and adds a new test on models/marts/fct_orders.sql. Sketch the dbt build lifecycle in CI.
Input.
| Step | Artifact involved |
|---|---|
| 1 |
dbt_project.yml and profiles.yml (target = ci) |
| 2 | Local manifest.json from dbt parse
|
| 3 | Production manifest.json from S3 (last successful run) |
| 4 |
state:modified+ selector |
| 5 | Compiled SQL written to target/compiled/
|
| 6 | Test results written to target/run_results.json
|
Code.
# 1. Resolve adapter + credentials
dbt deps # install packages.yml
dbt parse # produce target/manifest.json
# 2. Slim CI — only build what changed (plus downstream)
dbt build \
--select state:modified+ \
--defer \
--state ./prod_manifest \
--target ci
# 3. Tests run inline with each model (that's what `build` adds over `run`)
# 4. Upload the new manifest to S3 for the next PR's --defer baseline
aws s3 cp target/manifest.json s3://my-bucket/dbt/manifest.json
Step-by-step explanation.
-
dbt depsinstalls everything inpackages.yml(dbt_utils,dbt_expectations, etc.) intodbt_packages/. -
dbt parsereads every.sqland.ymland produces a freshtarget/manifest.jsonrepresenting the DAG. -
--state ./prod_manifestpoints at a previous manifest cached from production;state:modified+selects modified models plus everything downstream of them. -
--defertells dbt to resolve any unselectedref()against the prod manifest's relations, so you don't have to rebuild the whole upstream chain in CI. -
dbt buildruns the selected nodes; for each model it executes the compiled SQL, then runs every test attached to that model inline (the build verb does both, in dependency order). - After CI passes, upload the new
manifest.jsonso the next PR's--deferbaseline is up to date.
Output (CI log excerpt).
Running with dbt=1.8.3
Found 24 models, 87 tests, 12 sources, 5 macros, 3 packages
Concurrency: 4 threads (target='ci')
1 of 6 START sql view model dbt_ci.stg_orders ............. [RUN]
1 of 6 OK created view model dbt_ci.stg_orders ............ [CREATE VIEW in 0.34s]
2 of 6 START test unique_stg_orders_order_id .............. [RUN]
2 of 6 PASS unique_stg_orders_order_id .................... [PASS in 0.12s]
...
6 of 6 PASS dbt_expectations_expect_column_values_to_be_unique [PASS in 0.21s]
Completed successfully — 6 succeeded, 0 failed, 0 errors, 0 skipped
Why this works — concept by concept:
- Slim CI scopes the build to changed nodes plus their downstream, so PR runs cost minutes not hours.
-
--deferstitches unselected refs to production relations, eliminating the need to rebuild parents in every CI run. -
dbt buildruns models and their attached tests in one DAG walk, so a failing test halts downstream nodes immediately. - manifest.json is the artifact that makes Slim CI possible — caching it from prod to S3 is the one non-obvious operational step every senior dbt team standardises.
-
Cost —
state:modified+reduces typical PR build time from O(all models) to O(changed subgraph), often a 10-50× win.
SQL
Topic — etl
ETL pipeline drills
SQL
Topic — data-transformation
Data-transformation practice
2. Project structure + profiles — dbt_project.yml · profiles.yml · adapters
dbt project structure — the canonical layout every senior project ships
dbt project structure is convention, not rule — but the staging → intermediate → marts layout is the 2026 default and the first thing every reviewer looks for. The reason: predictable folder names make a 50-person engineering org navigable; new joiners know where to find a stg_orders.sql without asking.
The canonical project skeleton.
analytics/
├── dbt_project.yml # the central config (project name, paths, model defaults)
├── packages.yml # community packages (dbt_utils, dbt_expectations, ...)
├── profiles.yml # connection credentials per target (often kept in ~/.dbt/)
├── models/
│ ├── staging/ # 1:1 with sources; light renaming + casting only
│ │ ├── _stg_sources.yml # source() declarations + freshness
│ │ ├── stg_orders.sql
│ │ └── stg_customers.sql
│ ├── intermediate/ # reusable joins + business logic (int_*)
│ │ └── int_orders_enriched.sql
│ └── marts/ # business-facing fact + dim tables
│ ├── _marts.yml # tests + descriptions for marts
│ ├── fct_orders.sql
│ └── dim_customers.sql
├── tests/ # singular SQL tests (one file = one query)
│ └── assert_no_negative_revenue.sql
├── macros/ # Jinja reusables
│ └── cents_to_dollars.sql
├── seeds/ # tiny CSV reference data committed to git
│ └── country_iso.csv
├── snapshots/ # SCD2-style history capture
│ └── snap_customers.sql
└── analyses/ # exploratory queries (compiled, not built)
-
models/staging/— 1:1 with raw sources. Onestg_orders.sqlper source table. Light renaming, casting, andsafe_castonly; no joins, no business logic. The contract: anything downstream consumes staging, never raw. -
models/intermediate/— joins, fan-outs, reusable building blocks. Often namedint_orders_enrichedorint_customer_features. Materialized asephemeralortabledepending on reuse. -
models/marts/— the final fact (fct_*) and dimension (dim_*) tables BI tools and stakeholders query. These are the contract surface to the business. -
tests/— singular SQL tests. One file = oneSELECTthat returns failing rows; zero rows = pass. -
macros/— Jinja templates you can call from any model. Examples:cents_to_dollars,pivot_status_counts,date_spine. -
seeds/— CSV files committed to git that get loaded into the warehouse withdbt seed. Use for tiny reference tables (country codes, ISO currency mappings). -
snapshots/— SCD2-style history. dbt watches a query and writes a row every time a column changes.
dbt_project.yml — the central manifest of your project
The dbt_project.yml file defines project name, version, paths, and the default materialization per folder. Setting materialization at the folder level is the senior-vs-junior signal — junior engineers configure it per-model; senior engineers set sensible defaults at the directory and override only the exceptions.
# dbt_project.yml
name: 'analytics'
version: '1.0.0'
config-version: 2
profile: 'analytics' # matches the profile in ~/.dbt/profiles.yml
# Path configuration
model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
# Folder-level defaults — the senior pattern
models:
analytics:
staging:
+materialized: view # cheap; refresh on demand
+schema: staging
intermediate:
+materialized: ephemeral # inlined; never materialised
+schema: intermediate
marts:
+materialized: table # exposed to BI
+schema: marts
+on_schema_change: append_new_columns
vars:
start_date: '2024-01-01'
payment_methods: ['credit_card', 'ach', 'paypal']
-
Folder-level defaults — every model under
staging/is a view; every model undermarts/is a table; you override per-model only when needed (e.g. a single huge fact table flipped toincremental). -
+schema:— dbt suffixes the target schema. With target schemaanalytics, astagingmodel lands inanalytics_staging. -
vars:— project-wide variables accessible in models via{{ var('start_date') }}. Use for environment-specific knobs like backfill windows. -
+on_schema_change:— for incremental models, controls what happens when the source schema gains a column (append_new_columnsis the safe default).
profiles.yml — connection credentials and target environments
profiles.yml lives at ~/.dbt/profiles.yml (or in the project root for CI) and never enters Git — it holds credentials. The file defines named targets for dev / prod / ci, each pointing at a different warehouse, schema, and credential set.
# ~/.dbt/profiles.yml
analytics: # matches `profile:` in dbt_project.yml
target: dev # default target if --target not passed
outputs:
dev:
type: snowflake
account: my_account.us-east-1
user: "{{ env_var('DBT_DEV_USER') }}"
password: "{{ env_var('DBT_DEV_PASSWORD') }}"
role: ANALYTICS_DEV
database: ANALYTICS_DEV
schema: dbt_alice # per-developer schema — prevents stomping
warehouse: COMPUTE_WH
threads: 4
prod:
type: snowflake
account: my_account.us-east-1
user: "{{ env_var('DBT_PROD_USER') }}"
password: "{{ env_var('DBT_PROD_PASSWORD') }}"
role: ANALYTICS_PROD
database: ANALYTICS
schema: analytics
warehouse: COMPUTE_WH
threads: 8
ci:
type: snowflake
account: my_account.us-east-1
user: "{{ env_var('DBT_CI_USER') }}"
password: "{{ env_var('DBT_CI_PASSWORD') }}"
role: ANALYTICS_CI
database: ANALYTICS_CI
schema: "dbt_ci_pr_{{ env_var('PR_NUMBER', 'local') }}"
warehouse: COMPUTE_WH_XS
threads: 8
-
schema: dbt_alicein dev — every developer gets their own schema; dbt creates objects underanalytics_dev.dbt_alice_staging,analytics_dev.dbt_alice_marts, etc. No two developers stomp on each other. -
schema: "dbt_ci_pr_{{ env_var('PR_NUMBER') }}"in CI — each PR gets a throwaway schema; dropped on merge. This is what makes Slim CI safe. -
env_var('DBT_...')— credentials come from the environment, never the YAML. -
threads:— dbt's concurrency knob. Dev = 4, prod = 8, CI = 8 are typical. Each thread runs one model. -
role:(Snowflake) /location:(BigQuery) /catalog:(Databricks) — adapter-specific extras.
Adapter ecosystem — one project, every warehouse
dbt is adapter-driven: install a package, change the type: in profiles.yml, and the same models run against a different warehouse. The five most common adapters:
| Adapter | Install | type: |
Typical use |
|---|---|---|---|
dbt-snowflake |
pip install dbt-snowflake |
snowflake |
The most common production stack |
dbt-bigquery |
pip install dbt-bigquery |
bigquery |
Google-shop default; great for ad-hoc analysts |
dbt-databricks |
pip install dbt-databricks |
databricks |
Lakehouse / Delta-based projects |
dbt-redshift |
pip install dbt-redshift |
redshift |
Legacy AWS data-warehouse teams |
dbt-postgres |
pip install dbt-postgres |
postgres |
Local dev + small / self-hosted teams |
Worked example — bootstrap a new dbt project from scratch
Detailed explanation. Every dbt team's first hour: dbt init, swap in real credentials, point at a sandbox schema, and verify the example model compiles. This is the muscle memory every interview opener tests.
Question. Bootstrap a new dbt project called analytics against Snowflake and run the default example model.
Input. A Snowflake account, a sandbox warehouse, a sandbox database, and a personal schema.
Code.
# 1. Install dbt-core + the Snowflake adapter
pip install dbt-snowflake==1.8.*
# 2. Scaffold a new project
dbt init analytics
# (prompts for adapter, account, user, password, role, database, schema, warehouse)
cd analytics
# 3. Verify the connection
dbt debug
# 4. Install community packages (packages.yml created later)
dbt deps
# 5. Compile every model (no warehouse writes)
dbt compile
# 6. Build the example models + run tests
dbt build
Step-by-step explanation.
-
dbt initscaffolds the project skeleton (dbt_project.yml,models/example/) and writes a freshprofiles.ymlunder~/.dbt/. -
dbt debugverifies every part of the connection: adapter present, credentials valid, the chosen role can read / write the target schema. Run this any time something feels off. -
dbt compilereads every model and writes the rendered SQL totarget/compiled/. Nothing hits the warehouse; this is a fast syntax + ref-resolution check. -
dbt buildruns every model and every test in dependency order. For Snowflake it executesCREATE TABLE / VIEWs into your sandbox schema.
Output.
$ dbt debug
All checks passed!
$ dbt build
Found 2 models, 4 tests, 0 sources, 0 exposures, 0 metrics, 348 macros
Concurrency: 4 threads (target='dev')
1 of 6 START sql view model dbt_alice.my_first_dbt_model ... [RUN]
1 of 6 OK created view model dbt_alice.my_first_dbt_model .. [CREATE VIEW in 0.41s]
...
Completed successfully — 6 succeeded, 0 failed
Why this works — concept by concept:
-
dbt initships a working starter project so you can prove the connection in under five minutes. -
dbt debugis the single best diagnostic command — it walks every layer (adapter, network, auth, role permissions) and reports the first failure with the offending stanza. -
dbt compilevsdbt build— compile renders SQL to disk; build executes it and runs tests. Use compile to iterate fast, build to ship. -
Per-developer schema — the
schema: dbt_alicedefault keeps every engineer's sandbox isolated; no overlap between teammates. -
Cost —
dbt debugis free;dbt compileis free (no warehouse compute); onlydbt buildanddbt runcost warehouse credits.
SQL
Topic — etl
Pipeline structure drills
SQL
Topic — data-transformation
Warehouse transformation practice
3. Models — refs, sources, materializations, layered DAG
dbt models — every .sql file is a versioned SELECT
dbt models are the unit of work — every .sql file under models/ is a single SELECT statement that dbt wraps in a CREATE TABLE or CREATE VIEW against your warehouse. You never write the DDL yourself; dbt generates it based on the model's materialization.
The model contract — one SELECT, zero side effects.
- A model is one
SELECTat the top level; noCREATE, noINSERT, noMERGE. - The compiler wraps it with the appropriate DDL based on materialization.
- The model's name is the file name (
stg_orders.sql→stg_ordersrelation). - The model's upstreams are inferred from every
{{ ref('...') }}and{{ source('...', '...') }}call inside it.
dbt ref vs source — the two ways a model declares its inputs
{{ ref('upstream_model') }} points at another dbt model. {{ source('schema', 'table') }} points at a raw table you don't own (a Fivetran-loaded raw schema, a Postgres replica, a Kafka sink). The two together form a complete dependency graph dbt walks at run time.
-- models/staging/stg_orders.sql
{{ config(materialized='view') }}
with src as (
select * from {{ source('raw_jaffle_shop', 'orders') }}
)
select
id as order_id,
user_id as customer_id,
order_date,
status,
cast(amount_cents as numeric) / 100 as amount_usd
from src
-- models/marts/fct_orders.sql
{{ config(materialized='table') }}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('stg_customers') }}
)
select
o.order_id,
o.customer_id,
c.region,
o.order_date,
o.amount_usd
from orders o
left join customers c using (customer_id)
where o.status = 'completed'
# models/staging/_stg_sources.yml
version: 2
sources:
- name: raw_jaffle_shop
database: RAW
schema: jaffle_shop
tables:
- name: orders
description: "Raw orders from the production OLTP DB, replicated by Fivetran."
freshness:
warn_after: { count: 12, period: hour }
error_after: { count: 24, period: hour }
loaded_at_field: _fivetran_synced
- name: customers
-
source()lets you swap the underlying raw table (e.g. move from one ingest tool to another) by editing one YAML; every staging model picks up the change. -
freshnessthresholds powerdbt source freshness, which is your first line of defense against silent upstream breakage. -
ref()computes the DAG. dbt re-orders execution automatically — you never writeCREATE TABLE x DEPENDS ON y. -
The non-negotiable rule — never
select * from analytics.staging.stg_ordersdirectly. Alwaysref(). Hard-coded names break Slim CI,--defer, and cross-environment portability.
dbt materializations — view, table, incremental, ephemeral
dbt materializations are the four shapes a model can take in your warehouse. Pick the right one per layer; the wrong choice is the most common source of slow or expensive dbt projects.
| Materialization | What dbt does | When to use | Cost shape |
|---|---|---|---|
view |
CREATE OR REPLACE VIEW — no data stored |
Staging models, ad-hoc transforms over small data | Cheap to refresh, slow to query |
table |
CREATE OR REPLACE TABLE AS SELECT — full rebuild every run |
Marts, anything BI tools hit, anything joined to repeatedly | Fast to query, full-rebuild cost per run |
incremental |
First run = table; subsequent runs = MERGE of new rows |
Billion-row events / fact tables you can't fully rebuild | Cheap incremental cost, complexity overhead |
ephemeral |
Inlined as a CTE in the downstream model — never materialised | Small reusable joins, no direct querying | Zero storage; not queryable directly |
Incremental models — the production fact-table default.
-- models/marts/fct_events.sql
{{ config(
materialized='incremental',
unique_key='event_id',
on_schema_change='append_new_columns',
incremental_strategy='merge'
) }}
select
event_id,
user_id,
event_type,
occurred_at,
payload
from {{ source('raw_events', 'events') }}
{% if is_incremental() %}
-- only scan new rows
where occurred_at > (select coalesce(max(occurred_at), '1900-01-01')
from {{ this }})
{% endif %}
-
is_incremental()macro —trueonly when the target table already exists; lets the same file run as a full rebuild on first run and incrementally afterwards. -
{{ this }}— refers to the current model's target relation (e.g.analytics.marts.fct_events). -
unique_key— column dbt uses to determine "is this row new or an update?". -
incremental_strategy='merge'— the default on Snowflake / BigQuery / Databricks; on Postgres / Redshift the default isdelete+insert. -
on_schema_change='append_new_columns'— when the source schema gains a column, dbt adds it to the target without failing the run. Safer than the defaultignore.
The layered DAG — sources → staging → intermediate → marts
The four-layer pattern is the project shape every senior dbt team converges on:
-
Sources (raw) — owned by Fivetran / Airbyte / your replication tool; declared via
source(). -
Staging (
stg_*) — 1:1 with sources; rename columns, cast types, addsafe_cast, drop PII. Materialised asview. -
Intermediate (
int_*) — reusable joins and business logic. Materialised asephemeral(small reuse) ortable(heavy reuse). -
Marts (
fct_*,dim_*) — the contract to BI / business. Materialised astableorincremental.
The contract: downstream layers may only ref() upstream layers. Marts may not ref() other marts (instead, factor the join into an intermediate). Staging may not ref() other staging (instead, hold the join until intermediate). Enforce this with a dbt_project.yml config or a CI lint.
Worked example — a layered DAG with three layers + an incremental fact
Detailed explanation. Wire up a tiny but real DAG: two raw sources, two staging models, one intermediate model, one incremental fact, one dim table. This is the shape every junior interview asks you to sketch.
Question. Build a fct_orders incremental fact that joins to dim_customers, sourced from raw jaffle_shop.orders and jaffle_shop.customers.
Input.
| Layer | File | Materialization | Upstream |
|---|---|---|---|
| source | raw_jaffle_shop.orders |
(raw) | Fivetran |
| source | raw_jaffle_shop.customers |
(raw) | Fivetran |
| staging | stg_orders.sql |
view | source orders |
| staging | stg_customers.sql |
view | source customers |
| intermediate | int_orders_enriched.sql |
ephemeral | stg_orders + stg_customers |
| mart | dim_customers.sql |
table | stg_customers |
| mart | fct_orders.sql |
incremental | int_orders_enriched |
Code.
-- models/intermediate/int_orders_enriched.sql
{{ config(materialized='ephemeral') }}
select
o.order_id,
o.customer_id,
c.region,
o.order_date,
o.amount_usd,
o.status
from {{ ref('stg_orders') }} o
left join {{ ref('stg_customers') }} c using (customer_id)
-- models/marts/fct_orders.sql
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
) }}
select *
from {{ ref('int_orders_enriched') }}
where status = 'completed'
{% if is_incremental() %}
and order_date > (select max(order_date) from {{ this }})
{% endif %}
# Build only the orders subgraph
dbt build --select +fct_orders
Step-by-step explanation.
-
dbt parsewalks every.sqland discovers the upstream chain viaref()andsource(). -
--select +fct_ordersselectsfct_ordersand all upstream nodes (the+prefix). dbt schedulesstg_orders,stg_customers,int_orders_enriched,dim_customers,fct_ordersin dependency order. -
int_orders_enrichedis ephemeral — dbt never creates it as a table; instead it inlines the SQL as a CTE insidefct_ordersat compile time. -
fct_ordersis incremental; first run =CREATE TABLE, subsequent runs =MERGE INTO fct_orders USING (the SELECT) ON order_id = order_id. - Tests attached to any of these models run inline (because we used
dbt build).
Output (compiled fct_orders, second run).
MERGE INTO analytics.marts.fct_orders AS target
USING (
with int_orders_enriched as (
select o.order_id, o.customer_id, c.region, o.order_date,
o.amount_usd, o.status
from analytics.staging.stg_orders o
left join analytics.staging.stg_customers c using (customer_id)
)
select *
from int_orders_enriched
where status = 'completed'
and order_date > (select max(order_date) from analytics.marts.fct_orders)
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
Why this works — concept by concept:
- Layered DAG isolates concerns: staging never knows about business rules; marts never know about raw column quirks.
-
ephemeralkeepsint_orders_enrichedout of the warehouse — useful since it's only joined to once. -
incremental+unique_keyturns full rebuilds intoMERGEs, making billion-row fact tables tractable. -
+fct_ordersselector scopes the run to the chain that matters; great for local iteration. - Cost — first run is O(all orders); subsequent runs are O(new orders only). For a busy fact table the savings compound daily.
SQL
Topic — dimensional-modeling
Dimensional-modeling drills
SQL
Topic — ctes
CTE pattern practice
4. Tests — generic schema tests, singular tests, model contracts
dbt tests — three families, one promise: bad data fails the build
dbt tests are the second-most-important pillar after models — they're the contract that turns SQL into a tested codebase. dbt ships three families: generic schema tests (declarative, one-liner per column), singular tests (bespoke SQL that returns failing rows), and model contracts (warehouse-enforced column types and constraints). Use all three.
dbt generic tests — declarative, in YAML, on every column that matters
Generic tests are the cheapest unit of correctness in dbt. You declare them in YAML next to the model; dbt runs them as SELECT COUNT(*) FROM (...) WHERE expected_invariant_violated. Zero rows back = pass.
# models/marts/_marts.yml
version: 2
models:
- name: fct_orders
description: "Order-grain facts for revenue reporting."
columns:
- name: order_id
description: "Primary key — one row per order."
tests:
- unique
- not_null
- name: customer_id
description: "FK to dim_customers."
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: status
tests:
- accepted_values:
values: ['completed', 'pending', 'cancelled']
severity: error
- name: amount_usd
tests:
- dbt_utils.expression_is_true:
expression: ">= 0"
config:
severity: warn
- name: dim_customers
columns:
- name: customer_id
tests:
- unique
- not_null
-
unique—SELECT col, COUNT(*) FROM model GROUP BY col HAVING COUNT(*) > 1. Fails if any duplicate. -
not_null—WHERE col IS NULL. -
accepted_values—WHERE col NOT IN (allowed_list). -
relationships—WHERE fk NOT IN (SELECT pk FROM target_model); the equivalent of a foreign-key check. -
dbt_utils.expression_is_true— boolean predicate; failing rows are those where the expression is false. -
severity: warnvsseverity: error— warn logs the failure but exits 0; error fails the build. Use warn for data-quality smells you want to triage; use error for invariants that must hold.
dbt singular tests — bespoke SQL, one file, one query
Singular tests cover anything generic tests can't — multi-table joins, business-rule invariants, sanity checks across the warehouse. Each is a .sql file under tests/; the file is a single SELECT that returns failing rows.
-- tests/assert_no_negative_revenue.sql
-- This test passes when zero rows are returned.
select
region,
sum(amount_usd) as total_revenue
from {{ ref('fct_orders') }}
group by region
having sum(amount_usd) < 0
-- tests/assert_orders_have_a_customer.sql
-- Catches orphan orders missing a matching dim_customers row.
select o.order_id
from {{ ref('fct_orders') }} o
left join {{ ref('dim_customers') }} c using (customer_id)
where c.customer_id is null
-
The contract — zero rows = pass; any rows = fail. The failing rows are written to
target/run_results.jsonand (with--store-failures) to a debug table you can inspect. -
Naming convention —
assert_*.sqlso test files sort together and the intent is obvious. - Cross-model invariants — singular tests are the only way to test "this column in model A matches the sum of this column in model B".
-
Don't reach for singular tests when a generic exists —
accepted_valuesis a one-liner; rewriting it as a singular SQL is noise.
dbt model contracts — warehouse-enforced schemas
Model contracts (added in dbt 1.5) enforce the column list, data types, and constraints at build time — before the SQL even runs. They're how you turn a model into a versioned API for downstream consumers (other teams, BI tools, the Semantic Layer).
# models/marts/_marts.yml
version: 2
models:
- name: fct_orders
config:
contract:
enforced: true
columns:
- name: order_id
data_type: integer
constraints:
- type: primary_key
- type: not_null
- name: customer_id
data_type: integer
constraints:
- type: foreign_key
expression: "{{ ref('dim_customers') }} (customer_id)"
- type: not_null
- name: order_date
data_type: date
constraints:
- type: not_null
- name: status
data_type: varchar(20)
constraints:
- type: not_null
- name: amount_usd
data_type: numeric(10,2)
constraints:
- type: check
expression: "amount_usd >= 0"
-
contract.enforced: true— at compile time dbt validates that the SELECT's column list, names, and data types match the YAML exactly. A column rename without updating the contract = build fail. -
constraints:— pushed to the warehouse where supported. Snowflake and Databricks supportnot_nullandcheck; BigQuery is partial; Postgres / Redshift are full. - The interview signal — model contracts are the closest dbt has to "typed APIs"; senior teams use them on every mart that BI tools or sibling teams depend on.
Worked example — three test families layered on a single model
Detailed explanation. A real fct_orders lands with all three test families: generic tests on every column, a singular test across the orders + customers relationship, and a contract enforcing the public column shape.
Question. Show the full test surface for fct_orders and run dbt test --select fct_orders.
Input.
| Test family | Where it lives | What it catches |
|---|---|---|
| generic |
_marts.yml columns block |
column-level invariants (unique, not_null, FK) |
| singular | tests/assert_no_orphans.sql |
cross-model relationship sanity |
| contract |
_marts.yml model config.contract
|
schema drift between SELECT and declared types |
Code.
# Run all tests attached to fct_orders
dbt test --select fct_orders
# Run tests with failure-row storage so you can inspect bad rows
dbt test --select fct_orders --store-failures
# Run only error-severity tests (skip warns)
dbt test --select fct_orders --exclude-resource-type test --severity error
Step-by-step explanation.
-
dbt test --select fct_ordersselects every test whosemodelorrefmatchesfct_orders. - Generic tests compile to a
SELECTthat returns failing rows; dbt counts those rows and reports pass / fail. - Singular tests are already
SELECTstatements; same shape. - Model contracts run before the model SQL; if the SELECT's columns / types don't match the YAML, the build aborts.
- With
--store-failures, failing rows land in adbt_test_failuresschema you can query for triage.
Output.
Running with dbt=1.8.3
Found 1 model, 7 tests, 1 contract
1 of 7 START test unique_fct_orders_order_id .............. [RUN]
1 of 7 PASS unique_fct_orders_order_id .................... [PASS in 0.18s]
2 of 7 START test not_null_fct_orders_order_id ............ [RUN]
2 of 7 PASS not_null_fct_orders_order_id .................. [PASS in 0.09s]
3 of 7 START test relationships_fct_orders_customer_id .... [RUN]
3 of 7 PASS relationships_fct_orders_customer_id .......... [PASS in 0.22s]
4 of 7 START test accepted_values_fct_orders_status ....... [RUN]
4 of 7 PASS accepted_values_fct_orders_status ............. [PASS in 0.11s]
5 of 7 START test dbt_utils_expression_is_true_amount ..... [RUN]
5 of 7 WARN dbt_utils_expression_is_true_amount ........... [WARN — 3 rows]
6 of 7 START test assert_no_orphans ....................... [RUN]
6 of 7 PASS assert_no_orphans ............................. [PASS in 0.31s]
7 of 7 START contract fct_orders .......................... [RUN]
7 of 7 PASS contract fct_orders ........................... [PASS]
Completed — 6 passed, 0 failed, 1 warning, 0 errors
Why this works — concept by concept:
- Generic tests are the cheapest invariants — one YAML line catches duplicate PKs, NULL FKs, bad enum values.
- Singular tests cover cross-model sanity checks that generic tests can't express.
- Model contracts lift schema checking from runtime to compile time — the most expensive failures (schema drift) become free to catch.
-
severity: warnlets you stage a new test in production without breaking the build; flip to error once the false positives are cleared. -
Cost — every test is one extra
SELECT; cheap. The cost of not testing a column is one bad BI dashboard and a Monday-morning fire drill.
SQL
Topic — validation
Data-validation drills
SQL
Topic — data-validation
Schema-test practice
5. Macros + Jinja — write once, compile per-call
dbt macros — Jinja templates that inline SQL across many models
dbt macros are the third pillar after models and tests. A macro is a Jinja function that returns SQL; you define it once under macros/ and call it from any model. At compile time, dbt inlines the macro's output exactly where you called it — no runtime overhead, no extra warehouse round-trips.
The macro lifecycle in three steps.
-
Define — write a
.sqlfile undermacros/containing a{% macro name(args) %} ... {% endmacro %}block. -
Call — invoke it from a model with
{{ name(args) }}. -
Compile — dbt expands the call into raw SQL written to
target/compiled/.... The warehouse only ever sees the expanded form.
Defining a macro — small, pure, reusable
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, decimals=2) %}
cast({{ column_name }} as numeric) / 100
{% endmacro %}
-- macros/pivot_status_counts.sql
{% macro pivot_status_counts(status_column, statuses) %}
{% for s in statuses %}
sum(case when {{ status_column }} = '{{ s }}' then 1 else 0 end) as {{ s }}_count
{% if not loop.last %},{% endif %}
{% endfor %}
{% endmacro %}
-- macros/get_payment_methods.sql — used by models to access vars
{% macro get_payment_methods() %}
{{ return(var('payment_methods', ['credit_card', 'ach'])) }}
{% endmacro %}
-
Arguments with defaults —
decimals=2makes the second arg optional. -
{% for %}loops — Jinja control flow;loop.lastis true on the final iteration. -
{{ return(...) }}— for macros that produce a Python-side value (not SQL); useful for variable factories. - Keep them small + pure — a 5-line macro is a delight; a 50-line macro with conditional dispatch becomes the next engineer's nightmare. Prefer composition.
Calling a macro — three syntaxes for three contexts
-- models/marts/fct_revenue.sql
{{ config(materialized='table') }}
select
region,
{{ cents_to_dollars('amount_cents') }} as amount_usd,
{{ pivot_status_counts('status', ['paid', 'pending', 'cancelled']) }}
from {{ ref('stg_orders') }}
group by region
-- Compiled output written to target/compiled/analytics/models/marts/fct_revenue.sql
select
region,
cast(amount_cents as numeric) / 100 as amount_usd,
sum(case when status = 'paid' then 1 else 0 end) as paid_count,
sum(case when status = 'pending' then 1 else 0 end) as pending_count,
sum(case when status = 'cancelled' then 1 else 0 end) as cancelled_count
from analytics.staging.stg_orders
group by region
-
{{ macro_name(args) }}— expression form; returns a string that gets inlined. -
{% do macro_name(args) %}— statement form; for side-effectful macros that don't return SQL (e.g. logging). -
{% set var = macro_name(args) %}— assign the macro's return into a Jinja variable for later reuse in the same compile pass.
Jinja control flow inside a model
Jinja makes SQL templating practical. Use it to loop over columns, conditionally include CTEs, switch behaviour per adapter, or build pivot tables dynamically.
-- models/marts/fct_revenue_by_method.sql
{% set payment_methods = ['credit_card', 'ach', 'paypal'] %}
select
order_date,
{% for m in payment_methods %}
sum(case when payment_method = '{{ m }}' then amount_usd else 0 end) as revenue_{{ m }}
{% if not loop.last %},{% endif %}
{% endfor %}
from {{ ref('stg_payments') }}
group by order_date
-- Adapter-conditional logic
select *,
{% if target.type == 'snowflake' %}
current_timestamp::timestamp_ntz as loaded_at
{% elif target.type == 'bigquery' %}
current_timestamp() as loaded_at
{% else %}
now() as loaded_at
{% endif %}
from {{ ref('stg_orders') }}
-
{% set %}— declare a Jinja variable scoped to the model. -
{% for %}/{% endfor %}— loop; great for building pivot SUMs without hand-writing N rows. -
{% if %}/{% elif %}/{% else %}— conditional SQL; the canonical adapter-switching pattern. -
target.type— at compile time you know which warehouse you're compiling for; use it sparingly to bridge dialect gaps.
dbt_utils — the community macro standard library
dbt_utils ships dozens of macros every project ends up using. The four most-used:
-
generate_surrogate_key(['col_a', 'col_b'])— hash-based composite key generation; the workhorse of dim-table modeling. -
dbt_utils.star(from=ref('stg_orders'), except=['raw_payload'])— expand*minus a few columns. Essential when staging models drop PII. -
dbt_utils.pivot('status', ['paid', 'pending', 'cancelled'])— pivot a column into N counts. Replaces the loop above with a one-liner. -
dbt_utils.date_spine(datepart='day', start_date='2024-01-01', end_date='2026-01-01')— generate a contiguous calendar table on the fly; great for cohort and gap-filling work.
# packages.yml — install dbt_utils so its macros become callable
packages:
- package: dbt-labs/dbt_utils
version: 1.2.0
- package: calogica/dbt_expectations
version: 0.10.4
- package: dbt-labs/audit_helper
version: 0.12.0
- package: calogica/dbt_date
version: 0.10.1
dbt deps # installs every package into dbt_packages/
Worked example — replace 50 lines of hand-written SQL with one macro call
Detailed explanation. Every dbt codebase ages into duplicated logic — same case when status in ('paid', 'completed', 'fulfilled') then 1 repeated across 20 models. The refactor: factor it into one macro, then call it everywhere.
Question. Replace duplicated revenue-pivot SQL across fct_revenue_daily and fct_revenue_weekly with a shared pivot_status_counts macro.
Input. Two models that each hand-write five sum(case when status = '...' then 1 else 0 end) columns.
Code (before — duplicated SQL across two models).
-- models/marts/fct_revenue_daily.sql (BEFORE)
select
order_date,
sum(case when status = 'paid' then 1 else 0 end) as paid_count,
sum(case when status = 'pending' then 1 else 0 end) as pending_count,
sum(case when status = 'cancelled' then 1 else 0 end) as cancelled_count,
sum(case when status = 'refunded' then 1 else 0 end) as refunded_count,
sum(case when status = 'shipped' then 1 else 0 end) as shipped_count
from {{ ref('stg_orders') }}
group by order_date
Code (after — one macro, two callers).
-- macros/pivot_status_counts.sql
{% macro pivot_status_counts(status_column, statuses) %}
{% for s in statuses %}
sum(case when {{ status_column }} = '{{ s }}' then 1 else 0 end) as {{ s }}_count
{% if not loop.last %},{% endif %}
{% endfor %}
{% endmacro %}
-- models/marts/fct_revenue_daily.sql (AFTER)
{% set statuses = ['paid', 'pending', 'cancelled', 'refunded', 'shipped'] %}
select
order_date,
{{ pivot_status_counts('status', statuses) }}
from {{ ref('stg_orders') }}
group by order_date
-- models/marts/fct_revenue_weekly.sql (AFTER)
{% set statuses = ['paid', 'pending', 'cancelled', 'refunded', 'shipped'] %}
select
date_trunc('week', order_date) as week,
{{ pivot_status_counts('status', statuses) }}
from {{ ref('stg_orders') }}
group by 1
Step-by-step explanation.
- The macro takes a column name and a list of values; Jinja's
{% for %}loop unrolls onesum(case when ...)per status. -
{% if not loop.last %},{% endif %}adds a trailing comma between expressions but not after the last one — the trick to clean compiled SQL. - Each caller
{% set statuses = [...] %}keeps the list local so two models can diverge if needed. - dbt compiles the macro to identical SQL in both callers — zero warehouse difference, full source dedup.
Output (compiled fct_revenue_daily).
select
order_date,
sum(case when status = 'paid' then 1 else 0 end) as paid_count,
sum(case when status = 'pending' then 1 else 0 end) as pending_count,
sum(case when status = 'cancelled' then 1 else 0 end) as cancelled_count,
sum(case when status = 'refunded' then 1 else 0 end) as refunded_count,
sum(case when status = 'shipped' then 1 else 0 end) as shipped_count
from analytics.staging.stg_orders
group by order_date
Why this works — concept by concept:
- Macro factoring removes a class of bugs — adding a new status now updates the list in one place, not N.
-
Jinja
{% for %}is the right tool when you'd otherwise hand-write N parallel columns; doubly so when N changes over time. - Compile-time inlining means the warehouse never sees Jinja; performance is identical to hand-written SQL.
-
loop.lastis the Jinja idiom for "skip the trailing separator"; commit this one to muscle memory. - Cost — Jinja compilation runs at parse time; the warehouse sees only inlined SQL. The savings are in maintenance hours, not runtime.
SQL
Topic — conditional-aggregation
Conditional-aggregation drills
SQL
Topic — aggregation
Aggregation pattern practice
6. Packages ecosystem — dbt_utils · dbt_expectations · dbt_audit_helper · Elementary
dbt packages — install once, get hundreds of macros for free
dbt packages are git-cloneable bundles of macros, tests, and models the community maintains. The four packages every senior team installs on day one: dbt_utils (the standard library), dbt_expectations (Great-Expectations-style tests), audit_helper (regression tooling for migrations), and elementary (observability + freshness alerts).
dbt_utils — the standard library beyond macros
dbt_utils is more than just macros — it also ships generic tests you can attach in YAML alongside the built-in unique / not_null ones.
# Generic tests from dbt_utils
models:
- name: fct_orders
columns:
- name: order_id
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [order_id, line_item_id]
- name: amount_usd
tests:
- dbt_utils.expression_is_true:
expression: ">= 0"
severity: warn
- name: order_date
tests:
- dbt_utils.recency:
datepart: day
field: order_date
interval: 1
-
unique_combination_of_columns— composite-key uniqueness; the right tool when the PK is two columns together. -
expression_is_true— any boolean SQL expression as a test. -
recency— fails if the most-recent row is older thaninterval; canonical freshness sanity. -
equal_rowcount— compares row counts between two relations; the workhorse of staging-to-marts sanity.
dbt_expectations — Great-Expectations-style declarative data quality
dbt_expectations ports the Great Expectations API to dbt: 60+ generic tests covering distributional, statistical, and pattern-based invariants.
# Distributional + format tests from dbt_expectations
models:
- name: fct_orders
columns:
- name: order_id
tests:
- dbt_expectations.expect_column_values_to_be_unique
- dbt_expectations.expect_column_values_to_match_regex:
regex: '^ORD-[0-9]{8}$'
- name: amount_usd
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100000
row_condition: "status = 'completed'"
- dbt_expectations.expect_column_mean_to_be_between:
min_value: 10
max_value: 500
- name: status
tests:
- dbt_expectations.expect_column_values_to_be_in_set:
value_set: ['completed', 'pending', 'cancelled', 'refunded']
-
expect_column_values_to_be_between— range check; great for sanity caps on revenue / quantity. -
expect_column_mean_to_be_between— distributional drift detector; catches the day a join goes wrong and revenue jumps 10×. -
expect_column_values_to_match_regex— pattern enforcement; great for IDs and email columns. -
row_condition:— scope the test to a subset of rows.
audit_helper — diff two relations during migrations
audit_helper is the package every team installs when they migrate a critical model — say, refactoring fct_orders to incremental, or porting Looker SQL into dbt. It ships macros that diff two relations and tell you exactly what changed.
-- analyses/compare_fct_orders.sql
-- Run with: dbt compile -s compare_fct_orders, then paste into your warehouse.
{% set old_query %}
select * from analytics.legacy.fct_orders_v1
{% endset %}
{% set new_query %}
select * from {{ ref('fct_orders') }}
{% endset %}
{{ audit_helper.compare_queries(
a_query=old_query,
b_query=new_query,
primary_key='order_id'
) }}
-
compare_queries— full row-level diff; tells you "8,231 matches, 12 missing in new, 0 missing in old, 45 differences in non-PK columns". -
compare_column_values— per-column value distribution comparison; the right tool when you suspect a single column changed. -
compare_relation_columns— schema diff; columns added / removed / type-changed. -
The migration ritual — every refactor of a critical model should ship with an
audit_helperanalysis in the PR description; reviewers see the diff and approve.
elementary — observability over dbt artifacts
elementary is the open-source observability layer that reads target/manifest.json and target/run_results.json after every run and turns them into freshness alerts, anomaly detection, and a Slack channel that pages on-call when something breaks.
# packages.yml — add elementary
packages:
- package: elementary-data/elementary
version: 0.15.0
# models/_elementary.yml — turn on monitoring
models:
- name: fct_orders
config:
elementary:
timestamp_column: order_date
tests:
- elementary.volume_anomalies:
time_bucket: { period: day, count: 1 }
- elementary.freshness_anomalies
- elementary.dimension_anomalies:
dimensions:
- region
-
volume_anomalies— row-count anomaly detection; flags the day order volume drops 80% (a likely upstream outage). -
freshness_anomalies— flags the day a model'sloaded_atstops advancing. -
dimension_anomalies— flags the day a dimension's value distribution shifts significantly. -
Slack / PagerDuty integration — Elementary ships a CLI you run after
dbt buildthat posts alerts to your incident channel.
A summary table — which package to reach for
| Package | What it ships | When you need it |
|---|---|---|
dbt_utils |
Surrogate keys, pivots, date spines, composite tests | Every dbt project — install on day one |
dbt_expectations |
60+ distributional / pattern / range tests | When unique / not_null aren't enough |
audit_helper |
Diff two relations during migrations | Refactors, OLAP-engine swaps, vendor cutovers |
elementary |
Freshness, anomaly, lineage observability | When dbt is in production with on-call rotations |
dbt_date |
Calendar / fiscal / business-day helpers | Finance / accounting / cohort work |
dbt_artifacts |
Persist run metadata into a warehouse table | Custom dashboards over dbt runs |
re_data |
Alternative observability stack | Teams that prefer it over Elementary |
Worked example — adopt dbt_utils + dbt_expectations on a single model
Detailed explanation. Install the two most-used packages and add three tests to fct_orders you couldn't have written without them.
Question. Wire dbt_utils.surrogate_key, dbt_expectations.expect_column_values_to_be_between, and dbt_expectations.expect_column_mean_to_be_between into the fct_orders model.
Input. A fresh dbt project with packages.yml already installed.
Code.
-- models/marts/fct_orders.sql
{{ config(materialized='table') }}
select
{{ dbt_utils.generate_surrogate_key(['order_id', 'line_item_id']) }} as order_line_sk,
order_id,
line_item_id,
customer_id,
order_date,
amount_usd,
status
from {{ ref('stg_orders_lines') }}
# models/marts/_marts.yml
version: 2
models:
- name: fct_orders
columns:
- name: order_line_sk
tests:
- unique
- not_null
- name: amount_usd
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100000
severity: error
- dbt_expectations.expect_column_mean_to_be_between:
min_value: 10
max_value: 500
severity: warn
- name: status
tests:
- dbt_expectations.expect_column_values_to_be_in_set:
value_set: ['completed', 'pending', 'cancelled']
dbt deps # installs packages
dbt build --select fct_orders # runs model + every test
Step-by-step explanation.
-
dbt depsclonesdbt_utilsanddbt_expectationsintodbt_packages/. -
generate_surrogate_key(['a', 'b'])returns amd5(a || '-' || b)expression specific to the active adapter. -
expect_column_values_to_be_between(min=0, max=100000)runsSELECT * FROM fct_orders WHERE amount_usd < 0 OR amount_usd > 100000— failing rows. -
expect_column_mean_to_be_between(min=10, max=500)runs an aggregate test — fails if the table's averageamount_usdis outside the range. -
dbt buildships them all in one DAG walk; severity flags decide which fail the run vs warn.
Output.
1 of 5 PASS unique_fct_orders_order_line_sk
2 of 5 PASS not_null_fct_orders_order_line_sk
3 of 5 PASS dbt_expectations_expect_column_values_to_be_between_amount_usd
4 of 5 WARN dbt_expectations_expect_column_mean_to_be_between_amount_usd [WARN — mean 8.4 below 10]
5 of 5 PASS dbt_expectations_expect_column_values_to_be_in_set_status
Completed — 4 passed, 0 failed, 1 warning, 0 errors
Why this works — concept by concept:
-
Composite keys —
dbt_utils.generate_surrogate_keyis the canonical way to hash multiple columns into one PK; saves you N lines ofmd5(concat(...))per model. -
Range tests —
expect_column_values_to_be_betweencatches the bug where a join multiplies rows and revenue jumps 10×. -
Distributional tests —
expect_column_mean_to_be_betweenis the kind of invariant you can't express withunique/not_null; the mean drifting is the first signal of a quiet upstream bug. - Severity tuning — error for hard invariants (range), warn for soft signals (drift); turns dbt into a tunable alarm system.
-
Cost — every test is one
SELECT; the marginal cost is small. The cost of not catching a 10× revenue inflation is real money.
SQL
Topic — data-validation
Data-quality test drills
SQL
Topic — aggregation
Aggregation + surrogate-key practice
7. Production patterns + CI/CD — Slim CI · orchestration · observability
dbt production patterns — what it takes to run dbt on call
dbt production patterns is the last pillar — every other pillar matters only if the project actually ships to production cleanly. Senior loops zero in on four moves: Slim CI on PRs, scheduled dbt build in dbt Cloud or Airflow, observability via Elementary, and the dbt Cloud vs Core decision that drives org-level choices.
dbt Slim CI — only rebuild what changed
dbt Slim CI is the highest-leverage CI optimisation in the ecosystem. Without it, every PR rebuilds your whole DAG; with it, PRs build only the changed subgraph and stitch upstream refs to production relations via --defer.
# .github/workflows/dbt-ci.yml
name: dbt CI
on:
pull_request:
paths: ['models/**', 'tests/**', 'macros/**', 'dbt_project.yml', 'packages.yml']
jobs:
dbt-ci:
runs-on: ubuntu-latest
env:
DBT_CI_USER: ${{ secrets.DBT_CI_USER }}
DBT_CI_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }}
PR_NUMBER: ${{ github.event.pull_request.number }}
steps:
- uses: actions/checkout@v4
- name: Install dbt + adapter
run: pip install dbt-snowflake==1.8.*
- name: Install packages
run: dbt deps
- name: Download prod manifest for --defer baseline
run: aws s3 cp s3://my-bucket/dbt/prod_manifest.json ./prod_manifest/manifest.json
- name: Slim CI build
run: |
dbt build \
--select state:modified+ \
--defer \
--state ./prod_manifest \
--target ci \
--fail-fast
- name: Drop the CI schema on PR close
if: github.event.action == 'closed'
run: dbt run-operation drop_schema --args "{schema: dbt_ci_pr_${PR_NUMBER}}"
-
state:modified+— modified models plus everything downstream; the canonical Slim CI selector. -
--defer + --state— unselected refs resolve to the production manifest's relations, so you don't have to rebuild upstream chains. -
--fail-fast— abort on first failure; saves CI minutes when something is obviously broken. -
PR-scoped schemas — each PR builds into
dbt_ci_pr_123; the schema is dropped on PR close so CI databases don't grow unbounded.
dbt scheduling — dbt Cloud, Airflow, or GitHub Actions
Once dbt is in production, something has to run it on a schedule. Three common patterns:
# dbt Cloud — the managed path
# Configure in the UI: a job that runs `dbt build` daily at 06:00 UTC,
# attached to the prod environment, with email + Slack alerts on failure.
# Airflow — for teams with existing DAGs
from airflow import DAG
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
# Or for dbt Core via the standard BashOperator + cosmos:
from cosmos import DbtTaskGroup, ProjectConfig, ProfileConfig
with DAG('analytics_dbt', schedule='0 6 * * *', catchup=False) as dag:
dbt_run = DbtTaskGroup(
group_id='dbt_build',
project_config=ProjectConfig('/opt/airflow/analytics'),
profile_config=ProfileConfig(
profile_name='analytics',
target_name='prod',
profiles_yml_filepath='/opt/airflow/profiles.yml',
),
operator_args={'select': 'tag:daily'},
)
# GitHub Actions cron — minimal infra for small teams
name: dbt nightly
on:
schedule:
- cron: '0 6 * * *'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install dbt-snowflake
- run: dbt deps
- run: dbt build --target prod --select tag:daily
- dbt Cloud — the easiest path; pay for managed orchestration, Slack alerts, hosted docs, the IDE, and the Semantic Layer. Pricing scales per developer seat.
-
Airflow +
cosmos— the standard for teams with existing Airflow infrastructure; lets you mix dbt with non-dbt tasks (Spark jobs, ML training, custom Python). - GitHub Actions cron — the cheapest option for small teams; works fine until you need cross-job dependencies or proper SLA monitoring.
dbt Cloud vs Core — the interview-canonical comparison
Every dbt interview has at least one "when would you pick Cloud vs Core?" question. The honest answer:
| Dimension | dbt Core | dbt Cloud |
|---|---|---|
| License | Apache 2.0, free | Subscription per seat |
| CLI | Yes — dbt build, dbt run, etc. |
Yes — under the hood it's Core |
| IDE | No — bring your own (VS Code + dbt Power User is standard) | Yes — web IDE with autocomplete + lineage |
| Scheduler | No — bring your own (Airflow, GitHub Actions, cron) | Yes — managed cron with retries + alerts |
| CI | No — wire it up in GitHub Actions | Yes — managed Slim CI on every PR |
| Hosted docs | No — self-host the static site | Yes — managed docs with auth |
| Semantic Layer | No (Core 1.7+ has the spec; Cloud serves it) | Yes — metric API for BI tools |
| Best for | Engineering-heavy teams with Airflow already | Analyst-heavy teams; smaller orgs without DevOps |
- Core is the engine; Cloud is the convenience layer. Every dbt project compiles via Core; Cloud wraps it in orchestration + UI.
- Senior teams often mix — develop locally on Core, run CI via GitHub Actions + Core, but use Cloud for the scheduler + Semantic Layer.
- The Semantic Layer is the Cloud lock-in — if your BI tool queries the SL, you're paying for Cloud.
Observability — Elementary, freshness alerts, on-call runbooks
Once you have nightly schedules, something will fail at 03:00 — and you need to know before stakeholders open dashboards at 09:00.
-
dbt source freshness— runsMAX(_fivetran_synced)against every source and warns / errors when it falls behind the threshold you set in YAML. -
Elementary alerts — Slack channel that posts
[ERROR] fct_orders: 12 rows failed unique_order_idwith a link to the failing-rows table. -
dbt docs serve— hosted lineage; let on-call see the upstream chain when a model fails downstream. -
Run-result archive — store
target/run_results.jsonin S3 after every run; the cheapest observability backbone you can have.
Worked example — full Slim-CI + nightly schedule + alerting in 30 lines of YAML
Detailed explanation. Stitch every piece together: a PR workflow that builds only changed models, a nightly workflow that runs dbt build and pushes the manifest, and an Elementary alert hook that posts to Slack when something fails.
Question. Wire up the full production pipeline for a small dbt project on GitHub Actions + Snowflake + Elementary.
Input. GitHub repo with the project, Snowflake CI / prod credentials in Actions secrets, an Elementary CLI installed in the prod environment, a Slack webhook URL.
Code.
# .github/workflows/dbt-pr.yml — Slim CI on every PR
name: dbt PR
on: { pull_request: { paths: ['models/**','tests/**','macros/**'] } }
jobs:
ci:
runs-on: ubuntu-latest
env: { DBT_CI_USER: ${{ secrets.DBT_CI_USER }}, DBT_CI_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }} }
steps:
- uses: actions/checkout@v4
- run: pip install dbt-snowflake==1.8.* elementary-data
- run: dbt deps
- run: aws s3 cp s3://my-bucket/dbt/manifest.json ./prod/manifest.json
- run: dbt build --select state:modified+ --defer --state ./prod --target ci --fail-fast
# .github/workflows/dbt-nightly.yml — production build + observability
name: dbt nightly
on: { schedule: [{ cron: '0 6 * * *' }] }
jobs:
build:
runs-on: ubuntu-latest
env:
DBT_PROD_USER: ${{ secrets.DBT_PROD_USER }}
DBT_PROD_PASSWORD: ${{ secrets.DBT_PROD_PASSWORD }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
steps:
- uses: actions/checkout@v4
- run: pip install dbt-snowflake==1.8.* elementary-data
- run: dbt deps
- run: dbt source freshness --target prod
- run: dbt build --target prod --fail-fast
- run: edr monitor --slack-webhook "$SLACK_WEBHOOK"
- run: aws s3 cp target/manifest.json s3://my-bucket/dbt/manifest.json
Step-by-step explanation.
- PR workflow runs Slim CI —
state:modified++--deferkeeps the build fast and cheap. - Nightly workflow runs
dbt source freshnessfirst — fails loudly if upstream ingest is stale. - Nightly workflow runs
dbt build --target prod— every model + every test in dependency order. -
edr monitoris the Elementary CLI; it readstarget/run_results.jsonand posts a Slack message with failing tests, slow models, and anomalies. - Manifest upload is the last step — it makes tomorrow's PR Slim CI work against today's state.
Output (Slack message after a failing run).
[dbt nightly · failed]
Project: analytics · Target: prod · Duration: 14m 22s
✗ fct_orders FAIL (3 rows violated unique_order_id)
✗ assert_no_negative_revenue FAIL (1 row returned: region=EU, total=-120.00)
⚠ dbt_expectations_mean_amount_usd WARN (mean 8.42 below threshold 10.0)
Run results: https://my-bucket.s3.amazonaws.com/dbt/run_results/2026-05-26.json
Failing rows: https://snowflake.com/.../dbt_test_failures.fct_orders_unique
On-call: @analytics-oncall
Why this works — concept by concept:
- Slim CI keeps PR feedback under five minutes even on 200-model projects.
-
dbt source freshnesscatches upstream outages at the boundary; everything downstream fails fast. -
dbt build --fail-fasthalts on first failure so downstream nodes don't compound the blast radius. -
Elementary
edr monitorturns dbt artifacts into actionable Slack alerts without any custom code. - Manifest archive is the one operational detail that ties everything together — without it, Slim CI has no baseline.
- Cost — Slim CI cuts PR build time 10-50×; freshness + observability cut MTTR for incidents from hours to minutes. The CI minutes you save pay for the Snowflake credits you spend.
SQL
Topic — etl
Production ETL drills
SQL
Topic — data-transformation
CI/CD transformation patterns
Choosing the right dbt primitive (cheat sheet)
A one-screen cheat sheet for using dbt for data engineering — pick the primitive that matches your task.
| You want to … | Primitive | Notes |
|---|---|---|
| Define a transformation | models/.../my_model.sql |
One SELECT; dbt wraps it as CREATE TABLE / VIEW |
| Point at a raw table | {{ source('schema', 'table') }} |
Declare it in _sources.yml; gets freshness for free |
| Point at another dbt model | {{ ref('upstream_model') }} |
Never hard-code; this is what powers the DAG |
| Refresh on demand | +materialized: view |
Cheap to refresh, slow to query |
| Cache for BI | +materialized: table |
Full rebuild per run; fast queries |
| Bill-of-materials huge fact |
+materialized: incremental + unique_key
|
MERGE after first run; cheapest at scale |
| Reusable mid-DAG logic | +materialized: ephemeral |
Inlined as CTE; no storage cost |
| Enforce a column invariant | tests: [unique, not_null, accepted_values] |
Generic schema tests; one YAML line each |
| FK-style relationship | tests: [relationships: { to: ref('dim'), field: id }] |
Catches orphans |
| Bespoke multi-table rule | tests/assert_*.sql |
Singular test; zero rows = pass |
| Versioned column types | config: { contract: { enforced: true } } |
Schema drift fails the build |
| Reuse SQL logic |
macros/my_macro.sql + {{ my_macro(args) }}
|
Jinja template inlined per call |
| Hash a composite key | {{ dbt_utils.generate_surrogate_key(['a','b']) }} |
The canonical surrogate-key macro |
| Pivot dynamically |
{{ dbt_utils.pivot('status', vals) }} or {% for %}
|
One line replaces N SUM(CASE) columns |
| Range / regex test | dbt_expectations.expect_column_values_to_* |
60+ generic tests over dbt_utils baseline |
| Migration regression test | audit_helper.compare_queries(...) |
Diff old vs new relation; output as table |
| Production observability |
elementary package + edr monitor CLI |
Slack alerts on anomalies + freshness |
| Fast PR feedback | dbt build --select state:modified+ --defer --state ./prod |
Slim CI; 10× faster than full build |
Pull from prod for --defer
|
Cache target/manifest.json to S3 each run |
The one operational detail that makes Slim CI work |
| Scheduled run | dbt Cloud job, Airflow cosmos, or GitHub Actions cron |
Pick by team size + existing infra |
| Catch stale source |
dbt source freshness + loaded_at_field
|
First line of defense against silent breakage |
Frequently asked questions
What is dbt and why has it won the transformation layer of the modern data stack?
dbt (data build tool) is a SQL-first transformation framework that compiles .sql files into native warehouse DDL — CREATE TABLE, CREATE VIEW, MERGE — and runs them in dependency order against Snowflake, BigQuery, Databricks, Redshift, or Postgres. It won the transformation layer for four reasons. Warehouse-first compute: dbt pushes every transformation back into the warehouse, eliminating the round-trip cost of moving data out into a separate engine. Git-first workflow: every model is a text file, so PRs, code review, and revert-on-disaster are native. Tests as first-class citizens: unique, not_null, accepted_values, relationships ship out of the box, so bad data fails the build before it lands in BI. ref() and the DAG: dbt computes upstream / downstream dependencies automatically; you never write a runbook. Add a Jinja templating layer, an adapter ecosystem covering every major warehouse, and a thriving package ecosystem (dbt_utils, dbt_expectations, elementary), and you have the de-facto standard transformation layer for the modern data stack in 2026.
What is the difference between ref() and source() in dbt?
{{ ref('upstream_model') }} points at another dbt model in the same project — a .sql file under models/. dbt uses every ref() call to compute the DAG and run nodes in the correct dependency order. {{ source('source_name', 'table_name') }} points at a raw table you don't own — a Fivetran-loaded raw schema, a Postgres replica, a Kafka sink. Sources are declared in a _sources.yml file with their database, schema, and optional freshness thresholds. The rule: every model's inputs are either ref() (project-internal) or source() (project-external); never hard-code a database.schema.table literal, because that breaks Slim CI, --defer, and cross-environment portability. The two together give dbt the complete dependency graph it needs to schedule runs, validate ordering, and run dbt source freshness against your raw ingest layer.
When should I use view, table, incremental, or ephemeral materialization in dbt?
view — CREATE OR REPLACE VIEW; no data stored; cheap to refresh, slow to query. The default for staging models that are 1:1 with sources and rarely queried directly. table — CREATE OR REPLACE TABLE AS SELECT; full rebuild every run; fast to query. The default for marts that BI tools and stakeholders hit constantly; the rebuild cost is fine for tables up to millions of rows. incremental — first run = table; subsequent runs = MERGE (Snowflake / BigQuery / Databricks) or delete+insert (Postgres / Redshift). Use for billion-row fact tables you can't fully rebuild every run; pair with unique_key and an is_incremental() predicate that scopes new rows by timestamp. ephemeral — inlined as a CTE in the downstream model; never materialised in the warehouse. Use for small intermediate models that are joined once and never queried directly. The senior pattern: set folder-level defaults in dbt_project.yml (staging → view, intermediate → ephemeral, marts → table) and override per-model only when the data shape demands it.
What are the three families of dbt tests and when should I use each?
Generic schema tests — declared in YAML, one line per column. The four built-ins (unique, not_null, accepted_values, relationships) plus the 60+ from dbt_utils and dbt_expectations cover most column-level invariants. Use them aggressively; every column that matters should have at least one. Singular tests — bespoke SELECT files under tests/ that return failing rows. Use when the invariant spans multiple tables or expresses a business rule that doesn't fit a per-column shape — e.g. "no region has negative revenue", "every order has a matching customer". The contract is uniform: zero rows = pass, any rows = fail. Model contracts — added in dbt 1.5; declared in YAML under config.contract.enforced: true. They enforce the SELECT's column list, data types, and constraints at compile time, before any SQL runs against the warehouse. Use them on every mart that's a public API to other teams or BI tools; schema drift becomes a build failure instead of a 09:00 dashboard fire. The senior approach is all three layered together — generic for column invariants, singular for cross-model rules, contracts for the public-API surface.
What is dbt Slim CI and why does every senior dbt team use it?
dbt Slim CI is the workflow that only rebuilds the dbt models that changed in a pull request, plus everything downstream of them, while resolving unchanged upstream ref() calls against the production relations. The two flags that make it work: --select state:modified+ (modified models plus everything downstream) and --defer --state ./prod_manifest (resolve unselected refs to the cached production manifest's relations). Without Slim CI, every PR rebuilds the entire DAG — for a 200-model project that's hours of warehouse credits per PR. With it, PRs build only the changed subgraph in minutes, give developers fast feedback, and cost a fraction of full rebuilds. The one operational detail that makes Slim CI possible: archive target/manifest.json to S3 (or any blob store) after every successful production run; download it as the --state baseline in CI. Senior teams pair Slim CI with per-PR schemas (schema: dbt_ci_pr_{{ env_var('PR_NUMBER') }}) so each PR's artifacts are isolated and dropped on merge.
Should I use dbt Cloud or dbt Core, and how do senior teams decide?
dbt Core is the open-source CLI — dbt build, dbt run, dbt test. It runs anywhere: your laptop, GitHub Actions, Airflow, Kubernetes. You own orchestration, CI, hosted docs, and the IDE. dbt Cloud is the hosted layer — a web IDE with autocomplete and lineage, a managed scheduler with retries and alerts, managed Slim CI on every PR, hosted docs with auth, and the dbt Semantic Layer that BI tools can query. The honest decision tree: small / analyst-heavy teams without DevOps capacity should default to dbt Cloud — the time saved on orchestration and CI infrastructure pays for the per-seat license. Engineering-heavy teams with existing Airflow infrastructure often run dbt Core via cosmos (Airflow-dbt integration) and skip Cloud entirely; their scheduler, CI, and observability already exist. Mid-size teams mix the two — develop on Core locally, run CI via GitHub Actions + Core, but use Cloud for the scheduler and Semantic Layer. The interview-canonical framing: "Core is the engine; Cloud is the convenience layer; the right choice depends on whether your team already owns orchestration."
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including SQL practice keyed to the same shapes dbt models live in: aggregations, conditional aggregation, CTEs, joins, dimensional modeling, ETL pipelines, and data-quality validation. Whether you're drilling dbt for data engineering end-to-end or sharpening the underlying SQL fluency that makes great dbt models, the practice library mirrors the exact patterns this guide teaches.
Kick off via Explore practice →; drill the SQL practice lane →; fan out into ETL pipeline drills →; sharpen CTE patterns →; rehearse aggregation drills →; reinforce dimensional-modeling problems →; widen coverage on the full data-modeling library →.





Top comments (0)