DEV Community

Cover image for dbt for Data Engineering: Models, Tests, Macros & Production Patterns
Gowtham Potureddi
Gowtham Potureddi

Posted on

dbt for Data Engineering: Models, Tests, Macros & Production Patterns

dbt for data engineering is the canonical transformation layer of the modern data stack in 2026: it sits between your warehouse (Snowflake, BigQuery, Redshift, Databricks, Postgres) and your BI tools and replaces brittle stored procedures with version-controlled SQL models, declarative dbt tests, reusable dbt macros, and CI/CD-driven dbt production patterns. Seven things make a production dbt project hang together — dbt project structure, profiles.yml, dbt models with ref() / source() / materializations, the three dbt tests families, dbt macros and Jinja, the dbt packages ecosystem (dbt_utils, dbt_expectations, dbt_audit_helper, Elementary), and Slim CI with orchestration — and every senior dbt interview questions loop circles every one of them.

This deep guide walks all seven pillars in order, with real dbt YAML, SQL, and Jinja in every section. You'll see the canonical dbt_project.yml layout that ships in 90% of real projects, profiles.yml for dev / prod / ci targets across adapters, dbt ref vs source and the four materializations (view, table, incremental, ephemeral) as a layered DAG, dbt generic tests vs singular tests vs dbt model contracts, Jinja macros that compile per-call, the four community dbt packages every team installs, and dbt Slim CI with --defer state:modified+, Airflow DbtRunOperator, dbt Cloud vs Core, and Elementary freshness alerts. Every numbered H2 ends with a Question → Input → Code → Step-by-step → Output → Why this works worked example you can drop into a project.

PipeCode blog header for a complete dbt for data engineering guide — bold white headline 'dbt · Complete Guide' with subtitle 'Project · Models · Tests · Macros · Packages · Production' and a stylised seven-rung dbt project tree (project structure → models → tests → macros → packages → CI/CD → docs) on a dark gradient with orange, green, purple, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse the SQL practice lane →, drill ETL pipeline drills →, sharpen CTE patterns →, rehearse aggregation drills →, reinforce dimensional-modeling problems →, or widen coverage on the full data-modeling library →.


On this page


1. Why dbt won the transformation layer of the modern data stack

dbt for data engineering — the warehouse-first, SQL-first, Git-first thesis

The one-sentence invariant: dbt for data engineering is "Git + SQL + Jinja + tests, compiled against your warehouse" — every transformation is a versioned .sql file, every dependency is a ref(), every column is testable, every business rule is one reusable macro, and every deploy runs in CI before it touches production. Once you internalise that, every other dbt design decision becomes a follow-up.

The three architectural commitments that won dbt the transformation layer.

  • Warehouse-first — dbt compiles to native warehouse SQL (CREATE TABLE, CREATE VIEW, MERGE) and pushes the compute to Snowflake / BigQuery / Redshift / Databricks / Postgres. No data leaves the warehouse.
  • SQL-first — the surface language is SQL, the language your analysts and data engineers already share. Jinja adds templating without forcing engineers to learn a new DSL.
  • Git-first — every model is a .sql file, every test is a YAML entry, every change is a pull request. The whole transformation layer is reviewable, blameable, and revertable.

Why the modern data stack converged on dbt.

  • pandas and Spark moved compute out of the warehouse — dbt moved it back in. Modern warehouses are cheap and elastic; the round-trip cost of moving data out is the bottleneck.
  • Stored procedures and ETL GUIs lost the diff war — they don't show up cleanly in PR reviews, can't be unit-tested, and don't version cleanly. dbt models are just text files, so Git handles all three.
  • ref() killed hard-coded table names — every model declares its upstreams; dbt computes the DAG and runs nodes in the right order without you maintaining a runbook.
  • Tests as a first-class citizenunique, not_null, accepted_values, relationships ship out of the box; bad data fails the build before it lands in BI.
  • Jinja templating — variables, conditionals, loops, macros — without leaving SQL.
  • Adapter ecosystem — one project runs on every major warehouse via a swappable adapter (dbt-snowflake, dbt-bigquery, dbt-databricks, dbt-redshift, dbt-postgres).

What interviewers listen for in 2026 dbt loops.

  • Do you reach for ref() and source() instead of hard-coded db.schema.table names? — basic-but-tested fluency.
  • Do you name the four layers (sources → staging → intermediate → marts) when asked about project structure? — junior baseline.
  • Do you contrast view, table, incremental, ephemeral and pick the right one per layer? — mid-level signal.
  • Do you mention model contracts, Slim CI (--defer + state:modified+), dbt build (run + test in one command), and Elementary for freshness alerts? — senior signal.
  • Do you explain dbt Cloud vs Core as "Core is the engine; Cloud is the convenience layer (IDE + scheduler + Semantic Layer)"? — interview-canonical answer.

The five sub-themes the deeper loops add.

  • Model contracts — enforce column types and constraints at build time; fail the run before SQL hits the warehouse.
  • Incremental modelsunique_key + merge strategy for billion-row tables that you can't fully rebuild every run.
  • Slim CI — only build models that changed (--defer + state:modified+); 10× faster PR feedback.
  • Semantic Layer — metric definitions BI tools query so every team agrees on what "active user" means.
  • Observability — Elementary or re_data on top of dbt artifacts for freshness, anomaly detection, lineage.

Worked example — a 10-line dbt build cycle that touches every pillar

Detailed explanation. Every interviewer's favorite question shape: "walk me through what happens when you run dbt build in CI". The answer touches project structure, profiles, ref-resolution, materializations, tests, macros, and CI in one breath.

Question. A PR changes models/staging/stg_orders.sql and adds a new test on models/marts/fct_orders.sql. Sketch the dbt build lifecycle in CI.

Input.

Step Artifact involved
1 dbt_project.yml and profiles.yml (target = ci)
2 Local manifest.json from dbt parse
3 Production manifest.json from S3 (last successful run)
4 state:modified+ selector
5 Compiled SQL written to target/compiled/
6 Test results written to target/run_results.json

Code.

# 1. Resolve adapter + credentials
dbt deps                              # install packages.yml
dbt parse                             # produce target/manifest.json

# 2. Slim CI — only build what changed (plus downstream)
dbt build \
  --select state:modified+ \
  --defer \
  --state ./prod_manifest \
  --target ci

# 3. Tests run inline with each model (that's what `build` adds over `run`)
# 4. Upload the new manifest to S3 for the next PR's --defer baseline
aws s3 cp target/manifest.json s3://my-bucket/dbt/manifest.json
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. dbt deps installs everything in packages.yml (dbt_utils, dbt_expectations, etc.) into dbt_packages/.
  2. dbt parse reads every .sql and .yml and produces a fresh target/manifest.json representing the DAG.
  3. --state ./prod_manifest points at a previous manifest cached from production; state:modified+ selects modified models plus everything downstream of them.
  4. --defer tells dbt to resolve any unselected ref() against the prod manifest's relations, so you don't have to rebuild the whole upstream chain in CI.
  5. dbt build runs the selected nodes; for each model it executes the compiled SQL, then runs every test attached to that model inline (the build verb does both, in dependency order).
  6. After CI passes, upload the new manifest.json so the next PR's --defer baseline is up to date.

Output (CI log excerpt).

Running with dbt=1.8.3
Found 24 models, 87 tests, 12 sources, 5 macros, 3 packages
Concurrency: 4 threads (target='ci')

1 of 6 START sql view model dbt_ci.stg_orders ............. [RUN]
1 of 6 OK created view model dbt_ci.stg_orders ............ [CREATE VIEW in 0.34s]
2 of 6 START test unique_stg_orders_order_id .............. [RUN]
2 of 6 PASS unique_stg_orders_order_id .................... [PASS in 0.12s]
...
6 of 6 PASS dbt_expectations_expect_column_values_to_be_unique [PASS in 0.21s]

Completed successfully — 6 succeeded, 0 failed, 0 errors, 0 skipped
Enter fullscreen mode Exit fullscreen mode

Why this works — concept by concept:

  • Slim CI scopes the build to changed nodes plus their downstream, so PR runs cost minutes not hours.
  • --defer stitches unselected refs to production relations, eliminating the need to rebuild parents in every CI run.
  • dbt build runs models and their attached tests in one DAG walk, so a failing test halts downstream nodes immediately.
  • manifest.json is the artifact that makes Slim CI possible — caching it from prod to S3 is the one non-obvious operational step every senior dbt team standardises.
  • Coststate:modified+ reduces typical PR build time from O(all models) to O(changed subgraph), often a 10-50× win.

SQL
Topic — etl
ETL pipeline drills

Practice →

SQL
Topic — data-transformation
Data-transformation practice

Practice →


2. Project structure + profiles — dbt_project.yml · profiles.yml · adapters

Visual diagram of a dbt project layout — a folder-tree icon showing the canonical structure (models/staging, models/intermediate, models/marts, tests/, macros/, seeds/, snapshots/, packages.yml, dbt_project.yml); a profiles.yml card on the right showing target profiles for dev / prod / ci; a thin warehouse-target row at the bottom listing Snowflake / BigQuery / Databricks / Postgres; on a light PipeCode card.

dbt project structure — the canonical layout every senior project ships

dbt project structure is convention, not rule — but the staging → intermediate → marts layout is the 2026 default and the first thing every reviewer looks for. The reason: predictable folder names make a 50-person engineering org navigable; new joiners know where to find a stg_orders.sql without asking.

The canonical project skeleton.

analytics/
├── dbt_project.yml          # the central config (project name, paths, model defaults)
├── packages.yml             # community packages (dbt_utils, dbt_expectations, ...)
├── profiles.yml             # connection credentials per target (often kept in ~/.dbt/)
├── models/
│   ├── staging/             # 1:1 with sources; light renaming + casting only
│   │   ├── _stg_sources.yml # source() declarations + freshness
│   │   ├── stg_orders.sql
│   │   └── stg_customers.sql
│   ├── intermediate/        # reusable joins + business logic (int_*)
│   │   └── int_orders_enriched.sql
│   └── marts/               # business-facing fact + dim tables
│       ├── _marts.yml       # tests + descriptions for marts
│       ├── fct_orders.sql
│       └── dim_customers.sql
├── tests/                   # singular SQL tests (one file = one query)
│   └── assert_no_negative_revenue.sql
├── macros/                  # Jinja reusables
│   └── cents_to_dollars.sql
├── seeds/                   # tiny CSV reference data committed to git
│   └── country_iso.csv
├── snapshots/               # SCD2-style history capture
│   └── snap_customers.sql
└── analyses/                # exploratory queries (compiled, not built)
Enter fullscreen mode Exit fullscreen mode
  • models/staging/ — 1:1 with raw sources. One stg_orders.sql per source table. Light renaming, casting, and safe_cast only; no joins, no business logic. The contract: anything downstream consumes staging, never raw.
  • models/intermediate/ — joins, fan-outs, reusable building blocks. Often named int_orders_enriched or int_customer_features. Materialized as ephemeral or table depending on reuse.
  • models/marts/ — the final fact (fct_*) and dimension (dim_*) tables BI tools and stakeholders query. These are the contract surface to the business.
  • tests/ — singular SQL tests. One file = one SELECT that returns failing rows; zero rows = pass.
  • macros/ — Jinja templates you can call from any model. Examples: cents_to_dollars, pivot_status_counts, date_spine.
  • seeds/ — CSV files committed to git that get loaded into the warehouse with dbt seed. Use for tiny reference tables (country codes, ISO currency mappings).
  • snapshots/ — SCD2-style history. dbt watches a query and writes a row every time a column changes.

dbt_project.yml — the central manifest of your project

The dbt_project.yml file defines project name, version, paths, and the default materialization per folder. Setting materialization at the folder level is the senior-vs-junior signal — junior engineers configure it per-model; senior engineers set sensible defaults at the directory and override only the exceptions.

# dbt_project.yml
name: 'analytics'
version: '1.0.0'
config-version: 2

profile: 'analytics'           # matches the profile in ~/.dbt/profiles.yml

# Path configuration
model-paths:    ["models"]
seed-paths:     ["seeds"]
test-paths:     ["tests"]
macro-paths:    ["macros"]
snapshot-paths: ["snapshots"]

# Folder-level defaults — the senior pattern
models:
  analytics:
    staging:
      +materialized: view              # cheap; refresh on demand
      +schema: staging
    intermediate:
      +materialized: ephemeral         # inlined; never materialised
      +schema: intermediate
    marts:
      +materialized: table             # exposed to BI
      +schema: marts
      +on_schema_change: append_new_columns

vars:
  start_date: '2024-01-01'
  payment_methods: ['credit_card', 'ach', 'paypal']
Enter fullscreen mode Exit fullscreen mode
  • Folder-level defaults — every model under staging/ is a view; every model under marts/ is a table; you override per-model only when needed (e.g. a single huge fact table flipped to incremental).
  • +schema: — dbt suffixes the target schema. With target schema analytics, a staging model lands in analytics_staging.
  • vars: — project-wide variables accessible in models via {{ var('start_date') }}. Use for environment-specific knobs like backfill windows.
  • +on_schema_change: — for incremental models, controls what happens when the source schema gains a column (append_new_columns is the safe default).

profiles.yml — connection credentials and target environments

profiles.yml lives at ~/.dbt/profiles.yml (or in the project root for CI) and never enters Git — it holds credentials. The file defines named targets for dev / prod / ci, each pointing at a different warehouse, schema, and credential set.

# ~/.dbt/profiles.yml
analytics:                       # matches `profile:` in dbt_project.yml
  target: dev                    # default target if --target not passed
  outputs:
    dev:
      type: snowflake
      account: my_account.us-east-1
      user: "{{ env_var('DBT_DEV_USER') }}"
      password: "{{ env_var('DBT_DEV_PASSWORD') }}"
      role: ANALYTICS_DEV
      database: ANALYTICS_DEV
      schema: dbt_alice          # per-developer schema — prevents stomping
      warehouse: COMPUTE_WH
      threads: 4

    prod:
      type: snowflake
      account: my_account.us-east-1
      user: "{{ env_var('DBT_PROD_USER') }}"
      password: "{{ env_var('DBT_PROD_PASSWORD') }}"
      role: ANALYTICS_PROD
      database: ANALYTICS
      schema: analytics
      warehouse: COMPUTE_WH
      threads: 8

    ci:
      type: snowflake
      account: my_account.us-east-1
      user: "{{ env_var('DBT_CI_USER') }}"
      password: "{{ env_var('DBT_CI_PASSWORD') }}"
      role: ANALYTICS_CI
      database: ANALYTICS_CI
      schema: "dbt_ci_pr_{{ env_var('PR_NUMBER', 'local') }}"
      warehouse: COMPUTE_WH_XS
      threads: 8
Enter fullscreen mode Exit fullscreen mode
  • schema: dbt_alice in dev — every developer gets their own schema; dbt creates objects under analytics_dev.dbt_alice_staging, analytics_dev.dbt_alice_marts, etc. No two developers stomp on each other.
  • schema: "dbt_ci_pr_{{ env_var('PR_NUMBER') }}" in CI — each PR gets a throwaway schema; dropped on merge. This is what makes Slim CI safe.
  • env_var('DBT_...') — credentials come from the environment, never the YAML.
  • threads: — dbt's concurrency knob. Dev = 4, prod = 8, CI = 8 are typical. Each thread runs one model.
  • role: (Snowflake) / location: (BigQuery) / catalog: (Databricks) — adapter-specific extras.

Adapter ecosystem — one project, every warehouse

dbt is adapter-driven: install a package, change the type: in profiles.yml, and the same models run against a different warehouse. The five most common adapters:

Adapter Install type: Typical use
dbt-snowflake pip install dbt-snowflake snowflake The most common production stack
dbt-bigquery pip install dbt-bigquery bigquery Google-shop default; great for ad-hoc analysts
dbt-databricks pip install dbt-databricks databricks Lakehouse / Delta-based projects
dbt-redshift pip install dbt-redshift redshift Legacy AWS data-warehouse teams
dbt-postgres pip install dbt-postgres postgres Local dev + small / self-hosted teams

Worked example — bootstrap a new dbt project from scratch

Detailed explanation. Every dbt team's first hour: dbt init, swap in real credentials, point at a sandbox schema, and verify the example model compiles. This is the muscle memory every interview opener tests.

Question. Bootstrap a new dbt project called analytics against Snowflake and run the default example model.

Input. A Snowflake account, a sandbox warehouse, a sandbox database, and a personal schema.

Code.

# 1. Install dbt-core + the Snowflake adapter
pip install dbt-snowflake==1.8.*

# 2. Scaffold a new project
dbt init analytics
# (prompts for adapter, account, user, password, role, database, schema, warehouse)

cd analytics

# 3. Verify the connection
dbt debug

# 4. Install community packages (packages.yml created later)
dbt deps

# 5. Compile every model (no warehouse writes)
dbt compile

# 6. Build the example models + run tests
dbt build
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. dbt init scaffolds the project skeleton (dbt_project.yml, models/example/) and writes a fresh profiles.yml under ~/.dbt/.
  2. dbt debug verifies every part of the connection: adapter present, credentials valid, the chosen role can read / write the target schema. Run this any time something feels off.
  3. dbt compile reads every model and writes the rendered SQL to target/compiled/. Nothing hits the warehouse; this is a fast syntax + ref-resolution check.
  4. dbt build runs every model and every test in dependency order. For Snowflake it executes CREATE TABLE / VIEWs into your sandbox schema.

Output.

$ dbt debug
All checks passed!

$ dbt build
Found 2 models, 4 tests, 0 sources, 0 exposures, 0 metrics, 348 macros
Concurrency: 4 threads (target='dev')

1 of 6 START sql view model dbt_alice.my_first_dbt_model ... [RUN]
1 of 6 OK created view model dbt_alice.my_first_dbt_model .. [CREATE VIEW in 0.41s]
...
Completed successfully — 6 succeeded, 0 failed
Enter fullscreen mode Exit fullscreen mode

Why this works — concept by concept:

  • dbt init ships a working starter project so you can prove the connection in under five minutes.
  • dbt debug is the single best diagnostic command — it walks every layer (adapter, network, auth, role permissions) and reports the first failure with the offending stanza.
  • dbt compile vs dbt build — compile renders SQL to disk; build executes it and runs tests. Use compile to iterate fast, build to ship.
  • Per-developer schema — the schema: dbt_alice default keeps every engineer's sandbox isolated; no overlap between teammates.
  • Costdbt debug is free; dbt compile is free (no warehouse compute); only dbt build and dbt run cost warehouse credits.

SQL
Topic — etl
Pipeline structure drills

Practice →

SQL
Topic — data-transformation
Warehouse transformation practice

Practice →


3. Models — refs, sources, materializations, layered DAG

Visual diagram of dbt model DAG — a source layer with 3 raw source cards on the left, a staging layer with 3 stg_* model cards, an intermediate layer with 2 int_* models, and a marts layer with 2 final fct_/dim_ models on the right. Each model card shows its materialization badge (view / table / incremental / ephemeral). Thin glowing ref() arrows connect dependent models; on a light PipeCode card.

dbt models — every .sql file is a versioned SELECT

dbt models are the unit of work — every .sql file under models/ is a single SELECT statement that dbt wraps in a CREATE TABLE or CREATE VIEW against your warehouse. You never write the DDL yourself; dbt generates it based on the model's materialization.

The model contract — one SELECT, zero side effects.

  • A model is one SELECT at the top level; no CREATE, no INSERT, no MERGE.
  • The compiler wraps it with the appropriate DDL based on materialization.
  • The model's name is the file name (stg_orders.sqlstg_orders relation).
  • The model's upstreams are inferred from every {{ ref('...') }} and {{ source('...', '...') }} call inside it.

dbt ref vs source — the two ways a model declares its inputs

{{ ref('upstream_model') }} points at another dbt model. {{ source('schema', 'table') }} points at a raw table you don't own (a Fivetran-loaded raw schema, a Postgres replica, a Kafka sink). The two together form a complete dependency graph dbt walks at run time.

-- models/staging/stg_orders.sql
{{ config(materialized='view') }}

with src as (
    select * from {{ source('raw_jaffle_shop', 'orders') }}
)

select
    id            as order_id,
    user_id       as customer_id,
    order_date,
    status,
    cast(amount_cents as numeric) / 100 as amount_usd
from src
Enter fullscreen mode Exit fullscreen mode
-- models/marts/fct_orders.sql
{{ config(materialized='table') }}

with orders as (
    select * from {{ ref('stg_orders') }}
),
customers as (
    select * from {{ ref('stg_customers') }}
)

select
    o.order_id,
    o.customer_id,
    c.region,
    o.order_date,
    o.amount_usd
from orders o
left join customers c using (customer_id)
where o.status = 'completed'
Enter fullscreen mode Exit fullscreen mode
# models/staging/_stg_sources.yml
version: 2

sources:
  - name: raw_jaffle_shop
    database: RAW
    schema: jaffle_shop
    tables:
      - name: orders
        description: "Raw orders from the production OLTP DB, replicated by Fivetran."
        freshness:
          warn_after: { count: 12, period: hour }
          error_after: { count: 24, period: hour }
        loaded_at_field: _fivetran_synced
      - name: customers
Enter fullscreen mode Exit fullscreen mode
  • source() lets you swap the underlying raw table (e.g. move from one ingest tool to another) by editing one YAML; every staging model picks up the change.
  • freshness thresholds power dbt source freshness, which is your first line of defense against silent upstream breakage.
  • ref() computes the DAG. dbt re-orders execution automatically — you never write CREATE TABLE x DEPENDS ON y.
  • The non-negotiable rule — never select * from analytics.staging.stg_orders directly. Always ref(). Hard-coded names break Slim CI, --defer, and cross-environment portability.

dbt materializations — view, table, incremental, ephemeral

dbt materializations are the four shapes a model can take in your warehouse. Pick the right one per layer; the wrong choice is the most common source of slow or expensive dbt projects.

Materialization What dbt does When to use Cost shape
view CREATE OR REPLACE VIEW — no data stored Staging models, ad-hoc transforms over small data Cheap to refresh, slow to query
table CREATE OR REPLACE TABLE AS SELECT — full rebuild every run Marts, anything BI tools hit, anything joined to repeatedly Fast to query, full-rebuild cost per run
incremental First run = table; subsequent runs = MERGE of new rows Billion-row events / fact tables you can't fully rebuild Cheap incremental cost, complexity overhead
ephemeral Inlined as a CTE in the downstream model — never materialised Small reusable joins, no direct querying Zero storage; not queryable directly

Incremental models — the production fact-table default.

-- models/marts/fct_events.sql
{{ config(
    materialized='incremental',
    unique_key='event_id',
    on_schema_change='append_new_columns',
    incremental_strategy='merge'
) }}

select
    event_id,
    user_id,
    event_type,
    occurred_at,
    payload
from {{ source('raw_events', 'events') }}

{% if is_incremental() %}
  -- only scan new rows
  where occurred_at > (select coalesce(max(occurred_at), '1900-01-01')
                       from {{ this }})
{% endif %}
Enter fullscreen mode Exit fullscreen mode
  • is_incremental() macro — true only when the target table already exists; lets the same file run as a full rebuild on first run and incrementally afterwards.
  • {{ this }} — refers to the current model's target relation (e.g. analytics.marts.fct_events).
  • unique_key — column dbt uses to determine "is this row new or an update?".
  • incremental_strategy='merge' — the default on Snowflake / BigQuery / Databricks; on Postgres / Redshift the default is delete+insert.
  • on_schema_change='append_new_columns' — when the source schema gains a column, dbt adds it to the target without failing the run. Safer than the default ignore.

The layered DAG — sources → staging → intermediate → marts

The four-layer pattern is the project shape every senior dbt team converges on:

  • Sources (raw) — owned by Fivetran / Airbyte / your replication tool; declared via source().
  • Staging (stg_*) — 1:1 with sources; rename columns, cast types, add safe_cast, drop PII. Materialised as view.
  • Intermediate (int_*) — reusable joins and business logic. Materialised as ephemeral (small reuse) or table (heavy reuse).
  • Marts (fct_*, dim_*) — the contract to BI / business. Materialised as table or incremental.

The contract: downstream layers may only ref() upstream layers. Marts may not ref() other marts (instead, factor the join into an intermediate). Staging may not ref() other staging (instead, hold the join until intermediate). Enforce this with a dbt_project.yml config or a CI lint.

Worked example — a layered DAG with three layers + an incremental fact

Detailed explanation. Wire up a tiny but real DAG: two raw sources, two staging models, one intermediate model, one incremental fact, one dim table. This is the shape every junior interview asks you to sketch.

Question. Build a fct_orders incremental fact that joins to dim_customers, sourced from raw jaffle_shop.orders and jaffle_shop.customers.

Input.

Layer File Materialization Upstream
source raw_jaffle_shop.orders (raw) Fivetran
source raw_jaffle_shop.customers (raw) Fivetran
staging stg_orders.sql view source orders
staging stg_customers.sql view source customers
intermediate int_orders_enriched.sql ephemeral stg_orders + stg_customers
mart dim_customers.sql table stg_customers
mart fct_orders.sql incremental int_orders_enriched

Code.

-- models/intermediate/int_orders_enriched.sql
{{ config(materialized='ephemeral') }}

select
    o.order_id,
    o.customer_id,
    c.region,
    o.order_date,
    o.amount_usd,
    o.status
from {{ ref('stg_orders') }}      o
left join {{ ref('stg_customers') }} c using (customer_id)
Enter fullscreen mode Exit fullscreen mode
-- models/marts/fct_orders.sql
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge'
) }}

select *
from {{ ref('int_orders_enriched') }}
where status = 'completed'

{% if is_incremental() %}
  and order_date > (select max(order_date) from {{ this }})
{% endif %}
Enter fullscreen mode Exit fullscreen mode
# Build only the orders subgraph
dbt build --select +fct_orders
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. dbt parse walks every .sql and discovers the upstream chain via ref() and source().
  2. --select +fct_orders selects fct_orders and all upstream nodes (the + prefix). dbt schedules stg_orders, stg_customers, int_orders_enriched, dim_customers, fct_orders in dependency order.
  3. int_orders_enriched is ephemeral — dbt never creates it as a table; instead it inlines the SQL as a CTE inside fct_orders at compile time.
  4. fct_orders is incremental; first run = CREATE TABLE, subsequent runs = MERGE INTO fct_orders USING (the SELECT) ON order_id = order_id.
  5. Tests attached to any of these models run inline (because we used dbt build).

Output (compiled fct_orders, second run).

MERGE INTO analytics.marts.fct_orders AS target
USING (
    with int_orders_enriched as (
        select o.order_id, o.customer_id, c.region, o.order_date,
               o.amount_usd, o.status
        from analytics.staging.stg_orders o
        left join analytics.staging.stg_customers c using (customer_id)
    )
    select *
    from int_orders_enriched
    where status = 'completed'
      and order_date > (select max(order_date) from analytics.marts.fct_orders)
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
Enter fullscreen mode Exit fullscreen mode

Why this works — concept by concept:

  • Layered DAG isolates concerns: staging never knows about business rules; marts never know about raw column quirks.
  • ephemeral keeps int_orders_enriched out of the warehouse — useful since it's only joined to once.
  • incremental + unique_key turns full rebuilds into MERGEs, making billion-row fact tables tractable.
  • +fct_orders selector scopes the run to the chain that matters; great for local iteration.
  • Cost — first run is O(all orders); subsequent runs are O(new orders only). For a busy fact table the savings compound daily.

SQL
Topic — dimensional-modeling
Dimensional-modeling drills

Practice →

SQL
Topic — ctes
CTE pattern practice

Practice →


4. Tests — generic schema tests, singular tests, model contracts

Visual diagram of dbt tests — three labelled test families as side-by-side panels: Panel 1 (Generic schema tests: unique, not_null, accepted_values, relationships); Panel 2 (Singular tests: a custom SQL query that returns failing rows); Panel 3 (Model contracts: an enforced schema with column data types and constraints). Each panel shows a tiny YAML/SQL block and a pass/fail outcome chip; on a light PipeCode card.

dbt tests — three families, one promise: bad data fails the build

dbt tests are the second-most-important pillar after models — they're the contract that turns SQL into a tested codebase. dbt ships three families: generic schema tests (declarative, one-liner per column), singular tests (bespoke SQL that returns failing rows), and model contracts (warehouse-enforced column types and constraints). Use all three.

dbt generic tests — declarative, in YAML, on every column that matters

Generic tests are the cheapest unit of correctness in dbt. You declare them in YAML next to the model; dbt runs them as SELECT COUNT(*) FROM (...) WHERE expected_invariant_violated. Zero rows back = pass.

# models/marts/_marts.yml
version: 2

models:
  - name: fct_orders
    description: "Order-grain facts for revenue reporting."
    columns:
      - name: order_id
        description: "Primary key  one row per order."
        tests:
          - unique
          - not_null
      - name: customer_id
        description: "FK to dim_customers."
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: status
        tests:
          - accepted_values:
              values: ['completed', 'pending', 'cancelled']
              severity: error
      - name: amount_usd
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"
              config:
                severity: warn

  - name: dim_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
Enter fullscreen mode Exit fullscreen mode
  • uniqueSELECT col, COUNT(*) FROM model GROUP BY col HAVING COUNT(*) > 1. Fails if any duplicate.
  • not_nullWHERE col IS NULL.
  • accepted_valuesWHERE col NOT IN (allowed_list).
  • relationshipsWHERE fk NOT IN (SELECT pk FROM target_model); the equivalent of a foreign-key check.
  • dbt_utils.expression_is_true — boolean predicate; failing rows are those where the expression is false.
  • severity: warn vs severity: error — warn logs the failure but exits 0; error fails the build. Use warn for data-quality smells you want to triage; use error for invariants that must hold.

dbt singular tests — bespoke SQL, one file, one query

Singular tests cover anything generic tests can't — multi-table joins, business-rule invariants, sanity checks across the warehouse. Each is a .sql file under tests/; the file is a single SELECT that returns failing rows.

-- tests/assert_no_negative_revenue.sql
-- This test passes when zero rows are returned.

select
    region,
    sum(amount_usd) as total_revenue
from {{ ref('fct_orders') }}
group by region
having sum(amount_usd) < 0
Enter fullscreen mode Exit fullscreen mode
-- tests/assert_orders_have_a_customer.sql
-- Catches orphan orders missing a matching dim_customers row.

select o.order_id
from {{ ref('fct_orders') }}  o
left join {{ ref('dim_customers') }} c using (customer_id)
where c.customer_id is null
Enter fullscreen mode Exit fullscreen mode
  • The contract — zero rows = pass; any rows = fail. The failing rows are written to target/run_results.json and (with --store-failures) to a debug table you can inspect.
  • Naming conventionassert_*.sql so test files sort together and the intent is obvious.
  • Cross-model invariants — singular tests are the only way to test "this column in model A matches the sum of this column in model B".
  • Don't reach for singular tests when a generic existsaccepted_values is a one-liner; rewriting it as a singular SQL is noise.

dbt model contracts — warehouse-enforced schemas

Model contracts (added in dbt 1.5) enforce the column list, data types, and constraints at build time — before the SQL even runs. They're how you turn a model into a versioned API for downstream consumers (other teams, BI tools, the Semantic Layer).

# models/marts/_marts.yml
version: 2

models:
  - name: fct_orders
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: integer
        constraints:
          - type: primary_key
          - type: not_null
      - name: customer_id
        data_type: integer
        constraints:
          - type: foreign_key
            expression: "{{ ref('dim_customers') }} (customer_id)"
          - type: not_null
      - name: order_date
        data_type: date
        constraints:
          - type: not_null
      - name: status
        data_type: varchar(20)
        constraints:
          - type: not_null
      - name: amount_usd
        data_type: numeric(10,2)
        constraints:
          - type: check
            expression: "amount_usd >= 0"
Enter fullscreen mode Exit fullscreen mode
  • contract.enforced: true — at compile time dbt validates that the SELECT's column list, names, and data types match the YAML exactly. A column rename without updating the contract = build fail.
  • constraints: — pushed to the warehouse where supported. Snowflake and Databricks support not_null and check; BigQuery is partial; Postgres / Redshift are full.
  • The interview signal — model contracts are the closest dbt has to "typed APIs"; senior teams use them on every mart that BI tools or sibling teams depend on.

Worked example — three test families layered on a single model

Detailed explanation. A real fct_orders lands with all three test families: generic tests on every column, a singular test across the orders + customers relationship, and a contract enforcing the public column shape.

Question. Show the full test surface for fct_orders and run dbt test --select fct_orders.

Input.

Test family Where it lives What it catches
generic _marts.yml columns block column-level invariants (unique, not_null, FK)
singular tests/assert_no_orphans.sql cross-model relationship sanity
contract _marts.yml model config.contract schema drift between SELECT and declared types

Code.

# Run all tests attached to fct_orders
dbt test --select fct_orders

# Run tests with failure-row storage so you can inspect bad rows
dbt test --select fct_orders --store-failures

# Run only error-severity tests (skip warns)
dbt test --select fct_orders --exclude-resource-type test --severity error
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. dbt test --select fct_orders selects every test whose model or ref matches fct_orders.
  2. Generic tests compile to a SELECT that returns failing rows; dbt counts those rows and reports pass / fail.
  3. Singular tests are already SELECT statements; same shape.
  4. Model contracts run before the model SQL; if the SELECT's columns / types don't match the YAML, the build aborts.
  5. With --store-failures, failing rows land in a dbt_test_failures schema you can query for triage.

Output.

Running with dbt=1.8.3
Found 1 model, 7 tests, 1 contract

1 of 7 START test unique_fct_orders_order_id .............. [RUN]
1 of 7 PASS unique_fct_orders_order_id .................... [PASS in 0.18s]
2 of 7 START test not_null_fct_orders_order_id ............ [RUN]
2 of 7 PASS not_null_fct_orders_order_id .................. [PASS in 0.09s]
3 of 7 START test relationships_fct_orders_customer_id .... [RUN]
3 of 7 PASS relationships_fct_orders_customer_id .......... [PASS in 0.22s]
4 of 7 START test accepted_values_fct_orders_status ....... [RUN]
4 of 7 PASS accepted_values_fct_orders_status ............. [PASS in 0.11s]
5 of 7 START test dbt_utils_expression_is_true_amount ..... [RUN]
5 of 7 WARN dbt_utils_expression_is_true_amount ........... [WARN — 3 rows]
6 of 7 START test assert_no_orphans ....................... [RUN]
6 of 7 PASS assert_no_orphans ............................. [PASS in 0.31s]
7 of 7 START contract fct_orders .......................... [RUN]
7 of 7 PASS contract fct_orders ........................... [PASS]

Completed — 6 passed, 0 failed, 1 warning, 0 errors
Enter fullscreen mode Exit fullscreen mode

Why this works — concept by concept:

  • Generic tests are the cheapest invariants — one YAML line catches duplicate PKs, NULL FKs, bad enum values.
  • Singular tests cover cross-model sanity checks that generic tests can't express.
  • Model contracts lift schema checking from runtime to compile time — the most expensive failures (schema drift) become free to catch.
  • severity: warn lets you stage a new test in production without breaking the build; flip to error once the false positives are cleared.
  • Cost — every test is one extra SELECT; cheap. The cost of not testing a column is one bad BI dashboard and a Monday-morning fire drill.

SQL
Topic — validation
Data-validation drills

Practice →

SQL
Topic — data-validation
Schema-test practice

Practice →


5. Macros + Jinja — write once, compile per-call

Visual diagram of dbt macros + Jinja — left panel shows a small macro definition card (macros/cents_to_dollars.sql) with a tiny Jinja -…- icon; center shows a model card calling the macro with arguments; right shows a compiled SQL card on the warehouse with the macro inlined; below all three a small package-import strip showing common community packages (dbt_utils, dbt_expectations, dbt_audit_helper); on a light PipeCode card.

dbt macros — Jinja templates that inline SQL across many models

dbt macros are the third pillar after models and tests. A macro is a Jinja function that returns SQL; you define it once under macros/ and call it from any model. At compile time, dbt inlines the macro's output exactly where you called it — no runtime overhead, no extra warehouse round-trips.

The macro lifecycle in three steps.

  1. Define — write a .sql file under macros/ containing a {% macro name(args) %} ... {% endmacro %} block.
  2. Call — invoke it from a model with {{ name(args) }}.
  3. Compile — dbt expands the call into raw SQL written to target/compiled/.... The warehouse only ever sees the expanded form.

Defining a macro — small, pure, reusable

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, decimals=2) %}
    cast({{ column_name }} as numeric) / 100
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode
-- macros/pivot_status_counts.sql
{% macro pivot_status_counts(status_column, statuses) %}
    {% for s in statuses %}
        sum(case when {{ status_column }} = '{{ s }}' then 1 else 0 end) as {{ s }}_count
        {% if not loop.last %},{% endif %}
    {% endfor %}
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode
-- macros/get_payment_methods.sql — used by models to access vars
{% macro get_payment_methods() %}
    {{ return(var('payment_methods', ['credit_card', 'ach'])) }}
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode
  • Arguments with defaultsdecimals=2 makes the second arg optional.
  • {% for %} loops — Jinja control flow; loop.last is true on the final iteration.
  • {{ return(...) }} — for macros that produce a Python-side value (not SQL); useful for variable factories.
  • Keep them small + pure — a 5-line macro is a delight; a 50-line macro with conditional dispatch becomes the next engineer's nightmare. Prefer composition.

Calling a macro — three syntaxes for three contexts

-- models/marts/fct_revenue.sql
{{ config(materialized='table') }}

select
    region,
    {{ cents_to_dollars('amount_cents') }}      as amount_usd,
    {{ pivot_status_counts('status', ['paid', 'pending', 'cancelled']) }}
from {{ ref('stg_orders') }}
group by region
Enter fullscreen mode Exit fullscreen mode
-- Compiled output written to target/compiled/analytics/models/marts/fct_revenue.sql
select
    region,
    cast(amount_cents as numeric) / 100      as amount_usd,
    sum(case when status = 'paid' then 1 else 0 end) as paid_count,
    sum(case when status = 'pending' then 1 else 0 end) as pending_count,
    sum(case when status = 'cancelled' then 1 else 0 end) as cancelled_count
from analytics.staging.stg_orders
group by region
Enter fullscreen mode Exit fullscreen mode
  • {{ macro_name(args) }} — expression form; returns a string that gets inlined.
  • {% do macro_name(args) %} — statement form; for side-effectful macros that don't return SQL (e.g. logging).
  • {% set var = macro_name(args) %} — assign the macro's return into a Jinja variable for later reuse in the same compile pass.

Jinja control flow inside a model

Jinja makes SQL templating practical. Use it to loop over columns, conditionally include CTEs, switch behaviour per adapter, or build pivot tables dynamically.

-- models/marts/fct_revenue_by_method.sql
{% set payment_methods = ['credit_card', 'ach', 'paypal'] %}

select
    order_date,
    {% for m in payment_methods %}
    sum(case when payment_method = '{{ m }}' then amount_usd else 0 end) as revenue_{{ m }}
    {% if not loop.last %},{% endif %}
    {% endfor %}
from {{ ref('stg_payments') }}
group by order_date
Enter fullscreen mode Exit fullscreen mode
-- Adapter-conditional logic
select *,
    {% if target.type == 'snowflake' %}
        current_timestamp::timestamp_ntz as loaded_at
    {% elif target.type == 'bigquery' %}
        current_timestamp() as loaded_at
    {% else %}
        now() as loaded_at
    {% endif %}
from {{ ref('stg_orders') }}
Enter fullscreen mode Exit fullscreen mode
  • {% set %} — declare a Jinja variable scoped to the model.
  • {% for %} / {% endfor %} — loop; great for building pivot SUMs without hand-writing N rows.
  • {% if %} / {% elif %} / {% else %} — conditional SQL; the canonical adapter-switching pattern.
  • target.type — at compile time you know which warehouse you're compiling for; use it sparingly to bridge dialect gaps.

dbt_utils — the community macro standard library

dbt_utils ships dozens of macros every project ends up using. The four most-used:

  • generate_surrogate_key(['col_a', 'col_b']) — hash-based composite key generation; the workhorse of dim-table modeling.
  • dbt_utils.star(from=ref('stg_orders'), except=['raw_payload']) — expand * minus a few columns. Essential when staging models drop PII.
  • dbt_utils.pivot('status', ['paid', 'pending', 'cancelled']) — pivot a column into N counts. Replaces the loop above with a one-liner.
  • dbt_utils.date_spine(datepart='day', start_date='2024-01-01', end_date='2026-01-01') — generate a contiguous calendar table on the fly; great for cohort and gap-filling work.
# packages.yml — install dbt_utils so its macros become callable
packages:
  - package: dbt-labs/dbt_utils
    version: 1.2.0
  - package: calogica/dbt_expectations
    version: 0.10.4
  - package: dbt-labs/audit_helper
    version: 0.12.0
  - package: calogica/dbt_date
    version: 0.10.1
Enter fullscreen mode Exit fullscreen mode
dbt deps   # installs every package into dbt_packages/
Enter fullscreen mode Exit fullscreen mode

Worked example — replace 50 lines of hand-written SQL with one macro call

Detailed explanation. Every dbt codebase ages into duplicated logic — same case when status in ('paid', 'completed', 'fulfilled') then 1 repeated across 20 models. The refactor: factor it into one macro, then call it everywhere.

Question. Replace duplicated revenue-pivot SQL across fct_revenue_daily and fct_revenue_weekly with a shared pivot_status_counts macro.

Input. Two models that each hand-write five sum(case when status = '...' then 1 else 0 end) columns.

Code (before — duplicated SQL across two models).

-- models/marts/fct_revenue_daily.sql  (BEFORE)
select
    order_date,
    sum(case when status = 'paid'       then 1 else 0 end) as paid_count,
    sum(case when status = 'pending'    then 1 else 0 end) as pending_count,
    sum(case when status = 'cancelled'  then 1 else 0 end) as cancelled_count,
    sum(case when status = 'refunded'   then 1 else 0 end) as refunded_count,
    sum(case when status = 'shipped'    then 1 else 0 end) as shipped_count
from {{ ref('stg_orders') }}
group by order_date
Enter fullscreen mode Exit fullscreen mode

Code (after — one macro, two callers).

-- macros/pivot_status_counts.sql
{% macro pivot_status_counts(status_column, statuses) %}
    {% for s in statuses %}
    sum(case when {{ status_column }} = '{{ s }}' then 1 else 0 end) as {{ s }}_count
    {% if not loop.last %},{% endif %}
    {% endfor %}
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode
-- models/marts/fct_revenue_daily.sql  (AFTER)
{% set statuses = ['paid', 'pending', 'cancelled', 'refunded', 'shipped'] %}

select
    order_date,
    {{ pivot_status_counts('status', statuses) }}
from {{ ref('stg_orders') }}
group by order_date
Enter fullscreen mode Exit fullscreen mode
-- models/marts/fct_revenue_weekly.sql  (AFTER)
{% set statuses = ['paid', 'pending', 'cancelled', 'refunded', 'shipped'] %}

select
    date_trunc('week', order_date) as week,
    {{ pivot_status_counts('status', statuses) }}
from {{ ref('stg_orders') }}
group by 1
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The macro takes a column name and a list of values; Jinja's {% for %} loop unrolls one sum(case when ...) per status.
  2. {% if not loop.last %},{% endif %} adds a trailing comma between expressions but not after the last one — the trick to clean compiled SQL.
  3. Each caller {% set statuses = [...] %} keeps the list local so two models can diverge if needed.
  4. dbt compiles the macro to identical SQL in both callers — zero warehouse difference, full source dedup.

Output (compiled fct_revenue_daily).

select
    order_date,
    sum(case when status = 'paid' then 1 else 0 end) as paid_count,
    sum(case when status = 'pending' then 1 else 0 end) as pending_count,
    sum(case when status = 'cancelled' then 1 else 0 end) as cancelled_count,
    sum(case when status = 'refunded' then 1 else 0 end) as refunded_count,
    sum(case when status = 'shipped' then 1 else 0 end) as shipped_count
from analytics.staging.stg_orders
group by order_date
Enter fullscreen mode Exit fullscreen mode

Why this works — concept by concept:

  • Macro factoring removes a class of bugs — adding a new status now updates the list in one place, not N.
  • Jinja {% for %} is the right tool when you'd otherwise hand-write N parallel columns; doubly so when N changes over time.
  • Compile-time inlining means the warehouse never sees Jinja; performance is identical to hand-written SQL.
  • loop.last is the Jinja idiom for "skip the trailing separator"; commit this one to muscle memory.
  • Cost — Jinja compilation runs at parse time; the warehouse sees only inlined SQL. The savings are in maintenance hours, not runtime.

SQL
Topic — conditional-aggregation
Conditional-aggregation drills

Practice →

SQL
Topic — aggregation
Aggregation pattern practice

Practice →


6. Packages ecosystem — dbt_utils · dbt_expectations · dbt_audit_helper · Elementary

dbt packages — install once, get hundreds of macros for free

dbt packages are git-cloneable bundles of macros, tests, and models the community maintains. The four packages every senior team installs on day one: dbt_utils (the standard library), dbt_expectations (Great-Expectations-style tests), audit_helper (regression tooling for migrations), and elementary (observability + freshness alerts).

dbt_utils — the standard library beyond macros

dbt_utils is more than just macros — it also ships generic tests you can attach in YAML alongside the built-in unique / not_null ones.

# Generic tests from dbt_utils
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - dbt_utils.unique_combination_of_columns:
              combination_of_columns: [order_id, line_item_id]
      - name: amount_usd
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"
              severity: warn
      - name: order_date
        tests:
          - dbt_utils.recency:
              datepart: day
              field: order_date
              interval: 1
Enter fullscreen mode Exit fullscreen mode
  • unique_combination_of_columns — composite-key uniqueness; the right tool when the PK is two columns together.
  • expression_is_true — any boolean SQL expression as a test.
  • recency — fails if the most-recent row is older than interval; canonical freshness sanity.
  • equal_rowcount — compares row counts between two relations; the workhorse of staging-to-marts sanity.

dbt_expectations — Great-Expectations-style declarative data quality

dbt_expectations ports the Great Expectations API to dbt: 60+ generic tests covering distributional, statistical, and pattern-based invariants.

# Distributional + format tests from dbt_expectations
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - dbt_expectations.expect_column_values_to_be_unique
          - dbt_expectations.expect_column_values_to_match_regex:
              regex: '^ORD-[0-9]{8}$'
      - name: amount_usd
        tests:
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 100000
              row_condition: "status = 'completed'"
          - dbt_expectations.expect_column_mean_to_be_between:
              min_value: 10
              max_value: 500
      - name: status
        tests:
          - dbt_expectations.expect_column_values_to_be_in_set:
              value_set: ['completed', 'pending', 'cancelled', 'refunded']
Enter fullscreen mode Exit fullscreen mode
  • expect_column_values_to_be_between — range check; great for sanity caps on revenue / quantity.
  • expect_column_mean_to_be_between — distributional drift detector; catches the day a join goes wrong and revenue jumps 10×.
  • expect_column_values_to_match_regex — pattern enforcement; great for IDs and email columns.
  • row_condition: — scope the test to a subset of rows.

audit_helper — diff two relations during migrations

audit_helper is the package every team installs when they migrate a critical model — say, refactoring fct_orders to incremental, or porting Looker SQL into dbt. It ships macros that diff two relations and tell you exactly what changed.

-- analyses/compare_fct_orders.sql
-- Run with: dbt compile -s compare_fct_orders, then paste into your warehouse.

{% set old_query %}
    select * from analytics.legacy.fct_orders_v1
{% endset %}

{% set new_query %}
    select * from {{ ref('fct_orders') }}
{% endset %}

{{ audit_helper.compare_queries(
    a_query=old_query,
    b_query=new_query,
    primary_key='order_id'
) }}
Enter fullscreen mode Exit fullscreen mode
  • compare_queries — full row-level diff; tells you "8,231 matches, 12 missing in new, 0 missing in old, 45 differences in non-PK columns".
  • compare_column_values — per-column value distribution comparison; the right tool when you suspect a single column changed.
  • compare_relation_columns — schema diff; columns added / removed / type-changed.
  • The migration ritual — every refactor of a critical model should ship with an audit_helper analysis in the PR description; reviewers see the diff and approve.

elementary — observability over dbt artifacts

elementary is the open-source observability layer that reads target/manifest.json and target/run_results.json after every run and turns them into freshness alerts, anomaly detection, and a Slack channel that pages on-call when something breaks.

# packages.yml — add elementary
packages:
  - package: elementary-data/elementary
    version: 0.15.0
Enter fullscreen mode Exit fullscreen mode
# models/_elementary.yml — turn on monitoring
models:
  - name: fct_orders
    config:
      elementary:
        timestamp_column: order_date
    tests:
      - elementary.volume_anomalies:
          time_bucket: { period: day, count: 1 }
      - elementary.freshness_anomalies
      - elementary.dimension_anomalies:
          dimensions:
            - region
Enter fullscreen mode Exit fullscreen mode
  • volume_anomalies — row-count anomaly detection; flags the day order volume drops 80% (a likely upstream outage).
  • freshness_anomalies — flags the day a model's loaded_at stops advancing.
  • dimension_anomalies — flags the day a dimension's value distribution shifts significantly.
  • Slack / PagerDuty integration — Elementary ships a CLI you run after dbt build that posts alerts to your incident channel.

A summary table — which package to reach for

Package What it ships When you need it
dbt_utils Surrogate keys, pivots, date spines, composite tests Every dbt project — install on day one
dbt_expectations 60+ distributional / pattern / range tests When unique / not_null aren't enough
audit_helper Diff two relations during migrations Refactors, OLAP-engine swaps, vendor cutovers
elementary Freshness, anomaly, lineage observability When dbt is in production with on-call rotations
dbt_date Calendar / fiscal / business-day helpers Finance / accounting / cohort work
dbt_artifacts Persist run metadata into a warehouse table Custom dashboards over dbt runs
re_data Alternative observability stack Teams that prefer it over Elementary

Worked example — adopt dbt_utils + dbt_expectations on a single model

Detailed explanation. Install the two most-used packages and add three tests to fct_orders you couldn't have written without them.

Question. Wire dbt_utils.surrogate_key, dbt_expectations.expect_column_values_to_be_between, and dbt_expectations.expect_column_mean_to_be_between into the fct_orders model.

Input. A fresh dbt project with packages.yml already installed.

Code.

-- models/marts/fct_orders.sql
{{ config(materialized='table') }}

select
    {{ dbt_utils.generate_surrogate_key(['order_id', 'line_item_id']) }} as order_line_sk,
    order_id,
    line_item_id,
    customer_id,
    order_date,
    amount_usd,
    status
from {{ ref('stg_orders_lines') }}
Enter fullscreen mode Exit fullscreen mode
# models/marts/_marts.yml
version: 2

models:
  - name: fct_orders
    columns:
      - name: order_line_sk
        tests:
          - unique
          - not_null
      - name: amount_usd
        tests:
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 100000
              severity: error
          - dbt_expectations.expect_column_mean_to_be_between:
              min_value: 10
              max_value: 500
              severity: warn
      - name: status
        tests:
          - dbt_expectations.expect_column_values_to_be_in_set:
              value_set: ['completed', 'pending', 'cancelled']
Enter fullscreen mode Exit fullscreen mode
dbt deps                          # installs packages
dbt build --select fct_orders     # runs model + every test
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. dbt deps clones dbt_utils and dbt_expectations into dbt_packages/.
  2. generate_surrogate_key(['a', 'b']) returns a md5(a || '-' || b) expression specific to the active adapter.
  3. expect_column_values_to_be_between(min=0, max=100000) runs SELECT * FROM fct_orders WHERE amount_usd < 0 OR amount_usd > 100000 — failing rows.
  4. expect_column_mean_to_be_between(min=10, max=500) runs an aggregate test — fails if the table's average amount_usd is outside the range.
  5. dbt build ships them all in one DAG walk; severity flags decide which fail the run vs warn.

Output.

1 of 5 PASS unique_fct_orders_order_line_sk
2 of 5 PASS not_null_fct_orders_order_line_sk
3 of 5 PASS dbt_expectations_expect_column_values_to_be_between_amount_usd
4 of 5 WARN dbt_expectations_expect_column_mean_to_be_between_amount_usd  [WARN — mean 8.4 below 10]
5 of 5 PASS dbt_expectations_expect_column_values_to_be_in_set_status

Completed — 4 passed, 0 failed, 1 warning, 0 errors
Enter fullscreen mode Exit fullscreen mode

Why this works — concept by concept:

  • Composite keysdbt_utils.generate_surrogate_key is the canonical way to hash multiple columns into one PK; saves you N lines of md5(concat(...)) per model.
  • Range testsexpect_column_values_to_be_between catches the bug where a join multiplies rows and revenue jumps 10×.
  • Distributional testsexpect_column_mean_to_be_between is the kind of invariant you can't express with unique / not_null; the mean drifting is the first signal of a quiet upstream bug.
  • Severity tuning — error for hard invariants (range), warn for soft signals (drift); turns dbt into a tunable alarm system.
  • Cost — every test is one SELECT; the marginal cost is small. The cost of not catching a 10× revenue inflation is real money.

SQL
Topic — data-validation
Data-quality test drills

Practice →

SQL
Topic — aggregation
Aggregation + surrogate-key practice

Practice →


7. Production patterns + CI/CD — Slim CI · orchestration · observability

dbt production patterns — what it takes to run dbt on call

dbt production patterns is the last pillar — every other pillar matters only if the project actually ships to production cleanly. Senior loops zero in on four moves: Slim CI on PRs, scheduled dbt build in dbt Cloud or Airflow, observability via Elementary, and the dbt Cloud vs Core decision that drives org-level choices.

dbt Slim CI — only rebuild what changed

dbt Slim CI is the highest-leverage CI optimisation in the ecosystem. Without it, every PR rebuilds your whole DAG; with it, PRs build only the changed subgraph and stitch upstream refs to production relations via --defer.

# .github/workflows/dbt-ci.yml
name: dbt CI

on:
  pull_request:
    paths: ['models/**', 'tests/**', 'macros/**', 'dbt_project.yml', 'packages.yml']

jobs:
  dbt-ci:
    runs-on: ubuntu-latest
    env:
      DBT_CI_USER:     ${{ secrets.DBT_CI_USER }}
      DBT_CI_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }}
      PR_NUMBER:       ${{ github.event.pull_request.number }}

    steps:
      - uses: actions/checkout@v4

      - name: Install dbt + adapter
        run: pip install dbt-snowflake==1.8.*

      - name: Install packages
        run: dbt deps

      - name: Download prod manifest for --defer baseline
        run: aws s3 cp s3://my-bucket/dbt/prod_manifest.json ./prod_manifest/manifest.json

      - name: Slim CI build
        run: |
          dbt build \
            --select state:modified+ \
            --defer \
            --state ./prod_manifest \
            --target ci \
            --fail-fast

      - name: Drop the CI schema on PR close
        if: github.event.action == 'closed'
        run: dbt run-operation drop_schema --args "{schema: dbt_ci_pr_${PR_NUMBER}}"
Enter fullscreen mode Exit fullscreen mode
  • state:modified+ — modified models plus everything downstream; the canonical Slim CI selector.
  • --defer + --state — unselected refs resolve to the production manifest's relations, so you don't have to rebuild upstream chains.
  • --fail-fast — abort on first failure; saves CI minutes when something is obviously broken.
  • PR-scoped schemas — each PR builds into dbt_ci_pr_123; the schema is dropped on PR close so CI databases don't grow unbounded.

dbt scheduling — dbt Cloud, Airflow, or GitHub Actions

Once dbt is in production, something has to run it on a schedule. Three common patterns:

# dbt Cloud — the managed path
# Configure in the UI: a job that runs `dbt build` daily at 06:00 UTC,
# attached to the prod environment, with email + Slack alerts on failure.
Enter fullscreen mode Exit fullscreen mode
# Airflow — for teams with existing DAGs
from airflow import DAG
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
# Or for dbt Core via the standard BashOperator + cosmos:
from cosmos import DbtTaskGroup, ProjectConfig, ProfileConfig

with DAG('analytics_dbt', schedule='0 6 * * *', catchup=False) as dag:
    dbt_run = DbtTaskGroup(
        group_id='dbt_build',
        project_config=ProjectConfig('/opt/airflow/analytics'),
        profile_config=ProfileConfig(
            profile_name='analytics',
            target_name='prod',
            profiles_yml_filepath='/opt/airflow/profiles.yml',
        ),
        operator_args={'select': 'tag:daily'},
    )
Enter fullscreen mode Exit fullscreen mode
# GitHub Actions cron — minimal infra for small teams
name: dbt nightly

on:
  schedule:
    - cron: '0 6 * * *'

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install dbt-snowflake
      - run: dbt deps
      - run: dbt build --target prod --select tag:daily
Enter fullscreen mode Exit fullscreen mode
  • dbt Cloud — the easiest path; pay for managed orchestration, Slack alerts, hosted docs, the IDE, and the Semantic Layer. Pricing scales per developer seat.
  • Airflow + cosmos — the standard for teams with existing Airflow infrastructure; lets you mix dbt with non-dbt tasks (Spark jobs, ML training, custom Python).
  • GitHub Actions cron — the cheapest option for small teams; works fine until you need cross-job dependencies or proper SLA monitoring.

dbt Cloud vs Core — the interview-canonical comparison

Every dbt interview has at least one "when would you pick Cloud vs Core?" question. The honest answer:

Dimension dbt Core dbt Cloud
License Apache 2.0, free Subscription per seat
CLI Yes — dbt build, dbt run, etc. Yes — under the hood it's Core
IDE No — bring your own (VS Code + dbt Power User is standard) Yes — web IDE with autocomplete + lineage
Scheduler No — bring your own (Airflow, GitHub Actions, cron) Yes — managed cron with retries + alerts
CI No — wire it up in GitHub Actions Yes — managed Slim CI on every PR
Hosted docs No — self-host the static site Yes — managed docs with auth
Semantic Layer No (Core 1.7+ has the spec; Cloud serves it) Yes — metric API for BI tools
Best for Engineering-heavy teams with Airflow already Analyst-heavy teams; smaller orgs without DevOps
  • Core is the engine; Cloud is the convenience layer. Every dbt project compiles via Core; Cloud wraps it in orchestration + UI.
  • Senior teams often mix — develop locally on Core, run CI via GitHub Actions + Core, but use Cloud for the scheduler + Semantic Layer.
  • The Semantic Layer is the Cloud lock-in — if your BI tool queries the SL, you're paying for Cloud.

Observability — Elementary, freshness alerts, on-call runbooks

Once you have nightly schedules, something will fail at 03:00 — and you need to know before stakeholders open dashboards at 09:00.

  • dbt source freshness — runs MAX(_fivetran_synced) against every source and warns / errors when it falls behind the threshold you set in YAML.
  • Elementary alerts — Slack channel that posts [ERROR] fct_orders: 12 rows failed unique_order_id with a link to the failing-rows table.
  • dbt docs serve — hosted lineage; let on-call see the upstream chain when a model fails downstream.
  • Run-result archive — store target/run_results.json in S3 after every run; the cheapest observability backbone you can have.

Worked example — full Slim-CI + nightly schedule + alerting in 30 lines of YAML

Detailed explanation. Stitch every piece together: a PR workflow that builds only changed models, a nightly workflow that runs dbt build and pushes the manifest, and an Elementary alert hook that posts to Slack when something fails.

Question. Wire up the full production pipeline for a small dbt project on GitHub Actions + Snowflake + Elementary.

Input. GitHub repo with the project, Snowflake CI / prod credentials in Actions secrets, an Elementary CLI installed in the prod environment, a Slack webhook URL.

Code.

# .github/workflows/dbt-pr.yml — Slim CI on every PR
name: dbt PR
on: { pull_request: { paths: ['models/**','tests/**','macros/**'] } }
jobs:
  ci:
    runs-on: ubuntu-latest
    env: { DBT_CI_USER: ${{ secrets.DBT_CI_USER }}, DBT_CI_PASSWORD: ${{ secrets.DBT_CI_PASSWORD }} }
    steps:
      - uses: actions/checkout@v4
      - run: pip install dbt-snowflake==1.8.* elementary-data
      - run: dbt deps
      - run: aws s3 cp s3://my-bucket/dbt/manifest.json ./prod/manifest.json
      - run: dbt build --select state:modified+ --defer --state ./prod --target ci --fail-fast
Enter fullscreen mode Exit fullscreen mode
# .github/workflows/dbt-nightly.yml — production build + observability
name: dbt nightly
on: { schedule: [{ cron: '0 6 * * *' }] }
jobs:
  build:
    runs-on: ubuntu-latest
    env:
      DBT_PROD_USER:     ${{ secrets.DBT_PROD_USER }}
      DBT_PROD_PASSWORD: ${{ secrets.DBT_PROD_PASSWORD }}
      SLACK_WEBHOOK:     ${{ secrets.SLACK_WEBHOOK }}
    steps:
      - uses: actions/checkout@v4
      - run: pip install dbt-snowflake==1.8.* elementary-data
      - run: dbt deps
      - run: dbt source freshness --target prod
      - run: dbt build --target prod --fail-fast
      - run: edr monitor --slack-webhook "$SLACK_WEBHOOK"
      - run: aws s3 cp target/manifest.json s3://my-bucket/dbt/manifest.json
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. PR workflow runs Slim CIstate:modified+ + --defer keeps the build fast and cheap.
  2. Nightly workflow runs dbt source freshness first — fails loudly if upstream ingest is stale.
  3. Nightly workflow runs dbt build --target prod — every model + every test in dependency order.
  4. edr monitor is the Elementary CLI; it reads target/run_results.json and posts a Slack message with failing tests, slow models, and anomalies.
  5. Manifest upload is the last step — it makes tomorrow's PR Slim CI work against today's state.

Output (Slack message after a failing run).

[dbt nightly · failed]
Project: analytics  ·  Target: prod  ·  Duration: 14m 22s

  ✗ fct_orders                       FAIL (3 rows violated unique_order_id)
  ✗ assert_no_negative_revenue       FAIL (1 row returned: region=EU, total=-120.00)
  ⚠ dbt_expectations_mean_amount_usd WARN (mean 8.42 below threshold 10.0)

Run results: https://my-bucket.s3.amazonaws.com/dbt/run_results/2026-05-26.json
Failing rows: https://snowflake.com/.../dbt_test_failures.fct_orders_unique
On-call: @analytics-oncall
Enter fullscreen mode Exit fullscreen mode

Why this works — concept by concept:

  • Slim CI keeps PR feedback under five minutes even on 200-model projects.
  • dbt source freshness catches upstream outages at the boundary; everything downstream fails fast.
  • dbt build --fail-fast halts on first failure so downstream nodes don't compound the blast radius.
  • Elementary edr monitor turns dbt artifacts into actionable Slack alerts without any custom code.
  • Manifest archive is the one operational detail that ties everything together — without it, Slim CI has no baseline.
  • Cost — Slim CI cuts PR build time 10-50×; freshness + observability cut MTTR for incidents from hours to minutes. The CI minutes you save pay for the Snowflake credits you spend.

SQL
Topic — etl
Production ETL drills

Practice →

SQL
Topic — data-transformation
CI/CD transformation patterns

Practice →


Choosing the right dbt primitive (cheat sheet)

A one-screen cheat sheet for using dbt for data engineering — pick the primitive that matches your task.

You want to … Primitive Notes
Define a transformation models/.../my_model.sql One SELECT; dbt wraps it as CREATE TABLE / VIEW
Point at a raw table {{ source('schema', 'table') }} Declare it in _sources.yml; gets freshness for free
Point at another dbt model {{ ref('upstream_model') }} Never hard-code; this is what powers the DAG
Refresh on demand +materialized: view Cheap to refresh, slow to query
Cache for BI +materialized: table Full rebuild per run; fast queries
Bill-of-materials huge fact +materialized: incremental + unique_key MERGE after first run; cheapest at scale
Reusable mid-DAG logic +materialized: ephemeral Inlined as CTE; no storage cost
Enforce a column invariant tests: [unique, not_null, accepted_values] Generic schema tests; one YAML line each
FK-style relationship tests: [relationships: { to: ref('dim'), field: id }] Catches orphans
Bespoke multi-table rule tests/assert_*.sql Singular test; zero rows = pass
Versioned column types config: { contract: { enforced: true } } Schema drift fails the build
Reuse SQL logic macros/my_macro.sql + {{ my_macro(args) }} Jinja template inlined per call
Hash a composite key {{ dbt_utils.generate_surrogate_key(['a','b']) }} The canonical surrogate-key macro
Pivot dynamically {{ dbt_utils.pivot('status', vals) }} or {% for %} One line replaces N SUM(CASE) columns
Range / regex test dbt_expectations.expect_column_values_to_* 60+ generic tests over dbt_utils baseline
Migration regression test audit_helper.compare_queries(...) Diff old vs new relation; output as table
Production observability elementary package + edr monitor CLI Slack alerts on anomalies + freshness
Fast PR feedback dbt build --select state:modified+ --defer --state ./prod Slim CI; 10× faster than full build
Pull from prod for --defer Cache target/manifest.json to S3 each run The one operational detail that makes Slim CI work
Scheduled run dbt Cloud job, Airflow cosmos, or GitHub Actions cron Pick by team size + existing infra
Catch stale source dbt source freshness + loaded_at_field First line of defense against silent breakage

Frequently asked questions

What is dbt and why has it won the transformation layer of the modern data stack?

dbt (data build tool) is a SQL-first transformation framework that compiles .sql files into native warehouse DDL — CREATE TABLE, CREATE VIEW, MERGE — and runs them in dependency order against Snowflake, BigQuery, Databricks, Redshift, or Postgres. It won the transformation layer for four reasons. Warehouse-first compute: dbt pushes every transformation back into the warehouse, eliminating the round-trip cost of moving data out into a separate engine. Git-first workflow: every model is a text file, so PRs, code review, and revert-on-disaster are native. Tests as first-class citizens: unique, not_null, accepted_values, relationships ship out of the box, so bad data fails the build before it lands in BI. ref() and the DAG: dbt computes upstream / downstream dependencies automatically; you never write a runbook. Add a Jinja templating layer, an adapter ecosystem covering every major warehouse, and a thriving package ecosystem (dbt_utils, dbt_expectations, elementary), and you have the de-facto standard transformation layer for the modern data stack in 2026.

What is the difference between ref() and source() in dbt?

{{ ref('upstream_model') }} points at another dbt model in the same project — a .sql file under models/. dbt uses every ref() call to compute the DAG and run nodes in the correct dependency order. {{ source('source_name', 'table_name') }} points at a raw table you don't own — a Fivetran-loaded raw schema, a Postgres replica, a Kafka sink. Sources are declared in a _sources.yml file with their database, schema, and optional freshness thresholds. The rule: every model's inputs are either ref() (project-internal) or source() (project-external); never hard-code a database.schema.table literal, because that breaks Slim CI, --defer, and cross-environment portability. The two together give dbt the complete dependency graph it needs to schedule runs, validate ordering, and run dbt source freshness against your raw ingest layer.

When should I use view, table, incremental, or ephemeral materialization in dbt?

viewCREATE OR REPLACE VIEW; no data stored; cheap to refresh, slow to query. The default for staging models that are 1:1 with sources and rarely queried directly. tableCREATE OR REPLACE TABLE AS SELECT; full rebuild every run; fast to query. The default for marts that BI tools and stakeholders hit constantly; the rebuild cost is fine for tables up to millions of rows. incremental — first run = table; subsequent runs = MERGE (Snowflake / BigQuery / Databricks) or delete+insert (Postgres / Redshift). Use for billion-row fact tables you can't fully rebuild every run; pair with unique_key and an is_incremental() predicate that scopes new rows by timestamp. ephemeral — inlined as a CTE in the downstream model; never materialised in the warehouse. Use for small intermediate models that are joined once and never queried directly. The senior pattern: set folder-level defaults in dbt_project.yml (staging → view, intermediate → ephemeral, marts → table) and override per-model only when the data shape demands it.

What are the three families of dbt tests and when should I use each?

Generic schema tests — declared in YAML, one line per column. The four built-ins (unique, not_null, accepted_values, relationships) plus the 60+ from dbt_utils and dbt_expectations cover most column-level invariants. Use them aggressively; every column that matters should have at least one. Singular tests — bespoke SELECT files under tests/ that return failing rows. Use when the invariant spans multiple tables or expresses a business rule that doesn't fit a per-column shape — e.g. "no region has negative revenue", "every order has a matching customer". The contract is uniform: zero rows = pass, any rows = fail. Model contracts — added in dbt 1.5; declared in YAML under config.contract.enforced: true. They enforce the SELECT's column list, data types, and constraints at compile time, before any SQL runs against the warehouse. Use them on every mart that's a public API to other teams or BI tools; schema drift becomes a build failure instead of a 09:00 dashboard fire. The senior approach is all three layered together — generic for column invariants, singular for cross-model rules, contracts for the public-API surface.

What is dbt Slim CI and why does every senior dbt team use it?

dbt Slim CI is the workflow that only rebuilds the dbt models that changed in a pull request, plus everything downstream of them, while resolving unchanged upstream ref() calls against the production relations. The two flags that make it work: --select state:modified+ (modified models plus everything downstream) and --defer --state ./prod_manifest (resolve unselected refs to the cached production manifest's relations). Without Slim CI, every PR rebuilds the entire DAG — for a 200-model project that's hours of warehouse credits per PR. With it, PRs build only the changed subgraph in minutes, give developers fast feedback, and cost a fraction of full rebuilds. The one operational detail that makes Slim CI possible: archive target/manifest.json to S3 (or any blob store) after every successful production run; download it as the --state baseline in CI. Senior teams pair Slim CI with per-PR schemas (schema: dbt_ci_pr_{{ env_var('PR_NUMBER') }}) so each PR's artifacts are isolated and dropped on merge.

Should I use dbt Cloud or dbt Core, and how do senior teams decide?

dbt Core is the open-source CLI — dbt build, dbt run, dbt test. It runs anywhere: your laptop, GitHub Actions, Airflow, Kubernetes. You own orchestration, CI, hosted docs, and the IDE. dbt Cloud is the hosted layer — a web IDE with autocomplete and lineage, a managed scheduler with retries and alerts, managed Slim CI on every PR, hosted docs with auth, and the dbt Semantic Layer that BI tools can query. The honest decision tree: small / analyst-heavy teams without DevOps capacity should default to dbt Cloud — the time saved on orchestration and CI infrastructure pays for the per-seat license. Engineering-heavy teams with existing Airflow infrastructure often run dbt Core via cosmos (Airflow-dbt integration) and skip Cloud entirely; their scheduler, CI, and observability already exist. Mid-size teams mix the two — develop on Core locally, run CI via GitHub Actions + Core, but use Cloud for the scheduler and Semantic Layer. The interview-canonical framing: "Core is the engine; Cloud is the convenience layer; the right choice depends on whether your team already owns orchestration."


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including SQL practice keyed to the same shapes dbt models live in: aggregations, conditional aggregation, CTEs, joins, dimensional modeling, ETL pipelines, and data-quality validation. Whether you're drilling dbt for data engineering end-to-end or sharpening the underlying SQL fluency that makes great dbt models, the practice library mirrors the exact patterns this guide teaches.

Kick off via Explore practice →; drill the SQL practice lane →; fan out into ETL pipeline drills →; sharpen CTE patterns →; rehearse aggregation drills →; reinforce dimensional-modeling problems →; widen coverage on the full data-modeling library →.

Top comments (0)