Niclas Olofsson

Posted on Jan 7

TDD for dbt: unit testing the way it should be

#codequality #dataengineering #softwareengineering #testing

Unit testing arrived in dbt 1.8. Finally, right? Except nobody does it.

About this article

I wrote this. The ideas are mine; the execution is collaborative.

Write tests with mock data, verify your business logic, practice TDD like any proper software engineer. That's what dbt 1.8 promised. Except nobody does it.

And I get it. I tried. You sit down with good intentions, open a new YAML file, and then reality hits. You need to figure out which models your test depends on. Query the warehouse to get realistic sample data. Format everything as YAML dictionaries with the right structure. Then do it again for every edge case you want to cover. Then maintain all of it as your models evolve.

That's not TDD. That's YAML accounting.

So the feature sits there, unused. Which creates two problems that feed into each other.

First, TDD stays theoretical. The boilerplate overhead kills any chance of test-first development. You write the model, then maybe add tests later if you have time. You don't.

Second, and this is the one that hit me harder, your AI collaboration suffers. Without tests, there's no verification mechanism. You ask Copilot to implement something, it generates code, and now you're stuck manually checking if the logic is right. Query the warehouse, eyeball the results, spot the bug, explain it, wait for the fix, check again. Your flow state is gone. You're babysitting syntax instead of thinking about the problem.

But here's what I discovered building dbt-core-mcp: when AI can handle the tedious parts, unit tests stop being a burden and start being a dialogue. The same tooling that lets AI scaffold YAML fixtures also lets AI iterate with test guardrails. Write test, run test, fix code, run test. That loop can happen without you leaving navigation mode.

The refactoring problem

Software developers figured this out decades ago. Martin Fowler documented it extensively: you can't maintain quality code without refactoring. And you can't refactor with confidence unless you have tests.

You write code. It works. Six months later, requirements change or you understand the domain better or the model becomes too complex. You need to restructure it. Simplify the logic. Optimize the queries. Make it maintainable again.

Without tests, refactoring is terrifying. Change the SQL, run the full pipeline, manually verify the output matches what it used to produce. Hope you didn't break something subtle. That fear keeps you from refactoring, so the code rots. Technical debt compounds.

With tests, refactoring is mechanical. Change the implementation, run the tests, they pass, you're done. The tests document what the model should do. As long as the behavior stays consistent, the implementation can evolve.

And here's what happens once you get used to it: you start finding the rhythm. Maybe you write tests after the fact at first, adding coverage to legacy models as you touch them. But eventually, you notice it's easier to write the test first. Define the edge case, write the test, then implement the logic that makes it pass. That's TDD. Test-driven development.

Analytics engineering could have had this all along. The capability was there in dbt 1.8. But the YAML accounting killed adoption before it started. Writing fixtures by hand, sampling data, formatting dictionaries, it was too much friction.

AI removes that barrier. The tedious parts get automated. Suddenly TDD isn't theoretical anymore. It's practical. And that changes everything.

When bugs happen

Software developers learned another pattern: when something breaks, you don't just fix it. You write a failing test first.

The workflow: a bug gets reported. Customers with exactly one order are showing null for first_order_date. Before touching the code, you write a test that reproduces the problem. One customer, one order, assert the date should match. Run it. It fails. Good - now you've proven you understand the bug.

Then you fix the code. Maybe you forgot to handle the single-record case in your aggregation. Add the logic, run the test, it passes, ship it.

But here's the real value: that test stays. Forever. It's not just a bug fix anymore, it's documentation. Six months from now when someone refactors that model, the test will catch it if they reintroduce the same bug. The test is evidence that this edge case matters, that it broke before, and here's exactly what the correct behavior should be.

Without tests, bug fixes are "I think I fixed it, seems to work now, hope it doesn't come back." With tests, bug fixes are "Here's the test that proves it was broken, here's the test passing that proves it's fixed, and here's the permanent guard against regression."

And yes, AI can help here too. "Write a test that reproduces the bug where customers with one order get null dates." Copilot scaffolds the failing test, you verify it actually fails for the right reason, then ask AI to fix it. Or fix it yourself. Either way, the test documents the fix.

What this actually looks like

Without tests: I ask Copilot to add customer order counts. It generates code. Now what? I need to verify it works. Maybe I run the model and query the output in Databricks. Maybe I just spot-check a few rows. Maybe I trust it and ship it. Whatever I do, it's manual and ad-hoc. No systematic verification. And when I notice it returns null for customers with no orders, we're back to the same cycle: point it out, wait for fix, check again. Context switching. Flow broken. Hope it's right this time.

And let's be honest: "without tests" is today's default.

With tests: I say "Add customer order count, zero for customers with no orders, not null." Copilot inspects the model dependencies through dbt-core-mcp, samples some data, writes a unit test for the edge case, implements the model, runs the test, fails, fixes the coalesce, runs again, passes, and tells me "Test passing, ready for review."

I review the test assertion and the implementation together. One cycle. I never left the conversation.

The trick isn't that AI writes better code. It's that AI can now verify its own work systematically before reporting back. Tests become the feedback mechanism that keeps the loop tight.

The anatomy of a dbt unit test

A dbt unit test looks like this:

unit_tests:
  - name: test_customer_with_no_orders
    description: "Verify customer with no orders gets 0 count, not null"
    model: customers
    given:
      - input: ref('stg_customers')
        rows:
          - {customer_id: 99, first_name: 'New', last_name: 'Customer'}

      - input: ref('stg_orders')
        rows: []

    expect:
      rows:
        - {customer_id: 99, number_of_orders: 0}

You tell it what model you're testing, what input data to use (the given section), and what output to expect. The test runs against mock data, not your warehouse. Fast. Isolated. Repeatable.

The pain is in given. Every input needs realistic fixtures. Every column that matters needs a value. Every edge case needs its own setup. That's where the YAML accounting happens, and that's exactly what AI is good at generating.

How dbt-core-mcp makes this work

The MCP server gives AI the tools it needs to scaffold tests intelligently. You don't see these tool calls in the chat - this is what happens under the surface while Copilot is working. When I ask Copilot to write a test, it can:

Choose to inspect the model to understand dependencies:

get_resource_info('customers')
→ Shows: depends on ref('stg_customers'), ref('stg_orders')

Choose to query sample data to get realistic fixtures:

query_database("SELECT * FROM stg_customers LIMIT 3")
→ Returns actual column structure and example values

Choose to run specific tests for fast iteration:

run_tests(select="test_name:test_customer_with_no_orders")
→ Immediate feedback on just that test

It chooses which tools to use based on what it needs. Building a test from scratch? Inspect the model and query sample data. Adding a test to an existing file? It'll likely read your existing tests first and follow the same style and patterns your team already uses. The fixture format you prefer, the naming conventions, the level of detail - AI adapts.

AI uses these to build tests that actually make sense for your data. Not generic placeholder values, but fixtures that reflect your schema. And it can iterate on them without waiting for full pipeline runs.

Where to put the tests

dbt's official recommendation is to keep unit tests alongside your models. Same directory, same context. It's a reasonable approach - everything related to a model lives together.

I prefer something different:

dbt_project/
├── models/
│   └── marts/
│       └── customers.sql
│
└── unit_tests/
    └── marts/
        └── customers_unit_tests.yml

Separate directory that mirrors the model structure. Clean separation between code and tests. Easy to find. Easy to exclude from production builds if needed. Easy to navigate.

This is how the rest of the software development world does it. Python projects have src/ and tests/. Java has src/main and src/test. C# has separate test projects. Separating tests from implementation code is established practice everywhere outside data engineering.

If you go this route, you'll need to tell dbt where to find your tests. Add this to your dbt_project.yml:

model-paths: ["models", "unit_tests"]

Is this controversial in the dbt community? Maybe. But let's be honest - how controversial can it be when nobody's writing unit tests anyway?

Choose what works for your team. The structure matters less than actually having the tests.

The patterns that matter

Once you start writing tests (or having AI write them), some patterns emerge that make the difference between maintainable tests and a YAML nightmare.

Keep fixtures minimal

It's tempting to dump all the columns into your test fixtures. Don't. Only include what the test actually needs.

# This is noise
- {customer_id: 1, first_name: 'Alice', last_name: 'Smith', 
   email: 'alice@example.com', phone: '555-1234', 
   address: '123 Main St', city: 'Portland', ...}

# This is a test
- {customer_id: 1, first_name: 'Alice', last_name: 'Smith'}

Minimal fixtures are faster to read, make it obvious what's being tested, and don't break when you add columns to your staging models. AI tends to over-include columns when it first scaffolds, so this is worth reviewing.

One behavior per test

Each test should prove one thing. If your test name needs "and" in it, split it.

unit_tests:
  - name: test_customer_with_no_orders
    # Proves: null handling works

  - name: test_customer_with_single_order
    # Proves: min = max when one record

  - name: test_customer_order_aggregation
    # Proves: count, min, max all work together

Three small tests beat one comprehensive test. They run faster, fail clearer, and document behavior better.

If multiple tests need the same base data, use YAML anchors to share fixtures (covered in the YAML anchors section below).

Happy path, then edge cases

When writing tests, start with the normal case. How should the model work when everything is straightforward? Customer has orders, all fields present, typical data.

- name: test_customer_order_aggregation
  # The normal case: customer with multiple orders

That establishes the baseline. Then add the edge cases - the scenarios where things can break:

- name: test_customer_with_no_orders
  # Edge case: empty join result

- name: test_customer_with_single_order
  # Edge case: min = max

This is the natural TDD rhythm. Happy path first proves the core logic works. Edge cases prove it handles the boundaries correctly. Both matter, but the happy path gives you the foundation.

The essential edge cases for most models:

Empty inputs (what if there are no orders?)
Single item (what if exactly one record?)
Null handling (what if optional fields are missing?)
Boundary conditions (first order = last order?)

Use dict format, not CSV

dbt supports CSV format for fixtures. Ignore it.

# This will hurt you later
format: csv
rows: |
  customer_id,first_name,last_name
  1,Alice,Smith
  2,Bob,Jones

# This is what you want
rows:
  - {customer_id: 1, first_name: 'Alice', last_name: 'Smith'}
  - {customer_id: 2, first_name: 'Bob', last_name: 'Jones'}

Dict format doesn't care about column order. It survives adding columns. It produces readable git diffs. It's the default for a reason.

YAML anchors for shared fixtures

Once you have several tests for a model, you'll notice duplicate fixtures. Three tests that all need the same base customer data. YAML anchors can help, but use them sparingly.

# Define once at the top
_base_customers: &customer_input
  input: ref('stg_customers')
  rows:
    - {customer_id: 1, first_name: 'Alice', last_name: 'Smith'}
    - {customer_id: 2, first_name: 'Bob', last_name: 'Jones'}

unit_tests:
  - name: test_basic_aggregation
    model: customers
    given:
      - *customer_input      # Reuse the whole thing
      - input: ref('stg_orders')
        rows: [...]

Rule of thumb: only use anchors when three or more tests share identical fixtures. Before that, the duplication is clearer than the abstraction.

Here's the thing about DRY (Don't Repeat Yourself) in tests: it's not always best practice. Shared fixtures can create coupling in your test code. Change one fixture to handle a new test case, suddenly three other tests break. You're refactoring tests to fix tests, not production code.

Sometimes duplication in tests is better. Each test is self-contained. You can read it without jumping between anchor definitions. You can change it without worrying about breaking other tests. More code can mean less coupling. Use anchors when the duplication is truly painful, not just because DRY says so.

A complete cycle

I'm building a customers model that aggregates order data. I want order counts, first order date, most recent order date. And I want customers with no orders to show zero, not null.

I tell Copilot: "Create a customers model that counts orders per customer. Zero for customers with no orders, not null."

Copilot gets to work. It inspects the likely dependencies (stg_customers, stg_orders), queries sample data to understand the schema, and scaffolds the first test:

unit_tests:
  - name: test_customer_with_no_orders
    description: "Verify customer with no orders gets 0 count, not null"
    model: customers
    given:
      - input: ref('stg_customers')
        rows:
          - {customer_id: 99, first_name: 'New', last_name: 'Customer'}

      - input: ref('stg_orders')
        rows: []

    expect:
      rows:
        - {customer_id: 99, first_name: 'New', last_name: 'Customer', 
           number_of_orders: 0}

Then implements the model:

with customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('stg_orders') }}
),

customer_orders as (
    select
        customer_id,
        count(order_id) as number_of_orders
    from orders
    group by customer_id
),

final as (
    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders
    from customers
    left join customer_orders 
        on customers.customer_id = customer_orders.customer_id
)

select * from final

Runs the test. Passes. Reports back.

I look at the test assertion. Does it match what I asked for? Zero instead of null? Yes. I look at the implementation. Left join, coalesce, makes sense. I approve.

Now I can say "Add first and most recent order dates" and Copilot will add another test, extend the model, verify it passes, and report back. Same cycle, building on verified work.

I never had to open the warehouse. Never had to manually check output. Never had to write YAML fixture syntax. I stayed in the conversation, thinking about what the model should do, not how to verify it.

Test what needs to be tested

Software developers learned this decades ago: you don't unit test getters and setters. You test the logic that can break. The transformations. The edge cases. The parts where bugs hide.

The same applies to dbt models.

Test the happy path first - the normal case where everything works as expected. Customer has orders, all fields present, typical data. This is your baseline. It proves the core logic works. Always test this. And here's the kicker: when things break during refactoring or changes, it's usually the happy path test that catches it first, not the edge cases.

Test aggregations - count, min/max, group by. This is where null handling breaks. Where empty groups return unexpected results. Where your left join suddenly drops customers because you forgot the coalesce. These transformations have logic, and logic needs verification.

Test business logic - calculations, case statements, conditional logic. If you're implementing "customer lifetime value" or "revenue recognition rules" or any domain logic that came from a business requirement, test it. These are the models that change when requirements change. Tests document what the business actually wanted.

Test edge cases - nulls, empty sets, boundary conditions. The customer with no orders. The single transaction that's both first and last. The optional field that's missing. Production data will hit all of these eventually. Better to define the behavior now than debug it later.

Test critical models - finance, customer-facing, regulatory. If it goes in a report that executives read or customers see or auditors review, test it. The cost of being wrong is too high.

Test what you'll refactor - anything you know you'll change later. Tests are your safety net. You can restructure the SQL, optimize the joins, rework the CTEs, and know immediately if you broke the behavior.

Don't test pass-throughs - simple select-star models, basic renaming, casting. These are mechanical transformations with no logic. If they break, the downstream tests will catch it.

My rule: if the model has group by, case when, or coalesce, it probably deserves a test. That's where the logic lives.

The meta shift

Here's what surprised me most: the same patterns that make AI better at writing tests also make me better at reviewing them. When Copilot scaffolds a test, I'm not fighting YAML syntax. I'm not sampling data. I'm looking at the assertion and asking: does this capture what I meant? Is this the edge case that matters? Is the expected output correct?

That's a different kind of work. Navigation and judgment instead of execution. I'm thinking about behavior, not formatting. And because the tests exist, I can refactor safely. Change the implementation, run the tests, know immediately if I broke something. That confidence compounds. You build faster when you're not afraid of breaking things.

Getting started

You need dbt 1.8 or higher for unit testing support. For dbt-core-mcp, you'll need dbt 1.9 or higher. Setup instructions are at github.com/NiclasOlofsson/dbt-core-mcp.

Start small. Pick your most complex model, the one with the group by and the edge cases you're always nervous about. Ask AI to write one unit test for one edge case. Review it. Run it. See how it feels. Then notice how your next conversation changes. You're not debugging syntax anymore. You're having a dialogue about what the model should do. AI verifies its own work. You stay in flow.

That's the shift. Unit testing was always possible in dbt. AI makes it practical. And practical changes everything.

Get dbt-core-mcp: github.com/NiclasOlofsson/dbt-core-mcp

DEV Community