De' Clerke

Posted on Jun 2

# dbt in Production: 5 Gotchas That Broke My Pipelines

#dbt #dataengineering #elt #etl

I've run dbt across more than eight data pipeline projects. Stock prices, real estate listings, call center data, loan risk models, government budget reconciliation. Different databases, different schemas, different team setups. And across all of them, the same category of failure kept appearing: dbt behaves in ways that are technically correct but completely non-obvious, and the failures are usually silent.

Before we get to the gotchas, here's the context on what dbt is and why it's worth understanding deeply.

What dbt Is (and Why Everyone Is Adopting It)

dbt (data build tool) is a transformation framework that sits inside the T of your ELT stack. You load raw data into your warehouse first, then dbt transforms it there using SQL. The transformation logic lives in .sql files, version-controlled like application code, organized into a DAG that dbt builds automatically from {{ ref() }} dependencies.

The reason the data engineering community has moved toward it fast:

SQL analysts can write production transformation logic. You don't need a Python engineer to build a mart model. If you can write a SELECT, you can write a dbt model. dbt handles materialization, dependency resolution, and schema management.

Testing is built in. Four generic tests cover the most common data quality assertions: unique, not_null, accepted_values, relationships. You wire them up in YAML and dbt runs them as part of every build. The first time I had 54 tests passing green across a single pipeline, the confidence it gave for deploying changes was qualitatively different from anything I'd had before.

Lineage is automatic. Because all models reference each other via {{ ref('model_name') }}, dbt builds a DAG of your entire transformation layer. dbt docs generate produces a browsable data catalog with column-level lineage. Stakeholders can see exactly where a number comes from without asking a data engineer.

Incremental models make large tables practical. Instead of reprocessing millions of rows every run, incremental models append or merge only new data. For a stock price table with years of history, this is the difference between a 10-second run and a 10-minute one.

The stack it fits: dbt works with PostgreSQL, DuckDB, BigQuery, Snowflake, Redshift, and most other modern warehouses through adapters. The same model code, mostly unchanged, runs on all of them.

The Gotcha Article Starts Here

dbt ran cleanly. Green checkmarks across the board. My Airflow DAG showed SUCCESS. My ingestion table was gone.

That's what happened when I named a dbt model the same as a table my Python ingestion script was managing. dbt saw the table existed, assumed it owned it, dropped it, and recreated it as an empty view. No warning. No error. Just a clean run and missing data.

Here are five things that broke my dbt setups in production, and how to fix each.

1. Table Name Collision: dbt Will Drop Your Ingestion Table

When a dbt model shares a name with an existing table in the same schema, dbt treats that table as its own on the next dbt run. It drops it and recreates it based on the model SQL.

This bit me on a pipeline where my ingestion script was writing to public.stock_prices and I had a dbt model also named stock_prices. The first dbt run wiped the base table and replaced it with a view of itself — which was now empty, because the source data it was reading from had just been deleted.

-- What my Python script created:
-- Table: public.stock_prices  (3,200 rows of raw ingestion data)

-- What my dbt model did on first run:
-- DROP TABLE public.stock_prices
-- CREATE VIEW public.stock_prices AS SELECT ...
-- (now selecting from a table that no longer exists)

The fix has two parts. First, always prefix dbt staging models with stg_:

-- models/staging/stg_stock_prices.sql
{{ config(materialized='view') }}

SELECT
    id,
    UPPER(TRIM(symbol))         AS symbol,
    close_price::NUMERIC(12, 4) AS close_price,
    volume::BIGINT              AS volume,
    timestamp::TIMESTAMPTZ      AS ts
FROM {{ source('raw', 'stock_prices') }}
WHERE close_price > 0

Second, if you need to use a specific schema for dbt models and keep them away from Python-managed tables, set it in dbt_project.yml:

models:
  my_project:
    staging:
      +materialized: view
      +schema: staging     # creates schema: <target>_staging, separate from raw tables

The rule I follow now: every table my ingestion pipeline owns goes in raw schema, every dbt model goes in staging or marts. They never share a namespace.

2. profiles.yml Not Found Inside Docker

The default dbt profiles location is ~/.dbt/profiles.yml on your host machine. Inside a Docker container, that path doesn't exist unless you explicitly create it. When dbt runs inside Airflow via BashOperator, it looks for ~/.dbt/profiles.yml in the container's home directory, fails silently with a connection error, and the task fails with a message like Could not find profile named 'my_project'.

The fix is to always pass --profiles-dir explicitly and mount your profiles file into the container:

# In your Airflow DAG
from airflow.operators.bash import BashOperator

dbt_run = BashOperator(
    task_id='dbt_run',
    bash_command=(
        'cd /opt/airflow/dbt/my_project && '
        'dbt run --profiles-dir /opt/airflow/dbt --target prod'
    ),
    env={
        'DBT_POSTGRES_USER':     '{{ var("DBT_POSTGRES_USER") }}',
        'DBT_POSTGRES_PASSWORD': '{{ var("DBT_POSTGRES_PASSWORD") }}',
        'DBT_POSTGRES_DB':       '{{ var("DBT_POSTGRES_DB") }}',
    },
)

# docker-compose.yml — mount the dbt directory into the container
volumes:
  - ./dbt:/opt/airflow/dbt

The profiles.yml inside that mounted directory uses environment variables, not hardcoded credentials:

# dbt/profiles.yml (mounted at /opt/airflow/dbt/profiles.yml)
my_project:
  target: prod
  outputs:
    prod:
      type: postgres
      host: postgres
      port: 5432
      user: "{{ env_var('DBT_POSTGRES_USER') }}"
      password: "{{ env_var('DBT_POSTGRES_PASSWORD') }}"
      dbname: "{{ env_var('DBT_POSTGRES_DB') }}"
      schema: analytics
      threads: 4

One more thing: when dbt runs inside the same Docker network as PostgreSQL, the host is the service name (postgres), not localhost. Using localhost inside a container connects to the container itself, not the database service. This is a different issue from the profiles path, but it often shows up at the same time.

3. Incremental Models Break After Schema Changes

Incremental models are one of the best features in dbt for large tables. They append or merge only new rows on each run, which is fast. The problem is what happens when you change the SELECT columns in that model.

The first time you add a new column to an incremental model's SQL, dbt tries to insert rows with the new column into a table that doesn't have that column yet. The default behavior (on_schema_change='ignore') silently drops the new column from the output. The behavior when set to 'fail' errors the run. Neither is what you want in production.

-- Before: incremental model without a column
{{
  config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge'
  )
}}

SELECT event_id, user_id, occurred_at
FROM {{ source('raw', 'events') }}

{% if is_incremental() %}
  WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}

-- After: added session_id — this will break without on_schema_change
SELECT event_id, user_id, session_id, occurred_at
FROM {{ source('raw', 'events') }}

Set on_schema_change='sync_all_columns' so dbt automatically adds missing columns to the existing table when it detects a mismatch:

{{
  config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge',
    on_schema_change='sync_all_columns'
  )
}}

For larger schema changes — removing columns, renaming columns, changing types — sync_all_columns isn't enough. You need a full rebuild:

dbt run --full-refresh --select my_incremental_model

This drops and recreates the table from scratch. On a large table it's slow, but it's the only way to correctly handle breaking schema changes.

4. Schema Prefix Confusion: Where Your Tables Actually Land

When you set +schema: staging in dbt_project.yml, the models don't land in a schema named staging. They land in <target_schema>_staging.

If your target schema in profiles.yml is public, your staging models end up in public_staging. Your mart models end up in public_marts. This is the default behavior and it's documented, but it caused me real confusion the first time I ran dbt run and went looking for my models in the wrong place.

# dbt_project.yml
models:
  my_project:
    staging:
      +schema: staging    # Creates: public_staging (not staging)
    marts:
      +schema: marts      # Creates: public_marts (not marts)

To override this and use the schema name exactly as written, add a generate_schema_name macro to your project:

-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}
        {{ default_schema }}
    {%- else -%}
        {{ custom_schema_name | trim }}
    {%- endif -%}
{%- endmacro %}

With this macro in place, +schema: staging creates a schema named staging, and +schema: marts creates marts. The default behavior is gone.

I add this macro to every project now. The dbt default made sense when you had multiple developers sharing one database, but for isolated project databases it just adds confusion.

5. Ephemeral Models Can't Be Tested Directly

Ephemeral models are useful for intermediate logic you want to reuse as a CTE without materializing a table. The problem is that because no table is created, you can't run tests on ephemeral model output directly. dbt test --select my_ephemeral_model runs nothing. --store-failures has nothing to write to.

-- models/intermediate/int_daily_volume.sql
{{ config(materialized='ephemeral') }}

SELECT
    symbol,
    DATE(timestamp) AS trade_date,
    SUM(volume)     AS total_volume
FROM {{ ref('stg_stock_prices') }}
GROUP BY symbol, DATE(timestamp)

If you try to test int_daily_volume.column_name: [not_null, unique] in your schema YAML, dbt runs the test by inlining the CTE into the test query. This works for simple cases, but --store-failures won't save the failing rows anywhere because there's no backing table. Debugging test failures on ephemeral models is harder than it needs to be.

The fix depends on what you need. For development and debugging, temporarily promote the model to a view:

{{ config(materialized='view') }}   -- change to this while debugging

Run your tests with --store-failures, inspect the failures:

dbt test --store-failures --select int_daily_volume

Then query the failures table dbt created:

SELECT *
FROM dbt_test__audit.not_null_int_daily_volume_total_volume
LIMIT 20;

Once the tests pass, revert to ephemeral. The general rule: if a model has column-level tests you need to be able to debug, make it a view. Ephemeral is for pure logic reuse, not for models you need visibility into.

The Pattern Behind All Five

Most dbt failures are silent. The run completes green, the tests pass, and something downstream is quietly wrong. The table name collision didn't error. The schema prefix didn't error. The ephemeral test silently ran nothing.

The habits that catch these before they reach production:

# Before every run, compile first — Jinja errors and ref() issues surface here
dbt compile --select my_model

# Run tests with store-failures so you can inspect what failed
dbt test --store-failures

# Check the full DAG before building something new
dbt ls --select +my_model+

# End-to-end clean build in CI — catches full-refresh regressions
dbt build --full-refresh

dbt's design is solid. The failures come from assumptions it makes that aren't obvious until you've hit them. Now you don't have to.

Follow me on dev.to for more data engineering content, or browse the project code at github.com/declerke.

DEV Community