DEV Community: Baris Terzioglu

Why LLM agents break when you give them tools (and what to do about it)

Baris Terzioglu — Sun, 15 Mar 2026 21:03:54 +0000

Your agent demo works perfectly. The model picks the right function, passes clean arguments, gets a response, and synthesizes a nice answer. Then you deploy it with 50 real API endpoints and everything falls apart.

This is the gap that nobody warns you about in tool-use tutorials. The research on LLM tool use is actually quite mature at this point, with clear findings about what works and what doesn't. But most of those findings haven't made it into the "how to build an AI agent" blog posts that dominate search results.

I spent the last few weeks going through the academic literature on tool use in LLM agents. Here's what I found, what it means if you're building agents today, and the failure modes that will bite you in production.

The two schools of tool use

There are fundamentally two approaches to giving LLMs access to tools, and understanding the difference matters.

The first is prompting-based tool use. You describe your tools in the system prompt or via a function-calling API, and the model decides at inference time which tools to use. This is what OpenAI's function calling, Anthropic's tool use, and most agent frameworks do. The model was never specifically trained on your tools. It's generalizing from its understanding of APIs and function signatures.

The second is training-based tool use. You fine-tune the model to use specific tools. Toolformer (Schick et al., 2023) is the seminal paper here. They trained a model to decide which APIs to call, when to call them, what arguments to pass, and how to incorporate results into its predictions. The clever bit: they did it in a self-supervised way, needing only a handful of demonstrations per API. The model learned to insert API calls into its own text generation when doing so improved next-token prediction.

The Toolformer approach got 3100+ citations for good reason. It showed that a 6.7B parameter model could match or beat much larger models on tasks involving calculation, search, and translation, simply by learning when to reach for a tool instead of hallucinating an answer.

But here's what's interesting: the industry largely went with the prompting-based approach anyway. Why? Because fine-tuning per-tool doesn't scale when you need to support arbitrary APIs. And modern function-calling implementations are good enough for most use cases.

ReAct and why interleaving reasoning with action matters

The ReAct paper (Yao et al., 2022) is probably the most influential work on agent tool use, with over 6300 citations. The core insight is deceptively simple: let the model think out loud between tool calls.

Before ReAct, you had two separate paradigms. Chain-of-thought prompting let models reason step-by-step but couldn't interact with the outside world. Action-generation approaches could call tools but were essentially operating blind between calls, with no visible reasoning about what to do next.

ReAct interleaves the two. The model generates a thought ("I need to search for X because..."), then an action (calling the search tool), then an observation (processing the result), then another thought ("This tells me Y, so next I should..."), and so on.

On HotpotQA, this approach beat chain-of-thought prompting by reducing hallucinations. The model could actually check facts instead of just reasoning about them. On interactive benchmarks like ALFWorld and WebShop, it outperformed reinforcement learning methods by 34% and 10% success rate respectively, using only one or two in-context examples.

The practical lesson: if you're building an agent that uses tools, don't just have it call functions. Make it explain its reasoning between calls. The interpretability is nice for debugging, but the real payoff is better tool selection and argument quality.

Here's what a ReAct-style loop looks like in practice:

messages = [{"role": "system", "content": SYSTEM_PROMPT}]

while not done:
    response = llm.chat(messages, tools=available_tools)

    if response.has_tool_call:
        # The model chose a tool - execute it
        result = execute_tool(response.tool_call)
        messages.append({"role": "tool", "content": result})
    else:
        # The model is done reasoning
        done = True

The important difference from naive tool use is that the system prompt encourages the model to think before acting:

Before calling any tool, explain:
1. What you're trying to find out
2. Why this specific tool is the right choice
3. What you'll do with the result

After receiving a tool result, analyze it before deciding your next step.

This isn't magic. It's just giving the model space to plan. But in our experience building agents for data pipelines at Bruin, that planning step is the difference between an agent that picks the right tool 60% of the time and one that picks it 85% of the time.

Where tool use actually fails

The research paints a clear picture of where things go wrong. Let me walk through the biggest failure modes.

Nested and sequential calls are hard

The NESTFUL benchmark (Basu et al., 2024) tested LLMs on nested sequences of API calls, where the output of one call feeds into the input of the next. Think: "get the user's order history, find the most recent order, look up the shipping status for that order."

GPT-4o, the best-performing model they tested, achieved a full sequence match accuracy of just 28%. It could get individual calls right, but chaining them correctly was a different story. The win-rate (partial credit for getting some calls right) was 60%, which tells you the model understands the individual tools fine but struggles with the composition.

This matches what anyone building real agents has seen. An agent can call a single tool reliably. But give it a task that requires calling tool A, parsing the result, using part of that result as input to tool B, and then combining both results? The error rate compounds at each step.

Static tool retrieval breaks down

When you have more than a handful of tools, you need some way to decide which tools are relevant for a given query. Most systems use embedding-based retrieval: embed the user query, embed the tool descriptions, find the closest matches, and include only those in the prompt.

Patel et al. (2025) showed this static approach has a fundamental problem. The right tool for step 2 of a task depends on what happened in step 1. Their Dynamic Tool Dependency Retrieval (DTDR) method conditions on both the initial query and the evolving execution context, which improved function calling success rates compared to static retrievers.

In simpler terms: when an agent is three steps into a task, the tools it needs aren't necessarily the ones that were closest to the original question. Tool selection needs to be dynamic.

Poor documentation means poor tool use

OpaqueToolsBench (Hallinan et al., 2026) studied what happens when tools have incomplete or misleading documentation. This is the real world. Your internal APIs rarely have perfect docs. There are edge cases, implicit constraints, and undocumented behaviors everywhere.

Their finding: LLMs struggle with tools that lack clear best practices or documented failure modes. Their proposed solution, ToolObserver, iteratively refines tool documentation by observing execution feedback from actual tool-calling trajectories. The agent literally learns from its mistakes to build better tool descriptions.

This is a big deal. It means writing good tool descriptions isn't optional, it's load-bearing infrastructure. The descriptions need to cover failure modes and edge cases, not only the happy path.

The model doesn't have a world model

Guo et al. (2025) identified a subtle but important problem: LLMs using tools in stateful environments can't predict what will happen when they take an action. Their method, DyMo (dynamics modelling), augments LLMs with a state prediction capability alongside function calling during post-training.

This matters because in many real applications, tools change state. If your agent is managing a database, it needs to understand that running a DELETE query will change what subsequent SELECT queries return. Without that world model, agents can and do take destructive actions because they don't predict the consequences.

What actually helps in practice

Based on the research and my own experience, here are the things that make the biggest difference.

Write tool descriptions like your agent's life depends on it

Because it does. Include:

What the tool does (one sentence)
What the parameters mean (with types and constraints)
What the tool returns (with examples)
When NOT to use the tool
Common error conditions

{
  "name": "query_database",
  "description": "Executes a read-only SQL query against the analytics database. Returns up to 1000 rows. Use this for data retrieval, not for mutations. If you need to modify data, use the write_database tool instead.",
  "parameters": {
    "sql": {
      "type": "string",
      "description": "A SELECT query. Must not contain INSERT, UPDATE, DELETE, or DROP statements. The query will be rejected if it does."
    },
    "timeout_seconds": {
      "type": "integer",
      "description": "Max execution time. Default 30. Queries over large tables may need 60+. If a query times out, try adding a LIMIT clause or narrowing the WHERE condition."
    }
  },
  "errors": {
    "TIMEOUT": "Query exceeded timeout_seconds. Simplify the query or increase timeout.",
    "SYNTAX_ERROR": "SQL syntax error. Check for missing quotes or wrong table names.",
    "NO_RESULTS": "Query ran but returned 0 rows. This isn't an error, the data just doesn't exist."
  }
}

That "errors" field isn't standard in any function-calling spec. But putting it in the description text helps the model recover from failures instead of retrying the same broken call.

Keep the tool count low

The Berkeley Function Calling Leaderboard (Patil et al., 2025) has evaluated dozens of models across thousands of function-calling scenarios. One consistent finding: accuracy drops as the number of available tools increases. This isn't surprising. More tools means more options to confuse, more documentation to parse, and more chances to pick the wrong one.

If you have 50 tools, the agent doesn't need all 50 in every prompt. Use retrieval to narrow it down to the 5-10 most relevant. And if you find yourself with hundreds of tools, that's a sign you need to rethink your API design, not a problem to solve with better prompting.

Let the agent fail and recover

ToolPRM (Lin et al., 2025) introduced an inference scaling framework that scores internal steps of function calls. One of their findings is a principle they call "explore more but retain less," because structured function calling has an "unrecoverability" characteristic. Once the model starts generating a malformed function call, it's very hard to course-correct mid-generation.

The practical implication: build your agent loops to expect failures. Parse tool call results. If the call failed, give the model the error message and let it try again. This sounds obvious, but a surprising number of agent implementations either crash on tool errors or silently swallow them.

max_retries = 3
for attempt in range(max_retries):
    tool_result = execute_tool(tool_call)

    if tool_result.success:
        break

    # Feed the error back to the model
    messages.append({
        "role": "tool",
        "content": f"Error: {tool_result.error}. "
                   f"Attempt {attempt + 1}/{max_retries}. "
                   f"Please fix the arguments and try again."
    })

    response = llm.chat(messages, tools=available_tools)
    tool_call = response.tool_call

Think in chains, not single calls

The Chain-of-Abstraction work (Gao et al., 2024) showed that training models to plan tool call chains with abstract placeholders before executing them improved accuracy by about 6% across math and QA tasks, while being 1.4x faster than sequential tool calling.

You probably can't train your own model this way. But you can prompt for it. Before a complex task, ask the model to outline the sequence of tool calls it plans to make, with placeholders for intermediate results. Then execute them. This planning step catches many composition errors before they happen.

User: What was our revenue last quarter compared to the same quarter last year?

Agent thinking:
1. Call query_database to get last quarter's dates -> Q_CURRENT
2. Call query_database to get revenue for Q_CURRENT -> REV_CURRENT
3. Call query_database to get same quarter last year -> Q_PREVIOUS
4. Call query_database to get revenue for Q_PREVIOUS -> REV_PREVIOUS
5. Calculate and compare REV_CURRENT vs REV_PREVIOUS

Actually, I can combine steps 1-2 and 3-4 into two queries...

That "actually, I can combine" moment is exactly what you want. The model is optimizing its own plan before executing it.

The gap between benchmarks and production

One thing that stood out going through this research: the benchmarks are getting more realistic, but they still don't capture the full messiness of production environments.

NESTFUL tests nested API calls but with well-defined, stable APIs. In real systems, APIs return unexpected formats, timeout randomly, and have rate limits. The Berkeley Function Calling Leaderboard tests a wide range of scenarios but in isolation. Real agents deal with conversation context that accumulates over dozens of turns and can push against context window limits.

The OpaqueToolsBench work gets closest to reality by studying poorly-documented tools, and it's no coincidence that it's one of the most recent papers in this space. The field is slowly moving toward evaluating what actually matters: tool use under messy, real-world conditions.

What I'd tell someone building their first agent

Start with three to five tools, not thirty. Get ReAct-style reasoning working with those few tools until it's reliable. Write obsessively detailed tool descriptions, including failure modes and constraints. Build retry logic from day one. And test with sequences of tool calls, not just individual ones, because that's where everything breaks.

The research is clear that tool use in LLMs is a solved problem for simple cases and a very much unsolved problem for complex ones. The gap between "call one function with clean arguments" and "orchestrate a sequence of dependent API calls with error handling" is enormous. If you're building something real, plan for that gap.

The papers I've referenced here are a good starting point if you want to go deeper. The ReAct paper for foundations, NESTFUL for understanding failure modes in composition, and the Berkeley Function Calling Leaderboard for keeping up with which models are actually good at this. The field moves fast, but these core challenges haven't changed much in three years. We're just getting more honest about how hard they are.

How Bruin turns a SELECT query into 9 different materialization strategies across 14 databases

Baris Terzioglu — Sun, 15 Mar 2026 20:56:04 +0000

You write SELECT * FROM orders WHERE dt > '2024-01-01'. But that query alone doesn't create a table, update a partition, or merge with existing records. Something has to wrap your SQL in the right DDL/DML for your specific database, strategy, and context.

In Bruin, that something is the materialization system. It takes your query, looks at your config, and generates the exact SQL needed to materialize the result. Pure string manipulation with a clear purpose.

I want to walk through how it actually works, because the architecture is simpler than you'd expect for something that handles 9 strategies across 14 databases.

The problem

Say you have this SQL asset:

/* @bruin
name: dashboard.user_metrics
type: bq.sql

materialization:
    type: table
    strategy: delete+insert
    incremental_key: updated_at
@bruin */

SELECT user_id, event_count, updated_at
FROM raw.user_events
WHERE updated_at > '2024-01-01'

Your query is a SELECT. But what you actually want to happen is:

Run that SELECT into a temp table
Find all distinct values of updated_at in the temp table
Delete rows from the target table where updated_at matches those values
Insert the temp table rows into the target table

And the exact SQL to do that differs between BigQuery, Snowflake, Postgres, DuckDB, and every other database Bruin supports. BigQuery doesn't have real transactions. Snowflake uses different temp table syntax. Postgres has its own quirks.

The naive approach would be to just build one giant function with a bunch of if-else branches. Check the database, check the strategy, generate SQL. It would work, but it would be a mess of 5,000+ lines of spaghetti. And every time you add a new database or strategy, you'd be touching the same fragile function.

The dispatch table

Here's the core of the materialization system. It's in pkg/pipeline/materializer.go:

type (
    MaterializerFunc        func(task *Asset, query string) (string, error)
    AssetMaterializationMap map[MaterializationType]map[MaterializationStrategy]MaterializerFunc
)

type Materializer struct {
    MaterializationMap AssetMaterializationMap
    FullRefresh        bool
}

MaterializerFunc is the signature every materializer must match: take an asset and a query string, return a new query string. That's it. The AssetMaterializationMap is a nested map: outer key is type (table or view), inner key is strategy (merge, append, delete+insert, etc.), value is the function that generates SQL.

The Render method does the dispatch:

func (m *Materializer) Render(asset *Asset, query string) (string, error) {
    mat := asset.Materialization
    if mat.Type == MaterializationTypeNone {
        return removeComments(query), nil
    }

    strategy := mat.Strategy
    if m.FullRefresh && mat.Type == MaterializationTypeTable {
        if mat.Strategy != MaterializationStrategyDDL &&
           (asset.RefreshRestricted == nil || !*asset.RefreshRestricted) {
            strategy = MaterializationStrategyCreateReplace
        }
    }

    if matFunc, ok := m.MaterializationMap[mat.Type][strategy]; ok {
        materializedQuery, err := matFunc(asset, query)
        if err != nil {
            return "", err
        }
        return removeComments(materializedQuery), nil
    }

    return "", fmt.Errorf("unsupported materialization type - strategy combination: (%s - %s)",
        mat.Type, mat.Strategy)
}

Two things worth noting:

First, the full refresh override. When you run bruin run --full-refresh, every table strategy gets replaced with create+replace, which drops and recreates the table from scratch. But there are two exceptions: DDL strategy (you can't drop/recreate a DDL-only table, that would lose data) and assets marked refresh_restricted: true (for tables you never want accidentally dropped).

Second, the comment stripping. Bruin assets embed YAML config in SQL comments (/* @bruin ... @bruin */). The materializer strips these before sending the query to the database, using a regex: commentRegex = regexp.MustCompile('/\* *@bruin[\s\w\S]*@bruin *\*/').

Each database brings its own map

The core Materializer struct is database-agnostic. Each database package provides its own dispatch map. Here's DuckDB's (from pkg/duckdb/materialization.go):

var matMap = pipeline.AssetMaterializationMap{
    pipeline.MaterializationTypeView: {
        pipeline.MaterializationStrategyNone:          viewMaterializer,
        pipeline.MaterializationStrategyAppend:        errorMaterializer,
        pipeline.MaterializationStrategyCreateReplace: errorMaterializer,
        pipeline.MaterializationStrategyDeleteInsert:  errorMaterializer,
    },
    pipeline.MaterializationTypeTable: {
        pipeline.MaterializationStrategyNone:           buildCreateReplaceQuery,
        pipeline.MaterializationStrategyAppend:         buildAppendQuery,
        pipeline.MaterializationStrategyCreateReplace:  buildCreateReplaceQuery,
        pipeline.MaterializationStrategyDeleteInsert:   buildIncrementalQuery,
        pipeline.MaterializationStrategyTruncateInsert: ansisql.BuildTruncateInsertQuery,
        pipeline.MaterializationStrategyMerge:          buildMergeQuery,
        pipeline.MaterializationStrategyTimeInterval:   buildTimeIntervalQuery,
        pipeline.MaterializationStrategyDDL:            buildDDLQuery,
        pipeline.MaterializationStrategySCD2ByTime:     buildSCD2ByTimeQuery,
        pipeline.MaterializationStrategySCD2ByColumn:   buildSCD2ByColumnQuery,
    },
}

func NewMaterializer(fullRefresh bool) *pipeline.Materializer {
    return &pipeline.Materializer{
        MaterializationMap: matMap,
        FullRefresh:        fullRefresh,
    }
}

Look at the view section. You can't append to a view, or delete+insert into one. Those combinations map to errorMaterializer, which just returns an error saying "not supported." The dispatch map itself encodes which combinations are valid.

Also notice ansisql.BuildTruncateInsertQuery for truncate+insert. When the SQL is standard enough, databases share an implementation from the pkg/ansisql/ package. That function is just:

func BuildTruncateInsertQuery(task *pipeline.Asset, query string) (string, error) {
    queries := []string{
        "BEGIN TRANSACTION",
        "TRUNCATE TABLE " + task.Name,
        fmt.Sprintf("INSERT INTO %s %s", task.Name, strings.TrimSuffix(query, ";")),
        "COMMIT",
    }
    return strings.Join(queries, ";\n") + ";", nil
}

Snowflake, DuckDB, and others all reuse this. BigQuery can't, because BigQuery doesn't support transactions, so it has its own version without the BEGIN/COMMIT wrapper.

Every database creates its Materializer via a NewMaterializer(fullRefresh bool) factory. The calling code doesn't know or care which database-specific functions are in the map. It just calls Render().

From simple to complex: the actual SQL generation

The simple strategies are, well, simple. append is literally:

func buildAppendQuery(asset *pipeline.Asset, query string) (string, error) {
    return fmt.Sprintf("INSERT INTO %s %s", asset.Name, query), nil
}

create+replace for BigQuery adds partitioning and clustering:

func buildCreateReplaceQuery(asset *pipeline.Asset, query string) (string, error) {
    partitionClause := ""
    if mat.PartitionBy != "" {
        partitionClause = "PARTITION BY " + mat.PartitionBy
    }
    clusterByClause := ""
    if len(mat.ClusterBy) > 0 {
        clusterByClause = "CLUSTER BY " + strings.Join(mat.ClusterBy, ", ")
    }
    return fmt.Sprintf("CREATE OR REPLACE TABLE %s %s %s AS\n%s",
        asset.Name, partitionClause, clusterByClause, query), nil
}

delete+insert is where the complexity starts. BigQuery's version (from pkg/bigquery/materialization.go) has an optimization most people wouldn't think of:

func buildIncrementalQuery(asset *pipeline.Asset, query string) (string, error) {
    mat := asset.Materialization

    foundCol := asset.GetColumnWithName(mat.IncrementalKey)
    if foundCol == nil || foundCol.Type == "" || foundCol.Type == "UNKNOWN" {
        return buildIncrementalQueryWithoutTempVariable(asset, query)
    }

    randPrefix := helpers.PrefixGenerator()
    tempTableName := "__bruin_tmp_" + randPrefix
    declaredVarName := "distinct_keys_" + randPrefix

    queries := []string{
        fmt.Sprintf("DECLARE %s array<%s>", declaredVarName, foundCol.Type),
        "BEGIN TRANSACTION",
        fmt.Sprintf("CREATE TEMP TABLE %s AS %s", tempTableName, strings.TrimSuffix(query, ";")),
        fmt.Sprintf("SET %s = (SELECT array_agg(distinct %s) FROM %s)",
            declaredVarName, mat.IncrementalKey, tempTableName),
        fmt.Sprintf("DELETE FROM %s WHERE %s in unnest(%s)",
            asset.Name, mat.IncrementalKey, declaredVarName),
        fmt.Sprintf("INSERT INTO %s SELECT * FROM %s", asset.Name, tempTableName),
        "COMMIT TRANSACTION",
    }

    return strings.Join(queries, ";\n") + ";", nil
}

When the column type is known, BigQuery can use a typed DECLARE variable with array_agg and unnest to collect the distinct keys. This is faster than a subquery for large datasets because BigQuery can optimize the array operation. When the type is unknown, it falls back to a simpler approach with an inline SELECT DISTINCT subquery.

Each temp table gets a random prefix via helpers.PrefixGenerator() to avoid collisions when running concurrent pipelines. The naming convention is __bruin_tmp_<random> or __bruin_merge_tmp_<random> depending on the strategy.

Compare this with Snowflake's version (from pkg/snowflake/materialization.go), which is more straightforward because Snowflake has proper transaction support:

func buildIncrementalQuery(task *pipeline.Asset, query string) (string, error) {
    tempTableName := "__bruin_tmp_" + helpers.PrefixGenerator()

    queries := []string{
        "BEGIN TRANSACTION",
        fmt.Sprintf("CREATE TEMP TABLE %s AS %s", tempTableName, strings.TrimSuffix(query, ";")),
        fmt.Sprintf("DELETE FROM %s WHERE %s in (SELECT DISTINCT %s FROM %s)",
            task.Name, mat.IncrementalKey, mat.IncrementalKey, tempTableName),
        fmt.Sprintf("INSERT INTO %s SELECT * FROM %s", task.Name, tempTableName),
        "DROP TABLE IF EXISTS " + tempTableName,
        "COMMIT",
    }

    return strings.Join(queries, ";\n") + ";", nil
}

No DECLARE, no array_agg. Just a subquery in the DELETE. Also notice Snowflake explicitly drops the temp table before commit, while BigQuery doesn't need to (BigQuery temp tables are session-scoped and auto-cleaned).

Merge: the most complex strategy

Merge is where things get genuinely complicated, because the semantics differ significantly between databases. Here's BigQuery's merge (from pkg/bigquery/materialization.go):

func mergeMaterializer(asset *pipeline.Asset, query string) (string, error) {
    primaryKeys := asset.ColumnNamesWithPrimaryKey()
    mergeColumns := ansisql.GetColumnsWithMergeLogic(asset)
    columnNames := asset.ColumnNames()

    on := make([]string, 0, len(primaryKeys))
    for _, key := range primaryKeys {
        on = append(on, fmt.Sprintf(
            "(source.%s = target.%s OR (source.%s IS NULL and target.%s IS NULL))",
            key, key, key, key))
    }

    // ... build WHEN MATCHED and WHEN NOT MATCHED clauses

    mergeLines := []string{
        fmt.Sprintf("MERGE %s target", asset.Name),
        fmt.Sprintf("USING (%s) source", strings.TrimSuffix(query, ";")),
        fmt.Sprintf("ON (%s)", onQuery),
        whenMatchedThenQuery,
        fmt.Sprintf("WHEN NOT MATCHED THEN INSERT(%s) VALUES(%s)",
            allColumnValues, allColumnValues),
    }

    return strings.Join(mergeLines, "\n") + ";", nil
}

See that NULL handling in the ON condition? (source.id = target.id OR (source.id IS NULL and target.id IS NULL)). BigQuery needs this because NULL = NULL evaluates to NULL, not TRUE. Snowflake's version uses plain target.id = source.id because its MERGE handles NULLs differently.

DuckDB can't use MERGE at all (it didn't support it when this was written), so its merge implementation uses a different approach entirely: temp table, UPDATE with a JOIN, then INSERT with NOT EXISTS:

func buildMergeQuery(asset *pipeline.Asset, query string) (string, error) {
    // ... setup ...
    queries := []string{
        "BEGIN TRANSACTION",
        fmt.Sprintf("CREATE TEMP TABLE %s AS %s", tempTableName, query),
        fmt.Sprintf("UPDATE %s AS target SET %s FROM %s AS source WHERE %s",
            asset.Name, updateClause, tempTableName, onCondition),
        fmt.Sprintf("INSERT INTO %s (%s) SELECT %s FROM %s AS source "+
            "WHERE NOT EXISTS (SELECT 1 FROM %s AS target WHERE %s)",
            asset.Name, allColumnNames, allColumnNames,
            tempTableName, asset.Name, onCondition),
        "DROP TABLE " + tempTableName,
        "COMMIT",
    }

    return strings.Join(queries, ";\n") + ";", nil
}

Same behavior, totally different SQL. The dispatch map makes this invisible to the user.

There's also a nice feature in how merge columns work. The ansisql.GetColumnsWithMergeLogic function (from pkg/ansisql/materialization.go) filters columns:

func GetColumnsWithMergeLogic(asset *pipeline.Asset) []pipeline.Column {
    var columns []pipeline.Column
    for _, col := range asset.Columns {
        if col.PrimaryKey {
            continue
        }
        if col.MergeSQL != "" || col.UpdateOnMerge {
            columns = append(columns, col)
        }
    }
    return columns
}

Primary keys are never updated (they're the join condition). Other columns are only updated if they're marked update_on_merge: true or have a custom merge_sql expression. This means users can do things like merge_sql: GREATEST(target.Score, source.Score) to keep whichever score is higher, per-column.

SCD2: generating slowly changing dimensions from a SELECT

The most complex materializer is SCD2 (Slowly Changing Dimension Type 2). It takes your SELECT query and generates SQL that maintains a full history table with _valid_from, _valid_until, and _is_current columns.

There are two variants: scd2_by_column (detects changes by comparing column values) and scd2_by_time (detects changes using a timestamp column). Both are surprisingly different in their generated SQL.

For BigQuery, scd2_by_time generates a MERGE statement. It creates a source CTE that unions two versions of every incoming row: one marked _is_current = TRUE (the new version to insert) and one marked _is_current = FALSE (to expire the old version). Then the MERGE matches these against the target:

MERGE INTO `product_catalog` AS target
USING (
  WITH s1 AS (
    SELECT product_id, name, price, updated_at FROM raw_products
  )
  SELECT s1.*, TRUE AS _is_current
  FROM   s1
  UNION ALL
  SELECT s1.*, FALSE AS _is_current
  FROM s1
  JOIN   `product_catalog` AS t1 USING (product_id)
  WHERE  t1._valid_from < CAST(s1.updated_at AS TIMESTAMP) AND t1._is_current
) AS source
ON  target.product_id = source.product_id AND target._is_current AND source._is_current

WHEN MATCHED AND (
  target._valid_from < CAST(source.updated_at AS TIMESTAMP)
) THEN
  UPDATE SET
    target._valid_until = CAST(source.updated_at AS TIMESTAMP),
    target._is_current  = FALSE

WHEN NOT MATCHED BY SOURCE AND target._is_current = TRUE THEN
  UPDATE SET
    target._valid_until = CURRENT_TIMESTAMP(),
    target._is_current  = FALSE

WHEN NOT MATCHED BY TARGET THEN
  INSERT (product_id, name, price, updated_at, _valid_from, _valid_until, _is_current)
  VALUES (source.product_id, source.name, source.price, source.updated_at,
          CAST(source.updated_at AS TIMESTAMP), TIMESTAMP('9999-12-31'), TRUE);

That WHEN NOT MATCHED BY SOURCE clause is what handles deletions: records that exist in the target but not in the source get expired.

Snowflake's SCD2 implementation can't use the same approach because of dialect differences. It uses a multi-step transaction instead:

BEGIN TRANSACTION;
SET current_scd2_ts = CURRENT_TIMESTAMP();

-- Step 1: Expire records no longer in source
UPDATE product_catalog AS target
SET _valid_until = $current_scd2_ts, _is_current = FALSE
WHERE target._is_current = TRUE
  AND NOT EXISTS (
    SELECT 1 FROM (SELECT ...) AS source
    WHERE target.product_id = source.product_id
  );

-- Step 2: Handle new and changed records via MERGE
MERGE INTO product_catalog AS target
USING ( ... ) AS source
ON target.product_id = source.product_id AND target._is_current AND source._is_current
WHEN MATCHED AND (...) THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);

COMMIT;

Notice how Snowflake captures the timestamp into a session variable (SET current_scd2_ts = CURRENT_TIMESTAMP()) and reuses it throughout. This is important for consistency: if the transaction takes a few seconds, you don't want different rows getting different expiry timestamps.

Both implementations also handle full refresh mode. When you run --full-refresh, SCD2 tables get bootstrapped with all records marked as current and auto-partitioned by _valid_from:

stmt := fmt.Sprintf(
    `CREATE OR REPLACE TABLE %s
%s
%s AS
SELECT
  CAST(%s AS TIMESTAMP) AS _valid_from,
  src.*,
  TIMESTAMP('9999-12-31') AS _valid_until,
  TRUE AS _is_current
FROM (
%s
) AS src;`,
    tbl, partitionClause, clusterClause,
    asset.Materialization.IncrementalKey,
    strings.TrimSpace(query),
)

The reserved column names (_valid_from, _valid_until, _is_current) are validated at query generation time. If your asset has a column named _is_current, the materializer returns an error before any SQL hits the database.

The hook wrapper

There's one more layer I want to mention. Assets can define pre/post hooks, SQL statements that run before and after the materialized query:

hooks:
  pre:
    - query: "CREATE SCHEMA IF NOT EXISTS staging"
  post:
    - query: "GRANT SELECT ON dashboard.user_metrics TO analytics_role"

Rather than embedding hook logic into every materializer function, there's a HookWrapperMaterializer decorator:

type HookWrapperMaterializer struct {
    Mat interface {
        Render(asset *Asset, query string) (string, error)
    }
}

func (m HookWrapperMaterializer) Render(asset *Asset, query string) (string, error) {
    materialized, err := m.Mat.Render(asset, query)
    if err != nil {
        return "", err
    }
    return WrapHooks(materialized, asset.Hooks), nil
}

It wraps any materializer, runs the base Render, then prepends/appends the hook queries. No materializer function needs to know about hooks.

What makes this work

The whole system spans around 5,000+ lines of SQL generation code across 14 database packages, but the core pattern is just a two-level map lookup.

Adding a new database means providing a new map. A new strategy means writing a function with the right signature. If a combination doesn't make sense (like append for a view), you plug in an error function. The map itself is the documentation of what's supported.

There are tradeoffs. The MaterializerFunc signature is func(task *Asset, query string) (string, error). It passes the entire Asset struct, so every function has access to everything: columns, primary keys, materialization config, connection details. That's flexible but also means there's no compile-time guarantee that a function only reads what it needs. It's a pragmatic choice for a system where each function needs slightly different fields.

The generated SQL isn't parameterized (it uses fmt.Sprintf with string interpolation). In a different context, this would be a SQL injection concern, but here the inputs come from YAML config files that the user wrote, not from untrusted external input.

If you're building something that needs to generate SQL across multiple databases, this pattern works well. Don't try to abstract away the differences. Accept that BigQuery's MERGE and DuckDB's UPDATE-then-INSERT are fundamentally different operations, give each database its own implementation, and use a dispatch table to route to the right one. The database-specific code ends up being surprisingly readable because each function only worries about one database and one strategy.

The materialization source code is at:

Core: pkg/pipeline/materializer.go
Strategies enum: pkg/pipeline/pipeline.go (lines 350-390)
Shared ANSI SQL: pkg/ansisql/materialization.go
Per-database implementations: pkg/bigquery/materialization.go, pkg/snowflake/materialization.go, pkg/duckdb/materialization.go, and 11 others

Materialization strategies: how Bruin and dbt turn SELECT queries into tables

Baris Terzioglu — Sun, 15 Mar 2026 20:55:37 +0000

Materialization strategies: how Bruin and dbt turn SELECT queries into tables

Every SQL-based data pipeline tool faces the same problem: you wrote a SELECT query, and now you need it to exist as a table (or view) in your warehouse. The logic that bridges that gap is called materialization.

Both Bruin and dbt solve this. They just solve it differently, and the differences say a lot about each tool's design philosophy.

What is materialization, exactly?

If you write a query like this:

SELECT
    user_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent
FROM orders
GROUP BY user_id

Materialization is the logic that wraps this query to produce a real database object. At its simplest, that means generating CREATE TABLE my_table AS (SELECT ...). But real workloads get complicated fast. What if the table already exists? What if you only want to update yesterday's data? What if you need to track historical changes to records over time?

The materialization strategy answers these questions.

dbt's approach: four types, strategy as a sub-option

dbt organizes materialization around four types: view, table, incremental, and ephemeral. You set this in the model's config block:

{{
  config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id'
  )
}}

SELECT * FROM {{ ref('raw_orders') }}
{% if is_incremental() %}
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

The materialized property is the top-level decision. Within incremental, you then pick a strategy: append, delete+insert, merge, or the newer microbatch.

This design made a lot of sense when dbt launched. Most teams needed three things: views for lightweight transformations, tables for heavy ones, and some way to avoid full refreshes on large tables. The incremental type covers that last case, and the strategy sub-option lets you pick how it works.

dbt also has ephemeral, which never creates a database object at all. Instead, the query gets injected as a CTE into whatever model references it. It is a compile-time optimization, not a runtime materialization.

For SCD Type 2 (slowly changing dimensions), dbt uses a separate concept entirely: snapshots. Snapshots are their own resource type with their own config, directory structure, and CLI command (dbt snapshot). They support two strategies: timestamp (track changes via an updated_at column) and check (track changes by comparing column values).

From the dbt architecture docs (docs/arch/4.5_Node_Materialization.md):

When a model is configured with materialized='table', the materialization macro generates SQL that follows an atomic swap pattern to avoid downtime.

The implementation looks like:

-- 1. Create a temporary table with the model's SQL
CREATE TABLE "schema"."my_model__dbt_tmp" AS (
  SELECT * FROM ...
);

-- 2. Drop the existing table
DROP TABLE IF EXISTS "schema"."my_model";

-- 3. Rename temp to final
ALTER TABLE "schema"."my_model__dbt_tmp"
RENAME TO "my_model";

Under the hood, materializations are Jinja macros. The actual macro files live in the dbt-adapters repo, not in dbt-core itself. dbt-core's ModelRunner.execute() (in core/dbt/task/run.py) looks up the right macro through a dispatch chain:

Adapter-specific macro in your project
Adapter-specific macro from a package
Adapter-specific macro from the adapter itself
Default macro in your project
Default macro from dbt-core

This layered dispatch is powerful. You can override any materialization at any level. But it also means the actual SQL generation lives across multiple repos and macro files, which can make debugging harder.

dbt's incremental config

The incremental model configuration (from core/dbt/artifacts/resources/v1/config.py) gives you these options:

materialized: str = "view"
incremental_strategy: Optional[str] = None
unique_key: Union[str, List[str], None] = None
on_schema_change: Optional[str] = "ignore"
batch_size: Any = None       # for microbatch
lookback: Any = 1            # for microbatch
begin: Any = None            # for microbatch
event_time: Any = None       # for microbatch

The on_schema_change option is worth noting. When your source schema changes between runs, dbt can ignore it, append_new_columns, sync_all_columns, or fail. This is a real-world problem that bites teams when upstream tables change, and dbt handles it well.

One thing I find interesting about dbt's is_incremental() pattern: the model itself has to contain the conditional logic. You write the full-refresh query and the incremental query in the same file, gated by {% if is_incremental() %}. This means the model author has to think about both paths, which is both a strength (explicit) and a weakness (verbose, easy to get wrong).

Bruin's approach: type + strategy as a flat matrix

Bruin takes a different approach. Instead of overloading incremental as a materialization type with sub-strategies, Bruin separates the concept into two orthogonal dimensions:

type: either table or view
strategy: one of nine options

The config lives in the asset's YAML header:

/* @bruin
name: dashboard.orders
type: bq.sql

materialization:
    type: table
    strategy: delete+insert
    incremental_key: order_date

@bruin */

SELECT
    order_date,
    user_id,
    amount
FROM raw.orders
WHERE order_date >= CURRENT_DATE - 7

Notice what's missing: there's no {% if is_incremental() %} conditional in the SQL. The query is just a query. Bruin handles the wrapping logic entirely based on the strategy you picked.

Here are all nine strategies, defined in pkg/pipeline/pipeline.go:

const (
    MaterializationStrategyCreateReplace  = "create+replace"
    MaterializationStrategyDeleteInsert   = "delete+insert"
    MaterializationStrategyTruncateInsert = "truncate+insert"
    MaterializationStrategyAppend         = "append"
    MaterializationStrategyMerge          = "merge"
    MaterializationStrategyTimeInterval   = "time_interval"
    MaterializationStrategyDDL            = "ddl"
    MaterializationStrategySCD2ByTime     = "scd2_by_time"
    MaterializationStrategySCD2ByColumn   = "scd2_by_column"
)

That's nine strategies vs dbt's four incremental strategies plus separate snapshot types. Bruin puts everything into one flat list because the team's view is that these are all just different ways to materialize a SELECT query. SCD2 isn't a separate concept from incremental loading; it's just another strategy.

The dispatch pattern: Go functions vs Jinja macros

The way each tool dispatches to the right SQL generation code is revealing.

dbt uses a multi-layer Jinja macro dispatch with adapter inheritance. When you run a model with materialized='incremental' and incremental_strategy='merge' on Snowflake, dbt searches for a macro named materialization_incremental_snowflake, then falls back to materialization_incremental_default. The adapter can also dispatch the strategy itself, searching for snowflake__incremental_merge then default__incremental_merge.

Bruin uses a simpler pattern. Each supported database has a Go file with a function map (from pkg/bigquery/materialization.go):

var matMap = pipeline.AssetMaterializationMap{
    pipeline.MaterializationTypeView: {
        pipeline.MaterializationStrategyNone: viewMaterializer,
        // ...
    },
    pipeline.MaterializationTypeTable: {
        pipeline.MaterializationStrategyNone:           buildCreateReplaceQuery,
        pipeline.MaterializationStrategyAppend:         buildAppendQuery,
        pipeline.MaterializationStrategyCreateReplace:  buildCreateReplaceQuery,
        pipeline.MaterializationStrategyDeleteInsert:   buildIncrementalQuery,
        pipeline.MaterializationStrategyMerge:          mergeMaterializer,
        pipeline.MaterializationStrategyTimeInterval:   buildTimeIntervalQuery,
        pipeline.MaterializationStrategyDDL:            buildDDLQuery,
        pipeline.MaterializationStrategySCD2ByColumn:   buildSCD2ByColumnQuery,
        pipeline.MaterializationStrategySCD2ByTime:     buildSCD2QueryByTime,
    },
}

The dispatcher in pkg/pipeline/materializer.go is about 25 lines of Go:

func (m *Materializer) Render(asset *Asset, query string) (string, error) {
    mat := asset.Materialization
    if mat.Type == MaterializationTypeNone {
        return removeComments(query), nil
    }

    strategy := mat.Strategy
    if m.FullRefresh && mat.Type == MaterializationTypeTable {
        if mat.Strategy != MaterializationStrategyDDL &&
           (asset.RefreshRestricted == nil || !*asset.RefreshRestricted) {
            strategy = MaterializationStrategyCreateReplace
        }
    }

    if matFunc, ok := m.MaterializationMap[mat.Type][strategy]; ok {
        materializedQuery, err := matFunc(asset, query)
        if err != nil {
            return "", err
        }
        return removeComments(materializedQuery), nil
    }

    return "", fmt.Errorf("unsupported materialization type - strategy combination: (`%s` - `%s`)",
        mat.Type, mat.Strategy)
}

The map lookup m.MaterializationMap[mat.Type][strategy] is the entire dispatch. No inheritance chain, no macro resolution order, no adapter fallbacks. Each database has its own map, and the right function gets called directly.

This means you can't override Bruin's materializations from your project the way you can with dbt macros. But it also means you can read exactly what SQL will be generated by looking at one Go file. When Bruin's delete+insert on BigQuery does something unexpected, you look at pkg/bigquery/materialization.go, find buildIncrementalQuery, and read the Go code. With dbt, the same investigation might take you through three repos (dbt-core, dbt-adapters, dbt-bigquery) and multiple Jinja files.

Comparing specific strategies

Let me walk through how each tool handles some common patterns.

Full refresh: create+replace vs table

dbt's table materialization uses the atomic swap pattern: create temp, drop old, rename. This is safe and avoids partial reads.

Bruin's create+replace generates a simpler statement:

CREATE OR REPLACE TABLE dashboard.orders
PARTITION BY order_date
CLUSTER BY user_id
AS
SELECT ...

The PARTITION BY and CLUSTER BY come straight from the YAML config. Most modern warehouses support CREATE OR REPLACE TABLE, making the temp-and-rename pattern unnecessary. Bruin leans on the warehouse's own atomicity guarantees here.

Incremental: delete+insert

dbt's delete+insert strategy requires you to write the filter logic yourself using {% if is_incremental() %}:

{{ config(materialized='incremental', incremental_strategy='delete+insert') }}

SELECT * FROM {{ ref('seed') }}
{% if is_incremental() %}
    WHERE a > (SELECT MAX(a) FROM {{ this }})
{% endif %}

Bruin generates the delete+insert logic from the incremental_key config. The BigQuery implementation (pkg/bigquery/materialization.go) does this:

func buildIncrementalQuery(asset *pipeline.Asset, query string) (string, error) {
    mat := asset.Materialization
    // ...
    queries := []string{
        fmt.Sprintf("DECLARE %s array<%s>", declaredVarName, foundCol.Type),
        "BEGIN TRANSACTION",
        fmt.Sprintf("CREATE TEMP TABLE %s AS %s", tempTableName, query),
        fmt.Sprintf("SET %s = (SELECT array_agg(distinct %s) FROM %s)",
            declaredVarName, mat.IncrementalKey, tempTableName),
        fmt.Sprintf("DELETE FROM %s WHERE %s in unnest(%s)",
            asset.Name, mat.IncrementalKey, declaredVarName),
        fmt.Sprintf("INSERT INTO %s SELECT * FROM %s", asset.Name, tempTableName),
        "COMMIT TRANSACTION",
    }
    return strings.Join(queries, ";\n") + ";", nil
}

The generated SQL: run the query into a temp table, extract the distinct incremental keys, delete matching rows from the target, insert the new rows, all inside a transaction. You don't write any of this logic in your SQL file.

This is a genuine philosophical difference. dbt says: you, the model author, know best how to filter for new data. Bruin says: tell us the incremental key and we'll handle the rest. Both are defensible positions, but Bruin's approach means fewer bugs from incorrectly written incremental filters, especially for junior engineers or teams moving fast.

Merge (upsert)

Both tools support MERGE statements, but the configuration differs.

dbt uses unique_key in the config:

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id'
) }}

Bruin uses column-level configuration with finer control:

/* @bruin
name: dashboard.products
type: bq.sql

materialization:
    type: table
    strategy: merge

columns:
  - name: product_id
    type: INTEGER
    primary_key: true
  - name: product_name
    type: VARCHAR
    update_on_merge: true
  - name: price
    type: INTEGER
    update_on_merge: true
  - name: high_score
    type: INTEGER
    merge_sql: GREATEST(target.high_score, source.high_score)

@bruin */

SELECT ...

That merge_sql field is worth a closer look. It lets you define custom merge logic per column. In the example above, high_score keeps whichever value is higher, whether it's the existing row or the incoming row. The generated MERGE statement (from mergeMaterializer in pkg/bigquery/materialization.go) produces:

MERGE dashboard.products target
USING (...) source
ON (source.product_id = target.product_id
    OR (source.product_id IS NULL AND target.product_id IS NULL))
WHEN MATCHED THEN UPDATE SET
    target.product_name = source.product_name,
    target.price = source.price,
    target.high_score = GREATEST(target.high_score, source.high_score)
WHEN NOT MATCHED THEN INSERT(product_id, product_name, price, high_score)
    VALUES(product_id, product_name, price, high_score);

In dbt, achieving the same per-column merge logic would require writing a custom materialization macro or using a post_hook.

SCD Type 2

This is where the design philosophies diverge most.

dbt treats SCD2 as a separate resource type: snapshots. You put them in a snapshots/ directory, configure them differently from models, and run them with dbt snapshot. They track changes with dbt_valid_from and dbt_valid_to columns and support two strategies: timestamp and check.

Bruin treats SCD2 as two more materialization strategies: scd2_by_time and scd2_by_column. They're configured the same way as any other asset, in the same directory, with the same YAML structure:

/* @bruin
name: warehouse.product_history
type: bq.sql

materialization:
  type: table
  strategy: scd2_by_column

columns:
  - name: product_id
    type: INTEGER
    primary_key: true
  - name: product_name
    type: VARCHAR
  - name: price
    type: FLOAT

@bruin */

SELECT 1 AS product_id, 'Laptop' AS product_name, 999.99 AS price
UNION ALL
SELECT 2 AS product_id, 'Mouse' AS product_name, 29.99 AS price

Bruin automatically adds _valid_from, _valid_until, and _is_current columns. The scd2_by_column strategy compares all non-primary-key columns to detect changes. The scd2_by_time strategy uses a timestamp column (the incremental_key) to determine when records changed.

The generated SQL for scd2_by_column (from buildSCD2ByColumnQuery in pkg/bigquery/materialization.go) is a single MERGE statement that handles three cases in one pass:

WHEN MATCHED and columns changed: mark the old row as historical
WHEN NOT MATCHED BY SOURCE: mark removed records as historical
WHEN NOT MATCHED BY TARGET: insert new records

The full-refresh path for SCD2 is also handled. When you run bruin run --full-refresh, the SCD2 strategies produce a CREATE OR REPLACE TABLE with the SCD2 tracking columns pre-populated, partitioned by _valid_from by default.

The practical difference: in dbt, adding SCD2 tracking to an existing model means moving it to a new directory, changing the resource type, and potentially refactoring your DAG. In Bruin, you change one line of YAML from strategy: create+replace to strategy: scd2_by_column and add column definitions.

Time interval and microbatch

Both tools have a concept of time-windowed processing, but they frame it differently.

dbt's microbatch strategy (introduced more recently) processes data in configurable time batches:

{{ config(
    materialized='incremental',
    incremental_strategy='microbatch',
    unique_key='id',
    event_time='event_time',
    batch_size='day',
    begin=modules.datetime.datetime(2020, 1, 1, 0, 0, 0)
) }}

The MicrobatchBuilder class (in core/dbt/materializations/incremental/microbatch.py) splits the time range into batches and processes each one. It supports hour, day, month, and year granularity, with a lookback parameter for reprocessing recent batches.

Bruin's time_interval strategy works with explicit start and end dates passed via CLI:

bruin run --start-date "2024-03-01" --end-date "2024-03-31" assets/orders.sql

The strategy deletes the time window and reinserts:

queries := []string{
    "BEGIN TRANSACTION",
    fmt.Sprintf("DELETE FROM %s WHERE %s BETWEEN '%s' AND '%s'",
        asset.Name, asset.Materialization.IncrementalKey, startVar, endVar),
    fmt.Sprintf("INSERT INTO %s %s", asset.Name, query),
    "COMMIT TRANSACTION",
}

dbt's microbatch is more automatic (it figures out the batches for you). Bruin's time_interval gives you more explicit control over the window. If you need to backfill March 2024, you say exactly that. With dbt, you'd configure the begin date and let the batching logic work it out.

The full-refresh escape hatch

Both tools let you bypass incremental logic and do a full rebuild.

dbt uses dbt run --full-refresh, which tells all incremental models to rebuild from scratch.

Bruin uses bruin run --full-refresh, and the behavior is encoded in the materializer. When FullRefresh is true, the strategy gets overridden to create+replace for all table materializations, with two exceptions:

DDL strategy is never overridden (it only creates the table structure, so dropping it would lose data)
Assets with refresh_restricted: true keep their normal strategy

That refresh_restricted flag is a nice touch. If you have a table that takes hours to rebuild, or one with external dependencies that would break if dropped, you can protect it from accidental full refreshes while still running --full-refresh on the rest of your pipeline.

What this tells us about each tool

dbt's materialization system reflects its origins as a "transforms" tool. It started with the simplest question (view or table?) and added incrementality as a sub-concern. SCD2 got bolted on as a separate concept (snapshots) because it's a fundamentally different workflow from transformation. The Jinja macro system makes everything customizable, at the cost of complexity and indirection.

Bruin started with a wider view of what a data pipeline asset needs. Materialization is a flat list of strategies because the team treated all the ways you might want to persist a SELECT query as peers. SCD2, merge with custom column logic, time-interval backfilling: they're all just strategies on the same axis. The code generation happens in Go functions that are explicit and traceable, but not customizable from outside the tool.

If you're building a team that will need custom materializations or uses heavily customized dbt packages, dbt's macro system gives you that flexibility. If you want a tool where "change the strategy" means changing one word in a YAML config and the generated SQL is predictable and inspectable, Bruin's approach is compelling. The fact that SCD2 is just another strategy, not a separate resource type, reduces the conceptual overhead for teams that need historical tracking.

I'd recommend looking at both tools' actual materialization code if you're evaluating them. The Go functions in Bruin's pkg/bigquery/materialization.go (or the equivalent for your warehouse) show you exactly what SQL you'll get. For dbt, start with core/dbt/task/run.py and follow the macro dispatch chain into dbt-adapters. Both are well-engineered, and reading the code tells you more than any comparison post ever could.

Data quality testing: how Bruin and dbt take different paths to the same goal

Baris Terzioglu — Sun, 15 Mar 2026 18:01:08 +0000

If you've built data pipelines for any length of time, you know the drill: the pipeline runs fine, the table gets created, and three days later someone discovers that half the rows have null IDs. The transformation was correct, the data just wasn't what you assumed.

Both Bruin and dbt have built-in systems for catching these problems. They solve the same problem, but in genuinely different ways. dbt treats tests as separate nodes in the DAG. Bruin embeds quality checks directly into asset definitions. Both approaches work, and the trade-offs between them are worth understanding regardless of which tool you use.

What we mean by "data quality testing"

Before comparing the tools, let me define the scope. I'm talking about checks that answer: "Does the data this pipeline just produced actually look right?" Common examples:

Is this column unique? Are there nulls where there shouldn't be?
Are all values in a column within an expected set?
Does a custom business rule hold? (e.g., total debits = total credits)

Both tools handle these. They just wire it up differently.

dbt: tests as first-class DAG nodes

dbt pioneered the idea of bringing software engineering testing practices to data. In dbt, a test is a SQL query that returns rows, if it returns zero rows, the test passes. If it returns rows, those are the failures.

There are two flavors of tests in dbt:

Generic tests are declared in schema.yml and apply to specific columns or models. The four built-in ones: unique, not_null, accepted_values, and relationships, cover the basics:

# schema.yml
models:
  - name: orders
    columns:
      - name: order_id
        data_tests:
          - unique
          - not_null
      - name: status
        data_tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'returned']

Each of these gets compiled into a separate DAG node. Under the hood, unique becomes something like:

select order_id
from orders
group by order_id
having count(*) > 1

You can see this in the codebase GenericTestNode in core/dbt/contracts/graph/nodes.py (line 1083) inherits from CompiledNode, meaning each test is a full compiled SQL statement with its own node ID, configs, and execution context.

Singular tests are standalone SQL files you drop in the tests/ directory. They're just queries that should return zero rows:

-- tests/assert_total_payments_positive.sql
select order_id, total_amount
from {{ ref('orders') }}
where total_amount < 0

The test configuration system has real depth to it. Looking at TestConfig in core/dbt/artifacts/resources/v1/config.py:

@dataclass
class TestConfig(NodeAndTestConfig):
    severity: Severity = Severity("ERROR")
    store_failures: Optional[bool] = None
    store_failures_as: Optional[str] = None
    where: Optional[str] = None
    limit: Optional[int] = None
    fail_calc: str = "count(*)"
    warn_if: str = "!= 0"
    error_if: str = "!= 0"

That severity field with warn/error options is genuinely useful. You can say "I want to know about this problem, but don't block the pipeline":

data_tests:
  - not_null:
      severity: warn
      warn_if: "> 10"  # only warn if more than 10 nulls

And store_failures: true materializes the failing rows into a table so you can actually go look at what failed. When you're debugging data quality issues at 2am, being able to SELECT * FROM dbt_test__audit.not_null_orders_order_id is a lifesaver.

The thing to understand about dbt's approach: tests are nodes in the DAG, run via dbt test (or dbt build, which runs models and tests together). The TestTask class in core/dbt/task/test.py inherits from RunTask, tests go through the same execution machinery as models. The TestRunner.execute_data_test method renders the test SQL through the materialization macro system, executes it, and expects back exactly one row with three columns (failures, should_warn, should_error).

Bruin: checks embedded in the asset

Bruin takes a different approach. Quality checks aren't separate nodes , they're declared inside the asset definition itself, right next to the SQL that produces the data:

/* @bruin

name: analytics.orders
type: sf.sql
materialization:
  type: table

columns:
  - name: order_id
    type: integer
    checks:
      - name: unique
      - name: not_null
  - name: amount
    type: float
    checks:
      - name: positive
      - name: min
        value: 0.01

custom_checks:
  - name: row count above threshold
    query: SELECT count(*) > 100 FROM analytics.orders
    value: 1

@bruin */

SELECT
  order_id,
  amount
FROM raw.orders
WHERE status != 'cancelled'

Everything about the data quality expectations lives in the same file as the transformation. The column definitions, their types, and the checks they should pass are all declared together.

Bruin ships with nine built-in check types: unique, not_null, positive, negative, non_negative, accepted_values, pattern, min, and max. That's five more than dbt's four built-in generic tests. The positive, negative, non_negative, min, and max checks don't have out-of-the-box equivalents in dbt you'd write custom generic tests or singular test files.

Looking at the implementation, the check SQL generation is straightforward. From pkg/ansisql/checks.go:

// NotNullCheck generates: SELECT count(*) FROM {table} WHERE {column} IS NULL
func (c *NotNullCheck) Check(ctx context.Context, ti *scheduler.ColumnCheckInstance) error {
    qq := fmt.Sprintf("SELECT count(*) FROM %s WHERE %s IS NULL",
        ti.GetAsset().Name, ti.Column.Name)

    return (&CountableQueryCheck{
        conn:                c.conn,
        expectedQueryResult: 0,
        queryInstance:       &query.Query{Query: qq},
        checkName:           "not_null",
        customError: func(count int64) error {
            return errors.Errorf("column '%s' has %d null values", ti.Column.Name, count)
        },
    }).Check(ctx, ti)
}

Each check type generates a SELECT count(*) query and expects a specific result (usually zero). The CountableQueryCheck pattern runs the query against the actual database connection, parses the integer result, and compares it. Simple.

For custom checks, Bruin supports inline SQL queries directly in the asset YAML. The CustomCheck type in the same file renders the query through Jinja (so you can use template variables), then runs it:

func (c *CustomCheck) Check(ctx context.Context, ti *scheduler.CustomCheckInstance) error {
    qq := ti.Check.Query
    // Jinja rendering happens here...
    expected := ti.Check.Value
    if ti.Check.Count != nil {
        expected = *ti.Check.Count
        qq = fmt.Sprintf("SELECT count(*) FROM (%s) AS t", qq)
    }
    return NewCountableQueryCheck(c.conn, expected, &query.Query{Query: qq}, ti.Check.Name, ...)
}

How they wire into the execution graph

This is where the design philosophy really diverges.

In dbt, tests are independent nodes. When you run dbt build, the DAG might look like:

stg_orders (model) → orders (model) → test: unique_orders_order_id
                                     → test: not_null_orders_order_id
                                     → customers (model, depends on orders)

Tests and downstream models run after the model they test. But tests don't block downstream execution by default dbt build runs them in the DAG order, but a test failure on orders doesn't automatically prevent customers from running. You'd need to rely on the DAG structure or use dbt build with --fail-fast to get blocking behavior.

Bruin does something different. Quality checks are scheduled as ColumnCheckInstance and CustomCheckInstance objects, sub-tasks of the asset they belong to. In pkg/scheduler/scheduler.go (line 668), the scheduler explicitly wires them:

// add the upstream-downstream relationships for the main task to its quality checks
s.taskNameMap[assetName].AddUpstreamByType(TaskInstanceTypeColumnCheck, ti)
s.taskNameMap[assetName].AddUpstreamByType(TaskInstanceTypeCustomCheck, ti)

So the execution graph looks like:

raw_orders (asset) → [quality checks: unique, not_null, positive] → downstream_asset

Quality checks run after their asset completes, and they block downstream assets by default. The blocking field on each check controls this:

func (t *ColumnCheckInstance) Blocking() bool {
    return t.Check.Blocking.Bool()  // defaults to true
}

When constructInstanceRelationships builds the DAG, it considers blocking status. A downstream asset won't start until all blocking checks on its upstream assets have passed. Non-blocking checks still run, still report failures, but they don't hold up the pipeline.

You can also run checks independently without re-running the asset:

bruin run --only checks assets/orders.sql

This is useful when you want to re-check data quality without re-materializing.

Where each approach shines

dbt's "tests as separate nodes" design has some real advantages:

Reusability through macros. dbt's generic test system lets you write a test macro once and use it across your whole project. Packages like dbt-utils and dbt-expectations add dozens of test types. The macro system is genuinely powerful for this.
store_failures gives you debugging data. When a test fails, you can query the actual failing rows. Bruin tells you how many rows failed; dbt can show you which rows.
Granular severity. The warn_if/error_if system with thresholds is more nuanced than a binary blocking/not-blocking toggle. "Warn if more than 5% of rows fail" is a useful middle ground.

Bruin's "checks in the asset" design has different strengths:

Co-location. When I open a SQL file, I see the transformation AND the quality expectations in one place. I don't need to cross-reference between a SQL file and a schema.yml to understand what an asset does and what constraints it should satisfy. For onboarding engineers to a project, this is a real benefit.
Checks as pipeline gates. Blocking checks are wired into the DAG by default. If the orders table has null IDs, downstream assets won't even try to run. You don't need to think about test ordering or --fail-fast flags, it's the default behavior.
More built-in check types. Nine built-in checks vs four means less custom work for common validations. positive, min, and max come up all the time in financial and analytics data, and having them built in saves you from writing (and maintaining) custom test macros.
Custom checks without extra files. Need a business-specific check? Add a custom_checks entry with a SQL query. In dbt, you'd create a separate SQL file in tests/ or write a generic test macro in macros/. Bruin keeps it in the asset.

The deeper difference

These two approaches reflect a broader design question: should data quality be something you add next to your transformations, or something you define inside them?

dbt inherits from the software testing tradition: tests live in separate files, run as a separate step. There's a clean separation of concerns. The transformation does one thing, the test does another. This is familiar if you come from application development where src/ and tests/ are separate directories.

Bruin treats an asset as a complete unit: here's the data I produce, here are the columns it should have, and here are the constraints those columns must satisfy. It's closer to how a database schema with CHECK constraints works,the expectations are part of the definition, not a separate layer.

I find Bruin's approach particularly practical for teams where the person writing the transformation is also the person responsible for its quality. You define what the data should look like in the same breath as defining how to produce it. There's no friction of switching files or remembering to update a separate YAML when you change a column name.

That said, dbt's the ecosystem of test packages is something Bruin hasn't matched yet. If you need 50 different test types, dbt-expectations has them ready to go.

For me, the blocking-by-default behavior is the strongest argument for Bruin's design. Quality gates should be the default, not something you have to opt into. When a data quality check fails, the pipeline should stop and you shouldn't have to remember to configure that.