Baris Terzioglu

Posted on Mar 15

Materialization strategies: how Bruin and dbt turn SELECT queries into tables

#dataengineering #sql #dbt #datapipeline

Materialization strategies: how Bruin and dbt turn SELECT queries into tables

Every SQL-based data pipeline tool faces the same problem: you wrote a SELECT query, and now you need it to exist as a table (or view) in your warehouse. The logic that bridges that gap is called materialization.

Both Bruin and dbt solve this. They just solve it differently, and the differences say a lot about each tool's design philosophy.

What is materialization, exactly?

If you write a query like this:

SELECT
    user_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent
FROM orders
GROUP BY user_id

Materialization is the logic that wraps this query to produce a real database object. At its simplest, that means generating CREATE TABLE my_table AS (SELECT ...). But real workloads get complicated fast. What if the table already exists? What if you only want to update yesterday's data? What if you need to track historical changes to records over time?

The materialization strategy answers these questions.

dbt's approach: four types, strategy as a sub-option

dbt organizes materialization around four types: view, table, incremental, and ephemeral. You set this in the model's config block:

{{
  config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id'
  )
}}

SELECT * FROM {{ ref('raw_orders') }}
{% if is_incremental() %}
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

The materialized property is the top-level decision. Within incremental, you then pick a strategy: append, delete+insert, merge, or the newer microbatch.

This design made a lot of sense when dbt launched. Most teams needed three things: views for lightweight transformations, tables for heavy ones, and some way to avoid full refreshes on large tables. The incremental type covers that last case, and the strategy sub-option lets you pick how it works.

dbt also has ephemeral, which never creates a database object at all. Instead, the query gets injected as a CTE into whatever model references it. It is a compile-time optimization, not a runtime materialization.

For SCD Type 2 (slowly changing dimensions), dbt uses a separate concept entirely: snapshots. Snapshots are their own resource type with their own config, directory structure, and CLI command (dbt snapshot). They support two strategies: timestamp (track changes via an updated_at column) and check (track changes by comparing column values).

From the dbt architecture docs (docs/arch/4.5_Node_Materialization.md):

When a model is configured with materialized='table', the materialization macro generates SQL that follows an atomic swap pattern to avoid downtime.

The implementation looks like:

-- 1. Create a temporary table with the model's SQL
CREATE TABLE "schema"."my_model__dbt_tmp" AS (
  SELECT * FROM ...
);

-- 2. Drop the existing table
DROP TABLE IF EXISTS "schema"."my_model";

-- 3. Rename temp to final
ALTER TABLE "schema"."my_model__dbt_tmp"
RENAME TO "my_model";

Under the hood, materializations are Jinja macros. The actual macro files live in the dbt-adapters repo, not in dbt-core itself. dbt-core's ModelRunner.execute() (in core/dbt/task/run.py) looks up the right macro through a dispatch chain:

Adapter-specific macro in your project
Adapter-specific macro from a package
Adapter-specific macro from the adapter itself
Default macro in your project
Default macro from dbt-core

This layered dispatch is powerful. You can override any materialization at any level. But it also means the actual SQL generation lives across multiple repos and macro files, which can make debugging harder.

dbt's incremental config

The incremental model configuration (from core/dbt/artifacts/resources/v1/config.py) gives you these options:

materialized: str = "view"
incremental_strategy: Optional[str] = None
unique_key: Union[str, List[str], None] = None
on_schema_change: Optional[str] = "ignore"
batch_size: Any = None       # for microbatch
lookback: Any = 1            # for microbatch
begin: Any = None            # for microbatch
event_time: Any = None       # for microbatch

The on_schema_change option is worth noting. When your source schema changes between runs, dbt can ignore it, append_new_columns, sync_all_columns, or fail. This is a real-world problem that bites teams when upstream tables change, and dbt handles it well.

One thing I find interesting about dbt's is_incremental() pattern: the model itself has to contain the conditional logic. You write the full-refresh query and the incremental query in the same file, gated by {% if is_incremental() %}. This means the model author has to think about both paths, which is both a strength (explicit) and a weakness (verbose, easy to get wrong).

Bruin's approach: type + strategy as a flat matrix

Bruin takes a different approach. Instead of overloading incremental as a materialization type with sub-strategies, Bruin separates the concept into two orthogonal dimensions:

type: either table or view
strategy: one of nine options

The config lives in the asset's YAML header:

/* @bruin
name: dashboard.orders
type: bq.sql

materialization:
    type: table
    strategy: delete+insert
    incremental_key: order_date

@bruin */

SELECT
    order_date,
    user_id,
    amount
FROM raw.orders
WHERE order_date >= CURRENT_DATE - 7

Notice what's missing: there's no {% if is_incremental() %} conditional in the SQL. The query is just a query. Bruin handles the wrapping logic entirely based on the strategy you picked.

Here are all nine strategies, defined in pkg/pipeline/pipeline.go:

const (
    MaterializationStrategyCreateReplace  = "create+replace"
    MaterializationStrategyDeleteInsert   = "delete+insert"
    MaterializationStrategyTruncateInsert = "truncate+insert"
    MaterializationStrategyAppend         = "append"
    MaterializationStrategyMerge          = "merge"
    MaterializationStrategyTimeInterval   = "time_interval"
    MaterializationStrategyDDL            = "ddl"
    MaterializationStrategySCD2ByTime     = "scd2_by_time"
    MaterializationStrategySCD2ByColumn   = "scd2_by_column"
)

That's nine strategies vs dbt's four incremental strategies plus separate snapshot types. Bruin puts everything into one flat list because the team's view is that these are all just different ways to materialize a SELECT query. SCD2 isn't a separate concept from incremental loading; it's just another strategy.

The dispatch pattern: Go functions vs Jinja macros

The way each tool dispatches to the right SQL generation code is revealing.

dbt uses a multi-layer Jinja macro dispatch with adapter inheritance. When you run a model with materialized='incremental' and incremental_strategy='merge' on Snowflake, dbt searches for a macro named materialization_incremental_snowflake, then falls back to materialization_incremental_default. The adapter can also dispatch the strategy itself, searching for snowflake__incremental_merge then default__incremental_merge.

Bruin uses a simpler pattern. Each supported database has a Go file with a function map (from pkg/bigquery/materialization.go):

var matMap = pipeline.AssetMaterializationMap{
    pipeline.MaterializationTypeView: {
        pipeline.MaterializationStrategyNone: viewMaterializer,
        // ...
    },
    pipeline.MaterializationTypeTable: {
        pipeline.MaterializationStrategyNone:           buildCreateReplaceQuery,
        pipeline.MaterializationStrategyAppend:         buildAppendQuery,
        pipeline.MaterializationStrategyCreateReplace:  buildCreateReplaceQuery,
        pipeline.MaterializationStrategyDeleteInsert:   buildIncrementalQuery,
        pipeline.MaterializationStrategyMerge:          mergeMaterializer,
        pipeline.MaterializationStrategyTimeInterval:   buildTimeIntervalQuery,
        pipeline.MaterializationStrategyDDL:            buildDDLQuery,
        pipeline.MaterializationStrategySCD2ByColumn:   buildSCD2ByColumnQuery,
        pipeline.MaterializationStrategySCD2ByTime:     buildSCD2QueryByTime,
    },
}

The dispatcher in pkg/pipeline/materializer.go is about 25 lines of Go:

func (m *Materializer) Render(asset *Asset, query string) (string, error) {
    mat := asset.Materialization
    if mat.Type == MaterializationTypeNone {
        return removeComments(query), nil
    }

    strategy := mat.Strategy
    if m.FullRefresh && mat.Type == MaterializationTypeTable {
        if mat.Strategy != MaterializationStrategyDDL &&
           (asset.RefreshRestricted == nil || !*asset.RefreshRestricted) {
            strategy = MaterializationStrategyCreateReplace
        }
    }

    if matFunc, ok := m.MaterializationMap[mat.Type][strategy]; ok {
        materializedQuery, err := matFunc(asset, query)
        if err != nil {
            return "", err
        }
        return removeComments(materializedQuery), nil
    }

    return "", fmt.Errorf("unsupported materialization type - strategy combination: (`%s` - `%s`)",
        mat.Type, mat.Strategy)
}

The map lookup m.MaterializationMap[mat.Type][strategy] is the entire dispatch. No inheritance chain, no macro resolution order, no adapter fallbacks. Each database has its own map, and the right function gets called directly.

This means you can't override Bruin's materializations from your project the way you can with dbt macros. But it also means you can read exactly what SQL will be generated by looking at one Go file. When Bruin's delete+insert on BigQuery does something unexpected, you look at pkg/bigquery/materialization.go, find buildIncrementalQuery, and read the Go code. With dbt, the same investigation might take you through three repos (dbt-core, dbt-adapters, dbt-bigquery) and multiple Jinja files.

Comparing specific strategies

Let me walk through how each tool handles some common patterns.

Full refresh: create+replace vs table

dbt's table materialization uses the atomic swap pattern: create temp, drop old, rename. This is safe and avoids partial reads.

Bruin's create+replace generates a simpler statement:

CREATE OR REPLACE TABLE dashboard.orders
PARTITION BY order_date
CLUSTER BY user_id
AS
SELECT ...

The PARTITION BY and CLUSTER BY come straight from the YAML config. Most modern warehouses support CREATE OR REPLACE TABLE, making the temp-and-rename pattern unnecessary. Bruin leans on the warehouse's own atomicity guarantees here.

Incremental: delete+insert

dbt's delete+insert strategy requires you to write the filter logic yourself using {% if is_incremental() %}:

{{ config(materialized='incremental', incremental_strategy='delete+insert') }}

SELECT * FROM {{ ref('seed') }}
{% if is_incremental() %}
    WHERE a > (SELECT MAX(a) FROM {{ this }})
{% endif %}

Bruin generates the delete+insert logic from the incremental_key config. The BigQuery implementation (pkg/bigquery/materialization.go) does this:

func buildIncrementalQuery(asset *pipeline.Asset, query string) (string, error) {
    mat := asset.Materialization
    // ...
    queries := []string{
        fmt.Sprintf("DECLARE %s array<%s>", declaredVarName, foundCol.Type),
        "BEGIN TRANSACTION",
        fmt.Sprintf("CREATE TEMP TABLE %s AS %s", tempTableName, query),
        fmt.Sprintf("SET %s = (SELECT array_agg(distinct %s) FROM %s)",
            declaredVarName, mat.IncrementalKey, tempTableName),
        fmt.Sprintf("DELETE FROM %s WHERE %s in unnest(%s)",
            asset.Name, mat.IncrementalKey, declaredVarName),
        fmt.Sprintf("INSERT INTO %s SELECT * FROM %s", asset.Name, tempTableName),
        "COMMIT TRANSACTION",
    }
    return strings.Join(queries, ";\n") + ";", nil
}

The generated SQL: run the query into a temp table, extract the distinct incremental keys, delete matching rows from the target, insert the new rows, all inside a transaction. You don't write any of this logic in your SQL file.

This is a genuine philosophical difference. dbt says: you, the model author, know best how to filter for new data. Bruin says: tell us the incremental key and we'll handle the rest. Both are defensible positions, but Bruin's approach means fewer bugs from incorrectly written incremental filters, especially for junior engineers or teams moving fast.

Merge (upsert)

Both tools support MERGE statements, but the configuration differs.

dbt uses unique_key in the config:

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id'
) }}

Bruin uses column-level configuration with finer control:

/* @bruin
name: dashboard.products
type: bq.sql

materialization:
    type: table
    strategy: merge

columns:
  - name: product_id
    type: INTEGER
    primary_key: true
  - name: product_name
    type: VARCHAR
    update_on_merge: true
  - name: price
    type: INTEGER
    update_on_merge: true
  - name: high_score
    type: INTEGER
    merge_sql: GREATEST(target.high_score, source.high_score)

@bruin */

SELECT ...

That merge_sql field is worth a closer look. It lets you define custom merge logic per column. In the example above, high_score keeps whichever value is higher, whether it's the existing row or the incoming row. The generated MERGE statement (from mergeMaterializer in pkg/bigquery/materialization.go) produces:

MERGE dashboard.products target
USING (...) source
ON (source.product_id = target.product_id
    OR (source.product_id IS NULL AND target.product_id IS NULL))
WHEN MATCHED THEN UPDATE SET
    target.product_name = source.product_name,
    target.price = source.price,
    target.high_score = GREATEST(target.high_score, source.high_score)
WHEN NOT MATCHED THEN INSERT(product_id, product_name, price, high_score)
    VALUES(product_id, product_name, price, high_score);

In dbt, achieving the same per-column merge logic would require writing a custom materialization macro or using a post_hook.

SCD Type 2

This is where the design philosophies diverge most.

dbt treats SCD2 as a separate resource type: snapshots. You put them in a snapshots/ directory, configure them differently from models, and run them with dbt snapshot. They track changes with dbt_valid_from and dbt_valid_to columns and support two strategies: timestamp and check.

Bruin treats SCD2 as two more materialization strategies: scd2_by_time and scd2_by_column. They're configured the same way as any other asset, in the same directory, with the same YAML structure:

/* @bruin
name: warehouse.product_history
type: bq.sql

materialization:
  type: table
  strategy: scd2_by_column

columns:
  - name: product_id
    type: INTEGER
    primary_key: true
  - name: product_name
    type: VARCHAR
  - name: price
    type: FLOAT

@bruin */

SELECT 1 AS product_id, 'Laptop' AS product_name, 999.99 AS price
UNION ALL
SELECT 2 AS product_id, 'Mouse' AS product_name, 29.99 AS price

Bruin automatically adds _valid_from, _valid_until, and _is_current columns. The scd2_by_column strategy compares all non-primary-key columns to detect changes. The scd2_by_time strategy uses a timestamp column (the incremental_key) to determine when records changed.

The generated SQL for scd2_by_column (from buildSCD2ByColumnQuery in pkg/bigquery/materialization.go) is a single MERGE statement that handles three cases in one pass:

WHEN MATCHED and columns changed: mark the old row as historical
WHEN NOT MATCHED BY SOURCE: mark removed records as historical
WHEN NOT MATCHED BY TARGET: insert new records

The full-refresh path for SCD2 is also handled. When you run bruin run --full-refresh, the SCD2 strategies produce a CREATE OR REPLACE TABLE with the SCD2 tracking columns pre-populated, partitioned by _valid_from by default.

The practical difference: in dbt, adding SCD2 tracking to an existing model means moving it to a new directory, changing the resource type, and potentially refactoring your DAG. In Bruin, you change one line of YAML from strategy: create+replace to strategy: scd2_by_column and add column definitions.

Time interval and microbatch

Both tools have a concept of time-windowed processing, but they frame it differently.

dbt's microbatch strategy (introduced more recently) processes data in configurable time batches:

{{ config(
    materialized='incremental',
    incremental_strategy='microbatch',
    unique_key='id',
    event_time='event_time',
    batch_size='day',
    begin=modules.datetime.datetime(2020, 1, 1, 0, 0, 0)
) }}

The MicrobatchBuilder class (in core/dbt/materializations/incremental/microbatch.py) splits the time range into batches and processes each one. It supports hour, day, month, and year granularity, with a lookback parameter for reprocessing recent batches.

Bruin's time_interval strategy works with explicit start and end dates passed via CLI:

bruin run --start-date "2024-03-01" --end-date "2024-03-31" assets/orders.sql

The strategy deletes the time window and reinserts:

queries := []string{
    "BEGIN TRANSACTION",
    fmt.Sprintf("DELETE FROM %s WHERE %s BETWEEN '%s' AND '%s'",
        asset.Name, asset.Materialization.IncrementalKey, startVar, endVar),
    fmt.Sprintf("INSERT INTO %s %s", asset.Name, query),
    "COMMIT TRANSACTION",
}

dbt's microbatch is more automatic (it figures out the batches for you). Bruin's time_interval gives you more explicit control over the window. If you need to backfill March 2024, you say exactly that. With dbt, you'd configure the begin date and let the batching logic work it out.

The full-refresh escape hatch

Both tools let you bypass incremental logic and do a full rebuild.

dbt uses dbt run --full-refresh, which tells all incremental models to rebuild from scratch.

Bruin uses bruin run --full-refresh, and the behavior is encoded in the materializer. When FullRefresh is true, the strategy gets overridden to create+replace for all table materializations, with two exceptions:

DDL strategy is never overridden (it only creates the table structure, so dropping it would lose data)
Assets with refresh_restricted: true keep their normal strategy

That refresh_restricted flag is a nice touch. If you have a table that takes hours to rebuild, or one with external dependencies that would break if dropped, you can protect it from accidental full refreshes while still running --full-refresh on the rest of your pipeline.

What this tells us about each tool

dbt's materialization system reflects its origins as a "transforms" tool. It started with the simplest question (view or table?) and added incrementality as a sub-concern. SCD2 got bolted on as a separate concept (snapshots) because it's a fundamentally different workflow from transformation. The Jinja macro system makes everything customizable, at the cost of complexity and indirection.

Bruin started with a wider view of what a data pipeline asset needs. Materialization is a flat list of strategies because the team treated all the ways you might want to persist a SELECT query as peers. SCD2, merge with custom column logic, time-interval backfilling: they're all just strategies on the same axis. The code generation happens in Go functions that are explicit and traceable, but not customizable from outside the tool.

If you're building a team that will need custom materializations or uses heavily customized dbt packages, dbt's macro system gives you that flexibility. If you want a tool where "change the strategy" means changing one word in a YAML config and the generated SQL is predictable and inspectable, Bruin's approach is compelling. The fact that SCD2 is just another strategy, not a separate resource type, reduces the conceptual overhead for teams that need historical tracking.

I'd recommend looking at both tools' actual materialization code if you're evaluating them. The Go functions in Bruin's pkg/bigquery/materialization.go (or the equivalent for your warehouse) show you exactly what SQL you'll get. For dbt, start with core/dbt/task/run.py and follow the macro dispatch chain into dbt-adapters. Both are well-engineered, and reading the code tells you more than any comparison post ever could.

DEV Community

Materialization strategies: how Bruin and dbt turn SELECT queries into tables

Materialization strategies: how Bruin and dbt turn SELECT queries into tables

What is materialization, exactly?

dbt's approach: four types, strategy as a sub-option

dbt's incremental config

Bruin's approach: type + strategy as a flat matrix

The dispatch pattern: Go functions vs Jinja macros

Comparing specific strategies

Full refresh: create+replace vs table

Incremental: delete+insert

Merge (upsert)

SCD Type 2

Time interval and microbatch

The full-refresh escape hatch

What this tells us about each tool

Top comments (0)