linou518

Posted on Mar 16

dbt + BigQuery Production Pipeline Patterns: Incremental Strategies, Partition Tuning, and Data Contracts

#openclaw #ai #dashboard

Nigerian fintech company Kuda ran into a problem your team may recognize.

After their data grew, full-refresh model run times ballooned from minutes to hours. The worse pain: they were scanning a 2TB table every hour just to retrieve a few MB of new data. BigQuery charges by bytes scanned, so monthly bills shot up. Dashboards lagged. The real-time requirements of their financial business became impossible to meet.

This wasn't an architecture failure — it was the wrong data pipeline pattern for the job.

The core of this article: building production-grade incremental data pipelines with dbt on BigQuery, and using Data Contracts to prevent schema changes from cascading into production failures.

Why Full Refresh Doesn't Scale

Full Refresh logic is simple: every run, recompute the entire table from scratch.

When data is small, this works. But data has a growth curve:

1M rows today: 30 seconds
10M rows next year: 5 minutes
100M rows the year after: 1 hour, cost ×100

BigQuery's on-demand pricing charges by bytes scanned. Full refresh on large tables becomes a cost black hole.

The solution: dbt Incremental Models — only process new or changed records, dramatically cutting processing volume and cost.

Three Incremental Strategies: Scenario Drives the Choice

dbt supports three Incremental strategies on BigQuery. There's no "best" one — only the one that fits your data characteristics.

Strategy 1: Append

For insert-only data that never changes: log streams, click events, payment records (when historical records are immutable).

{{ config(materialized = 'incremental') }}
SELECT * FROM {{ source('core', 'transactions') }}
{% if is_incremental() %}
WHERE transaction_date > (SELECT MAX(transaction_date) FROM {{ this }})
{% endif %}

Note: BigQuery doesn't support a native append strategy, so you implement it via timestamp filtering. If upstream can resend data, you must add deduplication logic or you'll end up with duplicate rows.

Strategy 2: Insert Overwrite ★ BigQuery's Sweet Spot

For date-partitioned aggregation tables where data changes but is limited to specific date ranges.

{{ config(
  materialized = 'incremental',
  incremental_strategy = 'insert_overwrite',
  partition_by = { 'field': 'transaction_date', 'data_type': 'date' }
) }}

This replaces entire partitions rather than merging row by row. Late-arriving data is captured by recomputing today's partition. It avoids complex row-level MERGE operations.

This is the highest cost-performance starting point for BigQuery environments.

Strategy 3: Merge (Upsert)

For dimension tables with primary keys that evolve over time: customer profiles, account balances, user status.

{{ config(
  materialized = 'incremental',
  incremental_strategy = 'merge',
  unique_key = 'customer_id'
) }}

dbt compiles this to a native BigQuery MERGE statement. unique_key selection is critical — choosing the wrong key causes duplicate data. Confirm which field is truly unique with the business side before designing.

BigQuery-Specific Tuning: Partition + Clustering

Incremental strategy solves "how much data to process." Partition and Clustering solve "how to scan efficiently."

Partition Configuration

{{ config(
  materialized = 'table',
  partition_by = {
    "field": "created_at",
    "data_type": "timestamp",
    "granularity": "day"
  },
  require_partition_filter = true,
  partition_expiration_days = 90
) }}

What these settings actually do:

require_partition_filter = true: Forces all queries to include a partition filter, preventing engineers from accidentally writing full-table scans. For event tables over 100GB, this is a mandatory guardrail.
partition_expiration_days = 90: Automatically deletes partitions older than 90 days, no manual cleanup needed, direct storage cost reduction.
Partition filters must use literal values, not subqueries — otherwise BigQuery can't trigger partition pruning and effectively scans the full table.

Clustering

On top of partitioning, sort by high-frequency filter columns (user_id, event_type, status) to accelerate WHERE conditions and aggregations.

Best combination: Date partition + user_id clustering = scanning an extremely small data subset when querying "a specific user's activity on a specific day."

Data Contract: Catch Schema Breaks in CI/CD

Every data team has experienced this scenario:

The product team changed user_status from STRING to INTEGER. No notification to the data engineering team. The next morning, dbt models crash, Looker dashboards go all red, ML pipelines break. Three hours to find the root cause.

The purpose of Data Contracts is to stop this before it happens.

models:
  - name: fct_user_activity
    contract:
      enforced: true
    columns:
      - name: user_status
        data_type: string
        constraints:
          - type: not_null
      - name: user_id
        data_type: integer
        constraints:
          - type: not_null
          - type: unique

With contract.enforced: true, dbt validates at compile time whether the model's output schema matches the YAML definition. If it doesn't, CI/CD fails and breaking changes can't reach production.

This is a formal agreement between data producers and consumers, turning schema changes from "surprises" into "controlled conversations."

Pipeline Layering: The Medallion Pattern

Raw Layer → Staging Layer → Intermediate Layer → Mart Layer
(raw data)   (light cleaning)  (business logic)   (analytics-ready)

Recommended directory structure:

models/
  staging/       # one subdirectory per source system
  intermediate/  # cross-source joins, business middle layer
  marts/         # by business domain (finance/product/operations)

Materialization recommendations per layer:

Staging: view or ephemeral (no storage, computed at query time)
Intermediate: table (frequently reused intermediate results)
Mart: incremental (the final large tables facing analytics queries)

Implementation Priority Guide

Scenario	Recommended Action
BigQuery has on-demand tables >100GB	Introduce `insert_overwrite` + date partition first
Large event tables	Add `require_partition_filter = true`
Cross-team collaboration or multiple upstream sources	Introduce Data Contract + CI validation
Historical data retention exceeds business needs	Set `partition_expiration_days`
Dimension tables need update capability	Use `merge` strategy + `unique_key`

Conclusion

Scaling data pipelines isn't about bigger machines — it's about smarter data processing strategies.

dbt's Incremental Models solve "only process what's necessary." BigQuery's Partition + Clustering solve "scan efficiently." Data Contracts solve "cross-team schema changes cascading into production failures."

Combine these three, and you have the modern data platform on BigQuery working the way it should.

Sources: dbt Developer Blog (Kuda fintech case study); dbt BigQuery Configs official documentation; Data Contracts with dbt+BigQuery practice

DEV Community