Nigerian fintech company Kuda ran into a problem your team may recognize.
After their data grew, full-refresh model run times ballooned from minutes to hours. The worse pain: they were scanning a 2TB table every hour just to retrieve a few MB of new data. BigQuery charges by bytes scanned, so monthly bills shot up. Dashboards lagged. The real-time requirements of their financial business became impossible to meet.
This wasn't an architecture failure — it was the wrong data pipeline pattern for the job.
The core of this article: building production-grade incremental data pipelines with dbt on BigQuery, and using Data Contracts to prevent schema changes from cascading into production failures.
Why Full Refresh Doesn't Scale
Full Refresh logic is simple: every run, recompute the entire table from scratch.
When data is small, this works. But data has a growth curve:
- 1M rows today: 30 seconds
- 10M rows next year: 5 minutes
- 100M rows the year after: 1 hour, cost ×100
BigQuery's on-demand pricing charges by bytes scanned. Full refresh on large tables becomes a cost black hole.
The solution: dbt Incremental Models — only process new or changed records, dramatically cutting processing volume and cost.
Three Incremental Strategies: Scenario Drives the Choice
dbt supports three Incremental strategies on BigQuery. There's no "best" one — only the one that fits your data characteristics.
Strategy 1: Append
For insert-only data that never changes: log streams, click events, payment records (when historical records are immutable).
{{ config(materialized = 'incremental') }}
SELECT * FROM {{ source('core', 'transactions') }}
{% if is_incremental() %}
WHERE transaction_date > (SELECT MAX(transaction_date) FROM {{ this }})
{% endif %}
Note: BigQuery doesn't support a native append strategy, so you implement it via timestamp filtering. If upstream can resend data, you must add deduplication logic or you'll end up with duplicate rows.
Strategy 2: Insert Overwrite ★ BigQuery's Sweet Spot
For date-partitioned aggregation tables where data changes but is limited to specific date ranges.
{{ config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
partition_by = { 'field': 'transaction_date', 'data_type': 'date' }
) }}
This replaces entire partitions rather than merging row by row. Late-arriving data is captured by recomputing today's partition. It avoids complex row-level MERGE operations.
This is the highest cost-performance starting point for BigQuery environments.
Strategy 3: Merge (Upsert)
For dimension tables with primary keys that evolve over time: customer profiles, account balances, user status.
{{ config(
materialized = 'incremental',
incremental_strategy = 'merge',
unique_key = 'customer_id'
) }}
dbt compiles this to a native BigQuery MERGE statement. unique_key selection is critical — choosing the wrong key causes duplicate data. Confirm which field is truly unique with the business side before designing.
BigQuery-Specific Tuning: Partition + Clustering
Incremental strategy solves "how much data to process." Partition and Clustering solve "how to scan efficiently."
Partition Configuration
{{ config(
materialized = 'table',
partition_by = {
"field": "created_at",
"data_type": "timestamp",
"granularity": "day"
},
require_partition_filter = true,
partition_expiration_days = 90
) }}
What these settings actually do:
-
require_partition_filter = true: Forces all queries to include a partition filter, preventing engineers from accidentally writing full-table scans. For event tables over 100GB, this is a mandatory guardrail. -
partition_expiration_days = 90: Automatically deletes partitions older than 90 days, no manual cleanup needed, direct storage cost reduction. - Partition filters must use literal values, not subqueries — otherwise BigQuery can't trigger partition pruning and effectively scans the full table.
Clustering
On top of partitioning, sort by high-frequency filter columns (user_id, event_type, status) to accelerate WHERE conditions and aggregations.
Best combination: Date partition + user_id clustering = scanning an extremely small data subset when querying "a specific user's activity on a specific day."
Data Contract: Catch Schema Breaks in CI/CD
Every data team has experienced this scenario:
The product team changed user_status from STRING to INTEGER. No notification to the data engineering team. The next morning, dbt models crash, Looker dashboards go all red, ML pipelines break. Three hours to find the root cause.
The purpose of Data Contracts is to stop this before it happens.
models:
- name: fct_user_activity
contract:
enforced: true
columns:
- name: user_status
data_type: string
constraints:
- type: not_null
- name: user_id
data_type: integer
constraints:
- type: not_null
- type: unique
With contract.enforced: true, dbt validates at compile time whether the model's output schema matches the YAML definition. If it doesn't, CI/CD fails and breaking changes can't reach production.
This is a formal agreement between data producers and consumers, turning schema changes from "surprises" into "controlled conversations."
Pipeline Layering: The Medallion Pattern
Raw Layer → Staging Layer → Intermediate Layer → Mart Layer
(raw data) (light cleaning) (business logic) (analytics-ready)
Recommended directory structure:
models/
staging/ # one subdirectory per source system
intermediate/ # cross-source joins, business middle layer
marts/ # by business domain (finance/product/operations)
Materialization recommendations per layer:
- Staging:
vieworephemeral(no storage, computed at query time) - Intermediate:
table(frequently reused intermediate results) - Mart:
incremental(the final large tables facing analytics queries)
Implementation Priority Guide
| Scenario | Recommended Action |
|---|---|
| BigQuery has on-demand tables >100GB | Introduce insert_overwrite + date partition first |
| Large event tables | Add require_partition_filter = true
|
| Cross-team collaboration or multiple upstream sources | Introduce Data Contract + CI validation |
| Historical data retention exceeds business needs | Set partition_expiration_days
|
| Dimension tables need update capability | Use merge strategy + unique_key
|
Conclusion
Scaling data pipelines isn't about bigger machines — it's about smarter data processing strategies.
dbt's Incremental Models solve "only process what's necessary." BigQuery's Partition + Clustering solve "scan efficiently." Data Contracts solve "cross-team schema changes cascading into production failures."
Combine these three, and you have the modern data platform on BigQuery working the way it should.
Sources: dbt Developer Blog (Kuda fintech case study); dbt BigQuery Configs official documentation; Data Contracts with dbt+BigQuery practice
Top comments (0)