Remember that feeling when you discover a shortcut that saves you hours every week? That's what incremental processing did for our data pipeline.
Let me tell you about the moment I realized we had a problem.
The Wake-Up Call
It was a Tuesday morning. I kicked off our daily pipeline run and went to grab coffee. When I came back 10 minutes later, it was still running. Processing 50,000 account records shouldn't take this long, I thought.
I checked the logs. The pipeline was reprocessing every single record from scratch. All 50,000 of them. Even though only 47 accounts had actually changed since yesterday.
We were doing the equivalent of repainting your entire house every time you scuff one wall.
The Naive Approach (What We Were Doing)
Here's what our original pipeline looked like:
-- Every day, process EVERYTHING
CREATE OR REPLACE TABLE account_summary AS
SELECT
account_id,
balance,
calculate_interest(balance, has_loan) as interest,
balance + interest as new_balance
FROM accounts
Day 1: Process 50,000 accounts → 10 minutes
Day 2: Process 50,000 accounts (47 changed) → 10 minutes
Day 3: Process 50,000 accounts (23 changed) → 10 minutes
You see the problem. We're wasting 99.9% of our compute on unchanged data.
The Incremental Mindset
The solution seems obvious in hindsight: only process what changed. But how do you know what changed?
This is where that loaded_at timestamp from Part 1 becomes crucial. Remember when we added it to the source layer? This is why.
Here's the mental model: Every record has a timestamp showing when it was last modified. Your pipeline remembers when it last ran. On the next run, you only process records modified after that timestamp.
Think of it like checking your email. You don't re-read every email you've ever received. You just check for new ones since you last looked.
The Implementation
Let's look at how this actually works in code. I'll show you the pattern we use in our marts layer:
{{
config(
materialized='incremental',
unique_key='account_id'
)
}}
SELECT
account_id,
customer_id,
balance as original_balance,
calculate_interest(balance, has_loan) as interest,
balance + interest as new_balance,
CURRENT_TIMESTAMP() as calculated_at
FROM {{ ref('intermediate_accounts') }}
{% if is_incremental() %}
-- Only process records that changed since last run
WHERE valid_from_at > (
SELECT MAX(calculated_at)
FROM {{ this }}
)
{% endif %}
Let me break down what's happening here:
First run (table doesn't exist yet):
-
is_incremental()returns false - Process all 50,000 records
- Takes 10 minutes
- Each record gets a
calculated_attimestamp
Second run (table exists):
-
is_incremental()returns true - Find the latest
calculated_attimestamp (let's say it's yesterday at 2 AM) - Only process records where
valid_from_at> yesterday at 2 AM - That's just 47 records
- Takes 6 seconds
The magic: The unique_key='account_id' tells the database to merge results. If account A001 appears in the new data, it updates the existing row. If account A999 is new, it inserts a new row.
The Merge Strategy
Here's what actually happens in the database during an incremental run:
-- Simplified version of what the database does
MERGE INTO account_summary target
USING (
-- Your incremental query results
SELECT * FROM new_records
) source
ON target.account_id = source.account_id
WHEN MATCHED THEN
UPDATE SET
balance = source.balance,
interest = source.interest,
calculated_at = source.calculated_at
WHEN NOT MATCHED THEN
INSERT VALUES (source.*)
Changed records get updated. New records get inserted. Unchanged records? Untouched. Exactly what we want.
The Edge Cases (Where Things Get Tricky)
Of course, it's never quite that simple. Here are the gotchas we ran into:
Problem 1: Late-Arriving Data
Sometimes data shows up late. An account update from Monday arrives on Wednesday. Your incremental logic already processed Tuesday's data, so it misses the Monday update.
Solution: Add a lookback window.
{% if is_incremental() %}
WHERE valid_from_at > (
SELECT MAX(calculated_at) - INTERVAL '3 days' -- Look back 3 days
FROM {{ this }}
)
{% endif %}
Now you reprocess the last 3 days of data on every run. It's a bit redundant, but it catches late arrivals. We found 3 days was the sweet spot—long enough to catch stragglers, short enough to stay fast.
Problem 2: The Empty Table Trap
What happens on the very first run when the table doesn't exist? MAX(calculated_at) returns NULL, and your WHERE clause breaks.
Solution: Use COALESCE with a fallback date.
{% if is_incremental() %}
WHERE valid_from_at > COALESCE(
(SELECT MAX(calculated_at) FROM {{ this }}),
'1900-01-01'::timestamp -- Fallback: process everything
)
{% endif %}
If the table is empty, fall back to a date in the distant past, which effectively processes all records. Simple and bulletproof.
Problem 3: Schema Changes
You add a new column to your calculation. Now what? The incremental logic will only update new records. Old records won't have the new column.
Solution: Full refresh when needed.
# Normal incremental run
dbt run --select account_summary
# Force full refresh (reprocess everything)
dbt run --select account_summary --full-refresh
We run full refreshes:
- After schema changes
- After logic changes that affect all records
- Once a month as a sanity check
The rest of the time? Incremental all the way.
The Performance Numbers
Let me show you the actual impact on our pipeline:
Before incremental processing:
Daily run: 10 minutes 23 seconds
Weekly compute cost: $47
Records processed per day: 50,000
Records actually changed: ~50 (0.1%)
After incremental processing:
Daily run: 6 seconds
Weekly compute cost: $0.80
Records processed per day: ~50
Records actually changed: ~50 (100%)
That's a 100x speedup and a 98% cost reduction. Same results, fraction of the time and money.
When NOT to Use Incremental Processing
Incremental isn't always the answer. Here's when we stick with full refreshes:
Small datasets: If you're processing 1,000 records and it takes 5 seconds, don't bother with incremental. The complexity isn't worth it.
Frequent schema changes: If your logic changes weekly, you'll be running full refreshes anyway. Incremental adds complexity without benefit.
Complex dependencies: If your calculation depends on the entire dataset (like percentiles or rankings), incremental gets tricky. Sometimes it's easier to just reprocess everything.
Aggregations across all records: If you're calculating "total balance across all accounts," you need all records, not just changed ones.
We use incremental for:
- Row-level calculations (interest rates, classifications)
- Joins that don't require full table scans
- Transformations where each record is independent
We use full refresh for:
- Aggregations (sums, averages across all data)
- Rankings and percentiles
- Anything that needs the complete dataset
The Debugging Challenge
Here's something nobody tells you: incremental processing makes debugging harder.
With full refresh, every run is identical. With incremental, each run processes different data. A bug might only appear when certain records are processed together.
We learned to:
1. Keep detailed logs:
context.log.info(f"Processing {len(new_records)} changed records")
context.log.info(f"Date range: {min_date} to {max_date}")
context.log.info(f"Last run timestamp: {last_run}")
2. Make full refresh easy:
# One command to reprocess everything
dbt run --select account_summary --full-refresh
3. Test incremental logic:
# Unit test: Does incremental produce same results as full refresh?
def test_incremental_matches_full():
full_results = run_full_refresh()
incremental_results = run_incremental()
assert full_results == incremental_results
The Compound Effect
Here's what really sold me on incremental processing: it compounds.
When your pipeline runs in 6 seconds instead of 10 minutes, you can run it more often. We went from daily runs to hourly runs. Suddenly we had near-real-time analytics.
When your compute costs drop 98%, you can afford to add more transformations. We added three new marts that would have been too expensive before.
When your pipeline is fast, people trust it more. They know they can get fresh data quickly, so they actually use it.
It's not just about speed. It's about what speed enables.
The Practical Checklist
If you want to implement incremental processing in your pipeline, here's what you need:
✅ Timestamp column: Every record needs a "last modified" timestamp
✅ Unique key: A column (or combination) that uniquely identifies each record
✅ Merge support: Your database needs to support MERGE or UPSERT operations
✅ Lookback window: Handle late-arriving data gracefully
✅ Full refresh option: For when you need to reprocess everything
✅ Monitoring: Track how many records are processed each run
If you have these pieces, you're ready to go incremental.
What's Next
In Part 3, we'll talk about orchestration and data quality. Because a fast pipeline that produces wrong results is worse than a slow pipeline that produces right results.
We'll cover:
- How to orchestrate these layers so they run in the right order
- Automated testing to catch issues before production
- Monitoring and alerting when things go wrong
- The retry strategies that saved us during that Databricks outage
But for now, take a look at your pipelines. Are you reprocessing everything every time? Could you process just what changed? The performance gains might surprise you.
This is Part 2 of a 3-part series on modern data pipeline architecture.
Part 1: Modern Data Pipelines - Why Five Layers Changed Everything
Part 3: Orchestration & Data Quality (coming soon)
Want to see the full code? Check out the GitHub repository with complete source code, documentation, and production metrics.
Tech Stack: Dagster • DBT • DuckDB • Databricks • Python • Docker
Have you implemented incremental processing?
What challenges did you face? What patterns worked for you? Drop a comment below—I'd love to hear your experiences! 👇
Top comments (0)