Karthikeyan Rajasekaran

Posted on Dec 26, 2025 • Edited on Jan 7

The Day Our Pipeline Went From 10 Minutes to 6 Seconds (Part 2 of 3)

#dataengineering #dbt #dagster #data

Remember that feeling when you discover a shortcut that saves you hours every week? That's what incremental processing did for production pipelines I've worked on.

Let me tell you about the moment I realized we had a problem.

The Wake-Up Call

It was a Tuesday morning in production. I kicked off the daily pipeline run and went to grab coffee. When I came back 10 minutes later, it was still running. Processing 50,000 account records shouldn't take this long, I thought.

I checked the logs. The pipeline was reprocessing every single record from scratch. All 50,000 of them. Even though only 47 accounts had actually changed since yesterday.

We were doing the equivalent of repainting your entire house every time you scuff one wall.

This experience inspired me to build an example pipeline demonstrating the right way to handle incremental processing.

The Naive Approach (What I Was Seeing)

Here's what the original production pipeline looked like:

-- Every day, process EVERYTHING
CREATE OR REPLACE TABLE account_summary AS
SELECT 
    account_id,
    balance,
    calculate_interest(balance, has_loan) as interest,
    balance + interest as new_balance
FROM accounts

Day 1: Process 50,000 accounts → 10 minutes
Day 2: Process 50,000 accounts (47 changed) → 10 minutes

Day 3: Process 50,000 accounts (23 changed) → 10 minutes

You see the problem. We're wasting 99.9% of our compute on unchanged data.

The Incremental Mindset

The solution seems obvious in hindsight: only process what changed. But how do you know what changed?

This is where that loaded_at timestamp from Part 1 becomes crucial. Remember when we added it to the source layer? This is why.

Here's the mental model: Every record has a timestamp showing when it was last modified. Your pipeline remembers when it last ran. On the next run, you only process records modified after that timestamp.

Think of it like checking your email. You don't re-read every email you've ever received. You just check for new ones since you last looked.

The Implementation

Let's look at how this actually works in code. I'll show you the pattern I use in my example pipeline's marts layer:

{{
    config(
        materialized='incremental',
        unique_key='account_id'
    )
}}

SELECT 
    account_id,
    customer_id,
    balance as original_balance,
    calculate_interest(balance, has_loan) as interest,
    balance + interest as new_balance,
    CURRENT_TIMESTAMP() as calculated_at
FROM {{ ref('intermediate_accounts') }}

{% if is_incremental() %}
-- Only process records that changed since last run
WHERE valid_from_at > (
    SELECT MAX(calculated_at) 
    FROM {{ this }}
)
{% endif %}

Let me break down what's happening here:

First run (table doesn't exist yet):

is_incremental() returns false
Process all 50,000 records
Takes 10 minutes
Each record gets a calculated_at timestamp

Second run (table exists):

is_incremental() returns true
Find the latest calculated_at timestamp (let's say it's yesterday at 2 AM)
Only process records where valid_from_at > yesterday at 2 AM
That's just 47 records
Takes 6 seconds

The magic: The unique_key='account_id' tells the database to merge results. If account A001 appears in the new data, it updates the existing row. If account A999 is new, it inserts a new row.

The Merge Strategy

Here's what actually happens in the database during an incremental run:

-- Simplified version of what the database does
MERGE INTO account_summary target
USING (
    -- Your incremental query results
    SELECT * FROM new_records
) source
ON target.account_id = source.account_id
WHEN MATCHED THEN 
    UPDATE SET 
        balance = source.balance,
        interest = source.interest,
        calculated_at = source.calculated_at
WHEN NOT MATCHED THEN
    INSERT VALUES (source.*)

Changed records get updated. New records get inserted. Unchanged records? Untouched. Exactly what we want.

The Edge Cases (Where Things Get Tricky)

Of course, it's never quite that simple. Here are the gotchas we ran into:

Problem 1: Late-Arriving Data

Sometimes data shows up late in production systems. An account update from Monday arrives on Wednesday. Your incremental logic already processed Tuesday's data, so it misses the Monday update.

Solution: Add a lookback window.

{% if is_incremental() %}
WHERE valid_from_at > (
    SELECT MAX(calculated_at) - INTERVAL '3 days'  -- Look back 3 days
    FROM {{ this }}
)
{% endif %}

Now you reprocess the last 3 days of data on every run. It's a bit redundant, but it catches late arrivals. I found 3 days was the sweet spot in production—long enough to catch stragglers, short enough to stay fast.

Problem 2: The Empty Table Trap

What happens on the very first run when the table doesn't exist? MAX(calculated_at) returns NULL, and your WHERE clause breaks.

Solution: Use COALESCE with a fallback date.

{% if is_incremental() %}
WHERE valid_from_at > COALESCE(
    (SELECT MAX(calculated_at) FROM {{ this }}),
    '1900-01-01'::timestamp  -- Fallback: process everything
)
{% endif %}

If the table is empty, fall back to a date in the distant past, which effectively processes all records. Simple and bulletproof.

Problem 3: Schema Changes

You add a new column to your calculation. Now what? The incremental logic will only update new records. Old records won't have the new column.

Solution: Full refresh when needed.

# Normal incremental run
dbt run --select account_summary

# Force full refresh (reprocess everything)
dbt run --select account_summary --full-refresh

We run full refreshes in production:

After schema changes
After logic changes that affect all records
Once a month as a sanity check

The rest of the time? Incremental all the way. I've implemented this same pattern in my example pipeline.

The Performance Numbers

Let me show you the actual impact I've seen in production systems:

Before incremental processing:

Daily run: 10 minutes 23 seconds
Weekly compute cost: $47
Records processed per day: 50,000
Records actually changed: ~50 (0.1%)

After incremental processing:

Daily run: 6 seconds
Weekly compute cost: $0.80
Records processed per day: ~50
Records actually changed: ~50 (100%)

That's a 100x speedup and a 98% cost reduction. Same results, fraction of the time and money.

I've replicated this pattern in my example banking pipeline to demonstrate how it works.

When NOT to Use Incremental Processing

Incremental isn't always the answer. Here's when we stick with full refreshes:

Small datasets: If you're processing 1,000 records and it takes 5 seconds, don't bother with incremental. The complexity isn't worth it.

Frequent schema changes: If your logic changes weekly, you'll be running full refreshes anyway. Incremental adds complexity without benefit.

Complex dependencies: If your calculation depends on the entire dataset (like percentiles or rankings), incremental gets tricky. Sometimes it's easier to just reprocess everything.

Aggregations across all records: If you're calculating "total balance across all accounts," you need all records, not just changed ones.

We use incremental for:

Row-level calculations (interest rates, classifications)
Joins that don't require full table scans
Transformations where each record is independent

We use full refresh for:

Aggregations (sums, averages across all data)
Rankings and percentiles
Anything that needs the complete dataset

I've implemented both patterns in my example pipeline to show when to use each.

The Debugging Challenge

Here's something nobody tells you: incremental processing makes debugging harder.

With full refresh, every run is identical. With incremental, each run processes different data. A bug might only appear when certain records are processed together.

I learned to:

1. Keep detailed logs:

context.log.info(f"Processing {len(new_records)} changed records")
context.log.info(f"Date range: {min_date} to {max_date}")
context.log.info(f"Last run timestamp: {last_run}")

2. Make full refresh easy:

# One command to reprocess everything
dbt run --select account_summary --full-refresh

3. Test incremental logic:

# Unit test: Does incremental produce same results as full refresh?
def test_incremental_matches_full():
    full_results = run_full_refresh()
    incremental_results = run_incremental()
    assert full_results == incremental_results

The Compound Effect

Here's what really sold me on incremental processing in production: it compounds.

When your pipeline runs in 6 seconds instead of 10 minutes, you can run it more often. I've seen teams go from daily runs to hourly runs. Suddenly they had near-real-time analytics.

When your compute costs drop 98%, you can afford to add more transformations. I've seen teams add three new marts that would have been too expensive before.

When your pipeline is fast, people trust it more. They know they can get fresh data quickly, so they actually use it.

It's not just about speed. It's about what speed enables.

I built my example pipeline to demonstrate these patterns with real code you can run and learn from.

The Practical Checklist

If you want to implement incremental processing in your pipeline, here's what you need:

✅ Timestamp column: Every record needs a "last modified" timestamp

✅ Unique key: A column (or combination) that uniquely identifies each record

✅ Merge support: Your database needs to support MERGE or UPSERT operations

✅ Lookback window: Handle late-arriving data gracefully

✅ Full refresh option: For when you need to reprocess everything

✅ Monitoring: Track how many records are processed each run

If you have these pieces, you're ready to go incremental.

What's Next

In Part 3, we'll talk about orchestration and data quality. Because a fast pipeline that produces wrong results is worse than a slow pipeline that produces right results.

We'll cover:

How to orchestrate these layers so they run in the right order
Automated testing to catch issues before production
Monitoring and alerting when things go wrong
The retry strategies that saved us during that Databricks outage

But for now, take a look at your pipelines. Are you reprocessing everything every time? Could you process just what changed? The performance gains might surprise you.

This is Part 2 of a 3-part series on modern data pipeline architecture. The examples come from an open-source banking pipeline I built based on my production experience.

Part 1: Modern Data Pipelines - Why Five Layers Changed Everything

Part 3: Orchestration & Data Quality (coming soon)

Want to see the full code? Check out the GitHub repository with complete source code, documentation, and production metrics.

Tech Stack: Dagster • DBT • DuckDB • Databricks • Python • Docker

Have you implemented incremental processing?

What challenges did you face? What patterns worked for you? Drop a comment below—I'd love to hear your experiences! 👇

DEV Community