DEV Community

Cover image for The Day Our Pipeline Went From 10 Minutes to 6 Seconds (Part 2 of 3)
Karthikeyan Rajasekaran
Karthikeyan Rajasekaran

Posted on

The Day Our Pipeline Went From 10 Minutes to 6 Seconds (Part 2 of 3)

Remember that feeling when you discover a shortcut that saves you hours every week? That's what incremental processing did for our data pipeline.

Let me tell you about the moment I realized we had a problem.

The Wake-Up Call

It was a Tuesday morning. I kicked off our daily pipeline run and went to grab coffee. When I came back 10 minutes later, it was still running. Processing 50,000 account records shouldn't take this long, I thought.

I checked the logs. The pipeline was reprocessing every single record from scratch. All 50,000 of them. Even though only 47 accounts had actually changed since yesterday.

We were doing the equivalent of repainting your entire house every time you scuff one wall.

The Naive Approach (What We Were Doing)

Here's what our original pipeline looked like:

-- Every day, process EVERYTHING
CREATE OR REPLACE TABLE account_summary AS
SELECT 
    account_id,
    balance,
    calculate_interest(balance, has_loan) as interest,
    balance + interest as new_balance
FROM accounts
Enter fullscreen mode Exit fullscreen mode

Day 1: Process 50,000 accounts → 10 minutes
Day 2: Process 50,000 accounts (47 changed) → 10 minutes

Day 3: Process 50,000 accounts (23 changed) → 10 minutes

You see the problem. We're wasting 99.9% of our compute on unchanged data.

The Incremental Mindset

The solution seems obvious in hindsight: only process what changed. But how do you know what changed?

This is where that loaded_at timestamp from Part 1 becomes crucial. Remember when we added it to the source layer? This is why.

Here's the mental model: Every record has a timestamp showing when it was last modified. Your pipeline remembers when it last ran. On the next run, you only process records modified after that timestamp.

Think of it like checking your email. You don't re-read every email you've ever received. You just check for new ones since you last looked.

The Implementation

Let's look at how this actually works in code. I'll show you the pattern we use in our marts layer:

{{
    config(
        materialized='incremental',
        unique_key='account_id'
    )
}}

SELECT 
    account_id,
    customer_id,
    balance as original_balance,
    calculate_interest(balance, has_loan) as interest,
    balance + interest as new_balance,
    CURRENT_TIMESTAMP() as calculated_at
FROM {{ ref('intermediate_accounts') }}

{% if is_incremental() %}
-- Only process records that changed since last run
WHERE valid_from_at > (
    SELECT MAX(calculated_at) 
    FROM {{ this }}
)
{% endif %}
Enter fullscreen mode Exit fullscreen mode

Let me break down what's happening here:

First run (table doesn't exist yet):

  • is_incremental() returns false
  • Process all 50,000 records
  • Takes 10 minutes
  • Each record gets a calculated_at timestamp

Second run (table exists):

  • is_incremental() returns true
  • Find the latest calculated_at timestamp (let's say it's yesterday at 2 AM)
  • Only process records where valid_from_at > yesterday at 2 AM
  • That's just 47 records
  • Takes 6 seconds

The magic: The unique_key='account_id' tells the database to merge results. If account A001 appears in the new data, it updates the existing row. If account A999 is new, it inserts a new row.

The Merge Strategy

Here's what actually happens in the database during an incremental run:

-- Simplified version of what the database does
MERGE INTO account_summary target
USING (
    -- Your incremental query results
    SELECT * FROM new_records
) source
ON target.account_id = source.account_id
WHEN MATCHED THEN 
    UPDATE SET 
        balance = source.balance,
        interest = source.interest,
        calculated_at = source.calculated_at
WHEN NOT MATCHED THEN
    INSERT VALUES (source.*)
Enter fullscreen mode Exit fullscreen mode

Changed records get updated. New records get inserted. Unchanged records? Untouched. Exactly what we want.

The Edge Cases (Where Things Get Tricky)

Of course, it's never quite that simple. Here are the gotchas we ran into:

Problem 1: Late-Arriving Data

Sometimes data shows up late. An account update from Monday arrives on Wednesday. Your incremental logic already processed Tuesday's data, so it misses the Monday update.

Solution: Add a lookback window.

{% if is_incremental() %}
WHERE valid_from_at > (
    SELECT MAX(calculated_at) - INTERVAL '3 days'  -- Look back 3 days
    FROM {{ this }}
)
{% endif %}
Enter fullscreen mode Exit fullscreen mode

Now you reprocess the last 3 days of data on every run. It's a bit redundant, but it catches late arrivals. We found 3 days was the sweet spot—long enough to catch stragglers, short enough to stay fast.

Problem 2: The Empty Table Trap

What happens on the very first run when the table doesn't exist? MAX(calculated_at) returns NULL, and your WHERE clause breaks.

Solution: Use COALESCE with a fallback date.

{% if is_incremental() %}
WHERE valid_from_at > COALESCE(
    (SELECT MAX(calculated_at) FROM {{ this }}),
    '1900-01-01'::timestamp  -- Fallback: process everything
)
{% endif %}
Enter fullscreen mode Exit fullscreen mode

If the table is empty, fall back to a date in the distant past, which effectively processes all records. Simple and bulletproof.

Problem 3: Schema Changes

You add a new column to your calculation. Now what? The incremental logic will only update new records. Old records won't have the new column.

Solution: Full refresh when needed.

# Normal incremental run
dbt run --select account_summary

# Force full refresh (reprocess everything)
dbt run --select account_summary --full-refresh
Enter fullscreen mode Exit fullscreen mode

We run full refreshes:

  • After schema changes
  • After logic changes that affect all records
  • Once a month as a sanity check

The rest of the time? Incremental all the way.

The Performance Numbers

Let me show you the actual impact on our pipeline:

Before incremental processing:

Daily run: 10 minutes 23 seconds
Weekly compute cost: $47
Records processed per day: 50,000
Records actually changed: ~50 (0.1%)
Enter fullscreen mode Exit fullscreen mode

After incremental processing:

Daily run: 6 seconds
Weekly compute cost: $0.80
Records processed per day: ~50
Records actually changed: ~50 (100%)
Enter fullscreen mode Exit fullscreen mode

That's a 100x speedup and a 98% cost reduction. Same results, fraction of the time and money.

When NOT to Use Incremental Processing

Incremental isn't always the answer. Here's when we stick with full refreshes:

Small datasets: If you're processing 1,000 records and it takes 5 seconds, don't bother with incremental. The complexity isn't worth it.

Frequent schema changes: If your logic changes weekly, you'll be running full refreshes anyway. Incremental adds complexity without benefit.

Complex dependencies: If your calculation depends on the entire dataset (like percentiles or rankings), incremental gets tricky. Sometimes it's easier to just reprocess everything.

Aggregations across all records: If you're calculating "total balance across all accounts," you need all records, not just changed ones.

We use incremental for:

  • Row-level calculations (interest rates, classifications)
  • Joins that don't require full table scans
  • Transformations where each record is independent

We use full refresh for:

  • Aggregations (sums, averages across all data)
  • Rankings and percentiles
  • Anything that needs the complete dataset

The Debugging Challenge

Here's something nobody tells you: incremental processing makes debugging harder.

With full refresh, every run is identical. With incremental, each run processes different data. A bug might only appear when certain records are processed together.

We learned to:

1. Keep detailed logs:

context.log.info(f"Processing {len(new_records)} changed records")
context.log.info(f"Date range: {min_date} to {max_date}")
context.log.info(f"Last run timestamp: {last_run}")
Enter fullscreen mode Exit fullscreen mode

2. Make full refresh easy:

# One command to reprocess everything
dbt run --select account_summary --full-refresh
Enter fullscreen mode Exit fullscreen mode

3. Test incremental logic:

# Unit test: Does incremental produce same results as full refresh?
def test_incremental_matches_full():
    full_results = run_full_refresh()
    incremental_results = run_incremental()
    assert full_results == incremental_results
Enter fullscreen mode Exit fullscreen mode

The Compound Effect

Here's what really sold me on incremental processing: it compounds.

When your pipeline runs in 6 seconds instead of 10 minutes, you can run it more often. We went from daily runs to hourly runs. Suddenly we had near-real-time analytics.

When your compute costs drop 98%, you can afford to add more transformations. We added three new marts that would have been too expensive before.

When your pipeline is fast, people trust it more. They know they can get fresh data quickly, so they actually use it.

It's not just about speed. It's about what speed enables.

The Practical Checklist

If you want to implement incremental processing in your pipeline, here's what you need:

Timestamp column: Every record needs a "last modified" timestamp

Unique key: A column (or combination) that uniquely identifies each record

Merge support: Your database needs to support MERGE or UPSERT operations

Lookback window: Handle late-arriving data gracefully

Full refresh option: For when you need to reprocess everything

Monitoring: Track how many records are processed each run

If you have these pieces, you're ready to go incremental.

What's Next

In Part 3, we'll talk about orchestration and data quality. Because a fast pipeline that produces wrong results is worse than a slow pipeline that produces right results.

We'll cover:

  • How to orchestrate these layers so they run in the right order
  • Automated testing to catch issues before production
  • Monitoring and alerting when things go wrong
  • The retry strategies that saved us during that Databricks outage

But for now, take a look at your pipelines. Are you reprocessing everything every time? Could you process just what changed? The performance gains might surprise you.


This is Part 2 of a 3-part series on modern data pipeline architecture.

Part 1: Modern Data Pipelines - Why Five Layers Changed Everything

Part 3: Orchestration & Data Quality (coming soon)

Want to see the full code? Check out the GitHub repository with complete source code, documentation, and production metrics.

Tech Stack: Dagster • DBT • DuckDB • Databricks • Python • Docker


Have you implemented incremental processing?

What challenges did you face? What patterns worked for you? Drop a comment below—I'd love to hear your experiences! 👇


Top comments (0)