DEV Community: Karthikeyan Rajasekaran

Building Bulletproof Data Pipelines: Orchestration, Testing, and Monitoring (Part 3 of 3)

Karthikeyan Rajasekaran — Fri, 02 Jan 2026 05:54:14 +0000

It was 3:17 AM when my phone buzzed in a previous role. I grabbed it, squinting at the screen: "Pipeline failed: account_summary."

Still half-asleep, I opened my laptop and pulled up the logs. The error message stared back at me: "Relation 'intermediate_accounts' does not exist."

Wait, what? That table should exist. The intermediate layer runs before the marts layer. Why is it missing?

Then I saw it. The intermediate job had failed silently 20 minutes earlier. The marts job ran anyway, looking for a table that wasn't there. The orchestration had failed.

This is the moment I realized: you can have perfect transformations and blazing-fast incremental processing (see Part 2), but if your orchestration is broken, your pipeline is a ticking time bomb.

That experience inspired me to build an example pipeline demonstrating the right patterns for orchestration, testing, and monitoring. Let me show you what I learned.

The Orchestration Problem

Here's what I got wrong for way too long in production: I focused on the transformations and forgot about the orchestration.

I had perfect SQL, clean architecture, and blazing-fast incremental processing. But the jobs were held together with cron jobs and shell scripts. And that's how I ended up debugging at 3 AM.

Let me show you the patterns I learned and implemented in my example pipeline.

The Naive Approach: Cron Jobs

The production pipeline started with cron jobs. Simple, right?

# crontab
0 2 * * * /scripts/run_source.sh
5 2 * * * /scripts/run_staging.sh
10 2 * * * /scripts/run_snapshots.sh
15 2 * * * /scripts/run_intermediate.sh
20 2 * * * /scripts/run_marts.sh

Each job runs 5 minutes after the previous one. Plenty of time, right?

What could go wrong?

Everything.

Scenario 1: The source job takes 7 minutes instead of 4. The staging job starts before source finishes. Chaos.

Scenario 2: The intermediate job fails. The marts job runs anyway, using stale data. Nobody notices for three days.

Scenario 3: You need to rerun just the marts layer. You have to manually figure out which script to run and in what order.

Scenario 4: Someone asks "when did this job last run successfully?" You grep through logs for 20 minutes.

Cron jobs work for simple tasks. For data pipelines? They're a disaster waiting to happen.

Enter Dagster: Asset-Centric Orchestration

Switching to Dagster changed everything for production pipelines I've worked on. Not because Dagster is magic, but because it forced me to think about data, not tasks.

Here's the mental shift: Instead of "run this script, then that script," you think "this data depends on that data."

Let me show you what this looks like in my example pipeline:

@asset(group_name="ingestion")
def customers_raw(context):
    """Ingest customer data from CSV"""
    df = pd.read_csv("data/customers.csv")
    return df

@asset(deps=[customers_raw])  # Wait for customers_raw
def dbt_transformations(context, dbt):
    """Run all DBT models"""
    dbt.cli(["build"], context=context)

@asset(deps=[dbt_transformations])  # Wait for transformations
def account_summary_csv(context):
    """Export results to CSV"""
    # Read from database and export

Notice what's different? We're not saying "run at 2:05 AM." We're saying "this asset depends on that asset."

Dagster figures out the order. If customers_raw fails, dbt_transformations doesn't run. If dbt_transformations fails, account_summary_csv doesn't run. The failure stops propagating.

The Asset Lineage View

Here's where Dagster really shines. You get a visual graph of your entire pipeline:

customers_raw ──┐
                ├──> dbt_transformations ──> account_summary_csv
accounts_raw ───┘                       └──> account_summary_parquet
                                        └──> data_quality_report

This isn't just pretty. It's functional. Click on any asset and you see:

When it last ran
How long it took
What data it produced
What depends on it
The full logs

That 3 AM debugging session? Would have taken 2 minutes instead of 20 with this visibility.

Retry Logic: Because Things Fail

Networks timeout. Databases get overloaded. Cloud services have hiccups. I learned this during a Databricks outage in a previous production system.

The pipeline failed. Then it retried. And succeeded. I didn't even know there was an outage until I checked the logs later.

Here's the retry strategy that saved us, which I've implemented in my example pipeline:

@asset(
    retry_policy=RetryPolicy(
        max_retries=3,
        delay=1,  # Start with 1 second
        backoff=Backoff.EXPONENTIAL  # Double each time
    )
)
def account_summary_to_databricks(context, databricks):
    """Load data to Databricks with retry logic"""

    for attempt in range(max_retries):
        try:
            # Attempt to load data
            databricks.load_data(df, "account_summary")
            return
        except ConnectionTimeout as e:
            if attempt < max_retries - 1:
                wait_time = delay * (2 ** attempt)  # Exponential backoff
                context.log.warning(f"Attempt {attempt + 1} failed, retrying in {wait_time}s")
                time.sleep(wait_time)
            else:
                raise

First attempt fails: Wait 1 second, retry

Second attempt fails: Wait 2 seconds, retry

Third attempt fails: Wait 4 seconds, retry

Fourth attempt fails: Give up, alert humans

This pattern saved us during that outage. The first few attempts failed, but by the time the third retry happened, Databricks was back up. The pipeline succeeded without waking me up.

I've built this same retry logic into my example pipeline to demonstrate the pattern.

Data Quality: Trust But Verify

Fast pipelines are useless if they produce wrong results. I learned this the embarrassing way in production.

A business user asked why the interest calculations looked off. I checked the code. Looked fine. I checked the data. Looked fine. Then I dug deeper.

Turns out, we had a bug in the staging layer. Some boolean values weren't being standardized correctly. The pipeline ran successfully every day, producing wrong results every day. For three weeks.

That's when I became obsessed with testing. My example pipeline includes 99 automated tests to demonstrate comprehensive data quality patterns.

Layer 1: Schema Tests

First line of defense: make sure the data structure is correct.

models:
  - name: stg_customer
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

      - name: has_loan_flag
        tests:
          - accepted_values:
              values: [true, false]

These tests run after every transformation. If customer_id has duplicates, the pipeline fails. If has_loan_flag has a value other than true/false, the pipeline fails.

Fail fast, fail loud.

Layer 2: Relationship Tests

Make sure data relationships are valid.

models:
  - name: int_account_with_customer
    columns:
      - name: customer_id
        tests:
          - relationships:
              to: ref('stg_customer')
              field: customer_id

This ensures every account has a valid customer. No orphaned records, no broken foreign keys.

Layer 3: Business Logic Tests

Make sure calculations are correct.

models:
  - name: account_summary
    columns:
      - name: interest_rate_pct
        tests:
          - positive_value  # Custom test
          - accepted_range:
              min_value: 0.01
              max_value: 0.025

      - name: new_balance_amount
        tests:
          - positive_value

Interest rates should be between 1% and 2.5%. If we see 25% or 0.001%, something's wrong.

Layer 4: Freshness Tests

Make sure data is recent.

sources:
  - name: raw
    tables:
      - name: customers_raw
        freshness:
          warn_after: {count: 24, period: hour}
          error_after: {count: 48, period: hour}

If customer data hasn't been updated in 24 hours, warn us. If it's been 48 hours, fail the pipeline.

This catches issues where ingestion jobs silently fail. The pipeline keeps running with stale data until the freshness test catches it.

The Quality Report

After every run, we generate a quality report:

{
  "timestamp": "2024-12-01T02:30:00",
  "summary": {
    "total_tests": 40,
    "passed": 38,
    "failed": 2,
    "pass_rate": 95.0
  },
  "failures": [
    {
      "test": "unique_customer_id",
      "model": "stg_customer",
      "message": "Found 3 duplicate customer IDs: [101, 205, 309]"
    },
    {
      "test": "positive_value_balance",
      "model": "stg_account",
      "message": "Found 1 negative balance: account A042 has balance -150.00"
    }
  ]
}

This report goes to Slack in production systems. If tests fail, teams investigate before the data reaches production. I've implemented this same pattern in my example pipeline.

The Quarantine Pattern

Sometimes you can't fix bad data immediately in production. Maybe it's a weekend, or the source system is down, or you need business input on how to handle it.

I use a quarantine pattern in my example pipeline to demonstrate this:

-- stg_customer.sql
-- Good records go to stg_customer
SELECT *
FROM src_customer
WHERE customer_id IS NOT NULL
  AND customer_name IS NOT NULL

-- Bad records go to quarantine
-- quarantine_stg_customer.sql
SELECT 
    *,
    CASE 
        WHEN customer_id IS NULL THEN 'missing_customer_id'
        WHEN customer_name IS NULL THEN 'missing_customer_name'
    END as quarantine_reason,
    CURRENT_TIMESTAMP() as quarantined_at
FROM src_customer
WHERE customer_id IS NULL
   OR customer_name IS NULL

Bad records don't break the pipeline. They go to a quarantine table where we can review them later. The pipeline continues with good data.

This saved us in production when a source system started sending records with null IDs. Instead of failing the entire pipeline, we quarantined those records and processed everything else. We fixed the source system later and reprocessed the quarantined records.

I've implemented this pattern in my example pipeline to show how it works.

Monitoring: Know What's Happening

I learned to track three key metrics in production systems:

1. Run duration:

context.add_output_metadata({
    "duration_seconds": end_time - start_time,
    "records_processed": len(df)
})

If a job that usually takes 6 seconds suddenly takes 60 seconds, something's wrong.

2. Record counts:

context.add_output_metadata({
    "input_records": len(input_df),
    "output_records": len(output_df),
    "filtered_records": len(input_df) - len(output_df)
})

If we usually process 50 records and suddenly process 5,000, something's wrong.

3. Data quality:

context.add_output_metadata({
    "null_count": df.isnull().sum().sum(),
    "duplicate_count": len(df) - len(df.drop_duplicates())
})

If null counts spike, something's wrong.

These metrics go to a dashboard in production. I check it every morning. If something looks off, I investigate.

I've implemented the same monitoring patterns in my example pipeline.

The 3 AM Incident: Resolved

Remember that 3 AM failure from the beginning? Here's how proper orchestration would have prevented it (and now does in my example pipeline):

Before:

Intermediate job failed silently
Marts job ran anyway
Used stale data
Nobody noticed for hours

After (with proper orchestration):

Intermediate job fails
Dagster stops the pipeline
Marts job doesn't run
Slack alert: "Pipeline stopped at intermediate layer"
Error is visible immediately
Fix takes 5 minutes instead of 3 hours

The fix was simple: the job ran out of memory, so I increased the allocation. But I only caught it quickly because of proper orchestration.

The Checklist

If you want reliable pipelines, you need:

✅ Dependency management: Jobs run in the right order

✅ Failure isolation: One failure doesn't cascade

✅ Retry logic: Transient failures resolve automatically

✅ Data quality tests: Catch issues before production

✅ Quarantine pattern: Bad data doesn't break the pipeline

✅ Monitoring: Know what's happening in real-time

✅ Alerting: Get notified when things go wrong

✅ Observability: Debug issues quickly

Without these, you're flying blind.

What We Learned

1. Orchestration is not optional: Cron jobs work for simple tasks. For data pipelines, use a proper orchestrator.

2. Test everything: Schema, relationships, business logic, freshness. If you don't test it, it will break.

3. Fail fast: Better to catch issues early than to produce wrong results.

4. Make debugging easy: When things break (and they will), you need to diagnose quickly.

5. Automate recovery: Retry transient failures. Quarantine bad data. Don't wake humans for things machines can handle.

The Final Architecture

Here's what a complete, production-ready pipeline looks like:

CSV Files
    ↓
[Ingestion Assets]
    ↓ (Dagster orchestration)
[DBT Transformations]
    ├─ Source Layer (raw data)
    ├─ Staging Layer (cleaned data)
    ├─ Snapshots (SCD2 history)
    ├─ Intermediate Layer (joins)
    └─ Marts Layer (analytics)
    ↓
[Data Quality Tests] (40+ tests)
    ↓
[Output Assets]
    ├─ CSV exports
    ├─ Parquet files
    └─ Databricks tables
    ↓
[Quality Report]
    └─ Slack notification

Every step is orchestrated. Every layer is tested. Every failure is caught. Every metric is tracked.

The Results

Here's what changed after we implemented proper orchestration and data quality:

Before:

Pipeline failures: 2-3 per week
Mean time to detection: 4 hours
Mean time to resolution: 2 hours
Data quality issues in production: Weekly
On-call stress level: High
3 AM wake-ups: Too many

After:

Pipeline failures: 1-2 per month
Mean time to detection: 2 minutes
Mean time to resolution: 15 minutes
Data quality issues in production: None in 6 months
On-call stress level: Low
3 AM wake-ups: Zero

The pipeline isn't perfect. Things still break. But when they do, I know immediately, and I can fix them quickly. Usually before anyone else even notices.

Closing Thoughts

Building a data pipeline is easy. Building a reliable data pipeline is hard.

The transformations are the easy part. The orchestration, testing, monitoring, and error handling—that's where the real work is.

But it's worth it. Because a pipeline that runs reliably at 3 AM, catches issues before production, and recovers from failures automatically? That's the difference between a script and a production system.

And that's what lets me sleep through the night instead of waking up to Slack alerts.

This is Part 3 of a 3-part series on modern data pipeline architecture. The examples come from an open-source banking pipeline I built based on my production experience.

Part 1: Modern Data Pipelines - Why Five Layers Changed Everything

Part 2: The Day Our Pipeline Went From 10 Minutes to 6 Seconds

Want to see the full code? Check out the GitHub repository with complete source code, documentation, and production metrics.

Tech Stack: Dagster • DBT • DuckDB • Databricks • Python • Docker

What's your worst pipeline incident?

How did you fix it? What lessons did you learn? Drop a comment below—I'd love to hear your war stories! 👇

The Day Our Pipeline Went From 10 Minutes to 6 Seconds (Part 2 of 3)

Karthikeyan Rajasekaran — Fri, 26 Dec 2025 22:05:11 +0000

Remember that feeling when you discover a shortcut that saves you hours every week? That's what incremental processing did for production pipelines I've worked on.

Let me tell you about the moment I realized we had a problem.

The Wake-Up Call

It was a Tuesday morning in production. I kicked off the daily pipeline run and went to grab coffee. When I came back 10 minutes later, it was still running. Processing 50,000 account records shouldn't take this long, I thought.

I checked the logs. The pipeline was reprocessing every single record from scratch. All 50,000 of them. Even though only 47 accounts had actually changed since yesterday.

We were doing the equivalent of repainting your entire house every time you scuff one wall.

This experience inspired me to build an example pipeline demonstrating the right way to handle incremental processing.

The Naive Approach (What I Was Seeing)

Here's what the original production pipeline looked like:

-- Every day, process EVERYTHING
CREATE OR REPLACE TABLE account_summary AS
SELECT 
    account_id,
    balance,
    calculate_interest(balance, has_loan) as interest,
    balance + interest as new_balance
FROM accounts

Day 1: Process 50,000 accounts → 10 minutes
Day 2: Process 50,000 accounts (47 changed) → 10 minutes

Day 3: Process 50,000 accounts (23 changed) → 10 minutes

You see the problem. We're wasting 99.9% of our compute on unchanged data.

The Incremental Mindset

The solution seems obvious in hindsight: only process what changed. But how do you know what changed?

This is where that loaded_at timestamp from Part 1 becomes crucial. Remember when we added it to the source layer? This is why.

Here's the mental model: Every record has a timestamp showing when it was last modified. Your pipeline remembers when it last ran. On the next run, you only process records modified after that timestamp.

Think of it like checking your email. You don't re-read every email you've ever received. You just check for new ones since you last looked.

The Implementation

Let's look at how this actually works in code. I'll show you the pattern I use in my example pipeline's marts layer:

{{
    config(
        materialized='incremental',
        unique_key='account_id'
    )
}}

SELECT 
    account_id,
    customer_id,
    balance as original_balance,
    calculate_interest(balance, has_loan) as interest,
    balance + interest as new_balance,
    CURRENT_TIMESTAMP() as calculated_at
FROM {{ ref('intermediate_accounts') }}

{% if is_incremental() %}
-- Only process records that changed since last run
WHERE valid_from_at > (
    SELECT MAX(calculated_at) 
    FROM {{ this }}
)
{% endif %}

Let me break down what's happening here:

First run (table doesn't exist yet):

is_incremental() returns false
Process all 50,000 records
Takes 10 minutes
Each record gets a calculated_at timestamp

Second run (table exists):

is_incremental() returns true
Find the latest calculated_at timestamp (let's say it's yesterday at 2 AM)
Only process records where valid_from_at > yesterday at 2 AM
That's just 47 records
Takes 6 seconds

The magic: The unique_key='account_id' tells the database to merge results. If account A001 appears in the new data, it updates the existing row. If account A999 is new, it inserts a new row.

The Merge Strategy

Here's what actually happens in the database during an incremental run:

-- Simplified version of what the database does
MERGE INTO account_summary target
USING (
    -- Your incremental query results
    SELECT * FROM new_records
) source
ON target.account_id = source.account_id
WHEN MATCHED THEN 
    UPDATE SET 
        balance = source.balance,
        interest = source.interest,
        calculated_at = source.calculated_at
WHEN NOT MATCHED THEN
    INSERT VALUES (source.*)

Changed records get updated. New records get inserted. Unchanged records? Untouched. Exactly what we want.

The Edge Cases (Where Things Get Tricky)

Of course, it's never quite that simple. Here are the gotchas we ran into:

Problem 1: Late-Arriving Data

Sometimes data shows up late in production systems. An account update from Monday arrives on Wednesday. Your incremental logic already processed Tuesday's data, so it misses the Monday update.

Solution: Add a lookback window.

{% if is_incremental() %}
WHERE valid_from_at > (
    SELECT MAX(calculated_at) - INTERVAL '3 days'  -- Look back 3 days
    FROM {{ this }}
)
{% endif %}

Now you reprocess the last 3 days of data on every run. It's a bit redundant, but it catches late arrivals. I found 3 days was the sweet spot in production—long enough to catch stragglers, short enough to stay fast.

Problem 2: The Empty Table Trap

What happens on the very first run when the table doesn't exist? MAX(calculated_at) returns NULL, and your WHERE clause breaks.

Solution: Use COALESCE with a fallback date.

{% if is_incremental() %}
WHERE valid_from_at > COALESCE(
    (SELECT MAX(calculated_at) FROM {{ this }}),
    '1900-01-01'::timestamp  -- Fallback: process everything
)
{% endif %}

If the table is empty, fall back to a date in the distant past, which effectively processes all records. Simple and bulletproof.

Problem 3: Schema Changes

You add a new column to your calculation. Now what? The incremental logic will only update new records. Old records won't have the new column.

Solution: Full refresh when needed.

# Normal incremental run
dbt run --select account_summary

# Force full refresh (reprocess everything)
dbt run --select account_summary --full-refresh

We run full refreshes in production:

After schema changes
After logic changes that affect all records
Once a month as a sanity check

The rest of the time? Incremental all the way. I've implemented this same pattern in my example pipeline.

The Performance Numbers

Let me show you the actual impact I've seen in production systems:

Before incremental processing:

Daily run: 10 minutes 23 seconds
Weekly compute cost: $47
Records processed per day: 50,000
Records actually changed: ~50 (0.1%)

After incremental processing:

Daily run: 6 seconds
Weekly compute cost: $0.80
Records processed per day: ~50
Records actually changed: ~50 (100%)

That's a 100x speedup and a 98% cost reduction. Same results, fraction of the time and money.

I've replicated this pattern in my example banking pipeline to demonstrate how it works.

When NOT to Use Incremental Processing

Incremental isn't always the answer. Here's when we stick with full refreshes:

Small datasets: If you're processing 1,000 records and it takes 5 seconds, don't bother with incremental. The complexity isn't worth it.

Frequent schema changes: If your logic changes weekly, you'll be running full refreshes anyway. Incremental adds complexity without benefit.

Complex dependencies: If your calculation depends on the entire dataset (like percentiles or rankings), incremental gets tricky. Sometimes it's easier to just reprocess everything.

Aggregations across all records: If you're calculating "total balance across all accounts," you need all records, not just changed ones.

We use incremental for:

Row-level calculations (interest rates, classifications)
Joins that don't require full table scans
Transformations where each record is independent

We use full refresh for:

Aggregations (sums, averages across all data)
Rankings and percentiles
Anything that needs the complete dataset

I've implemented both patterns in my example pipeline to show when to use each.

The Debugging Challenge

Here's something nobody tells you: incremental processing makes debugging harder.

With full refresh, every run is identical. With incremental, each run processes different data. A bug might only appear when certain records are processed together.

I learned to:

1. Keep detailed logs:

context.log.info(f"Processing {len(new_records)} changed records")
context.log.info(f"Date range: {min_date} to {max_date}")
context.log.info(f"Last run timestamp: {last_run}")

2. Make full refresh easy:

# One command to reprocess everything
dbt run --select account_summary --full-refresh

3. Test incremental logic:

# Unit test: Does incremental produce same results as full refresh?
def test_incremental_matches_full():
    full_results = run_full_refresh()
    incremental_results = run_incremental()
    assert full_results == incremental_results

The Compound Effect

Here's what really sold me on incremental processing in production: it compounds.

When your pipeline runs in 6 seconds instead of 10 minutes, you can run it more often. I've seen teams go from daily runs to hourly runs. Suddenly they had near-real-time analytics.

When your compute costs drop 98%, you can afford to add more transformations. I've seen teams add three new marts that would have been too expensive before.

When your pipeline is fast, people trust it more. They know they can get fresh data quickly, so they actually use it.

It's not just about speed. It's about what speed enables.

I built my example pipeline to demonstrate these patterns with real code you can run and learn from.

The Practical Checklist

If you want to implement incremental processing in your pipeline, here's what you need:

✅ Timestamp column: Every record needs a "last modified" timestamp

✅ Unique key: A column (or combination) that uniquely identifies each record

✅ Merge support: Your database needs to support MERGE or UPSERT operations

✅ Lookback window: Handle late-arriving data gracefully

✅ Full refresh option: For when you need to reprocess everything

✅ Monitoring: Track how many records are processed each run

If you have these pieces, you're ready to go incremental.

What's Next

In Part 3, we'll talk about orchestration and data quality. Because a fast pipeline that produces wrong results is worse than a slow pipeline that produces right results.

We'll cover:

How to orchestrate these layers so they run in the right order
Automated testing to catch issues before production
Monitoring and alerting when things go wrong
The retry strategies that saved us during that Databricks outage

But for now, take a look at your pipelines. Are you reprocessing everything every time? Could you process just what changed? The performance gains might surprise you.

This is Part 2 of a 3-part series on modern data pipeline architecture. The examples come from an open-source banking pipeline I built based on my production experience.

Part 1: Modern Data Pipelines - Why Five Layers Changed Everything

Part 3: Orchestration & Data Quality (coming soon)

Want to see the full code? Check out the GitHub repository with complete source code, documentation, and production metrics.

Tech Stack: Dagster • DBT • DuckDB • Databricks • Python • Docker

Have you implemented incremental processing?

What challenges did you face? What patterns worked for you? Drop a comment below—I'd love to hear your experiences! 👇

Modern Data Pipelines: Why Five Layers Changed Everything (Part 1 of 3)

Karthikeyan Rajasekaran — Wed, 24 Dec 2025 03:12:53 +0000

I'll be honest—when I first heard about "layered data architectures," I rolled my eyes. Another buzzword, I thought. Just write some SQL, move the data, and call it a day.

Then I spent three weeks debugging a production pipeline where raw data, cleaned data, and analytics were all intermingled in one giant, spaghetti-like mess. That's when it clicked.

To demonstrate these lessons, I built an example banking data pipeline that captures the patterns and architecture I learned from real production systems. The code is open source, and the principles apply to any domain.

The Problem Nobody Talks About

Here's what actually happens in most data projects:

You start simple. Maybe you're pulling data from an API or reading CSV files. You write a script that cleans the data and calculates some metrics. It works! You ship it. Everyone's happy.

Six months later, someone asks: "Can we see what this metric looked like last quarter?"

You check the database. The old data is gone—overwritten by yesterday's run.

"Can we add a new calculation without breaking the existing reports?"

You look at the code. Everything is tangled together. Changing one thing breaks three others.

"Why did this number change between Tuesday and Wednesday?"

You have no idea. There's no audit trail.

Sound familiar? This is why we need layers.

The Five-Layer Philosophy

Think about how a restaurant kitchen works. You don't see the head chef doing everything. There's a system:

Receiving dock: Ingredients arrive exactly as delivered (even if the tomatoes are bruised)
Prep station: Wash, peel, chop—make ingredients ready to use
Cold storage: Keep prepared ingredients fresh and organized
Cooking line: Combine ingredients following recipes
Plating station: Final presentation for customers

Each station has one job. If something goes wrong, you know exactly where to look. A data pipeline works the same way.

Layer 1: Source (The Receiving Dock)

What it does: Store data exactly as received. No cleaning, no transformations, no "fixing" things.

Why it matters: This is your insurance policy. When something goes wrong downstream (and it will), you can always come back to the original data.

I learned this the hard way in production. We once had a pipeline that "cleaned" data on ingestion—converting empty strings to nulls, trimming whitespace, fixing typos. Seemed smart at the time. Then a business user asked why certain records were missing. We had no way to prove whether the data arrived that way or if our cleaning broke something.

Now? I always save everything exactly as received. Here's the pattern from my example pipeline:

-- Source layer: Just add a timestamp
SELECT 
    *,  -- Everything, unchanged
    CURRENT_TIMESTAMP() as loaded_at
FROM raw_input

That loaded_at timestamp becomes crucial later. It tells us when data arrived, which helps track down issues and enables change detection.

Layer 2: Staging (The Prep Station)

What it does: Clean and standardize data without changing its meaning.

Why it matters: Real-world data is messy. You'll see "Yes", "YES", "true", "1", "Y" all meaning the same thing. Staging normalizes this chaos.

Here's an example from my demo pipeline that mirrors what I've seen in production. Customer data arrives with loan status in various formats:

-- Before staging (the mess)
CustomerID | HasLoan
-----------|--------
1          | Yes
2          | YES  
3          | true
4          | 1
5          | no
6          | FALSE

-- After staging (clean and consistent)
customer_id | has_loan_flag
------------|---------------
1           | true
2           | true
3           | true
4           | true
5           | false
6           | false

The staging layer handles this:

SELECT 
    customer_id,
    LOWER(TRIM(customer_name)) as customer_name,
    CASE 
        WHEN LOWER(has_loan) IN ('yes', 'true', '1', 'y') THEN true
        WHEN LOWER(has_loan) IN ('no', 'false', '0', 'n') THEN false
        ELSE null
    END as has_loan_flag
FROM source_customer

Notice we're not calculating anything or joining tables. We're just cleaning. One job, done well.

Layer 3: Snapshots (The Time Machine)

This is where things get interesting. Most pipelines overwrite data every day. Yesterday's data? Gone. Last month's data? Gone. You're flying blind.

Snapshots solve this by keeping every version of every record. It's called Slowly Changing Dimension Type 2 (SCD2), but I prefer to think of it as version control for data.

Real scenario from production: A customer's loan status changes on February 15th. Without snapshots, you only know their current status. With snapshots, you know their status on any date in history.

Here's what the snapshot table looks like:

customer_id | has_loan | valid_from  | valid_to    | Status
------------|----------|-------------|-------------|--------
123         | false    | 2024-01-01  | 2024-02-15  | Old
123         | true     | 2024-02-15  | NULL        | Current

The magic is in those valid_from and valid_to timestamps. Want to know the status on January 20th? Query where that date falls between valid_from and valid_to. Want current status? Query where valid_to is NULL.

This saved us during a production audit. Regulators asked about account balances from six months ago. Without snapshots, we would have been scrambling. With snapshots? One SQL query, done in 30 seconds.

I've implemented this same pattern in my example pipeline to show how it works.

Layer 4: Intermediate (The Cooking Line)

What it does: Join data from different sources and apply business rules.

Why it matters: This is where you start building the actual insights. But you're not calculating final metrics yet—you're preparing the ingredients.

In my example pipeline, I join customer data with account data:

SELECT 
    accounts.account_id,
    accounts.balance,
    customers.customer_name,
    customers.has_loan_flag
FROM account_snapshots accounts
JOIN customer_snapshots customers 
    ON accounts.customer_id = customers.customer_id
WHERE accounts.valid_to IS NULL  -- Current records only
  AND customers.valid_to IS NULL

Why not do this in the marts layer? Because other teams might need this joined data for different calculations. Build it once, use it everywhere.

Layer 5: Marts (The Plating Station)

What it does: Final calculations and aggregations. This is what business users actually see.

Why it matters: This is your product. Everything before this was preparation.

Here's where I calculate interest rates based on business rules in the example:

SELECT 
    account_id,
    balance as original_balance,

    -- Business logic: Interest rate based on balance tiers
    CASE
        WHEN balance < 10000 THEN 0.01
        WHEN balance < 20000 THEN 0.015
        ELSE 0.02
    END as base_rate,

    -- Bonus rate for customers with loans
    CASE WHEN has_loan_flag THEN 0.005 ELSE 0 END as bonus_rate,

    -- Final calculation
    balance * (base_rate + bonus_rate) as annual_interest
FROM intermediate_accounts

The business logic is crystal clear. No digging through nested queries or trying to figure out where a number came from. It's right there.

Why This Actually Works

I've seen teams try to skip layers. "We don't need staging, we'll just clean in the source layer." Or "Why separate intermediate and marts? Let's just do it all in one query."

Here's what happens:

Without layers: A bug in the cleaning logic corrupts your analytics. You can't tell if the issue is in the data, the cleaning, the joins, or the calculations. You're debugging everything at once.

With layers: A bug in the cleaning logic? Check staging. Bad join? Check intermediate. Wrong calculation? Check marts. You know exactly where to look.

It's like having a stack trace for your data.

The Performance Question

"But doesn't this mean more tables and slower queries?"

Actually, no. Here's why:

Staging is views, not tables: No storage overhead. They're computed on the fly.
Snapshots enable incremental processing: Instead of reprocessing everything daily, you only process what changed. In production, I've seen this reduce runs from 10 minutes to 6 seconds.
Intermediate tables are reusable: Build the join once, use it in multiple marts. Faster than joining raw data every time.
Marts are optimized for queries: They're pre-aggregated and indexed exactly how business users need them.

The performance actually improves because each layer is optimized for its specific job.

What I Wish I Knew Earlier

Start with layers from day one: Don't wait until the pipeline is a mess. It's easier to build it right than to refactor later.

Layers aren't bureaucracy: They're clarity. Each layer answers one question: What is this data? (source), Is it clean? (staging), What changed? (snapshots), How does it relate? (intermediate), What does it mean? (marts).

The time machine is worth it: Snapshots take more storage, yes. But the first time someone asks "what was this value last month?" you'll be glad you have them.

One job per layer: The moment you start mixing concerns (cleaning in source, calculating in intermediate), you're back to spaghetti.

Coming Up Next

In Part 2, we'll dive into incremental processing—how to process only what changed instead of reprocessing everything. This is where the real performance gains happen.

In Part 3, we'll cover orchestration and data quality—how to make sure this whole system runs reliably and catches issues before they reach production.

But for now, think about your current pipelines. Are they layered? Can you trace a number from the final report back through each transformation to the raw data? If not, it might be time to add some layers.

This is Part 1 of a 3-part series on modern data pipeline architecture. The examples come from an open-source banking pipeline I built based on my production experience. The patterns apply to any domain—e-commerce, healthcare, logistics, you name it.

Want to see the full code? Check out the GitHub repository with complete source code, documentation, and production metrics.

Tech Stack: Dagster • DBT • DuckDB • Databricks • Python • Docker

What's your experience with data pipeline architecture?

Have you built layered pipelines? What challenges did you face? Drop a comment below—I'd love to hear your stories! 👇