Karthikeyan Rajasekaran

Posted on Dec 24, 2025 • Edited on Jan 7

Modern Data Pipelines: Why Five Layers Changed Everything (Part 1 of 3)

#dataengineering #dbt #dagster #python

I'll be honest—when I first heard about "layered data architectures," I rolled my eyes. Another buzzword, I thought. Just write some SQL, move the data, and call it a day.

Then I spent three weeks debugging a production pipeline where raw data, cleaned data, and analytics were all intermingled in one giant, spaghetti-like mess. That's when it clicked.

To demonstrate these lessons, I built an example banking data pipeline that captures the patterns and architecture I learned from real production systems. The code is open source, and the principles apply to any domain.

The Problem Nobody Talks About

Here's what actually happens in most data projects:

You start simple. Maybe you're pulling data from an API or reading CSV files. You write a script that cleans the data and calculates some metrics. It works! You ship it. Everyone's happy.

Six months later, someone asks: "Can we see what this metric looked like last quarter?"

You check the database. The old data is gone—overwritten by yesterday's run.

"Can we add a new calculation without breaking the existing reports?"

You look at the code. Everything is tangled together. Changing one thing breaks three others.

"Why did this number change between Tuesday and Wednesday?"

You have no idea. There's no audit trail.

Sound familiar? This is why we need layers.

The Five-Layer Philosophy

Think about how a restaurant kitchen works. You don't see the head chef doing everything. There's a system:

Receiving dock: Ingredients arrive exactly as delivered (even if the tomatoes are bruised)
Prep station: Wash, peel, chop—make ingredients ready to use
Cold storage: Keep prepared ingredients fresh and organized
Cooking line: Combine ingredients following recipes
Plating station: Final presentation for customers

Each station has one job. If something goes wrong, you know exactly where to look. A data pipeline works the same way.

Layer 1: Source (The Receiving Dock)

What it does: Store data exactly as received. No cleaning, no transformations, no "fixing" things.

Why it matters: This is your insurance policy. When something goes wrong downstream (and it will), you can always come back to the original data.

I learned this the hard way in production. We once had a pipeline that "cleaned" data on ingestion—converting empty strings to nulls, trimming whitespace, fixing typos. Seemed smart at the time. Then a business user asked why certain records were missing. We had no way to prove whether the data arrived that way or if our cleaning broke something.

Now? I always save everything exactly as received. Here's the pattern from my example pipeline:

-- Source layer: Just add a timestamp
SELECT 
    *,  -- Everything, unchanged
    CURRENT_TIMESTAMP() as loaded_at
FROM raw_input

That loaded_at timestamp becomes crucial later. It tells us when data arrived, which helps track down issues and enables change detection.

Layer 2: Staging (The Prep Station)

What it does: Clean and standardize data without changing its meaning.

Why it matters: Real-world data is messy. You'll see "Yes", "YES", "true", "1", "Y" all meaning the same thing. Staging normalizes this chaos.

Here's an example from my demo pipeline that mirrors what I've seen in production. Customer data arrives with loan status in various formats:

-- Before staging (the mess)
CustomerID | HasLoan
-----------|--------
1          | Yes
2          | YES  
3          | true
4          | 1
5          | no
6          | FALSE

-- After staging (clean and consistent)
customer_id | has_loan_flag
------------|---------------
1           | true
2           | true
3           | true
4           | true
5           | false
6           | false

The staging layer handles this:

SELECT 
    customer_id,
    LOWER(TRIM(customer_name)) as customer_name,
    CASE 
        WHEN LOWER(has_loan) IN ('yes', 'true', '1', 'y') THEN true
        WHEN LOWER(has_loan) IN ('no', 'false', '0', 'n') THEN false
        ELSE null
    END as has_loan_flag
FROM source_customer

Notice we're not calculating anything or joining tables. We're just cleaning. One job, done well.

Layer 3: Snapshots (The Time Machine)

This is where things get interesting. Most pipelines overwrite data every day. Yesterday's data? Gone. Last month's data? Gone. You're flying blind.

Snapshots solve this by keeping every version of every record. It's called Slowly Changing Dimension Type 2 (SCD2), but I prefer to think of it as version control for data.

Real scenario from production: A customer's loan status changes on February 15th. Without snapshots, you only know their current status. With snapshots, you know their status on any date in history.

Here's what the snapshot table looks like:

customer_id | has_loan | valid_from  | valid_to    | Status
------------|----------|-------------|-------------|--------
123         | false    | 2024-01-01  | 2024-02-15  | Old
123         | true     | 2024-02-15  | NULL        | Current

The magic is in those valid_from and valid_to timestamps. Want to know the status on January 20th? Query where that date falls between valid_from and valid_to. Want current status? Query where valid_to is NULL.

This saved us during a production audit. Regulators asked about account balances from six months ago. Without snapshots, we would have been scrambling. With snapshots? One SQL query, done in 30 seconds.

I've implemented this same pattern in my example pipeline to show how it works.

Layer 4: Intermediate (The Cooking Line)

What it does: Join data from different sources and apply business rules.

Why it matters: This is where you start building the actual insights. But you're not calculating final metrics yet—you're preparing the ingredients.

In my example pipeline, I join customer data with account data:

SELECT 
    accounts.account_id,
    accounts.balance,
    customers.customer_name,
    customers.has_loan_flag
FROM account_snapshots accounts
JOIN customer_snapshots customers 
    ON accounts.customer_id = customers.customer_id
WHERE accounts.valid_to IS NULL  -- Current records only
  AND customers.valid_to IS NULL

Why not do this in the marts layer? Because other teams might need this joined data for different calculations. Build it once, use it everywhere.

Layer 5: Marts (The Plating Station)

What it does: Final calculations and aggregations. This is what business users actually see.

Why it matters: This is your product. Everything before this was preparation.

Here's where I calculate interest rates based on business rules in the example:

SELECT 
    account_id,
    balance as original_balance,

    -- Business logic: Interest rate based on balance tiers
    CASE
        WHEN balance < 10000 THEN 0.01
        WHEN balance < 20000 THEN 0.015
        ELSE 0.02
    END as base_rate,

    -- Bonus rate for customers with loans
    CASE WHEN has_loan_flag THEN 0.005 ELSE 0 END as bonus_rate,

    -- Final calculation
    balance * (base_rate + bonus_rate) as annual_interest
FROM intermediate_accounts

The business logic is crystal clear. No digging through nested queries or trying to figure out where a number came from. It's right there.

Why This Actually Works

I've seen teams try to skip layers. "We don't need staging, we'll just clean in the source layer." Or "Why separate intermediate and marts? Let's just do it all in one query."

Here's what happens:

Without layers: A bug in the cleaning logic corrupts your analytics. You can't tell if the issue is in the data, the cleaning, the joins, or the calculations. You're debugging everything at once.

With layers: A bug in the cleaning logic? Check staging. Bad join? Check intermediate. Wrong calculation? Check marts. You know exactly where to look.

It's like having a stack trace for your data.

The Performance Question

"But doesn't this mean more tables and slower queries?"

Actually, no. Here's why:

Staging is views, not tables: No storage overhead. They're computed on the fly.
Snapshots enable incremental processing: Instead of reprocessing everything daily, you only process what changed. In production, I've seen this reduce runs from 10 minutes to 6 seconds.
Intermediate tables are reusable: Build the join once, use it in multiple marts. Faster than joining raw data every time.
Marts are optimized for queries: They're pre-aggregated and indexed exactly how business users need them.

The performance actually improves because each layer is optimized for its specific job.

What I Wish I Knew Earlier

Start with layers from day one: Don't wait until the pipeline is a mess. It's easier to build it right than to refactor later.

Layers aren't bureaucracy: They're clarity. Each layer answers one question: What is this data? (source), Is it clean? (staging), What changed? (snapshots), How does it relate? (intermediate), What does it mean? (marts).

The time machine is worth it: Snapshots take more storage, yes. But the first time someone asks "what was this value last month?" you'll be glad you have them.

One job per layer: The moment you start mixing concerns (cleaning in source, calculating in intermediate), you're back to spaghetti.

Coming Up Next

In Part 2, we'll dive into incremental processing—how to process only what changed instead of reprocessing everything. This is where the real performance gains happen.

In Part 3, we'll cover orchestration and data quality—how to make sure this whole system runs reliably and catches issues before they reach production.

But for now, think about your current pipelines. Are they layered? Can you trace a number from the final report back through each transformation to the raw data? If not, it might be time to add some layers.

This is Part 1 of a 3-part series on modern data pipeline architecture. The examples come from an open-source banking pipeline I built based on my production experience. The patterns apply to any domain—e-commerce, healthcare, logistics, you name it.

Want to see the full code? Check out the GitHub repository with complete source code, documentation, and production metrics.

Tech Stack: Dagster • DBT • DuckDB • Databricks • Python • Docker

What's your experience with data pipeline architecture?

Have you built layered pipelines? What challenges did you face? Drop a comment below—I'd love to hear your stories! 👇

DEV Community