I'll be honest—when I first heard about "layered data architectures," I rolled my eyes. Another buzzword, I thought. Just write some SQL, move the data, and call it a day.
Then I spent three weeks debugging a pipeline where raw data, cleaned data, and analytics were all mixed together in one giant spaghetti mess. That's when it clicked.
The Problem Nobody Talks About
Here's what actually happens in most data projects:
You start simple. Maybe you're pulling data from an API or reading CSV files. You write a script that cleans the data and calculates some metrics. It works! You ship it. Everyone's happy.
Six months later, someone asks: "Can we see what this metric looked like last quarter?"
You check the database. The old data is gone—overwritten by yesterday's run.
"Can we add a new calculation without breaking the existing reports?"
You look at the code. Everything is tangled together. Changing one thing breaks three others.
"Why did this number change between Tuesday and Wednesday?"
You have no idea. There's no audit trail.
Sound familiar? This is why we need layers.
The Five-Layer Philosophy
Think about how a restaurant kitchen works. You don't see the head chef doing everything. There's a system:
- Receiving dock: Ingredients arrive exactly as delivered (even if the tomatoes are bruised)
- Prep station: Wash, peel, chop—make ingredients ready to use
- Cold storage: Keep prepared ingredients fresh and organized
- Cooking line: Combine ingredients following recipes
- Plating station: Final presentation for customers
Each station has one job. If something goes wrong, you know exactly where to look. A data pipeline works the same way.
Layer 1: Source (The Receiving Dock)
What it does: Store data exactly as received. No cleaning, no transformations, no "fixing" things.
Why it matters: This is your insurance policy. When something goes wrong downstream (and it will), you can always come back to the original data.
I learned this the hard way. We once had a pipeline that "cleaned" data on ingestion—converting empty strings to nulls, trimming whitespace, fixing typos. Seemed smart at the time. Then a business user asked why certain records were missing. We had no way to prove whether the data arrived that way or if our cleaning broke something.
Now? We save everything exactly as received:
-- Source layer: Just add a timestamp
SELECT
*, -- Everything, unchanged
CURRENT_TIMESTAMP() as loaded_at
FROM raw_input
That loaded_at timestamp becomes crucial later. It tells us when data arrived, which helps track down issues and enables change detection.
Layer 2: Staging (The Prep Station)
What it does: Clean and standardize data without changing its meaning.
Why it matters: Real-world data is messy. You'll see "Yes", "YES", "true", "1", "Y" all meaning the same thing. Staging normalizes this chaos.
Here's a real example from our project. Customer data arrived with loan status in various formats:
-- Before staging (the mess)
CustomerID | HasLoan
-----------|--------
1 | Yes
2 | YES
3 | true
4 | 1
5 | no
6 | FALSE
-- After staging (clean and consistent)
customer_id | has_loan_flag
------------|---------------
1 | true
2 | true
3 | true
4 | true
5 | false
6 | false
The staging layer handles this:
SELECT
customer_id,
LOWER(TRIM(customer_name)) as customer_name,
CASE
WHEN LOWER(has_loan) IN ('yes', 'true', '1', 'y') THEN true
WHEN LOWER(has_loan) IN ('no', 'false', '0', 'n') THEN false
ELSE null
END as has_loan_flag
FROM source_customer
Notice we're not calculating anything or joining tables. We're just cleaning. One job, done well.
Layer 3: Snapshots (The Time Machine)
This is where things get interesting. Most pipelines overwrite data every day. Yesterday's data? Gone. Last month's data? Gone. You're flying blind.
Snapshots solve this by keeping every version of every record. It's called Slowly Changing Dimension Type 2 (SCD2), but I prefer to think of it as version control for data.
Real scenario: A customer's loan status changes on February 15th. Without snapshots, you only know their current status. With snapshots, you know their status on any date in history.
Here's what the snapshot table looks like:
customer_id | has_loan | valid_from | valid_to | Status
------------|----------|-------------|-------------|--------
123 | false | 2024-01-01 | 2024-02-15 | Old
123 | true | 2024-02-15 | NULL | Current
The magic is in those valid_from and valid_to timestamps. Want to know the status on January 20th? Query where that date falls between valid_from and valid_to. Want current status? Query where valid_to is NULL.
This saved us during an audit. Regulators asked about account balances from six months ago. Without snapshots, we would have been scrambling. With snapshots? One SQL query, done in 30 seconds.
Layer 4: Intermediate (The Cooking Line)
What it does: Join data from different sources and apply business rules.
Why it matters: This is where you start building the actual insights. But you're not calculating final metrics yet—you're preparing the ingredients.
In our pipeline, we join customer data with account data:
SELECT
accounts.account_id,
accounts.balance,
customers.customer_name,
customers.has_loan_flag
FROM account_snapshots accounts
JOIN customer_snapshots customers
ON accounts.customer_id = customers.customer_id
WHERE accounts.valid_to IS NULL -- Current records only
AND customers.valid_to IS NULL
Why not do this in the marts layer? Because other teams might need this joined data for different calculations. Build it once, use it everywhere.
Layer 5: Marts (The Plating Station)
What it does: Final calculations and aggregations. This is what business users actually see.
Why it matters: This is your product. Everything before this was preparation.
Here's where we calculate interest rates based on business rules:
SELECT
account_id,
balance as original_balance,
-- Business logic: Interest rate based on balance tiers
CASE
WHEN balance < 10000 THEN 0.01
WHEN balance < 20000 THEN 0.015
ELSE 0.02
END as base_rate,
-- Bonus rate for customers with loans
CASE WHEN has_loan_flag THEN 0.005 ELSE 0 END as bonus_rate,
-- Final calculation
balance * (base_rate + bonus_rate) as annual_interest
FROM intermediate_accounts
The business logic is crystal clear. No digging through nested queries or trying to figure out where a number came from. It's right there.
Why This Actually Works
I've seen teams try to skip layers. "We don't need staging, we'll just clean in the source layer." Or "Why separate intermediate and marts? Let's just do it all in one query."
Here's what happens:
Without layers: A bug in the cleaning logic corrupts your analytics. You can't tell if the issue is in the data, the cleaning, the joins, or the calculations. You're debugging everything at once.
With layers: A bug in the cleaning logic? Check staging. Bad join? Check intermediate. Wrong calculation? Check marts. You know exactly where to look.
It's like having a stack trace for your data.
The Performance Question
"But doesn't this mean more tables and slower queries?"
Actually, no. Here's why:
Staging is views, not tables: No storage overhead. They're computed on the fly.
Snapshots enable incremental processing: Instead of reprocessing everything daily, you only process what changed. We went from 10-minute runs to 6-second runs.
Intermediate tables are reusable: Build the join once, use it in multiple marts. Faster than joining raw data every time.
Marts are optimized for queries: They're pre-aggregated and indexed exactly how business users need them.
The performance actually improves because each layer is optimized for its specific job.
What I Wish I Knew Earlier
Start with layers from day one: Don't wait until the pipeline is a mess. It's easier to build it right than to refactor later.
Layers aren't bureaucracy: They're clarity. Each layer answers one question: What is this data? (source), Is it clean? (staging), What changed? (snapshots), How does it relate? (intermediate), What does it mean? (marts).
The time machine is worth it: Snapshots take more storage, yes. But the first time someone asks "what was this value last month?" you'll be glad you have them.
One job per layer: The moment you start mixing concerns (cleaning in source, calculating in intermediate), you're back to spaghetti.
Coming Up Next
In Part 2, we'll dive into incremental processing—how to process only what changed instead of reprocessing everything. This is where the real performance gains happen.
In Part 3, we'll cover orchestration and data quality—how to make sure this whole system runs reliably and catches issues before they reach production.
But for now, think about your current pipelines. Are they layered? Can you trace a number from the final report back through each transformation to the raw data? If not, it might be time to add some layers.
This is Part 1 of a 3-part series on modern data pipeline architecture. The examples come from a real production pipeline processing financial data, but the patterns apply to any domain—e-commerce, healthcare, logistics, you name it.
Want to see the full code? Check out the GitHub repository with complete source code, documentation, and production metrics.
Tech Stack: Dagster • DBT • DuckDB • Databricks • Python • Docker
What's your experience with data pipeline architecture?
Have you built layered pipelines? What challenges did you face? Drop a comment below—I'd love to hear your stories! 👇
Top comments (0)