Pradeep Kalluri

Posted on Nov 22 • Originally published at Medium

From Raw to Refined: Data Pipeline Architecture at Scale

#dataengineering #bigdata #python #dataquality

Originally published on Medium: https://medium.com/@kalluripradeep99/from-raw-to-refined-data-pipeline-architecture-at-scale-52cd4b02ef10

How I built production data pipelines that process massive volumes daily — and what I learned along the way

Every day, modern data platforms handle hundreds of gigabytes of data — transactions, customer activity, event streams, operational reports. All of this needs to flow from messy source systems into clean, reliable tables that teams can use for dashboards, reports, and ML models.

Here’s what surprised me after years of building these systems: moving data isn’t the hard part. Making it reliable at scale is.

I’ve debugged pipelines that silently corrupted data for weeks. I’ve seen duplicate records inflate ML model accuracy by double digits. I’ve watched pipelines grind to a halt because someone forgot to partition a table properly.

These experiences taught me something valuable: you need a solid architecture before you write a single line of code.

In this article, I’ll walk you through the three-zone framework I use for production data pipelines. We’ll cover which tools make sense at each stage, how to keep data quality high, and the mistakes I’ve made so you don’t have to.

If you’re building a data platform from scratch or trying to scale an existing one, this should help.

The Three-Zone Architecture
I like keeping things simple. Split your data pipeline into three zones, each doing one thing well. This makes everything easier to build, fix, and explain to your team.

Zone 1: Raw/Landing Zone
This is where data first shows up. The most important rule: don’t touch it. Store everything exactly as it comes in.

What it does: Keeps data in its original form
Tools I use: Object storage (S3/ADLS) for batch files, Kafka for streaming
Why it matters: You can always go back and reprocess if something breaks

Example: Transaction data comes in as JSON files. I store them in organized paths like s3://raw-zone/transactions/2024/11/20/. For real-time data like payment events, they go into Kafka topics unchanged.

Why bother with this separation? Because you’ll have bugs. Business rules will change. Data quality checks will evolve. When that happens, you just reprocess from raw. It’s your safety net.

I once discovered a data transformation bug that had been running for weeks. Because we had the raw zone, we reprocessed everything in a few hours. Without it? We would have had serious data integrity issues.

Zone 2: Curated/Staging Zone
This is where the actual work happens. Clean up the mess, standardize formats, catch bad data before it reaches production.

What it does: Turns raw data into something usable
Tools I use: PySpark for heavy lifting, cloud compute platforms for processing
What I do here: Remove duplicates, fix data types, validate everything, standardize formats

Real talk: data is always messier than you expect. You’ll get duplicate records. Date formats all over the place — some systems use MM/DD/YYYY, others use DD-MM-YYYY. Codes that don’t match standards. Nulls everywhere.

This zone fixes all of that. Convert dates to ISO format. Deduplicate records using window functions. Flag invalid data and send it to error tables so someone can investigate later.

One time, we received data where the same record appeared multiple times with different values due to system retries. Our deduplication logic caught it and kept only the latest record based on timestamp. Simple, but it prevented incorrect reporting downstream.

Zone 3: Refined/Consumption Zone
This is what people actually use. Clean, fast, optimized, ready to go.

What it does: Serves data to analysts, dashboards, ML models
Tools I use: Cloud data warehouses (Snowflake/Redshift/BigQuery), dbt for transformations
What’s here: Star schemas, pre-aggregated tables, feature stores for ML

Instead of making analysts query millions of raw records, give them pre-aggregated summary tables. Instead of making ML engineers join dozens of tables every time they need features, give them pre-computed feature tables.

Performance matters here. Use proper partitioning. Pre-compute common aggregations. Model your data in ways people understand — star schemas, not normalized tables with excessive joins.

Why Split It Up?
Easier to debug: When something breaks, you know exactly where to look. Data quality issue? Check curated. Performance problem? Look at refined.

Safer to experiment: Want to try a new transformation logic? Test it in curated without touching raw data. Want to change your warehouse schema? Refined zone only.

Right tool for the job: Object storage for raw, distributed compute for processing, columnar database for analytics. Each zone uses the best tool for its purpose.

Better quality: Catch problems early in curated before they reach business users and damage trust in your data platform.

The boundaries are clear: raw-to-curated handles technical stuff (formats, types, duplicates). Curated-to-refined handles business logic (aggregations, joins, metrics). Everyone knows what goes where.

Data Ingestion Layer
Getting data into your platform is step one. You’ve got two main approaches: batch and streaming. Most real-world systems need both.

Batch Ingestion
This is your scheduled, bulk data loads. Works great for data that doesn’t need to be real-time — think daily summaries, overnight files, periodic reports.

I use cloud object storage as the landing zone. Source systems drop files there — usually CSV, JSON, or Parquet. Then I’ve got scheduled jobs (orchestrated by Airflow) that pick them up and process them.

The trick is organizing your storage paths properly. Use a structure like:

s3://raw-zone/source_system/table_name/YYYY/MM/DD/filename.parquet
This makes it easy to process specific date ranges and troubleshoot when things go wrong. And trust me, things will go wrong.

Pro tip: Use Parquet format when you can. Columnar storage can reduce storage costs significantly compared to CSV, plus query performance improves substantially.

Stream Ingestion
For real-time data, I use Kafka. Payment events, user activity, system logs — anything that needs to be processed within seconds or minutes.

Kafka is great because it keeps messages for a retention period (say, 7 days). If your downstream system goes down for maintenance, you can catch up without losing data. It’s like a replay buffer for your data streams.

Here’s a pattern that works well: Kafka producers write events to topics. Consumer applications read from topics and write to object storage in micro-batches (every 5 minutes). This gives you both real-time processing AND a permanent archive in your raw zone.

In one system, we processed tens of thousands of events per second through Kafka, with consumer lag under a minute. The key was proper partitioning and scaling consumer groups horizontally.

Handling Late Data
Real-world data doesn’t arrive on time. An event might get recorded but the network hiccups. The data shows up hours late. Or a mobile app was offline and syncs data the next day.

My rule: always use event time (when it actually happened) not processing time (when you received it). Store both timestamps. This way you can handle late arrivals properly in downstream processing without corrupting your analytics.

Processing & Transformation
This is where PySpark does the heavy lifting. Reading from raw, applying transformations, writing to curated. Let me show you the patterns that actually work in production.

Reading from Raw Zone
Start by reading your data. I usually work with Parquet files in object storage because they’re fast and efficient.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, row_number, upper
from pyspark.sql.window import Window
spark = SparkSession.builder \
.appName("curated_processing") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()

Read from raw zone with schema inference

df = spark.read.parquet("s3://raw-zone/transactions/2024/11/20/")
Data Validation
First thing: validate your data. Don’t process garbage.

Remove records with null IDs

df_valid = df.filter(col("transaction_id").isNotNull())

Check for valid amounts (positive values only)

df_valid = df_valid.filter(col("amount") > 0)

Validate date ranges

df_valid = df_valid.filter(
(col("transaction_date") >= "2024-01-01") &
(col("transaction_date") <= "2024-12-31")
)
Simple checks like this save you headaches later. Invalid data goes to an error table so someone can investigate — don’t just drop it silently.

Deduplication
Duplicates are everywhere. Source systems send the same record twice. Networks retry failed requests. It happens constantly.

Here’s how I handle it — keep the most recent record based on a timestamp:

Define window to find duplicates

window = Window.partitionBy("transaction_id") \
.orderBy(col("timestamp").desc())

Keep only the latest record for each transaction_id

df_dedup = df_valid.withColumn("row_num", row_number().over(window)) \
.filter(col("row_num") == 1) \
.drop("row_num")
This pattern works for any duplicate scenario. Just change the partitionBy and orderBy columns based on your needs. I've used this same logic for customer records, sensor data, and API responses.

Type Casting and Standardization
Data comes in as strings more often than you’d think. Convert to proper types for downstream processing.

Convert string dates to actual dates

df_typed = df_dedup.withColumn(
"transaction_date",
to_date(col("date_string"), "yyyy-MM-dd")
)

Ensure numeric types with proper precision

df_typed = df_typed.withColumn(
"amount",
col("amount").cast("decimal(10,2)")
)

Standardize codes to uppercase

df_typed = df_typed.withColumn(
"currency",
upper(col("currency"))
)
Writing to Curated Zone
Once data is clean, write it back to storage in the curated zone. Use partitioning for better performance downstream.

Write partitioned by date for efficient queries

df_typed.write \
.mode("overwrite") \
.partitionBy("transaction_date") \
.parquet("s3://curated-zone/transactions/")
Partitioning means queries only read relevant data. If someone wants yesterday’s data, Spark only scans yesterday’s partition. Fast and cheap.

In one pipeline, proper partitioning reduced query times from 45 minutes to just a few minutes. Same data, same query, just better organization.

Why PySpark?
You might ask — why not just use Pandas? Simple: scale. Pandas runs on one machine’s memory. PySpark distributes across a cluster. When you’re processing large volumes, you need that distributed power.

Become a member
Plus, PySpark’s lazy evaluation is smart. It optimizes your entire transformation pipeline before executing. Less data shuffling, fewer passes over data, faster results.

Orchestration with Airflow
You can’t run data jobs manually every day. You need orchestration. Airflow handles scheduling, dependencies, retries, and monitoring — all the operational complexity.

DAG Design
Here’s a DAG structure for our three-zone pipeline:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'retries': 2,
'retry_delay': timedelta(minutes=5),
'email_on_failure': True,
'email': ['data-alerts@company.com'],
}
dag = DAG(
'transaction_pipeline',
default_args=default_args,
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False,
)

Task 1: Wait for source data to arrive

wait_for_data = S3KeySensor(
task_id='wait_for_source_data',
bucket_name='raw-zone',
bucket_key='transactions/{{ ds }}/',
timeout=3600,
poke_interval=60,
dag=dag,
)

Task 2: Ingest from source to raw

ingest_task = PythonOperator(
task_id='ingest_to_raw',
python_callable=ingest_data,
dag=dag,
)

Task 3: Process raw to curated

process_task = PythonOperator(
task_id='process_to_curated',
python_callable=process_data,
dag=dag,
)

Task 4: Transform curated to refined

transform_task = PythonOperator(
task_id='transform_to_refined',
python_callable=transform_data,
dag=dag,
)

Task 5: Data quality checks

quality_task = PythonOperator(
task_id='quality_checks',
python_callable=run_quality_checks,
dag=dag,
)

Set dependencies - clear pipeline flow

wait_for_data >> ingest_task >> process_task >> transform_task >> quality_task
Key Principles
Idempotency: Run the same task twice, get the same result. Use mode("overwrite") with date partitions. If today's job fails and reruns, it overwrites today's data without affecting other days. This is crucial for reliable operations.

Clear dependencies: The >> operator makes dependencies obvious. Process can't start until ingest finishes. Quality checks run last. Anyone looking at the DAG understands the flow immediately.

Retry logic: Network hiccups happen. Source systems go down. Airflow retries failed tasks automatically. Set sensible retry counts (2–3) and delays (5–10 minutes).

Monitoring: Airflow’s UI shows you everything. Which tasks failed? How long did they take? When did they last run? All visible at a glance. I check the dashboard regularly — green boxes mean happy pipelines, red boxes mean I’ve got work to do.

Alerting: Set up email or Slack alerts for failures. Don’t wait until someone complains about missing data. Know about problems before your users do.

Data Quality & Validation
Bad data is worse than no data. It leads to wrong decisions, broken dashboards, and lost trust in your platform. I learned this the hard way.

Why Quality Matters
I once saw an ML model built using data with duplicate IDs. The model performed great in testing — high accuracy. Poor in production — much lower accuracy. Why? Because the duplicates artificially inflated performance metrics during training. It was caught after deployment. Not fun.

Now I validate everything.

Using Great Expectations
Great Expectations is my go-to tool for data quality. You define rules (called expectations), and it validates your data against them automatically.

import great_expectations as ge

Load your data as a Great Expectations DataFrame

df = ge.read_csv("s3://curated-zone/transactions.csv")

Set expectations - these become tests

df.expect_column_values_to_not_be_null("transaction_id")
df.expect_column_values_to_be_unique("transaction_id")
df.expect_column_values_to_be_between("amount", min_value=0, max_value=1000000)
df.expect_column_values_to_be_in_set("currency", ["USD", "EUR", "GBP"])
df.expect_column_values_to_match_regex("email", r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$")

Validate and get results

results = df.validate()
if not results.success:
# Alert someone, block pipeline, write to error log
raise ValueError(f"Data quality check failed! {results}")
Simple but effective. If data fails validation, the pipeline stops. No bad data reaches production and corrupts your analytics.

dbt Tests
For the refined zone, I use dbt tests. They’re built directly into your transformation code, which makes them easy to maintain.

-- models/daily_summary.sql
{{ config(materialized='table') }}
SELECT
date,
customer_id,
SUM(amount) as total_amount,
COUNT(*) as transaction_count
FROM {{ ref('transactions') }}
GROUP BY date, customer_id

tests/daily_summary.yml

version: 2
models:

name: daily_summary columns:
- name: customer_id tests:
  - not_null
  - unique
- name: total_amount tests:
  - not_null
- name: date tests:
  - not_null tests:
- dbt_utils.recency: datepart: day field: date interval: 1 dbt runs these tests automatically after building models. If a test fails, you know immediately. The recency test is particularly useful — it alerts you if data stops arriving.

Continuous Monitoring
Quality isn’t one-and-done. Monitor continuously:

Track row counts over time: Sudden drops = problem
Watch null rates: If nulls suddenly spike, investigate
Monitor data freshness: Is data arriving on time?
Set up anomaly detection: Catch unusual patterns early
Validate referential integrity: Ensure foreign keys match
I’ve got Slack alerts configured for quality failures. If something breaks overnight, I know about it quickly. Better to know immediately than discover it during a morning meeting.

Real-World Lessons
Let me share some lessons from systems I’ve built:

Performance wins:

Proper partitioning can reduce query times by 10x or more
Parquet format typically reduces storage costs by 60–70% vs CSV
Pre-aggregated tables eliminate the need for complex real-time queries
Reliability improvements:

The three-zone architecture makes debugging much faster
Comprehensive quality checks catch issues before users see them
Idempotent pipelines allow safe retries without data corruption
Cost optimizations:

Smart partitioning reduces cloud compute costs significantly
Columnar formats (Parquet) save on both storage and processing
Proper cluster sizing prevents over-provisioning
These results didn’t come from fancy tools or bleeding-edge technology. They came from solid architecture, good practices, and attention to quality.

Common Mistakes (and How to Avoid Them)
Mistake #1: Skipping the raw zone
“We’ll just clean data as it arrives.” Then you have a bug and no way to reprocess. Always keep raw data.

Mistake #2: No data quality checks
“We’ll add those later.” Later never comes. Build quality checks from day one.

Mistake #3: Over-engineering early
You don’t need Kafka and real-time processing for daily batch reports. Start simple, scale when needed.

Mistake #4: Ignoring monitoring
If you don’t know your pipeline failed, you can’t fix it. Set up alerts and dashboards.

Mistake #5: Poor partitioning
This kills performance and inflates costs. Partition by date or another high-cardinality field that matches query patterns.

Mistake #6: Treating all data the same
Not everything needs real-time processing. Batch is cheaper and simpler for most use cases.

Getting Started
If you’re building a data platform from scratch:

Start with the three-zone architecture — Even if you’re just moving files around, establish the pattern early
Implement one pipeline end-to-end — Don’t build all the infrastructure first. Get one working pipeline, learn from it
Add quality checks incrementally — Start with basic null checks, expand from there
Monitor everything — Build dashboards and alerts from day one
Document your patterns — Future you (and your team) will thank you
The tools I mentioned — S3/ADLS, Kafka, PySpark, Airflow, Snowflake/Redshift/BigQuery, dbt, Great Expectations — work well together. But they’re not the only options. Use what fits your needs, budget, and team expertise.

Most importantly: make your pipelines reliable. Teams will depend on them. Analysts will base decisions on the data. Executives will present it to stakeholders. Make sure it’s trustworthy.

Wrapping Up
Building data pipelines at scale isn’t rocket science, but it requires thought and discipline. The three-zone architecture gives you a solid foundation. Raw for safety, curated for processing, refined for consumption.

Start simple. One pipeline, end-to-end. Get it working. Add quality checks. Then scale based on actual needs, not hypothetical ones.

After years and dozens of pipelines, I keep coming back to these patterns because they work. They’re not the newest or the flashiest, but they’re reliable. And in data engineering, reliability beats novelty every time.

The patterns, code examples, and lessons in this article are all based on real production experience. They’re battle-tested and proven to work at scale. Whether you’re building your first data platform or optimizing an existing one, I hope these insights help you avoid common pitfalls and build something reliable.

Want to discuss data architecture? Connect with me on LinkedIn or check out my portfolio. I’m always happy to talk about data engineering, pipeline design, or building scalable systems.

And if you’re contributing to dbt, Airflow, Great Expectations, or other open-source data tools, I’d love to hear about your experiences!

Thanks for reading! If you found this helpful, consider following for more articles on data engineering, cloud architecture, and building scalable systems.

DEV Community