DEV Community: Pradeep Kalluri

Rewriting My Apache Airflow PR: When Your First Solution Isn't the Right One

Pradeep Kalluri — Mon, 05 Jan 2026 09:46:50 +0000

Lessons learned from contributing to Apache Airflow after getting my first PR merged - complete rewrite, 7 CI failures, and persistence

After getting my first Apache Airflow PR merged (#58587), I felt pretty confident about the contribution process. So when I found another bug, I jumped right in with what seemed like a perfect solution.

Two weeks and a complete rewrite later, my second PR (#59938) is now merged into Apache Airflow. Here's the real story—the good, the messy, and what changed.

🐛 Finding the Bug

It started with a production issue I encountered. Our Airflow scheduler kept crashing with a cryptic error:

InvalidStatsNameException: The stat name (pool.running_slots.data engineering pool 😎) 
has to be composed of ASCII alphabets, numbers, or the underscore, dot, or dash characters.

Someone had created a pool with spaces and an emoji. Airflow accepted it, but when trying to report metrics, everything broke.

Having just gotten my first PR merged, I thought: "I know how this works now. I can fix this quickly."

Spoiler: It wasn't quick.

💻 My "Perfect" Solution: Validation

My approach seemed obvious:

Add validation when creating pools
Only allow ASCII letters, numbers, underscores, dots, and dashes
Reject invalid pool names with a clear error

def validate_pool_name(name: str) -> None:
    if not re.match(r"^[a-zA-Z0-9_.-]+$", name):
        raise ValueError(
            f"Pool name '{name}' is invalid. Pool names must only contain "
            "ASCII alphabets, numbers, underscores, dots, and dashes."
        )

I wrote tests, updated the news fragment, and submitted the PR with confidence.

Problem solved! Or so I thought.

💥 The Feedback That Humbled Me

@potiuk (Apache Airflow PMC member) reviewed my PR:

"I do not think it's a good idea to raise issue at pool creation time. This will mean that when you create an invalid pool, things will start crashing soon after. That's quite wrong behaviour."

He suggested a completely different approach: normalize the pool names for stats reporting instead of preventing them.

My heart sank. I'd spent hours on this validation approach, written tests, updated docs. But he was right:

The Problems With My Solution:

❌ Users with existing "invalid" pools would be stuck
❌ Migration would be complex and painful
❌ It would break backward compatibility
❌ Pools would be created but then crash the scheduler

The Better Approach:

✅ Keep existing pools working
✅ Normalize names only for stats reporting
✅ Warn users, but don't break their systems
✅ Graceful degradation instead of failures

Lesson 1: "Working" code isn't the same as "right" code.

My first PR was accepted with minor tweaks. This time, I needed to completely rethink the solution.

🔄 The Rewrite: Normalization

I threw away my validation code and started fresh:

def normalize_pool_name_for_stats(name: str) -> str:
    """
    Normalize pool name for stats reporting by replacing invalid characters.

    Stats names must only contain ASCII alphabets, numbers, underscores, 
    dots, and dashes. Invalid characters are replaced with underscores.
    """
    # Check if normalization is needed
    if re.match(r"^[a-zA-Z0-9_.-]+$", name):
        return name

    # Replace invalid characters with underscores
    normalized = re.sub(r"[^a-zA-Z0-9_.-]", "_", name)

    # Log warning
    logger.warning(
        "Pool name '%s' contains invalid characters for stats reporting. "
        "Reporting stats with normalized name '%s'. "
        "Consider renaming the pool to avoid this warning.",
        name,
        normalized,
    )

    return normalized

Instead of preventing "bad" pool names, we:

Accept any pool name (backward compatible)
Normalize it when reporting metrics (fixes the crash)
Log a warning (educates users)
Suggest renaming (guides to best practices)

This was objectively better. And I would never have thought of it without the feedback.

Lesson 2: Maintainers see the bigger picture. Listen to them.

😤 The Static Check Marathon

I pushed my rewritten code. CI failed:

❌ Missing blank lines
❌ Import order wrong  
❌ LoggingMixin usage incorrect
❌ Missing 're' module import

I fixed them. Pushed again. CI failed again with different formatting issues.

This happened SEVEN TIMES.

Each time:

CI would auto-format and show what it wanted
I'd try to apply the fixes manually (Windows, no local pre-commit)
I'd push
New formatting errors would appear

By attempt #5, I was frustrated. By attempt #7, I was questioning my career choices.

Lesson 3: Set up your local environment properly BEFORE you start coding.

✨ The Breakthrough

On attempt #8, I finally:

Used PowerShell to apply exact formatting from CI diffs
Added proper blank lines (2 after logger, 2 after functions)
Fixed import order alphabetically
Replaced LoggingMixin with module-level logger

import logging

logger = logging.getLogger(__name__)


def normalize_pool_name_for_stats(name: str) -> str:
    # Function code...
    return normalized


class Pool(Base):
    # Class code...

All checks passed! 🎉

@potiuk approved with "Two nits" (which I quickly fixed). Minutes later, the PR was merged into Apache Airflow's main branch.

Lesson 4: Persistence beats perfection. Keep going.

📊 First PR vs Second PR: The Differences

My First PR (#58587):

⏱️ Time: 3 days
🔄 Major rewrites: 0
❌ CI failures: 2
📝 Commits: 4
🎓 Learned: The contribution process
✅ Status: Merged

My Second PR (#59938):

⏱️ Time: 2 weeks
🔄 Major rewrites: 1 (complete approach change)
❌ CI failures: 7
📝 Commits: 16
🎓 Learned: How to handle feedback, rewrites, and persistence
✅ Status: Merged into Apache Airflow

The second one taught me WAY more.

📚 What I Learned (That My First PR Didn't Teach Me)

Technical Lessons:

1. Think About Backward Compatibility

My first solution would have broken existing users. Always ask: "What happens to people already using this?"

2. Graceful Degradation > Hard Failures

Warn users and normalize data instead of crashing. Systems should be resilient.

3. Pre-commit Hooks Are Non-Negotiable

Don't use CI as your formatter. Set up pre-commit locally FIRST.

4. Read Diffs Carefully

CI was telling me exactly what it wanted. I just needed to pay attention.

Soft Skills Lessons:

1. Be Ready to Throw Away Your Work

I spent hours on validation code. All of it went in the trash. That's okay. It's part of learning.

2. Feedback Isn't Personal

Potiuk wasn't criticizing me. He was helping me build something better. There's a huge difference.

3. Persistence Matters More Than Talent

7 CI failures felt embarrassing. But I kept going, and eventually it worked.

4. Document Your Thinking

In my PR description, I explained WHY I chose normalization after feedback. This helped reviewers understand my thought process.

🎯 Advice for Your Second (or Third, or Tenth) PR

When You Get Critical Feedback:

Don't:

❌ Defend your solution immediately
❌ Make minimal changes hoping they'll accept it
❌ Take it personally

Do:

✅ Read the feedback carefully (twice!)
✅ Ask questions if you don't understand
✅ Be willing to start over if needed
✅ Thank reviewers for their time

When CI Keeps Failing:

Don't:

❌ Push 10 commits trying to guess the fix
❌ Ignore error messages
❌ Give up after 3 failures

Do:

✅ Set up local pre-commit hooks
✅ Read the CI diff output carefully
✅ Apply fixes locally and test before pushing
✅ Ask for help if you're stuck after 3-4 attempts

When You Need to Rewrite:

Don't:

❌ Try to salvage the old approach
❌ Rush the rewrite
❌ Skip tests because you're frustrated

Do:

✅ Start with a clean slate
✅ Think through the new approach carefully
✅ Write better tests based on what you learned
✅ Update documentation to match

🚀 Why You Should Keep Contributing

My first PR was smooth sailing. My second was rough waters.

That's exactly how learning works.

Each contribution teaches you something new:

First PR: The basics (fork, commit, PR, review)
Second PR: Handling feedback and major rewrites
Third PR: You'll discover this next!

For your career:

Real-world experience with design decisions
Proof you can handle feedback and pivot
Stories to tell in interviews ("I once had to completely rewrite my approach...")
Connections with senior engineers

For your skills:

Understanding trade-offs (validation vs normalization)
Production-quality code standards
Communication under pressure
Resilience and persistence

💡 Your Turn

If you've contributed once and it went well, do it again.

The second one will probably be harder. You might get asked to rewrite. CI might fail repeatedly. Reviewers might question your approach.

That's when the real learning happens.

My second PR took 4x longer than my first. It also taught me 10x more.

What will your next contribution teach you?

📎 Resources

My Merged PRs:

First PR (Merged): apache/airflow#58587 ✅
Second PR (Merged): apache/airflow#59938 ✅
Issue: apache/airflow#59935

Helpful Links:

Originally published on Medium: https://medium.com/@kalluripradeep99/rewriting-my-apache-airflow-pr-when-your-first-solution-isnt-the-right-one-8c4243ca9daf

The Time Our Pipeline Processed the Same Day’s Data 47 Times

Pradeep Kalluri — Wed, 17 Dec 2025 15:29:11 +0000

I noticed something odd in our Airflow logs on Monday morning. Our daily data pipeline had run multiple times over the weekend instead of once per day.

Not just a few extra runs. Forty-seven executions. All processing the same day's data: December 3rd.

Each run showed as successful. No errors. No alerts. Just the same date being processed over and over.

Here's what happened and what I learned about retry logic that I wish I'd known sooner.

How I Found It

Monday morning, I was reviewing our weekend pipeline runs as part of my routine checks. Our Airflow dashboard showed an unusual pattern - our main transformation DAG had executed far more times than it should have.

Looking closer, I saw the DAG had run 47 times between Saturday morning and Monday. But we only schedule it once per day at 2 AM.

What caught my attention: every single run was processing December 3rd's data. Not December 4th, 5th, or 6th. Just December 3rd, repeatedly.

All runs showed as successful. Green status. No failed tasks. The logs showed normal processing - read data, transform it, write to warehouse, mark complete.

The Investigation

I checked the obvious things first:

Was someone manually triggering reruns? No. The audit logs showed all runs were automatic, triggered by the scheduler.

Had the source data changed? No. The S3 timestamps showed December 3rd's data hadn't been modified since it was originally created.

Was there a scheduler configuration issue? The schedule looked correct: daily at 2 AM.

Then I noticed something in the run history. The pattern started on Saturday. Our pipeline ran at 2 AM (normal), then again at 4 AM, 6 AM, 8 AM... every two hours through the weekend.

That's when I realized: these weren't scheduled runs. These were retries.

The Background

The previous Friday, we'd deployed a new analytics feature - calculating average transaction values by customer segment. Marketing wanted to track premium customer behavior separately from regular customers.

The code had been tested thoroughly. We ran it against sample data from the past week. All tests passed. We deployed Friday afternoon.

What we didn't test: weekend data patterns.

The Root Cause

Our pipeline used Airflow's execution date to determine which data partition to process:

execution_date = context['execution_date']
data_date = execution_date.strftime('%Y-%m-%d')
s3_path = f"s3://bucket/data/date={data_date}/"

The pipeline had multiple steps:

Read data from S3
Transform and validate records
Calculate daily metrics
Write to warehouse

Step 3 is where things broke on weekends.

Our new metric calculated "average transaction value per customer segment":

# Calculate average for our premium customer segment
target_customers = df[df['customer_segment'] == 'premium']
total_value = target_customers['amount'].sum()
customer_count = target_customers['customer_id'].nunique()
avg_value = total_value / customer_count

This worked fine on the weekdays we tested:

December 3rd (Wednesday): 8,500 premium customers. Calculated successfully.
December 4th (Thursday): 7,200 premium customers. Calculated successfully.
December 5th (Friday): 6,800 premium customers. Calculated successfully.

December 6th (Saturday): 0 premium customers.

Our premium segment was entirely B2B customers - business accounts, enterprise clients. They don't transact on weekends. The businesses are closed.

We had plenty of regular consumer transactions on Saturday (48,000 total), but zero from the premium segment we were calculating metrics for.

customer_count = target_customers['customer_id'].nunique()  # Returns 0
avg_value = total_value / 0  # Division by zero error

The calculation failed. Task failed. Airflow scheduled a retry.

Here's where the bug was. We had retry logic that tried to be helpful:

if task_instance.try_number > 1:
    # If this is a retry, process the last successful date
    # to avoid reprocessing potentially corrupted data
    last_successful = get_last_successful_date()
    data_date = last_successful
else:
    data_date = execution_date.strftime('%Y-%m-%d')

The logic made sense when we wrote it: if a task fails partway through processing, don't try to reprocess potentially corrupted data. Instead, go back to the last known good date.

But in this case:

December 6th processing failed (division by zero - no premium customers)
Retry triggered, using execution_date = December 6th
Retry logic checked: last successful date = December 3rd
Processed December 3rd data (which had premium customer transactions)
Calculation succeeded!
Airflow marked December 6th as complete

Then the same thing happened with December 7th (Sunday). And continued through the weekend until I stopped it Monday morning.

The Impact

The immediate problem was duplicate data. We'd loaded December 3rd's transactions into our warehouse 47 times.

Our deduplication logic caught most of it - we used transaction IDs as primary keys, so the database just overwrote the same records.

But not all our downstream reports deduplicated. Some aggregation tables counted each load as new data. For a few hours Monday morning, our dashboards showed December 3rd with 47x normal transaction volume.

The bigger problem: we had no data for December 6th or 7th. The pipeline thought it had processed those dates successfully (because it processed December 3rd instead), so it moved on to December 8th.

We skipped two days of weekend data without realizing it until a business user asked why our weekend sales reports were blank.

The Fix

I fixed two things:

First, the immediate bug - handle zero-count scenarios in calculations:

target_customers = df[df['customer_segment'] == 'premium']
customer_count = target_customers['customer_id'].nunique()

if customer_count > 0:
    avg_value = target_customers['amount'].sum() / customer_count
else:
    # No customers in this segment - set to NULL rather than failing
    avg_value = None

Second, the retry logic - removed it entirely:

# Always process the execution date, regardless of retry count
data_date = execution_date.strftime('%Y-%m-%d')

The key insight: retries should reprocess the SAME data, not fall back to old data. If there's a real data problem, retrying won't help. If it's a transient issue, retrying the same operation will work.

For the weekend scenario specifically, I also updated our metrics logic to handle the expected pattern:

# Weekend data note: Premium segment (B2B) has zero weekend activity
# This is expected behavior - record NULL for weekend metrics

What I Learned

Test with realistic data patterns. We tested with weekday data because that's what was convenient. We should have tested with weekend data, holiday data, month-end data - all the edge cases.

Retry logic needs careful thought. Our retry logic assumed "last successful date" was a safe fallback. It wasn't. Retries should reprocess the same data, not different data.

Division by zero is common in analytics. Anytime you're calculating averages or ratios, handle the zero-count case explicitly. Don't just let it fail.

Monitor successful runs, not just failures. All our alerts focused on failures. These runs succeeded, so we had no alerts. The only way I caught it was manually reviewing logs.

Execution date vs data date matter. Airflow's execution date is when the job runs. The data you process might be different, especially with retries. Keep them separate in your code.

The Aftermath

After the fix, the pipeline handled weekend data normally:

Saturday: Processed December 13th. Premium metrics = NULL (expected). Success.
Sunday: Processed December 14th. Premium metrics = NULL (expected). Success.
No retries. No duplicate processing.

I backfilled the missing December 6th and 7th data manually and added a test case for weekend scenarios to our test suite.

Total time debugging: about 3 hours. Time spent fixing missing weekend data: another 2 hours.

Lesson learned: always test edge cases, especially predictable ones like weekends.

Have you deployed code on a Friday that broke over the weekend? Or had retry logic that made things worse instead of better?

I'd be interested to hear how others handle data quality validation for metrics with variable data patterns.

Connect with me on LinkedIn or check out my portfolio.

Thanks for reading! Follow for more practical data engineering stories and lessons from production systems.

Why 71,000 Data Engineers Read My Article: What I Learned About Technical Writing

Pradeep Kalluri — Mon, 08 Dec 2025 21:38:19 +0000

My article on data quality hit 71,000 views in a week. I didn't expect that.

I've been writing technical content for years. Most articles get a few hundred views, maybe a thousand if I'm lucky. This one was different. It sparked 100+ upvotes on Reddit, generated 40+ discussions, and reached engineers at companies from early-stage startups to Fortune 500 banks.

What made this one work when others didn't? I spent the past week analyzing the engagement, reading every comment, and trying to understand what resonated. Here's what I learned about technical writing that actually gets read.

I Wrote About Pain, Not Solutions

Most technical articles start with a solution. "Here's how to implement data quality checks in Spark." "5 ways to optimize your pipeline." "A framework for..."

My article started with a problem: "Our data pipeline was dropping 10% of transactions and nobody noticed."

That opening line got more engagement than anything else I've written. Why? Because every data engineer has been there. We've all had that Monday morning panic when someone asks why the numbers look wrong.

Pain is universal. Solutions are specific.

When you start with pain, readers think "that's me." When you start with a solution, they think "does this apply to my situation?"

The most-upvoted comment on my Reddit post: "I felt this in my soul. Currently debugging a similar issue." Not "great solution" or "I'll try this." Just recognition of shared pain.

I Showed My Mistakes, Not My Expertise

I could have written "5 Best Practices for Data Quality" and listed industry-standard recommendations. Schema validation. Freshness checks. Data contracts. All correct. All boring.

Instead, I wrote about the time I deployed validation logic on a Friday afternoon and spent the next week recovering 10% of our transactions from raw storage.

The difference? Vulnerability.

When you show expertise, readers feel inadequate. When you show mistakes, they feel understood. The best technical writing doesn't make you look smart - it makes readers feel less alone in their struggles.

One engineer commented: "Thank you for being honest about this. Most articles make it seem like everyone has perfect pipelines except me."

That's the response that tells you your writing worked. Not "great tutorial" but "I thought I was the only one."

I Used Specific Numbers, Not Vague Examples

Generic: "Data quality issues can cause significant problems."

Specific: "We dropped 10% of transactions. Finance called. Then my manager called."

The specificity makes it real. Anyone can write about "quality issues causing problems." But 10%? That's a number someone will remember. That's a number that shows this actually happened.

Throughout the article, I used real numbers:

10% of data dropped
Three weekends of being paged at 6 AM
$100 becoming 10,000 (exactly 100x)
Revenue reports 3-5% off

Each number grounds the story in reality. Readers commented on specific numbers more than anything else. "The $100 to 10,000 thing happened to us with currency conversion!"

Vague writing is forgettable. Specific writing is memorable.

I Wrote for Skimmers, Not Readers

Most people don't read articles. They skim them.

I structured every section the same way:

Clear heading describing the mistake
Opening hook - what went wrong
The debugging journey
The fix
The lesson learned

Readers could skim the headings, pick the mistakes that sounded familiar, and dive into just those sections. Multiple people commented "I skipped to #3 because we just had a currency issue."

That's not a failure of writing. That's success. They found value without reading every word.

Good technical writing respects that people are busy. Structure helps them find what they need quickly.

I Ended with Questions, Not Conclusions

Most articles end with "In conclusion..." followed by a summary of what you just read.

I ended with: "What's your worst pipeline debugging story?"

That question generated half the comments. People shared their own disasters. Currency issues. Schema changes. Data quietly disappearing. Each comment added value for future readers.

The best technical articles start conversations, not just share information. Questions invite engagement. Conclusions end it.

I Published on a Platform Where My Audience Lives

I didn't just post on Medium and hope people found it. I posted directly to r/dataengineering on Reddit, where thousands of data engineers actively discuss their work.

Platform matters. A brilliant article posted to the wrong platform gets zero views. A decent article posted where your audience already hangs out gets thousands.

Where does your audience spend time? That's where you should publish first.

What Didn't Matter

Before the article went viral, I worried about things that turned out not to matter:

Perfect writing: My article has typos. A few sentences are awkwardly phrased. Nobody cared. The content mattered more than the polish.

Length: The article is long - over 2,000 words. Conventional wisdom says shorter is better. But if the content is valuable, people will read. Several comments said "this was long but worth every minute."

SEO optimization: I didn't optimize for search engines. I wrote for humans. The article ranks nowhere in Google but hit 71,000 views through Reddit and LinkedIn shares.

Fancy formatting: No graphics, no diagrams, no custom styling. Just text, code blocks, and clear headings. Content beats design.

Professional polish: I wrote how I talk. Contractions, sentence fragments, informal language. It feels more authentic than formal technical writing.

The Pattern Across Popular Technical Content

After analyzing my article's success, I looked at other high-engagement technical posts. A pattern emerged:

They all start with a specific problem the author actually experienced.

Not "here's a common issue" but "here's what happened to me last Tuesday."

They all show vulnerability.

Not "here's what I know" but "here's what I learned after screwing up."

They all use concrete details.

Numbers, names, specific error messages. Real things that happened to a real person.

They all invite discussion.

They end with questions or acknowledgment that others might have different experiences.

What This Means for Your Technical Writing

If you're writing technical content, here's what worked for me:

Start with a real problem you solved. Not a hypothetical scenario. Something that actually cost you time, caused stress, made you look bad.

Be specific. Use real numbers. Quote actual error messages. Describe the exact commands you ran.

Show the messy middle. Don't just show the solution. Show the three wrong paths you took first. The assumptions you made that were wrong. The obvious thing you missed for way too long.

Structure for skimmers. Clear headings. Predictable section structure. Make it easy to find the parts people care about.

Write for people, not search engines. Informal language. Short paragraphs. Conversational tone.

End with a question. Invite people to share their experiences. The comments add as much value as your article.

Post where your audience lives. Don't just publish and pray. Go to where engineers already discuss problems like yours.

The Unexpected Part

The article's success wasn't about the writing quality. It was about recognizing shared experience.

Every engineer who upvoted or commented had debugged similar issues. Weekend-only failures. Silent data loss. Schema changes nobody announced. They'd all been there.

The article worked because it said "you're not alone" better than it said "here's the answer."

That's probably the most important lesson about technical writing: your readers don't need you to be the smartest person in the room. They need you to be the honest one.

Write about what went wrong. Show your debugging process, including the dead ends. Use real numbers from real incidents. Be specific enough that people know this actually happened to you.

That vulnerability creates trust. And trust is what makes people read your writing, share it, and come back for more.

What I'm Writing Next

This experience changed how I think about technical writing. I'm planning more articles about real production incidents:

The time our pipeline processed the same day's data 47 times
When I accidentally made our entire data warehouse read-only
The monitoring alert that cried wolf so often we turned it off (then regretted it)

Each one will follow the same pattern: real problem, specific details, messy debugging process, lessons learned.

Because apparently that's what 71,000 people want to read.

What's your experience with technical writing? Have you written content that unexpectedly resonated? Or struggled to get engagement despite great content? I'd love to hear what you've learned.

Connect with me on LinkedIn or check out my portfolio. Always happy to discuss technical writing, data engineering, or production war stories.

Thanks for reading! If this resonated with you, follow for more articles on data engineering, building in production, and lessons learned the hard way.

5 Data Pipeline Mistakes That Cost Me Weeks of Debugging

Pradeep Kalluri — Mon, 01 Dec 2025 19:14:02 +0000

After three years building data pipelines in production, I’ve made plenty of mistakes. Some were quick fixes. Others cost me days of debugging and awkward conversations with management.

Here are five mistakes that taught me the most — not because they were dramatic or interesting, but because they were subtle enough to slip through testing and painful enough that I’ll never make them again.

If you’re building data pipelines, hopefully my mistakes save you some time.

Mistake #1: Silently Dropping 10% of Data

I added validation logic to filter out “invalid” records from our pipeline. Seemed smart — catch bad data before it reaches the warehouse. I tested it on a sample dataset, everything looked fine, deployed it on a Friday afternoon.

Monday morning, a business analyst asked why revenue looked 10% lower than expected. My first thought was “probably just a slow weekend.” Then finance called. Then my manager called.

Turns out, a source system had added a new status code without telling anyone. My validation logic saw this unknown code, flagged those records as invalid, and silently dropped them. 10% of our transactions, gone. The pipeline ran successfully — all green checkmarks, no errors — because technically it worked exactly as I’d coded it.

The worst part? It took almost a full day to figure out what was wrong. I was looking for pipeline failures, schema mismatches, network issues. Everything checked out. The data was being dropped intentionally by my own code, so there were no error logs.

How did I fix it? First, I stopped the pipeline immediately. Then I recovered the data from our raw zone and reprocessed it. Took about four hours to backfill everything.

But the real fix was changing how I think about validation. Now I don’t drop data silently — I send it to an error table with alerts. If something unexpected appears, I know about it. And I never, ever deploy significant changes on Friday afternoons anymore.

The lesson? Pipelines that succeed aren’t always correct. Green checkmarks just mean your code ran, not that it did the right thing.

Mistake #2: The Weekend Bug That Haunted Me

Our pipeline ran perfectly Monday through Friday. Every single weekend it failed. Every Saturday and Sunday morning, I got paged.

For the first few weeks, I thought it was a coincidence. Maybe the source system had issues on weekends. Maybe network problems. I spent hours checking infrastructure, reviewing logs, testing connections. Everything looked fine during the week.

Then I realized — the pipeline wasn’t actually failing. It was being killed by our monitoring system because it thought something was wrong.

The problem? I had a row count check with a hard-coded threshold. The pipeline expected at least 100,000 records per day. Monday through Friday, we got 120,000–150,000 transactions. Easy pass.

Weekends? Only 20,000–30,000 transactions. Our customers didn’t work weekends. Lower volume was completely normal. But my check didn’t know that. It saw “only 20,000 rows” and decided the pipeline had failed.

The fix was embarrassingly simple — change from a fixed threshold to a percentage-based check comparing against the same day of the week historically. Weekends are compared to previous weekends, not to weekdays.

Took me three weekends of being woken up at 6 AM to figure this out.

The lesson? Context matters in data validation. What’s normal on Tuesday isn’t normal on Sunday. Your checks need to understand the patterns in your data, not just absolute numbers.

Mistake #3: When $100 Became 10,000

Transaction amounts suddenly doubled in our reports overnight. Every single number was exactly 100x what it should have been.

This one took me almost a full day to debug because the numbers weren’t obviously wrong. A $100 transaction became 10,000. In isolation, that’s a valid transaction amount. Nothing technically broken — no nulls, no errors, schema matched perfectly.

The breakthrough came when I compared distributions. Average transaction amount had been around $150 for months. Suddenly it was $15,000. That’s when I knew something systemic had changed.

I traced it back to the source system. They’d changed from sending amounts in dollars ($100.00) to cents (10000). Their reasoning? “Cents are more precise and avoid floating-point issues.” Fair enough. But they didn’t tell anyone.

My pipeline happily processed the new format. Why wouldn’t it? Numbers are numbers. The schema was still “decimal field for amount.” Technically valid.

The fix was adding a validation check — if the average transaction amount changes by more than 50% day-over-day, alert someone. Also, I started tracking the ratio of amounts to compare against historical patterns.

But more importantly, I learned to monitor distributions, not just point values. A value can be individually valid but collectively wrong. If every transaction suddenly costs 100x more, something changed in how the data is formatted, even if the schema stayed the same.

Become a member
The lesson? Data can be technically correct but business incorrect. Schema validation catches structure problems. Distribution monitoring catches semantic problems.

Mistake #4: The Schema Change Nobody Told Me About

A source system added new columns to their schema without telling anyone. I didn’t update my transformation logic to include these columns when checking for duplicates.

The result? Records that should have been deduplicated weren’t. We started seeing the same transactions appear multiple times in our reports. Not every record — just enough to make the numbers look slightly off.

The confusing part was that my deduplication logic was working correctly for the old schema. I was using transaction_id and timestamp to identify duplicates. But the source system had added a version column that changed for retries. Same transaction_id, same timestamp, different version. My code saw them as the same record. The database saw them as different.

It took me two days to figure out because the duplicates weren’t obvious. Revenue reports were 3–5% higher than expected. Not enough to scream “something’s broken” but enough that finance noticed during reconciliation.

The fix was simple once I found it — include all relevant columns in the deduplication logic. The lesson? Always check what changed in the source schema, even if nobody tells you it changed.

Now I log schema changes automatically. If a new column appears, I get an alert. Saves me from assuming the schema is the same as last week.

Mistake #5: The Missing Columns in COALESCE

I was merging data from multiple sources using COALESCE to pick the first non-null value across columns. Simple enough — if Source A has the data, use it. If not, fall back to Source B, then Source C.

Except I didn’t include all the columns in my logic. I focused on the main fields — customer ID, transaction amount, date. But I missed some metadata columns like source_system_id and updated_timestamp.

This created duplicates because records that should have been identified as the same transaction weren’t. They had the same main fields but different metadata, so my join logic treated them as separate records.

Debugging this was frustrating because the duplicates followed no obvious pattern. Some customers had them, others didn’t. Some days had duplicates, other days were clean. It looked random.

The breakthrough came when I added granularity to my debugging — instead of just checking if duplicates existed, I checked exactly which columns were causing them. I wrote a query that compared all fields between duplicate records and showed me which ones differed.

That’s when I saw it — the metadata columns I’d ignored. Once I added them to my COALESCE logic with proper priority ordering, the duplicates disappeared.

The lesson? When handling multiple data sources, think about ALL columns that define uniqueness, not just the obvious business keys. And when debugging duplicates, check field-by-field to see exactly what’s different.

What I Learned

Looking back at these five mistakes, there’s a pattern:

Test the right things. Schema validation is easy. Testing business logic is hard. Most of my bugs came from assumptions about the data, not the code.

Monitor what matters. Green checkmarks mean your pipeline ran. They don’t mean your data is correct. Track distributions, row counts, and patterns — not just success/failure.

Context is everything. A valid value on Tuesday might be invalid on Sunday. A normal schema last week might have changed this week. Your validation logic needs to understand context.

Never drop data silently. If something looks wrong, flag it loudly. Send it to an error table. Alert someone. Don’t just filter it out and hope it was actually bad data.

Keep raw data. Every single one of these mistakes was fixable because we kept the original data. When your transformation logic is wrong, you can reprocess. When you’ve dropped the raw data, you’re done.

The best part about making mistakes? You only make each one once — if you learn from it. These five cost me weeks of debugging time. But now I have checks in place to catch them before they reach production.

What’s your worst pipeline debugging story? I’d love to hear what others have learned the hard way.

— -

Want to discuss data pipelines or debugging strategies? Connect with me on LinkedIn or check out my portfolio. Always happy to talk about building reliable data systems.

— -

Thanks for reading! If this was helpful, follow for more articles on data engineering, production lessons, and building reliable systems.

Data Quality at Scale: Why Your Pipeline Needs More Than Green Checkmarks

Pradeep Kalluri — Mon, 24 Nov 2025 10:02:15 +0000

Originally published on Medium: https://medium.com/@kalluripradeep99/data-quality-at-scale-why-your-pipeline-needs-more-than-green-checkmarks-f3af3dbff8a4

Data Quality at Scale: Why Your Pipeline Needs More Than Green Checkmarks
I once watched a company make a major strategic decision based on a dashboard that had been showing incorrect data for three weeks. The scary part? Nobody knew. The data pipeline ran successfully every day. All green checkmarks in Airflow. Zero alerts. Everything looked fine.
Except the data was wrong.
After years of building data platforms, I've learned something important: moving data is the easy part. Making sure it's correct is what keeps you up at night.
In this article, I'll share what I've learned about data quality at scale. Not the theory you read in textbooks, but the practical stuff that actually matters when your CEO is looking at a dashboard you built.
The $2 Million Dashboard
Let me tell you about that incident I mentioned. A source system quietly changed how they tracked customer IDs. They sent us an email about it (that got lost in someone's inbox). Our pipeline kept running perfectly. Schema matched. No null values. Everything technically valid.
But we were now double-counting about 15% of customers.
For three weeks, our growth metrics looked amazing. Leadership loved it. They approved a massive marketing spend based on those numbers. Then someone in finance noticed the discrepancy during a reconciliation. We had to go back and explain that our "amazing growth" was actually a data bug.
That was expensive. Not just the money, but the trust. It took months to rebuild confidence in our data platform.
Why Traditional Testing Isn't Enough
If you're coming from software engineering, you might think, "Just write unit tests!" I thought that too. Didn't work.
Here's the thing: with code, you control the inputs. You write tests for expected scenarios. Code is deterministic.
Data is different. You don't control the source systems. They change without telling you. Business rules evolve. Schema drift happens. And here's the worst part - data can be technically valid but business invalid.
Some real examples I've seen:
The string that wasn't a string: Transaction amounts came through as "1,234.56" instead of 1234.56. Schema said "string field," so it passed validation. Try summing those in a SQL query. You get $0.
The date that wasn't wrong: A source system started sending dates in DD/MM/YYYY format instead of MM/DD/YYYY. Every date from the 1st to the 12th of the month worked fine. Then on the 13th, everything broke. Took us two weeks to figure out why.
The midnight ghost records: Mobile app transactions synced when users had WiFi. Some took 48 hours to arrive. Our daily reports were always incomplete, but we had no way to know which days were "final."
I learned the hard way that you need to test six things:

Schema (structure is right)
Values (numbers make sense)
Volume (right amount of data)
Freshness (data is recent enough)
Distribution (patterns look normal)
Relationships (foreign keys work)

Most teams only test the first one.
What Data Quality Actually Means
I organize quality checks into four categories. Each one catches different types of problems.
Completeness: Is Everything There?
This seems obvious, but it's where most issues start. You expect 100,000 rows. You get 60,000. Is that a problem or just a slow day?
I check:

Row counts against historical averages (alert if >20% different)
Null rates in critical fields (customer_id should never be null)
All expected dates/partitions are present
Foreign keys exist (every transaction has a valid customer)

One time we lost an entire day of data because the source system had a disk space issue. They dumped empty files to our S3 bucket. Our pipeline happily processed zero rows. Everything succeeded. We only found out when someone asked why yesterday's revenue was $0.
Now I check row counts. If today is 50% lower than the average of the last 7 days, I get paged.
Accuracy: Is the Data Correct?
This is harder because "correct" depends on business context. A $1 million transaction might be valid for some businesses, fraud for others.
I focus on:

Range checks (transaction amounts between $0 and $100,000)
Format validation (emails look like emails, dates are dates)
Business rules (refund amount can't exceed original purchase)
Reconciliation with source systems (row counts and totals match)

The trick is working with business users to define what "correct" means. Don't guess. Ask.
Consistency: Does It All Make Sense Together?
Data doesn't exist in isolation. Tables relate to each other. Metrics calculated different ways should match.
I check for:

Orphaned records (transactions without a customer)
Duplicate primary keys (should be impossible but happens)
Cross-table consistency (revenue calculated two ways gives same answer)
Time-series anomalies (revenue doesn't drop 90% overnight unless something big happened)

We once had a bug where a retry mechanism created duplicate records. Started at 0.1%. Grew to 15% over three months. Aggregations were inflated. We were reporting 15% higher revenue than we actually had. Found it during a financial audit. Not fun.
Freshness: Is It Recent Enough?
Stale data is useless data. But "fresh enough" depends on the use case. Real-time fraud detection needs data from the last minute. Monthly reports can tolerate day-old data.
I monitor:

Maximum timestamp in each table
Time since last successful pipeline run
SLA breaches (data should be <2 hours old for dashboards)

Set clear SLAs. Measure against them. Alert when you miss them.
How I Actually Implement This
Theory is nice. Let me show you what I actually do.
Great Expectations for Processing Layer
This tool changed how I think about data quality. Instead of writing custom validation code, you define expectations. Then run them automatically.
Here's a real example from a transaction pipeline:
pythonimport great_expectations as ge

Load your data

df = ge.read_csv("s3://curated-zone/transactions.csv")

Critical checks - pipeline stops if these fail

df.expect_column_values_to_not_be_null("transaction_id")
df.expect_column_values_to_be_unique("transaction_id")
df.expect_column_values_to_not_be_null("customer_id")

Range checks

df.expect_column_values_to_be_between("amount", min_value=0, max_value=1000000)

Valid values only

df.expect_column_values_to_be_in_set("currency", ["USD", "EUR", "GBP"])
df.expect_column_values_to_be_in_set("status", ["pending", "completed", "failed"])

Format validation

df.expect_column_values_to_match_regex(
"email",
r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$"
)

Business rule: refunds can't exceed original amount

df.expect_column_pair_values_A_to_be_greater_than_B(
column_A="original_amount",
column_B="refund_amount"
)

Row count check (based on historical average)

df.expect_table_row_count_to_be_between(
min_value=50000,
max_value=200000
)

Run all checks

results = df.validate()

if not results.success:
failed = [exp for exp in results.results if not exp.success]
print(f"Quality check failed! {len(failed)} issues found")
# Send alert, stop pipeline, whatever makes sense
raise ValueError("Data quality check failed")
I run this in my Airflow pipeline right after reading from the raw zone. If validation fails, the pipeline stops. Bad data never reaches production.
The key is starting simple. Five checks on day one. Add more as you learn what can go wrong. I now have 50+ checks on critical tables. Built up over time based on actual incidents.
dbt Tests for Analytics Layer
While Great Expectations handles the processing layer, I use dbt for the warehouse. Tests live right next to the models. Easy for analysts to write and maintain.
yaml# models/schema.yml
version: 2

models:

name: fct_daily_revenue description: "Daily revenue by product" columns:
- name: date tests:
  - not_null
  - unique
- name: product_id tests:
  - not_null
  - relationships: to: ref('dim_products') field: product_id
- name: revenue tests:
  - not_null
  - dbt_utils.accepted_range: min_value: 0 max_value: 10000000 And custom tests for business logic: sql-- tests/revenue_reconciliation.sql -- Revenue in warehouse should match source system

with warehouse as (
select sum(revenue) as total
from {{ ref('fct_daily_revenue') }}
where date = current_date - 1
),

source as (
select total_revenue as total
from {{ ref('source_summary') }}
where date = current_date - 1
)

select *
from warehouse w
cross join source s
where abs(w.total - s.total) / s.total > 0.01 -- Fail if >1% difference
Run these after every model build. If tests fail, you know immediately.
Monitoring and Alerts
Quality checks are useless if nobody looks at them. You need alerts that actually get attention.
I use three severity levels:
Critical (page someone):

Pipeline completely failed
Zero rows loaded
SLA breach by >4 hours

High (Slack with @channel):

Quality checks failed
Volume drop >50%
Freshness breach by >2 hours

Medium (Slack notification):

Warning-level checks failed
Volume drop 20-50%
Minor anomalies

Don't alert on everything. Alert fatigue is real. I learned this by setting alerts too aggressively and then ignoring them. Start conservative. Tune based on false positives.
Building a Quality Culture
Here's what I've learned about getting teams to care about quality:
Show the impact. Don't say "we need more tests." Say "last month's incorrect dashboard cost us a $2M budgeting mistake. These tests prevent that."
Make it visible. We have a dashboard showing data quality scores for every table. Updates daily. Everyone can see it. When scores drop, people notice.
Make it easy. Pre-built test templates. Clear documentation. If adding quality checks is hard, people won't do it.
Celebrate wins. "Zero quality incidents this month!" matters. Recognize teams that maintain high quality scores.
Share incidents. When things break (and they will), do a blameless post-mortem. What happened? What did we learn? How do we prevent it? Share these widely. Learn from mistakes.
The Quality Checklist
Before any new pipeline goes to production, I make sure:

Schema validation exists
Critical fields have null checks
Value ranges are validated
Row count checks are in place
Freshness monitoring configured
Alerts set up and tested
Team knows how to respond to alerts
Runbook exists for common failures

Takes 30 minutes to set up. Saves hours when things break.
Real Incidents I've Seen
Let me share three more incidents and what I learned from each.
The silent schema change: A source system added a new status code without telling us. Our pipeline treated it as invalid and dropped those records. 10% of data quietly disappeared. We found out when a business user asked why certain transactions weren't showing up.
Lesson: Monitor unexpected values. If a new status code appears, alert on it. Don't silently drop data.
The weekend bug: Our pipeline ran fine Monday through Friday. Every weekend it failed. Why? Because weekend volume was 80% lower. Our row count check had a fixed threshold, not a relative one. Every Sunday morning, someone got paged.
Lesson: Make thresholds context-aware. Weekend expectations ≠ weekday expectations.
The currency confusion: Transaction amounts suddenly doubled. Took us a day to figure out why. A source system changed from sending amounts in dollars to cents. $100.00 became 10000. Technically valid (still a number), but wrong.
Lesson: Compare against historical distributions. If average transaction amount suddenly changes by 100x, something's wrong.
What Actually Matters
After years of doing this, here's what I've learned:
Start simple. Five good checks beat 50 mediocre ones. Focus on what actually breaks in your pipelines.
Monitor trends, not just values. A gradual increase in null rates is harder to catch than a sudden spike. Watch the trends.
Test what you can't see. Schema and row counts are easy. Business logic is hard. Both matter.
Make quality everyone's job. Data engineers build the checks. Analysts write tests for their models. Business users define what "correct" means. Shared responsibility works.
Learn from failures. Every incident is a chance to add a test that prevents it from happening again. Build your quality suite from real problems.
Alert strategically. Too many alerts and people ignore them. Too few and you miss real issues. Tune constantly.
The goal isn't perfection. It's trust. When someone looks at a dashboard, they should trust the numbers. When leadership makes a decision based on data, it should be the right decision.
That's what data quality is really about.
Getting Started
If you're building a data platform or trying to improve an existing one:

Pick your most critical table
Add five basic checks (not null, unique, value ranges, row count, freshness)
Set up alerts
Wait for something to break
Add a check that would have caught it
Repeat

Don't try to build perfect quality checks on day one. Build them incrementally based on what actually goes wrong.
And when something does break (it will), treat it as a learning opportunity. What check would have caught this? Add it. Move on.
Data quality isn't a project with an end date. It's ongoing vigilance. The teams that do it well make it part of their culture, not just a checklist.

Want to discuss data quality or pipeline architecture? Connect with me on LinkedIn or check out my portfolio. Always happy to talk about building reliable data systems.
And if you're working on data quality tools or have war stories to share, I'd love to hear them!

Thanks for reading! If this was helpful, follow for more articles on data engineering, building reliable systems, and lessons learned from production incidents.

From Raw to Refined: Data Pipeline Architecture at Scale

Pradeep Kalluri — Sat, 22 Nov 2025 20:44:40 +0000

Originally published on Medium: https://medium.com/@kalluripradeep99/from-raw-to-refined-data-pipeline-architecture-at-scale-52cd4b02ef10

How I built production data pipelines that process massive volumes daily — and what I learned along the way

Every day, modern data platforms handle hundreds of gigabytes of data — transactions, customer activity, event streams, operational reports. All of this needs to flow from messy source systems into clean, reliable tables that teams can use for dashboards, reports, and ML models.

Here’s what surprised me after years of building these systems: moving data isn’t the hard part. Making it reliable at scale is.

I’ve debugged pipelines that silently corrupted data for weeks. I’ve seen duplicate records inflate ML model accuracy by double digits. I’ve watched pipelines grind to a halt because someone forgot to partition a table properly.

These experiences taught me something valuable: you need a solid architecture before you write a single line of code.

In this article, I’ll walk you through the three-zone framework I use for production data pipelines. We’ll cover which tools make sense at each stage, how to keep data quality high, and the mistakes I’ve made so you don’t have to.

If you’re building a data platform from scratch or trying to scale an existing one, this should help.

The Three-Zone Architecture
I like keeping things simple. Split your data pipeline into three zones, each doing one thing well. This makes everything easier to build, fix, and explain to your team.

Zone 1: Raw/Landing Zone
This is where data first shows up. The most important rule: don’t touch it. Store everything exactly as it comes in.

What it does: Keeps data in its original form
Tools I use: Object storage (S3/ADLS) for batch files, Kafka for streaming
Why it matters: You can always go back and reprocess if something breaks

Example: Transaction data comes in as JSON files. I store them in organized paths like s3://raw-zone/transactions/2024/11/20/. For real-time data like payment events, they go into Kafka topics unchanged.

Why bother with this separation? Because you’ll have bugs. Business rules will change. Data quality checks will evolve. When that happens, you just reprocess from raw. It’s your safety net.

I once discovered a data transformation bug that had been running for weeks. Because we had the raw zone, we reprocessed everything in a few hours. Without it? We would have had serious data integrity issues.

Zone 2: Curated/Staging Zone
This is where the actual work happens. Clean up the mess, standardize formats, catch bad data before it reaches production.

What it does: Turns raw data into something usable
Tools I use: PySpark for heavy lifting, cloud compute platforms for processing
What I do here: Remove duplicates, fix data types, validate everything, standardize formats

Real talk: data is always messier than you expect. You’ll get duplicate records. Date formats all over the place — some systems use MM/DD/YYYY, others use DD-MM-YYYY. Codes that don’t match standards. Nulls everywhere.

This zone fixes all of that. Convert dates to ISO format. Deduplicate records using window functions. Flag invalid data and send it to error tables so someone can investigate later.

One time, we received data where the same record appeared multiple times with different values due to system retries. Our deduplication logic caught it and kept only the latest record based on timestamp. Simple, but it prevented incorrect reporting downstream.

Zone 3: Refined/Consumption Zone
This is what people actually use. Clean, fast, optimized, ready to go.

What it does: Serves data to analysts, dashboards, ML models
Tools I use: Cloud data warehouses (Snowflake/Redshift/BigQuery), dbt for transformations
What’s here: Star schemas, pre-aggregated tables, feature stores for ML

Instead of making analysts query millions of raw records, give them pre-aggregated summary tables. Instead of making ML engineers join dozens of tables every time they need features, give them pre-computed feature tables.

Performance matters here. Use proper partitioning. Pre-compute common aggregations. Model your data in ways people understand — star schemas, not normalized tables with excessive joins.

Why Split It Up?
Easier to debug: When something breaks, you know exactly where to look. Data quality issue? Check curated. Performance problem? Look at refined.

Safer to experiment: Want to try a new transformation logic? Test it in curated without touching raw data. Want to change your warehouse schema? Refined zone only.

Right tool for the job: Object storage for raw, distributed compute for processing, columnar database for analytics. Each zone uses the best tool for its purpose.

Better quality: Catch problems early in curated before they reach business users and damage trust in your data platform.

The boundaries are clear: raw-to-curated handles technical stuff (formats, types, duplicates). Curated-to-refined handles business logic (aggregations, joins, metrics). Everyone knows what goes where.

Data Ingestion Layer
Getting data into your platform is step one. You’ve got two main approaches: batch and streaming. Most real-world systems need both.

Batch Ingestion
This is your scheduled, bulk data loads. Works great for data that doesn’t need to be real-time — think daily summaries, overnight files, periodic reports.

I use cloud object storage as the landing zone. Source systems drop files there — usually CSV, JSON, or Parquet. Then I’ve got scheduled jobs (orchestrated by Airflow) that pick them up and process them.

The trick is organizing your storage paths properly. Use a structure like:

s3://raw-zone/source_system/table_name/YYYY/MM/DD/filename.parquet
This makes it easy to process specific date ranges and troubleshoot when things go wrong. And trust me, things will go wrong.

Pro tip: Use Parquet format when you can. Columnar storage can reduce storage costs significantly compared to CSV, plus query performance improves substantially.

Stream Ingestion
For real-time data, I use Kafka. Payment events, user activity, system logs — anything that needs to be processed within seconds or minutes.

Kafka is great because it keeps messages for a retention period (say, 7 days). If your downstream system goes down for maintenance, you can catch up without losing data. It’s like a replay buffer for your data streams.

Here’s a pattern that works well: Kafka producers write events to topics. Consumer applications read from topics and write to object storage in micro-batches (every 5 minutes). This gives you both real-time processing AND a permanent archive in your raw zone.

In one system, we processed tens of thousands of events per second through Kafka, with consumer lag under a minute. The key was proper partitioning and scaling consumer groups horizontally.

Handling Late Data
Real-world data doesn’t arrive on time. An event might get recorded but the network hiccups. The data shows up hours late. Or a mobile app was offline and syncs data the next day.

My rule: always use event time (when it actually happened) not processing time (when you received it). Store both timestamps. This way you can handle late arrivals properly in downstream processing without corrupting your analytics.

Processing & Transformation
This is where PySpark does the heavy lifting. Reading from raw, applying transformations, writing to curated. Let me show you the patterns that actually work in production.

Reading from Raw Zone
Start by reading your data. I usually work with Parquet files in object storage because they’re fast and efficient.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, row_number, upper
from pyspark.sql.window import Window
spark = SparkSession.builder \
.appName("curated_processing") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()

Read from raw zone with schema inference

df = spark.read.parquet("s3://raw-zone/transactions/2024/11/20/")
Data Validation
First thing: validate your data. Don’t process garbage.

Remove records with null IDs

df_valid = df.filter(col("transaction_id").isNotNull())

Check for valid amounts (positive values only)

df_valid = df_valid.filter(col("amount") > 0)

Validate date ranges

df_valid = df_valid.filter(
(col("transaction_date") >= "2024-01-01") &
(col("transaction_date") <= "2024-12-31")
)
Simple checks like this save you headaches later. Invalid data goes to an error table so someone can investigate — don’t just drop it silently.

Deduplication
Duplicates are everywhere. Source systems send the same record twice. Networks retry failed requests. It happens constantly.

Here’s how I handle it — keep the most recent record based on a timestamp:

Define window to find duplicates

window = Window.partitionBy("transaction_id") \
.orderBy(col("timestamp").desc())

Keep only the latest record for each transaction_id

df_dedup = df_valid.withColumn("row_num", row_number().over(window)) \
.filter(col("row_num") == 1) \
.drop("row_num")
This pattern works for any duplicate scenario. Just change the partitionBy and orderBy columns based on your needs. I've used this same logic for customer records, sensor data, and API responses.

Type Casting and Standardization
Data comes in as strings more often than you’d think. Convert to proper types for downstream processing.

Convert string dates to actual dates

df_typed = df_dedup.withColumn(
"transaction_date",
to_date(col("date_string"), "yyyy-MM-dd")
)

Ensure numeric types with proper precision

df_typed = df_typed.withColumn(
"amount",
col("amount").cast("decimal(10,2)")
)

Standardize codes to uppercase

df_typed = df_typed.withColumn(
"currency",
upper(col("currency"))
)
Writing to Curated Zone
Once data is clean, write it back to storage in the curated zone. Use partitioning for better performance downstream.

Write partitioned by date for efficient queries

df_typed.write \
.mode("overwrite") \
.partitionBy("transaction_date") \
.parquet("s3://curated-zone/transactions/")
Partitioning means queries only read relevant data. If someone wants yesterday’s data, Spark only scans yesterday’s partition. Fast and cheap.

In one pipeline, proper partitioning reduced query times from 45 minutes to just a few minutes. Same data, same query, just better organization.

Why PySpark?
You might ask — why not just use Pandas? Simple: scale. Pandas runs on one machine’s memory. PySpark distributes across a cluster. When you’re processing large volumes, you need that distributed power.

Become a member
Plus, PySpark’s lazy evaluation is smart. It optimizes your entire transformation pipeline before executing. Less data shuffling, fewer passes over data, faster results.

Orchestration with Airflow
You can’t run data jobs manually every day. You need orchestration. Airflow handles scheduling, dependencies, retries, and monitoring — all the operational complexity.

DAG Design
Here’s a DAG structure for our three-zone pipeline:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'retries': 2,
'retry_delay': timedelta(minutes=5),
'email_on_failure': True,
'email': ['data-alerts@company.com'],
}
dag = DAG(
'transaction_pipeline',
default_args=default_args,
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False,
)

Task 1: Wait for source data to arrive

wait_for_data = S3KeySensor(
task_id='wait_for_source_data',
bucket_name='raw-zone',
bucket_key='transactions/{{ ds }}/',
timeout=3600,
poke_interval=60,
dag=dag,
)

Task 2: Ingest from source to raw

ingest_task = PythonOperator(
task_id='ingest_to_raw',
python_callable=ingest_data,
dag=dag,
)

Task 3: Process raw to curated

process_task = PythonOperator(
task_id='process_to_curated',
python_callable=process_data,
dag=dag,
)

Task 4: Transform curated to refined

transform_task = PythonOperator(
task_id='transform_to_refined',
python_callable=transform_data,
dag=dag,
)

Task 5: Data quality checks

quality_task = PythonOperator(
task_id='quality_checks',
python_callable=run_quality_checks,
dag=dag,
)

Set dependencies - clear pipeline flow

wait_for_data >> ingest_task >> process_task >> transform_task >> quality_task
Key Principles
Idempotency: Run the same task twice, get the same result. Use mode("overwrite") with date partitions. If today's job fails and reruns, it overwrites today's data without affecting other days. This is crucial for reliable operations.

Clear dependencies: The >> operator makes dependencies obvious. Process can't start until ingest finishes. Quality checks run last. Anyone looking at the DAG understands the flow immediately.

Retry logic: Network hiccups happen. Source systems go down. Airflow retries failed tasks automatically. Set sensible retry counts (2–3) and delays (5–10 minutes).

Monitoring: Airflow’s UI shows you everything. Which tasks failed? How long did they take? When did they last run? All visible at a glance. I check the dashboard regularly — green boxes mean happy pipelines, red boxes mean I’ve got work to do.

Alerting: Set up email or Slack alerts for failures. Don’t wait until someone complains about missing data. Know about problems before your users do.

Data Quality & Validation
Bad data is worse than no data. It leads to wrong decisions, broken dashboards, and lost trust in your platform. I learned this the hard way.

Why Quality Matters
I once saw an ML model built using data with duplicate IDs. The model performed great in testing — high accuracy. Poor in production — much lower accuracy. Why? Because the duplicates artificially inflated performance metrics during training. It was caught after deployment. Not fun.

Now I validate everything.

Using Great Expectations
Great Expectations is my go-to tool for data quality. You define rules (called expectations), and it validates your data against them automatically.

import great_expectations as ge

Load your data as a Great Expectations DataFrame

df = ge.read_csv("s3://curated-zone/transactions.csv")

Set expectations - these become tests

df.expect_column_values_to_not_be_null("transaction_id")
df.expect_column_values_to_be_unique("transaction_id")
df.expect_column_values_to_be_between("amount", min_value=0, max_value=1000000)
df.expect_column_values_to_be_in_set("currency", ["USD", "EUR", "GBP"])
df.expect_column_values_to_match_regex("email", r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$")

Validate and get results

results = df.validate()
if not results.success:
# Alert someone, block pipeline, write to error log
raise ValueError(f"Data quality check failed! {results}")
Simple but effective. If data fails validation, the pipeline stops. No bad data reaches production and corrupts your analytics.

dbt Tests
For the refined zone, I use dbt tests. They’re built directly into your transformation code, which makes them easy to maintain.

-- models/daily_summary.sql
{{ config(materialized='table') }}
SELECT
date,
customer_id,
SUM(amount) as total_amount,
COUNT(*) as transaction_count
FROM {{ ref('transactions') }}
GROUP BY date, customer_id

tests/daily_summary.yml

version: 2
models:

name: daily_summary columns:
- name: customer_id tests:
  - not_null
  - unique
- name: total_amount tests:
  - not_null
- name: date tests:
  - not_null tests:
- dbt_utils.recency: datepart: day field: date interval: 1 dbt runs these tests automatically after building models. If a test fails, you know immediately. The recency test is particularly useful — it alerts you if data stops arriving.

Continuous Monitoring
Quality isn’t one-and-done. Monitor continuously:

Track row counts over time: Sudden drops = problem
Watch null rates: If nulls suddenly spike, investigate
Monitor data freshness: Is data arriving on time?
Set up anomaly detection: Catch unusual patterns early
Validate referential integrity: Ensure foreign keys match
I’ve got Slack alerts configured for quality failures. If something breaks overnight, I know about it quickly. Better to know immediately than discover it during a morning meeting.

Real-World Lessons
Let me share some lessons from systems I’ve built:

Performance wins:

Proper partitioning can reduce query times by 10x or more
Parquet format typically reduces storage costs by 60–70% vs CSV
Pre-aggregated tables eliminate the need for complex real-time queries
Reliability improvements:

The three-zone architecture makes debugging much faster
Comprehensive quality checks catch issues before users see them
Idempotent pipelines allow safe retries without data corruption
Cost optimizations:

Smart partitioning reduces cloud compute costs significantly
Columnar formats (Parquet) save on both storage and processing
Proper cluster sizing prevents over-provisioning
These results didn’t come from fancy tools or bleeding-edge technology. They came from solid architecture, good practices, and attention to quality.

Common Mistakes (and How to Avoid Them)
Mistake #1: Skipping the raw zone
“We’ll just clean data as it arrives.” Then you have a bug and no way to reprocess. Always keep raw data.

Mistake #2: No data quality checks
“We’ll add those later.” Later never comes. Build quality checks from day one.

Mistake #3: Over-engineering early
You don’t need Kafka and real-time processing for daily batch reports. Start simple, scale when needed.

Mistake #4: Ignoring monitoring
If you don’t know your pipeline failed, you can’t fix it. Set up alerts and dashboards.

Mistake #5: Poor partitioning
This kills performance and inflates costs. Partition by date or another high-cardinality field that matches query patterns.

Mistake #6: Treating all data the same
Not everything needs real-time processing. Batch is cheaper and simpler for most use cases.

Getting Started
If you’re building a data platform from scratch:

Start with the three-zone architecture — Even if you’re just moving files around, establish the pattern early
Implement one pipeline end-to-end — Don’t build all the infrastructure first. Get one working pipeline, learn from it
Add quality checks incrementally — Start with basic null checks, expand from there
Monitor everything — Build dashboards and alerts from day one
Document your patterns — Future you (and your team) will thank you
The tools I mentioned — S3/ADLS, Kafka, PySpark, Airflow, Snowflake/Redshift/BigQuery, dbt, Great Expectations — work well together. But they’re not the only options. Use what fits your needs, budget, and team expertise.

Most importantly: make your pipelines reliable. Teams will depend on them. Analysts will base decisions on the data. Executives will present it to stakeholders. Make sure it’s trustworthy.

Wrapping Up
Building data pipelines at scale isn’t rocket science, but it requires thought and discipline. The three-zone architecture gives you a solid foundation. Raw for safety, curated for processing, refined for consumption.

Start simple. One pipeline, end-to-end. Get it working. Add quality checks. Then scale based on actual needs, not hypothetical ones.

After years and dozens of pipelines, I keep coming back to these patterns because they work. They’re not the newest or the flashiest, but they’re reliable. And in data engineering, reliability beats novelty every time.

The patterns, code examples, and lessons in this article are all based on real production experience. They’re battle-tested and proven to work at scale. Whether you’re building your first data platform or optimizing an existing one, I hope these insights help you avoid common pitfalls and build something reliable.

Want to discuss data architecture? Connect with me on LinkedIn or check out my portfolio. I’m always happy to talk about data engineering, pipeline design, or building scalable systems.

And if you’re contributing to dbt, Airflow, Great Expectations, or other open-source data tools, I’d love to hear about your experiences!

Thanks for reading! If you found this helpful, consider following for more articles on data engineering, cloud architecture, and building scalable systems.