DEV Community

Pradeep Kalluri
Pradeep Kalluri

Posted on • Originally published at Medium

The Time Our Pipeline Processed the Same Day’s Data 47 Times

I noticed something odd in our Airflow logs on Monday morning. Our daily data pipeline had run multiple times over the weekend instead of once per day.

Not just a few extra runs. Forty-seven executions. All processing the same day's data: December 3rd.

Each run showed as successful. No errors. No alerts. Just the same date being processed over and over.

Here's what happened and what I learned about retry logic that I wish I'd known sooner.


How I Found It

Monday morning, I was reviewing our weekend pipeline runs as part of my routine checks. Our Airflow dashboard showed an unusual pattern - our main transformation DAG had executed far more times than it should have.

Looking closer, I saw the DAG had run 47 times between Saturday morning and Monday. But we only schedule it once per day at 2 AM.

What caught my attention: every single run was processing December 3rd's data. Not December 4th, 5th, or 6th. Just December 3rd, repeatedly.

All runs showed as successful. Green status. No failed tasks. The logs showed normal processing - read data, transform it, write to warehouse, mark complete.


The Investigation

I checked the obvious things first:

Was someone manually triggering reruns? No. The audit logs showed all runs were automatic, triggered by the scheduler.

Had the source data changed? No. The S3 timestamps showed December 3rd's data hadn't been modified since it was originally created.

Was there a scheduler configuration issue? The schedule looked correct: daily at 2 AM.

Then I noticed something in the run history. The pattern started on Saturday. Our pipeline ran at 2 AM (normal), then again at 4 AM, 6 AM, 8 AM... every two hours through the weekend.

That's when I realized: these weren't scheduled runs. These were retries.


The Background

The previous Friday, we'd deployed a new analytics feature - calculating average transaction values by customer segment. Marketing wanted to track premium customer behavior separately from regular customers.

The code had been tested thoroughly. We ran it against sample data from the past week. All tests passed. We deployed Friday afternoon.

What we didn't test: weekend data patterns.


The Root Cause

Our pipeline used Airflow's execution date to determine which data partition to process:

execution_date = context['execution_date']
data_date = execution_date.strftime('%Y-%m-%d')
s3_path = f"s3://bucket/data/date={data_date}/"
Enter fullscreen mode Exit fullscreen mode

The pipeline had multiple steps:

  1. Read data from S3
  2. Transform and validate records
  3. Calculate daily metrics
  4. Write to warehouse

Step 3 is where things broke on weekends.

Our new metric calculated "average transaction value per customer segment":

# Calculate average for our premium customer segment
target_customers = df[df['customer_segment'] == 'premium']
total_value = target_customers['amount'].sum()
customer_count = target_customers['customer_id'].nunique()
avg_value = total_value / customer_count
Enter fullscreen mode Exit fullscreen mode

This worked fine on the weekdays we tested:

  • December 3rd (Wednesday): 8,500 premium customers. Calculated successfully.
  • December 4th (Thursday): 7,200 premium customers. Calculated successfully.
  • December 5th (Friday): 6,800 premium customers. Calculated successfully.

December 6th (Saturday): 0 premium customers.

Our premium segment was entirely B2B customers - business accounts, enterprise clients. They don't transact on weekends. The businesses are closed.

We had plenty of regular consumer transactions on Saturday (48,000 total), but zero from the premium segment we were calculating metrics for.

customer_count = target_customers['customer_id'].nunique()  # Returns 0
avg_value = total_value / 0  # Division by zero error
Enter fullscreen mode Exit fullscreen mode

The calculation failed. Task failed. Airflow scheduled a retry.

Here's where the bug was. We had retry logic that tried to be helpful:

if task_instance.try_number > 1:
    # If this is a retry, process the last successful date
    # to avoid reprocessing potentially corrupted data
    last_successful = get_last_successful_date()
    data_date = last_successful
else:
    data_date = execution_date.strftime('%Y-%m-%d')
Enter fullscreen mode Exit fullscreen mode

The logic made sense when we wrote it: if a task fails partway through processing, don't try to reprocess potentially corrupted data. Instead, go back to the last known good date.

But in this case:

  1. December 6th processing failed (division by zero - no premium customers)
  2. Retry triggered, using execution_date = December 6th
  3. Retry logic checked: last successful date = December 3rd
  4. Processed December 3rd data (which had premium customer transactions)
  5. Calculation succeeded!
  6. Airflow marked December 6th as complete

Then the same thing happened with December 7th (Sunday). And continued through the weekend until I stopped it Monday morning.


The Impact

The immediate problem was duplicate data. We'd loaded December 3rd's transactions into our warehouse 47 times.

Our deduplication logic caught most of it - we used transaction IDs as primary keys, so the database just overwrote the same records.

But not all our downstream reports deduplicated. Some aggregation tables counted each load as new data. For a few hours Monday morning, our dashboards showed December 3rd with 47x normal transaction volume.

The bigger problem: we had no data for December 6th or 7th. The pipeline thought it had processed those dates successfully (because it processed December 3rd instead), so it moved on to December 8th.

We skipped two days of weekend data without realizing it until a business user asked why our weekend sales reports were blank.


The Fix

I fixed two things:

First, the immediate bug - handle zero-count scenarios in calculations:

target_customers = df[df['customer_segment'] == 'premium']
customer_count = target_customers['customer_id'].nunique()

if customer_count > 0:
    avg_value = target_customers['amount'].sum() / customer_count
else:
    # No customers in this segment - set to NULL rather than failing
    avg_value = None
Enter fullscreen mode Exit fullscreen mode

Second, the retry logic - removed it entirely:

# Always process the execution date, regardless of retry count
data_date = execution_date.strftime('%Y-%m-%d')
Enter fullscreen mode Exit fullscreen mode

The key insight: retries should reprocess the SAME data, not fall back to old data. If there's a real data problem, retrying won't help. If it's a transient issue, retrying the same operation will work.

For the weekend scenario specifically, I also updated our metrics logic to handle the expected pattern:

# Weekend data note: Premium segment (B2B) has zero weekend activity
# This is expected behavior - record NULL for weekend metrics
Enter fullscreen mode Exit fullscreen mode

What I Learned

Test with realistic data patterns. We tested with weekday data because that's what was convenient. We should have tested with weekend data, holiday data, month-end data - all the edge cases.

Retry logic needs careful thought. Our retry logic assumed "last successful date" was a safe fallback. It wasn't. Retries should reprocess the same data, not different data.

Division by zero is common in analytics. Anytime you're calculating averages or ratios, handle the zero-count case explicitly. Don't just let it fail.

Monitor successful runs, not just failures. All our alerts focused on failures. These runs succeeded, so we had no alerts. The only way I caught it was manually reviewing logs.

Execution date vs data date matter. Airflow's execution date is when the job runs. The data you process might be different, especially with retries. Keep them separate in your code.


The Aftermath

After the fix, the pipeline handled weekend data normally:

  • Saturday: Processed December 13th. Premium metrics = NULL (expected). Success.
  • Sunday: Processed December 14th. Premium metrics = NULL (expected). Success.
  • No retries. No duplicate processing.

I backfilled the missing December 6th and 7th data manually and added a test case for weekend scenarios to our test suite.

Total time debugging: about 3 hours. Time spent fixing missing weekend data: another 2 hours.

Lesson learned: always test edge cases, especially predictable ones like weekends.


Have you deployed code on a Friday that broke over the weekend? Or had retry logic that made things worse instead of better?

I'd be interested to hear how others handle data quality validation for metrics with variable data patterns.

Connect with me on LinkedIn or check out my portfolio.


Thanks for reading! Follow for more practical data engineering stories and lessons from production systems.

Top comments (0)