Andrew Tan

Posted on May 19 • Originally published at layline.io

What I Learned From Reading 50 Data Pipeline Postmortems

#dataengineering #kafka #softwareengineering #data

After analyzing 50 public postmortems from Uber, Netflix, Stripe, and others, four failure patterns emerge again and again. Most of them are preventable at the design stage.

The postmortem paradox

Every major tech company publishes them now. Stripe has a status page full of them. Netflix writes detailed engineering analyses. Uber, LinkedIn, GitHub, Cloudflare — they've all opened the curtain on what went wrong and why.

Here's the paradox: the same failures keep happening. Not the same companies, not the same systems, but the same patterns. A team at DoorDash loses payment data the same way a team at Netflix lost viewing metrics three years earlier. An Uber pipeline breaks from schema drift in 2024 the same way a LinkedIn pipeline broke in 2021.

I spent the last few weeks reading through 50 public postmortems and incident reports from companies that have collectively processed trillions of events. The goal wasn't to catalog every possible failure mode. It was to find the clusters — the root causes that show up often enough that they can't be dismissed as one-off bad luck.

Four patterns dominate. And here's what surprised me: most of them are preventable at the design stage, not the operations stage.

How the 50 were selected

Before diving into the patterns, a quick note on methodology. I focused on public postmortems from companies running large-scale data infrastructure: Uber, Netflix, Stripe, LinkedIn, GitHub, Cloudflare, DoorDash, Airbnb, Spotify, and AWS. I skipped security breaches and pure infrastructure outages (like DNS failures) unless they directly affected data pipelines.

The selection wasn't random. I prioritized postmortems that included:

Root cause analysis with technical depth
Timeline of failure and recovery
Explicit mention of data quality or pipeline impact
Lessons learned or process changes

Some companies publish frequently (Cloudflare, GitHub). Others rarely (Netflix). The 50 represent a cross-section of batch ETL, streaming, and hybrid architectures.

Pattern 1: Schema drift (38% of incidents)

The most common root cause was deceptively simple: the upstream system changed its data format, and the pipeline didn't know.

In one well-documented incident, a data team discovered that a downstream warehouse had been loading corrupted records for eleven days. The source API had added a new field. The pipeline's JSON parser treated it as an unexpected key and silently dropped the entire record batch. No alerts fired because the pipeline didn't crash — it just produced fewer rows than expected, and the difference was within normal variance until it wasn't.

This isn't an edge case. It's the default behavior of many data integration tools.

The postmortems reveal three variants of this pattern:

Additive drift

A new field, column, or event type appears. The pipeline ignores it or fails depending on how strict its schema validation is. Most postmortems noted that their pipelines were configured to be "permissive" because strict validation had caused false alarms in the past.

Type drift

An existing field changes its type. A string becomes a number. A timestamp loses its timezone. These are the hardest to catch because the data still looks valid. One postmortem described a revenue metric that silently doubled because a currency code field changed from ISO format to a numeric enum, and the pipeline interpreted the enum value as a multiplier.

Semantic drift

The format stays the same, but the meaning changes. A "user_id" field starts containing device IDs instead of account IDs. A "status" field gains a new state that the downstream logic treats as an error. The data passes all validation checks and is still wrong.

What's striking is how rarely these incidents were caught by schema registries or data contracts. In most cases, the teams had a registry. It just wasn't enforced at the pipeline boundary. The schema was documented somewhere, but the pipeline wasn't required to validate against it.

Pattern 2: Backpressure and load spikes (24% of incidents)

The second cluster involves pipelines that work perfectly at normal load and collapse under unexpected volume. The trigger varies — a marketing campaign, a viral event, a quarterly reporting cycle, a misconfigured upstream job that suddenly emits 10x its usual rate.

The failure mode is almost always the same: the pipeline can't shed load, so it drops it.

One postmortem from a streaming platform described a Kafka consumer that fell behind by six hours during a product launch. The consumer group auto-scaled, but the new instances hit a database connection pool limit that had never been tested at that scale. The pipeline didn't crash. It just stopped processing new events while old ones aged out of retention. By the time the team noticed, the data was gone.

Another described a batch ETL job that ran fine for two years until Black Friday, when the source system emitted files 40x larger than usual. The job ran for 18 hours, exhausted temporary storage, and failed without cleaning up its partial outputs. The next scheduled run started on top of the corrupted data.

The common thread: these pipelines were designed for steady-state operation, not for boundary conditions. They had monitoring for whether they were running, but not for how close to their limits they were operating.

Several postmortems noted that load testing had been deprioritized because "we'll just auto-scale." Auto-scaling works for compute. It doesn't work for connection pools, memory limits, disk I/O, or downstream API rate limits — the bottlenecks that actually break pipelines.

Pattern 3: Silent data loss (19% of incidents)

This is the pattern that keeps engineers up at night. The pipeline reports success. The dashboards show green. The SLA is met. But the data is incomplete, duplicated, or corrupted — and nobody knows until a business user asks why the numbers look wrong.

Silent loss shows up in several forms across the postmortems:

The filter that was too aggressive

A data quality rule dropped records that matched a malformed pattern. The rule was intended to catch corrupted upstream data, but it also caught legitimate records with unusual but valid values. Over three weeks, 12% of legitimate transactions were filtered out.

The exactly-once that wasn't

A pipeline claimed exactly-once semantics but used a non-idempotent sink. When a transient network error triggered a retry, some records were written twice. The deduplication logic existed in theory but not in the actual code path.

The retention gap

A streaming pipeline wrote to a message queue with a 24-hour retention window. When downstream processing fell behind due to a separate incident, the unprocessed data expired before recovery. The pipeline logs showed successful writes. The data just wasn't there when someone tried to read it.

What makes silent loss so dangerous is that it's invisible to traditional monitoring. Pipeline health metrics — runtime, throughput, error rate — don't catch it. You need data quality metrics: row counts, cardinality checks, referential integrity, distribution tests. Most of the postmortems admitted these checks were added after the incident, not before.

Pattern 4: Cascade failures from shared state (14% of incidents)

The smallest cluster but often the most catastrophic. These are incidents where a failure in one pipeline corrupts or disables others through shared infrastructure.

One memorable postmortem described a "poison pill" event — a single malformed record that caused a parser to enter an infinite loop. The consumer thread hung, the partition rebalanced, and the new consumer thread also hung. Within minutes, an entire consumer group was offline. Because the pipeline shared a Kafka cluster with other services, the broker's log compaction was affected, and unrelated pipelines began seeing increased latency.

Another described a metadata store used by multiple batch jobs. A schema migration for one job locked the metadata table for 90 seconds. Every other job that touched the same table failed or timed out. What should have been a single-team issue became a company-wide incident.

The lesson from these postmortems isn't just "isolate your failures." It's that shared state is often invisible. Teams don't realize they're sharing infrastructure until it fails. The Kafka cluster, the metadata table, the shared NFS mount — these aren't considered part of the pipeline's design, but they are part of its failure domain.

What the remaining 5% looked like

The rest of the postmortems were genuinely one-off: a cosmic ray flipping a bit, a vendor API changing behavior without notice, a certificate expiring on a holiday weekend. These are the failures you can't design away. The 95% above, you can.

The design checklist

After reading these 50 postmortems, I kept seeing the same gap. The failures didn't happen because teams lacked talent, tooling, or awareness. They happened because specific design questions weren't asked early enough.

Here are six questions that, if answered honestly during design review, would have prevented the majority of incidents I analyzed:

1. What happens when the schema changes without warning?

Not "do we have a schema registry?" — that's a tooling question. The design question is: does the pipeline fail when the schema deviates from expectations, or does it silently adapt? Adaptive behavior feels safer until it produces wrong data. Default to failure. Make schema mismatches loud.

2. What's the maximum load this pipeline has been tested at, and what breaks first when we exceed it?

Most teams test for correctness. Far fewer test for limits. Know your first bottleneck — memory, connections, disk, downstream rate limits — and have a graceful degradation plan for when you hit it.

3. How would we know if we were silently losing 10% of our data?

This is the most important question. If your only validation is "the job finished," you're flying blind. You need independent data quality checks that compare output volume, distribution, and key metrics against historical baselines.

4. Are our retries safe?

Any retry logic is a potential duplication mechanism unless the sink is strictly idempotent. Review every API call, every database write, every file append. If you can't guarantee idempotency, guarantee at-most-once and accept the occasional loss over the guaranteed duplication.

5. What other systems fail if this one does?

Map your failure domain. If your pipeline hangs, does it block a shared queue? Does it exhaust a connection pool? Does it fill a disk that other jobs need? Design for blast radius containment, not just recovery.

6. Can someone who's never seen this pipeline debug it at 3 AM?

The postmortems with the fastest recovery times all had one thing in common: observability that didn't require institutional knowledge. Logs that explain decisions, not just state changes. Metrics that show data health, not just system health. Alerts that point to root cause, not just symptoms.

The uncomfortable truth

Reading 50 postmortems doesn't make you immune to failure. But it does make the patterns obvious. And the patterns are, for the most part, boring. Schema drift. Load limits. Missing validation. Shared state. These aren't exotic distributed systems problems. They're design hygiene.

The teams that published these postmortems are among the best in the world at building data infrastructure. If they're still hitting these patterns, everyone else is too. The difference is whether you catch them in design review or at 3 AM.

DEV Community