The 5 Distributed System Failures That Show Up in 80% of Postmortems

#architecture #microservices #devops #observability

Book: System Design Pocket Guide: Fundamentals
Also by me: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

On October 20, 2025, a race condition in DynamoDB's DNS automation produced an empty record for dynamodb.us-east-1.amazonaws.com, and US-EAST-1 went into a 15-hour cascade that took down EC2 launches, Lambda invocations, Fargate tasks, and a long list of services that do not look anything like a database (AWS DynamoDB outage analysis on InfoQ). Downdetector logged 6.5M reports across more than 1,000 services.

If you read enough public postmortems (Cloudflare, AWS, Google Cloud, Stripe, GitHub), you stop being surprised. The same five failure modes show up over and over. Different companies, different years, different stacks, the same shape of incident.

This is the catalog. Five failures, each tied to a real public RCA, each with a defensive pattern that costs you a few lines of code and saves you a Friday.

1. Thundering herd on retry

The pattern. A backend hiccups. Every client retries. The retries arrive synchronously, multiplying the original load. The backend, which was about to recover, is buried again. Repeat until somebody manually drains traffic.

Cloudflare has shipped this exact failure publicly twice. In their November 2023 control plane outage, a datacenter failover triggered a rush of API calls into Europe that overwhelmed the recovering servers (Cloudflare control plane post-mortem). On September 12, 2025, a bug in the dashboard's logic amplified the same thundering herd pattern when the dashboard came back online (Cloudflare September 2025 deep dive). The same shape recurs whenever a backend restart sends every client retrying in lockstep: the recovering tier sees a wave instead of a ramp, and falls over again.

The defensive pattern. Exponential backoff with jitter, applied at every retry boundary. Synchronized retries are what turn a hiccup into an outage. Adding a random component to the backoff de-synchronizes them, spreading the recovery load across the backoff window instead of stacking it on the next tick.

import random
import time
from functools import wraps

class TransientError(Exception):
    pass

class PermanentError(Exception):
    pass


def retry_with_jitter(
    max_attempts: int = 5,
    base: float = 0.2,
    cap: float = 30.0,
):
    """Decorator: full-jitter exponential backoff.

    Sleeps for random(0, min(cap, base * 2**attempt)).
    Permanent errors break out immediately.
    """
    def deco(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return fn(*args, **kwargs)
                except PermanentError:
                    raise
                except TransientError:
                    if attempt == max_attempts - 1:
                        raise
                    delay = random.uniform(
                        0, min(cap, base * (2 ** attempt))
                    )
                    time.sleep(delay)
        return wrapper
    return deco

The decorator splits the error space into two: transient errors sleep and retry, permanent errors propagate immediately. Wrapping a real call site stays compact:

@retry_with_jitter(max_attempts=6, base=0.25, cap=20.0)
def fetch_user(user_id: str) -> dict:
    resp = http.get(f"/users/{user_id}", timeout=2)
    if resp.status_code >= 500:
        raise TransientError(resp.status_code)
    if resp.status_code == 404:
        raise PermanentError("not found")
    return resp.json()

Three details matter. Full jitter (random between 0 and the cap) outperforms equal jitter for de-synchronization. This is the AWS recommendation that has been reproduced across dozens of real systems. A cap on the delay stops a long retry chain from sleeping for hours. Permanent errors break immediately so you do not waste retry budget on a bad request that will never succeed.

2. Cascading timeouts

Service A calls Service B with a 30-second timeout. Service B, internally, calls Service C with a 30-second timeout. Service C is slow but eventually returns. Service B has already timed out, freed its connection, and dropped the response. Service A retries Service B. Service B starts a new call to Service C. The chain stacks.

The Stripe 2022 latency event is the textbook case as recapped in third-party write-ups: within 45 minutes, p95 across the Payments API was reported above 4 seconds. The trigger was a metadata write path saturating a database cluster; the amplification was clients retrying every slow response, which added load to the same already-overloaded pool, which slowed the next set of responses, which produced more retries (third-party recap on Medium). The event reportedly lasted three hours.

The defensive pattern. Timeout budgeting. Each layer in the call chain gets less time than the layer above it, with explicit deadline propagation. If A budgets 30s, B should know it has 27s left, and C should know it has 24s. When a deadline is exceeded inside C, C cancels its own work. No orphaned database queries to keep tying up the pool.

The other half is circuit breakers at every service-to-service call. Once you see a sustained error rate or latency spike from a downstream, you stop sending requests to it for a window. The downstream gets a chance to recover; the upstream stops adding to the queue. Third-party analysis of a February 2025 Stripe API timeout event suggests circuit breakers were slow to trip (third-party recap at prodrescueai.com).

3. Queue overflow / poison message

The pattern. A worker pool consumes from a queue. One message contains data the worker cannot handle: a malformed payload, a missing dependency, a serialization bug. The worker crashes or panics. The message goes back to the queue. Another worker picks it up. Same crash. The poison message blocks the queue while every other message backs up behind it.

This pattern is canonical enough that the danluu/post-mortems index has multiple entries of this shape: a bad event in an async workers queue triggers unhandled panics that repeatedly crash the worker. Variants appear in public postmortems across queue-based systems, where a single malformed message can stall throughput for hours before the on-call notices the redelivery loop.

The defensive pattern. Dead-letter queues with redelivery counts. A message that has failed N times (typically 3–5) does not go back on the main queue. It goes to a separate DLQ topic, alongside the exception that killed it. The main queue keeps moving. A human or an automated triage consumer drains the DLQ on its own time.

The piece teams skip: monitoring on DLQ depth and DLQ age. A DLQ with 200k messages that nobody is looking at is just a slower form of data loss. A 5-minute SLO on DLQ inspection (page if the oldest unacknowledged DLQ message is older than X) turns the queue from a graveyard into a triage signal.

4. Hot-key contention

A distributed store partitions data by key. One key (typically a celebrity user, a special tenant, a global counter) gets a disproportionate share of traffic. The shard holding that key saturates. Latency on that shard spikes, queue depth grows, and clients reading other keys on the same shard get caught in the splash.

The folklore version is Twitter's "Bieber problem" — one user with so many followers that the fan-out machinery had to special-case the account. The same pattern shows up in any chat or social platform where one channel or one tenant draws orders of magnitude more traffic than the median. Redis replication lag during failover, at scale, is overwhelmingly a hot-key story: one writer producing more writes per second than the replica can apply, the replication offset diverging until a failover loses minutes of writes (Redis replication lag handling, vendor write-up).

The defensive pattern. Identify, then split. Identify hot keys with sampling; every store that matters supports a sampled keyspace analyzer. Split them by adding a salt to the key (user:bieber becomes user:bieber:0, user:bieber:1, ... user:bieber:N) for writes, and aggregate on read. For read-heavy hot keys, push the value into a local cache with a short TTL so the read load decouples from the shard.

The structural fix is harder: design schemas that do not encourage hot keys. Compound keys that include a time bucket prevent a single-counter pattern from concentrating on one row. Pre-aggregation at write time turns a hot read into a cheap one. Both require thinking about the workload during design, not after the first incident.

5. Replication lag during failover

The pattern. A primary database fails. Failover promotes a replica that was N seconds behind. Writes that were acknowledged to clients in the last N seconds are gone. Worse, after promotion, the system reads from the new primary, but caches, queues, and other replicas are still holding artifacts produced by the old primary's writes that no longer exist.

The October 2025 AWS DynamoDB outage has this shape at planet scale. The race condition in DynamoDB's internal DNS automation produced an empty endpoint record, and the recovery path needed the same DNS layer that the failure had corrupted to make progress, which is what created the circular failure loop. Incomplete replication of control-plane metadata across regions, combined with a lack of automated failover triggers, then widened the global impact (InfoQ DynamoDB postmortem). The outage lasted 15 hours not because the bug was hard to find, but because the recovery path itself depended on the broken component.

The defensive pattern is structural and unglamorous. Synchronous replication on the writes you cannot afford to lose, with the latency penalty as a cost of business: billing rows, payment transactions, identity records. Asynchronous replication for everything else, with explicit acknowledgement that data within the lag window is at risk. Bounded replication lag as an SLO: alert when replicas drift past N seconds, and refuse to promote a replica that is too far behind unless an operator forces it.

The other piece: failover paths cannot depend on the system that is failing over. AWS's DynamoDB recovery depended on the same DNS automation layer the incident had corrupted. A clean failover path uses out-of-band coordination (separate DNS, separate identity, separate observability) for the control plane that runs the failover.

What ties them together

Every one of these failures is about feedback loops you did not design. A retry storm: client and server. A cascading timeout: service tiers. A poison message: worker and queue. A hot key: traffic distribution and shard placement. A replication-lag failover: the failure and the recovery system.

The defensive patterns share a shape too: break the loop with randomness, with bounds, with budgets, or with isolation. Jitter de-synchronizes the retries. Timeouts break the unbounded wait. DLQs cut the redelivery cycle. Hot-key splitting spreads the concentration. Out-of-band failover breaks the dependency cycle.

You will not catch all five with a code review. You will catch some of them by reading other people's RCAs as if they were a checklist for your own design.

If this was useful

Most of what makes these defenses work is in the building blocks: how you partition, how you replicate, how you queue. The System Design Pocket Guide: Fundamentals covers those at the depth where decisions like "should this retry be idempotent" stop being a guess. And once a system is running, you live or die by what you can see. The LLM Observability Pocket Guide is about exactly that: picking the tracing and eval tooling that turns "something is slow" into a real signal.