TechLogStack

Posted on Jun 3 • Originally published at techlogstack.com on May 31

A Race Condition in DynamoDB's DNS Took Down Snapchat, Fortnite, Ring, and Half the Internet for 15 Hours

#distributedsystems #reliability #devops #webdev

October 19–20, 2025 — 15-hour outage in US-EAST-1
Root cause: race condition between two DNS Enactor processes; cleanup job deleted active DNS records
~3 hours for DynamoDB to recover; 12+ additional hours for EC2 cascade to clear
140+ AWS services affected: EC2, IAM, Lambda, STS, S3, and every control-plane dependency
Snapchat (375M daily users), Fortnite, Roblox, Ring, Venmo, Coinbase, UK HMRC all affected
17M+ outage reports across 3,000+ organisations (Ookla data); 20–30% of internet-facing services disrupted at peak
Recovery anti-pattern: engineers had to manually disable automatic failover — the automation was making things worse

It was 11:48 PM PDT on October 19, 2025. Two automation processes inside AWS's DynamoDB DNS management system were doing the same job simultaneously — one fast, one painfully slow. The slow one was just finishing up when the fast one, having already completed, triggered a cleanup job that deleted the slow one's work. In that moment, every DNS record for DynamoDB in the world's busiest cloud region vanished. Snapchat went dark for 375 million daily users. Fortnite lobbies dissolved mid-match. Ring cameras stopped recording. The UK's HMRC tax authority went offline. For 15 hours, the internet's largest database service had no address.

The Story

When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB.

— Amazon Web Services, Official Post-Incident Summary, October 2025

DynamoDB is not just a database. Inside AWS's infrastructure, it is the connective tissue — the system that EC2, IAM, Lambda, STS, Redshift, and dozens of other control-plane services rely on to store metadata, track state, and coordinate operations. When DynamoDB becomes unreachable, it doesn't just take databases offline. It takes down the systems that manage everything else. This is why a DNS failure that lasted roughly three hours for DynamoDB itself cascaded into a 15-hour platform-wide crisis. The control plane broke. And when the control plane breaks, recovery is not a matter of fixing the root cause — it is a matter of stabilising everything that lost its footing when the ground disappeared.

The Two-Component DNS Architecture: Planner and Enactor

At AWS's scale, DynamoDB maintains hundreds of thousands of DNS records to route traffic across load balancers. AWS built a two-component system to manage this: The DNS Planner monitors load balancer health and periodically creates DNS plans — specifications of which load balancers should receive traffic and with what weight distribution. The DNS Enactors are the workers — multiple independent processes running across three Availability Zones — that pick up the plans and apply them to Route53. Multiple Enactors running in parallel provide redundancy. In theory.

Problem

Enactor A Slows Down — And Its Stale Check Becomes a Time Bomb

DNS Enactor A began applying an older DNS plan but encountered unusual delays — blocked trying to update records, moving painfully slowly through the list of endpoints. Crucially, Enactor A performed a staleness check early in its process: "Is my plan newer than what's currently active?" At the time of that check, it was. But by the time Enactor A actually finished applying the plan, newer plans had been created and applied. The staleness check was now stale itself.

Cause

The Race Condition Fires — Enactor B Wins, Then Cleans Up

While Enactor A was slowly working through its updates, Enactor B picked up one of the newer plans and rapidly applied it across all endpoints. When Enactor B completed, it triggered the cleanup process: identify plans that are significantly older than the one just applied, and delete them. At that exact moment — T+45 seconds after the race began — Enactor A finally finished applying its old plan, overwriting Enactor B's newer records. The cleanup job identified Enactor A's newly-applied old plan as many generations old, and deleted it. All DynamoDB DNS records for the US-EAST-1 regional endpoint were gone.

Solution

11:48 PM PDT: Total DNS Blackout → Manual Recovery

At 11:48 PM PDT, every system trying to connect to DynamoDB in US-EAST-1 received DNS failures. Engineers identified the DNS issue by 12:38 AM UTC, began temporary mitigations by 1:15 AM UTC, and DynamoDB itself recovered by approximately 2:25 AM UTC — roughly three hours after the incident began. But the cascade had already overwhelmed EC2's Droplet Workflow Manager with a backlog of expired instance leases it couldn't process.

Result

15 Hours of Cascading Failure

The DWFM entered congestive collapse, requiring 12+ more hours for network state to fully stabilise. Engineers had to manually disable the automatic failover system entirely to stop it from flip-flopping between states and allow the platform to stabilise. Full recovery across all services wasn't complete until late afternoon on October 20 — roughly 15 hours after the cascade began.

The Fix

AWS's Post-Incident Fixes: Preventing the Race, Containing the Cascade

AWS's five-layer post-incident fix plan (from the official post-incident summary, October 23, 2025):

Failure Layer	What Went Wrong	AWS's Fix
DNS Enactor race condition	Enactor A's stale staleness check allowed it to overwrite Enactor B's newer plan	Stronger staleness validation at time of application — must reflect current world state, not time of plan pickup
Cleanup automation	Cleanup job deleted Enactor A's just-applied old plan, wiping all DNS records	Safeguards ensuring no automated process can delete an active DNS plan regardless of generation number
NLB failover velocity	Network Load Balancers moved large capacity during AZ failover, amplifying the cascade	Velocity control mechanism limiting how much capacity a single NLB can remove during health check failures
EC2 recovery workflow	DWFM entered congestive collapse when DynamoDB recovered — failure mode not tested at scale	Additional test suite to exercise the DWFM recovery workflow at scale before production discovery
Automatic failover during recovery	Failover automation flip-flopped during recovery, requiring manual disabling before stabilisation	Review of failover automation behaviour during degraded DNS states — distinguish 'service down' from 'DNS inconsistent during recovery'

~3 hrs — time from incident start to DynamoDB DNS restoration
12+ hrs — additional hours EC2's Droplet Workflow Manager required to clear congestive collapse
140+ — AWS services eventually affected; DynamoDB powers the control planes of EC2, IAM, Lambda, STS
$581M — estimated insurance losses (CyberCube) representing disruption to thousands of globally dependent businesses

The Anti-Pattern: When Automation Prevents Recovery

The most counterintuitive part of the recovery was that engineers had to disable automatic failover to stabilise the system. The automatic failover mechanisms were detecting DNS inconsistency as failures and triggering failovers, which created new inconsistencies, which triggered more failovers. The automation designed to speed recovery was making recovery impossible. Engineers had to manually turn it off, let the system reach a stable state, and re-enable it with correct DNS records in place. Sometimes, the recovery automation has to stop before recovery can start. Build your recovery playbooks to include the question: "Is any automated system currently making this worse?"

The congestive collapse pattern that extended the outage by 12 hours is worth naming clearly. When DynamoDB recovered, EC2's DWFM was facing an enormous queue of backlogged lease management tasks — all trying to execute simultaneously. The more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue, which increased the pressure. The system was stuck in a self-sustaining degraded state. This is the same metastable failure pattern documented in the Slack 2-22-22 incident — and the solution is the same: reduce incoming load or add capacity, rather than waiting for self-recovery.

The EC2 Droplet Workflow Manager congestive collapse

EC2's Droplet Workflow Manager (DWFM) is the system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM couldn't process instance state updates and began accumulating a backlog of expired leases. By the time DynamoDB recovered, DWFM was facing an enormous simultaneous queue. The system entered congestive collapse: the more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue. Network state recovery from this collapse took more than five additional hours after DynamoDB was fixed. AWS's fix: build the test suite that exercises this recovery workflow at production scale.

The hidden cross-region dependency problem

The October 2025 outage adds to a body of evidence about a specific architectural anti-pattern: regions that are called independent but aren't. AWS regions were designed with the premise that a failure in US-EAST-1 should not affect services running in EU-WEST-1. But control-plane dependencies — authentication services, metadata stores, quota management systems — create invisible cross-region ties. Ring cameras deployed globally still authenticated against US-EAST-1 IAM. UK government services deployed in EU regions still made US-EAST-1 API calls. True regional independence requires not just deploying application code in multiple regions, but ensuring that every control-plane dependency is also independently redundant per region. For most organisations, this is not the architecture they have — it is the architecture they think they have.

Architecture

The October 2025 DynamoDB outage is a case study in control-plane failure — a class of failure categorically more damaging than a data-plane failure because it removes the ability to manage and coordinate infrastructure rather than just disrupting one service.

Major services affected:

Category	Affected Services
Social & Entertainment	Snapchat (375M daily users), Discord, Reddit, Roblox, Fortnite, Disney+, Hulu, Twitch
Finance & Payments	Coinbase, Venmo, Lloyds, Halifax
Smart Home & IoT	Amazon Ring, Amazon Alexa, Eight Sleep
Communications	Signal, enterprise platforms
Government	UK HMRC tax authority
Travel	United Airlines, Delta apps
AWS Services (internal)	EC2, IAM, STS, Lambda, S3, SQS, Redshift (140+ total)

The DNS Race Condition: Step-by-Step

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Cascade: How DynamoDB's DNS Failure Propagated

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Why US-EAST-1 Became a Single Point of Failure for the Internet

AWS designed its regions to be independently operable — a failure in US-EAST-1 should not affect EU-WEST-1. This design intention is correct, but the reality that emerged over 20 years is different. US-EAST-1 is where AWS first launched most services, accumulating the most mature feature sets. It became the default — the region developers reach for first, the one that decades of "just deploy to us-east-1" decisions have concentrated critical infrastructure in. Even services claiming multi-region redundancy often still rely on US-EAST-1 for authentication flows, control-plane coordination, or foundational database calls. The technical independence of regions is real. The operational independence, as experienced during the October 2025 outage, is not.

Lessons

Staleness checks must be evaluated at time of use, not time of pickup. Enactor A's staleness check was valid when it ran. By the time Enactor A acted on the result, the check was stale. In any concurrent system where state changes between the check and the action, the check must be re-evaluated immediately before the action. This is TOCTOU (Time-of-Check to Time-of-Use — a race condition where the condition being checked changes between when it is checked and when it is acted upon) — one of the oldest race condition patterns in computer science — appearing in production at AWS scale.
No automated process should be able to delete an active record. The cleanup job had no protection for the case where an older plan was actively in use as the live DNS record. The invariant that must be protected: the record currently resolving live traffic cannot be deleted by any automated process, regardless of its generation number. This invariant is simpler than the cleanup logic that violated it.
Congestive collapse is a failure mode that only appears at scale — and the recovery path for it must be tested before it's needed. EC2's DWFM had never been tested through the scenario of processing a massive backlog of expired leases simultaneously after a DynamoDB recovery. The scenario seemed unlikely enough to skip in testing. Building the test suite that exercises recovery workflows at production scale is the investment that pays off only in disasters — but those are exactly the moments when it matters most.
Control-plane dependencies (the hidden dependencies that applications have on cloud provider management systems — authentication services, metadata stores, quota management — which can create cross-region failure modes even when application code is deployed in multiple regions) must be evaluated independently for each region. Ring cameras deployed globally still authenticated against US-EAST-1 IAM. True regional independence requires independently redundant control planes, not just independently deployed application code.
Sometimes, the recovery automation has to stop before recovery can start. Build recovery playbooks to include the question: "Is any automated system currently making this worse?" Automation that detects 'DNS is inconsistent during manual recovery' the same way as 'service is down' will trigger failovers that create new inconsistencies. Automation must be able to distinguish between these states — and humans must be empowered to pause it when it cannot.

Engineering Glossary

Congestive collapse — a failure mode where a system attempting to recover from backlog overwhelms its dependencies, slowing processing and lengthening the queue, creating a self-sustaining degraded state. EC2's DWFM entered congestive collapse when DynamoDB recovered and the accumulated lease backlog overwhelmed the now-restored database.

Control-plane failure — a class of failure where the management and coordination layer of a system fails, rather than the data-serving layer. Uniquely damaging because it removes the ability to manage everything else: EC2 can't track instances, IAM can't validate credentials, Lambda can't execute. Control-plane failures cascade differently from data-plane failures.

DNS Enactor — one of the worker processes in AWS's DynamoDB DNS management system that picks up DNS plans and applies them to Route53. Multiple Enactors run in parallel across Availability Zones for redundancy. The race condition that caused the October 2025 outage occurred between two Enactors picking up different-generation plans simultaneously.

DNS Planner — the planning component in AWS's DynamoDB DNS management system that monitors load balancer health and creates DNS plans specifying which load balancers should receive traffic. Plans are then consumed by DNS Enactors.

Droplet Workflow Manager (DWFM) — EC2's system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM accumulated a backlog of expired lease management tasks. When DynamoDB recovered, the simultaneous burst of backlog processing triggered congestive collapse.

TOCTOU (Time-of-Check to Time-of-Use) — a race condition where the condition being checked changes between when it is checked and when it is acted upon, causing the action to operate on incorrect assumptions. Enactor A checked its plan's staleness, found it valid, then applied the plan — but by the time it applied, the world had moved on and the check was stale.

Thundering herd / herd effect — a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource, overwhelming it. Appears in the October 2025 outage as the DWFM congestive collapse. The standard solution is randomised exponential backoff.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community