- October 19–20, 2025 — 15-hour outage in US-EAST-1
- Root cause: race condition between two DNS Enactor processes; cleanup job deleted active DNS records
- ~3 hours for DynamoDB to recover; 12+ additional hours for EC2 cascade to clear
- 140+ AWS services affected: EC2, IAM, Lambda, STS, S3, and every control-plane dependency
- Snapchat (375M daily users), Fortnite, Roblox, Ring, Venmo, Coinbase, UK HMRC all affected
- 17M+ outage reports across 3,000+ organisations (Ookla data); 20–30% of internet-facing services disrupted at peak
- Recovery anti-pattern: engineers had to manually disable automatic failover — the automation was making things worse
It was 11:48 PM PDT on October 19, 2025. Two automation processes inside AWS's DynamoDB DNS management system were doing the same job simultaneously — one fast, one painfully slow. The slow one was just finishing up when the fast one, having already completed, triggered a cleanup job that deleted the slow one's work. In that moment, every DNS record for DynamoDB in the world's busiest cloud region vanished. Snapchat went dark for 375 million daily users. Fortnite lobbies dissolved mid-match. Ring cameras stopped recording. The UK's HMRC tax authority went offline. For 15 hours, the internet's largest database service had no address.
The Story
When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB.
— Amazon Web Services, Official Post-Incident Summary, October 2025
DynamoDB is not just a database. Inside AWS's infrastructure, it is the connective tissue — the system that EC2, IAM, Lambda, STS, Redshift, and dozens of other control-plane services rely on to store metadata, track state, and coordinate operations. When DynamoDB becomes unreachable, it doesn't just take databases offline. It takes down the systems that manage everything else. This is why a DNS failure that lasted roughly three hours for DynamoDB itself cascaded into a 15-hour platform-wide crisis. The control plane broke. And when the control plane breaks, recovery is not a matter of fixing the root cause — it is a matter of stabilising everything that lost its footing when the ground disappeared.
Problem
Enactor A Slows Down — And Its Stale Check Becomes a Time Bomb
DNS Enactor A began applying an older DNS plan but encountered unusual delays — blocked trying to update records, moving painfully slowly through the list of endpoints. Crucially, Enactor A performed a staleness check early in its process: "Is my plan newer than what's currently active?" At the time of that check, it was. But by the time Enactor A actually finished applying the plan, newer plans had been created and applied. The staleness check was now stale itself.
Cause
The Race Condition Fires — Enactor B Wins, Then Cleans Up
While Enactor A was slowly working through its updates, Enactor B picked up one of the newer plans and rapidly applied it across all endpoints. When Enactor B completed, it triggered the cleanup process: identify plans that are significantly older than the one just applied, and delete them. At that exact moment — T+45 seconds after the race began — Enactor A finally finished applying its old plan, overwriting Enactor B's newer records. The cleanup job identified Enactor A's newly-applied old plan as many generations old, and deleted it. All DynamoDB DNS records for the US-EAST-1 regional endpoint were gone.
Solution
11:48 PM PDT: Total DNS Blackout → Manual Recovery
At 11:48 PM PDT, every system trying to connect to DynamoDB in US-EAST-1 received DNS failures. Engineers identified the DNS issue by 12:38 AM UTC, began temporary mitigations by 1:15 AM UTC, and DynamoDB itself recovered by approximately 2:25 AM UTC — roughly three hours after the incident began. But the cascade had already overwhelmed EC2's Droplet Workflow Manager with a backlog of expired instance leases it couldn't process.
Result
15 Hours of Cascading Failure
The DWFM entered congestive collapse, requiring 12+ more hours for network state to fully stabilise. Engineers had to manually disable the automatic failover system entirely to stop it from flip-flopping between states and allow the platform to stabilise. Full recovery across all services wasn't complete until late afternoon on October 20 — roughly 15 hours after the cascade began.
The Fix
AWS's Post-Incident Fixes: Preventing the Race, Containing the Cascade
AWS's five-layer post-incident fix plan (from the official post-incident summary, October 23, 2025):
| Failure Layer | What Went Wrong | AWS's Fix |
|---|---|---|
| DNS Enactor race condition | Enactor A's stale staleness check allowed it to overwrite Enactor B's newer plan | Stronger staleness validation at time of application — must reflect current world state, not time of plan pickup |
| Cleanup automation | Cleanup job deleted Enactor A's just-applied old plan, wiping all DNS records | Safeguards ensuring no automated process can delete an active DNS plan regardless of generation number |
| NLB failover velocity | Network Load Balancers moved large capacity during AZ failover, amplifying the cascade | Velocity control mechanism limiting how much capacity a single NLB can remove during health check failures |
| EC2 recovery workflow | DWFM entered congestive collapse when DynamoDB recovered — failure mode not tested at scale | Additional test suite to exercise the DWFM recovery workflow at scale before production discovery |
| Automatic failover during recovery | Failover automation flip-flopped during recovery, requiring manual disabling before stabilisation | Review of failover automation behaviour during degraded DNS states — distinguish 'service down' from 'DNS inconsistent during recovery' |
- ~3 hrs — time from incident start to DynamoDB DNS restoration
- 12+ hrs — additional hours EC2's Droplet Workflow Manager required to clear congestive collapse
- 140+ — AWS services eventually affected; DynamoDB powers the control planes of EC2, IAM, Lambda, STS
- $581M — estimated insurance losses (CyberCube) representing disruption to thousands of globally dependent businesses
The congestive collapse pattern that extended the outage by 12 hours is worth naming clearly. When DynamoDB recovered, EC2's DWFM was facing an enormous queue of backlogged lease management tasks — all trying to execute simultaneously. The more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue, which increased the pressure. The system was stuck in a self-sustaining degraded state. This is the same metastable failure pattern documented in the Slack 2-22-22 incident — and the solution is the same: reduce incoming load or add capacity, rather than waiting for self-recovery.
The EC2 Droplet Workflow Manager congestive collapse
EC2's Droplet Workflow Manager (DWFM) is the system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM couldn't process instance state updates and began accumulating a backlog of expired leases. By the time DynamoDB recovered, DWFM was facing an enormous simultaneous queue. The system entered congestive collapse: the more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue. Network state recovery from this collapse took more than five additional hours after DynamoDB was fixed. AWS's fix: build the test suite that exercises this recovery workflow at production scale.
The hidden cross-region dependency problem
The October 2025 outage adds to a body of evidence about a specific architectural anti-pattern: regions that are called independent but aren't. AWS regions were designed with the premise that a failure in US-EAST-1 should not affect services running in EU-WEST-1. But control-plane dependencies — authentication services, metadata stores, quota management systems — create invisible cross-region ties. Ring cameras deployed globally still authenticated against US-EAST-1 IAM. UK government services deployed in EU regions still made US-EAST-1 API calls. True regional independence requires not just deploying application code in multiple regions, but ensuring that every control-plane dependency is also independently redundant per region. For most organisations, this is not the architecture they have — it is the architecture they think they have.
Architecture
The October 2025 DynamoDB outage is a case study in control-plane failure — a class of failure categorically more damaging than a data-plane failure because it removes the ability to manage and coordinate infrastructure rather than just disrupting one service.
Major services affected:
| Category | Affected Services |
|---|---|
| Social & Entertainment | Snapchat (375M daily users), Discord, Reddit, Roblox, Fortnite, Disney+, Hulu, Twitch |
| Finance & Payments | Coinbase, Venmo, Lloyds, Halifax |
| Smart Home & IoT | Amazon Ring, Amazon Alexa, Eight Sleep |
| Communications | Signal, enterprise platforms |
| Government | UK HMRC tax authority |
| Travel | United Airlines, Delta apps |
| AWS Services (internal) | EC2, IAM, STS, Lambda, S3, SQS, Redshift (140+ total) |
The DNS Race Condition: Step-by-Step
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
The Cascade: How DynamoDB's DNS Failure Propagated
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Lessons
Staleness checks must be evaluated at time of use, not time of pickup. Enactor A's staleness check was valid when it ran. By the time Enactor A acted on the result, the check was stale. In any concurrent system where state changes between the check and the action, the check must be re-evaluated immediately before the action. This is TOCTOU (Time-of-Check to Time-of-Use — a race condition where the condition being checked changes between when it is checked and when it is acted upon) — one of the oldest race condition patterns in computer science — appearing in production at AWS scale.
No automated process should be able to delete an active record. The cleanup job had no protection for the case where an older plan was actively in use as the live DNS record. The invariant that must be protected: the record currently resolving live traffic cannot be deleted by any automated process, regardless of its generation number. This invariant is simpler than the cleanup logic that violated it.
Congestive collapse is a failure mode that only appears at scale — and the recovery path for it must be tested before it's needed. EC2's DWFM had never been tested through the scenario of processing a massive backlog of expired leases simultaneously after a DynamoDB recovery. The scenario seemed unlikely enough to skip in testing. Building the test suite that exercises recovery workflows at production scale is the investment that pays off only in disasters — but those are exactly the moments when it matters most.
Control-plane dependencies (the hidden dependencies that applications have on cloud provider management systems — authentication services, metadata stores, quota management — which can create cross-region failure modes even when application code is deployed in multiple regions) must be evaluated independently for each region. Ring cameras deployed globally still authenticated against US-EAST-1 IAM. True regional independence requires independently redundant control planes, not just independently deployed application code.
Sometimes, the recovery automation has to stop before recovery can start. Build recovery playbooks to include the question: "Is any automated system currently making this worse?" Automation that detects 'DNS is inconsistent during manual recovery' the same way as 'service is down' will trigger failovers that create new inconsistencies. Automation must be able to distinguish between these states — and humans must be empowered to pause it when it cannot.
Engineering Glossary
Congestive collapse — a failure mode where a system attempting to recover from backlog overwhelms its dependencies, slowing processing and lengthening the queue, creating a self-sustaining degraded state. EC2's DWFM entered congestive collapse when DynamoDB recovered and the accumulated lease backlog overwhelmed the now-restored database.
Control-plane failure — a class of failure where the management and coordination layer of a system fails, rather than the data-serving layer. Uniquely damaging because it removes the ability to manage everything else: EC2 can't track instances, IAM can't validate credentials, Lambda can't execute. Control-plane failures cascade differently from data-plane failures.
DNS Enactor — one of the worker processes in AWS's DynamoDB DNS management system that picks up DNS plans and applies them to Route53. Multiple Enactors run in parallel across Availability Zones for redundancy. The race condition that caused the October 2025 outage occurred between two Enactors picking up different-generation plans simultaneously.
DNS Planner — the planning component in AWS's DynamoDB DNS management system that monitors load balancer health and creates DNS plans specifying which load balancers should receive traffic. Plans are then consumed by DNS Enactors.
Droplet Workflow Manager (DWFM) — EC2's system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM accumulated a backlog of expired lease management tasks. When DynamoDB recovered, the simultaneous burst of backlog processing triggered congestive collapse.
TOCTOU (Time-of-Check to Time-of-Use) — a race condition where the condition being checked changes between when it is checked and when it is acted upon, causing the action to operate on incorrect assumptions. Enactor A checked its plan's staleness, found it valid, then applied the plan — but by the time it applied, the world had moved on and the check was stale.
Thundering herd / herd effect — a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource, overwhelming it. Appears in the October 2025 outage as the DWFM congestive collapse. The standard solution is randomised exponential backoff.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)