TL;DR: The 15-Hour Outage
On October 20, 2025, AWS’s US-EAST-1 (Northern Virginia) region experienced a 15-hour outage triggered by a rare race condition in DynamoDB’s DNS automation system. This caused DynamoDB (a NoSQL database used across AWS control planes) to become unreachable.
Because DynamoDB powers internal services like EC2, IAM, STS, Lambda, and Redshift, over 140 AWS services were eventually affected.
Independent measurements showed that 20 to 30 percent of all internet-facing services experienced disruptions — nearly one-third of the internet.
AWS Infrastructure Context
AWS organizes compute into:
- Regions (geographical clusters)
- Availability Zones (AZs) (isolated data centers within a region)
- Control planes (authentication, orchestration, routing)
- Data planes (actual compute, storage, execution)
This outage was a regional control-plane failure, which is worse than a simple service crash because many systems depended on DynamoDB for metadata and operations.
After reading this article, you will understand:
- How the DynamoDB DNS race condition happened
- Why a 2.5-hour bug turned into a 15-hour outage
- How metastable failure overwhelmed EC2
- How the failure cascaded across the internet
- How to architect systems to avoid such collapses
Part 1: The Root Cause (The “How” and “Why”)
DynamoDB DNS Automation Internals
DynamoDB uses a two-part subsystem to maintain consistent DNS entries:
DNS Planner
Generates routing configuration sets called plans that describe:
- Backend server lists
- Health and routing weights
- Failover settings
- DNS TTL values
DNS Enactors
Distributed workers that read these plans and apply them to Route 53.
They operate independently across Availability Zones for fault tolerance.
What Went Wrong
On October 20:
- One Enactor stalled while processing Plan-100.
- Other Enactors applied Plan-101 and Plan-102 successfully.
- A cleanup job deleted old plans, including Plan-100.
- Hours later, the slow Enactor resumed and applied Plan-100.
- Because the plan no longer existed, it submitted an empty DNS update.
The endpoint:
dynamodb.us-east-1.amazonaws.com
now pointed to no IP addresses.
DynamoDB continued running internally, but DNS made it unreachable.
This was the spark that triggered the larger cascade.
DNS Race Condition Diagram
Explanation: Shows how a delayed Enactor reapplied outdated state after deletion, erasing DynamoDB’s DNS entry.
Part 2: The Cascade (How a 2.5-Hour Bug Became a 15-Hour Outage)
AWS fixed DNS in ~2.5 hours, but the region did not recover because it entered a metastable failure state.
A metastable system is “alive but stuck” because:
- backlog > processing capacity
- retry storms amplify load
- recovery cannot progress
Step-by-Step Breakdown
1. EC2’s Droplet Workflow Manager Failed
DWFM stores host leases and lifecycle metadata in DynamoDB.
When DynamoDB became unreachable:
- Lease renewals failed
- Autoscaling operations stalled
- Millions of internal control-plane writes backed up
2. Synchronized Retry Storm
Once DNS was restored:
- EC2 hosts
- AWS internal services
- Customer workloads
all retried at the same time.
This thundering herd instantly saturated DynamoDB and EC2.
3. Congestive Collapse
Symptoms:
- 100 percent CPU
- Zero progress
- Endless retries
- Growing queues
- No way to drain backlog sequentially
4. Manual Recovery
AWS engineers had to:
- Implement global throttling
- Purge corrupted internal queues
- Restart EC2 control-plane nodes
- Gradually rebuild DynamoDB state
- Slowly warm caches
The majority of the 15-hour outage was recovery, not the root cause.
Metastable Failure Loop Diagram
Explanation: Shows how retries overloaded the control plane, preventing state from stabilizing even after DynamoDB’s DNS was fixed.
Part 3: The Blast Radius (Who Was Affected)
Internal AWS Failures
- DynamoDB: DNS unreachable
- EC2: Lifecycle and autoscaling halted
- IAM / STS: Auth failures cascaded to all clients
- Lambda: Triggers, scaling, and invocations failed
- Redshift: Control-plane operations stalled
- NLB: Health checks degraded
- AWS Support Console: Partially offline
External Impact (2,000+ Companies)
More than 8 million user-facing errors occurred.
| Category | Examples | Impact |
|---|---|---|
| Social / Messaging | Snapchat, Signal, Discord | Login failures, message delays |
| Gaming / Media | Roblox, Fortnite, Disney+ | Playback and matchmaking failures |
| Productivity | Canva, Duolingo, Atlassian | API failures, degraded workflows |
| Finance | Venmo, Coinbase, Banks | Payments stuck, verification delays |
| IoT | Alexa, Ring | Device control and telemetry failures |
US-EAST-1’s failure rippled across global internet infrastructure.
Cascade Dependency Tree Diagram
Explanation: Visualizes how DynamoDB sits at the foundation of multiple AWS control planes. Once its DNS failed, the outage propagated upward through EC2, IAM, Lambda, and into customer workloads.
Part 4: How to Architect for Resilience Next Time
These lessons apply to any large distributed system.
1. Reduce Regional Blast Radius
Use:
- Multi-region architectures
- DynamoDB Global Tables
- Route 53 failover
- AWS Global Accelerator
Critical workloads must not rely solely on US-EAST-1.
2. Prevent Thundering Herds
Implement disciplined retry strategies:
- Exponential backoff
- Full jitter
- Retry budgets
- Max retry caps
Retries should help recovery, not destroy it.
3. Use Circuit Breakers
Circuit breakers:
- Detect repeated failures
- Stop calling the dependency
- Fail fast
- Reopen slowly
This prevents your service from participating in a cascading overload.
4. Test Disaster Recovery with Chaos Engineering
Simulate:
- Regional DynamoDB outages
- IAM / STS failures
- EC2 API throttling
- Partial DNS failures
- Cross-region failover
A DR plan is only real once tested.
Closing Thoughts
The October 2025 AWS outage was a reminder that:
- A small bug can ripple across global infrastructure
- DNS misconfigurations can disable entire services
- Control-plane failures are more destructive than data-plane failures
- Regional dependence is a systemic risk
Cloud resilience is not automatic.
It must be intentionally engineered.
Your architecture must assume US-EAST-1 can fail.
Because one day, it will.
References and Further Reading
AWS Official
- AWS Global Infrastructure https://aws.amazon.com/about-aws/global-infrastructure/
- DynamoDB Global Tables https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalTables.html
- AWS Fault Injection Simulator https://aws.amazon.com/fis/
- AWS GameDay https://aws.amazon.com/gameday/
- AWS Builders Library: Exponential Backoff and Jitter https://aws.amazon.com/builders-library/timeouts-retries-backoff/
AWS Postmortem
- Why DynamoDB Failed in October 2025 (AWS Builder’s Library) https://builder.aws.com/content/34TzjGmCIBLhnT1b5tn6bgttlI1/por-que-fallo-dynamodb-en-octubre-de-2025
Independent Analysis
- Wired: What the AWS Outage Reveals About the Internet https://www.wired.com/story/what-that-huge-aws-outage-reveals-about-the-internet/
- Cloudflare Radar: Outage Impact https://radar.cloudflare.com/
- ThousandEyes AWS Outage Breakdown https://www.thousandeyes.com/blog
- Reuters Report on AWS Outage https://www.reuters.com/
- The Guardian Coverage https://www.theguardian.com/
- Thundergolfer Deep Analysis https://thundergolfer.com/




Top comments (0)