Bonkur Harshith Reddy

Posted on Nov 14

Anatomy of a Cloud Collapse: A Technical Deep-Dive on the AWS Outage of October 2025

#architecture #aws #cloud

TL;DR: The 15-Hour Outage

On October 20, 2025, AWS’s US-EAST-1 (Northern Virginia) region experienced a 15-hour outage triggered by a rare race condition in DynamoDB’s DNS automation system. This caused DynamoDB (a NoSQL database used across AWS control planes) to become unreachable.

Because DynamoDB powers internal services like EC2, IAM, STS, Lambda, and Redshift, over 140 AWS services were eventually affected.

Independent measurements showed that 20 to 30 percent of all internet-facing services experienced disruptions — nearly one-third of the internet.

AWS Infrastructure Context

AWS organizes compute into:

Regions (geographical clusters)
Availability Zones (AZs) (isolated data centers within a region)
Control planes (authentication, orchestration, routing)
Data planes (actual compute, storage, execution)

This outage was a regional control-plane failure, which is worse than a simple service crash because many systems depended on DynamoDB for metadata and operations.

After reading this article, you will understand:

How the DynamoDB DNS race condition happened
Why a 2.5-hour bug turned into a 15-hour outage
How metastable failure overwhelmed EC2
How the failure cascaded across the internet
How to architect systems to avoid such collapses

Part 1: The Root Cause (The “How” and “Why”)

DynamoDB DNS Automation Internals

DynamoDB uses a two-part subsystem to maintain consistent DNS entries:

DNS Planner

Generates routing configuration sets called plans that describe:

Backend server lists
Health and routing weights
Failover settings
DNS TTL values

DNS Enactors

Distributed workers that read these plans and apply them to Route 53.

They operate independently across Availability Zones for fault tolerance.

What Went Wrong

On October 20:

One Enactor stalled while processing Plan-100.
Other Enactors applied Plan-101 and Plan-102 successfully.
A cleanup job deleted old plans, including Plan-100.
Hours later, the slow Enactor resumed and applied Plan-100.
Because the plan no longer existed, it submitted an empty DNS update.

The endpoint:

dynamodb.us-east-1.amazonaws.com

now pointed to no IP addresses.

DynamoDB continued running internally, but DNS made it unreachable.
This was the spark that triggered the larger cascade.

DNS Race Condition Diagram

Explanation: Shows how a delayed Enactor reapplied outdated state after deletion, erasing DynamoDB’s DNS entry.

Part 2: The Cascade (How a 2.5-Hour Bug Became a 15-Hour Outage)

AWS fixed DNS in ~2.5 hours, but the region did not recover because it entered a metastable failure state.

A metastable system is “alive but stuck” because:

backlog > processing capacity
retry storms amplify load
recovery cannot progress

Step-by-Step Breakdown

1. EC2’s Droplet Workflow Manager Failed

DWFM stores host leases and lifecycle metadata in DynamoDB.

When DynamoDB became unreachable:

Lease renewals failed
Autoscaling operations stalled
Millions of internal control-plane writes backed up

2. Synchronized Retry Storm

Once DNS was restored:

EC2 hosts
AWS internal services
Customer workloads

all retried at the same time.

This thundering herd instantly saturated DynamoDB and EC2.

3. Congestive Collapse

Symptoms:

100 percent CPU
Zero progress
Endless retries
Growing queues
No way to drain backlog sequentially

4. Manual Recovery

AWS engineers had to:

Implement global throttling
Purge corrupted internal queues
Restart EC2 control-plane nodes
Gradually rebuild DynamoDB state
Slowly warm caches

The majority of the 15-hour outage was recovery, not the root cause.

Metastable Failure Loop Diagram

Explanation: Shows how retries overloaded the control plane, preventing state from stabilizing even after DynamoDB’s DNS was fixed.

Part 3: The Blast Radius (Who Was Affected)

Internal AWS Failures

DynamoDB: DNS unreachable
EC2: Lifecycle and autoscaling halted
IAM / STS: Auth failures cascaded to all clients
Lambda: Triggers, scaling, and invocations failed
Redshift: Control-plane operations stalled
NLB: Health checks degraded
AWS Support Console: Partially offline

External Impact (2,000+ Companies)

More than 8 million user-facing errors occurred.

Category	Examples	Impact
Social / Messaging	Snapchat, Signal, Discord	Login failures, message delays
Gaming / Media	Roblox, Fortnite, Disney+	Playback and matchmaking failures
Productivity	Canva, Duolingo, Atlassian	API failures, degraded workflows
Finance	Venmo, Coinbase, Banks	Payments stuck, verification delays
IoT	Alexa, Ring	Device control and telemetry failures

US-EAST-1’s failure rippled across global internet infrastructure.

Cascade Dependency Tree Diagram

Explanation: Visualizes how DynamoDB sits at the foundation of multiple AWS control planes. Once its DNS failed, the outage propagated upward through EC2, IAM, Lambda, and into customer workloads.

Part 4: How to Architect for Resilience Next Time

These lessons apply to any large distributed system.

1. Reduce Regional Blast Radius

Use:

Multi-region architectures
DynamoDB Global Tables
Route 53 failover
AWS Global Accelerator

Critical workloads must not rely solely on US-EAST-1.

2. Prevent Thundering Herds

Implement disciplined retry strategies:

Exponential backoff
Full jitter
Retry budgets
Max retry caps

Retries should help recovery, not destroy it.

3. Use Circuit Breakers

Circuit breakers:

Detect repeated failures
Stop calling the dependency
Fail fast
Reopen slowly

This prevents your service from participating in a cascading overload.

4. Test Disaster Recovery with Chaos Engineering

Simulate:

Regional DynamoDB outages
IAM / STS failures
EC2 API throttling
Partial DNS failures
Cross-region failover

A DR plan is only real once tested.

Closing Thoughts

The October 2025 AWS outage was a reminder that:

A small bug can ripple across global infrastructure
DNS misconfigurations can disable entire services
Control-plane failures are more destructive than data-plane failures
Regional dependence is a systemic risk

Cloud resilience is not automatic.
It must be intentionally engineered.

Your architecture must assume US-EAST-1 can fail.
Because one day, it will.

References and Further Reading

DEV Community