DEV Community

Bonkur Harshith Reddy
Bonkur Harshith Reddy

Posted on

Anatomy of a Cloud Collapse: A Technical Deep-Dive on the AWS Outage of October 2025

TL;DR: The 15-Hour Outage

On October 20, 2025, AWS’s US-EAST-1 (Northern Virginia) region experienced a 15-hour outage triggered by a rare race condition in DynamoDB’s DNS automation system. This caused DynamoDB (a NoSQL database used across AWS control planes) to become unreachable.

Because DynamoDB powers internal services like EC2, IAM, STS, Lambda, and Redshift, over 140 AWS services were eventually affected.

Independent measurements showed that 20 to 30 percent of all internet-facing services experienced disruptions — nearly one-third of the internet.


AWS Infrastructure Context

AWS organizes compute into:

  • Regions (geographical clusters)
  • Availability Zones (AZs) (isolated data centers within a region)
  • Control planes (authentication, orchestration, routing)
  • Data planes (actual compute, storage, execution)

This outage was a regional control-plane failure, which is worse than a simple service crash because many systems depended on DynamoDB for metadata and operations.


After reading this article, you will understand:

  • How the DynamoDB DNS race condition happened
  • Why a 2.5-hour bug turned into a 15-hour outage
  • How metastable failure overwhelmed EC2
  • How the failure cascaded across the internet
  • How to architect systems to avoid such collapses

Part 1: The Root Cause (The “How” and “Why”)

DynamoDB DNS Automation Internals

DynamoDB uses a two-part subsystem to maintain consistent DNS entries:

DNS Planner

Generates routing configuration sets called plans that describe:

  • Backend server lists
  • Health and routing weights
  • Failover settings
  • DNS TTL values

DNS Enactors

Distributed workers that read these plans and apply them to Route 53.

They operate independently across Availability Zones for fault tolerance.


What Went Wrong

On October 20:

  1. One Enactor stalled while processing Plan-100.
  2. Other Enactors applied Plan-101 and Plan-102 successfully.
  3. A cleanup job deleted old plans, including Plan-100.
  4. Hours later, the slow Enactor resumed and applied Plan-100.
  5. Because the plan no longer existed, it submitted an empty DNS update.

The endpoint:

dynamodb.us-east-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

now pointed to no IP addresses.

DynamoDB continued running internally, but DNS made it unreachable.
This was the spark that triggered the larger cascade.


DNS Race Condition Diagram

Explanation: Shows how a delayed Enactor reapplied outdated state after deletion, erasing DynamoDB’s DNS entry.


Part 2: The Cascade (How a 2.5-Hour Bug Became a 15-Hour Outage)

AWS fixed DNS in ~2.5 hours, but the region did not recover because it entered a metastable failure state.

A metastable system is “alive but stuck” because:

  • backlog > processing capacity
  • retry storms amplify load
  • recovery cannot progress

Step-by-Step Breakdown

1. EC2’s Droplet Workflow Manager Failed

DWFM stores host leases and lifecycle metadata in DynamoDB.

When DynamoDB became unreachable:

  • Lease renewals failed
  • Autoscaling operations stalled
  • Millions of internal control-plane writes backed up

2. Synchronized Retry Storm

Once DNS was restored:

  • EC2 hosts
  • AWS internal services
  • Customer workloads

all retried at the same time.

This thundering herd instantly saturated DynamoDB and EC2.

3. Congestive Collapse

Symptoms:

  • 100 percent CPU
  • Zero progress
  • Endless retries
  • Growing queues
  • No way to drain backlog sequentially

4. Manual Recovery

AWS engineers had to:

  • Implement global throttling
  • Purge corrupted internal queues
  • Restart EC2 control-plane nodes
  • Gradually rebuild DynamoDB state
  • Slowly warm caches

The majority of the 15-hour outage was recovery, not the root cause.


Metastable Failure Loop Diagram

Explanation: Shows how retries overloaded the control plane, preventing state from stabilizing even after DynamoDB’s DNS was fixed.


Part 3: The Blast Radius (Who Was Affected)

Internal AWS Failures

  • DynamoDB: DNS unreachable
  • EC2: Lifecycle and autoscaling halted
  • IAM / STS: Auth failures cascaded to all clients
  • Lambda: Triggers, scaling, and invocations failed
  • Redshift: Control-plane operations stalled
  • NLB: Health checks degraded
  • AWS Support Console: Partially offline

External Impact (2,000+ Companies)

More than 8 million user-facing errors occurred.

Category Examples Impact
Social / Messaging Snapchat, Signal, Discord Login failures, message delays
Gaming / Media Roblox, Fortnite, Disney+ Playback and matchmaking failures
Productivity Canva, Duolingo, Atlassian API failures, degraded workflows
Finance Venmo, Coinbase, Banks Payments stuck, verification delays
IoT Alexa, Ring Device control and telemetry failures

US-EAST-1’s failure rippled across global internet infrastructure.


Cascade Dependency Tree Diagram

Explanation: Visualizes how DynamoDB sits at the foundation of multiple AWS control planes. Once its DNS failed, the outage propagated upward through EC2, IAM, Lambda, and into customer workloads.


Part 4: How to Architect for Resilience Next Time

These lessons apply to any large distributed system.


1. Reduce Regional Blast Radius

Use:

  • Multi-region architectures
  • DynamoDB Global Tables
  • Route 53 failover
  • AWS Global Accelerator

Critical workloads must not rely solely on US-EAST-1.


2. Prevent Thundering Herds

Implement disciplined retry strategies:

  • Exponential backoff
  • Full jitter
  • Retry budgets
  • Max retry caps

Retries should help recovery, not destroy it.


3. Use Circuit Breakers

Circuit breakers:

  • Detect repeated failures
  • Stop calling the dependency
  • Fail fast
  • Reopen slowly

This prevents your service from participating in a cascading overload.


4. Test Disaster Recovery with Chaos Engineering

Simulate:

  • Regional DynamoDB outages
  • IAM / STS failures
  • EC2 API throttling
  • Partial DNS failures
  • Cross-region failover

A DR plan is only real once tested.


Closing Thoughts

The October 2025 AWS outage was a reminder that:

  • A small bug can ripple across global infrastructure
  • DNS misconfigurations can disable entire services
  • Control-plane failures are more destructive than data-plane failures
  • Regional dependence is a systemic risk

Cloud resilience is not automatic.
It must be intentionally engineered.

Your architecture must assume US-EAST-1 can fail.
Because one day, it will.


References and Further Reading

AWS Official

AWS Postmortem

Independent Analysis


Top comments (0)