Anything can go wrong, will go wrong - Murphy's law
DynamoDB depends heavily on DNS. Instead of one static IP, AWS maintains hundreds of thousands of DNS records for:
- scaling,
- routing traffic across load balancers,
- handling IPv6/FIPS variants,
- removing failed capacity.
To manage this, DynamoDB uses two internal components:
DNS Planner
Continuously generates new “plans” (Plan #1200, #1300, #1400, #1500… etc) describing which load balancers should serve the endpoint.
DNS Enactors (3 copies, one per AZ)
Apply these plans to Route53 using atomic transactions.
This design normally ensures high availability.
But it also created the perfect conditions for a rare race condition.
The Rare Race Condition (The Real Root Cause)
Here’s the exact sequence of events:
- Enactor A picked old Plan #1200 but got stuck retrying.
- Planner produced newer plans: #1300 → #1400 → #1500.
- Enactor B, running normally, applied new Plan #1500.
- Enactor A finally woke up and applied old Plan #1200, overwriting #1500.
- Enactor B’s cleanup then deleted all old plans — including the old plan #1200.
- Now no plan existed in the system.
With no plan:
- Route53 had no IP addresses for
dynamodb.us-east-1.amazonaws.com. - DynamoDB endpoint disappeared.
- All AWS services depending on DDB immediately failed.
Now this was epicenter of the blast that followed on the day of Diwali!
Chain Reaction/Cascading Failures
Let's discuss these failures one by one:
EC2 failures
It failed because its control plane lost access to DynamoDB, which it relies on for critical internal state. This caused EC2 to temporarily “lose” capacity and become unable to launch new instances.
Below is the exact breakdown of how this happened.
1. DropletWorkflow Manager (DWFM)
DWFM manages all physical servers (called droplets) that run EC2 instances. It maintains:
host state
instance-to-host mapping
lease/heartbeat for each physical server
lifecycle operations (shutdown, reboot, etc.)
2. Network Manager
Responsible for:
updating VPC routing
propagating network configuration to new instances
networking for ENIs, subnets, routes, and load balancers
Both systems store their operational metadata in DynamoDB.
When DynamoDB became unreachable, both broke immediately.
What happened when DynamoDB went down?
DWFM could not refresh leases for any droplet
Every few minutes, each physical EC2 host requires a renewed lease.
Since DWFM couldn’t read/write DynamoDB:
leases began expiring across the entire region
expired lease = host can’t be used for new instance launches
EC2 effectively “lost” available capacity even though hardware was healthy
This is why EC2 API calls returned:
“insufficient capacity”
“request limit exceeded”
After DynamoDB recovered, DWFM entered congestive collapse. DWFM tried to re-establish leases for thousands of hosts at once.
But:
there were too many expired leases
every attempt added more work
retries caused queue buildup
DWFM couldn’t finish lease recovery fast enough
This led to congestive collapse, where DWFM was stuck processing old work and couldn’t make forward progress.
So even though DynamoDB was fixed, EC2 still couldn’t launch new instances.
Manual intervention was required
Since there was no pre-existing playbook for this scenario, AWS engineers had to:
throttle EC2 API request rates
manually restart DWFM hosts
clear the internal queues
slowly rebuild leases for the entire region
Why NLB Failed During the Outage ?
Network Load Balancer (NLB) did not fail because its systems were broken.
It failed because EC2’s network propagation was delayed, causing NLB’s health checks to misinterpret healthy nodes as unhealthy.
This triggered a cascading failure inside the entire NLB fleet.
1. NLB depends on Network Manager for routing information
Whenever a new EC2 instance is launched:
Network Manager must push ENI attachments
update routes
propagate VPC networking state
notify load balancers that the instance is ready
But Network Manager was already delayed because DWFM entered congestive collapse after the DynamoDB outage.
This meant:
New EC2 instances came up without network connectivity
The instance existed, but:
no routing
no connectivity
no health check path
From NLB’s point of view → the instance looked dead.
2. NLB’s health check subsystem began failing
NLB performs constant health checks on all backend targets.
Because network state propagation was delayed:
new instances failed health checks
NLB nodes themselves sometimes couldn't communicate internally
health check results began oscillating (passing → failing → passing)
This caused mass thrashing in NLB’s internal control plane.
3. Automatic AZ failover made things dramatically worse
When enough health checks fail in an AZ, NLB’s automation triggers:
Automatic DNS failover to another Availability Zone
But because failures were due to delayed network propagation, not actual instance faults:
nodes were removed from DNS
then added back
then removed again
over and over
This resulted in:
capacity disappearing temporarily
routing instability
increased connection errors
fluctuating backend availability
4. Engineers disabled automatic failover
To stop the thrashing, AWS engineers:
disabled NLB automatic health-check failover
brought all remaining nodes back into service
waited for EC2 + Network Manager to recover
Once EC2 network propagation returned to normal, NLB health checks stabilized.
Lessons for Engineers
The DynamoDB outage revealed several important lessons about designing and operating distributed systems.
1. Hidden single points of failure exist even inside “distributed” systems
DynamoDB was multi-AZ and globally resilient, yet a single DNS race condition took it down.
Distributed systems can still hide centralized control-plane dependencies.
2. Protect the control plane more than the data plane
EC2’s servers were healthy, but its control plane broke (DWFM, Network Manager).
When the control plane fails, the entire service becomes unusable, even if machines are fine.
3. Recovery paths must be tested at scale
DWFM collapsed while trying to rebuild thousands of expired leases.
This scenario had never been tested.
Recovery code must be tested under:
backlogs
retry storms
mass-expiry
cold-start recovery
4. Automated failover must be carefully rate-limited
NLB misinterpreted delayed network propagation as failures and triggered AZ failover loops, removing capacity repeatedly.
Failover automation should:
limit velocity
understand root cause
avoid over-correcting
Automation can multiply failures.
5. Retry storms can cause more damage than the original failure
DWFM entered congestive collapse because retries piled up.
Unbounded retries = self-inflicted outage extension.
6. Know your dependency graph
Lambda, EC2, STS/IAM, Redshift, Connect — all failed because they depend on DynamoDB.
If you don’t know your upstream dependencies, you can’t predict your outage scenarios.
Final Takeaway
Most outages at scale come not from hardware failure, but from small bugs in the control plane.
Building resilient systems requires:
safe automation
controlled failover
tested recovery logic
and deep awareness of cross-service dependencies


Top comments (0)