<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anunay Kumar</title>
    <description>The latest articles on DEV Community by Anunay Kumar (@anunay2).</description>
    <link>https://dev.to/anunay2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3593285%2F087399d5-1458-4350-9d33-7d6abf6d4f86.jpeg</url>
      <title>DEV Community: Anunay Kumar</title>
      <link>https://dev.to/anunay2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anunay2"/>
    <language>en</language>
    <item>
      <title>Understanding AWS DynamoDB Outage on Deepawali!</title>
      <dc:creator>Anunay Kumar</dc:creator>
      <pubDate>Sun, 02 Nov 2025 16:56:43 +0000</pubDate>
      <link>https://dev.to/anunay2/understanding-aws-dynamodb-outage-on-deepawali-21c0</link>
      <guid>https://dev.to/anunay2/understanding-aws-dynamodb-outage-on-deepawali-21c0</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Anything can go wrong, will go wrong - Murphy's law&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;DynamoDB depends heavily on DNS. Instead of one static IP, AWS maintains &lt;strong&gt;hundreds of thousands of DNS records&lt;/strong&gt; for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scaling,&lt;/li&gt;
&lt;li&gt;routing traffic across load balancers,&lt;/li&gt;
&lt;li&gt;handling IPv6/FIPS variants,&lt;/li&gt;
&lt;li&gt;removing failed capacity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To manage this, DynamoDB uses two internal components:&lt;/p&gt;

&lt;h3&gt;
  
  
  DNS Planner
&lt;/h3&gt;

&lt;p&gt;Continuously generates new “plans” (Plan #1200, #1300, #1400, #1500… etc) describing which load balancers should serve the endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  DNS Enactors (3 copies, one per AZ)
&lt;/h3&gt;

&lt;p&gt;Apply these plans to &lt;a href="https://en.wikipedia.org/wiki/Amazon_Route_53" rel="noopener noreferrer"&gt;Route53&lt;/a&gt; using atomic transactions.&lt;/p&gt;

&lt;p&gt;This design normally ensures high availability.&lt;br&gt;&lt;br&gt;
But it also created the perfect conditions for a rare race condition.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyl8ix3pyecjdg5pooou.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyl8ix3pyecjdg5pooou.png" alt=" " width="800" height="559"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Rare Race Condition (The Real Root Cause)&lt;/p&gt;

&lt;p&gt;Here’s the exact sequence of events:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enactor A picked &lt;strong&gt;old Plan #1200&lt;/strong&gt; but got stuck retrying.&lt;/li&gt;
&lt;li&gt;Planner produced newer plans: #1300 → #1400 → #1500.&lt;/li&gt;
&lt;li&gt;Enactor B, running normally, applied &lt;strong&gt;new Plan #1500&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Enactor A finally woke up and &lt;strong&gt;applied old Plan #1200&lt;/strong&gt;, overwriting #1500.&lt;/li&gt;
&lt;li&gt;Enactor B’s cleanup then deleted all old plans — including the old plan #1200.&lt;/li&gt;
&lt;li&gt;Now &lt;strong&gt;no plan existed&lt;/strong&gt; in the system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With no plan:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Route53 had &lt;em&gt;no IP addresses&lt;/em&gt; for &lt;code&gt;dynamodb.us-east-1.amazonaws.com&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;DynamoDB endpoint disappeared.&lt;/li&gt;
&lt;li&gt;All AWS services depending on DDB immediately failed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now this was epicenter of the blast that followed on the day of Diwali!&lt;/p&gt;

&lt;h3&gt;
  
  
  Chain Reaction/Cascading Failures
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgaucwarnpwhxwiu5wj75.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgaucwarnpwhxwiu5wj75.png" alt=" " width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's discuss these failures one by one:&lt;/p&gt;

&lt;h3&gt;
  
  
  EC2 failures
&lt;/h3&gt;

&lt;p&gt;It failed because its &lt;strong&gt;control plane lost access to DynamoDB&lt;/strong&gt;, which it relies on for critical internal state. This caused EC2 to temporarily “lose” capacity and become unable to launch new instances.&lt;/p&gt;

&lt;p&gt;Below is the exact breakdown of how this happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. DropletWorkflow Manager (DWFM)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DWFM manages all physical servers (called &lt;em&gt;droplets&lt;/em&gt;) that run EC2 instances. It maintains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;host state&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;instance-to-host mapping&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;lease/heartbeat for each physical server&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;lifecycle operations (shutdown, reboot, etc.)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Network Manager&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;updating VPC routing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;propagating network configuration to new instances&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;networking for ENIs, subnets, routes, and load balancers&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Both systems store their operational metadata in DynamoDB.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When DynamoDB became unreachable, both broke immediately.&lt;/p&gt;

&lt;h4&gt;
  
  
  What happened when DynamoDB went down?
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;DWFM could not refresh leases for any droplet&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every few minutes, each physical EC2 host requires a renewed lease.&lt;br&gt;&lt;br&gt;
Since DWFM couldn’t read/write DynamoDB:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;leases began expiring across the entire region&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;expired lease = host can’t be used for new instance launches&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;EC2 effectively “lost” available capacity even though hardware was healthy&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why EC2 API calls returned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;“insufficient capacity”&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;“request limit exceeded”&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h6&gt;
  
  
  After DynamoDB recovered, DWFM entered congestive collapse. DWFM tried to re-establish leases for thousands of hosts at once.
&lt;/h6&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;there were too many expired leases&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;every attempt added more work&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;retries caused queue buildup&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DWFM couldn’t finish lease recovery fast enough&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This led to &lt;strong&gt;congestive collapse&lt;/strong&gt;, where DWFM was stuck processing old work and couldn’t make forward progress.&lt;/p&gt;

&lt;p&gt;So even though DynamoDB was fixed, &lt;strong&gt;EC2 still couldn’t launch new instances&lt;/strong&gt;.&lt;/p&gt;




&lt;h5&gt;
  
  
  Manual intervention was required
&lt;/h5&gt;

&lt;p&gt;Since there was no pre-existing playbook for this scenario, AWS engineers had to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;throttle EC2 API request rates&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;manually restart DWFM hosts&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;clear the internal queues&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;slowly rebuild leases for the entire region&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why NLB Failed During the Outage ?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Network Load Balancer (NLB) did &lt;strong&gt;not&lt;/strong&gt; fail because its systems were broken.&lt;br&gt;&lt;br&gt;
It failed because &lt;strong&gt;EC2’s network propagation was delayed&lt;/strong&gt;, causing NLB’s health checks to misinterpret healthy nodes as &lt;em&gt;unhealthy&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This triggered a cascading failure inside the entire NLB fleet.&lt;/p&gt;




&lt;h4&gt;
  
  
  1. NLB depends on Network Manager for routing information
&lt;/h4&gt;

&lt;p&gt;Whenever a new EC2 instance is launched:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Network Manager must push &lt;strong&gt;&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html" rel="noopener noreferrer"&gt;ENI&lt;/a&gt; attachments&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;update &lt;strong&gt;routes&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;propagate &lt;strong&gt;&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-vpc.html" rel="noopener noreferrer"&gt;VPC&lt;/a&gt; networking state&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;notify load balancers that the instance is ready&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But Network Manager was already delayed because DWFM entered congestive collapse after the DynamoDB outage.&lt;/p&gt;

&lt;p&gt;This meant:&lt;/p&gt;

&lt;h5&gt;
  
  
  New EC2 instances came up &lt;strong&gt;without network connectivity&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;The instance existed, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;no routing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;no connectivity&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;no health check path&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From NLB’s point of view → &lt;strong&gt;the instance looked dead&lt;/strong&gt;.&lt;/p&gt;




&lt;h4&gt;
  
  
  2. NLB’s health check subsystem began failing
&lt;/h4&gt;

&lt;p&gt;NLB performs &lt;strong&gt;constant health checks&lt;/strong&gt; on all backend targets.&lt;/p&gt;

&lt;p&gt;Because network state propagation was delayed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;new instances failed health checks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;NLB nodes themselves sometimes couldn't communicate internally&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;health check results began oscillating (passing → failing → passing)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This caused &lt;strong&gt;mass thrashing&lt;/strong&gt; in NLB’s internal control plane.&lt;/p&gt;




&lt;h4&gt;
  
  
  3. Automatic AZ failover made things dramatically worse
&lt;/h4&gt;

&lt;p&gt;When enough health checks fail in an AZ, NLB’s automation triggers:&lt;/p&gt;

&lt;h6&gt;
  
  
  &lt;strong&gt;Automatic DNS failover to another Availability Zone&lt;/strong&gt;
&lt;/h6&gt;

&lt;p&gt;But because failures were due to &lt;em&gt;delayed network propagation&lt;/em&gt;, not actual instance faults:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;nodes were removed from DNS&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;then added back&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;then removed again&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;over and over&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This resulted in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;capacity disappearing temporarily&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;routing instability&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;increased connection errors&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;fluctuating backend availability&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. Engineers disabled automatic failover
&lt;/h4&gt;

&lt;p&gt;To stop the thrashing, AWS engineers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;disabled NLB automatic health-check failover &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;brought all remaining nodes back into service&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;waited for EC2 + Network Manager to recover&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once EC2 network propagation returned to normal, NLB health checks stabilized.&lt;/p&gt;

&lt;h4&gt;
  
  
  Lessons for Engineers
&lt;/h4&gt;

&lt;p&gt;The DynamoDB outage revealed several important lessons about designing and operating distributed systems.&lt;/p&gt;




&lt;h4&gt;
  
  
  1. Hidden single points of failure exist even inside “distributed” systems
&lt;/h4&gt;

&lt;p&gt;DynamoDB was multi-AZ and globally resilient, yet a &lt;strong&gt;single DNS race condition&lt;/strong&gt; took it down.&lt;br&gt;&lt;br&gt;
Distributed systems can still hide &lt;strong&gt;centralized control-plane dependencies&lt;/strong&gt;.&lt;/p&gt;




&lt;h4&gt;
  
  
  2. Protect the control plane more than the data plane
&lt;/h4&gt;

&lt;p&gt;EC2’s &lt;em&gt;servers&lt;/em&gt; were healthy, but its &lt;strong&gt;control plane&lt;/strong&gt; broke (DWFM, Network Manager).&lt;br&gt;&lt;br&gt;
When the control plane fails, &lt;strong&gt;the entire service becomes unusable&lt;/strong&gt;, even if machines are fine.&lt;/p&gt;




&lt;h4&gt;
  
  
  3. Recovery paths must be tested at scale
&lt;/h4&gt;

&lt;p&gt;DWFM collapsed while trying to rebuild thousands of expired leases.&lt;br&gt;&lt;br&gt;
This scenario had &lt;strong&gt;never been tested&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Recovery code must be tested under:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;backlogs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;retry storms&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mass-expiry&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;cold-start recovery&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  4. Automated failover must be carefully rate-limited
&lt;/h4&gt;

&lt;p&gt;NLB misinterpreted delayed network propagation as failures and triggered &lt;strong&gt;AZ failover loops&lt;/strong&gt;, removing capacity repeatedly.&lt;/p&gt;

&lt;p&gt;Failover automation should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;limit velocity&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;understand root cause&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;avoid over-correcting&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Automation can multiply failures.&lt;/p&gt;




&lt;h4&gt;
  
  
  5. Retry storms can cause more damage than the original failure
&lt;/h4&gt;

&lt;p&gt;DWFM entered &lt;strong&gt;congestive collapse&lt;/strong&gt; because retries piled up.&lt;br&gt;&lt;br&gt;
Unbounded retries = &lt;strong&gt;self-inflicted outage extension&lt;/strong&gt;.&lt;/p&gt;




&lt;h4&gt;
  
  
  6. Know your dependency graph
&lt;/h4&gt;

&lt;p&gt;Lambda, EC2, STS/IAM, Redshift, Connect — all failed because they depend on DynamoDB.&lt;/p&gt;

&lt;p&gt;If you don’t know your upstream dependencies, you can’t predict your outage scenarios.&lt;/p&gt;




&lt;h3&gt;
  
  
  Final Takeaway
&lt;/h3&gt;

&lt;p&gt;Most outages at scale come not from hardware failure, but from &lt;strong&gt;small bugs in the control plane&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
Building resilient systems requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;safe automation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;controlled failover&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;tested recovery logic&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;and deep awareness of cross-service dependencies&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Reference Document.
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;Outage summary doc by AWS&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>networking</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
