DEV Community

Cover image for AWS US East-1 Outage: A DNS Glitch That Crippled the Cloud on October 20, 2025
Muhammad Zeeshan
Muhammad Zeeshan

Posted on

AWS US East-1 Outage: A DNS Glitch That Crippled the Cloud on October 20, 2025

In the pre-dawn hours of October 20, 2025, AWS's US East-1 region—the backbone of much of the internet—experienced a multi-hour outage that disrupted 78 services and rippled across global apps. Starting around 3:11 a.m. ET, elevated error rates in DynamoDB's API endpoint snowballed into widespread failures, affecting everything from EC2 instance launches to IAM updates. This Northern Virginia hub, AWS's largest and default region, powers giants like Snapchat, Ring, Fortnite, and Alexa, turning a regional hiccup into a worldwide headache that lasted over seven hours.
The root cause? A DNS resolution failure triggered by a botched technical update to DynamoDB's API in one of US East-1's primary data centers. This corrupted DNS records, blocking endpoint access and causing client retries to flood systems, which jammed SQS queues, stalled Lambda functions, and triggered EC2 capacity errors. An internal network glitch compounded the issue, hitting services like Amazon Connect and AWS Batch. AWS's Health Dashboard timeline revealed frantic mitigations: DNS patches by 3:35 a.m. PDT, rate limiting on EC2, and backlog drains, with full recovery dragging into the afternoon ET.
Impacts were swift and severe—Downdetector reports spiked to 10,000 per hour as users faced stalled streams on Twitch, silent Alexas, and empty Fortnite lobbies. Enterprise sectors like finance and e-commerce saw delayed trades and abandoned carts, with estimated losses in the hundreds of millions. AWS responded with near-real-time updates every 30-45 minutes, a transparency win over past outages, and urged multi-AZ spreads for resilience. Post-incident credits are likely, but critics highlight US East-1's over-reliance as a predictable vulnerability, exacerbated by Amazon's 2025 talent exodus straining ops.
Key lessons: Diversify across regions ruthlessly—use Global Accelerator for traffic routing and replicate data to EU West-1 or Asia Pacific. Treat DNS as critical infrastructure with redundant resolvers and circuit breakers to halt cascades. Implement exponential backoff retries and idempotent designs to manage surges, while quarterly chaos engineering drills build muscle memory. Ultimately, this outage underscores cloud fragility: uptime demands proactive redundancy, not just reactive fixes, as our AI-fueled digital world grows ever more interdependent.

Top comments (0)