How a Single DNS Failure Caused the Massive AWS Global Outage

#aws #dns #cloud #outage

Monday, October 20, 2025 — you’re comfortably browsing your favorite article on Medium while sipping your morning coffee. As you scroll, the page refuses to load. You hit refresh. Nothing. You switch apps, still Nothing then Suddenly your go-to messaging platform won’t let you sign in, your bank app times out, even your game’s matchmaking screen flickers and fails. Your friends start tweeting about issues with Fortnite and Snapchat. Meanwhile, your smart-home gear stops responding. Something big is happening.

Behind the scenes, in the heart of the cloud, a seemingly small crack opened in the internet’s foundations. In the Amazon Web Services (AWS) US-EAST-1 region, the DNS (domain name system) path to a key service — Amazon DynamoDB — failed to resolve. Because so many websites and apps lean on that service, the failure rippled out globally. Services like Snapchat, Fortnite, Duolingo, Ring and banks across the UK reported disruptions.

The scariest part? You weren’t just a spectator — as you sat there unable to load Medium, you were part of a vast global consequence of a single DNS failure in one region.

Good news — you still have options

While you may not control AWS’s infrastructure, any team building online services can adopt hardened practices to survive such dependency failures:

Deploy multi-region and multi-provider redundancy: Don’t host everything in a single region or provider. If one region (or cloud) fails, traffic can fall back to another.
Implement fallback endpoints and DNS caching: Ensure critical services have alternate endpoints or cached resolutions so that momentary DNS lookup failures don’t completely block functionality.
Monitor DNS lookups and endpoint resolution health: Track metrics like DNS time-outs and retry rates; set alerts when your “phone-book” is failing.
Adopt client-side retry/backoff and graceful degradation: Assume external dependencies may fail — apps should handle failures gracefully (e.g., “Offline mode”, reduced functionality) rather than simply hanging.
Decouple critical dependencies: If a database/API is common to many subsystems, consider splitting or replicating it across independent infrastructure so one hiccup doesn’t cascade into full failure.

For example, AWS’s own status updates pointed to a DNS resolution issue with the DynamoDB endpoint in US-EAST-1.By caching endpoints early or configuring alternate DNS paths, many downstream services might have reduced the severity of the impact.

Otherwise Lets wait for the Fireship Video😂

DEV Community

How a Single DNS Failure Caused the Massive AWS Global Outage

Good news — you still have options

Top comments (0)