The AWS Outage on Oct 20, 2025: What Broke, Why It Felt Global, and How Amazon Stabilized It

#aws #cloudinfrastructure #dns #cloud

In the early hours of October 20, 2025 (PDT), AWS began seeing elevated error rates in US-EAST-1 (N. Virginia). Amazon later confirmed two key threads:

a DNS resolution problem for DynamoDB API endpoints in us-east-1 (fully mitigated at 2:24 AM PDT), and
an internal subsystem that monitors the health of Network Load Balancers (NLBs) inside the EC2 internal network that degraded, cascading into service connectivity and launch issues. (About Amazon)

The incident: a clear timeline

12:11 AM PDT (Oct 20) — AWS reports increased error rates for multiple services; investigation points to DNS resolution of DynamoDB API endpoints in us-east-1. (About Amazon)
2:24 AM PDT — Amazon says the underlying DNS issue is fully mitigated, but residual impact continues (notably EC2 instance launches). (About Amazon)
Morning–afternoon (US) — AWS communicates a “root cause” in an internal subsystem monitoring NLB health, affecting connectivity; AWS also throttles new EC2 launches to speed recovery and clear backlogs. (Reuters)
Later in the day — AWS states services are back to normal, while some backlogs continue processing. (Reuters)

Root causes (the short, accurate version)

1) DNS resolution to DynamoDB endpoints (us-east-1) failed

For a window of time, clients couldn’t resolve regional DynamoDB API hostnames in N. Virginia, generating timeouts and errors for apps and AWS services depending on those endpoints. This DNS-specific fault was mitigated at 2:24 AM PDT. (About Amazon)

2) An internal NLB-health monitoring subsystem degraded

As recovery progressed, AWS identified the underlying issue: an internal subsystem that monitors the health of Network Load Balancers on the EC2 internal network. Problems there degraded connectivity and knocked other subsystems (e.g., parts of Lambda), prompting AWS to throttle EC2 instance launches as a mitigation while it restored health. (Reuters)

Why did this feel global if it started in one Region?
Many “global” features and customer architectures implicitly lean on us-east-1 control planes or make cross-region calls. When DNS (a cross-cutting dependency) and regional APIs wobble, effects can surface far from N. Virginia. AWS’ own Fault Isolation Boundaries guidance explains how such control-plane dependencies work and why regional designs matter. (AWS Documentation)

Impact snapshot

Major consumer and enterprise apps—including social, commerce, gaming, media, and even some government services—reported outages or degraded performance while AWS recovered the Region and dependent subsystems. Coverage consistently tied the blast radius to us-east-1 and DynamoDB/DNS symptoms, with Lambda errors and EC2 launch failures during the day. (The Guardian)

How AWS fixed it (mitigations & recovery steps)

Mitigated DNS faults for DynamoDB API in us-east-1 (completed at 2:24 AM PDT). (About Amazon)
Identified the NLB health-monitoring subsystem as the root cause and applied mitigations to restore internal connectivity. (Reuters)
Throttled new EC2 instance launches to reduce load and accelerate stabilization; continued clearing service backlogs as systems healed. (The Guardian)
Communicated recovery status over the day as connectivity and API success rates improved across services (Lambda, SQS, Amazon Connect, etc.). (The Guardian)

What builders should learn (practical takeaways)

Design for regional independence. Keep user journeys working without calling a control plane in us-east-1; make stacks statically stable during incidents. (AWS Documentation)
Prefer regional endpoints & VPC endpoints. Use regional service URLs (e.g., dynamodb.<region>.amazonaws.com\) and Private DNS with VPC endpoints to lower exposure to external DNS issues. (AWS Documentation)
Assume DNS is fallible—add resilience. Multiple resolvers, reasonable TTLs, negative-cache hygiene, jittered retries, and last-good-answer caches can turn DNS from a SPOF into a survivable layer. (AWS docs detail health-check behavior and routing on NLBs.) (AWS Documentation)
Avoid hidden single-Region anchors. AWS’ Fault Isolation Boundaries papers enumerate which “global” features depend on us-east-1; don’t put these in your recovery path. (AWS Documentation)

Closing

This incident had two intertwined threads—DNS resolution for DynamoDB endpoints and a degraded internal NLB-monitoring subsystem—that together produced global-feeling impact from a regional fault. Amazon mitigated DNS early, then stabilized internal networking and capacity by throttling EC2 launches and clearing backlogs as services recovered. (About Amazon)

If you share your regions and AWS services, I can tailor a resilience checklist (DNS patterns, endpoint strategy, failover drills) mapped to your stack.