Inside the AWS US-East-1 Outage: Why DNS Failure Triggered a Global Cloud Crisis

#sre #devops #aws #linux

What Really Happened in the AWS US-East-1 Outage and Why It Was So Bad: An Initial Writeup Based on AWS Communications

While many tech professionals have detailed AWS’s recent US-East-1 outage, my view is shaped by extensive experience managing DNS outages in on-premises environments. This writeup is an initial analysis based on AWS’s official statements and public information.

Why AWS Outage Became a Doomsday Event Unlike Typical On-Prem DNS Failures

DNS outages are a fundamental failure point in any distributed system. No provider, including AWS, can fully eliminate DNS risk. Yet in on-prem environments, DNS disruptions—even with tight application dependencies—usually recover fast and stay localized, enabling quick service restoration.

AWS operates at hyperscale—millions of interdependent APIs, services, and control planes deeply coupled and globally dispersed. DNS in AWS underpins service discovery, authentication, authorization, and control-plane orchestration. The US-East-1 DNS failure that hit DynamoDB endpoints triggered cascading failures across IAM, Lambda, EC2, CloudWatch, and more. Retry storms and state synchronization extended outage timelines, transforming a typical DNS hiccup into a prolonged global incident.

Rough Dependency Mapping of Key Affected AWS Services and Their DNS Endpoint Dependencies

This dependency mapping and analysis are personal assessments based on publicly available AWS documentation, outage reports, and professional experience. Due to AWS’s proprietary and complex architecture, some inferred details may not exactly represent internal implementations. This post aims to provide an informed approximation grounded in official public information and practical knowledge, not an authoritative AWS internal architecture description.

DynamoDB (dynamodb.us-east-1.amazonaws.com)
- Services that depend on DynamoDB:
- IAM: Uses DynamoDB to store and retrieve authentication tokens, session state, and authorization policies. This enables IAM to validate credentials and enforce access control.
- Lambda: Uses DynamoDB for state persistence and event metadata storage. Lambda functions may read/write data to DynamoDB tables as part of normal workflows.
- CloudWatch: Stores custom metrics and alarms related to resource usage and function executions in DynamoDB.
- Why the dependency matters: DynamoDB acts as a fast, globally distributed NoSQL store holding critical authorization, session, and configuration data. If unresolved or inaccessible, IAM cannot authenticate or authorize, leading to login and API failures.
IAM (Identity and Access Management)
- Depends on:
- DynamoDB: for policy storage, session tokens, and metadata.
- KMS (Key Management Service): for cryptographic key operations to securely sign and validate tokens.
- Lambda: for custom authorization flows and policy evaluations that can trigger functions dynamically.
- Services that depend on IAM:
- All AWS Services: Every service requiring access control checks (EC2, Lambda, S3, etc.) queries IAM for validated credentials and permissions.
- AWS Console & Support: User portal and case-raising systems rely on IAM for authentication and enrollment.
- Why the dependency matters: IAM is the cornerstone for secure identity and access control. Any interruption cascades into login failures and administrative lockouts.
Lambda
- Depends on:
- S3: for fetching function code and layers during cold starts.
- IAM: for getting execution roles and permission tokens.
- Event Sources: like S3, EventBridge, or DynamoDB streams for triggering executions.
- Services that depend on Lambda:
- Application Workflows and System Integrations: Lambda enables event-driven architectures, allowing asynchronous processing in many AWS services.
- Why the dependency matters: Lambda’s dynamic, scalable compute depends on timely availability of code from S3, secure token access via IAM, and event triggers—all reliant on DNS-based resolution and availability.
EC2 and VPC
- Depends on:
- IAM: for instance credentials and access tokens.
- Metadata Service: to fetch configuration and instance metadata at runtime.
- AMI Catalogs (via S3/EC2 API Endpoints): for retrieving machine images to launch new instances.
- Services that depend on EC2:
- Customer Applications and Services: rely on EC2 instances for compute, networking, and storage access.
- Why the dependency matters: EC2 provisioning and ongoing instance operations rely on credential validation and configuration data resolvable only through DNS-based AWS endpoints. Failures in these dependencies delay provisioning and impact workloads.
CloudWatch
- Depends on:
- IAM: for authenticating metric and log uploads.
- DynamoDB or other data stores: for storing monitoring data and alarm state.
- Services that depend on CloudWatch:
- All AWS Users and Services: rely on CloudWatch for operational visibility and automated response triggers.
- Why the dependency matters: Loss of monitoring visibility impacts incident response and auto-remediation capabilities critical during outages.
Route 53
- Depends on:
- Internal Control Plane Services: to verify DNS zones, health checks, and routing policies.
- Services that depend on Route 53:
- All AWS Services and Customer Applications: depend on Route 53 for DNS resolution, failover routing, and global traffic management.
- Why the dependency matters: DNS is foundational for AWS internal and external communications. Route 53’s partial degradation affected failover and traffic routing during the outage.

What Customers Did During the Outage — Help or Hurt?

Many customers sought to fail over to standby regions. However:

Human ability to log into IAM management consoles and promote Disaster Recovery (DR) regions was impaired because IAM’s global authentication backbone remained dependent on US-East-1 endpoints.
Hybrid on-prem + AWS DR setups faced manual complexity, needing reconfiguration of on-prem services to point to DR sites.
Traffic redirection often requires updating Route 53 DNS records for warm/standby sites. While Route 53 health checks ordinarily enable hot-hot failover by routing traffic away from degraded sites, Route 53 itself experienced partial degradation, limiting automated failover efficacy.
Many customers reported backlogs and slow performance in US-East-1, driving them to failover attempts that risked data conflicts due to asynchronous replication, especially for global DynamoDB tables and IAM policies.

Did Login Failures Occur Across Regions? Disaster Recovery State?

Yes. Because IAM and DynamoDB global tables anchor on US-East-1, login and authentication failures were seen in failover regions. Effective disaster recovery requires not only traffic failover but also resilient global state replication and authentication services. Without this, DR activation is hampered by login and token validation failures.

Official AWS Root Cause Summary (Public)

Amazon confirmed the core issue was a DNS resolution failure for DynamoDB API endpoints in the US-East-1 region starting late October 19, 2025. Though DNS issues were mitigated early October 20, retry storms and internal networking load balancer faults prolonged service impact for hours, affecting thousands of customers and multiple AWS services, including Amazon’s own platforms.

Final Thoughts: DNS is an Unavoidable Fundamental Risk—not an AWS Fault

DNS underpins all distributed services globally and cannot be engineered to be infallible. This outage highlights the need for system architects to anticipate DNS failures, build architectures with decoupled control planes, multi-region resilience, caching, and failover strategies focused on graceful degradation over catastrophic failure.

AWSOutage #DNSFailure #IAM #DynamoDB #CloudResilience #MultiRegion #DisasterRecovery #DevOps #SRE #Infrastructure

DEV Community