Most teams design for instance and zone failures but treat region-level outages as someone else's problem. Region-level failures are rare — but they are not theoretical. AWS us-east-1 has had multiple significant incidents. Azure AD suffered a global authentication outage in 2023. Google Cloud's europe-west9 went offline due to a data center fire.
When a region fails, the blast radius is not one service. It is every workload, every database, every queue, and every control plane operation scoped to that region.
Multi-AZ Does Not Protect Against Regional Failures
Multi-AZ protects against data center failures. It does not protect against:
- Regional control plane failures — the API that manages your resources is regional. If it degrades, you cannot scale or deploy.
- Regional service outages — SQS, Lambda, DynamoDB, Cosmos DB are all regional.
- Shared fate dependencies — IAM, Secrets Manager, Key Vault are regional. If your app cannot retrieve secrets, it doesn't matter that compute is healthy across three AZs.
The December 2021 AWS us-east-1 incident demonstrated this. Services in unaffected AZs experienced degradation because their dependencies were not AZ-independent.
Multi-Region Architecture Patterns
Pilot Light — secondary region has minimum infrastructure (DB replicas, networking). Compute provisioned on failover. RTO: 15-60 min. Cost: ~10-15% of primary.
Warm Standby — secondary runs a scaled-down but fully functional copy. On failover, scale up and promote DB. RTO: 5-15 min. Cost: ~25-40% of primary.
Active-Active — both regions serve traffic simultaneously. No failover needed. Requires multi-region writes (DynamoDB Global Tables, Cosmos DB) and conflict resolution. RTO: near-zero. Cost: ~80-100%+ of primary.
Data Replication: The Hardest Problem
- Synchronous — zero data loss, but adds 50-150ms to every write. Impractical for most workloads.
- Asynchronous — no write latency impact, but creates a replication lag window where data can be lost if primary fails.
For active-active with async replication, you need a conflict resolution strategy. Last-writer-wins works for profiles and preferences. It silently drops writes for counters and balances — use application-level merge or CRDTs there.
Practical rule: if you cannot define a conflict resolution strategy for a data entity, route its writes to a single primary region.
Failover Automation
Manual failover is not failover. Under the stress of a region-level incident, manual steps fail or take far longer than practiced.
- DNS-based failover (Route 53, Traffic Manager) with health checks on actual regional functionality — not just process liveness
- Database promotion automated via API (Aurora Global: under 1 minute, RDS cross-region: 5-10 minutes)
- Test quarterly — not a tabletop exercise, an actual failover. Measure real RTO. Fix the gaps.
Common Mistakes
- Untested failover — an assumption, not a plan
- Hidden regional dependencies — auth provider or secrets manager pinned to one region
- Same deployment pipeline for both regions — a bad deploy takes down both simultaneously
- No capacity planning — secondary region hits service quotas during scale-up
Read the Full Article
This is a summary of my deep dive into multi-region resilience. The full article covers all patterns with AWS and Azure architecture sketches, cost analysis, and a decision framework:
👉 Designing Cloud-Native Systems That Survive Region-Level Failures — Full Article
The full article includes:
- Multi-AZ vs multi-region — what each actually protects (and what it doesn't)
- Three patterns with RTO/RPO/cost profiles (Pilot Light, Warm Standby, Active-Active)
- AWS and Azure architecture sketches for active-active
- Data replication deep dive (sync vs async, managed DB options, conflict resolution)
- Failover automation with Route 53 config and health check design
- Cost vs resilience decision framework with workload tiering
- Six common mistakes that break multi-region architectures
Top comments (0)