Designing Cloud-Native Systems That Survive Region-Level Failures

#cloud #architecture #aws #distributedsystems

Most teams design for instance and zone failures but treat region-level outages as someone else's problem. Region-level failures are rare — but they are not theoretical. AWS us-east-1 has had multiple significant incidents. Azure AD suffered a global authentication outage in 2023. Google Cloud's europe-west9 went offline due to a data center fire.

When a region fails, the blast radius is not one service. It is every workload, every database, every queue, and every control plane operation scoped to that region.

Multi-AZ Does Not Protect Against Regional Failures

Multi-AZ protects against data center failures. It does not protect against:

Regional control plane failures — the API that manages your resources is regional. If it degrades, you cannot scale or deploy.
Regional service outages — SQS, Lambda, DynamoDB, Cosmos DB are all regional.
Shared fate dependencies — IAM, Secrets Manager, Key Vault are regional. If your app cannot retrieve secrets, it doesn't matter that compute is healthy across three AZs.

The December 2021 AWS us-east-1 incident demonstrated this. Services in unaffected AZs experienced degradation because their dependencies were not AZ-independent.

Multi-Region Architecture Patterns

Pilot Light — secondary region has minimum infrastructure (DB replicas, networking). Compute provisioned on failover. RTO: 15-60 min. Cost: ~10-15% of primary.

Warm Standby — secondary runs a scaled-down but fully functional copy. On failover, scale up and promote DB. RTO: 5-15 min. Cost: ~25-40% of primary.

Active-Active — both regions serve traffic simultaneously. No failover needed. Requires multi-region writes (DynamoDB Global Tables, Cosmos DB) and conflict resolution. RTO: near-zero. Cost: ~80-100%+ of primary.

Data Replication: The Hardest Problem

Synchronous — zero data loss, but adds 50-150ms to every write. Impractical for most workloads.
Asynchronous — no write latency impact, but creates a replication lag window where data can be lost if primary fails.

For active-active with async replication, you need a conflict resolution strategy. Last-writer-wins works for profiles and preferences. It silently drops writes for counters and balances — use application-level merge or CRDTs there.

Practical rule: if you cannot define a conflict resolution strategy for a data entity, route its writes to a single primary region.

Failover Automation

Manual failover is not failover. Under the stress of a region-level incident, manual steps fail or take far longer than practiced.

DNS-based failover (Route 53, Traffic Manager) with health checks on actual regional functionality — not just process liveness
Database promotion automated via API (Aurora Global: under 1 minute, RDS cross-region: 5-10 minutes)
Test quarterly — not a tabletop exercise, an actual failover. Measure real RTO. Fix the gaps.

Common Mistakes

Untested failover — an assumption, not a plan
Hidden regional dependencies — auth provider or secrets manager pinned to one region
Same deployment pipeline for both regions — a bad deploy takes down both simultaneously
No capacity planning — secondary region hits service quotas during scale-up

Read the Full Article

This is a summary of my deep dive into multi-region resilience. The full article covers all patterns with AWS and Azure architecture sketches, cost analysis, and a decision framework:

👉 Designing Cloud-Native Systems That Survive Region-Level Failures — Full Article

The full article includes:

Multi-AZ vs multi-region — what each actually protects (and what it doesn't)
Three patterns with RTO/RPO/cost profiles (Pilot Light, Warm Standby, Active-Active)
AWS and Azure architecture sketches for active-active
Data replication deep dive (sync vs async, managed DB options, conflict resolution)
Failover automation with Route 53 config and health check design
Cost vs resilience decision framework with workload tiering
Six common mistakes that break multi-region architectures