DEV Community

Cover image for Designing Cloud-Native Systems That Survive Region-Level Failures
Alok Ranjan Daftuar
Alok Ranjan Daftuar

Posted on • Originally published at aloknecessary.github.io

Designing Cloud-Native Systems That Survive Region-Level Failures

Most teams design for instance and zone failures but treat region-level outages as someone else's problem. Region-level failures are rare — but they are not theoretical. AWS us-east-1 has had multiple significant incidents. Azure AD suffered a global authentication outage in 2023. Google Cloud's europe-west9 went offline due to a data center fire.

When a region fails, the blast radius is not one service. It is every workload, every database, every queue, and every control plane operation scoped to that region.


Multi-AZ Does Not Protect Against Regional Failures

Multi-AZ protects against data center failures. It does not protect against:

  • Regional control plane failures — the API that manages your resources is regional. If it degrades, you cannot scale or deploy.
  • Regional service outages — SQS, Lambda, DynamoDB, Cosmos DB are all regional.
  • Shared fate dependencies — IAM, Secrets Manager, Key Vault are regional. If your app cannot retrieve secrets, it doesn't matter that compute is healthy across three AZs.

The December 2021 AWS us-east-1 incident demonstrated this. Services in unaffected AZs experienced degradation because their dependencies were not AZ-independent.


Multi-Region Architecture Patterns

Pilot Light — secondary region has minimum infrastructure (DB replicas, networking). Compute provisioned on failover. RTO: 15-60 min. Cost: ~10-15% of primary.

Warm Standby — secondary runs a scaled-down but fully functional copy. On failover, scale up and promote DB. RTO: 5-15 min. Cost: ~25-40% of primary.

Active-Active — both regions serve traffic simultaneously. No failover needed. Requires multi-region writes (DynamoDB Global Tables, Cosmos DB) and conflict resolution. RTO: near-zero. Cost: ~80-100%+ of primary.


Data Replication: The Hardest Problem

  • Synchronous — zero data loss, but adds 50-150ms to every write. Impractical for most workloads.
  • Asynchronous — no write latency impact, but creates a replication lag window where data can be lost if primary fails.

For active-active with async replication, you need a conflict resolution strategy. Last-writer-wins works for profiles and preferences. It silently drops writes for counters and balances — use application-level merge or CRDTs there.

Practical rule: if you cannot define a conflict resolution strategy for a data entity, route its writes to a single primary region.


Failover Automation

Manual failover is not failover. Under the stress of a region-level incident, manual steps fail or take far longer than practiced.

  • DNS-based failover (Route 53, Traffic Manager) with health checks on actual regional functionality — not just process liveness
  • Database promotion automated via API (Aurora Global: under 1 minute, RDS cross-region: 5-10 minutes)
  • Test quarterly — not a tabletop exercise, an actual failover. Measure real RTO. Fix the gaps.

Common Mistakes

  • Untested failover — an assumption, not a plan
  • Hidden regional dependencies — auth provider or secrets manager pinned to one region
  • Same deployment pipeline for both regions — a bad deploy takes down both simultaneously
  • No capacity planning — secondary region hits service quotas during scale-up

Read the Full Article

This is a summary of my deep dive into multi-region resilience. The full article covers all patterns with AWS and Azure architecture sketches, cost analysis, and a decision framework:

👉 Designing Cloud-Native Systems That Survive Region-Level Failures — Full Article

The full article includes:

  • Multi-AZ vs multi-region — what each actually protects (and what it doesn't)
  • Three patterns with RTO/RPO/cost profiles (Pilot Light, Warm Standby, Active-Active)
  • AWS and Azure architecture sketches for active-active
  • Data replication deep dive (sync vs async, managed DB options, conflict resolution)
  • Failover automation with Route 53 config and health check design
  • Cost vs resilience decision framework with workload tiering
  • Six common mistakes that break multi-region architectures

Top comments (0)