DEV Community

Cover image for Multi-Region Failover: Lessons from Running It Hot
Samson Tanimawo
Samson Tanimawo

Posted on

Multi-Region Failover: Lessons from Running It Hot

Why "Hot" Matters

Three multi-region strategies:

Cold: Backup region is off. You start it when primary fails. RTO: hours.

Warm: Backup region runs on minimum capacity. Scale up on failover. RTO: 15-30 minutes.

Hot: Both regions serve live traffic simultaneously. RTO: seconds.

If you need under 15 minutes RTO, you need hot. Everything else is marketing copy.

The Illusion of Warm Failover

Warm sounds great on paper. In practice, on the day you need it:

  • The warm region has never seen real load
  • DNS cache propagation takes 5-15 minutes
  • Autoscaling lags because it's starting cold
  • Your team has never run on the warm region
  • Half your connection strings are hardcoded to the primary

Warm failover works in tabletop exercises. It does not work during real incidents under pressure. We learned this the hard way.

Running It Hot: The Architecture

┌─────────────┐
│ DNS / CDN │
└──────┬──────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Region A │ │ Region B │
│ 50% TX │ │ 50% TX │
└────┬─────┘ └─────┬────┘
│ │
└──────┬──────┬────────┘
▼ ▼
┌─────────────┐
│ Shared DB │
│ (writer + │
│ replicas) │
└─────────────┘
Enter fullscreen mode Exit fullscreen mode

Both regions always serve traffic. Split is usually 50/50 but can shift.

The Hard Parts

1. Database replication

This is where multi-region gets hard. Three options:

  • Single writer, multi-region readers: simplest, but writes pay cross-region latency
  • Multi-master: complex, but truly hot requires conflict resolution
  • Region-sharded: users pinned to a region for writes, simplest if your data model allows it

We use region-sharded for user-scoped data and single-writer for global config.

2. Session stickiness

If a user's session is in Region A, and their next request goes to Region B, things break.

Solutions:

  • JWT tokens with no server state
  • Session data in a globally replicated store (DynamoDB Global Tables, CockroachDB)
  • Cookie routing that pins a user to a region

3. Cache coherence

Region A's cache doesn't know when Region B updates the database. Options:

  • Short TTLs (1-5 minutes) and accept the inconsistency
  • Pub/sub cache invalidation across regions (complex)
  • Read-through caches only, never write-through

The Failover Mechanics

When Region A dies:

  1. Health checks detect failure route53/ALB removes Region A from DNS
  2. Traffic shifts to Region B already warm, already running
  3. Autoscaling kicks in Region B doubles capacity
  4. User sessions degrade gracefully re-authentication, cache warmup
  5. Monitoring reports the shift team gets paged, not customers

Target: customer-facing latency spike of under 30 seconds, full recovery under 5 minutes.

Testing It Monthly

If you don't test failover monthly, you don't have failover. You have hope.

We do this:

  • First Tuesday of every month, 10 AM
  • Route100% of traffic to Region B for 30 minutes
  • Watch dashboards, fix anything that degrades
  • Route back to 50/50
  • Document any issues, fix them
  • Repeat next month with the other region

Cost Reality Check

Running hot doubles your compute cost. For most companies, that's $10K-$100K/month.

The question is: what's your revenue per hour of downtime?

Company A: $100K/hr revenue → 99.99% target → hot is worth it
Company B: $1K/hr revenue → 99.9% target → warm is fine
Enter fullscreen mode Exit fullscreen mode

Do the math. Don't copy FAANG patterns if your revenue doesn't justify them.

The Operational Complexity Tax

Running hot costs more than money. It costs:

  • More runbooks (one per region)
  • More monitoring (cross-region latency, replication lag)
  • Harder debugging ("which region was this request in?")
  • More compliance surface (data residency, each region)
  • More deployment pipelines (usually)

Budget 20% more engineering time for multi-region from day one.

Common Mistakes

  1. Single point of failure in DNS config your DNS provider becomes the new SPOF
  2. Testing only with healthy traffic test with 2x normal load during drills
  3. Forgetting about databases DB failover is the hardest part
  4. Using regions as backup, not active never tested until crisis
  5. Not planning for split-brain what if both regions think they're primary?

The Minimum Viable Hot Setup

  1. Two regions, stateless app tier, 50/50 traffic
  2. Database: multi-AZ primary, cross-region async replica
  3. CDN/DNS: health-check-based routing
  4. Session: JWT-based (stateless)
  5. Monthly failover drills
  6. Runbooks tested in last 90 days

Start there. Layer in complexity as you need it.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)