Samson Tanimawo

Posted on Apr 30

Multi-Region Failover: Lessons from Running It Hot

#multiregion #failover #sre #aws

Why "Hot" Matters

Three multi-region strategies:

Cold: Backup region is off. You start it when primary fails. RTO: hours.

Warm: Backup region runs on minimum capacity. Scale up on failover. RTO: 15-30 minutes.

Hot: Both regions serve live traffic simultaneously. RTO: seconds.

If you need under 15 minutes RTO, you need hot. Everything else is marketing copy.

The Illusion of Warm Failover

Warm sounds great on paper. In practice, on the day you need it:

The warm region has never seen real load
DNS cache propagation takes 5-15 minutes
Autoscaling lags because it's starting cold
Your team has never run on the warm region
Half your connection strings are hardcoded to the primary

Warm failover works in tabletop exercises. It does not work during real incidents under pressure. We learned this the hard way.

Running It Hot: The Architecture

┌─────────────┐
│ DNS / CDN │
└──────┬──────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Region A │ │ Region B │
│ 50% TX │ │ 50% TX │
└────┬─────┘ └─────┬────┘
│ │
└──────┬──────┬────────┘
▼ ▼
┌─────────────┐
│ Shared DB │
│ (writer + │
│ replicas) │
└─────────────┘

Both regions always serve traffic. Split is usually 50/50 but can shift.

The Hard Parts

1. Database replication

This is where multi-region gets hard. Three options:

Single writer, multi-region readers: simplest, but writes pay cross-region latency
Multi-master: complex, but truly hot requires conflict resolution
Region-sharded: users pinned to a region for writes, simplest if your data model allows it

We use region-sharded for user-scoped data and single-writer for global config.

2. Session stickiness

If a user's session is in Region A, and their next request goes to Region B, things break.

Solutions:

JWT tokens with no server state
Session data in a globally replicated store (DynamoDB Global Tables, CockroachDB)
Cookie routing that pins a user to a region

3. Cache coherence

Region A's cache doesn't know when Region B updates the database. Options:

Short TTLs (1-5 minutes) and accept the inconsistency
Pub/sub cache invalidation across regions (complex)
Read-through caches only, never write-through

The Failover Mechanics

When Region A dies:

Health checks detect failure route53/ALB removes Region A from DNS
Traffic shifts to Region B already warm, already running
Autoscaling kicks in Region B doubles capacity
User sessions degrade gracefully re-authentication, cache warmup
Monitoring reports the shift team gets paged, not customers

Target: customer-facing latency spike of under 30 seconds, full recovery under 5 minutes.

Testing It Monthly

If you don't test failover monthly, you don't have failover. You have hope.

We do this:

First Tuesday of every month, 10 AM
Route100% of traffic to Region B for 30 minutes
Watch dashboards, fix anything that degrades
Route back to 50/50
Document any issues, fix them
Repeat next month with the other region

Cost Reality Check

Running hot doubles your compute cost. For most companies, that's $10K-$100K/month.

The question is: what's your revenue per hour of downtime?

Company A: $100K/hr revenue → 99.99% target → hot is worth it
Company B: $1K/hr revenue → 99.9% target → warm is fine

Do the math. Don't copy FAANG patterns if your revenue doesn't justify them.

The Operational Complexity Tax

Running hot costs more than money. It costs:

More runbooks (one per region)
More monitoring (cross-region latency, replication lag)
Harder debugging ("which region was this request in?")
More compliance surface (data residency, each region)
More deployment pipelines (usually)

Budget 20% more engineering time for multi-region from day one.

Common Mistakes

Single point of failure in DNS config your DNS provider becomes the new SPOF
Testing only with healthy traffic test with 2x normal load during drills
Forgetting about databases DB failover is the hardest part
Using regions as backup, not active never tested until crisis
Not planning for split-brain what if both regions think they're primary?

The Minimum Viable Hot Setup

Two regions, stateless app tier, 50/50 traffic
Database: multi-AZ primary, cross-region async replica
CDN/DNS: health-check-based routing
Session: JWT-based (stateless)
Monthly failover drills
Runbooks tested in last 90 days

Start there. Layer in complexity as you need it.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community