Why "Hot" Matters
Three multi-region strategies:
Cold: Backup region is off. You start it when primary fails. RTO: hours.
Warm: Backup region runs on minimum capacity. Scale up on failover. RTO: 15-30 minutes.
Hot: Both regions serve live traffic simultaneously. RTO: seconds.
If you need under 15 minutes RTO, you need hot. Everything else is marketing copy.
The Illusion of Warm Failover
Warm sounds great on paper. In practice, on the day you need it:
- The warm region has never seen real load
- DNS cache propagation takes 5-15 minutes
- Autoscaling lags because it's starting cold
- Your team has never run on the warm region
- Half your connection strings are hardcoded to the primary
Warm failover works in tabletop exercises. It does not work during real incidents under pressure. We learned this the hard way.
Running It Hot: The Architecture
┌─────────────┐
│ DNS / CDN │
└──────┬──────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Region A │ │ Region B │
│ 50% TX │ │ 50% TX │
└────┬─────┘ └─────┬────┘
│ │
└──────┬──────┬────────┘
▼ ▼
┌─────────────┐
│ Shared DB │
│ (writer + │
│ replicas) │
└─────────────┘
Both regions always serve traffic. Split is usually 50/50 but can shift.
The Hard Parts
1. Database replication
This is where multi-region gets hard. Three options:
- Single writer, multi-region readers: simplest, but writes pay cross-region latency
- Multi-master: complex, but truly hot requires conflict resolution
- Region-sharded: users pinned to a region for writes, simplest if your data model allows it
We use region-sharded for user-scoped data and single-writer for global config.
2. Session stickiness
If a user's session is in Region A, and their next request goes to Region B, things break.
Solutions:
- JWT tokens with no server state
- Session data in a globally replicated store (DynamoDB Global Tables, CockroachDB)
- Cookie routing that pins a user to a region
3. Cache coherence
Region A's cache doesn't know when Region B updates the database. Options:
- Short TTLs (1-5 minutes) and accept the inconsistency
- Pub/sub cache invalidation across regions (complex)
- Read-through caches only, never write-through
The Failover Mechanics
When Region A dies:
- Health checks detect failure route53/ALB removes Region A from DNS
- Traffic shifts to Region B already warm, already running
- Autoscaling kicks in Region B doubles capacity
- User sessions degrade gracefully re-authentication, cache warmup
- Monitoring reports the shift team gets paged, not customers
Target: customer-facing latency spike of under 30 seconds, full recovery under 5 minutes.
Testing It Monthly
If you don't test failover monthly, you don't have failover. You have hope.
We do this:
- First Tuesday of every month, 10 AM
- Route100% of traffic to Region B for 30 minutes
- Watch dashboards, fix anything that degrades
- Route back to 50/50
- Document any issues, fix them
- Repeat next month with the other region
Cost Reality Check
Running hot doubles your compute cost. For most companies, that's $10K-$100K/month.
The question is: what's your revenue per hour of downtime?
Company A: $100K/hr revenue → 99.99% target → hot is worth it
Company B: $1K/hr revenue → 99.9% target → warm is fine
Do the math. Don't copy FAANG patterns if your revenue doesn't justify them.
The Operational Complexity Tax
Running hot costs more than money. It costs:
- More runbooks (one per region)
- More monitoring (cross-region latency, replication lag)
- Harder debugging ("which region was this request in?")
- More compliance surface (data residency, each region)
- More deployment pipelines (usually)
Budget 20% more engineering time for multi-region from day one.
Common Mistakes
- Single point of failure in DNS config your DNS provider becomes the new SPOF
- Testing only with healthy traffic test with 2x normal load during drills
- Forgetting about databases DB failover is the hardest part
- Using regions as backup, not active never tested until crisis
- Not planning for split-brain what if both regions think they're primary?
The Minimum Viable Hot Setup
- Two regions, stateless app tier, 50/50 traffic
- Database: multi-AZ primary, cross-region async replica
- CDN/DNS: health-check-based routing
- Session: JWT-based (stateless)
- Monthly failover drills
- Runbooks tested in last 90 days
Start there. Layer in complexity as you need it.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)