Cloud Disaster Recovery Patterns
Production-tested disaster recovery architectures covering the full spectrum from cost-effective pilot light to enterprise-grade multi-region active-active. Each pattern includes Terraform/CloudFormation implementations, RTO/RPO calculation worksheets, runbook templates, and automated failover scripts. Stop treating DR as an afterthought — this kit gives you deployable infrastructure and tested procedures so your recovery plan actually works when the worst happens.
Key Features
- Four DR Tiers — Backup & restore, pilot light, warm standby, and multi-region active-active with clear cost/RTO trade-offs
- RTO/RPO Calculator — Spreadsheet-style tool mapping business requirements to appropriate DR tier and architecture
- Automated Failover Scripts — Route 53 health checks, Aurora global database failover, and S3 cross-region replication
- Runbook Templates — Step-by-step incident response procedures with decision trees and escalation paths
- Infrastructure as Code — Terraform modules for deploying each DR pattern across AWS regions
- Testing Framework — Chaos engineering scripts for simulating regional failures and validating recovery procedures
- Cost Comparison Matrix — Monthly cost estimates for each DR tier at small, medium, and large scale
- Compliance Mappings — DR controls mapped to SOC2, ISO 27001, and HIPAA requirements
Quick Start
# Deploy pilot light DR in secondary region
cd src/terraform/pilot-light
cp terraform.tfvars.example terraform.tfvars
# Set primary_region, dr_region, and vpc_cidrs
terraform init
terraform plan -out=plan.out
terraform apply plan.out
# Test failover (dry run)
python3 src/scripts/failover_test.py \
--mode dry-run \
--pattern pilot-light \
--dr-region us-west-2
Architecture
┌─────────────── DR Tier Comparison ──────────────────────┐
│ │
│ Tier 1: Backup & Restore RTO: 24h RPO: 24h │
│ ┌──────────┐ S3 CRR ┌──────────┐ │
│ │ Primary │─────────────►│ Backups │ │
│ │ Region │ │ (cold) │ │
│ └──────────┘ └──────────┘ │
│ │
│ Tier 2: Pilot Light RTO: 1-4h RPO: minutes │
│ ┌──────────┐ Repl. ┌──────────┐ │
│ │ Primary │────────────►│ Core DB │ │
│ │ (active) │ │ + AMIs │ │
│ └──────────┘ └──────────┘ │
│ │
│ Tier 3: Warm Standby RTO: mins RPO: seconds │
│ ┌──────────┐ Repl. ┌──────────┐ │
│ │ Primary │────────────►│ Scaled- │ │
│ │ (full) │ │ down copy│ │
│ └──────────┘ └──────────┘ │
│ │
│ Tier 4: Active-Active RTO: ~0 RPO: ~0 │
│ ┌──────────┐◄───────────►┌──────────┐ │
│ │ Region A │ Global │ Region B │ │
│ │ (active) │ Database │ (active) │ │
│ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
Usage Examples
Pilot Light — Aurora Global Database
# src/terraform/pilot-light/aurora-global.tf
resource "aws_rds_global_cluster" "main" {
global_cluster_identifier = "${var.project}-global-db"
engine = "aurora-postgresql"
engine_version = "15.4"
storage_encrypted = true
}
resource "aws_rds_cluster" "primary" {
provider = aws.primary
cluster_identifier = "${var.project}-primary"
global_cluster_identifier = aws_rds_global_cluster.main.id
engine = "aurora-postgresql"
engine_version = "15.4"
master_username = var.db_username
master_password = var.db_password
backup_retention_period = 35
preferred_backup_window = "03:00-04:00"
}
resource "aws_rds_cluster" "secondary" {
provider = aws.dr
cluster_identifier = "${var.project}-dr"
global_cluster_identifier = aws_rds_global_cluster.main.id
engine = "aurora-postgresql"
engine_version = "15.4"
# Secondary cluster — no master credentials needed
# Promotes to read-write during failover
}
Route 53 Health Check with Failover
# src/terraform/shared/dns-failover.tf
resource "aws_route53_health_check" "primary" {
fqdn = "app-primary.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = { Name = "primary-health-check" }
}
resource "aws_route53_record" "failover_primary" {
zone_id = var.hosted_zone_id
name = "app.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = var.primary_alb_dns
zone_id = var.primary_alb_zone_id
evaluate_target_health = true
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary.id
}
Failover Decision Script
# src/scripts/failover_decision.py
"""Automated failover decision engine with safety checks."""
from dataclasses import dataclass
from enum import Enum
class FailoverAction(Enum):
PROCEED = "proceed"
WAIT = "wait"
ESCALATE = "escalate"
@dataclass
class HealthStatus:
region: str
api_healthy: bool
db_replication_lag_seconds: float
last_check_utc: str
def evaluate_failover(
primary: HealthStatus,
secondary: HealthStatus,
max_replication_lag: float = 30.0,
) -> FailoverAction:
"""Determine if failover should proceed based on health data."""
if primary.api_healthy:
return FailoverAction.WAIT # Primary still up — false alarm?
if not secondary.api_healthy:
return FailoverAction.ESCALATE # Both regions down
if secondary.db_replication_lag_seconds > max_replication_lag:
return FailoverAction.ESCALATE # Data loss risk too high
return FailoverAction.PROCEED
Configuration
# configs/dr-config.yaml
project: acme-app
dr_tier: pilot-light # backup-restore | pilot-light | warm-standby | active-active
primary:
region: us-east-1
vpc_cidr: 10.0.0.0/16
dr:
region: us-west-2
vpc_cidr: 10.1.0.0/16
rto_target_minutes: 60 # Recovery Time Objective
rpo_target_minutes: 5 # Recovery Point Objective
health_check:
endpoint: /health
interval_seconds: 10
failure_threshold: 3
notifications:
email: oncall-team@example.com
pagerduty_key: YOUR_PAGERDUTY_KEY_HERE
Best Practices
- Test failover quarterly — A DR plan that's never been tested is not a plan; it's a hope
- Automate DNS failover — Manual DNS changes during an outage add critical minutes to your RTO
- Monitor replication lag continuously — Your actual RPO is only as good as your replication lag at the moment of failure
- Keep DR infrastructure patched — Stale AMIs and outdated configs in the DR region will cause failover failures
- Document the failback procedure — Getting back to primary after a DR event is often harder than the initial failover
- Budget for DR — Active-active doubles your cost; pilot light adds ~15%; plan accordingly
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| Aurora global failover takes 30+ minutes | Secondary cluster has insufficient capacity | Pre-provision at least one reader instance in the DR region |
| Route 53 failover not triggering | Health check evaluating wrong endpoint | Verify health check FQDN and path match your actual health endpoint |
| S3 cross-region replication delayed | Large objects or high throughput exceeding CRR capacity | Enable S3 Replication Time Control (RTC) for guaranteed 15-min SLA |
| DR region missing latest AMIs | AMI copy not automated | Add AMI copy step to CI/CD pipeline; use aws ec2 copy-image
|
This is 1 of 11 resources in the Cloud Architecture Pro toolkit. Get the complete [Cloud DR Patterns] with all files, templates, and documentation for $39.
Or grab the entire Cloud Architecture Pro bundle (11 products) for $149 — save 30%.
Top comments (0)