Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Cloud DR Patterns: Cloud Disaster Recovery Patterns

#cloud #aws #terraform #architecture

Cloud Disaster Recovery Patterns

Production-tested disaster recovery architectures covering the full spectrum from cost-effective pilot light to enterprise-grade multi-region active-active. Each pattern includes Terraform/CloudFormation implementations, RTO/RPO calculation worksheets, runbook templates, and automated failover scripts. Stop treating DR as an afterthought — this kit gives you deployable infrastructure and tested procedures so your recovery plan actually works when the worst happens.

Key Features

Four DR Tiers — Backup & restore, pilot light, warm standby, and multi-region active-active with clear cost/RTO trade-offs
RTO/RPO Calculator — Spreadsheet-style tool mapping business requirements to appropriate DR tier and architecture
Automated Failover Scripts — Route 53 health checks, Aurora global database failover, and S3 cross-region replication
Runbook Templates — Step-by-step incident response procedures with decision trees and escalation paths
Infrastructure as Code — Terraform modules for deploying each DR pattern across AWS regions
Testing Framework — Chaos engineering scripts for simulating regional failures and validating recovery procedures
Cost Comparison Matrix — Monthly cost estimates for each DR tier at small, medium, and large scale
Compliance Mappings — DR controls mapped to SOC2, ISO 27001, and HIPAA requirements

Quick Start

# Deploy pilot light DR in secondary region
cd src/terraform/pilot-light
cp terraform.tfvars.example terraform.tfvars
# Set primary_region, dr_region, and vpc_cidrs

terraform init
terraform plan -out=plan.out
terraform apply plan.out

# Test failover (dry run)
python3 src/scripts/failover_test.py \
  --mode dry-run \
  --pattern pilot-light \
  --dr-region us-west-2

Architecture

┌─────────────── DR Tier Comparison ──────────────────────┐
│                                                         │
│  Tier 1: Backup & Restore    RTO: 24h    RPO: 24h      │
│  ┌──────────┐    S3 CRR    ┌──────────┐                │
│  │ Primary  │─────────────►│ Backups  │                │
│  │ Region   │              │ (cold)   │                │
│  └──────────┘              └──────────┘                │
│                                                         │
│  Tier 2: Pilot Light         RTO: 1-4h   RPO: minutes  │
│  ┌──────────┐   Repl.     ┌──────────┐                │
│  │ Primary  │────────────►│ Core DB  │                │
│  │ (active) │              │ + AMIs   │                │
│  └──────────┘              └──────────┘                │
│                                                         │
│  Tier 3: Warm Standby       RTO: mins    RPO: seconds   │
│  ┌──────────┐   Repl.     ┌──────────┐                │
│  │ Primary  │────────────►│ Scaled-  │                │
│  │ (full)   │              │ down copy│                │
│  └──────────┘              └──────────┘                │
│                                                         │
│  Tier 4: Active-Active      RTO: ~0      RPO: ~0       │
│  ┌──────────┐◄───────────►┌──────────┐                │
│  │ Region A │   Global    │ Region B │                │
│  │ (active) │   Database  │ (active) │                │
│  └──────────┘              └──────────┘                │
└─────────────────────────────────────────────────────────┘

Usage Examples

Pilot Light — Aurora Global Database

# src/terraform/pilot-light/aurora-global.tf
resource "aws_rds_global_cluster" "main" {
  global_cluster_identifier = "${var.project}-global-db"
  engine                    = "aurora-postgresql"
  engine_version            = "15.4"
  storage_encrypted         = true
}

resource "aws_rds_cluster" "primary" {
  provider                  = aws.primary
  cluster_identifier        = "${var.project}-primary"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  engine                    = "aurora-postgresql"
  engine_version            = "15.4"
  master_username           = var.db_username
  master_password           = var.db_password
  backup_retention_period   = 35
  preferred_backup_window   = "03:00-04:00"
}

resource "aws_rds_cluster" "secondary" {
  provider                  = aws.dr
  cluster_identifier        = "${var.project}-dr"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  engine                    = "aurora-postgresql"
  engine_version            = "15.4"
  # Secondary cluster — no master credentials needed
  # Promotes to read-write during failover
}

Route 53 Health Check with Failover

# src/terraform/shared/dns-failover.tf
resource "aws_route53_health_check" "primary" {
  fqdn              = "app-primary.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold  = 3
  request_interval   = 10

  tags = { Name = "primary-health-check" }
}

resource "aws_route53_record" "failover_primary" {
  zone_id = var.hosted_zone_id
  name    = "app.example.com"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = var.primary_alb_dns
    zone_id                = var.primary_alb_zone_id
    evaluate_target_health = true
  }

  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id
}

Failover Decision Script

# src/scripts/failover_decision.py
"""Automated failover decision engine with safety checks."""
from dataclasses import dataclass
from enum import Enum

class FailoverAction(Enum):
    PROCEED = "proceed"
    WAIT = "wait"
    ESCALATE = "escalate"

@dataclass
class HealthStatus:
    region: str
    api_healthy: bool
    db_replication_lag_seconds: float
    last_check_utc: str

def evaluate_failover(
    primary: HealthStatus,
    secondary: HealthStatus,
    max_replication_lag: float = 30.0,
) -> FailoverAction:
    """Determine if failover should proceed based on health data."""
    if primary.api_healthy:
        return FailoverAction.WAIT  # Primary still up — false alarm?

    if not secondary.api_healthy:
        return FailoverAction.ESCALATE  # Both regions down

    if secondary.db_replication_lag_seconds > max_replication_lag:
        return FailoverAction.ESCALATE  # Data loss risk too high

    return FailoverAction.PROCEED

Configuration

# configs/dr-config.yaml
project: acme-app
dr_tier: pilot-light          # backup-restore | pilot-light | warm-standby | active-active

primary:
  region: us-east-1
  vpc_cidr: 10.0.0.0/16

dr:
  region: us-west-2
  vpc_cidr: 10.1.0.0/16

rto_target_minutes: 60        # Recovery Time Objective
rpo_target_minutes: 5         # Recovery Point Objective

health_check:
  endpoint: /health
  interval_seconds: 10
  failure_threshold: 3

notifications:
  email: oncall-team@example.com
  pagerduty_key: YOUR_PAGERDUTY_KEY_HERE

Best Practices

Test failover quarterly — A DR plan that's never been tested is not a plan; it's a hope
Automate DNS failover — Manual DNS changes during an outage add critical minutes to your RTO
Monitor replication lag continuously — Your actual RPO is only as good as your replication lag at the moment of failure
Keep DR infrastructure patched — Stale AMIs and outdated configs in the DR region will cause failover failures
Document the failback procedure — Getting back to primary after a DR event is often harder than the initial failover
Budget for DR — Active-active doubles your cost; pilot light adds ~15%; plan accordingly

Troubleshooting

Issue	Cause	Fix
Aurora global failover takes 30+ minutes	Secondary cluster has insufficient capacity	Pre-provision at least one reader instance in the DR region
Route 53 failover not triggering	Health check evaluating wrong endpoint	Verify health check FQDN and path match your actual health endpoint
S3 cross-region replication delayed	Large objects or high throughput exceeding CRR capacity	Enable S3 Replication Time Control (RTC) for guaranteed 15-min SLA
DR region missing latest AMIs	AMI copy not automated	Add AMI copy step to CI/CD pipeline; use `aws ec2 copy-image`

This is 1 of 11 resources in the Cloud Architecture Pro toolkit. Get the complete [Cloud DR Patterns] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Cloud Architecture Pro bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community