DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Cloud DR Patterns: Cloud Disaster Recovery Patterns

Cloud Disaster Recovery Patterns

Production-tested disaster recovery architectures covering the full spectrum from cost-effective pilot light to enterprise-grade multi-region active-active. Each pattern includes Terraform/CloudFormation implementations, RTO/RPO calculation worksheets, runbook templates, and automated failover scripts. Stop treating DR as an afterthought — this kit gives you deployable infrastructure and tested procedures so your recovery plan actually works when the worst happens.

Key Features

  • Four DR Tiers — Backup & restore, pilot light, warm standby, and multi-region active-active with clear cost/RTO trade-offs
  • RTO/RPO Calculator — Spreadsheet-style tool mapping business requirements to appropriate DR tier and architecture
  • Automated Failover Scripts — Route 53 health checks, Aurora global database failover, and S3 cross-region replication
  • Runbook Templates — Step-by-step incident response procedures with decision trees and escalation paths
  • Infrastructure as Code — Terraform modules for deploying each DR pattern across AWS regions
  • Testing Framework — Chaos engineering scripts for simulating regional failures and validating recovery procedures
  • Cost Comparison Matrix — Monthly cost estimates for each DR tier at small, medium, and large scale
  • Compliance Mappings — DR controls mapped to SOC2, ISO 27001, and HIPAA requirements

Quick Start

# Deploy pilot light DR in secondary region
cd src/terraform/pilot-light
cp terraform.tfvars.example terraform.tfvars
# Set primary_region, dr_region, and vpc_cidrs

terraform init
terraform plan -out=plan.out
terraform apply plan.out

# Test failover (dry run)
python3 src/scripts/failover_test.py \
  --mode dry-run \
  --pattern pilot-light \
  --dr-region us-west-2
Enter fullscreen mode Exit fullscreen mode

Architecture

┌─────────────── DR Tier Comparison ──────────────────────┐
│                                                         │
│  Tier 1: Backup & Restore    RTO: 24h    RPO: 24h      │
│  ┌──────────┐    S3 CRR    ┌──────────┐                │
│  │ Primary  │─────────────►│ Backups  │                │
│  │ Region   │              │ (cold)   │                │
│  └──────────┘              └──────────┘                │
│                                                         │
│  Tier 2: Pilot Light         RTO: 1-4h   RPO: minutes  │
│  ┌──────────┐   Repl.     ┌──────────┐                │
│  │ Primary  │────────────►│ Core DB  │                │
│  │ (active) │              │ + AMIs   │                │
│  └──────────┘              └──────────┘                │
│                                                         │
│  Tier 3: Warm Standby       RTO: mins    RPO: seconds   │
│  ┌──────────┐   Repl.     ┌──────────┐                │
│  │ Primary  │────────────►│ Scaled-  │                │
│  │ (full)   │              │ down copy│                │
│  └──────────┘              └──────────┘                │
│                                                         │
│  Tier 4: Active-Active      RTO: ~0      RPO: ~0       │
│  ┌──────────┐◄───────────►┌──────────┐                │
│  │ Region A │   Global    │ Region B │                │
│  │ (active) │   Database  │ (active) │                │
│  └──────────┘              └──────────┘                │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Pilot Light — Aurora Global Database

# src/terraform/pilot-light/aurora-global.tf
resource "aws_rds_global_cluster" "main" {
  global_cluster_identifier = "${var.project}-global-db"
  engine                    = "aurora-postgresql"
  engine_version            = "15.4"
  storage_encrypted         = true
}

resource "aws_rds_cluster" "primary" {
  provider                  = aws.primary
  cluster_identifier        = "${var.project}-primary"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  engine                    = "aurora-postgresql"
  engine_version            = "15.4"
  master_username           = var.db_username
  master_password           = var.db_password
  backup_retention_period   = 35
  preferred_backup_window   = "03:00-04:00"
}

resource "aws_rds_cluster" "secondary" {
  provider                  = aws.dr
  cluster_identifier        = "${var.project}-dr"
  global_cluster_identifier = aws_rds_global_cluster.main.id
  engine                    = "aurora-postgresql"
  engine_version            = "15.4"
  # Secondary cluster — no master credentials needed
  # Promotes to read-write during failover
}
Enter fullscreen mode Exit fullscreen mode

Route 53 Health Check with Failover

# src/terraform/shared/dns-failover.tf
resource "aws_route53_health_check" "primary" {
  fqdn              = "app-primary.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold  = 3
  request_interval   = 10

  tags = { Name = "primary-health-check" }
}

resource "aws_route53_record" "failover_primary" {
  zone_id = var.hosted_zone_id
  name    = "app.example.com"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = var.primary_alb_dns
    zone_id                = var.primary_alb_zone_id
    evaluate_target_health = true
  }

  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id
}
Enter fullscreen mode Exit fullscreen mode

Failover Decision Script

# src/scripts/failover_decision.py
"""Automated failover decision engine with safety checks."""
from dataclasses import dataclass
from enum import Enum

class FailoverAction(Enum):
    PROCEED = "proceed"
    WAIT = "wait"
    ESCALATE = "escalate"

@dataclass
class HealthStatus:
    region: str
    api_healthy: bool
    db_replication_lag_seconds: float
    last_check_utc: str

def evaluate_failover(
    primary: HealthStatus,
    secondary: HealthStatus,
    max_replication_lag: float = 30.0,
) -> FailoverAction:
    """Determine if failover should proceed based on health data."""
    if primary.api_healthy:
        return FailoverAction.WAIT  # Primary still up — false alarm?

    if not secondary.api_healthy:
        return FailoverAction.ESCALATE  # Both regions down

    if secondary.db_replication_lag_seconds > max_replication_lag:
        return FailoverAction.ESCALATE  # Data loss risk too high

    return FailoverAction.PROCEED
Enter fullscreen mode Exit fullscreen mode

Configuration

# configs/dr-config.yaml
project: acme-app
dr_tier: pilot-light          # backup-restore | pilot-light | warm-standby | active-active

primary:
  region: us-east-1
  vpc_cidr: 10.0.0.0/16

dr:
  region: us-west-2
  vpc_cidr: 10.1.0.0/16

rto_target_minutes: 60        # Recovery Time Objective
rpo_target_minutes: 5         # Recovery Point Objective

health_check:
  endpoint: /health
  interval_seconds: 10
  failure_threshold: 3

notifications:
  email: oncall-team@example.com
  pagerduty_key: YOUR_PAGERDUTY_KEY_HERE
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Test failover quarterly — A DR plan that's never been tested is not a plan; it's a hope
  • Automate DNS failover — Manual DNS changes during an outage add critical minutes to your RTO
  • Monitor replication lag continuously — Your actual RPO is only as good as your replication lag at the moment of failure
  • Keep DR infrastructure patched — Stale AMIs and outdated configs in the DR region will cause failover failures
  • Document the failback procedure — Getting back to primary after a DR event is often harder than the initial failover
  • Budget for DR — Active-active doubles your cost; pilot light adds ~15%; plan accordingly

Troubleshooting

Issue Cause Fix
Aurora global failover takes 30+ minutes Secondary cluster has insufficient capacity Pre-provision at least one reader instance in the DR region
Route 53 failover not triggering Health check evaluating wrong endpoint Verify health check FQDN and path match your actual health endpoint
S3 cross-region replication delayed Large objects or high throughput exceeding CRR capacity Enable S3 Replication Time Control (RTC) for guaranteed 15-min SLA
DR region missing latest AMIs AMI copy not automated Add AMI copy step to CI/CD pipeline; use aws ec2 copy-image

This is 1 of 11 resources in the Cloud Architecture Pro toolkit. Get the complete [Cloud DR Patterns] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Cloud Architecture Pro bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)