From Single Region to Production-Grade Global Infrastructure
Day 27 of the 30-Day Terraform Challenge — and today I built something that can survive an entire AWS region going offline.
Yesterday I built a scalable web app in one region. Today I built infrastructure that spans two regions, with automatic failover, cross-region database replication, and zero single points of failure.
This is what production-grade looks like.
The Architecture
┌─────────────────────────────────────────────────────────────┐
│ Route53 Failover DNS │
│ app.example.com │
└─────────────────────┬───────────────┬───────────────────────┘
│ │
┌─────────────────────▼───────────────▼───────────────────────┐
│ │
│ PRIMARY REGION (us-east-1) SECONDARY REGION (us-west-2) │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ ALB │ │ ALB │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ ASG │ │ ASG │ │
│ │ (2-4 EC2) │ │ (2-4 EC2) │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ RDS Multi-AZ│◄──────────────│ RDS Replica │ │
│ │ (Primary) │ Replication │ (Read-only)│ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
What's happening:
- Route53 health checks monitor both regions
- If primary fails, DNS automatically routes to secondary
- RDS cross-region replica keeps data in sync
- Each region has its own VPC, ALB, and Auto Scaling Group
The Project Structure
day27-multi-region-ha/
├── modules/
│ ├── vpc/ # VPC, subnets, NAT gateways
│ ├── alb/ # Load balancer, target group
│ ├── asg/ # Auto Scaling, CloudWatch alarms
│ ├── rds/ # RDS instance with Multi-AZ and replicas
│ └── route53/ # DNS failover routing
├── envs/
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── backend.tf
└── provider.tf
Five modules, each with a single responsibility. The VPC module doesn't know about the ALB. The ALB module doesn't know about the ASG. The calling configuration wires them together.
The VPC Module (Network Foundation)
# modules/vpc/main.tf
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_support = true
enable_dns_hostnames = true
}
resource "aws_subnet" "public" {
count = length(var.public_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.public_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
}
resource "aws_subnet" "private" {
count = length(var.private_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
}
resource "aws_nat_gateway" "main" {
count = length(var.public_subnet_cidrs)
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
}
Why two subnet types:
- Public subnets → ALB (needs internet access)
- Private subnets → EC2 instances (no direct internet access)
- NAT Gateways → allow instances to download packages while remaining private
The ALB Module (Traffic Distribution)
# modules/alb/main.tf
resource "aws_lb" "web" {
name = "${var.name}-alb-${var.region}"
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = var.subnet_ids
}
resource "aws_lb_target_group" "web" {
name = "${var.name}-tg-${var.region}"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
path = "/health"
interval = 30
healthy_threshold = 2
unhealthy_threshold = 2
}
}
The health check endpoint (/health) is critical — Route53 uses it to determine if the region is healthy.
The ASG Module (Auto Scaling)
# modules/asg/main.tf
resource "aws_autoscaling_group" "web" {
min_size = var.min_size
max_size = var.max_size
desired_capacity = var.desired_capacity
target_group_arns = var.target_group_arns
health_check_type = "ELB"
}
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "web-cpu-high-${var.environment}-${var.region}"
threshold = 70
alarm_actions = [aws_autoscaling_policy.scale_out.arn]
}
The connection: target_group_arns links the ASG to the ALB. Without this, instances launch but never receive traffic.
The RDS Module (Database Tier)
# modules/rds/main.tf
resource "aws_db_instance" "main" {
identifier = var.identifier
engine = "mysql"
instance_class = "db.t3.micro"
multi_az = var.multi_az
storage_encrypted = true
# For cross-region replica
replicate_source_db = var.replicate_source_db
}
Multi-AZ (primary region): Synchronous replication to a standby in another AZ. Failover within minutes.
Cross-region replica (secondary region): Asynchronous replication. Used for disaster recovery, not failover.
The Route53 Module (DNS Failover)
# modules/route53/main.tf
resource "aws_route53_health_check" "primary" {
fqdn = var.primary_alb_dns_name
port = 80
type = "HTTP"
resource_path = "/health"
failure_threshold = 3
}
resource "aws_route53_record" "primary" {
zone_id = var.hosted_zone_id
name = var.domain_name
type = "A"
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary.id
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = var.primary_alb_dns_name
zone_id = var.primary_alb_zone_id
evaluate_target_health = true
}
}
How failover works:
- Route53 health checks ping the ALB's
/healthendpoint every 30 seconds - After 3 failures (90 seconds), health check marks region as unhealthy
- Route53 stops sending traffic to primary, starts sending to secondary
- DNS TTL (60 seconds) + health check interval = ~2-3 minute failover
The Calling Configuration (Wiring Everything Together)
# envs/prod/main.tf
module "vpc_primary" {
source = "../../modules/vpc"
region = "us-east-1"
# ... VPC config
}
module "alb_primary" {
source = "../../modules/alb"
vpc_id = module.vpc_primary.vpc_id
subnet_ids = module.vpc_primary.public_subnet_ids
}
module "asg_primary" {
source = "../../modules/asg"
target_group_arns = [module.alb_primary.target_group_arn]
launch_template_ami = var.primary_ami_id
}
module "rds_primary" {
source = "../../modules/rds"
multi_az = true
}
module "rds_replica" {
source = "../../modules/rds"
is_replica = true
replicate_source_db = module.rds_primary.db_instance_arn
}
module "route53" {
source = "../../modules/route53"
primary_alb_dns_name = module.alb_primary.alb_dns_name
secondary_alb_dns_name = module.alb_secondary.alb_dns_name
}
The data flow:
- VPC module outputs subnet IDs
- ALB module uses those to place the load balancer
- ASG module uses ALB's target group ARN to register instances
- RDS replica uses primary's ARN to set up replication
- Route53 uses both ALB DNS names for failover
The Deployment
$ terraform apply -auto-approve
Apply complete! Resources: 19 added, 1 changed, 0 destroyed.
Outputs:
alb_url = "http://alb-us-east-1-234339925.eu-north-1.elb.amazonaws.com"
The Result
What works:
- ALB distributes traffic to healthy instances
- ASG maintains 2-4 instances based on CPU
- CloudWatch alarms trigger scaling at 70% CPU
- RDS Multi-AZ protects against AZ failure
- Cross-region replica keeps secondary region in sync
What happens during a region outage:
- Health checks fail (90 seconds)
- Route53 stops sending traffic to primary
- Traffic shifts to secondary region
- Users continue accessing the application with minimal interruption
What I Learned
Multi-AZ ≠ cross-region. Multi-AZ protects against AZ failure within a region. Cross-region replicas protect against full regional outages. You need both for true high availability.
Health checks are critical. Without them, Route53 has no way to know a region is down. Every ALB needs a /health endpoint.
Modules must be focused. The VPC module shouldn't know about the ALB. The ALB module shouldn't know about the ASG. Each module does one thing well.
The calling configuration is the "glue." All the wiring happens in envs/prod/main.tf. The modules stay generic and reusable.
The Bottom Line
| Component | Protects Against | Failover Time |
|---|---|---|
| Multi-AZ RDS | AZ failure | Minutes |
| Cross-region replica | Regional outage | Manual promotion |
| Auto Scaling Group | Instance failure | Minutes |
| Route53 failover | Regional outage | 2-3 minutes |
This is what production-grade infrastructure looks like. No single points of failure. Automatic failover. Cross-region replication.
One terraform apply. Two regions. Zero downtime.
Top comments (0)