Mukami

Posted on Apr 17

Building a 3-Tier Multi-Region High Availability Architecture with Terraform

#30daychallenge #terraform #aws

From Single Region to Production-Grade Global Infrastructure

Day 27 of the 30-Day Terraform Challenge — and today I built something that can survive an entire AWS region going offline.

Yesterday I built a scalable web app in one region. Today I built infrastructure that spans two regions, with automatic failover, cross-region database replication, and zero single points of failure.

This is what production-grade looks like.

The Architecture

                    ┌─────────────────────────────────────────────────────────────┐
                    │                    Route53 Failover DNS                      │
                    │                   app.example.com                            │
                    └─────────────────────┬───────────────┬───────────────────────┘
                                          │               │
                    ┌─────────────────────▼───────────────▼───────────────────────┐
                    │                                                             │
                    │  PRIMARY REGION (us-east-1)    SECONDARY REGION (us-west-2) │
                    │                                                             │
                    │  ┌─────────────┐               ┌─────────────┐              │
                    │  │    ALB      │               │    ALB      │              │
                    │  └──────┬──────┘               └──────┬──────┘              │
                    │         │                             │                      │
                    │  ┌──────▼──────┐               ┌──────▼──────┐              │
                    │  │    ASG      │               │    ASG      │              │
                    │  │  (2-4 EC2)  │               │  (2-4 EC2)  │              │
                    │  └──────┬──────┘               └──────┬──────┘              │
                    │         │                             │                      │
                    │  ┌──────▼──────┐               ┌──────▼──────┐              │
                    │  │ RDS Multi-AZ│◄──────────────│ RDS Replica │              │
                    │  │  (Primary)  │   Replication  │  (Read-only)│              │
                    │  └─────────────┘               └─────────────┘              │
                    └─────────────────────────────────────────────────────────────┘

What's happening:

Route53 health checks monitor both regions
If primary fails, DNS automatically routes to secondary
RDS cross-region replica keeps data in sync
Each region has its own VPC, ALB, and Auto Scaling Group

The Project Structure

day27-multi-region-ha/
├── modules/
│   ├── vpc/           # VPC, subnets, NAT gateways
│   ├── alb/           # Load balancer, target group
│   ├── asg/           # Auto Scaling, CloudWatch alarms
│   ├── rds/           # RDS instance with Multi-AZ and replicas
│   └── route53/       # DNS failover routing
├── envs/
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
├── backend.tf
└── provider.tf

Five modules, each with a single responsibility. The VPC module doesn't know about the ALB. The ALB module doesn't know about the ASG. The calling configuration wires them together.

The VPC Module (Network Foundation)

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_support   = true
  enable_dns_hostnames = true
}

resource "aws_subnet" "public" {
  count                   = length(var.public_subnet_cidrs)
  vpc_id                  = aws_vpc.main.id
  cidr_block              = var.public_subnet_cidrs[count.index]
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true
}

resource "aws_subnet" "private" {
  count             = length(var.private_subnet_cidrs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]
}

resource "aws_nat_gateway" "main" {
  count         = length(var.public_subnet_cidrs)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
}

Why two subnet types:

Public subnets → ALB (needs internet access)
Private subnets → EC2 instances (no direct internet access)
NAT Gateways → allow instances to download packages while remaining private

The ALB Module (Traffic Distribution)

# modules/alb/main.tf
resource "aws_lb" "web" {
  name               = "${var.name}-alb-${var.region}"
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.subnet_ids
}

resource "aws_lb_target_group" "web" {
  name     = "${var.name}-tg-${var.region}"
  port     = 80
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    interval            = 30
    healthy_threshold   = 2
    unhealthy_threshold = 2
  }
}

The health check endpoint (/health) is critical — Route53 uses it to determine if the region is healthy.

The ASG Module (Auto Scaling)

# modules/asg/main.tf
resource "aws_autoscaling_group" "web" {
  min_size         = var.min_size
  max_size         = var.max_size
  desired_capacity = var.desired_capacity
  target_group_arns = var.target_group_arns
  health_check_type = "ELB"
}

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name  = "web-cpu-high-${var.environment}-${var.region}"
  threshold   = 70
  alarm_actions = [aws_autoscaling_policy.scale_out.arn]
}

The connection: target_group_arns links the ASG to the ALB. Without this, instances launch but never receive traffic.

The RDS Module (Database Tier)

# modules/rds/main.tf
resource "aws_db_instance" "main" {
  identifier   = var.identifier
  engine       = "mysql"
  instance_class = "db.t3.micro"
  multi_az     = var.multi_az
  storage_encrypted = true

  # For cross-region replica
  replicate_source_db = var.replicate_source_db
}

Multi-AZ (primary region): Synchronous replication to a standby in another AZ. Failover within minutes.

Cross-region replica (secondary region): Asynchronous replication. Used for disaster recovery, not failover.

The Route53 Module (DNS Failover)

# modules/route53/main.tf
resource "aws_route53_health_check" "primary" {
  fqdn              = var.primary_alb_dns_name
  port              = 80
  type              = "HTTP"
  resource_path     = "/health"
  failure_threshold = 3
}

resource "aws_route53_record" "primary" {
  zone_id         = var.hosted_zone_id
  name            = var.domain_name
  type            = "A"
  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = var.primary_alb_dns_name
    zone_id                = var.primary_alb_zone_id
    evaluate_target_health = true
  }
}

How failover works:

Route53 health checks ping the ALB's /health endpoint every 30 seconds
After 3 failures (90 seconds), health check marks region as unhealthy
Route53 stops sending traffic to primary, starts sending to secondary
DNS TTL (60 seconds) + health check interval = ~2-3 minute failover

The Calling Configuration (Wiring Everything Together)

# envs/prod/main.tf
module "vpc_primary" {
  source = "../../modules/vpc"
  region = "us-east-1"
  # ... VPC config
}

module "alb_primary" {
  source = "../../modules/alb"
  vpc_id = module.vpc_primary.vpc_id
  subnet_ids = module.vpc_primary.public_subnet_ids
}

module "asg_primary" {
  source = "../../modules/asg"
  target_group_arns = [module.alb_primary.target_group_arn]
  launch_template_ami = var.primary_ami_id
}

module "rds_primary" {
  source = "../../modules/rds"
  multi_az = true
}

module "rds_replica" {
  source = "../../modules/rds"
  is_replica = true
  replicate_source_db = module.rds_primary.db_instance_arn
}

module "route53" {
  source = "../../modules/route53"
  primary_alb_dns_name = module.alb_primary.alb_dns_name
  secondary_alb_dns_name = module.alb_secondary.alb_dns_name
}

The data flow:

VPC module outputs subnet IDs
ALB module uses those to place the load balancer
ASG module uses ALB's target group ARN to register instances
RDS replica uses primary's ARN to set up replication
Route53 uses both ALB DNS names for failover

The Deployment

$ terraform apply -auto-approve

Apply complete! Resources: 19 added, 1 changed, 0 destroyed.

Outputs:
alb_url = "http://alb-us-east-1-234339925.eu-north-1.elb.amazonaws.com"

The Result

What works:

ALB distributes traffic to healthy instances
ASG maintains 2-4 instances based on CPU
CloudWatch alarms trigger scaling at 70% CPU
RDS Multi-AZ protects against AZ failure
Cross-region replica keeps secondary region in sync

What happens during a region outage:

Health checks fail (90 seconds)
Route53 stops sending traffic to primary
Traffic shifts to secondary region
Users continue accessing the application with minimal interruption

What I Learned

Multi-AZ ≠ cross-region. Multi-AZ protects against AZ failure within a region. Cross-region replicas protect against full regional outages. You need both for true high availability.

Health checks are critical. Without them, Route53 has no way to know a region is down. Every ALB needs a /health endpoint.

Modules must be focused. The VPC module shouldn't know about the ALB. The ALB module shouldn't know about the ASG. Each module does one thing well.

The calling configuration is the "glue." All the wiring happens in envs/prod/main.tf. The modules stay generic and reusable.

The Bottom Line

Component	Protects Against	Failover Time
Multi-AZ RDS	AZ failure	Minutes
Cross-region replica	Regional outage	Manual promotion
Auto Scaling Group	Instance failure	Minutes
Route53 failover	Regional outage	2-3 minutes

This is what production-grade infrastructure looks like. No single points of failure. Automatic failover. Cross-region replication.

One terraform apply. Two regions. Zero downtime.

DEV Community