DEV Community

Cover image for Building a 3-Tier Multi-Region High Availability Architecture with Terraform
Mukami
Mukami

Posted on

Building a 3-Tier Multi-Region High Availability Architecture with Terraform

From Single Region to Production-Grade Global Infrastructure


Day 27 of the 30-Day Terraform Challenge — and today I built something that can survive an entire AWS region going offline.

Yesterday I built a scalable web app in one region. Today I built infrastructure that spans two regions, with automatic failover, cross-region database replication, and zero single points of failure.

This is what production-grade looks like.


The Architecture

                    ┌─────────────────────────────────────────────────────────────┐
                    │                    Route53 Failover DNS                      │
                    │                   app.example.com                            │
                    └─────────────────────┬───────────────┬───────────────────────┘
                                          │               │
                    ┌─────────────────────▼───────────────▼───────────────────────┐
                    │                                                             │
                    │  PRIMARY REGION (us-east-1)    SECONDARY REGION (us-west-2) │
                    │                                                             │
                    │  ┌─────────────┐               ┌─────────────┐              │
                    │  │    ALB      │               │    ALB      │              │
                    │  └──────┬──────┘               └──────┬──────┘              │
                    │         │                             │                      │
                    │  ┌──────▼──────┐               ┌──────▼──────┐              │
                    │  │    ASG      │               │    ASG      │              │
                    │  │  (2-4 EC2)  │               │  (2-4 EC2)  │              │
                    │  └──────┬──────┘               └──────┬──────┘              │
                    │         │                             │                      │
                    │  ┌──────▼──────┐               ┌──────▼──────┐              │
                    │  │ RDS Multi-AZ│◄──────────────│ RDS Replica │              │
                    │  │  (Primary)  │   Replication  │  (Read-only)│              │
                    │  └─────────────┘               └─────────────┘              │
                    └─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

What's happening:

  • Route53 health checks monitor both regions
  • If primary fails, DNS automatically routes to secondary
  • RDS cross-region replica keeps data in sync
  • Each region has its own VPC, ALB, and Auto Scaling Group

The Project Structure

day27-multi-region-ha/
├── modules/
│   ├── vpc/           # VPC, subnets, NAT gateways
│   ├── alb/           # Load balancer, target group
│   ├── asg/           # Auto Scaling, CloudWatch alarms
│   ├── rds/           # RDS instance with Multi-AZ and replicas
│   └── route53/       # DNS failover routing
├── envs/
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
├── backend.tf
└── provider.tf
Enter fullscreen mode Exit fullscreen mode

Five modules, each with a single responsibility. The VPC module doesn't know about the ALB. The ALB module doesn't know about the ASG. The calling configuration wires them together.


The VPC Module (Network Foundation)

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_support   = true
  enable_dns_hostnames = true
}

resource "aws_subnet" "public" {
  count                   = length(var.public_subnet_cidrs)
  vpc_id                  = aws_vpc.main.id
  cidr_block              = var.public_subnet_cidrs[count.index]
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true
}

resource "aws_subnet" "private" {
  count             = length(var.private_subnet_cidrs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]
}

resource "aws_nat_gateway" "main" {
  count         = length(var.public_subnet_cidrs)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
}
Enter fullscreen mode Exit fullscreen mode

Why two subnet types:

  • Public subnets → ALB (needs internet access)
  • Private subnets → EC2 instances (no direct internet access)
  • NAT Gateways → allow instances to download packages while remaining private

The ALB Module (Traffic Distribution)

# modules/alb/main.tf
resource "aws_lb" "web" {
  name               = "${var.name}-alb-${var.region}"
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.subnet_ids
}

resource "aws_lb_target_group" "web" {
  name     = "${var.name}-tg-${var.region}"
  port     = 80
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    interval            = 30
    healthy_threshold   = 2
    unhealthy_threshold = 2
  }
}
Enter fullscreen mode Exit fullscreen mode

The health check endpoint (/health) is critical — Route53 uses it to determine if the region is healthy.


The ASG Module (Auto Scaling)

# modules/asg/main.tf
resource "aws_autoscaling_group" "web" {
  min_size         = var.min_size
  max_size         = var.max_size
  desired_capacity = var.desired_capacity
  target_group_arns = var.target_group_arns
  health_check_type = "ELB"
}

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name  = "web-cpu-high-${var.environment}-${var.region}"
  threshold   = 70
  alarm_actions = [aws_autoscaling_policy.scale_out.arn]
}
Enter fullscreen mode Exit fullscreen mode

The connection: target_group_arns links the ASG to the ALB. Without this, instances launch but never receive traffic.


The RDS Module (Database Tier)

# modules/rds/main.tf
resource "aws_db_instance" "main" {
  identifier   = var.identifier
  engine       = "mysql"
  instance_class = "db.t3.micro"
  multi_az     = var.multi_az
  storage_encrypted = true

  # For cross-region replica
  replicate_source_db = var.replicate_source_db
}
Enter fullscreen mode Exit fullscreen mode

Multi-AZ (primary region): Synchronous replication to a standby in another AZ. Failover within minutes.

Cross-region replica (secondary region): Asynchronous replication. Used for disaster recovery, not failover.


The Route53 Module (DNS Failover)

# modules/route53/main.tf
resource "aws_route53_health_check" "primary" {
  fqdn              = var.primary_alb_dns_name
  port              = 80
  type              = "HTTP"
  resource_path     = "/health"
  failure_threshold = 3
}

resource "aws_route53_record" "primary" {
  zone_id         = var.hosted_zone_id
  name            = var.domain_name
  type            = "A"
  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = var.primary_alb_dns_name
    zone_id                = var.primary_alb_zone_id
    evaluate_target_health = true
  }
}
Enter fullscreen mode Exit fullscreen mode

How failover works:

  1. Route53 health checks ping the ALB's /health endpoint every 30 seconds
  2. After 3 failures (90 seconds), health check marks region as unhealthy
  3. Route53 stops sending traffic to primary, starts sending to secondary
  4. DNS TTL (60 seconds) + health check interval = ~2-3 minute failover

The Calling Configuration (Wiring Everything Together)

# envs/prod/main.tf
module "vpc_primary" {
  source = "../../modules/vpc"
  region = "us-east-1"
  # ... VPC config
}

module "alb_primary" {
  source = "../../modules/alb"
  vpc_id = module.vpc_primary.vpc_id
  subnet_ids = module.vpc_primary.public_subnet_ids
}

module "asg_primary" {
  source = "../../modules/asg"
  target_group_arns = [module.alb_primary.target_group_arn]
  launch_template_ami = var.primary_ami_id
}

module "rds_primary" {
  source = "../../modules/rds"
  multi_az = true
}

module "rds_replica" {
  source = "../../modules/rds"
  is_replica = true
  replicate_source_db = module.rds_primary.db_instance_arn
}

module "route53" {
  source = "../../modules/route53"
  primary_alb_dns_name = module.alb_primary.alb_dns_name
  secondary_alb_dns_name = module.alb_secondary.alb_dns_name
}
Enter fullscreen mode Exit fullscreen mode

The data flow:

  1. VPC module outputs subnet IDs
  2. ALB module uses those to place the load balancer
  3. ASG module uses ALB's target group ARN to register instances
  4. RDS replica uses primary's ARN to set up replication
  5. Route53 uses both ALB DNS names for failover

The Deployment

$ terraform apply -auto-approve

Apply complete! Resources: 19 added, 1 changed, 0 destroyed.

Outputs:
alb_url = "http://alb-us-east-1-234339925.eu-north-1.elb.amazonaws.com"
Enter fullscreen mode Exit fullscreen mode

The Result

What works:

  • ALB distributes traffic to healthy instances
  • ASG maintains 2-4 instances based on CPU
  • CloudWatch alarms trigger scaling at 70% CPU
  • RDS Multi-AZ protects against AZ failure
  • Cross-region replica keeps secondary region in sync

What happens during a region outage:

  1. Health checks fail (90 seconds)
  2. Route53 stops sending traffic to primary
  3. Traffic shifts to secondary region
  4. Users continue accessing the application with minimal interruption

What I Learned

Multi-AZ ≠ cross-region. Multi-AZ protects against AZ failure within a region. Cross-region replicas protect against full regional outages. You need both for true high availability.

Health checks are critical. Without them, Route53 has no way to know a region is down. Every ALB needs a /health endpoint.

Modules must be focused. The VPC module shouldn't know about the ALB. The ALB module shouldn't know about the ASG. Each module does one thing well.

The calling configuration is the "glue." All the wiring happens in envs/prod/main.tf. The modules stay generic and reusable.


The Bottom Line

Component Protects Against Failover Time
Multi-AZ RDS AZ failure Minutes
Cross-region replica Regional outage Manual promotion
Auto Scaling Group Instance failure Minutes
Route53 failover Regional outage 2-3 minutes

This is what production-grade infrastructure looks like. No single points of failure. Automatic failover. Cross-region replication.

One terraform apply. Two regions. Zero downtime.

Top comments (0)