DEV Community

Cover image for How I Built a Full High Availability AWS Infrastructure with Terraform Modules

How I Built a Full High Availability AWS Infrastructure with Terraform Modules

Introduction

Most AWS tutorials teach you how to launch a single EC2 instance in a public subnet and call it a day. That's fine for learning the basics, but it's nowhere near what production infrastructure looks like.
In this article I'll walk you through how I designed and deployed a full multi-tier, multi-AZ High Availability infrastructure on AWS written entirely in Terraform, structured as reusable modules. By the end you'll understand not just what I built, but why each decision was made.
This is part of my AWS SAA-C03 certification preparation as an AWS Community Builder 2026 (Serverless track).

What Does "High Availability" Actually Mean?

High Availability means your system keeps running even when something fails. In AWS, the primary failure unit is an Availability Zone a physically separate data centre within a region.

True HA means no single AZ failure can bring your application down. That requires every tier network, compute, and database to span multiple AZs.

Here's what most people get wrong: they put their EC2 instances in two AZs but share a single NAT Gateway in one AZ. When that AZ goes down, all outbound traffic from private subnets dies even the instances in the healthy AZ. True HA requires one NAT Gateway per AZ.

Architecture Drawing

Draw.io

Architecture Overview

The infrastructure spans 3 Availability Zones in eu-west-1 (Ireland) across 4 tiers:

Internet
    |
Route 53 (app.skylumanex.click)
    |
Application Load Balancer (3 public subnets)
    |
Auto Scaling Group (3 private app subnets)
    |
RDS MySQL Multi-AZ (3 private DB subnets)
Enter fullscreen mode Exit fullscreen mode

Every tier lives in its own subnet type, in its own security group, with tightly scoped ingress rules.

Terraform Module Structure

modules/
├── network/      # VPC, subnets, IGW, NAT, route tables
├── compute/      # Security groups, ALB, ASG, CloudWatch
├── database/     # RDS Multi-AZ, DB subnet group
└── dns/          # Route 53 alias + health check
environments/
└── dev/          # Root module wiring everything together
Enter fullscreen mode Exit fullscreen mode

Each module has exactly 3 files: main.tf, variables.tf, and outputs.tf. Modules communicate through outputs and inputs — the networking module outputs VPC ID and subnet IDs, the compute module takes those as inputs, the database module takes the app security group ID from compute to scope its DB ingress rules.

Module 1 — Networking

The networking module creates the entire network foundation:

  • 1 VPC (10.0.0.0/16)
  • 3 public subnets (one per AZ) for the ALB and NAT Gateways
  • 3 private app subnets (one per AZ) for EC2 instances
  • 3 private DB subnets (one per AZ) for RDS
  • 1 Internet Gateway
  • NAT Gateway(s) configurable via single_nat_gateway toggle
  • Route tables 1 public, 1 private per AZ

The NAT Gateway decision:

variable "single_nat_gateway" {
  type    = bool
  default = true  # cost-optimized for dev
}

locals {
  nat_count = var.single_nat_gateway ? 1 : length(local.azs)
}
Enter fullscreen mode Exit fullscreen mode

Flip single_nat_gateway = false in production and you get one NAT Gateway per AZ true HA outbound routing at ~$97/month. Keep it true for dev at ~$33/month.

Dynamic AZ lookup no hardcoded AZ names:

data "aws_availability_zones" "available" {
  state = "available"
}

locals {
  azs = slice(data.aws_availability_zones.available.names, 0, 3)
}
Enter fullscreen mode Exit fullscreen mode

This makes the module region-agnostic. Deploy to us-east-1 and it automatically picks the right AZs.

Module 2 — Compute

The compute module creates the application layer:
Two security groups with layered access:

# ALB SG — internet can reach the ALB on port 80
resource "aws_security_group" "alb" {
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# App SG — only the ALB can reach the EC2 instances
resource "aws_security_group" "app" {
  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }
}
Enter fullscreen mode Exit fullscreen mode

EC2 instances are in private subnets and only accept traffic that came through the ALB. They are never directly reachable from the internet.

Launch template with security best practices:

metadata_options {
  http_endpoint               = "enabled"
  http_tokens                 = "required"  # enforces IMDSv2
  http_put_response_hop_limit = 1
}

monitoring {
  enabled = true  # detailed CloudWatch metrics
}
Enter fullscreen mode Exit fullscreen mode

Auto Scaling with CloudWatch alarms:

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  comparison_operator = "GreaterThanThreshold"
  threshold           = var.cpu_high_threshold  # default 60%
  alarm_actions       = [aws_autoscaling_policy.scale_out.arn]
}

resource "aws_cloudwatch_metric_alarm" "cpu_low" {
  comparison_operator = "LessThanThreshold"
  threshold           = var.cpu_low_threshold   # default 20%
  alarm_actions       = [aws_autoscaling_policy.scale_in.arn]
}
Enter fullscreen mode Exit fullscreen mode

Module 3 — Database

RDS MySQL with Multi-AZ the core HA database setting:

resource "aws_db_instance" "this" {
  engine         = var.db_engine
  instance_class = var.db_instance_class
  multi_az       = true              # synchronous standby + auto failover
  storage_encrypted = true           # encrypted at rest
  storage_type      = "gp3"          # faster and cheaper than gp2
  deletion_protection = true         # safety guardrail
  backup_retention_period = 7        # 7 days of automated backups
}
Enter fullscreen mode Exit fullscreen mode

The DB security group only accepts traffic from the app security group — not from any IP address, not from the internet:

ingress {
  from_port       = 3306
  to_port         = 3306
  protocol        = "tcp"
  security_groups = [var.app_security_group_id]
}
Enter fullscreen mode Exit fullscreen mode

Module 4 — DNS

Route 53 alias record pointing app.skylumanex.click to the ALB, with a health check:

resource "aws_route53_record" "app" {
  zone_id = var.hosted_zone_id
  name    = var.domain_name
  type    = "A"

  alias {
    name                   = var.alb_dns_name
    zone_id                = var.alb_zone_id
    evaluate_target_health = true
  }
}
Enter fullscreen mode Exit fullscreen mode

evaluate_target_health = true means Route 53 won't route traffic to the ALB if the ALB health checks are failing. Another layer of resilience.

Root Module — Wiring Everything Together

The environments/dev/main.tf calls all 4 modules and passes outputs between them:

module "networking" {
  source         = "../../modules/network"
  project_name   = var.project_name
  vpc_cidr_block = var.vpc_cidr_block
  single_nat_gateway = var.single_nat_gateway
  # ...
}

module "compute" {
  source                 = "../../modules/compute"
  vpc_id                 = module.networking.vpc_id
  public_subnet_ids      = module.networking.public_subnet_ids
  private_app_subnet_ids = module.networking.private_app_subnet_ids
  # ...
}

module "database" {
  source                = "../../modules/database"
  vpc_id                = module.networking.vpc_id
  private_db_subnet_ids = module.networking.private_db_subnet_ids
  app_security_group_id = module.compute.app_security_group_id
  # ...
}

module "dns" {
  source         = "../../modules/dns"
  alb_dns_name   = module.compute.alb_dns_name
  alb_zone_id    = module.compute.alb_zone_id
  hosted_zone_id = var.hosted_zone_id
}
Enter fullscreen mode Exit fullscreen mode

Notice how module.networking.vpc_id flows into both compute and database. module.compute.app_security_group_id flows into database. Each module is independent but they communicate cleanly through their interfaces.

Deploying It

cd environments/dev
export TF_VAR_db_password="YourStrongPassword"
terraform init
terraform plan
terraform apply
Enter fullscreen mode Exit fullscreen mode

Terraform provisions all 40 resources in the correct dependency order automatically.

The Proof

$ curl http://app.skylumanex.click
Hello from ip-10-0-11-230.eu-west-1.compute.internal

$ curl http://app.skylumanex.click
Hello from ip-10-0-12-181.eu-west-1.compute.internal
Enter fullscreen mode Exit fullscreen mode

Traffic distributing across private subnets in eu-west-1a and eu-west-1b through the ALB, resolved via Route 53. The instances are never directly reachable from the internet.

Screenshots

Image2

Image4

Image5

Image6

Key Lessons

  1. Modules enforce separation of concerns the networking module doesn't know about EC2, the compute module doesn't know about RDS. Each module has one job.

  2. Outputs are the module API — what a module exposes in outputs.tf is its contract with the outside world. Design them carefully

  3. The NAT Gateway is the hidden single point of failure most HA tutorials miss this. One shared NAT Gateway means one AZ failure kills all outbound private traffic.

  4. deletion_protection = true on RDS is a guardrail, not an obstacle — it saved me from accidentally destroying a database during testing. Disable it explicitly before destroy, never by default.

  5. Never put db_password in terraform.tfvars use TF_VAR_db_password environment variable. It never touches disk.

What's Next

  • Add a bastion host or SSM Session Manager for secure instance access

  • Enable VPC Flow Logs for network traffic visibility

  • Add WAF in front of the ALB

  • Build a staging/ environment by copying environments/dev/ the modules don't change

Resources

  • Terraform AWS Provider docs

  • AWS Well-Architected Framework — Reliability Pillar

  • GitHub: aws-full-ha-infra

Part of my AWS SAA-C03 prep as an AWS Community Builder 2026 (Serverless track). Follow along as I build toward certification

Top comments (0)