Introduction
Most AWS tutorials teach you how to launch a single EC2 instance in a public subnet and call it a day. That's fine for learning the basics, but it's nowhere near what production infrastructure looks like.
In this article I'll walk you through how I designed and deployed a full multi-tier, multi-AZ High Availability infrastructure on AWS written entirely in Terraform, structured as reusable modules. By the end you'll understand not just what I built, but why each decision was made.
This is part of my AWS SAA-C03 certification preparation as an AWS Community Builder 2026 (Serverless track).
What Does "High Availability" Actually Mean?
High Availability means your system keeps running even when something fails. In AWS, the primary failure unit is an Availability Zone a physically separate data centre within a region.
True HA means no single AZ failure can bring your application down. That requires every tier network, compute, and database to span multiple AZs.
Here's what most people get wrong: they put their EC2 instances in two AZs but share a single NAT Gateway in one AZ. When that AZ goes down, all outbound traffic from private subnets dies even the instances in the healthy AZ. True HA requires one NAT Gateway per AZ.
Architecture Drawing
Architecture Overview
The infrastructure spans 3 Availability Zones in eu-west-1 (Ireland) across 4 tiers:
Internet
|
Route 53 (app.skylumanex.click)
|
Application Load Balancer (3 public subnets)
|
Auto Scaling Group (3 private app subnets)
|
RDS MySQL Multi-AZ (3 private DB subnets)
Every tier lives in its own subnet type, in its own security group, with tightly scoped ingress rules.
Terraform Module Structure
modules/
├── network/ # VPC, subnets, IGW, NAT, route tables
├── compute/ # Security groups, ALB, ASG, CloudWatch
├── database/ # RDS Multi-AZ, DB subnet group
└── dns/ # Route 53 alias + health check
environments/
└── dev/ # Root module wiring everything together
Each module has exactly 3 files: main.tf, variables.tf, and outputs.tf. Modules communicate through outputs and inputs — the networking module outputs VPC ID and subnet IDs, the compute module takes those as inputs, the database module takes the app security group ID from compute to scope its DB ingress rules.
Module 1 — Networking
The networking module creates the entire network foundation:
- 1 VPC (10.0.0.0/16)
- 3 public subnets (one per AZ) for the ALB and NAT Gateways
- 3 private app subnets (one per AZ) for EC2 instances
- 3 private DB subnets (one per AZ) for RDS
- 1 Internet Gateway
- NAT Gateway(s) configurable via single_nat_gateway toggle
- Route tables 1 public, 1 private per AZ
The NAT Gateway decision:
variable "single_nat_gateway" {
type = bool
default = true # cost-optimized for dev
}
locals {
nat_count = var.single_nat_gateway ? 1 : length(local.azs)
}
Flip single_nat_gateway = false in production and you get one NAT Gateway per AZ true HA outbound routing at ~$97/month. Keep it true for dev at ~$33/month.
Dynamic AZ lookup no hardcoded AZ names:
data "aws_availability_zones" "available" {
state = "available"
}
locals {
azs = slice(data.aws_availability_zones.available.names, 0, 3)
}
This makes the module region-agnostic. Deploy to us-east-1 and it automatically picks the right AZs.
Module 2 — Compute
The compute module creates the application layer:
Two security groups with layered access:
# ALB SG — internet can reach the ALB on port 80
resource "aws_security_group" "alb" {
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
}
# App SG — only the ALB can reach the EC2 instances
resource "aws_security_group" "app" {
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
}
EC2 instances are in private subnets and only accept traffic that came through the ALB. They are never directly reachable from the internet.
Launch template with security best practices:
metadata_options {
http_endpoint = "enabled"
http_tokens = "required" # enforces IMDSv2
http_put_response_hop_limit = 1
}
monitoring {
enabled = true # detailed CloudWatch metrics
}
Auto Scaling with CloudWatch alarms:
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
comparison_operator = "GreaterThanThreshold"
threshold = var.cpu_high_threshold # default 60%
alarm_actions = [aws_autoscaling_policy.scale_out.arn]
}
resource "aws_cloudwatch_metric_alarm" "cpu_low" {
comparison_operator = "LessThanThreshold"
threshold = var.cpu_low_threshold # default 20%
alarm_actions = [aws_autoscaling_policy.scale_in.arn]
}
Module 3 — Database
RDS MySQL with Multi-AZ the core HA database setting:
resource "aws_db_instance" "this" {
engine = var.db_engine
instance_class = var.db_instance_class
multi_az = true # synchronous standby + auto failover
storage_encrypted = true # encrypted at rest
storage_type = "gp3" # faster and cheaper than gp2
deletion_protection = true # safety guardrail
backup_retention_period = 7 # 7 days of automated backups
}
The DB security group only accepts traffic from the app security group — not from any IP address, not from the internet:
ingress {
from_port = 3306
to_port = 3306
protocol = "tcp"
security_groups = [var.app_security_group_id]
}
Module 4 — DNS
Route 53 alias record pointing app.skylumanex.click to the ALB, with a health check:
resource "aws_route53_record" "app" {
zone_id = var.hosted_zone_id
name = var.domain_name
type = "A"
alias {
name = var.alb_dns_name
zone_id = var.alb_zone_id
evaluate_target_health = true
}
}
evaluate_target_health = true means Route 53 won't route traffic to the ALB if the ALB health checks are failing. Another layer of resilience.
Root Module — Wiring Everything Together
The environments/dev/main.tf calls all 4 modules and passes outputs between them:
module "networking" {
source = "../../modules/network"
project_name = var.project_name
vpc_cidr_block = var.vpc_cidr_block
single_nat_gateway = var.single_nat_gateway
# ...
}
module "compute" {
source = "../../modules/compute"
vpc_id = module.networking.vpc_id
public_subnet_ids = module.networking.public_subnet_ids
private_app_subnet_ids = module.networking.private_app_subnet_ids
# ...
}
module "database" {
source = "../../modules/database"
vpc_id = module.networking.vpc_id
private_db_subnet_ids = module.networking.private_db_subnet_ids
app_security_group_id = module.compute.app_security_group_id
# ...
}
module "dns" {
source = "../../modules/dns"
alb_dns_name = module.compute.alb_dns_name
alb_zone_id = module.compute.alb_zone_id
hosted_zone_id = var.hosted_zone_id
}
Notice how module.networking.vpc_id flows into both compute and database. module.compute.app_security_group_id flows into database. Each module is independent but they communicate cleanly through their interfaces.
Deploying It
cd environments/dev
export TF_VAR_db_password="YourStrongPassword"
terraform init
terraform plan
terraform apply
Terraform provisions all 40 resources in the correct dependency order automatically.
The Proof
$ curl http://app.skylumanex.click
Hello from ip-10-0-11-230.eu-west-1.compute.internal
$ curl http://app.skylumanex.click
Hello from ip-10-0-12-181.eu-west-1.compute.internal
Traffic distributing across private subnets in eu-west-1a and eu-west-1b through the ALB, resolved via Route 53. The instances are never directly reachable from the internet.
Screenshots
Key Lessons
Modules enforce separation of concerns the networking module doesn't know about EC2, the compute module doesn't know about RDS. Each module has one job.
Outputs are the module API — what a module exposes in
outputs.tfis its contract with the outside world. Design them carefullyThe NAT Gateway is the hidden single point of failure most HA tutorials miss this. One shared NAT Gateway means one AZ failure kills all outbound private traffic.
deletion_protection = trueon RDS is a guardrail, not an obstacle — it saved me from accidentally destroying a database during testing. Disable it explicitly before destroy, never by default.Never put
db_passwordinterraform.tfvarsuseTF_VAR_db_passwordenvironment variable. It never touches disk.
What's Next
Add a bastion host or SSM Session Manager for secure instance access
Enable VPC Flow Logs for network traffic visibility
Add WAF in front of the ALB
Build a
staging/environment by copyingenvironments/dev/the modules don't change
Resources
Terraform AWS Provider docs
AWS Well-Architected Framework — Reliability Pillar
GitHub: aws-full-ha-infra
Part of my AWS SAA-C03 prep as an AWS Community Builder 2026 (Serverless track). Follow along as I build toward certification





Top comments (0)