Udoh Deborah

Posted on Mar 30

Managing High Traffic Applications with AWS Elastic Load Balancer and Terraform

Introduction

On Day 5 of the 30-Day Terraform Challenge, I tackled two of the most important concepts in production infrastructure: scaling with an AWS Application Load Balancer (ALB) and understanding Terraform state. By the end of the day, I had a fully load-balanced cluster running across multiple availability zones — and a much deeper understanding of what Terraform is actually doing behind the scenes.

What I Built

A fully production-ready scaled infrastructure consisting of:

An Application Load Balancer accepting public HTTP traffic on port 80
An Auto Scaling Group running a minimum of 2 EC2 instances across multiple AZs
A Target Group with HTTP health checks ensuring only healthy instances receive traffic
Security Groups that restrict direct instance access — only the ALB can talk to the instances

ALB + ASG Setup Walkthrough

The Architecture

Internet
    │
    ▼
[ ALB - port 80 ]
    │
    ▼
[ Target Group ]
    │         │
    ▼         ▼
[EC2 - AZ-a] [EC2 - AZ-b]
  (port 8080) (port 8080)

The ALB sits in front of the ASG. All public traffic hits the ALB on port 80, which forwards it to healthy instances in the target group on port 8080. Instances are not directly accessible from the internet — their security group only allows traffic from the ALB security group.
Security Groups — The Right Way
A common mistake is opening instance security groups to 0.0.0.0/0. The correct pattern is to reference the ALB security group directly:

# ALB accepts traffic from the internet
resource "aws_security_group" "alb" {
  name   = "terraform-day5-alb-sg"
  vpc_id = data.aws_vpc.default.id

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# Instances only accept traffic FROM the ALB security group
resource "aws_security_group" "instance" {
  name   = "terraform-day5-instance-sg"
  vpc_id = data.aws_vpc.default.id

  ingress {
    from_port       = var.server_port
    to_port         = var.server_port
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Launch Template and User Data

Each instance runs a simple Python HTTP server on port 8080, started via a User Data script:

resource "aws_launch_template" "web" {
  name_prefix   = "terraform-day5-"
  image_id      = data.aws_ami.amazon_linux.id
  instance_type = var.instance_type

  vpc_security_group_ids = [aws_security_group.instance.id]

  user_data = base64encode(<<-EOF
    #!/bin/bash
    mkdir -p /var/www
    cat > /var/www/index.html <<HTML
    <html>
      <body>
        <h1>Hello from $(hostname)</h1>
        <p>Instance ID: $(curl -s http://169.254.169.254/latest/meta-data/instance-id)</p>
        <p>AZ: $(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)</p>
      </body>
    </html>
    HTML
    cd /var/www
    nohup python3 -m http.server 8080 &
  EOF
  )
}

Auto Scaling Group

resource "aws_autoscaling_group" "web" {
  name                      = "terraform-day5-asg"
  min_size                  = var.min_size
  max_size                  = var.max_size
  desired_capacity          = var.min_size
  vpc_zone_identifier       = data.aws_subnets.default.ids
  target_group_arns         = [aws_lb_target_group.web.arn]
  health_check_type         = "ELB"
  health_check_grace_period = 60
  wait_for_capacity_timeout = "10m"

  launch_template {
    id      = aws_launch_template.web.id
    version = "$Latest"
  }
}

Application Load Balancer and Target Group

resource "aws_lb" "web" {
  name               = "terraform-day5-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = data.aws_subnets.default.ids
}

resource "aws_lb_target_group" "web" {
  name     = "terraform-day5-tg"
  port     = var.server_port
  protocol = "HTTP"
  vpc_id   = data.aws_vpc.default.id

  health_check {
    enabled             = true
    path                = "/"
    port                = "traffic-port"
    protocol            = "HTTP"
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 30
    timeout             = 5
    matcher             = "200"
  }
}

resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.web.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.web.arn
  }
}

Proof It Worked

After terraform apply completed, hitting the ALB DNS name in the browser returned:

[! [Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h0hhsc601uyw5pl8nn3e.png]

Refreshing the page cycled through different instance IDs and AZs — confirming the load balancer was distributing traffic across the cluster.

terraform State Deep Dive

What is Terraform State?
When you run terraform apply, Terraform creates a file called terraform.tfstate. This JSON file is Terraforms source of truth — it maps every resource in your configuration to the real resource that exists in AWS.
Without state, Terraform would have no way to know:

Which resources it already created
What the current configuration of those resources is
What needs to change when you update your code

What the State File Contains
Opening terraform.tfstate after the Day 5 deployment revealed detailed information about every resource:

Resource type and name — e.g. aws_lb.web
Provider metadata — which provider manages the resource
All attributes — ARNs, IDs, DNS names, tags, ports, every setting
Dependencies — which resources depend on which

It is essentially a complete snapshot of your infrastructure at the time of the last apply.
Why State Must Never Be Committed to Git
The state file contains sensitive data — resource IDs, ARNs, and potentially secrets if you have outputs exposing them. Beyond security, committing state to Git causes serious problems in team environments:

Two engineers apply at the same time — state gets corrupted
Someone applies from an old branch — state goes out of sync
Merge conflicts in state files are nearly impossible to resolve safely

The solution is remote state — storing state in S3, Terraform Cloud, or another backend that supports locking.

State Locking

State locking prevents two operations from running simultaneously against the same state. Without it, two engineers running terraform apply at the same time can corrupt the state file permanently. When using S3 as a remote backend, DynamoDB is used for locking:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "day5/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

State Experiments

Experiment 1 — Manual State Tampering
I manually edited terraform.tfstate and changed the Day tag value from "5" to "99", then ran terraform plan.
Terraform immediately detected the discrepancy. It compared the state file (which said Day=99) against the configuration code (which said Day=5) and proposed to update the tag back to 5.
Key insight:

Terraform always reconciles three thing... your code, the state file, and real infrastructure. When state and code disagree, Terraform treats the code as the desired state and proposes changes to match it.
After running terraform apply, the tag was corrected and state was back in sync.

Experiment 2 — Infrastructure Drift via AWS Console
I manually changed the Day tag on a running EC2 instance directly in the AWS Console from 5 to MANUAL, without touching any Terraform code.
Running terraform plan detected the drift immediately. Even though the code and state file both said Day=5, Terraform queried AWS directly and saw the real value was MANUAL. It proposed to revert the tag back to 5.

[! Image description]

Key insight: Terraform does not rely solely on the state file — it also refreshes real infrastructure on every plan. This is how it detects drift caused by manual changes outside of Terraform.
Running terraform apply corrected the drift automatically.

Errors I Hit and How I Fixed Them

Error 1 — 502 Bad Gateway
Cause: The EC2 instances had no web server running, so the ALB had no healthy targets to route traffic to.
Fix: Added a User Data script to the Launch Template that starts a Python HTTP server on port 8080 on every instance boot.

Error 2 — Instance type not supported in us-east-1e
Your requested instance type (t3.micro) is not supported in your
requested Availability Zone (us-east-1e)
Cause: The aws_subnets data source was fetching all subnets including us-east-1e, which does not support t3.micro.
Fix: Added an AZ filter to the subnets data source to explicitly exclude us-east-1e:

data "aws_subnets" "default" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }

  filter {
    name   = "availabilityZone"
    values = ["us-east-1a", "us-east-1b", "us-east-1c", "us-east-1d", "us-east-1f"]
  }
}

Key Takeaways

The ALB + ASG pattern is the foundation of every scalable AWS architecture
Security groups should reference each other, not open 0.0.0.0/0 to everything
Terraform state is the source of truth — understand it before you trust it
Never commit terraform.tfstate to Git — use remote state with locking
Drift happens — terraform plan is your best tool for detecting and fixing it

DEV Community