Adil Khan

Posted on Jan 2

Why Refactoring AWS Infrastructure Taught Me More Than Building It

#aws #terraform #infrastructureascode #devops

Most infrastructure projects work the first time because we push them until they do. But working infrastructure isn't the same as well-designed infrastructure.

Six months ago, I built an AWS infrastructure with Terraform. It worked. I was proud. Last week, I looked at that same code and cringed.

This is the story of what I learned by tearing it down and rebuilding it properly.

Background and Motivation

The original version of this project was built to validate concepts quickly. It provisioned EC2 instances, placed them behind a load balancer, and served traffic successfully. At the time, that felt like success.

But after gaining more exposure to Terraform patterns and real-world infrastructure practices, revisiting the code made the gaps obvious. Decisions were made because they worked, not because they were well thought out. Dependencies were forced instead of modeled. State handling existed, but wasn't fully understood.

This refactor was an attempt to slow down and rebuild the same infrastructure while focusing on clarity, correctness, and maintainability.

What This Project Builds

This project provisions a small but realistic AWS infrastructure stack using Terraform. The goal is not application complexity, but infrastructure correctness.

The setup includes:

✅ Multiple EC2 instances
✅ An Application Load Balancer in front of them
✅ A target group with health checks
✅ Security groups enforcing clear traffic flow
✅ Remote Terraform state with locking
✅ Instance bootstrapping using user data

Each EC2 instance runs Nginx and serves a simple page identifying the instance. This makes it easy to visually confirm load balancing behavior and instance health.

Architecture Overview

                    Internet
                       │
                       ▼
         ┌─────────────────────────┐
         │  Application Load       │
         │     Balancer (ALB)      │
         └─────────────────────────┘
                       │
         ┌─────────────┴─────────────┐
         │      Target Group         │
         │    (Health Checks)        │
         └─────────────┬─────────────┘
                       │
      ┌────────────────┼────────────────┐
      │                │                │
┌─────▼─────┐    ┌────▼─────┐    ┌────▼─────┐
│ EC2       │    │ EC2      │    │ EC2      │
│ Instance  │    │ Instance │    │ Instance │
│ (Nginx)   │    │ (Nginx)  │    │ (Nginx)  │
└───────────┘    └──────────┘    └──────────┘
  Subnet 1         Subnet 2        Subnet 3

The infrastructure uses:

AWS Default VPC (intentionally chosen for this learning project, though production workloads should always use custom VPCs with proper CIDR planning)
Public Application Load Balancer
EC2 instances distributed across available subnets
Target group attached to the ALB
Security groups controlling inbound and outbound traffic

A custom VPC was intentionally avoided. The purpose of this project was not network design, but Terraform fundamentals: state management, resource relationships, dynamic creation, and clean structure.

What Actually Changed

Aspect	Original Version	Refactored Version
Instance Creation	`count` with hardcoded values	`for_each` with dynamic mapping
Subnet Assignment	Manual/hardcoded	Modulo arithmetic distribution
Dependencies	Explicit `depends_on` everywhere	Implicit dependency graph
State Management	Local state file	S3 + DynamoDB locking
Security Groups	Overly permissive	Principle of least privilege
Code Lines	487 lines	312 lines (-36%)
Hardcoded Values	15+ IDs/ARNs	0 (all dynamic)

Terraform Concepts Applied

This refactor focused heavily on using Terraform the way it's intended to be used.

1. Remote State Management

Terraform state is stored in an S3 bucket with DynamoDB used for state locking. This prevents concurrent state corruption and reflects how Terraform is used in real team environments.

terraform {
  backend "s3" {
    bucket         = "oggy-backend-bucket"
    key            = "Alb-project-non-module/terraform.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "stateLock-table"
    encrypt        = true
  }
}

2. Data Sources

Instead of hardcoding values, data sources are used to dynamically fetch:

The default VPC
Subnets within the VPC
The latest Ubuntu 24.04 LTS AMI

data "aws_vpc" "default" {
  default = true
}

data "aws_subnets" "default" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }
}

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*"]
  }
}

3. Dynamic Resource Creation

Before:

resource "aws_instance" "web" {
  count         = 3
  subnet_id     = var.subnet_ids[count.index]
  # Fragile, breaks if subnets change
}

After:

locals {
  instances = {
    for i in range(var.nos) :
    "instance-${i + 1}" => data.aws_subnets.default_subnets.ids[
      i % length(data.aws_subnets.default_subnets.ids)
    ]
  }
}

resource "aws_instance" "vms" {
  for_each      = local.instances
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.type_instance
  subnet_id     = each.value

  tags = {
    Name = each.key
  }
}

EC2 instances are created dynamically using for_each rather than static counts. This improves clarity, stability, and scalability of the configuration.

4. Locals for Computed Logic

locals {
  instances = {
    "web-instance-1" = {}
    "web-instance-2" = {}
    "web-instance-3" = {}...
  }
}

Subnet assignment logic is handled using locals, keeping the resource blocks clean and readable.

5. Implicit Dependencies

Rather than forcing execution order with depends_on, resource relationships define the dependency graph naturally.

Original version had 8 explicit depends_on blocks. The refactored version has 0.

Dynamic Instance and Subnet Distribution

One of the most valuable improvements in this refactor was how instances are distributed across subnets.

Instead of manually mapping instances to subnets, modulo arithmetic is used to assign instances evenly across all available subnets.

How It Works

Modulo arithmetic (index % subnet_count) ensures even distribution:

With 3 subnets and 6 instances:
- Instances 0, 3 → Subnet 0
- Instances 1, 4 → Subnet 1
- Instances 2, 6 → Subnet 2

This approach:

✅ Avoids hardcoding subnet IDs
✅ Scales automatically if subnets change
✅ Produces deterministic and predictable placement
✅ Works across any number of availability zones

This logic alone made the configuration significantly more robust than the original version.

Security Group Design

Security groups are designed with intent rather than convenience.

# ALB Security Group
resource "aws_security_group" "alb" {
  name        = "alb-security-group"
  description = "Allow HTTP inbound traffic"
  vpc_id      = data.aws_vpc.default.id

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# EC2 Security Group
resource "aws_security_group" "ec2" {
  name        = "ec2-security-group"
  description = "Allow traffic from ALB only"
  vpc_id      = data.aws_vpc.default.id

  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Key principles:

The Application Load Balancer allows inbound HTTP traffic from anywhere
EC2 instances only allow inbound traffic from the ALB security group
Outbound traffic is permitted for updates and health checks

This enforces a clear and understandable traffic flow: public access ends at the load balancer, and instances remain protected behind it.

Bootstrapping with User Data

Each EC2 instance is bootstrapped at launch using a user data script.

#!/bin/bash
apt-get update -y
apt-get install -y nginx

INSTANCE_NAME="${instance_name}"

cat > /var/www/html/index.html <<EOF
<!DOCTYPE html>
<html>
<head>
    <title>Instance: $INSTANCE_NAME</title>
</head>
<body>
    <h1>Hello from $INSTANCE_NAME</h1>
    <p>This instance is managed by Terraform</p>
    <p>Load balancing is working correctly!</p>
</body>
</html>
EOF

systemctl enable nginx
systemctl start nginx

The script:

Updates the system
Installs Nginx
Serves a simple instance-specific web page

This makes validation straightforward. If the ALB DNS shows responses from different instances, both provisioning and health checks are working as expected.

What Broke During Refactoring (And Why That's Good)

Issue 1: State Lock Timeout

What happened: First remote state migration failed with lock timeout.

Root cause: Didn't understand DynamoDB table requirements properly. The table needed a primary key named LockID (case-sensitive).

What I learned: State locking isn't automatic—it requires proper table schema. Reading error messages carefully saves hours.

Issue 2: Target Group Attachment Race Condition

What happened: Instances registered before they were ready, causing initial health check failures.

Root cause: Health check grace period was too short, and user data script execution takes time.

What I learned: AWS eventual consistency requires patience in automation. Added proper health check intervals and grace periods.

Issue 3: Circular Dependency with Security Groups

What happened: Terraform complained about circular dependencies when trying to reference security groups.

Root cause: Tried to be too clever with cross-referencing security group rules.

What I learned: Sometimes the simplest approach is the best. Separated ingress rules into distinct resources when needed.

Issue 4: Subnet Data Source Returned Unexpected Results

What happened: Got 6 subnets instead of expected 3 in my default VPC.

Root cause: AWS creates both public and private subnets in some regions by default.

What I learned: Always validate data source outputs. Added filters to ensure I'm only using the subnets I intend to use.

Measurable Improvements

Metric	Before	After	Change
Lines of code	487	312	-36%
Explicit dependencies	8	0	-100%
Hardcoded values	15	0	-100%
State lock conflicts	Possible	Prevented	✅
Subnet scalability	Fixed to 3	Dynamic	✅
Code readability	Moderate	High	✅

Key Takeaways for Other Learners

1. State Management Isn't Optional

Even for learning projects, use remote state. The habits matter more than the project size. I spent 2 hours debugging a state corruption issue that would have been prevented by proper locking.

2. Dynamic Beats Static

for_each is harder to learn than count, but it's worth the investment. The flexibility and clarity it provides compounds over time.

3. Read the Dependency Graph

Run this command and actually look at it:

terraform graph | dot -Tpng > graph.png

It will show you what Terraform actually understands about your infrastructure.

4. Refactoring > New Projects

Building something new teaches you syntax. Rebuilding teaches you design. I learned more in this refactor than in the original build.

5. Document Your Mistakes

My original code had 8 explicit depends_on blocks. All were unnecessary. That's valuable to know and remember.

6. Slow Down to Speed Up

The original project took 3 days of "making it work." The refactor took 2 days of "making it right." But now I understand it 10x better.

Why Refactoring Was More Valuable Than the Original Build

The first version taught me how to assemble resources.

The refactored version taught me why certain Terraform patterns exist.

Rebuilding the project exposed assumptions I didn't know I was making the first time. It forced me to:

Question every hardcoded value
Understand the difference between implicit and explicit dependencies
Think about how the code would scale
Consider how someone else would read and modify this code

That process made the concepts stick far more effectively than building something new.

The best learning happens when you're forced to justify your decisions to yourself.

Try It Yourself

Want to see the difference? Clone the repository and explore:

# Clone the repository
git clone https://github.com/adil-khan-723/terraform_project2_refactored
cd terraform_project2_refactored

# Initialize Terraform
terraform init

# Review the plan
terraform plan

# Apply the configuration
terraform apply

# Get the ALB DNS name
terraform output alb_dns_name

# Test the load balancer (you'll see different instances)
for i in {1..10}; do curl http://<alb-dns-name>; sleep 1; done

# Clean up
terraform destroy

What's Next?

I'm currently working on:

🔧 Converting this into reusable Terraform modules
📈 Adding Auto Scaling Groups for dynamic scaling
🔒 Implementing HTTPS with AWS Certificate Manager
🌐 Building a custom VPC version with proper network segmentation
📊 Adding CloudWatch dashboards and alarms

Repository and Source Code

The complete source code, file structure, and documentation are available here:

GitHub Repository:

https://github.com/adil-khan-723/terraform_project2_refactored

The repository contains:

Clean Terraform files with clear separation of concerns
Comprehensive README with architecture diagrams
No committed state or local artifacts
A readable, review-friendly structure

Final Thoughts

Infrastructure as Code is as much about the "code" part as it is about the "infrastructure" part. Clean, maintainable, understandable code matters—even when you're the only person who will read it.

The infrastructure worked both times. But only the second time did I understand why.

That's the difference between code that works and code that teaches.

Connect With Me

Have you refactored your own infrastructure code? What surprised you most? Drop a comment below—I'd love to hear about your experience.

DEV Community