Most infrastructure projects work the first time because we push them until they do. But working infrastructure isn't the same as well-designed infrastructure.
Six months ago, I built an AWS infrastructure with Terraform. It worked. I was proud. Last week, I looked at that same code and cringed.
This is the story of what I learned by tearing it down and rebuilding it properly.
Background and Motivation
The original version of this project was built to validate concepts quickly. It provisioned EC2 instances, placed them behind a load balancer, and served traffic successfully. At the time, that felt like success.
But after gaining more exposure to Terraform patterns and real-world infrastructure practices, revisiting the code made the gaps obvious. Decisions were made because they worked, not because they were well thought out. Dependencies were forced instead of modeled. State handling existed, but wasn't fully understood.
This refactor was an attempt to slow down and rebuild the same infrastructure while focusing on clarity, correctness, and maintainability.
What This Project Builds
This project provisions a small but realistic AWS infrastructure stack using Terraform. The goal is not application complexity, but infrastructure correctness.
The setup includes:
- ✅ Multiple EC2 instances
- ✅ An Application Load Balancer in front of them
- ✅ A target group with health checks
- ✅ Security groups enforcing clear traffic flow
- ✅ Remote Terraform state with locking
- ✅ Instance bootstrapping using user data
Each EC2 instance runs Nginx and serves a simple page identifying the instance. This makes it easy to visually confirm load balancing behavior and instance health.
Architecture Overview
Internet
│
▼
┌─────────────────────────┐
│ Application Load │
│ Balancer (ALB) │
└─────────────────────────┘
│
┌─────────────┴─────────────┐
│ Target Group │
│ (Health Checks) │
└─────────────┬─────────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌─────▼─────┐ ┌────▼─────┐ ┌────▼─────┐
│ EC2 │ │ EC2 │ │ EC2 │
│ Instance │ │ Instance │ │ Instance │
│ (Nginx) │ │ (Nginx) │ │ (Nginx) │
└───────────┘ └──────────┘ └──────────┘
Subnet 1 Subnet 2 Subnet 3
The infrastructure uses:
- AWS Default VPC (intentionally chosen for this learning project, though production workloads should always use custom VPCs with proper CIDR planning)
- Public Application Load Balancer
- EC2 instances distributed across available subnets
- Target group attached to the ALB
- Security groups controlling inbound and outbound traffic
A custom VPC was intentionally avoided. The purpose of this project was not network design, but Terraform fundamentals: state management, resource relationships, dynamic creation, and clean structure.
What Actually Changed
| Aspect | Original Version | Refactored Version |
|---|---|---|
| Instance Creation |
count with hardcoded values |
for_each with dynamic mapping |
| Subnet Assignment | Manual/hardcoded | Modulo arithmetic distribution |
| Dependencies | Explicit depends_on everywhere |
Implicit dependency graph |
| State Management | Local state file | S3 + DynamoDB locking |
| Security Groups | Overly permissive | Principle of least privilege |
| Code Lines | 487 lines | 312 lines (-36%) |
| Hardcoded Values | 15+ IDs/ARNs | 0 (all dynamic) |
Terraform Concepts Applied
This refactor focused heavily on using Terraform the way it's intended to be used.
1. Remote State Management
Terraform state is stored in an S3 bucket with DynamoDB used for state locking. This prevents concurrent state corruption and reflects how Terraform is used in real team environments.
terraform {
backend "s3" {
bucket = "oggy-backend-bucket"
key = "Alb-project-non-module/terraform.tfstate"
region = "ap-south-1"
dynamodb_table = "stateLock-table"
encrypt = true
}
}
2. Data Sources
Instead of hardcoding values, data sources are used to dynamically fetch:
- The default VPC
- Subnets within the VPC
- The latest Ubuntu 24.04 LTS AMI
data "aws_vpc" "default" {
default = true
}
data "aws_subnets" "default" {
filter {
name = "vpc-id"
values = [data.aws_vpc.default.id]
}
}
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*"]
}
}
3. Dynamic Resource Creation
Before:
resource "aws_instance" "web" {
count = 3
subnet_id = var.subnet_ids[count.index]
# Fragile, breaks if subnets change
}
After:
locals {
instances = {
for i in range(var.nos) :
"instance-${i + 1}" => data.aws_subnets.default_subnets.ids[
i % length(data.aws_subnets.default_subnets.ids)
]
}
}
resource "aws_instance" "vms" {
for_each = local.instances
ami = data.aws_ami.ubuntu.id
instance_type = var.type_instance
subnet_id = each.value
tags = {
Name = each.key
}
}
EC2 instances are created dynamically using for_each rather than static counts. This improves clarity, stability, and scalability of the configuration.
4. Locals for Computed Logic
locals {
instances = {
"web-instance-1" = {}
"web-instance-2" = {}
"web-instance-3" = {}...
}
}
Subnet assignment logic is handled using locals, keeping the resource blocks clean and readable.
5. Implicit Dependencies
Rather than forcing execution order with depends_on, resource relationships define the dependency graph naturally.
Original version had 8 explicit depends_on blocks. The refactored version has 0.
Dynamic Instance and Subnet Distribution
One of the most valuable improvements in this refactor was how instances are distributed across subnets.
Instead of manually mapping instances to subnets, modulo arithmetic is used to assign instances evenly across all available subnets.
How It Works
Modulo arithmetic (index % subnet_count) ensures even distribution:
- With 3 subnets and 6 instances:
- Instances 0, 3 → Subnet 0
- Instances 1, 4 → Subnet 1
- Instances 2, 6 → Subnet 2
This approach:
- ✅ Avoids hardcoding subnet IDs
- ✅ Scales automatically if subnets change
- ✅ Produces deterministic and predictable placement
- ✅ Works across any number of availability zones
This logic alone made the configuration significantly more robust than the original version.
Security Group Design
Security groups are designed with intent rather than convenience.
# ALB Security Group
resource "aws_security_group" "alb" {
name = "alb-security-group"
description = "Allow HTTP inbound traffic"
vpc_id = data.aws_vpc.default.id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
# EC2 Security Group
resource "aws_security_group" "ec2" {
name = "ec2-security-group"
description = "Allow traffic from ALB only"
vpc_id = data.aws_vpc.default.id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Key principles:
- The Application Load Balancer allows inbound HTTP traffic from anywhere
- EC2 instances only allow inbound traffic from the ALB security group
- Outbound traffic is permitted for updates and health checks
This enforces a clear and understandable traffic flow: public access ends at the load balancer, and instances remain protected behind it.
Bootstrapping with User Data
Each EC2 instance is bootstrapped at launch using a user data script.
#!/bin/bash
apt-get update -y
apt-get install -y nginx
INSTANCE_NAME="${instance_name}"
cat > /var/www/html/index.html <<EOF
<!DOCTYPE html>
<html>
<head>
<title>Instance: $INSTANCE_NAME</title>
</head>
<body>
<h1>Hello from $INSTANCE_NAME</h1>
<p>This instance is managed by Terraform</p>
<p>Load balancing is working correctly!</p>
</body>
</html>
EOF
systemctl enable nginx
systemctl start nginx
The script:
- Updates the system
- Installs Nginx
- Serves a simple instance-specific web page
This makes validation straightforward. If the ALB DNS shows responses from different instances, both provisioning and health checks are working as expected.
What Broke During Refactoring (And Why That's Good)
Issue 1: State Lock Timeout
What happened: First remote state migration failed with lock timeout.
Root cause: Didn't understand DynamoDB table requirements properly. The table needed a primary key named LockID (case-sensitive).
What I learned: State locking isn't automatic—it requires proper table schema. Reading error messages carefully saves hours.
Issue 2: Target Group Attachment Race Condition
What happened: Instances registered before they were ready, causing initial health check failures.
Root cause: Health check grace period was too short, and user data script execution takes time.
What I learned: AWS eventual consistency requires patience in automation. Added proper health check intervals and grace periods.
Issue 3: Circular Dependency with Security Groups
What happened: Terraform complained about circular dependencies when trying to reference security groups.
Root cause: Tried to be too clever with cross-referencing security group rules.
What I learned: Sometimes the simplest approach is the best. Separated ingress rules into distinct resources when needed.
Issue 4: Subnet Data Source Returned Unexpected Results
What happened: Got 6 subnets instead of expected 3 in my default VPC.
Root cause: AWS creates both public and private subnets in some regions by default.
What I learned: Always validate data source outputs. Added filters to ensure I'm only using the subnets I intend to use.
Measurable Improvements
| Metric | Before | After | Change |
|---|---|---|---|
| Lines of code | 487 | 312 | -36% |
| Explicit dependencies | 8 | 0 | -100% |
| Hardcoded values | 15 | 0 | -100% |
| State lock conflicts | Possible | Prevented | ✅ |
| Subnet scalability | Fixed to 3 | Dynamic | ✅ |
| Code readability | Moderate | High | ✅ |
Key Takeaways for Other Learners
1. State Management Isn't Optional
Even for learning projects, use remote state. The habits matter more than the project size. I spent 2 hours debugging a state corruption issue that would have been prevented by proper locking.
2. Dynamic Beats Static
for_each is harder to learn than count, but it's worth the investment. The flexibility and clarity it provides compounds over time.
3. Read the Dependency Graph
Run this command and actually look at it:
terraform graph | dot -Tpng > graph.png
It will show you what Terraform actually understands about your infrastructure.
4. Refactoring > New Projects
Building something new teaches you syntax. Rebuilding teaches you design. I learned more in this refactor than in the original build.
5. Document Your Mistakes
My original code had 8 explicit depends_on blocks. All were unnecessary. That's valuable to know and remember.
6. Slow Down to Speed Up
The original project took 3 days of "making it work." The refactor took 2 days of "making it right." But now I understand it 10x better.
Why Refactoring Was More Valuable Than the Original Build
The first version taught me how to assemble resources.
The refactored version taught me why certain Terraform patterns exist.
Rebuilding the project exposed assumptions I didn't know I was making the first time. It forced me to:
- Question every hardcoded value
- Understand the difference between implicit and explicit dependencies
- Think about how the code would scale
- Consider how someone else would read and modify this code
That process made the concepts stick far more effectively than building something new.
The best learning happens when you're forced to justify your decisions to yourself.
Try It Yourself
Want to see the difference? Clone the repository and explore:
# Clone the repository
git clone https://github.com/adil-khan-723/terraform_project2_refactored
cd terraform_project2_refactored
# Initialize Terraform
terraform init
# Review the plan
terraform plan
# Apply the configuration
terraform apply
# Get the ALB DNS name
terraform output alb_dns_name
# Test the load balancer (you'll see different instances)
for i in {1..10}; do curl http://<alb-dns-name>; sleep 1; done
# Clean up
terraform destroy
What's Next?
I'm currently working on:
- 🔧 Converting this into reusable Terraform modules
- 📈 Adding Auto Scaling Groups for dynamic scaling
- 🔒 Implementing HTTPS with AWS Certificate Manager
- 🌐 Building a custom VPC version with proper network segmentation
- 📊 Adding CloudWatch dashboards and alarms
Repository and Source Code
The complete source code, file structure, and documentation are available here:
GitHub Repository:
https://github.com/adil-khan-723/terraform_project2_refactored
The repository contains:
- Clean Terraform files with clear separation of concerns
- Comprehensive README with architecture diagrams
- No committed state or local artifacts
- A readable, review-friendly structure
Final Thoughts
Infrastructure as Code is as much about the "code" part as it is about the "infrastructure" part. Clean, maintainable, understandable code matters—even when you're the only person who will read it.
The infrastructure worked both times. But only the second time did I understand why.
That's the difference between code that works and code that teaches.
Connect With Me
Have you refactored your own infrastructure code? What surprised you most? Drop a comment below—I'd love to hear about your experience.
LinkedIn: https://www.linkedin.com/in/adilk3682
If you found this helpful:
- ⭐ Star the GitHub repository
- 💬 Share your own refactoring stories in the comments
- 🔗 Connect with me on LinkedIn
Thanks for reading! Happy Terraforming! 🚀
Top comments (0)