Day 16 of my Terraform journey was less about creating new resources and more about improving how infrastructure is designed.
The big lesson was this:
Terraform code that works is not automatically production-grade.
Production-grade infrastructure should be:
- modular
- reliable
- secure
- observable
- maintainable
- testable
For Day 16, I audited my earlier infrastructure against that checklist, refactored it into smaller modules, added better validation and observability, and tested it both manually and with Terratest.
GitHub reference:
๐ Github Link
The Production-Grade Checklist
Here is the practical version of what I used.
1. Structure
In practice, this means:
- no giant
main.tfdoing everything - small modules with one responsibility
- clear variables and outputs
- repeated logic centralized with
locals
2. Reliability
In practice, this means:
- safer replacements with
create_before_destroy - proper health checks
- names that wonโt collide
- designs that support rolling changes
3. Security
In practice, this means:
- no secrets in Terraform code
- tighter security group rules
- safer state handling
- least-privilege thinking
4. Observability
In practice, this means:
- consistent tagging
- alerts for important metrics
- basic operational visibility
5. Maintainability
In practice, this means:
- README files for modules
- pinned provider versions
- reusable module boundaries
- runnable examples
- tests
That checklist turned Day 16 into an architecture refactor instead of just another provisioning exercise.
What I Refactored
I split the infrastructure into smaller modules:
modules/networking/albmodules/cluster/asg-rolling-deploymodules/services/hello-world-appexamples/hello-world-app
That made the design much easier to understand and test.
Refactor 1: Centralized Tagging
Before
resource "aws_instance" "web" {
tags = {
Name = "web-instance"
}
}
After
locals {
common_tags = merge(
{
Environment = var.environment
ManagedBy = "terraform"
Project = var.project_name
Owner = var.team_name
},
var.custom_tags
)
}
resource "aws_lb_target_group" "app" {
tags = merge(local.common_tags, {
Name = "${var.cluster_name}-tg"
})
}
Why this matters:
- less repeated code
- consistent tagging
- easier filtering, billing, and operations
Refactor 2: Variable Validation
Before
variable "environment" {
type = string
}
After
variable "environment" {
description = "Deployment environment"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production."
}
}
And for instance type:
variable "instance_type" {
description = "EC2 instance type for the app cluster"
type = string
validation {
condition = can(regex("^t[23]\\.", var.instance_type))
error_message = "Instance type must be a t2 or t3 family type."
}
}
Why this matters:
- bad values fail early
- module expectations are clearer
- future users make fewer mistakes
Refactor 3: Safer Replacements
Before
resource "aws_launch_template" "this" {
# ...
}
After
resource "aws_launch_template" "this" {
# ...
lifecycle {
create_before_destroy = true
}
}
Why this matters:
- safer rolling replacements
- less disruption during changes
- better production behavior
Refactor 4: Tighter Security Rules
Before
cidr_blocks = ["0.0.0.0/0"]
After
resource "aws_vpc_security_group_ingress_rule" "app_from_alb" {
security_group_id = aws_security_group.instance.id
referenced_security_group_id = var.alb_security_group_id
from_port = var.server_port
to_port = var.server_port
ip_protocol = "tcp"
}
Why this matters:
- instances are not exposed directly to the internet
- only the ALB can reach the app port
- networking intent is much safer and clearer
Refactor 5: Basic Observability
I added an SNS topic and CloudWatch CPU alarm:
resource "aws_sns_topic" "alerts" {
name = "${var.cluster_name}-alerts"
}
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "${var.cluster_name}-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 120
statistic = "Average"
threshold = 80
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.this.name
}
}
Why this matters:
- infrastructure should not just run
- it should also tell you when it is unhealthy
How I Tested It
I tested Day 16 in two ways.
Manual test
I ran the example root module in:
day_16/examples/hello-world-app
Then I:
- applied the stack
- got the ALB DNS output
- opened it in the browser
- confirmed it returned:
Hello from Day 16
Then I destroyed everything.
Automated test with Terratest
I also added a Go test:
func TestHelloWorldApp(t *testing.T) {
t.Parallel()
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../examples/hello-world-app",
Vars: map[string]interface{}{
"cluster_name": "test-cluster",
"instance_type": "t3.micro",
"min_size": 1,
"max_size": 1,
"desired_capacity": 1,
"environment": "dev",
"server_text": "Hello from Day 16",
},
})
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
albDnsName := terraform.Output(t, terraformOptions, "alb_dns_name")
url := "http://" + albDnsName
http_helper.HttpGetWithRetryWithCustomValidation(t, url, nil, 60, 10*time.Second, func(statusCode int, body string) bool {
return statusCode == 200 && strings.Contains(body, "Hello from Day 16")
})
}
Why Automated Testing Matters
Manual testing is useful, but automated infrastructure testing gives you things manual testing cannot:
- repeatability
- faster regression checks
- confidence after refactors
- executable proof that the infrastructure behaves as expected
- easier team validation in CI/CD later
Manual testing told me:
- โit works right nowโ
Terratest moves closer to:
- โit keeps working when I change the codeโ
That is a big difference.
My Main Takeaway
Day 16 changed how I think about Terraform quality.
The goal is not just:
- โCan I provision this?โ
The better question is:
- โCan another engineer understand, trust, reuse, test, and operate this safely?โ
That is what production-grade infrastructure really means.
Full Code
GitHub reference:
๐ Github Link
Follow My Journey
This is Day 16 of my 30-Day Terraform Challenge.
See you on Day 17 ๐
Top comments (0)