Mary Mutua

Posted on Apr 7

Creating Production-Grade Infrastructure with Terraform

#terraform #aws #infrastructureascode #devops

Day 16 of my Terraform journey was less about creating new resources and more about improving how infrastructure is designed.

The big lesson was this:

Terraform code that works is not automatically production-grade.

Production-grade infrastructure should be:

modular
reliable
secure
observable
maintainable
testable

For Day 16, I audited my earlier infrastructure against that checklist, refactored it into smaller modules, added better validation and observability, and tested it both manually and with Terratest.

GitHub reference:
👉 Github Link

The Production-Grade Checklist

Here is the practical version of what I used.

1. Structure

In practice, this means:

no giant main.tf doing everything
small modules with one responsibility
clear variables and outputs
repeated logic centralized with locals

2. Reliability

In practice, this means:

safer replacements with create_before_destroy
proper health checks
names that won’t collide
designs that support rolling changes

3. Security

In practice, this means:

no secrets in Terraform code
tighter security group rules
safer state handling
least-privilege thinking

4. Observability

In practice, this means:

consistent tagging
alerts for important metrics
basic operational visibility

5. Maintainability

In practice, this means:

README files for modules
pinned provider versions
reusable module boundaries
runnable examples
tests

That checklist turned Day 16 into an architecture refactor instead of just another provisioning exercise.

What I Refactored

I split the infrastructure into smaller modules:

modules/networking/alb
modules/cluster/asg-rolling-deploy
modules/services/hello-world-app
examples/hello-world-app

That made the design much easier to understand and test.

Refactor 1: Centralized Tagging

Before

resource "aws_instance" "web" {
  tags = {
    Name = "web-instance"
  }
}

After

locals {
  common_tags = merge(
    {
      Environment = var.environment
      ManagedBy   = "terraform"
      Project     = var.project_name
      Owner       = var.team_name
    },
    var.custom_tags
  )
}

resource "aws_lb_target_group" "app" {
  tags = merge(local.common_tags, {
    Name = "${var.cluster_name}-tg"
  })
}

Why this matters:

less repeated code
consistent tagging
easier filtering, billing, and operations

Refactor 2: Variable Validation

Before

variable "environment" {
  type = string
}

After

variable "environment" {
  description = "Deployment environment"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}

And for instance type:

variable "instance_type" {
  description = "EC2 instance type for the app cluster"
  type        = string

  validation {
    condition     = can(regex("^t[23]\\.", var.instance_type))
    error_message = "Instance type must be a t2 or t3 family type."
  }
}

Why this matters:

bad values fail early
module expectations are clearer
future users make fewer mistakes

Refactor 3: Safer Replacements

Before

resource "aws_launch_template" "this" {
  # ...
}

After

resource "aws_launch_template" "this" {
  # ...

  lifecycle {
    create_before_destroy = true
  }
}

Why this matters:

safer rolling replacements
less disruption during changes
better production behavior

Refactor 4: Tighter Security Rules

Before

cidr_blocks = ["0.0.0.0/0"]

After

resource "aws_vpc_security_group_ingress_rule" "app_from_alb" {
  security_group_id            = aws_security_group.instance.id
  referenced_security_group_id = var.alb_security_group_id
  from_port                    = var.server_port
  to_port                      = var.server_port
  ip_protocol                  = "tcp"
}

Why this matters:

instances are not exposed directly to the internet
only the ALB can reach the app port
networking intent is much safer and clearer

Refactor 5: Basic Observability

I added an SNS topic and CloudWatch CPU alarm:

resource "aws_sns_topic" "alerts" {
  name = "${var.cluster_name}-alerts"
}

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.cluster_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  statistic           = "Average"
  threshold           = 80
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.this.name
  }
}

Why this matters:

infrastructure should not just run
it should also tell you when it is unhealthy

How I Tested It

I tested Day 16 in two ways.

Manual test

I ran the example root module in:

day_16/examples/hello-world-app

Then I:

applied the stack
got the ALB DNS output
opened it in the browser
confirmed it returned:

Hello from Day 16

Then I destroyed everything.

Automated test with Terratest

I also added a Go test:

func TestHelloWorldApp(t *testing.T) {
    t.Parallel()

    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../examples/hello-world-app",
        Vars: map[string]interface{}{
            "cluster_name":     "test-cluster",
            "instance_type":    "t3.micro",
            "min_size":         1,
            "max_size":         1,
            "desired_capacity": 1,
            "environment":      "dev",
            "server_text":      "Hello from Day 16",
        },
    })

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    albDnsName := terraform.Output(t, terraformOptions, "alb_dns_name")
    url := "http://" + albDnsName

    http_helper.HttpGetWithRetryWithCustomValidation(t, url, nil, 60, 10*time.Second, func(statusCode int, body string) bool {
        return statusCode == 200 && strings.Contains(body, "Hello from Day 16")
    })
}

Why Automated Testing Matters

Manual testing is useful, but automated infrastructure testing gives you things manual testing cannot:

repeatability
faster regression checks
confidence after refactors
executable proof that the infrastructure behaves as expected
easier team validation in CI/CD later

Manual testing told me:

“it works right now”

Terratest moves closer to:

“it keeps working when I change the code”

That is a big difference.

My Main Takeaway

Day 16 changed how I think about Terraform quality.

The goal is not just:

“Can I provision this?”

The better question is:

“Can another engineer understand, trust, reuse, test, and operate this safely?”

That is what production-grade infrastructure really means.

Full Code

GitHub reference:
👉 Github Link

Follow My Journey

This is Day 16 of my 30-Day Terraform Challenge.
See you on Day 17 🚀

DEV Community

Creating Production-Grade Infrastructure with Terraform

The Production-Grade Checklist

1. Structure

2. Reliability

3. Security

4. Observability

5. Maintainability

What I Refactored

Refactor 1: Centralized Tagging

Before

After

Refactor 2: Variable Validation

Before

After

Refactor 3: Safer Replacements

Before

After

Refactor 4: Tighter Security Rules

Before

After

Refactor 5: Basic Observability

How I Tested It

Manual test

Automated test with Terratest

Why Automated Testing Matters

My Main Takeaway

Full Code

Follow My Journey

Top comments (0)