DEV Community

Mary Mutua
Mary Mutua

Posted on

Creating Production-Grade Infrastructure with Terraform

Day 16 of my Terraform journey was less about creating new resources and more about improving how infrastructure is designed.

The big lesson was this:

Terraform code that works is not automatically production-grade.

Production-grade infrastructure should be:

  • modular
  • reliable
  • secure
  • observable
  • maintainable
  • testable

For Day 16, I audited my earlier infrastructure against that checklist, refactored it into smaller modules, added better validation and observability, and tested it both manually and with Terratest.

GitHub reference:
๐Ÿ‘‰ Github Link

The Production-Grade Checklist

Here is the practical version of what I used.

1. Structure

In practice, this means:

  • no giant main.tf doing everything
  • small modules with one responsibility
  • clear variables and outputs
  • repeated logic centralized with locals

2. Reliability

In practice, this means:

  • safer replacements with create_before_destroy
  • proper health checks
  • names that wonโ€™t collide
  • designs that support rolling changes

3. Security

In practice, this means:

  • no secrets in Terraform code
  • tighter security group rules
  • safer state handling
  • least-privilege thinking

4. Observability

In practice, this means:

  • consistent tagging
  • alerts for important metrics
  • basic operational visibility

5. Maintainability

In practice, this means:

  • README files for modules
  • pinned provider versions
  • reusable module boundaries
  • runnable examples
  • tests

That checklist turned Day 16 into an architecture refactor instead of just another provisioning exercise.

What I Refactored

I split the infrastructure into smaller modules:

  • modules/networking/alb
  • modules/cluster/asg-rolling-deploy
  • modules/services/hello-world-app
  • examples/hello-world-app

That made the design much easier to understand and test.

Refactor 1: Centralized Tagging

Before

resource "aws_instance" "web" {
  tags = {
    Name = "web-instance"
  }
}
Enter fullscreen mode Exit fullscreen mode

After

locals {
  common_tags = merge(
    {
      Environment = var.environment
      ManagedBy   = "terraform"
      Project     = var.project_name
      Owner       = var.team_name
    },
    var.custom_tags
  )
}

resource "aws_lb_target_group" "app" {
  tags = merge(local.common_tags, {
    Name = "${var.cluster_name}-tg"
  })
}
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • less repeated code
  • consistent tagging
  • easier filtering, billing, and operations

Refactor 2: Variable Validation

Before

variable "environment" {
  type = string
}
Enter fullscreen mode Exit fullscreen mode

After

variable "environment" {
  description = "Deployment environment"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}
Enter fullscreen mode Exit fullscreen mode

And for instance type:

variable "instance_type" {
  description = "EC2 instance type for the app cluster"
  type        = string

  validation {
    condition     = can(regex("^t[23]\\.", var.instance_type))
    error_message = "Instance type must be a t2 or t3 family type."
  }
}
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • bad values fail early
  • module expectations are clearer
  • future users make fewer mistakes

Refactor 3: Safer Replacements

Before

resource "aws_launch_template" "this" {
  # ...
}
Enter fullscreen mode Exit fullscreen mode

After

resource "aws_launch_template" "this" {
  # ...

  lifecycle {
    create_before_destroy = true
  }
}
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • safer rolling replacements
  • less disruption during changes
  • better production behavior

Refactor 4: Tighter Security Rules

Before

cidr_blocks = ["0.0.0.0/0"]
Enter fullscreen mode Exit fullscreen mode

After

resource "aws_vpc_security_group_ingress_rule" "app_from_alb" {
  security_group_id            = aws_security_group.instance.id
  referenced_security_group_id = var.alb_security_group_id
  from_port                    = var.server_port
  to_port                      = var.server_port
  ip_protocol                  = "tcp"
}
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • instances are not exposed directly to the internet
  • only the ALB can reach the app port
  • networking intent is much safer and clearer

Refactor 5: Basic Observability

I added an SNS topic and CloudWatch CPU alarm:

resource "aws_sns_topic" "alerts" {
  name = "${var.cluster_name}-alerts"
}

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.cluster_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  statistic           = "Average"
  threshold           = 80
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.this.name
  }
}
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • infrastructure should not just run
  • it should also tell you when it is unhealthy

How I Tested It

I tested Day 16 in two ways.

Manual test

I ran the example root module in:

  • day_16/examples/hello-world-app

Then I:

  • applied the stack
  • got the ALB DNS output
  • opened it in the browser
  • confirmed it returned:

Hello from Day 16

Then I destroyed everything.

Automated test with Terratest

I also added a Go test:

func TestHelloWorldApp(t *testing.T) {
    t.Parallel()

    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../examples/hello-world-app",
        Vars: map[string]interface{}{
            "cluster_name":     "test-cluster",
            "instance_type":    "t3.micro",
            "min_size":         1,
            "max_size":         1,
            "desired_capacity": 1,
            "environment":      "dev",
            "server_text":      "Hello from Day 16",
        },
    })

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    albDnsName := terraform.Output(t, terraformOptions, "alb_dns_name")
    url := "http://" + albDnsName

    http_helper.HttpGetWithRetryWithCustomValidation(t, url, nil, 60, 10*time.Second, func(statusCode int, body string) bool {
        return statusCode == 200 && strings.Contains(body, "Hello from Day 16")
    })
}
Enter fullscreen mode Exit fullscreen mode

Why Automated Testing Matters

Manual testing is useful, but automated infrastructure testing gives you things manual testing cannot:

  • repeatability
  • faster regression checks
  • confidence after refactors
  • executable proof that the infrastructure behaves as expected
  • easier team validation in CI/CD later

Manual testing told me:

  • โ€œit works right nowโ€

Terratest moves closer to:

  • โ€œit keeps working when I change the codeโ€

That is a big difference.

My Main Takeaway

Day 16 changed how I think about Terraform quality.

The goal is not just:

  • โ€œCan I provision this?โ€

The better question is:

  • โ€œCan another engineer understand, trust, reuse, test, and operate this safely?โ€

That is what production-grade infrastructure really means.

Full Code

GitHub reference:
๐Ÿ‘‰ Github Link

Follow My Journey

This is Day 16 of my 30-Day Terraform Challenge.
See you on Day 17 ๐Ÿš€

Top comments (0)