Mukami

Posted on Mar 30

Creating Production-Grade Infrastructure with Terraform

#tutorial #terraform #30daychallenge #aws

The Gap Between "It Works" and "It's Production-Ready"

Day 16 of the 30-Day Terraform Challenge — and today I learned that my "working" infrastructure was nowhere near production-ready.

I had a webserver cluster. It deployed. It served traffic. I was proud of it.

Then I ran it against a production-grade checklist. The result was humbling.

The Production-Grade Checklist

I audited my infrastructure against 5 categories:

Category	Score	What I Was Missing
Structure	80%	Some hardcoded values
Reliability	60%	No `prevent_destroy`, no wait timeouts
Security	70%	SSH open to world, no validation
Observability	30%	No consistent tags, no alarms
Maintainability	90%	Missing input validation

The gap was significant. Here's how I closed it.

Refactor 1: Consistent Tagging

Before: Tags scattered across resources, inconsistent.

tags = {
  Name        = "${var.cluster_name}-instance"
  Environment = var.environment
}

After: Centralized tags with locals and merge().

locals {
  common_tags = {
    Environment = var.environment
    ManagedBy   = "Terraform"
    Project     = var.project_name
    Team        = var.team_name
    Day         = "16"
  }
}

tags = merge(local.common_tags, {
  Name = "${var.cluster_name}-alb"
})

Now every resource has the same base tags. Cost allocation, ownership tracking, and operations all benefit.

Refactor 2: Lifecycle Protection

Before: No protection against accidental deletion.

After: Added prevent_destroy to critical resources.

resource "aws_lb" "web" {
  # ... config ...

  lifecycle {
    prevent_destroy = true  # Can't accidentally delete ALB
  }
}

Without this, one wrong terraform destroy wipes production. With it, Terraform errors before doing damage.

Refactor 3: CloudWatch Alarms

Before: No monitoring. If CPU spiked, I wouldn't know.

After: Alarms that notify via SNS.

resource "aws_sns_topic" "alerts" {
  name = "${var.cluster_name}-alerts"
}

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.cluster_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  threshold           = 80
  alarm_description   = "CPU exceeds 80% for 4 minutes"

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Now when CPU hits 80% for 4 minutes, I get an alert. I can scale before users notice.

Refactor 4: Input Validation

Before: Any value was accepted. environment = "prod" would work.

After: Validation blocks catch mistakes early.

variable "environment" {
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}

variable "instance_type" {
  validation {
    condition     = can(regex("^t[23]\\.", var.instance_type))
    error_message = "Instance type must be a t2 or t3 family type."
  }
}

Try terraform plan -var="environment=prod" (missing "uction"):

Error: Invalid value for variable
Environment must be dev, staging, or production.

Caught at plan time, not after deployment.

Refactor 5: ASG Wait Timeout

Before: No wait_for_capacity_timeout — Terraform would move on before instances were healthy.

After: Added patience.

resource "aws_autoscaling_group" "web" {
  health_check_grace_period = 300
  wait_for_capacity_timeout = "10m"
}

Now Terraform waits up to 10 minutes for instances to pass health checks before destroying old ones. Critical for zero-downtime.

The Before and After

Aspect	Before	After
Tags	Inconsistent	Centralized, all resources tagged
Deletion Protection	None	`prevent_destroy` on ALB
Monitoring	None	CloudWatch alarms + SNS
Validation	None	All variables validated
ASG Wait	None	10-minute timeout
SSH Access	0.0.0.0/0	Restricted (configurable)

What I Learned

Production-grade isn't about features. It's about resilience.

My code "worked." But it wouldn't survive:

A bad terraform destroy command
A CPU spike at 3 AM
A teammate typing "prod" instead of "production"
An instance taking 2 minutes to boot

Each refactor addresses a failure mode I hadn't considered.

The Checklist Matters

The production-grade checklist isn't just a list. It's a map of failure modes.

Tagging → Who owns this? Who pays for it?
prevent_destroy → What happens if I fat-finger this?
Alarms → How will I know something is wrong?
Validation → What if someone passes wrong values?
Timeouts → What if things take longer than expected?

Every checkbox answers a "what if" question.

The Bottom Line

Today I transformed "working" infrastructure into "production-ready" infrastructure.

The difference isn't features. It's resilience, observability, and safety.

If you're deploying code that matters, run it through a production checklist.

P.S. The moment I added prevent_destroy to my ALB, I felt safer. The moment I added validation, I felt smarter. The moment I added alarms, I felt like a real engineer. Small changes, big impact.

DEV Community