DEV Community

Mukami
Mukami

Posted on

Creating Production-Grade Infrastructure with Terraform

The Gap Between "It Works" and "It's Production-Ready"


Day 16 of the 30-Day Terraform Challenge — and today I learned that my "working" infrastructure was nowhere near production-ready.

I had a webserver cluster. It deployed. It served traffic. I was proud of it.

Then I ran it against a production-grade checklist. The result was humbling.


The Production-Grade Checklist

I audited my infrastructure against 5 categories:

Category Score What I Was Missing
Structure 80% Some hardcoded values
Reliability 60% No prevent_destroy, no wait timeouts
Security 70% SSH open to world, no validation
Observability 30% No consistent tags, no alarms
Maintainability 90% Missing input validation

The gap was significant. Here's how I closed it.


Refactor 1: Consistent Tagging

Before: Tags scattered across resources, inconsistent.

tags = {
  Name        = "${var.cluster_name}-instance"
  Environment = var.environment
}
Enter fullscreen mode Exit fullscreen mode

After: Centralized tags with locals and merge().

locals {
  common_tags = {
    Environment = var.environment
    ManagedBy   = "Terraform"
    Project     = var.project_name
    Team        = var.team_name
    Day         = "16"
  }
}

tags = merge(local.common_tags, {
  Name = "${var.cluster_name}-alb"
})
Enter fullscreen mode Exit fullscreen mode

Now every resource has the same base tags. Cost allocation, ownership tracking, and operations all benefit.


Refactor 2: Lifecycle Protection

Before: No protection against accidental deletion.

After: Added prevent_destroy to critical resources.

resource "aws_lb" "web" {
  # ... config ...

  lifecycle {
    prevent_destroy = true  # Can't accidentally delete ALB
  }
}
Enter fullscreen mode Exit fullscreen mode

Without this, one wrong terraform destroy wipes production. With it, Terraform errors before doing damage.


Refactor 3: CloudWatch Alarms

Before: No monitoring. If CPU spiked, I wouldn't know.

After: Alarms that notify via SNS.

resource "aws_sns_topic" "alerts" {
  name = "${var.cluster_name}-alerts"
}

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.cluster_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  threshold           = 80
  alarm_description   = "CPU exceeds 80% for 4 minutes"

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}
Enter fullscreen mode Exit fullscreen mode

Now when CPU hits 80% for 4 minutes, I get an alert. I can scale before users notice.


Refactor 4: Input Validation

Before: Any value was accepted. environment = "prod" would work.

After: Validation blocks catch mistakes early.

variable "environment" {
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}

variable "instance_type" {
  validation {
    condition     = can(regex("^t[23]\\.", var.instance_type))
    error_message = "Instance type must be a t2 or t3 family type."
  }
}
Enter fullscreen mode Exit fullscreen mode

Try terraform plan -var="environment=prod" (missing "uction"):

Error: Invalid value for variable
Environment must be dev, staging, or production.
Enter fullscreen mode Exit fullscreen mode

Caught at plan time, not after deployment.


Refactor 5: ASG Wait Timeout

Before: No wait_for_capacity_timeout — Terraform would move on before instances were healthy.

After: Added patience.

resource "aws_autoscaling_group" "web" {
  health_check_grace_period = 300
  wait_for_capacity_timeout = "10m"
}
Enter fullscreen mode Exit fullscreen mode

Now Terraform waits up to 10 minutes for instances to pass health checks before destroying old ones. Critical for zero-downtime.


The Before and After

Aspect Before After
Tags Inconsistent Centralized, all resources tagged
Deletion Protection None prevent_destroy on ALB
Monitoring None CloudWatch alarms + SNS
Validation None All variables validated
ASG Wait None 10-minute timeout
SSH Access 0.0.0.0/0 Restricted (configurable)

What I Learned

Production-grade isn't about features. It's about resilience.

My code "worked." But it wouldn't survive:

  • A bad terraform destroy command
  • A CPU spike at 3 AM
  • A teammate typing "prod" instead of "production"
  • An instance taking 2 minutes to boot

Each refactor addresses a failure mode I hadn't considered.


The Checklist Matters

The production-grade checklist isn't just a list. It's a map of failure modes.

  • Tagging → Who owns this? Who pays for it?
  • prevent_destroy → What happens if I fat-finger this?
  • Alarms → How will I know something is wrong?
  • Validation → What if someone passes wrong values?
  • Timeouts → What if things take longer than expected?

Every checkbox answers a "what if" question.


The Bottom Line

Today I transformed "working" infrastructure into "production-ready" infrastructure.

The difference isn't features. It's resilience, observability, and safety.

If you're deploying code that matters, run it through a production checklist.

P.S. The moment I added prevent_destroy to my ALB, I felt safer. The moment I added validation, I felt smarter. The moment I added alarms, I felt like a real engineer. Small changes, big impact.

Top comments (0)