The Gap Between "It Works" and "It's Production-Ready"
Day 16 of the 30-Day Terraform Challenge — and today I learned that my "working" infrastructure was nowhere near production-ready.
I had a webserver cluster. It deployed. It served traffic. I was proud of it.
Then I ran it against a production-grade checklist. The result was humbling.
The Production-Grade Checklist
I audited my infrastructure against 5 categories:
| Category | Score | What I Was Missing |
|---|---|---|
| Structure | 80% | Some hardcoded values |
| Reliability | 60% | No prevent_destroy, no wait timeouts |
| Security | 70% | SSH open to world, no validation |
| Observability | 30% | No consistent tags, no alarms |
| Maintainability | 90% | Missing input validation |
The gap was significant. Here's how I closed it.
Refactor 1: Consistent Tagging
Before: Tags scattered across resources, inconsistent.
tags = {
Name = "${var.cluster_name}-instance"
Environment = var.environment
}
After: Centralized tags with locals and merge().
locals {
common_tags = {
Environment = var.environment
ManagedBy = "Terraform"
Project = var.project_name
Team = var.team_name
Day = "16"
}
}
tags = merge(local.common_tags, {
Name = "${var.cluster_name}-alb"
})
Now every resource has the same base tags. Cost allocation, ownership tracking, and operations all benefit.
Refactor 2: Lifecycle Protection
Before: No protection against accidental deletion.
After: Added prevent_destroy to critical resources.
resource "aws_lb" "web" {
# ... config ...
lifecycle {
prevent_destroy = true # Can't accidentally delete ALB
}
}
Without this, one wrong terraform destroy wipes production. With it, Terraform errors before doing damage.
Refactor 3: CloudWatch Alarms
Before: No monitoring. If CPU spiked, I wouldn't know.
After: Alarms that notify via SNS.
resource "aws_sns_topic" "alerts" {
name = "${var.cluster_name}-alerts"
}
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "${var.cluster_name}-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 120
threshold = 80
alarm_description = "CPU exceeds 80% for 4 minutes"
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.web.name
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
Now when CPU hits 80% for 4 minutes, I get an alert. I can scale before users notice.
Refactor 4: Input Validation
Before: Any value was accepted. environment = "prod" would work.
After: Validation blocks catch mistakes early.
variable "environment" {
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production."
}
}
variable "instance_type" {
validation {
condition = can(regex("^t[23]\\.", var.instance_type))
error_message = "Instance type must be a t2 or t3 family type."
}
}
Try terraform plan -var="environment=prod" (missing "uction"):
Error: Invalid value for variable
Environment must be dev, staging, or production.
Caught at plan time, not after deployment.
Refactor 5: ASG Wait Timeout
Before: No wait_for_capacity_timeout — Terraform would move on before instances were healthy.
After: Added patience.
resource "aws_autoscaling_group" "web" {
health_check_grace_period = 300
wait_for_capacity_timeout = "10m"
}
Now Terraform waits up to 10 minutes for instances to pass health checks before destroying old ones. Critical for zero-downtime.
The Before and After
| Aspect | Before | After |
|---|---|---|
| Tags | Inconsistent | Centralized, all resources tagged |
| Deletion Protection | None |
prevent_destroy on ALB |
| Monitoring | None | CloudWatch alarms + SNS |
| Validation | None | All variables validated |
| ASG Wait | None | 10-minute timeout |
| SSH Access | 0.0.0.0/0 | Restricted (configurable) |
What I Learned
Production-grade isn't about features. It's about resilience.
My code "worked." But it wouldn't survive:
- A bad
terraform destroycommand - A CPU spike at 3 AM
- A teammate typing "prod" instead of "production"
- An instance taking 2 minutes to boot
Each refactor addresses a failure mode I hadn't considered.
The Checklist Matters
The production-grade checklist isn't just a list. It's a map of failure modes.
- Tagging → Who owns this? Who pays for it?
-
prevent_destroy→ What happens if I fat-finger this? - Alarms → How will I know something is wrong?
- Validation → What if someone passes wrong values?
- Timeouts → What if things take longer than expected?
Every checkbox answers a "what if" question.
The Bottom Line
Today I transformed "working" infrastructure into "production-ready" infrastructure.
The difference isn't features. It's resilience, observability, and safety.
If you're deploying code that matters, run it through a production checklist.
P.S. The moment I added prevent_destroy to my ALB, I felt safer. The moment I added validation, I felt smarter. The moment I added alarms, I felt like a real engineer. Small changes, big impact.
Top comments (0)