The real reason your staging environment is always broken
"Don't trust staging" is engineering folklore. Teams treat it as inevitable. It's not — it's a symptom.
What actually causes staging drift
1. Manual setup at different times: Prod Q1, staging Q3. Different engineers, different patterns. Identical at the start. Diverging ever since.
2. Hotfix culture: Prod incident at 2 AM. IAM permission patched in the console. Terraform state doesn't know. Staging doesn't have the patch.
3. Cost pressure on non-prod: "We don't need the full setup in staging." Different ALB, smaller RDS, different security groups. "Close enough" becomes "completely different."
4. Nobody owns staging: Prod has on-call. Staging has... whoever notices it's broken.
The fix: same module, different vars
# prod
module "payment_service" {
source = "../../modules/service"; environment = "prod"
instance_count = 3; db_class = "db.r6g.large"
}
# staging — SAME MODULE
module "payment_service_staging" {
source = "../../modules/service"; environment = "staging"
instance_count = 1; db_class = "db.t3.medium"
}
Same module → same IAM → same security groups → same monitoring → same deploy process.
The difference between prod and staging is scale, not structure.
The hotfix rule
Every manual change to prod needs a Terraform change in the same sprint. Non-negotiable.
Named ownership
Staging needs a named owner. Not "the team." A specific person.
Step2Dev provisions staging and prod from the same template — parity by default.
What's your most memorable "staging was broken and we didn't know" story?
Top comments (0)