When Terraform Gets Slow
Our Terraform state file grew to 500+ resources. Plans took 8 minutes. Applies timed out. State locking conflicts were daily. Something had to change.
Here's how we tamed it.
Problem 1: Monolithic State
Everything was in one state file. VPCs, databases, Kubernetes clusters, DNS, IAM all in one giant blob.
Before: 1 state file, 500+ resources
terraform plan: 8 minutes
terraform apply: timeout risk
blast radius: everything
Solution: State Decomposition
infrastructure/
├── network/ # VPCs, subnets, security groups
├── data/ # RDS, ElastiCache, S3
├── compute/ # EKS, ASGs, Launch templates
├── dns/ # Route53 zones and records
├── iam/ # Roles, policies, users
└── monitoring/ # CloudWatch, SNS topics
Each directory = separate state file. Use data sources to reference across boundaries:
# compute/main.tf
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_eks_cluster" "main" {
vpc_config {
subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
}
}
Result: 6 state files, 60-100 resources each. Plan time: 45 seconds.
Problem 2: Environment Drift
Dev, staging, and prod drifted constantly because each was copy-pasted.
Solution: Modules + Terragrunt
modules/
├── eks-cluster/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── rds-instance/
├── main.tf
├── variables.tf
└── outputs.tf
environments/
├── dev/
│ └── terragrunt.hcl
├── staging/
│ └── terragrunt.hcl
└── prod/
└── terragrunt.hcl
# environments/prod/terragrunt.hcl
terraform {
source = "../../modules/eks-cluster"
}
inputs = {
cluster_name = "prod-main"
node_count = 10
instance_type = "m5.2xlarge"
multi_az = true
}
Problem 3: Dangerous Applies
Anyone could terraform apply to production from their laptop.
Solution: CI/CD Only
#.github/workflows/terraform.yml
name: Terraform
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform plan -out=plan.tfplan
- run: terraform show -json plan.tfplan > plan.json
# Post plan as PR comment
- uses: actions/github-script@v7
with:
script: |
const plan = require('./plan.json');
const adds = plan.resource_changes.filter(c => c.change.actions.includes('create')).length;
const changes = plan.resource_changes.filter(c => c.change.actions.includes('update')).length;
const deletes = plan.resource_changes.filter(c => c.change.actions.includes('delete')).length;
github.rest.issues.createComment({
issue_number: context.issue.number,
body: `## Terraform Plan\n+${adds} ~${changes} -${deletes}\n\n${deletes > 0? '⚠ RESOURCES WILL BE DESTROYED': ''}`
});
apply:
needs: plan
if: github.ref == 'refs/heads/main'
environment: production # Requires approval
steps:
- run: terraform apply plan.tfplan
Problem 4: State Locks
Multiple engineers running plan simultaneously caused state lock conflicts.
Solution: Remote State with DynamoDB Locking
terraform {
backend "s3" {
bucket = "terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Plus: only CI/CD runs apply. Humans run plan locally with -lock=false for quick checks.
Results
| Metric | Before | After |
|---|---|---|
| Plan time | 8 min | 45 sec |
| Apply failures | 3/week | 0.5/week |
| State conflicts | Daily | Never |
| Env drift incidents | Monthly | None in 6 months |
| Time to provision new env | 2 days | 30 minutes |
If you want AI-powered infrastructure management that catches drift before it causes outages, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)