Samson Tanimawo

Posted on Apr 20

Terraform at Scale: Lessons from Managing 500+ Resources

#terraform #devops #infrastructure #iac

When Terraform Gets Slow

Our Terraform state file grew to 500+ resources. Plans took 8 minutes. Applies timed out. State locking conflicts were daily. Something had to change.

Here's how we tamed it.

Problem 1: Monolithic State

Everything was in one state file. VPCs, databases, Kubernetes clusters, DNS, IAM all in one giant blob.

Before: 1 state file, 500+ resources
terraform plan: 8 minutes
terraform apply: timeout risk
blast radius: everything

Solution: State Decomposition

infrastructure/
├── network/ # VPCs, subnets, security groups
├── data/ # RDS, ElastiCache, S3
├── compute/ # EKS, ASGs, Launch templates
├── dns/ # Route53 zones and records
├── iam/ # Roles, policies, users
└── monitoring/ # CloudWatch, SNS topics

Each directory = separate state file. Use data sources to reference across boundaries:

# compute/main.tf
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
}
}

resource "aws_eks_cluster" "main" {
vpc_config {
subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
}
}

Result: 6 state files, 60-100 resources each. Plan time: 45 seconds.

Problem 2: Environment Drift

Dev, staging, and prod drifted constantly because each was copy-pasted.

Solution: Modules + Terragrunt

modules/
├── eks-cluster/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── rds-instance/
├── main.tf
├── variables.tf
└── outputs.tf

environments/
├── dev/
│ └── terragrunt.hcl
├── staging/
│ └── terragrunt.hcl
└── prod/
└── terragrunt.hcl

# environments/prod/terragrunt.hcl
terraform {
source = "../../modules/eks-cluster"
}

inputs = {
cluster_name = "prod-main"
node_count = 10
instance_type = "m5.2xlarge"
multi_az = true
}

Problem 3: Dangerous Applies

Anyone could terraform apply to production from their laptop.

Solution: CI/CD Only

#.github/workflows/terraform.yml
name: Terraform
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']

jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform plan -out=plan.tfplan
- run: terraform show -json plan.tfplan > plan.json
# Post plan as PR comment
- uses: actions/github-script@v7
with:
script: |
const plan = require('./plan.json');
const adds = plan.resource_changes.filter(c => c.change.actions.includes('create')).length;
const changes = plan.resource_changes.filter(c => c.change.actions.includes('update')).length;
const deletes = plan.resource_changes.filter(c => c.change.actions.includes('delete')).length;
github.rest.issues.createComment({
issue_number: context.issue.number,
body: `## Terraform Plan\n+${adds} ~${changes} -${deletes}\n\n${deletes > 0? '⚠ RESOURCES WILL BE DESTROYED': ''}`
});

apply:
needs: plan
if: github.ref == 'refs/heads/main'
environment: production # Requires approval
steps:
- run: terraform apply plan.tfplan

Problem 4: State Locks

Multiple engineers running plan simultaneously caused state lock conflicts.

Solution: Remote State with DynamoDB Locking

terraform {
backend "s3" {
bucket = "terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}

Plus: only CI/CD runs apply. Humans run plan locally with -lock=false for quick checks.

Results

Metric	Before	After
Plan time	8 min	45 sec
Apply failures	3/week	0.5/week
State conflicts	Daily	Never
Env drift incidents	Monthly	None in 6 months
Time to provision new env	2 days	30 minutes

If you want AI-powered infrastructure management that catches drift before it causes outages, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community