Great Stack to Doesn't Work Bonus: 10 Terraform 'I Wish I Knew This Earlier' Moments

#terraform #devops #infrastructure #discuss

Great Stack to Doesn't Work — Bonus

10 Terraform "I Wish I Knew This Earlier" Moments

Hard-won lessons from hundreds of terraform apply runs.

1. State locking saves careers.

Two engineers run terraform apply simultaneously. Both read the same state. Both make changes. One overwrites the other. Resources are orphaned. State is corrupted.

Use a remote backend with locking. For AWS:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

DynamoDB provides the lock. S3 provides the state. Without both, you're one concurrent apply away from a bad day.

2. Workspaces are not environments.

Terraform workspaces share the same configuration with different state files. This sounds like environments (dev, staging, prod) but it's a trap. You want different configurations per environment — different instance sizes, different replica counts, different feature flags. Workspaces give you different state, not different config.

Use separate directories or separate Terraform root modules per environment:

environments/
  dev/
    main.tf
    terraform.tfvars
  staging/
    main.tf
    terraform.tfvars
  prod/
    main.tf
    terraform.tfvars

Or use tools like Terragrunt that handle environment separation cleanly.

3. Module versioning prevents surprises.

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.1"  # Pin it
}

Without a version pin, terraform init pulls the latest version. The latest version might have breaking changes. Now your terraform plan shows 47 resources being destroyed and recreated, and you don't know why.

Pin module versions. Update deliberately, with a plan review.

4. Drift detection is your responsibility.

Someone clicks around in the AWS console and creates a security group rule manually. Terraform doesn't know about it. Your state file says there are 3 rules. AWS has 4. This is drift.

Run terraform plan regularly (daily in CI) even when you're not deploying. If the plan shows changes you didn't make, someone is making manual changes. Find them. Fix the process.

5. terraform import brings existing resources under management.

You have resources created manually or by another tool. You want Terraform to manage them without recreating them.

terraform import aws_instance.web i-1234567890abcdef0

This adds the resource to state. You still need to write the matching .tf configuration manually. If the config doesn't match the imported resource, the next plan will show changes.

6. moved blocks handle refactoring without destroying resources.

Renaming a resource or moving it into a module used to mean "destroy and recreate." Now:

moved {
  from = aws_instance.old_name
  to   = aws_instance.new_name
}

Terraform updates the state without touching the actual resource. Essential for codebase cleanups.

7. lifecycle { ignore_changes } prevents fights with auto-scaling.

Auto-scaling groups change the desired capacity. Terraform wants to reset it to what's in the config. Every apply is a fight.

resource "aws_autoscaling_group" "web" {
  desired_capacity = 3

  lifecycle {
    ignore_changes = [desired_capacity]
  }
}

Use this for any attribute that's legitimately managed outside Terraform: auto-scaled counts, tags added by external systems, annotations set by operators.

8. Data sources query, resources create.

# DATA SOURCE: reads existing VPC (doesn't create or manage it)
data "aws_vpc" "existing" {
  tags = { Name = "production" }
}

# RESOURCE: creates and manages a new subnet
resource "aws_subnet" "new" {
  vpc_id = data.aws_vpc.existing.id
  cidr_block = "10.0.1.0/24"
}

Data sources are read-only references to things that already exist. If you confuse data and resource, you'll either fail to create something or accidentally try to manage something you shouldn't.

9. Remote backend migration requires a two-step process.

Moving from local state to remote (or between remote backends):

# Step 1: Add the new backend configuration to your .tf files
# Step 2: Run init with migration flag
terraform init -migrate-state

Terraform copies the state to the new backend. Don't skip the -migrate-state flag — without it, Terraform starts with empty state and tries to create everything from scratch.

Always back up your state file before migration:

cp terraform.tfstate terraform.tfstate.backup

10. terraform plan doesn't catch everything.

Plan shows what Terraform intends to do. It doesn't validate that the changes will succeed. IAM permissions might block the apply. A resource might have a dependency that plan doesn't check. A provider might reject the configuration at apply time.

Plan is necessary but not sufficient. Always run plan before apply. But don't trust plan as proof that apply will succeed. Have a rollback strategy for every apply.

Over to You

What's your biggest Terraform 'I wish I knew this earlier' moment? Any state file corruption stories?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*