DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on • Edited on

PROJECT: Terraform Challenges in Production

0) Create folders

terraform-prod-challenges/
├── 00-baseline/
├── 01-drift-and-refresh/
├── 02-import-existing/
├── 03-rename-without-destroy-state-mv/
├── 04-address-change-count-to-foreach/
├── 05-prevent-destroy-safety/
├── 06-backend-change-reconfigure/
├── 07-state-locking-simulation/
├── 08-partial-apply-and-recovery/
├── 09-outputs-remote-state-contract/
│ ├── platform/
│ └── app/
└── 10-provider-version-lock-file/


00-baseline (baseline resource)

✅ Code

00-baseline/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "app_config" {
  filename = "${path.module}/app.conf"
  content  = "version=1\nowner=platform\n"
}
Enter fullscreen mode Exit fullscreen mode

00-baseline/outputs.tf

output "config_path" {
  value = local_file.app_config.filename
}
Enter fullscreen mode Exit fullscreen mode

Detailed explanation

What problem is this?

No problem yet. This is your “Hello World”.

Where it happens?

Everywhere: dev/stage/prod. This is how Terraform starts.

When it matters?

  • at the beginning of a project
  • in onboarding
  • before release automation

Interview answer

“I start with a simple baseline to show Terraform init/plan/apply and state. It helps new engineers understand how Terraform manages infrastructure.”


01-drift-and-refresh (drift in prod)

✅ Code

01-drift-and-refresh/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "drift_demo" {
  filename = "${path.module}/drift.txt"
  content  = "managed_by=terraform\nvalue=100\n"
}
Enter fullscreen mode Exit fullscreen mode

01-drift-and-refresh/README.txt

Steps:

1) terraform init
2) terraform apply

3) Manually edit drift.txt and change value=100 to value=999

4) terraform plan

Terraform will detect drift and plan to restore desired state.
Enter fullscreen mode Exit fullscreen mode

Detailed explanation

What is the issue?

Drift means: resource was changed outside Terraform.

Where it happens?

Mostly production, because people:

  • change AWS Console settings
  • hotfix security group rules
  • patch something quickly during outage

When it happens?

  • during incident response (outage)
  • after release (someone “fixes” prod)
  • anytime manual change happens

Why dangerous?

Next terraform apply can:

  • revert the manual fix (break prod again)
  • create confusion: “Why did it change back?”

How Terraform solves it?

terraform plan compares:

  • code (desired)
  • real file (actual) and shows difference.

Interview answer

“Drift happens when someone changes infra manually. I detect it using terraform plan, then decide: either accept the change (update code) or revert back to code.”


02-import-existing (resource exists already)

✅ Code

02-import-existing/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "import_me" {
  filename = "${path.module}/existing.txt"
  content  = "this_is_now_managed\n"
}
Enter fullscreen mode Exit fullscreen mode

02-import-existing/README.txt

Steps:

1) Create file first (outside Terraform):
   echo "i already existed" > existing.txt

2) terraform init

3) Import it into state:
   terraform import local_file.import_me existing.txt

4) terraform plan

Key point:
Import updates Terraform STATE. After import, ensure Terraform code matches reality.
Enter fullscreen mode Exit fullscreen mode

Detailed explanation

What is the issue?

In real life, Terraform is introduced after infrastructure already exists.

Where it happens?

  • old AWS accounts
  • inherited projects
  • manual infra created before IaC

When it happens?

  • onboarding Terraform into existing company environment
  • migration from CloudFormation/manual to Terraform
  • before big release, company wants standardization

Why dangerous?

If you don’t import:

  • Terraform will try to create duplicate resources
  • naming collisions
  • conflicts, downtime risk

How Terraform solves it?

terraform import connects an existing object to Terraform state.

Interview answer

“Many resources exist before Terraform. I use terraform import to bring them under Terraform management, then I update code to match real configuration.”


03-rename-without-destroy-state-mv (refactor safely)

✅ Code

03-rename-without-destroy-state-mv/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "old_name" {
  filename = "${path.module}/name.txt"
  content  = "hello\n"
}
Enter fullscreen mode Exit fullscreen mode

03-rename-without-destroy-state-mv/README.txt

Steps:

1) terraform init
2) terraform apply

3) Rename resource in code:
   local_file.old_name -> local_file.new_name

4) terraform plan  (it will show destroy/create)

Correct fix:
terraform state mv local_file.old_name local_file.new_name

Then:
terraform plan
terraform apply
Enter fullscreen mode Exit fullscreen mode

Detailed explanation

What is the issue?

Developer refactors code: changes resource name.
Terraform thinks:

  • old resource removed => destroy
  • new resource added => create

Where it happens?

  • in dev branch refactoring
  • module cleanup
  • team renaming standards

When it happens?

  • during release preparation
  • during large refactor PR
  • migration to modules

Why dangerous?

Can delete prod resources by mistake.

How Terraform solves it?

terraform state mv updates state address without changing real resource.

Interview answer

“When we refactor Terraform labels, we migrate state addresses using terraform state mv to avoid destroying production resources.”


04-address-change-count-to-foreach (classic breaking change)

✅ Code (Start with COUNT)

04-address-change-count-to-foreach/main.tf (COUNT version first)

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

variable "names" {
  type    = list(string)
  default = ["orders", "payments"]
}

resource "local_file" "svc" {
  count    = length(var.names)
  filename = "${path.module}/${var.names[count.index]}.txt"
  content  = "service=${var.names[count.index]}\n"
}
Enter fullscreen mode Exit fullscreen mode

04-address-change-count-to-foreach/README.txt

Steps:

1) terraform init
2) terraform apply

3) Edit main.tf to FOREACH version:

resource "local_file" "svc" {
  for_each = toset(var.names)
  filename = "${path.module}/${each.value}.txt"
  content  = "service=${each.value}\n"
}

4) terraform plan (will show recreate)

Fix with state mv:
terraform state mv 'local_file.svc[0]' 'local_file.svc["orders"]'
terraform state mv 'local_file.svc[1]' 'local_file.svc["payments"]'

Then:
terraform plan
terraform apply
Enter fullscreen mode Exit fullscreen mode

Detailed explanation (students)

What is the issue?

Terraform identifies resources by address.
With count: svc[0], svc[1]
With for_each: svc["orders"], svc["payments"]

So Terraform thinks they are new resources.

Where it happens?

  • when code matures
  • when teams want stable keys
  • in modules during refactor

When it happens?

  • before release (refactor PR)
  • during standardization
  • while adding new services

Why dangerous?

Terraform may destroy and recreate resources:

  • downtime (ALB, ECS, RDS settings)
  • lost data (if storage resource recreated)
  • production outage risk

How Terraform solves it?

State migration: map old addresses to new addresses.

Interview answer

“Count-to-for_each migration changes resource addresses. To avoid recreation, I migrate state with terraform state mv so production resources remain untouched.”


05-prevent-destroy-safety (guardrail)

✅ Code

05-prevent-destroy-safety/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "critical" {
  filename = "${path.module}/critical.txt"
  content  = "do_not_delete\n"

  lifecycle {
    prevent_destroy = true
  }
}
Enter fullscreen mode Exit fullscreen mode

05-prevent-destroy-safety/README.txt

Steps:
1) terraform init
2) terraform apply
3) terraform destroy

It will fail because prevent_destroy blocks deletion.
Enter fullscreen mode Exit fullscreen mode

Detailed explanation

What is the issue?

Accidental deletion in Terraform code or wrong apply.

Where it happens?

Production, especially for:

  • S3 state bucket
  • DynamoDB lock table
  • RDS
  • IAM critical roles

When it happens?

  • release time
  • cleanup tasks
  • refactor mistakes
  • wrong workspace selected

How Terraform solves it?

prevent_destroy blocks deletion even if plan says destroy.

Interview answer

“For critical resources, we set lifecycle prevent_destroy to avoid accidental deletion during CI/CD or refactoring.”


06-backend-change-reconfigure (safe + always runnable)

✅ Code (NO provider needed)

06-backend-change-reconfigure/main.tf

terraform {
  required_version = ">= 1.5.0"
}

resource "terraform_data" "backend_note" {
  input = "backend demo"
}
Enter fullscreen mode Exit fullscreen mode

06-backend-change-reconfigure/README.txt

Production behavior:

If backend config changes:
terraform init -reconfigure

If moving state to a new backend:
terraform init -migrate-state
Enter fullscreen mode Exit fullscreen mode

Detailed explanation (students)

What is the issue?

Terraform state location changed:

  • local -> S3
  • S3 bucket name changed
  • key path changed
  • region changed

Where it happens?

  • production pipelines
  • team collaboration setups

When it happens?

  • moving from local dev to shared backend
  • company security policy update
  • migration to new AWS account

Why dangerous?

Wrong backend = wrong state:

  • Terraform may create duplicates
  • or destroy wrong resources

How Terraform solves it?

  • terraform init -reconfigure tells Terraform to accept new backend settings
  • terraform init -migrate-state moves state safely

Interview answer

“Backend changes require terraform init -reconfigure. If we move state, we use -migrate-state to prevent losing track of production resources.”


07-state-locking-simulation

✅ Code

07-state-locking-simulation/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    time = {
      source  = "hashicorp/time"
      version = "~> 0.11"
    }
  }
}

resource "time_sleep" "simulate_long_apply" {
  create_duration = "60s"
}
Enter fullscreen mode Exit fullscreen mode

07-state-locking-simulation/README.txt

Terminal 1:
terraform init
terraform apply

While it's running, Terminal 2:
terraform apply

Local backend won't show real locks.
In production with S3 + DynamoDB locks: it prevents concurrent apply.
Enter fullscreen mode Exit fullscreen mode

Detailed explanation

What is the issue?

Two engineers (or CI + human) run terraform apply at same time.

Where it happens?

  • production pipelines
  • shared infrastructure repos
  • busy teams

When it happens?

  • during release
  • hotfix while pipeline is running
  • multiple PR merges at once

Why dangerous?

Concurrent apply can corrupt state or cause inconsistent infra.

Real production solution

S3 backend + DynamoDB locking:

  • DynamoDB ensures only one apply at a time

Interview answer

“In production we use DynamoDB state locking to prevent concurrent terraform apply and avoid state corruption.”


08-partial-apply-and-recovery (apply fails mid-way)

✅ Code

08-partial-apply-and-recovery/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    null = {
      source  = "hashicorp/null"
      version = "~> 3.2"
    }
  }
}

resource "null_resource" "step1" {}

resource "null_resource" "step2_fail" {
  provisioner "local-exec" {
    command = "echo 'simulating failure' && exit 1"
  }
}
Enter fullscreen mode Exit fullscreen mode

08-partial-apply-and-recovery/README.txt

Steps:
1) terraform init
2) terraform apply (will fail)

Inspect state:
terraform state list

Fix demo: change exit 1 to exit 0 then apply again.
Enter fullscreen mode Exit fullscreen mode

Detailed explanation (students)

What is the issue?

Terraform started creating resources, then it failed.
So some resources exist, others don’t.

Where it happens?

Mostly production because APIs fail:

  • IAM permissions missing
  • AWS rate limits
  • timeouts
  • dependency errors
  • networking issues

When it happens?

  • during release
  • during delivery pipeline apply
  • during disaster recovery changes

Why dangerous?

Now infra is in “half created” state.

How Terraform solves it?

Terraform state records what succeeded.
After fixing problem, run apply again.

Interview answer

“Partial applies happen. We check terraform state list, fix root cause, then rerun apply to converge infrastructure.”


09-outputs-remote-state-contract (platform vs app teams)

✅ Code

Platform

09-outputs-remote-state-contract/platform/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "platform" {
  filename = "${path.module}/platform.txt"
  content  = "subnet_id=subnet-123\n"
}
Enter fullscreen mode Exit fullscreen mode

09-outputs-remote-state-contract/platform/outputs.tf

output "subnet_id" {
  value = "subnet-123"
}
Enter fullscreen mode Exit fullscreen mode

App (NO null provider; uses terraform_data)

09-outputs-remote-state-contract/app/main.tf

terraform {
  required_version = ">= 1.5.0"
}

data "terraform_remote_state" "platform" {
  backend = "local"
  config = {
    path = "../platform/terraform.tfstate"
  }
}

resource "terraform_data" "use_contract" {
  input = {
    subnet = data.terraform_remote_state.platform.outputs.subnet_id
  }
}
Enter fullscreen mode Exit fullscreen mode

09-outputs-remote-state-contract/app/README.txt

Steps:

1) Run platform:
cd platform
terraform init
terraform apply

2) Run app:
cd ../app
terraform init
terraform apply

Break contract:
Rename output subnet_id -> public_subnet_id in platform outputs.tf
Apply platform again

App will fail until it is updated.

Lesson:
Outputs are API contracts between teams.
Enter fullscreen mode Exit fullscreen mode

Detailed explanation (students)

What is the issue?

App team depends on platform output names.
If platform renames output, app breaks.

Where it happens?

Real org structure:

  • networking team outputs subnet IDs
  • security team outputs IAM role ARN
  • app team reads them via remote state

When it happens?

  • release time when platform changes
  • platform upgrade
  • refactoring outputs

Why dangerous?

App deployment pipeline fails.
Delayed release.

How Terraform solves it?

Not automatic. This is process + design:

  • treat outputs as versioned API
  • communicate changes
  • create backward compatible outputs

Interview answer

“Remote state outputs act like API contracts. If platform changes output names, application stacks break, so we version and communicate output changes.”


10-provider-version-lock-file (consistency across engineers and CI)

✅ Code

10-provider-version-lock-file/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    random = {
      source  = "hashicorp/random"
      version = "~> 3.6"
    }
  }
}

resource "random_id" "demo" {
  byte_length = 4
}

output "id" {
  value = random_id.demo.hex
}
Enter fullscreen mode Exit fullscreen mode

10-provider-version-lock-file/README.txt

Steps:
terraform init
ls -la .terraform.lock.hcl

The lock file pins provider versions.
Commit it to Git so all engineers and CI use same provider builds.
Enter fullscreen mode Exit fullscreen mode

Detailed explanation

What is the issue?

Different machines install different provider versions.
CI runs with one version, laptop runs with another.

Where it happens?

  • CI/CD pipelines
  • multi-engineer teams
  • long-lived repos

When it happens?

  • release time (CI suddenly breaks)
  • provider update got released yesterday
  • new engineer runs init

How Terraform solves it?

.terraform.lock.hcl records exact provider version + hashes.
Commit it to Git.

Interview answer

“We commit .terraform.lock.hcl so CI and all engineers use the same provider versions, avoiding surprises during release.”


How students should run everything

Each lab is independent:

cd 01-drift-and-refresh
terraform init
terraform apply
Enter fullscreen mode Exit fullscreen mode

For lab 09, order matters:

cd 09-outputs-remote-state-contract/platform
terraform init
terraform apply

cd ../app
terraform init
terraform apply
Enter fullscreen mode Exit fullscreen mode

One interview story that covers everything (simple words)

“In production, Terraform problems usually come from drift, state, or team changes. Drift happens when someone changes infra manually. Imports happen because infra existed before Terraform. Refactors like rename, count-to-for_each require state migrations to prevent destroy. We protect critical resources with prevent_destroy. We use S3 backend and DynamoDB locks to avoid concurrent apply. Apply can fail mid-way, so we fix issue and rerun apply to converge. Outputs are contracts between teams, so we version them. And we commit lock files so CI and developers use the same provider versions.”

Top comments (0)