Aisalkyn Aidarova

Posted on Dec 22, 2025 • Edited on Dec 24, 2025

PROJECT: Terraform Challenges in Production

#devops #terraform #tutorial

0) Create folders

terraform-prod-challenges/
├── 00-baseline/
├── 01-drift-and-refresh/
├── 02-import-existing/
├── 03-rename-without-destroy-state-mv/
├── 04-address-change-count-to-foreach/
├── 05-prevent-destroy-safety/
├── 06-backend-change-reconfigure/
├── 07-state-locking-simulation/
├── 08-partial-apply-and-recovery/
├── 09-outputs-remote-state-contract/
│ ├── platform/
│ └── app/
└── 10-provider-version-lock-file/

00-baseline (baseline resource)

✅ Code

00-baseline/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "app_config" {
  filename = "${path.module}/app.conf"
  content  = "version=1\nowner=platform\n"
}

00-baseline/outputs.tf

output "config_path" {
  value = local_file.app_config.filename
}

Detailed explanation

What problem is this?

No problem yet. This is your “Hello World”.

Where it happens?

Everywhere: dev/stage/prod. This is how Terraform starts.

When it matters?

at the beginning of a project
in onboarding
before release automation

Interview answer

“I start with a simple baseline to show Terraform init/plan/apply and state. It helps new engineers understand how Terraform manages infrastructure.”

01-drift-and-refresh (drift in prod)

✅ Code

01-drift-and-refresh/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "drift_demo" {
  filename = "${path.module}/drift.txt"
  content  = "managed_by=terraform\nvalue=100\n"
}

01-drift-and-refresh/README.txt

Steps:

1) terraform init
2) terraform apply

3) Manually edit drift.txt and change value=100 to value=999

4) terraform plan

Terraform will detect drift and plan to restore desired state.

Detailed explanation

What is the issue?

Drift means: resource was changed outside Terraform.

Where it happens?

Mostly production, because people:

change AWS Console settings
hotfix security group rules
patch something quickly during outage

When it happens?

during incident response (outage)
after release (someone “fixes” prod)
anytime manual change happens

Why dangerous?

Next terraform apply can:

revert the manual fix (break prod again)
create confusion: “Why did it change back?”

How Terraform solves it?

terraform plan compares:

code (desired)
real file (actual) and shows difference.

Interview answer

“Drift happens when someone changes infra manually. I detect it using terraform plan, then decide: either accept the change (update code) or revert back to code.”

02-import-existing (resource exists already)

✅ Code

02-import-existing/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "import_me" {
  filename = "${path.module}/existing.txt"
  content  = "this_is_now_managed\n"
}

02-import-existing/README.txt

Steps:

1) Create file first (outside Terraform):
   echo "i already existed" > existing.txt

2) terraform init

3) Import it into state:
   terraform import local_file.import_me existing.txt

4) terraform plan

Key point:
Import updates Terraform STATE. After import, ensure Terraform code matches reality.

Detailed explanation

What is the issue?

In real life, Terraform is introduced after infrastructure already exists.

Where it happens?

old AWS accounts
inherited projects
manual infra created before IaC

When it happens?

onboarding Terraform into existing company environment
migration from CloudFormation/manual to Terraform
before big release, company wants standardization

Why dangerous?

If you don’t import:

Terraform will try to create duplicate resources
naming collisions
conflicts, downtime risk

How Terraform solves it?

terraform import connects an existing object to Terraform state.

Interview answer

“Many resources exist before Terraform. I use terraform import to bring them under Terraform management, then I update code to match real configuration.”

03-rename-without-destroy-state-mv (refactor safely)

✅ Code

03-rename-without-destroy-state-mv/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "old_name" {
  filename = "${path.module}/name.txt"
  content  = "hello\n"
}

03-rename-without-destroy-state-mv/README.txt

Steps:

1) terraform init
2) terraform apply

3) Rename resource in code:
   local_file.old_name -> local_file.new_name

4) terraform plan  (it will show destroy/create)

Correct fix:
terraform state mv local_file.old_name local_file.new_name

Then:
terraform plan
terraform apply

Detailed explanation

What is the issue?

Developer refactors code: changes resource name.
Terraform thinks:

old resource removed => destroy
new resource added => create

Where it happens?

in dev branch refactoring
module cleanup
team renaming standards

When it happens?

during release preparation
during large refactor PR
migration to modules

Why dangerous?

Can delete prod resources by mistake.

How Terraform solves it?

terraform state mv updates state address without changing real resource.

Interview answer

“When we refactor Terraform labels, we migrate state addresses using terraform state mv to avoid destroying production resources.”

04-address-change-count-to-foreach (classic breaking change)

✅ Code (Start with COUNT)

04-address-change-count-to-foreach/main.tf (COUNT version first)

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

variable "names" {
  type    = list(string)
  default = ["orders", "payments"]
}

resource "local_file" "svc" {
  count    = length(var.names)
  filename = "${path.module}/${var.names[count.index]}.txt"
  content  = "service=${var.names[count.index]}\n"
}

04-address-change-count-to-foreach/README.txt

Steps:

1) terraform init
2) terraform apply

3) Edit main.tf to FOREACH version:

resource "local_file" "svc" {
  for_each = toset(var.names)
  filename = "${path.module}/${each.value}.txt"
  content  = "service=${each.value}\n"
}

4) terraform plan (will show recreate)

Fix with state mv:
terraform state mv 'local_file.svc[0]' 'local_file.svc["orders"]'
terraform state mv 'local_file.svc[1]' 'local_file.svc["payments"]'

Then:
terraform plan
terraform apply

Detailed explanation (students)

What is the issue?

Terraform identifies resources by address.
With count: svc[0], svc[1]
With for_each: svc["orders"], svc["payments"]

So Terraform thinks they are new resources.

Where it happens?

when code matures
when teams want stable keys
in modules during refactor

When it happens?

before release (refactor PR)
during standardization
while adding new services

Why dangerous?

Terraform may destroy and recreate resources:

downtime (ALB, ECS, RDS settings)
lost data (if storage resource recreated)
production outage risk

How Terraform solves it?

State migration: map old addresses to new addresses.

Interview answer

“Count-to-for_each migration changes resource addresses. To avoid recreation, I migrate state with terraform state mv so production resources remain untouched.”

05-prevent-destroy-safety (guardrail)

✅ Code

05-prevent-destroy-safety/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "critical" {
  filename = "${path.module}/critical.txt"
  content  = "do_not_delete\n"

  lifecycle {
    prevent_destroy = true
  }
}

05-prevent-destroy-safety/README.txt

Steps:
1) terraform init
2) terraform apply
3) terraform destroy

It will fail because prevent_destroy blocks deletion.

Detailed explanation

What is the issue?

Accidental deletion in Terraform code or wrong apply.

Where it happens?

Production, especially for:

S3 state bucket
DynamoDB lock table
RDS
IAM critical roles

When it happens?

release time
cleanup tasks
refactor mistakes
wrong workspace selected

How Terraform solves it?

prevent_destroy blocks deletion even if plan says destroy.

Interview answer

“For critical resources, we set lifecycle prevent_destroy to avoid accidental deletion during CI/CD or refactoring.”

06-backend-change-reconfigure (safe + always runnable)

✅ Code (NO provider needed)

06-backend-change-reconfigure/main.tf

terraform {
  required_version = ">= 1.5.0"
}

resource "terraform_data" "backend_note" {
  input = "backend demo"
}

06-backend-change-reconfigure/README.txt

Production behavior:

If backend config changes:
terraform init -reconfigure

If moving state to a new backend:
terraform init -migrate-state

Detailed explanation (students)

What is the issue?

Terraform state location changed:

local -> S3
S3 bucket name changed
key path changed
region changed

Where it happens?

production pipelines
team collaboration setups

When it happens?

moving from local dev to shared backend
company security policy update
migration to new AWS account

Why dangerous?

Wrong backend = wrong state:

Terraform may create duplicates
or destroy wrong resources

How Terraform solves it?

terraform init -reconfigure tells Terraform to accept new backend settings
terraform init -migrate-state moves state safely

Interview answer

“Backend changes require terraform init -reconfigure. If we move state, we use -migrate-state to prevent losing track of production resources.”

07-state-locking-simulation

✅ Code

07-state-locking-simulation/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    time = {
      source  = "hashicorp/time"
      version = "~> 0.11"
    }
  }
}

resource "time_sleep" "simulate_long_apply" {
  create_duration = "60s"
}

07-state-locking-simulation/README.txt

Terminal 1:
terraform init
terraform apply

While it's running, Terminal 2:
terraform apply

Local backend won't show real locks.
In production with S3 + DynamoDB locks: it prevents concurrent apply.

Detailed explanation

What is the issue?

Two engineers (or CI + human) run terraform apply at same time.

Where it happens?

production pipelines
shared infrastructure repos
busy teams

When it happens?

during release
hotfix while pipeline is running
multiple PR merges at once

Why dangerous?

Concurrent apply can corrupt state or cause inconsistent infra.

Real production solution

S3 backend + DynamoDB locking:

DynamoDB ensures only one apply at a time

Interview answer

“In production we use DynamoDB state locking to prevent concurrent terraform apply and avoid state corruption.”

08-partial-apply-and-recovery (apply fails mid-way)

✅ Code

08-partial-apply-and-recovery/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    null = {
      source  = "hashicorp/null"
      version = "~> 3.2"
    }
  }
}

resource "null_resource" "step1" {}

resource "null_resource" "step2_fail" {
  provisioner "local-exec" {
    command = "echo 'simulating failure' && exit 1"
  }
}

08-partial-apply-and-recovery/README.txt

Steps:
1) terraform init
2) terraform apply (will fail)

Inspect state:
terraform state list

Fix demo: change exit 1 to exit 0 then apply again.

Detailed explanation (students)

What is the issue?

Terraform started creating resources, then it failed.
So some resources exist, others don’t.

Where it happens?

Mostly production because APIs fail:

IAM permissions missing
AWS rate limits
timeouts
dependency errors
networking issues

When it happens?

during release
during delivery pipeline apply
during disaster recovery changes

Why dangerous?

Now infra is in “half created” state.

How Terraform solves it?

Terraform state records what succeeded.
After fixing problem, run apply again.

Interview answer

“Partial applies happen. We check terraform state list, fix root cause, then rerun apply to converge infrastructure.”

09-outputs-remote-state-contract (platform vs app teams)

✅ Code

Platform

09-outputs-remote-state-contract/platform/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }
  }
}

resource "local_file" "platform" {
  filename = "${path.module}/platform.txt"
  content  = "subnet_id=subnet-123\n"
}

09-outputs-remote-state-contract/platform/outputs.tf

output "subnet_id" {
  value = "subnet-123"
}

App (NO null provider; uses terraform_data)

09-outputs-remote-state-contract/app/main.tf

terraform {
  required_version = ">= 1.5.0"
}

data "terraform_remote_state" "platform" {
  backend = "local"
  config = {
    path = "../platform/terraform.tfstate"
  }
}

resource "terraform_data" "use_contract" {
  input = {
    subnet = data.terraform_remote_state.platform.outputs.subnet_id
  }
}

09-outputs-remote-state-contract/app/README.txt

Steps:

1) Run platform:
cd platform
terraform init
terraform apply

2) Run app:
cd ../app
terraform init
terraform apply

Break contract:
Rename output subnet_id -> public_subnet_id in platform outputs.tf
Apply platform again

App will fail until it is updated.

Lesson:
Outputs are API contracts between teams.

Detailed explanation (students)

What is the issue?

App team depends on platform output names.
If platform renames output, app breaks.

Where it happens?

Real org structure:

networking team outputs subnet IDs
security team outputs IAM role ARN
app team reads them via remote state

When it happens?

release time when platform changes
platform upgrade
refactoring outputs

Why dangerous?

App deployment pipeline fails.
Delayed release.

How Terraform solves it?

Not automatic. This is process + design:

treat outputs as versioned API
communicate changes
create backward compatible outputs

Interview answer

“Remote state outputs act like API contracts. If platform changes output names, application stacks break, so we version and communicate output changes.”

10-provider-version-lock-file (consistency across engineers and CI)

✅ Code

10-provider-version-lock-file/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    random = {
      source  = "hashicorp/random"
      version = "~> 3.6"
    }
  }
}

resource "random_id" "demo" {
  byte_length = 4
}

output "id" {
  value = random_id.demo.hex
}

10-provider-version-lock-file/README.txt

Steps:
terraform init
ls -la .terraform.lock.hcl

The lock file pins provider versions.
Commit it to Git so all engineers and CI use same provider builds.

Detailed explanation

What is the issue?

Different machines install different provider versions.
CI runs with one version, laptop runs with another.

Where it happens?

CI/CD pipelines
multi-engineer teams
long-lived repos

When it happens?

release time (CI suddenly breaks)
provider update got released yesterday
new engineer runs init

How Terraform solves it?

.terraform.lock.hcl records exact provider version + hashes.
Commit it to Git.

Interview answer

“We commit .terraform.lock.hcl so CI and all engineers use the same provider versions, avoiding surprises during release.”

How students should run everything

Each lab is independent:

cd 01-drift-and-refresh
terraform init
terraform apply

For lab 09, order matters:

cd 09-outputs-remote-state-contract/platform
terraform init
terraform apply

cd ../app
terraform init
terraform apply

One interview story that covers everything (simple words)

“In production, Terraform problems usually come from drift, state, or team changes. Drift happens when someone changes infra manually. Imports happen because infra existed before Terraform. Refactors like rename, count-to-for_each require state migrations to prevent destroy. We protect critical resources with prevent_destroy. We use S3 backend and DynamoDB locks to avoid concurrent apply. Apply can fail mid-way, so we fix issue and rerun apply to converge. Outputs are contracts between teams, so we version them. And we commit lock files so CI and developers use the same provider versions.”