DEV Community

Cover image for Terraform State Files: The Diary Your Infrastructure Never Wanted You to Read
S, Sanjay
S, Sanjay

Posted on

Terraform State Files: The Diary Your Infrastructure Never Wanted You to Read

๐ŸŽฌ The Horror Begins

Error: Error acquiring the state lock

  Lock Info:
    ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
    Path:      terraform.tfstate
    Operation: OperationTypeApply
    Who:       dave@DESKTOP-OOPS
    Version:   1.9.0
    Created:   2026-03-17 14:32:07.123456 +0000 UTC
Enter fullscreen mode Exit fullscreen mode

Dave. It's always Dave. Dave started a terraform apply, got scared halfway through, closed his laptop, and went to lunch. Now the state is locked, Dave is unreachable, and you have a production deployment waiting.

Welcome to Terraform at Scale โ€” where state files are sacred, locking mechanisms are your best friend, and terraform destroy is a four-letter word.


๐Ÿ—๏ธ How Terraform Actually Works (The 30-Second Version)

Terraform is deceptively simple. You write what you want (HCL), and Terraform figures out how to get there:

                    You write .tf files
                          โ”‚
                          โ–ผ
    โ”Œโ”€โ”€โ”€ terraform init โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  โ€ข Downloads providers (azurerm)    โ”‚
    โ”‚  โ€ข Initializes backend (where state โ”‚
    โ”‚    is stored)                       โ”‚
    โ”‚  โ€ข Downloads modules               โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
    โ”Œโ”€โ”€โ”€ terraform plan โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  โ€ข Reads current state file         โ”‚
    โ”‚  โ€ข Calls Azure APIs: "What exists?" โ”‚
    โ”‚  โ€ข Compares desired vs actual        โ”‚
    โ”‚  โ€ข Generates execution plan          โ”‚
    โ”‚  โ€ข "Plan: 3 to add, 1 to change,   โ”‚
    โ”‚    0 to destroy"                    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
    โ”Œโ”€โ”€โ”€ terraform apply โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  โ€ข Executes the plan               โ”‚
    โ”‚  โ€ข Calls Azure APIs to create/     โ”‚
    โ”‚    update/delete resources          โ”‚
    โ”‚  โ€ข Updates state file              โ”‚
    โ”‚  โ€ข ๐Ÿ™ Hopes nothing crashes mid-way โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Enter fullscreen mode Exit fullscreen mode

The secret sauce? The Dependency Graph (DAG). Terraform builds a graph of all your resources and their dependencies, then walks it in the right order:

Resource Group
    โ”‚
    โ”œโ”€โ”€โ–ถ VNet โ”€โ”€โ–ถ Subnet โ”€โ”€โ–ถ AKS Cluster
    โ”‚                    โ””โ”€โ”€โ–ถ Private Endpoint
    โ””โ”€โ”€โ–ถ Key Vault
Enter fullscreen mode Exit fullscreen mode

Terraform knows to create the Resource Group first, then VNet and Key Vault in parallel (they don't depend on each other), then Subnet, then AKS and Private Endpoint.

๐Ÿ’ก The -parallelism flag: By default, Terraform processes 10 resources in parallel. For huge stacks, terraform apply -parallelism=5 reduces API throttling. For speed, terraform apply -parallelism=30 speeds things up if your provider can handle it.


๐Ÿ“ State Files: The Crown Jewels

The state file is Terraform's memory. It maps your .tf resources to actual cloud resources. Without it, Terraform has amnesia.

// What's in a state file (simplified):
{
  "resources": [
    {
      "type": "azurerm_resource_group",
      "name": "main",
      "instances": [{
        "attributes": {
          "id": "/subscriptions/xxx/resourceGroups/rg-prod",
          "name": "rg-prod",
          "location": "eastus"
        }
      }]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿšจ Real-World Disaster #1: The Deleted State File

The Message in #devops-emergency:

@channel I accidentally deleted the terraform.tfstate file from
the storage account. Is everything in production gone?
Enter fullscreen mode Exit fullscreen mode

Good News: Deleting the state file does NOT delete your infrastructure. Your Azure resources are fine.

Bad News: Terraform now has no idea what it manages. Running terraform plan will show it wants to CREATE everything from scratch (which would fail because resources already exist).

The Fix:

Option A: Restore from backup (Azure Storage has soft-delete)

# Check soft-deleted blobs
az storage blob list --account-name tfstate --container-name state \
  --include d --query "[?deleted]" -o table

# Restore it
az storage blob undelete --account-name tfstate \
  --container-name state --name prod/terraform.tfstate
Enter fullscreen mode Exit fullscreen mode

Option B: If no backup, re-import everything (painful but possible)

# Import each resource manually
terraform import azurerm_resource_group.main \
  /subscriptions/xxx/resourceGroups/rg-prod

terraform import azurerm_kubernetes_cluster.main \
  /subscriptions/xxx/resourceGroups/rg-prod/providers/Microsoft.ContainerService/managedClusters/aks-prod

# Repeat for every. single. resource. โ˜•โ˜•โ˜•
Enter fullscreen mode Exit fullscreen mode

Option C (Terraform 1.5+): Use import blocks

import {
  to = azurerm_resource_group.main
  id = "/subscriptions/xxx/resourceGroups/rg-prod"
}

import {
  to = azurerm_kubernetes_cluster.main
  id = "/subscriptions/xxx/.../managedClusters/aks-prod"
}
Enter fullscreen mode Exit fullscreen mode

Rule #1 of State: Remote Backend. Always.

# backend.tf โ€” NON-NEGOTIABLE for any real project
terraform {
  backend "azurerm" {
    resource_group_name  = "rg-terraform-state"
    storage_account_name = "stterraformstateprod"
    container_name       = "tfstate"
    key                  = "prod/networking.tfstate"

    # These save your life:
    use_azuread_auth = true     # No access keys!
    snapshot         = true     # Auto-snapshot before write
  }
}
Enter fullscreen mode Exit fullscreen mode

Storage Account Protection Checklist

  • โœ… Soft-delete enabled (30-day retention)
  • โœ… Versioning enabled (every state write is a new version)
  • โœ… Lock on the resource group (CanNotDelete)
  • โœ… No public access (Private Endpoint or Azure AD auth only)
  • โœ… Geo-redundant storage (GRS or RA-GRS)
  • โœ… Azure AD authentication (not storage keys)

๐Ÿ”’ State Locking: Preventing the "Dave Problem"

When someone runs terraform apply, the state file gets locked so nobody else can modify it simultaneously. This prevents two people making conflicting changes.

๐Ÿšจ Real-World Disaster #2: The Stuck Lock

The Error:

Error: Error acquiring the state lock
Lock Info:
  Who:       ci-pipeline@runner-xyz
  Created:   2026-03-15 09:14:22 UTC
Enter fullscreen mode Exit fullscreen mode

The CI pipeline crashed mid-apply (runner ran out of disk). The lock was never released.

The Fix:

# First: VERIFY the lock holder is actually dead
# (Don't force-unlock if someone is genuinely running apply!)

# Check if the pipeline is still running...
# If confirmed dead:
terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890
Enter fullscreen mode Exit fullscreen mode

Prevention:

  • CI/CD pipelines should have timeout on terraform apply steps
  • Use terraform wrapper scripts that catch kill signals and clean up
  • Monitor for stale locks (alert if lock age > 30 minutes)

๐Ÿ“ Module Architecture: Building Lego Blocks

Bad Terraform looks like one giant main.tf with 2,000 lines. Good Terraform looks like well-organized Lego blocks that snap together.

The Module Hierarchy

Modules
โ”œโ”€โ”€ Foundation Modules (building blocks)
โ”‚   โ”œโ”€โ”€ terraform-azurerm-vnet        โ€” Creates a VNet + subnets
โ”‚   โ”œโ”€โ”€ terraform-azurerm-aks         โ€” Creates an AKS cluster
โ”‚   โ”œโ”€โ”€ terraform-azurerm-keyvault    โ€” Creates a Key Vault
โ”‚   โ””โ”€โ”€ terraform-azurerm-sql         โ€” Creates Azure SQL
โ”‚
โ”œโ”€โ”€ Composition Modules (patterns)
โ”‚   โ”œโ”€โ”€ terraform-azurerm-landing-zone โ€” Combines: VNet + NSGs + DNS
โ”‚   โ”œโ”€โ”€ terraform-azurerm-app-stack   โ€” Combines: AKS + ACR + KeyVault
โ”‚   โ””โ”€โ”€ terraform-azurerm-data-stack  โ€” Combines: SQL + Redis + Storage
โ”‚
โ””โ”€โ”€ Root Modules (deployments)
    โ”œโ”€โ”€ prod/networking/    โ€” Uses landing-zone module
    โ”œโ”€โ”€ prod/applications/  โ€” Uses app-stack module
    โ””โ”€โ”€ dev/                โ€” Uses same modules, different vars
Enter fullscreen mode Exit fullscreen mode

Module Do's and Don'ts

โœ… DO:
  โ€ข Version your modules (git tags: v1.0.0, v1.1.0)
  โ€ข Pin module versions in consumers
  โ€ข Include validation on variables
  โ€ข Output everything consumers might need
  โ€ข Include a README with examples

โŒ DON'T:
  โ€ข Put provider config in modules (let the root decide)
  โ€ข Hardcode values (that's what variables are for)
  โ€ข Create God Modules that do everything
  โ€ข Use count when for_each works (index drift = pain)
  โ€ข Skip validation rules on variables
Enter fullscreen mode Exit fullscreen mode

๐Ÿšจ Real-World Disaster #3: The count Index Shift

The Setup:

# BAD: Using count with a list
variable "subnets" {
  default = ["web", "app", "data"]
}

resource "azurerm_subnet" "main" {
  count = length(var.subnets)
  name  = var.subnets[count.index]
  # ...
}
Enter fullscreen mode Exit fullscreen mode

What Happened: Someone removed "app" from the list โ†’ ["web", "data"]. Terraform's plan:

# Destroy: azurerm_subnet.main[1] ("app")    โ† Correct
# Destroy: azurerm_subnet.main[2] ("data")   โ† WAIT WHAT
# Create:  azurerm_subnet.main[1] ("data")   โ† WHY

# It's destroying and recreating "data" because its INDEX changed
# from 2 to 1! Everything in that subnet (VMs, AKS) will be destroyed!
Enter fullscreen mode Exit fullscreen mode

The Fix: Use for_each instead:

# GOOD: Using for_each with stable keys
resource "azurerm_subnet" "main" {
  for_each = toset(var.subnets)
  name     = each.value
  # ...
}

# Now removing "app" only destroys "app". "web" and "data" are untouched.
# Resources are keyed by NAME, not index position.
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ก Rule: count is only for count = var.enable_feature ? 1 : 0 (conditional creation). For everything else, use for_each.


๐Ÿงช Testing Terraform (Yes, You Should Test Your IaC)

"I'll just run terraform plan and check it manually" is the IaC equivalent of "I'll just test in production."

Testing Pyramid for Terraform

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  E2E Tests  โ”‚  โ† Deploy real infra, validate,
                    โ”‚  (Terratest)โ”‚    destroy. Slow but complete.
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                  โ”‚ Integration     โ”‚  โ† terraform plan + validate
                  โ”‚ (Plan Analysis) โ”‚    Check plan output for issues
                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚ Static Analysis             โ”‚  โ† No terraform needed!
             โ”‚ (tflint, checkov, trivy)    โ”‚    Fast, catches 80% of issues
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚ Unit Tests (terraform validate, fmt)  โ”‚  โ† Sub-second
        โ”‚ Pre-commit hooks                      โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Enter fullscreen mode Exit fullscreen mode

Quick Static Analysis Setup

# Install tflint
brew install tflint  # or scoop install tflint on Windows

# .tflint.hcl
plugin "azurerm" {
  enabled = true
  version = "0.27.0"
  source  = "github.com/terraform-linters/tflint-ruleset-azurerm"
}

rule "terraform_naming_convention" {
  enabled = true
  format  = "snake_case"
}

# Run it
tflint --init
tflint --recursive

# Common catches:
# โš  azurerm_storage_account: "account_replication_type" should be "GRS"
#   for production workloads
# โš  azurerm_kubernetes_cluster: "sku_tier" should be "Standard"
#   (not "Free") for production
Enter fullscreen mode Exit fullscreen mode

Checkov for Security Scanning

checkov -d . --framework terraform

# Output:
# Passed: 142
# Failed: 7
# Skipped: 3
#
# Check: CKV_AZURE_35: "Ensure storage account has access logging"
# FAILED for resource: azurerm_storage_account.main
#
# Check: CKV_AZURE_1: "Ensure Azure SQL is using managed identity"
# FAILED for resource: azurerm_mssql_server.main
Enter fullscreen mode Exit fullscreen mode

๐Ÿ”„ Multi-Environment Patterns

The Big Question: Workspaces vs. Directories vs. Terragrunt?

Approach How it Works When to Use Gotcha
Workspaces Same code, terraform workspace select prod Simple apps, identical envs Shared state backend, single plan file โ€” risky
Directory per env envs/dev/, envs/prod/ with different .tfvars Most teams Code duplication if not using modules well
Terragrunt DRY configs, dependency management, auto-backend Large orgs, many envs Learning curve, another tool to maintain

The Pattern That Works for Most Teams

infrastructure/
โ”œโ”€โ”€ modules/                    # Shared modules
โ”‚   โ”œโ”€โ”€ networking/
โ”‚   โ”œโ”€โ”€ aks-cluster/
โ”‚   โ””โ”€โ”€ database/
โ”‚
โ”œโ”€โ”€ environments/
โ”‚   โ”œโ”€โ”€ dev/
โ”‚   โ”‚   โ”œโ”€โ”€ main.tf             # Calls modules with dev settings
โ”‚   โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ”‚   โ”œโ”€โ”€ dev.tfvars           # env-specific values
โ”‚   โ”‚   โ””โ”€โ”€ backend.tf          # Points to dev state file
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ staging/
โ”‚   โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ”‚   โ”œโ”€โ”€ staging.tfvars
โ”‚   โ”‚   โ””โ”€โ”€ backend.tf
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ prod/
โ”‚       โ”œโ”€โ”€ main.tf
โ”‚       โ”œโ”€โ”€ variables.tf
โ”‚       โ”œโ”€โ”€ prod.tfvars
โ”‚       โ””โ”€โ”€ backend.tf          # Points to SEPARATE prod state file
โ”‚
โ””โ”€โ”€ global/                     # Shared resources (DNS zones, etc.)
    โ”œโ”€โ”€ main.tf
    โ””โ”€โ”€ backend.tf
Enter fullscreen mode Exit fullscreen mode

๐Ÿšจ Real-World Disaster #4: The Workspace Mixup

What Happened: Engineer ran terraform apply thinking they were in the dev workspace. They were in prod. 12 resources destroyed and recreated. 35 minutes of downtime.

# THE MOMENT OF HORROR:
$ terraform workspace show
prod

$ terraform apply -auto-approve
# ๐Ÿ’€๐Ÿ’€๐Ÿ’€
Enter fullscreen mode Exit fullscreen mode

The Fix:

  1. Never use -auto-approve in production
  2. Add a workspace check to your terraform wrapper:
#!/bin/bash
# safe-terraform.sh
CURRENT_WORKSPACE=$(terraform workspace show)
if [[ "$CURRENT_WORKSPACE" == "prod" ]]; then
  echo "โš ๏ธ  WARNING: You are targeting PROD!"
  echo "Type 'yes-i-mean-prod' to continue:"
  read confirmation
  if [[ "$confirmation" != "yes-i-mean-prod" ]]; then
    echo "Aborting. Good choice."
    exit 1
  fi
fi
terraform "$@"
Enter fullscreen mode Exit fullscreen mode
  1. Better yet: Use separate directories per environment instead of workspaces. Physical separation > logical separation.

๐Ÿš€ The moved Block: Refactoring Without Tears

One of Terraform's best features (added in 1.1) that too few people know about:

# You renamed a resource from this:
# resource "azurerm_kubernetes_cluster" "main" { ... }
#
# To this:
# module "aks" {
#   source = "./modules/aks"
# }
#
# Without `moved`, Terraform would DESTROY the old cluster
# and CREATE a new one. With `moved`:

moved {
  from = azurerm_kubernetes_cluster.main
  to   = module.aks.azurerm_kubernetes_cluster.main
}

# Now Terraform knows it's the SAME resource, just moved.
# No destruction. No downtime. Just a state update.
Enter fullscreen mode Exit fullscreen mode

This is a career-saver when refactoring large codebases.


๐Ÿง  Principal-Level Terraform Wisdom

The Golden Rules

1. State isolation per blast radius
   โ””โ”€ prod networking โ‰  prod application โ‰  dev anything

2. Module versioning is non-negotiable
   โ””โ”€ source = "git::https://...//modules/aks?ref=v2.1.0"

3. Plan in CI, Apply in CD
   โ””โ”€ PR โ†’ terraform plan (comment on PR) โ†’ merge โ†’ terraform apply

4. Never terraform apply from a laptop in production
   โ””โ”€ Pipeline or nothing

5. Import before you destroy
   โ””โ”€ Existing resources? terraform import, don't recreate

6. State locking + remote backend or don't bother
   โ””โ”€ Local state in a team = guaranteed disaster
Enter fullscreen mode Exit fullscreen mode

๐ŸŽฏ Key Takeaways

  1. State files are sacred โ€” remote backend, versioned, soft-deleted, geo-replicated
  2. for_each > count โ€” always, unless it's a simple on/off toggle
  3. Module versioning prevents breaking changes from cascading
  4. Test your IaC โ€” tflint + checkov catches most issues before plan
  5. Separate environments by directory, not just workspaces
  6. moved blocks let you refactor without destroying resources
  7. Never -auto-approve in production. Ever. EVER.

๐Ÿ”ฅ Homework

  1. Check if your Terraform state backend has soft-delete enabled: az storage account show -n <name> --query 'blobServiceProperties.deleteRetentionPolicy'
  2. Run checkov -d . on your Terraform code โ€” fix the Critical findings
  3. Find any count usage that should be for_each and refactor it (use moved blocks!)

Next up in the series: **Your CI/CD Pipeline is a Dumpster Fire โ€” Here's the Extinguisher* โ€” where we optimize 45-minute builds to 5 minutes, standardize pipelines across teams, and decode DORA metrics.*


๐Ÿ’ฌ What's your worst terraform destroy story? Did you survive? Drop it below. Therapy is free. ๐Ÿ›‹๏ธ

Top comments (0)