๐ฌ The Horror Begins
Error: Error acquiring the state lock
Lock Info:
ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Path: terraform.tfstate
Operation: OperationTypeApply
Who: dave@DESKTOP-OOPS
Version: 1.9.0
Created: 2026-03-17 14:32:07.123456 +0000 UTC
Dave. It's always Dave. Dave started a terraform apply, got scared halfway through, closed his laptop, and went to lunch. Now the state is locked, Dave is unreachable, and you have a production deployment waiting.
Welcome to Terraform at Scale โ where state files are sacred, locking mechanisms are your best friend, and terraform destroy is a four-letter word.
๐๏ธ How Terraform Actually Works (The 30-Second Version)
Terraform is deceptively simple. You write what you want (HCL), and Terraform figures out how to get there:
You write .tf files
โ
โผ
โโโโ terraform init โโโโโโโโโโโโโโโโโโ
โ โข Downloads providers (azurerm) โ
โ โข Initializes backend (where state โ
โ is stored) โ
โ โข Downloads modules โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโ terraform plan โโโโโโโโโโโโโโโโโโ
โ โข Reads current state file โ
โ โข Calls Azure APIs: "What exists?" โ
โ โข Compares desired vs actual โ
โ โข Generates execution plan โ
โ โข "Plan: 3 to add, 1 to change, โ
โ 0 to destroy" โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโ terraform apply โโโโโโโโโโโโโโโโโ
โ โข Executes the plan โ
โ โข Calls Azure APIs to create/ โ
โ update/delete resources โ
โ โข Updates state file โ
โ โข ๐ Hopes nothing crashes mid-way โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The secret sauce? The Dependency Graph (DAG). Terraform builds a graph of all your resources and their dependencies, then walks it in the right order:
Resource Group
โ
โโโโถ VNet โโโถ Subnet โโโถ AKS Cluster
โ โโโโถ Private Endpoint
โโโโถ Key Vault
Terraform knows to create the Resource Group first, then VNet and Key Vault in parallel (they don't depend on each other), then Subnet, then AKS and Private Endpoint.
๐ก The -parallelism flag: By default, Terraform processes 10 resources in parallel. For huge stacks,
terraform apply -parallelism=5reduces API throttling. For speed,terraform apply -parallelism=30speeds things up if your provider can handle it.
๐ State Files: The Crown Jewels
The state file is Terraform's memory. It maps your .tf resources to actual cloud resources. Without it, Terraform has amnesia.
// What's in a state file (simplified):
{
"resources": [
{
"type": "azurerm_resource_group",
"name": "main",
"instances": [{
"attributes": {
"id": "/subscriptions/xxx/resourceGroups/rg-prod",
"name": "rg-prod",
"location": "eastus"
}
}]
}
]
}
๐จ Real-World Disaster #1: The Deleted State File
The Message in #devops-emergency:
@channel I accidentally deleted the terraform.tfstate file from
the storage account. Is everything in production gone?
Good News: Deleting the state file does NOT delete your infrastructure. Your Azure resources are fine.
Bad News: Terraform now has no idea what it manages. Running terraform plan will show it wants to CREATE everything from scratch (which would fail because resources already exist).
The Fix:
Option A: Restore from backup (Azure Storage has soft-delete)
# Check soft-deleted blobs
az storage blob list --account-name tfstate --container-name state \
--include d --query "[?deleted]" -o table
# Restore it
az storage blob undelete --account-name tfstate \
--container-name state --name prod/terraform.tfstate
Option B: If no backup, re-import everything (painful but possible)
# Import each resource manually
terraform import azurerm_resource_group.main \
/subscriptions/xxx/resourceGroups/rg-prod
terraform import azurerm_kubernetes_cluster.main \
/subscriptions/xxx/resourceGroups/rg-prod/providers/Microsoft.ContainerService/managedClusters/aks-prod
# Repeat for every. single. resource. โโโ
Option C (Terraform 1.5+): Use import blocks
import {
to = azurerm_resource_group.main
id = "/subscriptions/xxx/resourceGroups/rg-prod"
}
import {
to = azurerm_kubernetes_cluster.main
id = "/subscriptions/xxx/.../managedClusters/aks-prod"
}
Rule #1 of State: Remote Backend. Always.
# backend.tf โ NON-NEGOTIABLE for any real project
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "stterraformstateprod"
container_name = "tfstate"
key = "prod/networking.tfstate"
# These save your life:
use_azuread_auth = true # No access keys!
snapshot = true # Auto-snapshot before write
}
}
Storage Account Protection Checklist
- โ Soft-delete enabled (30-day retention)
- โ Versioning enabled (every state write is a new version)
- โ Lock on the resource group (CanNotDelete)
- โ No public access (Private Endpoint or Azure AD auth only)
- โ Geo-redundant storage (GRS or RA-GRS)
- โ Azure AD authentication (not storage keys)
๐ State Locking: Preventing the "Dave Problem"
When someone runs terraform apply, the state file gets locked so nobody else can modify it simultaneously. This prevents two people making conflicting changes.
๐จ Real-World Disaster #2: The Stuck Lock
The Error:
Error: Error acquiring the state lock
Lock Info:
Who: ci-pipeline@runner-xyz
Created: 2026-03-15 09:14:22 UTC
The CI pipeline crashed mid-apply (runner ran out of disk). The lock was never released.
The Fix:
# First: VERIFY the lock holder is actually dead
# (Don't force-unlock if someone is genuinely running apply!)
# Check if the pipeline is still running...
# If confirmed dead:
terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890
Prevention:
- CI/CD pipelines should have
timeouton terraform apply steps - Use terraform wrapper scripts that catch kill signals and clean up
- Monitor for stale locks (alert if lock age > 30 minutes)
๐ Module Architecture: Building Lego Blocks
Bad Terraform looks like one giant main.tf with 2,000 lines. Good Terraform looks like well-organized Lego blocks that snap together.
The Module Hierarchy
Modules
โโโ Foundation Modules (building blocks)
โ โโโ terraform-azurerm-vnet โ Creates a VNet + subnets
โ โโโ terraform-azurerm-aks โ Creates an AKS cluster
โ โโโ terraform-azurerm-keyvault โ Creates a Key Vault
โ โโโ terraform-azurerm-sql โ Creates Azure SQL
โ
โโโ Composition Modules (patterns)
โ โโโ terraform-azurerm-landing-zone โ Combines: VNet + NSGs + DNS
โ โโโ terraform-azurerm-app-stack โ Combines: AKS + ACR + KeyVault
โ โโโ terraform-azurerm-data-stack โ Combines: SQL + Redis + Storage
โ
โโโ Root Modules (deployments)
โโโ prod/networking/ โ Uses landing-zone module
โโโ prod/applications/ โ Uses app-stack module
โโโ dev/ โ Uses same modules, different vars
Module Do's and Don'ts
โ
DO:
โข Version your modules (git tags: v1.0.0, v1.1.0)
โข Pin module versions in consumers
โข Include validation on variables
โข Output everything consumers might need
โข Include a README with examples
โ DON'T:
โข Put provider config in modules (let the root decide)
โข Hardcode values (that's what variables are for)
โข Create God Modules that do everything
โข Use count when for_each works (index drift = pain)
โข Skip validation rules on variables
๐จ Real-World Disaster #3: The count Index Shift
The Setup:
# BAD: Using count with a list
variable "subnets" {
default = ["web", "app", "data"]
}
resource "azurerm_subnet" "main" {
count = length(var.subnets)
name = var.subnets[count.index]
# ...
}
What Happened: Someone removed "app" from the list โ ["web", "data"]. Terraform's plan:
# Destroy: azurerm_subnet.main[1] ("app") โ Correct
# Destroy: azurerm_subnet.main[2] ("data") โ WAIT WHAT
# Create: azurerm_subnet.main[1] ("data") โ WHY
# It's destroying and recreating "data" because its INDEX changed
# from 2 to 1! Everything in that subnet (VMs, AKS) will be destroyed!
The Fix: Use for_each instead:
# GOOD: Using for_each with stable keys
resource "azurerm_subnet" "main" {
for_each = toset(var.subnets)
name = each.value
# ...
}
# Now removing "app" only destroys "app". "web" and "data" are untouched.
# Resources are keyed by NAME, not index position.
๐ก Rule:
countis only forcount = var.enable_feature ? 1 : 0(conditional creation). For everything else, usefor_each.
๐งช Testing Terraform (Yes, You Should Test Your IaC)
"I'll just run terraform plan and check it manually" is the IaC equivalent of "I'll just test in production."
Testing Pyramid for Terraform
โโโโโโโโโโโโโโโ
โ E2E Tests โ โ Deploy real infra, validate,
โ (Terratest)โ destroy. Slow but complete.
โโโโโโโโฌโโโโโโโ
โ
โโโโโโโโโโผโโโโโโโโโ
โ Integration โ โ terraform plan + validate
โ (Plan Analysis) โ Check plan output for issues
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโ
โ Static Analysis โ โ No terraform needed!
โ (tflint, checkov, trivy) โ Fast, catches 80% of issues
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโ
โ Unit Tests (terraform validate, fmt) โ โ Sub-second
โ Pre-commit hooks โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Quick Static Analysis Setup
# Install tflint
brew install tflint # or scoop install tflint on Windows
# .tflint.hcl
plugin "azurerm" {
enabled = true
version = "0.27.0"
source = "github.com/terraform-linters/tflint-ruleset-azurerm"
}
rule "terraform_naming_convention" {
enabled = true
format = "snake_case"
}
# Run it
tflint --init
tflint --recursive
# Common catches:
# โ azurerm_storage_account: "account_replication_type" should be "GRS"
# for production workloads
# โ azurerm_kubernetes_cluster: "sku_tier" should be "Standard"
# (not "Free") for production
Checkov for Security Scanning
checkov -d . --framework terraform
# Output:
# Passed: 142
# Failed: 7
# Skipped: 3
#
# Check: CKV_AZURE_35: "Ensure storage account has access logging"
# FAILED for resource: azurerm_storage_account.main
#
# Check: CKV_AZURE_1: "Ensure Azure SQL is using managed identity"
# FAILED for resource: azurerm_mssql_server.main
๐ Multi-Environment Patterns
The Big Question: Workspaces vs. Directories vs. Terragrunt?
| Approach | How it Works | When to Use | Gotcha |
|---|---|---|---|
| Workspaces | Same code, terraform workspace select prod
|
Simple apps, identical envs | Shared state backend, single plan file โ risky |
| Directory per env |
envs/dev/, envs/prod/ with different .tfvars
|
Most teams | Code duplication if not using modules well |
| Terragrunt | DRY configs, dependency management, auto-backend | Large orgs, many envs | Learning curve, another tool to maintain |
The Pattern That Works for Most Teams
infrastructure/
โโโ modules/ # Shared modules
โ โโโ networking/
โ โโโ aks-cluster/
โ โโโ database/
โ
โโโ environments/
โ โโโ dev/
โ โ โโโ main.tf # Calls modules with dev settings
โ โ โโโ variables.tf
โ โ โโโ dev.tfvars # env-specific values
โ โ โโโ backend.tf # Points to dev state file
โ โ
โ โโโ staging/
โ โ โโโ main.tf
โ โ โโโ variables.tf
โ โ โโโ staging.tfvars
โ โ โโโ backend.tf
โ โ
โ โโโ prod/
โ โโโ main.tf
โ โโโ variables.tf
โ โโโ prod.tfvars
โ โโโ backend.tf # Points to SEPARATE prod state file
โ
โโโ global/ # Shared resources (DNS zones, etc.)
โโโ main.tf
โโโ backend.tf
๐จ Real-World Disaster #4: The Workspace Mixup
What Happened: Engineer ran terraform apply thinking they were in the dev workspace. They were in prod. 12 resources destroyed and recreated. 35 minutes of downtime.
# THE MOMENT OF HORROR:
$ terraform workspace show
prod
$ terraform apply -auto-approve
# ๐๐๐
The Fix:
-
Never use
-auto-approvein production - Add a workspace check to your terraform wrapper:
#!/bin/bash
# safe-terraform.sh
CURRENT_WORKSPACE=$(terraform workspace show)
if [[ "$CURRENT_WORKSPACE" == "prod" ]]; then
echo "โ ๏ธ WARNING: You are targeting PROD!"
echo "Type 'yes-i-mean-prod' to continue:"
read confirmation
if [[ "$confirmation" != "yes-i-mean-prod" ]]; then
echo "Aborting. Good choice."
exit 1
fi
fi
terraform "$@"
- Better yet: Use separate directories per environment instead of workspaces. Physical separation > logical separation.
๐ The moved Block: Refactoring Without Tears
One of Terraform's best features (added in 1.1) that too few people know about:
# You renamed a resource from this:
# resource "azurerm_kubernetes_cluster" "main" { ... }
#
# To this:
# module "aks" {
# source = "./modules/aks"
# }
#
# Without `moved`, Terraform would DESTROY the old cluster
# and CREATE a new one. With `moved`:
moved {
from = azurerm_kubernetes_cluster.main
to = module.aks.azurerm_kubernetes_cluster.main
}
# Now Terraform knows it's the SAME resource, just moved.
# No destruction. No downtime. Just a state update.
This is a career-saver when refactoring large codebases.
๐ง Principal-Level Terraform Wisdom
The Golden Rules
1. State isolation per blast radius
โโ prod networking โ prod application โ dev anything
2. Module versioning is non-negotiable
โโ source = "git::https://...//modules/aks?ref=v2.1.0"
3. Plan in CI, Apply in CD
โโ PR โ terraform plan (comment on PR) โ merge โ terraform apply
4. Never terraform apply from a laptop in production
โโ Pipeline or nothing
5. Import before you destroy
โโ Existing resources? terraform import, don't recreate
6. State locking + remote backend or don't bother
โโ Local state in a team = guaranteed disaster
๐ฏ Key Takeaways
- State files are sacred โ remote backend, versioned, soft-deleted, geo-replicated
-
for_each>countโ always, unless it's a simple on/off toggle - Module versioning prevents breaking changes from cascading
-
Test your IaC โ tflint + checkov catches most issues before
plan - Separate environments by directory, not just workspaces
-
movedblocks let you refactor without destroying resources -
Never
-auto-approvein production. Ever. EVER.
๐ฅ Homework
- Check if your Terraform state backend has soft-delete enabled:
az storage account show -n <name> --query 'blobServiceProperties.deleteRetentionPolicy' - Run
checkov -d .on your Terraform code โ fix the Critical findings - Find any
countusage that should befor_eachand refactor it (usemovedblocks!)
Next up in the series: **Your CI/CD Pipeline is a Dumpster Fire โ Here's the Extinguisher* โ where we optimize 45-minute builds to 5 minutes, standardize pipelines across teams, and decode DORA metrics.*
๐ฌ What's your worst
terraform destroystory? Did you survive? Drop it below. Therapy is free. ๐๏ธ
Top comments (0)