🎬 The Horror Begins
Error: Error acquiring the state lock
Lock Info:
ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Path: terraform.tfstate
Operation: OperationTypeApply
Who: dave@DESKTOP-OOPS
Version: 1.9.0
Created: 2026-03-17 14:32:07.123456 +0000 UTC
Dave. It's always Dave. Dave started a terraform apply, got scared halfway through, closed his laptop, and went to lunch. Now the state is locked, Dave is unreachable, and you have a production deployment waiting.
Welcome to Terraform at Scale — where state files are sacred, locking mechanisms are your best friend, and terraform destroy is a four-letter word.
🏗️ How Terraform Actually Works (The 30-Second Version)
Terraform is deceptively simple. You write what you want (HCL), and Terraform figures out how to get there:
You write .tf files
│
▼
┌─── terraform init ─────────────────┐
│ • Downloads providers (azurerm) │
│ • Initializes backend (where state │
│ is stored) │
│ • Downloads modules │
└─────────────┬──────────────────────┘
│
▼
┌─── terraform plan ─────────────────┐
│ • Reads current state file │
│ • Calls Azure APIs: "What exists?" │
│ • Compares desired vs actual │
│ • Generates execution plan │
│ • "Plan: 3 to add, 1 to change, │
│ 0 to destroy" │
└─────────────┬──────────────────────┘
│
▼
┌─── terraform apply ────────────────┐
│ • Executes the plan │
│ • Calls Azure APIs to create/ │
│ update/delete resources │
│ • Updates state file │
│ • 🙏 Hopes nothing crashes mid-way │
└────────────────────────────────────┘
The secret sauce? The Dependency Graph (DAG). Terraform builds a graph of all your resources and their dependencies, then walks it in the right order:
Resource Group
│
├──▶ VNet ──▶ Subnet ──▶ AKS Cluster
│ └──▶ Private Endpoint
└──▶ Key Vault
Terraform knows to create the Resource Group first, then VNet and Key Vault in parallel (they don't depend on each other), then Subnet, then AKS and Private Endpoint.
💡 The -parallelism flag: By default, Terraform processes 10 resources in parallel. For huge stacks,
terraform apply -parallelism=5reduces API throttling. For speed,terraform apply -parallelism=30speeds things up if your provider can handle it.
📁 State Files: The Crown Jewels
The state file is Terraform's memory. It maps your .tf resources to actual cloud resources. Without it, Terraform has amnesia.
// What's in a state file (simplified):
{
"resources": [
{
"type": "azurerm_resource_group",
"name": "main",
"instances": [{
"attributes": {
"id": "/subscriptions/xxx/resourceGroups/rg-prod",
"name": "rg-prod",
"location": "eastus"
}
}]
}
]
}
🚨 Real-World Disaster #1: The Deleted State File
The Message in #devops-emergency:
@channel I accidentally deleted the terraform.tfstate file from
the storage account. Is everything in production gone?
Good News: Deleting the state file does NOT delete your infrastructure. Your Azure resources are fine.
Bad News: Terraform now has no idea what it manages. Running terraform plan will show it wants to CREATE everything from scratch (which would fail because resources already exist).
The Fix:
Option A: Restore from backup (Azure Storage has soft-delete)
# Check soft-deleted blobs
az storage blob list --account-name tfstate --container-name state \
--include d --query "[?deleted]" -o table
# Restore it
az storage blob undelete --account-name tfstate \
--container-name state --name prod/terraform.tfstate
Option B: If no backup, re-import everything (painful but possible)
# Import each resource manually
terraform import azurerm_resource_group.main \
/subscriptions/xxx/resourceGroups/rg-prod
terraform import azurerm_kubernetes_cluster.main \
/subscriptions/xxx/resourceGroups/rg-prod/providers/Microsoft.ContainerService/managedClusters/aks-prod
# Repeat for every. single. resource. ☕☕☕
Option C (Terraform 1.5+): Use import blocks
import {
to = azurerm_resource_group.main
id = "/subscriptions/xxx/resourceGroups/rg-prod"
}
import {
to = azurerm_kubernetes_cluster.main
id = "/subscriptions/xxx/.../managedClusters/aks-prod"
}
Rule #1 of State: Remote Backend. Always.
# backend.tf — NON-NEGOTIABLE for any real project
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "stterraformstateprod"
container_name = "tfstate"
key = "prod/networking.tfstate"
# These save your life:
use_azuread_auth = true # No access keys!
snapshot = true # Auto-snapshot before write
}
}
Storage Account Protection Checklist
- ✅ Soft-delete enabled (30-day retention)
- ✅ Versioning enabled (every state write is a new version)
- ✅ Lock on the resource group (CanNotDelete)
- ✅ No public access (Private Endpoint or Azure AD auth only)
- ✅ Geo-redundant storage (GRS or RA-GRS)
- ✅ Azure AD authentication (not storage keys)
🔒 State Locking: Preventing the "Dave Problem"
When someone runs terraform apply, the state file gets locked so nobody else can modify it simultaneously. This prevents two people making conflicting changes.
🚨 Real-World Disaster #2: The Stuck Lock
The Error:
Error: Error acquiring the state lock
Lock Info:
Who: ci-pipeline@runner-xyz
Created: 2026-03-15 09:14:22 UTC
The CI pipeline crashed mid-apply (runner ran out of disk). The lock was never released.
The Fix:
# First: VERIFY the lock holder is actually dead
# (Don't force-unlock if someone is genuinely running apply!)
# Check if the pipeline is still running...
# If confirmed dead:
terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890
Prevention:
- CI/CD pipelines should have
timeouton terraform apply steps - Use terraform wrapper scripts that catch kill signals and clean up
- Monitor for stale locks (alert if lock age > 30 minutes)
📐 Module Architecture: Building Lego Blocks
Bad Terraform looks like one giant main.tf with 2,000 lines. Good Terraform looks like well-organized Lego blocks that snap together.
The Module Hierarchy
Modules
├── Foundation Modules (building blocks)
│ ├── terraform-azurerm-vnet — Creates a VNet + subnets
│ ├── terraform-azurerm-aks — Creates an AKS cluster
│ ├── terraform-azurerm-keyvault — Creates a Key Vault
│ └── terraform-azurerm-sql — Creates Azure SQL
│
├── Composition Modules (patterns)
│ ├── terraform-azurerm-landing-zone — Combines: VNet + NSGs + DNS
│ ├── terraform-azurerm-app-stack — Combines: AKS + ACR + KeyVault
│ └── terraform-azurerm-data-stack — Combines: SQL + Redis + Storage
│
└── Root Modules (deployments)
├── prod/networking/ — Uses landing-zone module
├── prod/applications/ — Uses app-stack module
└── dev/ — Uses same modules, different vars
Module Do's and Don'ts
✅ DO:
• Version your modules (git tags: v1.0.0, v1.1.0)
• Pin module versions in consumers
• Include validation on variables
• Output everything consumers might need
• Include a README with examples
❌ DON'T:
• Put provider config in modules (let the root decide)
• Hardcode values (that's what variables are for)
• Create God Modules that do everything
• Use count when for_each works (index drift = pain)
• Skip validation rules on variables
🚨 Real-World Disaster #3: The count Index Shift
The Setup:
# BAD: Using count with a list
variable "subnets" {
default = ["web", "app", "data"]
}
resource "azurerm_subnet" "main" {
count = length(var.subnets)
name = var.subnets[count.index]
# ...
}
What Happened: Someone removed "app" from the list → ["web", "data"]. Terraform's plan:
# Destroy: azurerm_subnet.main[1] ("app") ← Correct
# Destroy: azurerm_subnet.main[2] ("data") ← WAIT WHAT
# Create: azurerm_subnet.main[1] ("data") ← WHY
# It's destroying and recreating "data" because its INDEX changed
# from 2 to 1! Everything in that subnet (VMs, AKS) will be destroyed!
The Fix: Use for_each instead:
# GOOD: Using for_each with stable keys
resource "azurerm_subnet" "main" {
for_each = toset(var.subnets)
name = each.value
# ...
}
# Now removing "app" only destroys "app". "web" and "data" are untouched.
# Resources are keyed by NAME, not index position.
💡 Rule:
countis only forcount = var.enable_feature ? 1 : 0(conditional creation). For everything else, usefor_each.
🧪 Testing Terraform (Yes, You Should Test Your IaC)
"I'll just run terraform plan and check it manually" is the IaC equivalent of "I'll just test in production."
Testing Pyramid for Terraform
┌─────────────┐
│ E2E Tests │ ← Deploy real infra, validate,
│ (Terratest)│ destroy. Slow but complete.
└──────┬──────┘
│
┌────────▼────────┐
│ Integration │ ← terraform plan + validate
│ (Plan Analysis) │ Check plan output for issues
└────────┬────────┘
│
┌─────────────▼──────────────┐
│ Static Analysis │ ← No terraform needed!
│ (tflint, checkov, trivy) │ Fast, catches 80% of issues
└─────────────┬──────────────┘
│
┌──────────────────▼────────────────────┐
│ Unit Tests (terraform validate, fmt) │ ← Sub-second
│ Pre-commit hooks │
└───────────────────────────────────────┘
Quick Static Analysis Setup
# Install tflint
brew install tflint # or scoop install tflint on Windows
# .tflint.hcl
plugin "azurerm" {
enabled = true
version = "0.27.0"
source = "github.com/terraform-linters/tflint-ruleset-azurerm"
}
rule "terraform_naming_convention" {
enabled = true
format = "snake_case"
}
# Run it
tflint --init
tflint --recursive
# Common catches:
# ⚠ azurerm_storage_account: "account_replication_type" should be "GRS"
# for production workloads
# ⚠ azurerm_kubernetes_cluster: "sku_tier" should be "Standard"
# (not "Free") for production
Checkov for Security Scanning
checkov -d . --framework terraform
# Output:
# Passed: 142
# Failed: 7
# Skipped: 3
#
# Check: CKV_AZURE_35: "Ensure storage account has access logging"
# FAILED for resource: azurerm_storage_account.main
#
# Check: CKV_AZURE_1: "Ensure Azure SQL is using managed identity"
# FAILED for resource: azurerm_mssql_server.main
🔄 Multi-Environment Patterns
The Big Question: Workspaces vs. Directories vs. Terragrunt?
| Approach | How it Works | When to Use | Gotcha |
|---|---|---|---|
| Workspaces | Same code, terraform workspace select prod
|
Simple apps, identical envs | Shared state backend, single plan file — risky |
| Directory per env |
envs/dev/, envs/prod/ with different .tfvars
|
Most teams | Code duplication if not using modules well |
| Terragrunt | DRY configs, dependency management, auto-backend | Large orgs, many envs | Learning curve, another tool to maintain |
The Pattern That Works for Most Teams
infrastructure/
├── modules/ # Shared modules
│ ├── networking/
│ ├── aks-cluster/
│ └── database/
│
├── environments/
│ ├── dev/
│ │ ├── main.tf # Calls modules with dev settings
│ │ ├── variables.tf
│ │ ├── dev.tfvars # env-specific values
│ │ └── backend.tf # Points to dev state file
│ │
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── staging.tfvars
│ │ └── backend.tf
│ │
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ ├── prod.tfvars
│ └── backend.tf # Points to SEPARATE prod state file
│
└── global/ # Shared resources (DNS zones, etc.)
├── main.tf
└── backend.tf
🚨 Real-World Disaster #4: The Workspace Mixup
What Happened: Engineer ran terraform apply thinking they were in the dev workspace. They were in prod. 12 resources destroyed and recreated. 35 minutes of downtime.
# THE MOMENT OF HORROR:
$ terraform workspace show
prod
$ terraform apply -auto-approve
# 💀💀💀
The Fix:
-
Never use
-auto-approvein production - Add a workspace check to your terraform wrapper:
#!/bin/bash
# safe-terraform.sh
CURRENT_WORKSPACE=$(terraform workspace show)
if [[ "$CURRENT_WORKSPACE" == "prod" ]]; then
echo "⚠️ WARNING: You are targeting PROD!"
echo "Type 'yes-i-mean-prod' to continue:"
read confirmation
if [[ "$confirmation" != "yes-i-mean-prod" ]]; then
echo "Aborting. Good choice."
exit 1
fi
fi
terraform "$@"
- Better yet: Use separate directories per environment instead of workspaces. Physical separation > logical separation.
🚀 The moved Block: Refactoring Without Tears
One of Terraform's best features (added in 1.1) that too few people know about:
# You renamed a resource from this:
# resource "azurerm_kubernetes_cluster" "main" { ... }
#
# To this:
# module "aks" {
# source = "./modules/aks"
# }
#
# Without `moved`, Terraform would DESTROY the old cluster
# and CREATE a new one. With `moved`:
moved {
from = azurerm_kubernetes_cluster.main
to = module.aks.azurerm_kubernetes_cluster.main
}
# Now Terraform knows it's the SAME resource, just moved.
# No destruction. No downtime. Just a state update.
This is a career-saver when refactoring large codebases.
🧠 Principal-Level Terraform Wisdom
The Golden Rules
1. State isolation per blast radius
└─ prod networking ≠ prod application ≠ dev anything
2. Module versioning is non-negotiable
└─ source = "git::https://...//modules/aks?ref=v2.1.0"
3. Plan in CI, Apply in CD
└─ PR → terraform plan (comment on PR) → merge → terraform apply
4. Never terraform apply from a laptop in production
└─ Pipeline or nothing
5. Import before you destroy
└─ Existing resources? terraform import, don't recreate
6. State locking + remote backend or don't bother
└─ Local state in a team = guaranteed disaster
🎯 Key Takeaways
- State files are sacred — remote backend, versioned, soft-deleted, geo-replicated
-
for_each>count— always, unless it's a simple on/off toggle - Module versioning prevents breaking changes from cascading
-
Test your IaC — tflint + checkov catches most issues before
plan - Separate environments by directory, not just workspaces
-
movedblocks let you refactor without destroying resources -
Never
-auto-approvein production. Ever. EVER.
🔥 Homework
- Check if your Terraform state backend has soft-delete enabled:
az storage account show -n <name> --query 'blobServiceProperties.deleteRetentionPolicy' - Run
checkov -d .on your Terraform code — fix the Critical findings - Find any
countusage that should befor_eachand refactor it (usemovedblocks!)
Next up in the series: **Your CI/CD Pipeline is a Dumpster Fire — Here's the Extinguisher* — where we optimize 45-minute builds to 5 minutes, standardize pipelines across teams, and decode DORA metrics.*
💬 What's your worst
terraform destroystory? Did you survive? Drop it below. Therapy is free. 🛋️
Top comments (0)