What happens when theory meets production reality? This week pushed me from basic Infrastructure as Code to enterprise-grade Terraform patterns — managing remote state across teams, isolating dev/prod environments with workspaces, and deploying a complete three-tier application stack with private databases, all while battling real-world AWS constraints and mysterious 403 errors at 2 AM.
🎯 The Production Reality Check: Why This Week Changed Everything 💡
Let me be blunt: Week 7 Part 1 was the warm-up.
Part 2 was the real game — where you discover that terraform apply in production isn't just a command, it's a responsibility that can break systems for entire teams if you mess up state management.
🤔 The Questions That Kept Me Up at Night
Q: "What happens when two engineers run terraform apply simultaneously?"
A: Without state locking, you get corrupted infrastructure. Period. This is why DynamoDB state locks exist—and why understanding them isn't optional.
Q: "How do you deploy identical infrastructure for dev, staging, and prod without copy-pasting code?"
A: Terraform workspaces. But here's the catch — misuse them and you'll accidentally destroy production thinking you're in dev.
Q: "Why can't I just terraform apply against my RDS database?"
A: Because AWS requires subnets in at least 2 Availability Zones for Multi-AZ RDS deployments. One subnet = instant failure. This error costed me 3 hours.
🏗️ Remote Backend & State Locking - The Foundation 🔒
What I Built
- S3 Backend: Centralized Terraform state storage with versioning enabled
- DynamoDB Locking: Prevented concurrent modifications (LockID as partition key)
- Multi-Workspace Setup: Dev and prod environments with isolated state files
The Deep Dive: How State Locking Actually Works
When you run terraform apply, here's what happens behind the scenes:
- Lock Acquisition: Terraform creates an entry in DynamoDB with a unique LockID
- State Read: Downloads current state from S3 bucket
- Plan Execution: Computes infrastructure changes
- State Write: Updates S3 with new state
- Lock Release: Deletes DynamoDB entry
💡 The Critical Part: If another user tries terraform apply during this window, they hit a lock error and must wait. Without this? Race conditions destroy your infrastructure.
Command Evidence:
# Create S3 bucket for remote state
aws s3 mb s3://epicbook-terraform-state-$(date +%s)-suvrajeet --region ap-south-1
aws s3api put-bucket-versioning --bucket <bucket-name> --versioning-configuration Status=Enabled
# Create DynamoDB table for locking
aws dynamodb create-table \
--table-name epicbook-terraform-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 \
--region ap-south-1
Backend Configuration:
terraform {
backend "s3" {
bucket = "epicbook-terraform-state-<timestamp>-suvrajeet"
key = "epicbook/terraform.tfstate"
region = "ap-south-1"
dynamodb_table = "epicbook-terraform-locks"
encrypt = true
}
}
🚨 Challenge #1: The Locking Proof Test
The Setup: Open two terminals, both targeting the same workspace.
Terminal A: Run terraform apply, pause at approval prompt
Terminal B: Try terraform plan
💡 Expected Behavior: Terminal B blocks with "Acquiring state lock. This may take a few moments..."
💊 What I Learned: This isn't just theoretical—in real teams, this prevents disasters when multiple engineers push changes via CI/CD pipelines simultaneously.
🔄 Terraform Workspaces - One Codebase, Multiple Realities 🌐
The Power of Workspaces
Imagine deploying a React app to both dev and prod URLs using identical Terraform code—just switching workspace contexts. That's what workspaces enable.
How It Works Under the Hood
# Create workspaces
terraform workspace new dev
terraform workspace new prod
# Deploy to dev
terraform workspace select dev
terraform apply -var-file=dev.tfvars
# Deploy to prod (completely isolated)
terraform workspace select prod
terraform apply -var-file=prod.tfvars
State File Magic:
-
Dev state:
.terraform/terraform.tfstate.d/dev/terraform.tfstate -
Prod state:
.terraform/terraform.tfstate.d/prod/terraform.tfstate
💡 Zero cross-contamination. Destroy dev? Prod is untouched.
Dynamic Configuration with Locals
locals {
env = terraform.workspace
env_configs = {
dev = {
name_suffix = "dev"
region = "ap-south-1"
tags = { Environment = "dev" }
}
prod = {
name_suffix = "prod"
region = "ap-south-1"
tags = { Environment = "prod" }
}
}
config = local.env_configs[local.env]
}
resource "aws_s3_bucket" "app" {
bucket = "react-app-${local.config.name_suffix}-${random_string.suffix.result}"
tags = local.config.tags
}
💡 Why This Matters: One main.tf, infinite environments. Change a variable map, not your entire codebase.
🚨 Challenge #2: The S3 Public Policy 403 Nightmare
Error Received:
Error: putting S3 Bucket Policy: AccessDenied: User is not authorized to perform: s3:PutBucketPolicy
because public policies are blocked by the BlockPublicPolicy block public access setting.
Root Cause: AWS account-level Block Public Access settings were enabled—overriding my Terraform block_public_policy = false.
The Fix:
# Disable account-level block
aws s3control put-public-access-block \
--account-id <account-id> \
--public-access-block-configuration \
BlockPublicAcls=false,IgnorePublicAcls=false,BlockPublicPolicy=false,RestrictPublicBuckets=false
💊 Lesson Learned: Always check higher-level AWS policies (account, SCP) before blaming Terraform code.
🏢 Production-Grade Full-Stack Deployment - The Final Boss 🎮
This demonstration combined everything: remote state, workspaces, AND deploying a complete three-tier application (VPC → RDS MySQL → EC2 with Nginx/Node.js) across dev and prod.
The Architecture
Network Module:
- VPC:
10.0.0.0/16 - Public Subnet:
10.0.1.0/24(EC2 instances) - Private Subnets:
10.0.2.0/24&10.0.3.0/24(RDS Multi-AZ) - Security Groups: SSH (22) + HTTP (80) for EC2, MySQL (3306) from EC2 SG only for RDS
Database Module:
- RDS MySQL
db.t3.micro(free tier) - Private access only (no public IP)
- Multi-AZ deployment for high availability
- Automated database initialization via EC2 user data
Compute Module:
- EC2
t2.microUbuntu 22.04 - Automated provisioning: Node.js, Nginx, MySQL client, PM2
- Git clone → npm install → build → deploy → Nginx reverse proxy
- Environment variables injected for DB connection
🚨 Challenge #3: The Multi-AZ Subnet Requirement Hell
Error That Broke Me:
Error: creating RDS DB Instance: InvalidParameterCombination:
Cannot create a Multi-AZ DB instance with only 1 subnet.
You must specify at least 2 subnets in different Availability Zones.
What Went Wrong: My original network module only created one private subnet. RDS Multi-AZ requires subnets in at least two AZs for failover.
The Solution:
# modulesnetwork/main.tf
resource "aws_subnet" "private_a" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.2.0/24"
availability_zone = "${var.region}a"
tags = merge(var.tags, { Name = "${var.name_prefix}-private-subnet-a" })
}
resource "aws_subnet" "private_b" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.3.0/24"
availability_zone = "${var.region}b"
tags = merge(var.tags, { Name = "${var.name_prefix}-private-subnet-b" })
}
# Output both for RDS subnet group
output "private_subnet_ids" {
value = [aws_subnet.private_a.id, aws_subnet.private_b.id]
}
Module Usage:
# modulesdatabase/main.tf
resource "aws_db_subnet_group" "epicbook" {
name = "${var.name_prefix}-db-subnet-group"
subnet_ids = var.subnet_ids # Now passes TWO subnets
tags = var.tags
}
💊 Lesson Learned: AWS enforces multi-AZ requirements at the API level. Terraform can't override cloud provider constraints—you must architect correctly from the start.
🚨 Challenge #4: RDS Endpoint Port Parsing Bug
The Issue: RDS returns endpoints as hostname:3306, but MySQL connection strings expect just the hostname.
Error in Logs:
mysql: [ERROR] Failed to connect to 'dev-epicbook-db.abc.ap-south-1.rds.amazonaws.com:3306:3306'
The Problem in Code:
# WRONG - passes endpoint with :3306 already appended
userdata = templatefile("${path.module}/userdata.sh", {
db_endpoint = module.database.db_endpoint # Includes :3306
})
The Fix:
# CORRECT - strip port using split()
userdata = base64encode(templatefile("${path.module}/userdata.sh", {
db_endpoint = split(":", module.database.db_endpoint)[0] # Hostname only
}))
User Data Script:
#!/bin/bash
DB_ENDPOINT="${db_endpoint}"
mysql -h $DB_ENDPOINT -u epicadmin -p${db_password} -e "CREATE DATABASE IF NOT EXISTS ${db_name}"
mysql -h $DB_ENDPOINT -u epicadmin -p${db_password} ${db_name} < db/schema.sql
Validation:
# SSH into EC2
ssh -i ~/.ssh/suvrajeet.key.pem ubuntu@$(terraform output -raw ec2_public_ip)
# Test DB connection
mysql -h $(terraform output -raw db_endpoint | cut -d: -f1) \
-u epicadmin -p<password> -e "SHOW DATABASES;"
📚 Concepts Deep Dive: The "Why" Behind Everything 🧠
What is the Terraform State File?
The .tfstate file is a JSON snapshot of your infrastructure's current state. It maps your Terraform code to real-world resource IDs (EC2 instance IDs, S3 bucket ARNs, etc.).
Why It Matters:
- Without it: Terraform can't track what exists, leading to duplicate resource creation
- Local state: File stored on your machine—team collaboration impossible
- Remote state: Centralized in S3—enables team workflows and CI/CD
What is the Lock File (.terraform.lock.hcl)?
This file locks provider versions to ensure consistency across team members.
Example:
provider "registry.terraform.io/hashicorp/aws" {
version = "5.0.1"
hashes = [
"h1:abc123...",
]
}
Purpose: Prevents "works on my machine" issues caused by provider version mismatches.
Backend Initialization: What Happens During terraform init?
- Downloads provider plugins (AWS, Azure, etc.)
- Configures remote backend (S3 + DynamoDB) if any
- Creates
.terraformdirectory with cached plugins - Generates
.terraform.lock.hclwith provider versions
Why 700 MB AWS Provider? It includes API definitions for 400+ AWS services.
Export Commands Explained
Q: What does export BUCKET_NAME=... do?
A: Creates a shell environment variable for reuse across commands. Prevents typos and enables scripting.
export BUCKET_NAME=epicbook-tf-state-$(date +%s)
aws s3 mb s3://$BUCKET_NAME # Reuses variable
Q: What does aws sts get-caller-identity do?
A: Verifies your AWS CLI authentication by returning your IAM user/role ARN.
aws sts get-caller-identity
# Output: { "UserId": "...", "Account": "970107226849", "Arn": "arn:aws:iam::..." }
🎓 Interview Questions: Ace Your Terraform Technical Screen 💼
Basic Level
Q1: What is Terraform state and why does it need locking?
A: State tracks infrastructure reality. Locking prevents concurrent modifications that corrupt state, causing resource conflicts or deletions.
Q2: Explain the difference between terraform plan and terraform apply.
A: plan previews changes (read-only), apply executes them (write operation). Always run plan first to catch errors.
Q3: How do workspaces differ from separate directories?
A: Workspaces share code but isolate state files—ideal for similar environments (dev/prod). Separate directories isolate everything—better for completely different stacks.
Intermediate Level
Q4: How does DynamoDB enable Terraform state locking?
A: Terraform writes a lock item with LockID as the partition key. Concurrent operations fail until the lock is released. DynamoDB provides atomic conditional writes.
Q5: What happens if you delete a workspace's state file?
A: Terraform loses track of resources—they still exist in AWS but Terraform can't manage them. Fix via terraform import to rebuild state.
Q6: Why use split(":", db_endpoint)[0] for RDS endpoints?
A: RDS returns hostname:3306, but connection strings expect just the hostname. Split extracts the first element (hostname only).
Advanced/Tricky Level
Q7: You run terraform apply in prod workspace but state is in dev. What breaks?
A: Terraform uses the wrong state file—it sees empty state and tries to create duplicate resources in prod, causing name conflicts or overwriting existing infrastructure.
Q8: How do you migrate local state to S3 without losing resources?
A: Add backend config, run terraform init -migrate-state, confirm migration, verify state in S3. Terraform moves state seamlessly.
Q9: Why does Multi-AZ RDS require 2+ subnets in different AZs?
A: AWS places primary instance in one AZ, standby replica in another for high availability. Single subnet = single point of failure.
Q10: How would you prevent accidental terraform destroy in production?
A: Use lifecycle prevent_destroy = true, require approval in CI/CD pipelines, restrict IAM permissions, add confirmation prompts in wrapper scripts.
🛠️ Troubleshooting Playbook: Solutions to Every Error I Hit 🔧
Error 1: Backend Bucket Not Found
Symptom: Error: Failed to get existing workspaces: bucket does not exist
Fix: Hardcode actual bucket name in backend.tf—variables aren't supported in backend blocks.
Error 2: IAM Permission Denied
Symptom: AccessDenied: User is not authorized to perform: s3:PutBucketPolicy
Fix: Check account-level Block Public Access, verify IAM user has s3:* permissions.
Error 3: RDS Creation Hangs
Symptom: Terraform apply stuck at "Creating RDS instance..." for 15+ minutes
Fix: RDS takes 5-10 minutes to provision—this is normal. Check CloudWatch Events for actual errors.
Error 4: Nginx 502 Bad Gateway
Symptom: Frontend loads but API calls fail
Fix: Backend not running. SSH to EC2, check pm2 status, restart with pm2 restart epicbook-backend.
Error 5: Cannot Connect to Database
Symptom: ERROR 2003: Can't connect to MySQL server on 'hostname'
Fix: Verify security group allows 3306 from EC2 SG, check RDS is in "available" state, test from EC2 instance directly.
🏆 Key Takeaways: What Production-Grade Terraform Really Means 💎
- State Management is Non-Negotiable: Remote state + locking = table stakes for teams
-
Workspaces ≠ Magic: Great for similar envs, dangerous if misused—always verify
terraform workspace show - Cloud Constraints are Real: Terraform can't bypass AWS requirements (Multi-AZ needs 2 subnets)
- Security Layers Stack: IAM + SCPs + Block Public Access—check every level before blaming code
- Automation Needs Validation: User data scripts must handle idempotency, errors, and async operations
- Modular Design Pays Off: Network → Database → Compute dependency chain prevents deployment chaos
- Always Test Locking: The first time two engineers collide on state, you'll thank yourself for DynamoDB
🚀 What's Next? Advanced Terraform Patterns 🔮
- Terragrunt: DRY configurations across multiple modules
- Terraform Cloud: Hosted state management with RBAC
- Policy as Code: Sentinel/OPA for automated compliance checks
- GitOps Workflows: Atlantis for PR-based Terraform automation
📜 This week taught me: Infrastructure as Code isn't just about automation—it's about building systems that teams can trust, modify, and scale without breaking production at 2 AM.
This is Week 7 Part-2 of 12 of the free DevOps cohort organized by Pravin Mishra sir 🙏 in continuation of 🏗️ Mastering Infrastructure as Code: From Manual Chaos to Multi-Cloud Orchestration [Week-7—P1] ⚡
Following my journey from Terraform basics to production-grade patterns—remote state, workspaces, and full-stack deployment mastery. Each week reveals the gap between tutorials and reality. What's your most painful Terraform lesson? Share in the comments! 🔥
🏷️ Tags:
#Terraform #DevOps #AWS #InfrastructureAsCode #RemoteState #Workspaces #RDS #MultiCloud #Production #CloudEngineering #IaC #StateLocking #DynamoDB #S3 #Learning
Read more in this series: DevOps Journey
🐙 Github Links
🔗 Team-Ready State—Remote Backends & Locking (Azure + AWS)
🔗 Deploy a React App with Terraform Workspaces (dev & prod)
🔗 EpicBook on Azure/AWS with Production-Grade Terraform










Top comments (0)