DEV Community

Cover image for 🔐 Terraform Production Battle-Tested: Remote State, Workspaces & Full-Stack AWS Deployment [Week-7—P2] 🚀
Suvrajeet Banerjee
Suvrajeet Banerjee Subscriber

Posted on

🔐 Terraform Production Battle-Tested: Remote State, Workspaces & Full-Stack AWS Deployment [Week-7—P2] 🚀

What happens when theory meets production reality? This week pushed me from basic Infrastructure as Code to enterprise-grade Terraform patterns — managing remote state across teams, isolating dev/prod environments with workspaces, and deploying a complete three-tier application stack with private databases, all while battling real-world AWS constraints and mysterious 403 errors at 2 AM.

tf-med


🎯 The Production Reality Check: Why This Week Changed Everything 💡

Let me be blunt: Week 7 Part 1 was the warm-up.
Part 2 was the real game — where you discover that terraform apply in production isn't just a command, it's a responsibility that can break systems for entire teams if you mess up state management.

🤔 The Questions That Kept Me Up at Night

Q: "What happens when two engineers run terraform apply simultaneously?"

A: Without state locking, you get corrupted infrastructure. Period. This is why DynamoDB state locks exist—and why understanding them isn't optional.

Q: "How do you deploy identical infrastructure for dev, staging, and prod without copy-pasting code?"

A: Terraform workspaces. But here's the catch — misuse them and you'll accidentally destroy production thinking you're in dev.

Q: "Why can't I just terraform apply against my RDS database?"

A: Because AWS requires subnets in at least 2 Availability Zones for Multi-AZ RDS deployments. One subnet = instant failure. This error costed me 3 hours.

lock-flow


🏗️ Remote Backend & State Locking - The Foundation 🔒

What I Built

  • S3 Backend: Centralized Terraform state storage with versioning enabled
  • DynamoDB Locking: Prevented concurrent modifications (LockID as partition key)
  • Multi-Workspace Setup: Dev and prod environments with isolated state files

The Deep Dive: How State Locking Actually Works

When you run terraform apply, here's what happens behind the scenes:

  1. Lock Acquisition: Terraform creates an entry in DynamoDB with a unique LockID
  2. State Read: Downloads current state from S3 bucket
  3. Plan Execution: Computes infrastructure changes
  4. State Write: Updates S3 with new state
  5. Lock Release: Deletes DynamoDB entry

💡 The Critical Part: If another user tries terraform apply during this window, they hit a lock error and must wait. Without this? Race conditions destroy your infrastructure.

Command Evidence:

# Create S3 bucket for remote state
aws s3 mb s3://epicbook-terraform-state-$(date +%s)-suvrajeet --region ap-south-1
aws s3api put-bucket-versioning --bucket <bucket-name> --versioning-configuration Status=Enabled

# Create DynamoDB table for locking
aws dynamodb create-table \
  --table-name epicbook-terraform-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 \
  --region ap-south-1
Enter fullscreen mode Exit fullscreen mode

Backend Configuration:

terraform {
  backend "s3" {
    bucket         = "epicbook-terraform-state-<timestamp>-suvrajeet"
    key            = "epicbook/terraform.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "epicbook-terraform-locks"
    encrypt        = true
  }
}
Enter fullscreen mode Exit fullscreen mode

term

🚨 Challenge #1: The Locking Proof Test

The Setup: Open two terminals, both targeting the same workspace.

Terminal A: Run terraform apply, pause at approval prompt

Terminal B: Try terraform plan

💡 Expected Behavior: Terminal B blocks with "Acquiring state lock. This may take a few moments..."

💊 What I Learned: This isn't just theoretical—in real teams, this prevents disasters when multiple engineers push changes via CI/CD pipelines simultaneously.


🔄 Terraform Workspaces - One Codebase, Multiple Realities 🌐

The Power of Workspaces

Imagine deploying a React app to both dev and prod URLs using identical Terraform code—just switching workspace contexts. That's what workspaces enable.

How It Works Under the Hood

# Create workspaces
terraform workspace new dev
terraform workspace new prod

# Deploy to dev
terraform workspace select dev
terraform apply -var-file=dev.tfvars

# Deploy to prod (completely isolated)
terraform workspace select prod
terraform apply -var-file=prod.tfvars
Enter fullscreen mode Exit fullscreen mode

State File Magic:

  • Dev state: .terraform/terraform.tfstate.d/dev/terraform.tfstate
  • Prod state: .terraform/terraform.tfstate.d/prod/terraform.tfstate

💡 Zero cross-contamination. Destroy dev? Prod is untouched.

wrkspc

Dynamic Configuration with Locals

locals {
  env = terraform.workspace
  env_configs = {
    dev = {
      name_suffix = "dev"
      region      = "ap-south-1"
      tags        = { Environment = "dev" }
    }
    prod = {
      name_suffix = "prod"
      region      = "ap-south-1"
      tags        = { Environment = "prod" }
    }
  }
  config = local.env_configs[local.env]
}

resource "aws_s3_bucket" "app" {
  bucket = "react-app-${local.config.name_suffix}-${random_string.suffix.result}"
  tags   = local.config.tags
}
Enter fullscreen mode Exit fullscreen mode

💡 Why This Matters: One main.tf, infinite environments. Change a variable map, not your entire codebase.

🚨 Challenge #2: The S3 Public Policy 403 Nightmare

Error Received:

Error: putting S3 Bucket Policy: AccessDenied: User is not authorized to perform: s3:PutBucketPolicy 
because public policies are blocked by the BlockPublicPolicy block public access setting.
Enter fullscreen mode Exit fullscreen mode

Root Cause: AWS account-level Block Public Access settings were enabled—overriding my Terraform block_public_policy = false.

The Fix:

# Disable account-level block
aws s3control put-public-access-block \
  --account-id <account-id> \
  --public-access-block-configuration \
    BlockPublicAcls=false,IgnorePublicAcls=false,BlockPublicPolicy=false,RestrictPublicBuckets=false
Enter fullscreen mode Exit fullscreen mode

💊 Lesson Learned: Always check higher-level AWS policies (account, SCP) before blaming Terraform code.

aws


🏢 Production-Grade Full-Stack Deployment - The Final Boss 🎮

This demonstration combined everything: remote state, workspaces, AND deploying a complete three-tier application (VPC → RDS MySQL → EC2 with Nginx/Node.js) across dev and prod.

The Architecture

Network Module:

  • VPC: 10.0.0.0/16
  • Public Subnet: 10.0.1.0/24 (EC2 instances)
  • Private Subnets: 10.0.2.0/24 & 10.0.3.0/24 (RDS Multi-AZ)
  • Security Groups: SSH (22) + HTTP (80) for EC2, MySQL (3306) from EC2 SG only for RDS

Database Module:

  • RDS MySQL db.t3.micro (free tier)
  • Private access only (no public IP)
  • Multi-AZ deployment for high availability
  • Automated database initialization via EC2 user data

Compute Module:

  • EC2 t2.micro Ubuntu 22.04
  • Automated provisioning: Node.js, Nginx, MySQL client, PM2
  • Git clone → npm install → build → deploy → Nginx reverse proxy
  • Environment variables injected for DB connection

3t

🚨 Challenge #3: The Multi-AZ Subnet Requirement Hell

Error That Broke Me:

Error: creating RDS DB Instance: InvalidParameterCombination: 
Cannot create a Multi-AZ DB instance with only 1 subnet. 
You must specify at least 2 subnets in different Availability Zones.
Enter fullscreen mode Exit fullscreen mode

What Went Wrong: My original network module only created one private subnet. RDS Multi-AZ requires subnets in at least two AZs for failover.

The Solution:

# modulesnetwork/main.tf
resource "aws_subnet" "private_a" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.2.0/24"
  availability_zone = "${var.region}a"
  tags = merge(var.tags, { Name = "${var.name_prefix}-private-subnet-a" })
}

resource "aws_subnet" "private_b" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.3.0/24"
  availability_zone = "${var.region}b"
  tags = merge(var.tags, { Name = "${var.name_prefix}-private-subnet-b" })
}

# Output both for RDS subnet group
output "private_subnet_ids" {
  value = [aws_subnet.private_a.id, aws_subnet.private_b.id]
}
Enter fullscreen mode Exit fullscreen mode

Module Usage:

# modulesdatabase/main.tf
resource "aws_db_subnet_group" "epicbook" {
  name       = "${var.name_prefix}-db-subnet-group"
  subnet_ids = var.subnet_ids  # Now passes TWO subnets
  tags       = var.tags
}
Enter fullscreen mode Exit fullscreen mode

💊 Lesson Learned: AWS enforces multi-AZ requirements at the API level. Terraform can't override cloud provider constraints—you must architect correctly from the start.

rds

🚨 Challenge #4: RDS Endpoint Port Parsing Bug

The Issue: RDS returns endpoints as hostname:3306, but MySQL connection strings expect just the hostname.

Error in Logs:

mysql: [ERROR] Failed to connect to 'dev-epicbook-db.abc.ap-south-1.rds.amazonaws.com:3306:3306'
Enter fullscreen mode Exit fullscreen mode

The Problem in Code:

# WRONG - passes endpoint with :3306 already appended
userdata = templatefile("${path.module}/userdata.sh", {
  db_endpoint = module.database.db_endpoint  # Includes :3306
})
Enter fullscreen mode Exit fullscreen mode

The Fix:

# CORRECT - strip port using split()
userdata = base64encode(templatefile("${path.module}/userdata.sh", {
  db_endpoint = split(":", module.database.db_endpoint)[0]  # Hostname only
}))
Enter fullscreen mode Exit fullscreen mode

User Data Script:

#!/bin/bash
DB_ENDPOINT="${db_endpoint}"
mysql -h $DB_ENDPOINT -u epicadmin -p${db_password} -e "CREATE DATABASE IF NOT EXISTS ${db_name}"
mysql -h $DB_ENDPOINT -u epicadmin -p${db_password} ${db_name} < db/schema.sql
Enter fullscreen mode Exit fullscreen mode

Validation:

# SSH into EC2
ssh -i ~/.ssh/suvrajeet.key.pem ubuntu@$(terraform output -raw ec2_public_ip)

# Test DB connection
mysql -h $(terraform output -raw db_endpoint | cut -d: -f1) \
  -u epicadmin -p<password> -e "SHOW DATABASES;"
Enter fullscreen mode Exit fullscreen mode

db


📚 Concepts Deep Dive: The "Why" Behind Everything 🧠

What is the Terraform State File?

The .tfstate file is a JSON snapshot of your infrastructure's current state. It maps your Terraform code to real-world resource IDs (EC2 instance IDs, S3 bucket ARNs, etc.).

Why It Matters:

  • Without it: Terraform can't track what exists, leading to duplicate resource creation
  • Local state: File stored on your machine—team collaboration impossible
  • Remote state: Centralized in S3—enables team workflows and CI/CD

What is the Lock File (.terraform.lock.hcl)?

This file locks provider versions to ensure consistency across team members.

Example:

provider "registry.terraform.io/hashicorp/aws" {
  version = "5.0.1"
  hashes = [
    "h1:abc123...",
  ]
}
Enter fullscreen mode Exit fullscreen mode

Purpose: Prevents "works on my machine" issues caused by provider version mismatches.

Backend Initialization: What Happens During terraform init?

  1. Downloads provider plugins (AWS, Azure, etc.)
  2. Configures remote backend (S3 + DynamoDB) if any
  3. Creates .terraform directory with cached plugins
  4. Generates .terraform.lock.hcl with provider versions

Why 700 MB AWS Provider? It includes API definitions for 400+ AWS services.

Export Commands Explained

Q: What does export BUCKET_NAME=... do?

A: Creates a shell environment variable for reuse across commands. Prevents typos and enables scripting.

export BUCKET_NAME=epicbook-tf-state-$(date +%s)
aws s3 mb s3://$BUCKET_NAME  # Reuses variable
Enter fullscreen mode Exit fullscreen mode

Q: What does aws sts get-caller-identity do?

A: Verifies your AWS CLI authentication by returning your IAM user/role ARN.

aws sts get-caller-identity
# Output: { "UserId": "...", "Account": "970107226849", "Arn": "arn:aws:iam::..." }
Enter fullscreen mode Exit fullscreen mode

aws-cli


🎓 Interview Questions: Ace Your Terraform Technical Screen 💼

Basic Level

Q1: What is Terraform state and why does it need locking?

A: State tracks infrastructure reality. Locking prevents concurrent modifications that corrupt state, causing resource conflicts or deletions.

Q2: Explain the difference between terraform plan and terraform apply.

A: plan previews changes (read-only), apply executes them (write operation). Always run plan first to catch errors.

Q3: How do workspaces differ from separate directories?

A: Workspaces share code but isolate state files—ideal for similar environments (dev/prod). Separate directories isolate everything—better for completely different stacks.

Intermediate Level

Q4: How does DynamoDB enable Terraform state locking?

A: Terraform writes a lock item with LockID as the partition key. Concurrent operations fail until the lock is released. DynamoDB provides atomic conditional writes.

Q5: What happens if you delete a workspace's state file?

A: Terraform loses track of resources—they still exist in AWS but Terraform can't manage them. Fix via terraform import to rebuild state.

Q6: Why use split(":", db_endpoint)[0] for RDS endpoints?

A: RDS returns hostname:3306, but connection strings expect just the hostname. Split extracts the first element (hostname only).

Advanced/Tricky Level

Q7: You run terraform apply in prod workspace but state is in dev. What breaks?

A: Terraform uses the wrong state file—it sees empty state and tries to create duplicate resources in prod, causing name conflicts or overwriting existing infrastructure.

Q8: How do you migrate local state to S3 without losing resources?

A: Add backend config, run terraform init -migrate-state, confirm migration, verify state in S3. Terraform moves state seamlessly.

Q9: Why does Multi-AZ RDS require 2+ subnets in different AZs?

A: AWS places primary instance in one AZ, standby replica in another for high availability. Single subnet = single point of failure.

Q10: How would you prevent accidental terraform destroy in production?

A: Use lifecycle prevent_destroy = true, require approval in CI/CD pipelines, restrict IAM permissions, add confirmation prompts in wrapper scripts.

cicd


🛠️ Troubleshooting Playbook: Solutions to Every Error I Hit 🔧

Error 1: Backend Bucket Not Found

Symptom: Error: Failed to get existing workspaces: bucket does not exist

Fix: Hardcode actual bucket name in backend.tf—variables aren't supported in backend blocks.

Error 2: IAM Permission Denied

Symptom: AccessDenied: User is not authorized to perform: s3:PutBucketPolicy

Fix: Check account-level Block Public Access, verify IAM user has s3:* permissions.

Error 3: RDS Creation Hangs

Symptom: Terraform apply stuck at "Creating RDS instance..." for 15+ minutes

Fix: RDS takes 5-10 minutes to provision—this is normal. Check CloudWatch Events for actual errors.

Error 4: Nginx 502 Bad Gateway

Symptom: Frontend loads but API calls fail

Fix: Backend not running. SSH to EC2, check pm2 status, restart with pm2 restart epicbook-backend.

Error 5: Cannot Connect to Database

Symptom: ERROR 2003: Can't connect to MySQL server on 'hostname'

Fix: Verify security group allows 3306 from EC2 SG, check RDS is in "available" state, test from EC2 instance directly.


🏆 Key Takeaways: What Production-Grade Terraform Really Means 💎

  1. State Management is Non-Negotiable: Remote state + locking = table stakes for teams
  2. Workspaces ≠ Magic: Great for similar envs, dangerous if misused—always verify terraform workspace show
  3. Cloud Constraints are Real: Terraform can't bypass AWS requirements (Multi-AZ needs 2 subnets)
  4. Security Layers Stack: IAM + SCPs + Block Public Access—check every level before blaming code
  5. Automation Needs Validation: User data scripts must handle idempotency, errors, and async operations
  6. Modular Design Pays Off: Network → Database → Compute dependency chain prevents deployment chaos
  7. Always Test Locking: The first time two engineers collide on state, you'll thank yourself for DynamoDB

🚀 What's Next? Advanced Terraform Patterns 🔮

  • Terragrunt: DRY configurations across multiple modules
  • Terraform Cloud: Hosted state management with RBAC
  • Policy as Code: Sentinel/OPA for automated compliance checks
  • GitOps Workflows: Atlantis for PR-based Terraform automation

📜 This week taught me: Infrastructure as Code isn't just about automation—it's about building systems that teams can trust, modify, and scale without breaking production at 2 AM.


This is Week 7 Part-2 of 12 of the free DevOps cohort organized by Pravin Mishra sir 🙏 in continuation of 🏗️ Mastering Infrastructure as Code: From Manual Chaos to Multi-Cloud Orchestration [Week-7—P1] ⚡

Following my journey from Terraform basics to production-grade patterns—remote state, workspaces, and full-stack deployment mastery. Each week reveals the gap between tutorials and reality. What's your most painful Terraform lesson? Share in the comments! 🔥


🏷️ Tags:

#Terraform #DevOps #AWS #InfrastructureAsCode #RemoteState #Workspaces #RDS #MultiCloud #Production #CloudEngineering #IaC #StateLocking #DynamoDB #S3 #Learning

Read more in this series: DevOps Journey


🐙 Github Links

🔗 Team-Ready State—Remote Backends & Locking (Azure + AWS)
🔗 Deploy a React App with Terraform Workspaces (dev & prod)
🔗 EpicBook on Azure/AWS with Production-Grade Terraform


Top comments (0)