Lab Goal
Build a “production-ish” AWS stack with Terraform, then simulate an accidental terraform apply that deletes/changes networking and breaks traffic. You will:
- Detect outage fast
- Identify what changed and why
- Restore service safely
- Fix / repair Terraform state (imports, state surgery if needed)
- Add guardrails so it can’t happen again
- Explain state drift with real examples
Architecture
- VPC: public + private subnets across 2 AZs, NAT, IGW
- EKS: private nodes, cluster endpoint public/private (your choice)
- ALB: created by AWS Load Balancer Controller via Kubernetes Ingress
- RDS: MySQL in private subnets (not publicly accessible)
- Terraform Remote State: S3 backend + DynamoDB lock
- Optional: CI/CD gate (GitHub Actions or Jenkins) that prevents apply on main without approvals
Important note: In real production, ALB should be created by Kubernetes ingress controller (not Terraform) OR managed by Terraform consistently. This lab teaches both and shows what happens when you mix ownership.
Prerequisites
Local tools
- AWS CLI configured (
aws sts get-caller-identitymust work) - Terraform v1.5+
- kubectl
- helm
- eksctl (optional)
AWS prerequisites
- One AWS account (or use separate dev/prod accounts if you want extra realism)
- Route53 domain optional (not required)
Step 0 — Create Remote State (S3 + DynamoDB)
Create a bootstrap Terraform project (or do it manually once).
0.1 Create bootstrap/ project
bootstrap/main.tf
terraform {
required_version = ">= 1.5.0"
}
provider "aws" {
region = var.region
}
variable "region" { type = string }
variable "state_bucket_name" { type = string }
variable "lock_table_name" { type = string }
resource "aws_s3_bucket" "tf_state" {
bucket = var.state_bucket_name
}
resource "aws_s3_bucket_versioning" "versioning" {
bucket = aws_s3_bucket.tf_state.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_server_side_encryption_configuration" "sse" {
bucket = aws_s3_bucket.tf_state.id
rule {
apply_server_side_encryption_by_default { sse_algorithm = "AES256" }
}
}
resource "aws_dynamodb_table" "tf_lock" {
name = var.lock_table_name
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute { name = "LockID" type = "S" }
}
Run:
cd bootstrap
terraform init
terraform apply
Step 1 — Create “Production” Terraform Repo Structure
Create repo like this:
infra/
envs/
prod/
main.tf
backend.tf
providers.tf
variables.tf
outputs.tf
terraform.tfvars
modules/
network/
eks/
rds/
iam/
1.1 Remote backend config (prod)
infra/envs/prod/backend.tf
terraform {
backend "s3" {
bucket = "YOUR_STATE_BUCKET"
key = "prod/terraform.tfstate"
region = "us-east-2"
dynamodb_table = "YOUR_LOCK_TABLE"
encrypt = true
}
}
1.2 Providers + tags
providers.tf
provider "aws" {
region = var.region
default_tags {
tags = {
Project = "prod-lab"
Env = "prod"
Owner = "jumptotech"
}
}
}
Step 2 — Build VPC Module (Production-ish)
Module requirements
- 2 public subnets
- 2 private subnets
- NAT gateway
- Route tables
(You can use terraform-aws-modules/vpc/aws to save time. In production, teams do.)
Example modules/network/main.tf using the VPC module:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = var.name
cidr = var.cidr
azs = var.azs
public_subnets = var.public_subnets
private_subnets = var.private_subnets
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true
enable_dns_support = true
}
Outputs:
- vpc_id
- public_subnet_ids
- private_subnet_ids
Step 3 — Build EKS Module (Production-ish)
Use terraform-aws-modules/eks/aws module.
Key points for a senior-level lab:
- Node groups in private subnets
- Cluster logs enabled
- IAM roles clean
- Add-ons (coredns, vpc-cni, kube-proxy)
After apply, set kubeconfig:
aws eks update-kubeconfig --region us-east-2 --name prod-lab-eks
kubectl get nodes
Step 4 — Install AWS Load Balancer Controller (ALB via Ingress)
4.1 Create IAM policy + IRSA
- Create OIDC provider for cluster (module can do)
- Create service account with role
Then:
helm repo add eks https://aws.github.io/eks-charts
helm repo update
helm upgrade --install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=prod-lab-eks \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller
Verify:
kubectl -n kube-system get deploy aws-load-balancer-controller
kubectl -n kube-system logs deploy/aws-load-balancer-controller --tail=50
Step 5 — Deploy App + Ingress to Create ALB
5.1 Sample app
Deploy nginx or your Node app. Example nginx:
kubectl create ns prod
kubectl -n prod create deploy web --image=nginx
kubectl -n prod expose deploy web --port 80
5.2 Ingress (creates ALB)
ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web
namespace: prod
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web
port:
number: 80
Apply:
kubectl apply -f ingress.yaml
kubectl -n prod get ingress web
You should see an ALB DNS name.
Step 6 — Add RDS (Private)
Create RDS in private subnets, security group allows MySQL from EKS nodes (or app SG).
Test connectivity from a temporary pod:
kubectl -n prod run mysql-client --rm -it --image=mysql:8 -- bash
mysql -h <rds-endpoint> -u admin -p
INCIDENT LAB: “Terraform Apply Broke Prod”
Step 7 — Pre-incident checks (baseline)
Save baseline outputs:
- ALB DNS
- EKS nodes
- Ingress status
- RDS endpoint
Commands:
kubectl -n prod get ingress web -o wide
kubectl -n prod get pods,svc
Step 8 — Simulate the Disaster (Accidental Apply)
Scenario A (very realistic): Network change causes ALB to be destroyed/recreated or unreachable
Example: junior changes VPC public subnet tags or routes.
In VPC module (or your tagging), remove the required tags for ALB controller:
Public subnets must have:
-
kubernetes.io/role/elb = 1Private subnets: -
kubernetes.io/role/internal-elb = 1Both need: kubernetes.io/cluster/<cluster-name> = shared|owned
Simulate mistake: remove or rename those tags in Terraform and apply:
terraform plan
terraform apply
Expected result
- ALB controller can’t find correct subnets or ALB targets go unhealthy
- Traffic becomes slow/503 (or ALB disappears and new one tries to create)
OUTAGE RESPONSE: Senior Troubleshooting Flow
Step 9 — Detect and Confirm
- Users see 503 / timeouts
- Ingress may show no hostname, or targets unhealthy
Check:
kubectl -n prod get ingress web -o wide
kubectl -n kube-system logs deploy/aws-load-balancer-controller --tail=200
Also check AWS:
- EC2 → Load Balancers
- Target Groups health
Step 10 — Identify What Happened (Terraform Evidence)
10.1 Use terraform plan history (best practice)
If you run via CI, you should have the plan artifact. In lab, you simulate by:
- Check git diff:
git diff
- Check Terraform state changes:
terraform show
- Look at AWS CloudTrail (who did it):
Search
DeleteLoadBalancer,ModifySubnetAttribute,DeleteRouteTable,ReplaceRoute, etc.
Step 11 — Restore Production Safely
This is where senior engineers separate themselves:
Option 1: Fastest restore (rollback code)
- Revert the bad commit:
git revert <commit>
- Run:
terraform plan
terraform apply
This should restore subnet tags/routes so ALB controller can recreate ALB or recover.
Option 2: If ALB is gone but cluster is OK
- Fix subnet tags first (Terraform)
- Then force ingress reconcile:
kubectl -n prod annotate ingress web alb.ingress.kubernetes.io/force-reconcile="$(date +%s)" --overwrite
(or delete/recreate ingress in worst case)
Option 3: If Terraform state is damaged or drifted
- If resource exists in AWS but not in state → import
- If resource deleted in AWS but still in state → remove from state and re-apply
Examples:
terraform state list
terraform import module.network.aws_subnet.public[0] subnet-xxxx
terraform state rm module.something.aws_lb.this
terraform apply
STATE MANAGEMENT + DRIFT (must know)
Step 12 — Demonstrate “State Drift”
Drift = AWS reality ≠ Terraform state/config.
12.1 Create drift on purpose
Manually change something in AWS Console:
- edit a security group rule
- change a subnet tag
- delete a target group
Then run:
terraform plan
You’ll see Terraform wants to “correct” it back. That’s drift.
Explain to students:
- Drift comes from manual console edits, other tools (CloudFormation), controllers (like ALB controller), or AWS auto-changes.
PREVENTION: Senior Guardrails
Step 13 — Put in “Production” Controls
13.1 No direct apply from laptops
- Only CI/CD can apply to prod
- Engineers create PRs; apply requires approvals
13.2 Add “plan” + manual approval gate
If using Jenkins:
- Stage:
terraform fmt,validate,plan - Require manual input step for
apply - Apply only from
mainbranch
13.3 Policy-as-code
Add one:
- OPA / Conftest (deny dangerous changes)
- Checkov / tfsec (security)
Example policy idea:
- Deny deleting public subnets
- Deny changing route tables in prod without approval label
- Deny
0.0.0.0/0SSH
13.4 Terraform protections
-
prevent_destroy = trueon critical resources -
Split state:
- network state separate from app state
- blast radius smaller
Use
-targetonly in emergencies (and document)
13.5 Separate AWS accounts
- dev / stage / prod accounts
- SCPs on prod to prevent deletion of key resources
Capstone: Incident Review Deliverables (Senior-level)
Students must produce:
- Incident timeline (when, what, who, impact)
- Root cause (bad tag/route change)
- Detection gaps (why monitoring didn’t alert earlier)
- Fix (rollback + restore + state repair)
- Prevention (gates, policies, controls)
Top comments (0)