Most Terraform tutorials get you to a working EC2 instance in about ten minutes. You run terraform apply, watch the output scroll, and feel good about yourself. Then you try to do the same thing with a teammate, in a real environment, with actual security requirements — and things fall apart in ways the tutorial never warned you about.
The gap between "Terraform works on my laptop" and "Terraform works reliably in production" is where most teams get burned. This article is about closing that gap. It assumes you've written Terraform before — you know what a resource block is, you've run plan and apply, you understand variables and outputs. What you probably haven't done is designed a production AWS setup from scratch, thought hard about state management across teams, built a VPC you'd actually want to run workloads in, or wired Terraform into CI/CD without leaving access keys lying around.
That's what we're covering.
The thesis: production Terraform is mostly discipline, not complexity
Here's the take that shapes everything in this article: the hard parts of production Terraform are not technically complex. Remote state isn't hard. Least-privilege IAM isn't hard. Multi-AZ VPC design isn't hard. What's hard is doing all of it consistently, from the start, before the shortcuts accumulate.
Every production Terraform mess I've seen shares a common origin story: a team that moved fast in the early days and deferred the "we'll clean this up later" work until later became never. Local state files living on one engineer's laptop. Hardcoded AWS credentials in .tf files. A single flat VPC with everything in public subnets because it was faster. These are not ignorance failures — they're discipline failures. The practices that separate a toy infrastructure from a production one are all available on day one. You just have to actually use them.
So: start with the right structure, even for small projects. It's cheaper than retrofitting.
Project structure: directories over workspaces
Before writing a single resource, get your directory structure right. This is one of those decisions that's easy to change at the beginning and brutal to change six months later with real state files involved.
The choice that trips people up most is workspaces vs. separate directories for environment isolation. Terraform workspaces feel elegant — one codebase, multiple environments, terraform workspace select prod. In practice, they're often the wrong call for production setups.
The problem is blast radius. With workspaces, a mistyped terraform apply in the wrong workspace context can touch production when you meant to touch staging. The environments share a backend configuration and a codebase, which means a bug in your Terraform affects all of them simultaneously. And workspace switching is manual — easy to forget, easy to miss in a rush.
Separate directories per environment cost you some code repetition but buy you hard isolation. When your environments/prod directory has its own backend block pointing at its own S3 key, you physically cannot accidentally apply a staging configuration to production. The redundancy is the point.
A clean production layout looks like this:
infra/
├── modules/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── rds/
│ └── alb/
├── environments/
│ ├── staging/
│ │ ├── main.tf
│ │ ├── backend.tf
│ │ └── terraform.tfvars
│ └── prod/
│ ├── main.tf
│ ├── backend.tf
│ └── terraform.tfvars
└── bootstrap/
└── state-backend/
└── main.tf
The bootstrap/ directory handles the chicken-and-egg problem of creating your S3 state bucket using Terraform itself — more on that shortly. The modules/ directory holds reusable infrastructure components. The environments/ directories are thin: they instantiate modules with environment-specific variable values, and that's mostly it.
Remote state: the non-negotiable foundation
If there is one thing to get right before anything else, it's state management. Local state is fine for personal experiments. For anything involving a team, a CI system, or infrastructure you care about, local state is how you corrupt your deployment history or lose track of what's actually running.
AWS S3 with state locking is the standard approach for teams on AWS. For Terraform v1.10 and later, you no longer need a separate DynamoDB table to handle locking — S3 native locking handles it with use_lockfile = true, which is one less resource to manage and one less failure point. If you're on an older version, the DynamoDB approach still works fine; just add the dynamodb_table argument to your backend block.
A production backend configuration:
# environments/prod/backend.tf
terraform {
backend "s3" {
bucket = "my-org-terraform-state"
key = "environments/prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
use_lockfile = true
}
}
A few things worth calling out explicitly:
encrypt = true is not optional. Terraform state files contain secrets. Database passwords, private keys, sensitive outputs — they all end up in the state file in plaintext. Encryption at rest on the S3 bucket is baseline hygiene.
Version the bucket. Enable S3 versioning on your state bucket so you can roll back to a previous state if something goes wrong. The day you need this, you'll be grateful you did it. S3 versioning is cheap; a corrupted state file is not.
Separate state per environment. Don't share a state file across staging and production. If something goes wrong with the state file (it happens), the blast radius should be limited to one environment.
Bootstrap carefully. You need an S3 bucket before you can use it as a Terraform backend — which means you can't use Terraform to create the bucket and immediately use it in the same run. The cleanest solution is a bootstrap/ directory that creates the state bucket using a local backend, then you switch to the remote backend afterward. Some teams create this bucket manually once and never touch it again. Either works.
VPC design: the foundation everything else sits on
A production VPC is not complicated. But there are several patterns that look reasonable and create real problems, so it's worth being explicit.
The pattern you want for most production workloads:
- Three availability zones, spread across AZ-a, AZ-b, and AZ-c.
- Public subnets in each AZ for load balancers and NAT gateways. Nothing else.
- Private subnets in each AZ for application servers and services.
- Database subnets (also private) for RDS and ElastiCache, kept separate from app subnets.
- A NAT Gateway in each public subnet so private resources can reach the internet for package updates and API calls — without being reachable from the internet themselves.
The classic mistake is putting application servers in public subnets because "we'll add security groups to restrict access." Security groups do work, but defense in depth exists for a reason. A misconfigured security group rule in a private subnet means traffic doesn't flow. A misconfigured rule in a public subnet means your EC2 instance is directly internet-reachable. Public subnets are for things that genuinely need a public IP: NAT gateways, application load balancers, bastion hosts. Everything else belongs in private subnets.
The other mistake is single-AZ. It feels like it saves money (one NAT gateway instead of three), and it does — until an AZ has an incident and your entire application goes offline. AWS has AZ-level incidents several times a year. Multi-AZ is not paranoia; it's operational baseline.
In Terraform, using the community AWS VPC module is a reasonable shortcut here. It's well-tested, handles the subnet math, creates the right route tables, and has sensible defaults. This is one place where reaching for a registry module saves real time without meaningful downsides.
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "prod-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # one per AZ for HA
one_nat_gateway_per_az = true
enable_dns_hostnames = true
enable_dns_support = true
tags = local.common_tags
}
One thing you'll notice: single_nat_gateway = false and one_nat_gateway_per_az = true. Three NAT gateways instead of one costs roughly $100/month more. For production workloads where an AZ outage would take you offline, it's worth it. For staging, use single_nat_gateway = true and save the money.
IAM: narrow by default, open intentionally
IAM is where good intentions go to die under deadline pressure. Someone needs access, something is breaking, there's a deploy in thirty minutes — and the fastest fix is a * on the action or the resource. It works. And then it stays that way for two years.
The discipline that matters here: start narrow and open only what's needed, in that order. The reverse — start wide and restrict later — almost never happens in practice.
For Terraform itself in CI/CD, the role you grant should cover only the resources Terraform manages in that configuration. A Terraform configuration that manages VPCs, EC2 instances, and RDS should have IAM permissions scoped to ec2:*, rds:*, and vpc:* on the relevant resources — not AdministratorAccess. Yes, figuring out the exact permission set takes time upfront. It saves you from the scenario where a compromised CI credential can delete your entire AWS account.
For application workloads, use IAM roles attached to EC2 instances or ECS tasks — never long-lived access keys in environment variables. The pattern:
resource "aws_iam_role" "app_role" {
name = "prod-app-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ec2.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy" "app_s3_access" {
name = "app-s3-read"
role = aws_iam_role.app_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["s3:GetObject", "s3:ListBucket"]
Resource = [
aws_s3_bucket.assets.arn,
"${aws_s3_bucket.assets.arn}/*"
]
}]
})
}
Notice the Resource field points at a specific bucket ARN, not "*". This is the whole game: specific actions, specific resources. If the application only needs to read from S3, it doesn't need s3:PutObject. If it only needs one bucket, it doesn't need access to all buckets.
CI/CD: OIDC instead of access keys
Here's a pattern that's become standard in 2025 and is still underused: using GitHub Actions OIDC authentication with AWS instead of storing long-lived access keys in GitHub Secrets.
The old way: create an IAM user, generate an access key, paste AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY into GitHub Secrets, rotate them manually every so often (read: never). The problem is that these are long-lived credentials. A leaked GitHub secret gives an attacker permanent AWS access until you notice and rotate.
The new way: configure AWS to trust GitHub's OIDC token issuer, create an IAM role with a trust policy scoped to your specific repository and branch, and have GitHub Actions exchange a short-lived JWT for temporary AWS credentials on each run. The credentials expire after the workflow finishes. There's nothing to leak because there are no static credentials anywhere.
The Terraform to set this up:
# Create the OIDC provider (once per AWS account)
resource "aws_iam_openid_connect_provider" "github" {
url = "https://token.actions.githubusercontent.com"
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = ["6938fd4d98bab03faadb97b34396831e3780aea1"]
}
# Create the role GitHub Actions will assume
resource "aws_iam_role" "github_actions_terraform" {
name = "github-actions-terraform"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.github.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringLike = {
"token.actions.githubusercontent.com:sub" = "repo:your-org/your-repo:ref:refs/heads/main"
}
StringEquals = {
"token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
}
}
}]
})
}
The Condition block is the critical part. The sub claim is scoped to your specific repo and branch — a compromised token from a different repository can't assume this role. This is least-privilege applied to your CI pipeline itself.
The GitHub Actions workflow side is straightforward:
permissions:
id-token: write # required for OIDC
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-terraform
aws-region: us-east-1
- name: Terraform Plan
run: terraform plan
For the workflow pattern itself: plan on pull requests, post the output as a PR comment so reviewers can see exactly what will change, and apply only on merge to main. Never run apply directly from a feature branch. The PR plan comment is the cheapest possible safety check and teams consistently skip it until they've had one bad apply.
Provider and module versioning: pin everything
Last thing, and it sounds boring, but it prevents a specific class of infrastructure incidents that nobody wants to debug: pin your provider versions.
Terraform providers release updates that occasionally include breaking changes or behavior differences. If your provider version isn't pinned and you run terraform init six months from now, you might get a different AWS provider version than the one your code was written against. Subtle differences in resource creation order, default values changing, new required arguments — these show up as unexpected diffs in terraform plan at the least convenient times.
terraform {
required_version = "~> 1.11"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.90"
}
}
}
The ~> constraint allows patch-level upgrades (5.90.x) but not minor or major upgrades. Update deliberately, not accidentally.
Same goes for modules. If you're using the AWS VPC module from the registry, pin version = "~> 5.0" rather than leaving it unpinned. An unpinned module reference will silently upgrade on the next terraform init if a new version has been published.
What to build first
If you're starting from zero: build the bootstrap state backend first, then the VPC, then your compute and data layers on top of it. The order matters because the VPC is a dependency for almost everything else, and the state backend is a dependency for running Terraform safely at all.
And one last thing: read your plan output before applying it. Actually read it. The number of production incidents that trace back to someone typing yes without reading the plan is higher than it should be. Terraform tells you exactly what it's going to do. The discipline of treating terraform plan output as a thing worth reading — every time, not just when something feels risky — is what separates infrastructure engineers who've had incidents from those who've had fewer.
The infrastructure patterns in this article are not exotic. They're what most experienced AWS infrastructure engineers reach for by default. The gap between knowing them and shipping them consistently is, as usual, mostly habit.
Top comments (0)