Every Terraform project starts clean. Six months later, you're staring at a 2,000-line main.tf that nobody dares refactor because the last person who tried took down staging for a day. Sound familiar?
The difference between Terraform that scales and Terraform that crumbles isn't the cloud provider or the tooling — it's the patterns you adopt on day one. This article covers the production patterns I've refined across years of managing infrastructure on AWS and Azure, from directory layout to CI/CD pipelines.
Directory Structure That Scales
The structure below prevents the monolith problem by separating reusable modules from environment-specific configuration:
infrastructure/
├── modules/ # Reusable modules
│ ├── networking/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── compute/
│ ├── database/
│ └── monitoring/
├── environments/ # Environment-specific configs
│ ├── dev/
│ │ ├── main.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ ├── staging/
│ └── prod/
├── global/ # Shared resources (IAM, DNS)
│ ├── iam/
│ └── dns/
└── scripts/
├── plan.sh
├── apply.sh
└── destroy-guard.sh
Key Principles
- Modules are reusable building blocks. They accept inputs, produce outputs, and contain zero environment-specific values.
- Environments compose modules with specific configurations. Each environment owns its own state file.
- Global holds resources shared across environments (IAM roles, DNS zones).
- Each environment is independently plannable and applyable — you never risk cross-environment blast radius.
State Management
Remote state with locking is non-negotiable for teams. Here's the setup for AWS.
AWS S3 Backend
# environments/prod/backend.tf
terraform {
backend "s3" {
bucket = "acmecorp-terraform-state"
key = "prod/infrastructure.tfstate"
region = "eu-west-1"
encrypt = true
dynamodb_table = "terraform-locks"
# Cross-account state access
role_arn = "arn:aws:iam::123456789012:role/TerraformStateAccess"
}
}
Bootstrap the State Backend
Run this once, manually, before anything else:
# bootstrap/main.tf — Run this ONCE manually
provider "aws" {
region = "eu-west-1"
}
resource "aws_s3_bucket" "terraform_state" {
bucket = "acmecorp-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Writing Reusable Modules
A good module is self-contained, well-documented, and flexible without being over-engineered. Here's a networking module that demonstrates input validation, sensible defaults, and clean outputs.
Variables with Validation
# modules/networking/variables.tf
variable "project_name" {
description = "Project name used for resource naming"
type = string
}
variable "environment" {
description = "Environment name (dev, staging, prod)"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "vpc_cidr" {
description = "CIDR block for the VPC"
type = string
default = "10.0.0.0/16"
validation {
condition = can(cidrhost(var.vpc_cidr, 0))
error_message = "Must be a valid CIDR block."
}
}
variable "availability_zones" {
description = "List of AZs to use"
type = list(string)
default = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
}
variable "enable_nat_gateway" {
description = "Enable NAT Gateway for private subnets"
type = bool
default = true
}
variable "single_nat_gateway" {
description = "Use single NAT (cost saving for non-prod)"
type = bool
default = false
}
variable "tags" {
description = "Additional tags for all resources"
type = map(string)
default = {}
}
Module Implementation
# modules/networking/main.tf
locals {
name_prefix = "${var.project_name}-${var.environment}"
az_count = length(var.availability_zones)
common_tags = merge(var.tags, {
Project = var.project_name
Environment = var.environment
ManagedBy = "terraform"
})
}
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-vpc"
})
}
resource "aws_subnet" "public" {
count = local.az_count
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-public-${var.availability_zones[count.index]}"
Tier = "public"
})
}
resource "aws_subnet" "private" {
count = local.az_count
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index + local.az_count)
availability_zone = var.availability_zones[count.index]
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-private-${var.availability_zones[count.index]}"
Tier = "private"
})
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-igw"
})
}
resource "aws_eip" "nat" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : local.az_count) : 0
domain = "vpc"
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-nat-eip-${count.index}"
})
}
resource "aws_nat_gateway" "main" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : local.az_count) : 0
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-nat-${count.index}"
})
}
Module Outputs
# modules/networking/outputs.tf
output "vpc_id" {
description = "ID of the VPC"
value = aws_vpc.main.id
}
output "public_subnet_ids" {
description = "IDs of public subnets"
value = aws_subnet.public[*].id
}
output "private_subnet_ids" {
description = "IDs of private subnets"
value = aws_subnet.private[*].id
}
output "nat_gateway_ips" {
description = "Public IPs of NAT Gateways"
value = aws_eip.nat[*].public_ip
}
Consuming the Module
# environments/prod/main.tf
terraform {
required_version = ">= 1.7.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "eu-west-1"
default_tags {
tags = {
ManagedBy = "terraform"
Environment = "prod"
}
}
}
module "networking" {
source = "../../modules/networking"
project_name = "myapp"
environment = "prod"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
enable_nat_gateway = true
single_nat_gateway = false # HA NAT for prod
tags = {
CostCenter = "platform-team"
}
}
module "database" {
source = "../../modules/database"
project_name = "myapp"
environment = "prod"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
instance_class = "db.r6g.xlarge"
allocated_storage = 100
}
Secrets Management
Never put secrets in .tfvars files or version control. Use a secrets manager and reference them at plan time:
# Read secrets from AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_credentials" {
secret_id = "prod/database/credentials"
}
locals {
db_creds = jsondecode(
data.aws_secretsmanager_secret_version.db_credentials.secret_string
)
}
resource "aws_db_instance" "main" {
# ... other config ...
username = local.db_creds["username"]
password = local.db_creds["password"]
lifecycle {
ignore_changes = [password] # Managed externally after creation
}
}
Create the secret outside of Terraform — it should exist before terraform plan ever runs:
aws secretsmanager create-secret \
--name "prod/database/credentials" \
--secret-string '{"username":"admin","password":"CHANGE_ME_IMMEDIATELY"}'
CI/CD Pipeline for Terraform
Automated plan on PR, manual apply on merge to main. This GitHub Actions workflow detects which environments changed and only plans/applies those:
# .github/workflows/terraform.yml
name: Terraform CI/CD
on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
env:
TF_VERSION: "1.7.0"
AWS_REGION: "eu-west-1"
permissions:
id-token: write
contents: read
pull-requests: write
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
environments: ${{ steps.changes.outputs.environments }}
steps:
- uses: actions/checkout@v4
- id: changes
run: |
envs=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} \
| grep "infrastructure/environments/" \
| cut -d'/' -f3 \
| sort -u \
| jq -R -s -c 'split("\n") | map(select(. != ""))')
echo "environments=$envs" >> $GITHUB_OUTPUT
plan:
needs: detect-changes
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
strategy:
matrix:
environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/TerraformPlan
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Init & Plan
working-directory: infrastructure/environments/${{ matrix.environment }}
run: |
terraform init -input=false
terraform plan -input=false -no-color -out=tfplan
apply:
needs: detect-changes
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: production
strategy:
matrix:
environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/TerraformApply
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Init & Apply
working-directory: infrastructure/environments/${{ matrix.environment }}
run: |
terraform init -input=false
terraform apply -input=false -auto-approve
Anti-Patterns to Avoid
1. Hardcoded AMI IDs
# BAD — what is this AMI? Will it exist next year?
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
}
# GOOD — always resolves to the latest matching AMI
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
}
resource "aws_instance" "web" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
}
2. Monolithic State Files
# BAD: Everything in one state file.
# If networking breaks, you can't update compute independently.
# GOOD: Split by lifecycle and blast radius.
# infrastructure/environments/prod/networking/
# infrastructure/environments/prod/compute/
# infrastructure/environments/prod/database/
3. Missing Lifecycle Rules
# Protect critical resources from accidental destruction
resource "aws_rds_instance" "main" {
# ... config ...
lifecycle {
prevent_destroy = true # Terraform will refuse to destroy this
ignore_changes = [
password, # Managed externally
latest_restorable_time # Changes on every read
]
}
}
4. No Input Validation
# Always validate inputs at the module boundary
variable "instance_type" {
type = string
validation {
condition = can(regex("^(t3|m6i|c6i)\\.", var.instance_type))
error_message = "Instance type must be t3, m6i, or c6i family."
}
}
Cost Tagging Strategy
Every resource should carry cost-allocation tags. Enforce this at the module level so teams can't skip it:
locals {
required_tags = {
Project = var.project_name
Environment = var.environment
ManagedBy = "terraform"
Team = var.team_name
CostCenter = var.cost_center
}
}
resource "aws_instance" "example" {
# ... config ...
tags = merge(local.required_tags, var.extra_tags)
}
Summary
Production Terraform is about discipline, not cleverness:
| Pattern | Why It Matters |
|---|---|
| Module-per-concern | Reusable, testable, composable |
| Environment-per-state | Blast radius isolation |
| Remote state + locking | Team safety |
| CI/CD with plan-on-PR | Review infra changes like code |
| Input validation | Fail fast with clear errors |
| Secrets in vault | Security baseline |
| Cost tags everywhere | No mystery AWS bills |
These patterns prevent the "Terraform spaghetti" that plagues most organizations. Adopt them early, and your infrastructure will thank you at scale.
If you found these patterns useful, check out the DataStack Pro collection for production-ready infrastructure templates, pipeline frameworks, and DevOps toolkits you can deploy today.
Top comments (0)