DatanestDigital

Posted on Mar 20 • Edited on Mar 23

Terraform Best Practices: Patterns That Survive Production

#terraform #aws #devops #infrastructure

Every Terraform project starts clean. Six months later, you're staring at a 2,000-line main.tf that nobody dares refactor because the last person who tried took down staging for a day. Sound familiar?

The difference between Terraform that scales and Terraform that crumbles isn't the cloud provider or the tooling — it's the patterns you adopt on day one. This article covers the production patterns I've refined across years of managing infrastructure on AWS and Azure, from directory layout to CI/CD pipelines.

Directory Structure That Scales

The structure below prevents the monolith problem by separating reusable modules from environment-specific configuration:

infrastructure/
├── modules/                    # Reusable modules
│   ├── networking/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── compute/
│   ├── database/
│   └── monitoring/
├── environments/               # Environment-specific configs
│   ├── dev/
│   │   ├── main.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   ├── staging/
│   └── prod/
├── global/                     # Shared resources (IAM, DNS)
│   ├── iam/
│   └── dns/
└── scripts/
    ├── plan.sh
    ├── apply.sh
    └── destroy-guard.sh

Key Principles

Modules are reusable building blocks. They accept inputs, produce outputs, and contain zero environment-specific values.
Environments compose modules with specific configurations. Each environment owns its own state file.
Global holds resources shared across environments (IAM roles, DNS zones).
Each environment is independently plannable and applyable — you never risk cross-environment blast radius.

State Management

Remote state with locking is non-negotiable for teams. Here's the setup for AWS.

AWS S3 Backend

# environments/prod/backend.tf
terraform {
  backend "s3" {
    bucket         = "acmecorp-terraform-state"
    key            = "prod/infrastructure.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"

    # Cross-account state access
    role_arn = "arn:aws:iam::123456789012:role/TerraformStateAccess"
  }
}

Bootstrap the State Backend

Run this once, manually, before anything else:

# bootstrap/main.tf — Run this ONCE manually
provider "aws" {
  region = "eu-west-1"
}

resource "aws_s3_bucket" "terraform_state" {
  bucket = "acmecorp-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Writing Reusable Modules

A good module is self-contained, well-documented, and flexible without being over-engineered. Here's a networking module that demonstrates input validation, sensible defaults, and clean outputs.

Variables with Validation

# modules/networking/variables.tf
variable "project_name" {
  description = "Project name used for resource naming"
  type        = string
}

variable "environment" {
  description = "Environment name (dev, staging, prod)"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "vpc_cidr" {
  description = "CIDR block for the VPC"
  type        = string
  default     = "10.0.0.0/16"

  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "Must be a valid CIDR block."
  }
}

variable "availability_zones" {
  description = "List of AZs to use"
  type        = list(string)
  default     = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
}

variable "enable_nat_gateway" {
  description = "Enable NAT Gateway for private subnets"
  type        = bool
  default     = true
}

variable "single_nat_gateway" {
  description = "Use single NAT (cost saving for non-prod)"
  type        = bool
  default     = false
}

variable "tags" {
  description = "Additional tags for all resources"
  type        = map(string)
  default     = {}
}

Module Implementation

# modules/networking/main.tf
locals {
  name_prefix = "${var.project_name}-${var.environment}"
  az_count    = length(var.availability_zones)

  common_tags = merge(var.tags, {
    Project     = var.project_name
    Environment = var.environment
    ManagedBy   = "terraform"
  })
}

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-vpc"
  })
}

resource "aws_subnet" "public" {
  count = local.az_count

  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(var.vpc_cidr, 8, count.index)
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true

  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-public-${var.availability_zones[count.index]}"
    Tier = "public"
  })
}

resource "aws_subnet" "private" {
  count = local.az_count

  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + local.az_count)
  availability_zone = var.availability_zones[count.index]

  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-private-${var.availability_zones[count.index]}"
    Tier = "private"
  })
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-igw"
  })
}

resource "aws_eip" "nat" {
  count  = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : local.az_count) : 0
  domain = "vpc"

  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-nat-eip-${count.index}"
  })
}

resource "aws_nat_gateway" "main" {
  count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : local.az_count) : 0

  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-nat-${count.index}"
  })
}

Module Outputs

# modules/networking/outputs.tf
output "vpc_id" {
  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "public_subnet_ids" {
  description = "IDs of public subnets"
  value       = aws_subnet.public[*].id
}

output "private_subnet_ids" {
  description = "IDs of private subnets"
  value       = aws_subnet.private[*].id
}

output "nat_gateway_ips" {
  description = "Public IPs of NAT Gateways"
  value       = aws_eip.nat[*].public_ip
}

Consuming the Module

# environments/prod/main.tf
terraform {
  required_version = ">= 1.7.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "eu-west-1"

  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Environment = "prod"
    }
  }
}

module "networking" {
  source = "../../modules/networking"

  project_name       = "myapp"
  environment        = "prod"
  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
  enable_nat_gateway = true
  single_nat_gateway = false  # HA NAT for prod

  tags = {
    CostCenter = "platform-team"
  }
}

module "database" {
  source = "../../modules/database"

  project_name      = "myapp"
  environment       = "prod"
  vpc_id            = module.networking.vpc_id
  subnet_ids        = module.networking.private_subnet_ids
  instance_class    = "db.r6g.xlarge"
  allocated_storage = 100
}

Secrets Management

Never put secrets in .tfvars files or version control. Use a secrets manager and reference them at plan time:

# Read secrets from AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_credentials" {
  secret_id = "prod/database/credentials"
}

locals {
  db_creds = jsondecode(
    data.aws_secretsmanager_secret_version.db_credentials.secret_string
  )
}

resource "aws_db_instance" "main" {
  # ... other config ...
  username = local.db_creds["username"]
  password = local.db_creds["password"]

  lifecycle {
    ignore_changes = [password]  # Managed externally after creation
  }
}

Create the secret outside of Terraform — it should exist before terraform plan ever runs:

aws secretsmanager create-secret \
  --name "prod/database/credentials" \
  --secret-string '{"username":"admin","password":"CHANGE_ME_IMMEDIATELY"}'

CI/CD Pipeline for Terraform

Automated plan on PR, manual apply on merge to main. This GitHub Actions workflow detects which environments changed and only plans/applies those:

# .github/workflows/terraform.yml
name: Terraform CI/CD

on:
  pull_request:
    paths: ['infrastructure/**']
  push:
    branches: [main]
    paths: ['infrastructure/**']

env:
  TF_VERSION: "1.7.0"
  AWS_REGION: "eu-west-1"

permissions:
  id-token: write
  contents: read
  pull-requests: write

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      environments: ${{ steps.changes.outputs.environments }}
    steps:
      - uses: actions/checkout@v4
      - id: changes
        run: |
          envs=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} \
            | grep "infrastructure/environments/" \
            | cut -d'/' -f3 \
            | sort -u \
            | jq -R -s -c 'split("\n") | map(select(. != ""))')
          echo "environments=$envs" >> $GITHUB_OUTPUT

  plan:
    needs: detect-changes
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    strategy:
      matrix:
        environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/TerraformPlan
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform Init & Plan
        working-directory: infrastructure/environments/${{ matrix.environment }}
        run: |
          terraform init -input=false
          terraform plan -input=false -no-color -out=tfplan

  apply:
    needs: detect-changes
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: production
    strategy:
      matrix:
        environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/TerraformApply
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform Init & Apply
        working-directory: infrastructure/environments/${{ matrix.environment }}
        run: |
          terraform init -input=false
          terraform apply -input=false -auto-approve

Anti-Patterns to Avoid

1. Hardcoded AMI IDs

# BAD — what is this AMI? Will it exist next year?
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"
}

# GOOD — always resolves to the latest matching AMI
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
}

2. Monolithic State Files

# BAD: Everything in one state file.
# If networking breaks, you can't update compute independently.

# GOOD: Split by lifecycle and blast radius.
# infrastructure/environments/prod/networking/
# infrastructure/environments/prod/compute/
# infrastructure/environments/prod/database/

3. Missing Lifecycle Rules

# Protect critical resources from accidental destruction
resource "aws_rds_instance" "main" {
  # ... config ...

  lifecycle {
    prevent_destroy = true  # Terraform will refuse to destroy this

    ignore_changes = [
      password,              # Managed externally
      latest_restorable_time # Changes on every read
    ]
  }
}

4. No Input Validation

# Always validate inputs at the module boundary
variable "instance_type" {
  type = string

  validation {
    condition     = can(regex("^(t3|m6i|c6i)\\.", var.instance_type))
    error_message = "Instance type must be t3, m6i, or c6i family."
  }
}

Cost Tagging Strategy

Every resource should carry cost-allocation tags. Enforce this at the module level so teams can't skip it:

locals {
  required_tags = {
    Project     = var.project_name
    Environment = var.environment
    ManagedBy   = "terraform"
    Team        = var.team_name
    CostCenter  = var.cost_center
  }
}

resource "aws_instance" "example" {
  # ... config ...
  tags = merge(local.required_tags, var.extra_tags)
}

Summary

Production Terraform is about discipline, not cleverness:

Pattern	Why It Matters
Module-per-concern	Reusable, testable, composable
Environment-per-state	Blast radius isolation
Remote state + locking	Team safety
CI/CD with plan-on-PR	Review infra changes like code
Input validation	Fail fast with clear errors
Secrets in vault	Security baseline
Cost tags everywhere	No mystery AWS bills

These patterns prevent the "Terraform spaghetti" that plagues most organizations. Adopt them early, and your infrastructure will thank you at scale.

If you found these patterns useful, check out the DataStack Pro collection for production-ready infrastructure templates, pipeline frameworks, and DevOps toolkits you can deploy today.

DEV Community