DatanestDigital

Posted on Mar 23 • Edited on Jun 17 • Originally published at datanest-stores.pages.dev

Terraform Starter Kit: Terraform Best Practices Guide

#devops #terraform #docker #kubernetes

Terraform Best Practices Guide

A field-tested collection of patterns for writing maintainable, secure, and team-friendly Terraform configurations. These practices come from managing production AWS infrastructure across dozens of projects.

1. Project Structure

The single most impactful decision is how you organize files. A flat directory with everything in main.tf works for tutorials but breaks down fast in real projects.

Recommended layout:

project/
├── backend.tf          # Provider and backend config
├── variables.tf        # All input variables
├── outputs.tf          # All outputs
├── terraform.tfvars    # Variable values (git-ignored)
├── modules/
│   ├── vpc/
│   ├── ecs/
│   └── rds/
└── environments/
    ├── dev/main.tf     # Composes modules for dev
    └── prod/main.tf    # Composes modules for prod

Why separate environments into directories instead of workspaces? Workspaces share the same backend config and state bucket key prefix. If you need different provider configurations, different module versions, or different backend settings per environment, directory-based separation is cleaner. Workspaces work well for identical environments that differ only in variable values.

2. Module Design

Good modules are the building blocks of maintainable infrastructure. Follow these principles:

Keep modules focused

Each module should manage one logical resource group. A VPC module creates a VPC, subnets, route tables, and gateways. It should not also create EC2 instances or RDS databases.

Expose configuration, hide implementation

# Good: The caller decides the behavior
variable "multi_az" {
  type    = bool
  default = true
}

# Bad: The caller decides the implementation detail
variable "availability_zone_count" {
  type    = number
  default = 3
}

Always set sensible defaults

Every variable should have a default that works for the most common case. This lets new team members use the module immediately without reading every variable description.

variable "instance_class" {
  type        = string
  default     = "db.t3.medium"
  description = "RDS instance class. Use db.r6g.* for production workloads."
}

Use `validation` blocks for input constraints

Catch configuration mistakes at plan time instead of discovering them during apply:

variable "environment" {
  type = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be one of: dev, staging, prod."
  }
}

variable "vpc_cidr" {
  type = string
  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "Must be a valid CIDR block."
  }
}

Output everything the caller might need

If a module creates a resource, output its ID, ARN, and any connection strings. It costs nothing and saves future refactoring.

3. State Management

Terraform state is the source of truth for what infrastructure exists. Mismanaging state is the #1 cause of Terraform disasters.

Always use remote state

Local state files get lost, cannot be shared, and offer no locking. Use S3 + DynamoDB (AWS), GCS (GCP), or Azure Blob Storage as your backend.

terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "project/env/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Enable state locking

Without locking, two engineers running terraform apply simultaneously will corrupt state. DynamoDB-based locking is the standard for AWS.

Separate state per environment

Never share state between dev and prod. A mistake in dev should not be able to affect prod resources. Use different S3 keys:

dev/terraform.tfstate
staging/terraform.tfstate
prod/terraform.tfstate

Encrypt state at rest

State files contain sensitive data: database passwords, private keys, API tokens. Always enable server-side encryption on your state bucket and consider using a KMS key for additional control.

Never edit state manually

If you need to move or remove resources from state, use terraform state mv and terraform state rm. Hand-editing terraform.tfstate will break checksums and can corrupt your infrastructure mapping.

4. Security Patterns

Never hardcode secrets

Use AWS Secrets Manager or SSM Parameter Store, and reference them in Terraform:

resource "aws_db_instance" "main" {
  # Let AWS manage the password in Secrets Manager
  manage_master_user_password = true
}

For secrets needed during apply (API keys, tokens), use environment variables:

export TF_VAR_datadog_api_key="abc123"
terraform apply

Use IAM roles, not access keys

For CI/CD pipelines, use OIDC federation (GitHub Actions, GitLab CI) or IAM roles (EC2, ECS) instead of long-lived access keys. This starter kit includes an OIDC provider configuration for GitHub Actions.

Apply least-privilege IAM policies

Scope IAM policies to specific resources using ARN patterns:

# Good: scoped to specific bucket
Resource = "arn:aws:s3:::myapp-prod-*/*"

# Bad: wildcard access
Resource = "*"

Block public access by default

S3 buckets, RDS instances, and Elasticsearch domains should never be publicly accessible. Use security groups and bucket policies to control access.

5. Tagging Strategy

Consistent tagging is critical for cost allocation, access control, and incident response.

Use `default_tags` in the provider

Apply organization-wide tags automatically:

provider "aws" {
  default_tags {
    tags = {
      Project     = var.project_name
      Environment = var.environment
      ManagedBy   = "terraform"
      Team        = var.team
    }
  }
}

Required tags for every resource

Tag	Purpose	Example
`Project`	Group resources by project	`myapp`
`Environment`	Identify environment	`prod`
`ManagedBy`	Distinguish IaC from manual	`terraform`
`Team`	Cost allocation and ownership	`platform`

6. CI/CD Integration

Plan on PR, apply on merge

Never auto-apply on pull request. The workflow should be:

PR is opened: terraform plan runs and posts the plan as a PR comment
Team reviews the plan diff alongside the code changes
PR is merged to main: terraform apply -auto-approve runs

Use `-no-color` for CI logs

Terraform color codes don't render well in CI logs. Always pass -no-color in CI pipelines.

Pin provider and Terraform versions

Inconsistent versions across team members and CI cause drift:

terraform {
  required_version = "~> 1.7.0"  # Allow 1.7.x patches only
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
  }
}

7. Cost Optimization

Use `count` and `for_each` to conditionally create resources

Don't pay for NAT Gateways in dev if you don't need them:

resource "aws_nat_gateway" "main" {
  count = var.environment == "prod" ? length(var.availability_zones) : 1
  # Prod: one NAT per AZ for HA. Dev: one NAT to save ~$32/month per gateway.
}

Right-size from the start

Use db.t3.medium in dev and db.r6g.large in prod. Make instance classes a variable with per-environment defaults.

Set lifecycle rules on S3 buckets

Transition objects to cheaper storage tiers automatically:

transition {
  days          = 90
  storage_class = "STANDARD_IA"  # ~40% cheaper for infrequent access
}
transition {
  days          = 365
  storage_class = "GLACIER"      # ~80% cheaper for archival
}

8. Testing and Validation

Run `terraform validate` and `terraform fmt` in CI

These catch syntax errors and enforce consistent formatting with zero configuration.

Use `terraform plan` as a test

A clean plan against an existing environment confirms your changes are additive and non-destructive. Look for (destroy) actions in the plan output — they usually indicate a mistake.

Consider Terratest for critical modules

For modules that manage production databases or networking, write Go tests with Terratest that create real infrastructure, validate it, and tear it down.

9. Common Pitfalls

Forgetting `lifecycle.create_before_destroy`

Security groups, parameter groups, and IAM policies often need replacement. Without this lifecycle rule, Terraform deletes the old resource before creating the new one, causing downtime.

Ignoring `prevent_destroy` for stateful resources

Databases and S3 buckets with important data should use prevent_destroy:

lifecycle {
  prevent_destroy = true
}

Not using `depends_on` when needed

Most dependencies are inferred from resource references. But some (like IAM policy propagation) need explicit depends_on to avoid race conditions during apply.

Over-using `terraform import`

If you find yourself importing many resources, consider whether Terraform is the right tool for that resource. Some resources (DNS records managed by external teams, legacy VPCs) are better left outside Terraform.