Part 3: Infrastructure as Code — Terraform Modules + Terragrunt
Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS
Introduction
Plain Terraform works fine for a single environment. But this pipeline has 6 clusters across 3 environments and 2 regions — 18+ Terragrunt child directories. Without a DRY strategy, you end up copy-pasting the same provider, backend, and module blocks everywhere, and a single account ID change means updating 18 files.
Terragrunt solves this with two mechanisms:
-
include— child configs inherit the root config's provider generation and remote state -
dependency— explicit ordering ensures VPC exists before EKS, KMS before EKS, etc.
The result: each child terragrunt.hcl is typically 10–30 lines of pure inputs, with all boilerplate generated automatically.
Repository Layout
myapp-infra/
├── _modules/ # Reusable Terraform modules (no Terragrunt here)
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── eks/
│ ├── kms/
│ ├── iam/
│ ├── ecr/
│ ├── waf/
│ ├── guardduty/
│ ├── eso-irsa/
│ ├── fluent-bit-irsa/
│ ├── karpenter/
│ └── velero/
│
└── live/ # Terragrunt wrappers — one dir per resource per env/region
├── terragrunt.hcl # ROOT config (provider + backend generation)
├── dev/
│ ├── us-east-1/
│ │ ├── vpc/
│ │ │ └── terragrunt.hcl
│ │ ├── kms/
│ │ │ └── terragrunt.hcl
│ │ ├── eks/
│ │ │ └── terragrunt.hcl
│ │ └── iam/
│ │ └── terragrunt.hcl
│ └── us-west-2/
│ └── ... (mirror)
├── staging/
│ └── ...
└── production/
└── ...
Key principle: modules in _modules/ are pure Terraform — no Terragrunt, no state config, no provider config. They are just reusable building blocks. The live/ tree contains nothing but thin Terragrunt wrappers that call those modules with environment-specific values.
Dependency Ordering
┌─────────────────────────────────────────────────────────────┐
│ APPLY ORDER (Terragrunt resolves this from dependency graph)│
│ │
│ 1. kms (no dependencies) │
│ 2. vpc (no dependencies) │
│ 3. eks (depends on: vpc, kms) │
│ 4. iam (depends on: eks — needs OIDC provider URL)│
│ 5. eso-irsa (depends on: eks, iam) │
│ 6. fluent-bit-irsa (depends on: eks) │
│ 7. karpenter (depends on: eks, iam) │
│ 8. velero (depends on: eks) │
│ 9. waf (no dependencies) │
│ 10. guardduty (no dependencies) │
└─────────────────────────────────────────────────────────────┘
Run everything in order automatically:
cd live/production/us-east-1
terragrunt run-all apply
Terragrunt reads all dependency blocks, builds a DAG, and applies in the correct order.
Root Terragrunt Config
# live/terragrunt.hcl
locals {
path_parts = split("/", path_relative_to_include())
env = local.path_parts[0] # dev | staging | production
region = local.path_parts[1] # us-east-1 | us-west-2
account_ids = {
dev = "557702566877"
staging = "YOUR_STAGING_ACCOUNT_ID"
production = "591120834781"
}
account_id = local.account_ids[local.env]
# Region short alias for naming (avoids long names hitting IAM limits)
region_alias = local.region == "us-east-1" ? "use1" : "usw2"
cluster_name = "myapp-${local.env}-${local.region_alias}"
}
# Auto-generate provider.tf in every child directory
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<-EOF
provider "aws" {
region = "${local.region}"
assume_role {
role_arn = "arn:aws:iam::${local.account_id}:role/OrganizationAccountAccessRole"
}
default_tags {
tags = {
Environment = "${local.env}"
Region = "${local.region}"
ManagedBy = "Terraform"
Project = "myapp"
Cluster = "${local.cluster_name}"
}
}
}
EOF
}
# Auto-generate backend.tf — per-module state file in S3
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "myapp-terraform-state-${local.account_id}"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1" # State always in us-east-1 regardless of resource region
encrypt = true
dynamodb_table = "myapp-terraform-locks"
role_arn = "arn:aws:iam::${local.account_id}:role/OrganizationAccountAccessRole"
}
}
VPC Module
The VPC is the network foundation everything else sits in. Each environment gets its own VPC per region — 6 VPCs total.
CIDR allocation:
dev us-east-1: 10.0.0.0/16
dev us-west-2: 10.1.0.0/16
staging us-east-1: 10.10.0.0/16
staging us-west-2: 10.11.0.0/16
prod us-east-1: 10.20.0.0/16
prod us-west-2: 10.21.0.0/16
# _modules/vpc/main.tf
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = var.vpc_name
cidr = var.vpc_cidr
azs = [
"${var.region}a",
"${var.region}b",
"${var.region}c"
]
# Public subnets — for NAT Gateways, Internet-facing ALBs
public_subnets = [
cidrsubnet(var.vpc_cidr, 8, 0), # x.x.0.0/24
cidrsubnet(var.vpc_cidr, 8, 1), # x.x.1.0/24
cidrsubnet(var.vpc_cidr, 8, 2), # x.x.2.0/24
]
# Private subnets — EKS nodes, RDS, ElastiCache
private_subnets = [
cidrsubnet(var.vpc_cidr, 3, 1), # x.x.8.0/21 (2048 IPs)
cidrsubnet(var.vpc_cidr, 3, 2), # x.x.16.0/21
cidrsubnet(var.vpc_cidr, 3, 3), # x.x.24.0/21
]
enable_nat_gateway = true
single_nat_gateway = var.single_nat_gateway # true for dev (cost), false for prod (HA)
enable_vpn_gateway = false
enable_dns_hostnames = true
enable_dns_support = true
# Required tags for AWS Load Balancer Controller to discover subnets
public_subnet_tags = {
"kubernetes.io/role/elb" = "1"
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
}
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = "1"
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
"karpenter.sh/discovery" = var.cluster_name # Karpenter node discovery
}
# VPC Flow Logs for network traffic auditing
enable_flow_log = true
create_flow_log_cloudwatch_log_group = true
create_flow_log_cloudwatch_iam_role = true
flow_log_max_aggregation_interval = 60
}
# _modules/vpc/outputs.tf
output "vpc_id" { value = module.vpc.vpc_id }
output "private_subnet_ids" { value = module.vpc.private_subnets }
output "public_subnet_ids" { value = module.vpc.public_subnets }
output "vpc_cidr_block" { value = module.vpc.vpc_cidr_block }
Terragrunt child config:
# live/production/us-east-1/vpc/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "../../../../_modules/vpc"
}
inputs = {
vpc_name = "myapp-production-use1"
vpc_cidr = "10.20.0.0/16"
region = "us-east-1"
cluster_name = "myapp-production-use1"
single_nat_gateway = false # HA: one NAT GW per AZ
}
KMS Module
# _modules/kms/main.tf
# Handle the AWSServiceRoleForAutoScaling chicken-and-egg problem.
# In a fresh account this SLR doesn't exist yet, so we optionally create it.
resource "aws_iam_service_linked_role" "autoscaling" {
count = var.create_autoscaling_slr ? 1 : 0
aws_service_name = "autoscaling.amazonaws.com"
}
# Wait 10s for IAM to propagate before referencing it in KMS key policy
resource "null_resource" "wait_for_slr" {
count = var.create_autoscaling_slr ? 1 : 0
depends_on = [aws_iam_service_linked_role.autoscaling]
provisioner "local-exec" {
command = "sleep 10"
}
}
resource "aws_kms_key" "main" {
depends_on = [null_resource.wait_for_slr]
description = "${var.env}-${var.region}-main"
deletion_window_in_days = 30
enable_key_rotation = true
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "RootFullAccess"
Effect = "Allow"
Principal = { AWS = "arn:aws:iam::${var.account_id}:root" }
Action = "kms:*"
Resource = "*"
},
{
Sid = "AutoScalingSLR"
Effect = "Allow"
Principal = {
AWS = "arn:aws:iam::${var.account_id}:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
}
Action = ["kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*",
"kms:GenerateDataKey*", "kms:DescribeKey", "kms:CreateGrant"]
Resource = "*"
}
]
})
}
resource "aws_kms_alias" "main" {
name = "alias/${var.env}-${var.region_alias}-main"
target_key_id = aws_kms_key.main.key_id
}
Critical lesson:
AWSServiceRoleForAutoScalingis an account-scoped IAM entity, not region-scoped. If you're deploying to two regions in the same account, only the first region should setcreate_autoscaling_slr = true. The second region's KMS config usescreate_autoscaling_slr = falsebecause the SLR already exists from the first apply.
# live/production/us-east-1/kms/terragrunt.hcl
include "root" { path = find_in_parent_folders() }
terraform { source = "../../../../_modules/kms" }
inputs = {
env = "production"
region = "us-east-1"
region_alias = "use1"
account_id = "591120834781"
create_autoscaling_slr = false # Already created by us-west-2 first apply
}
EKS Module (overview — full detail in Part 4)
# _modules/eks/main.tf (abbreviated)
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = var.cluster_name
cluster_version = "1.29"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
control_plane_subnet_ids = var.private_subnet_ids
# Private endpoint — spokes only; dev gets public too
cluster_endpoint_private_access = true
cluster_endpoint_public_access = var.public_api
# Must be explicit — without this the creator role can't kubectl
enable_cluster_creator_admin_permissions = true
cluster_encryption_config = {
provider_key_arn = var.kms_key_arn
resources = ["secrets"]
}
eks_managed_node_groups = {
main = {
instance_types = ["t3.medium"]
min_size = 2
max_size = 10
desired_size = 2
# Workaround: name_prefix is limited to 38 chars.
# Long cluster names (staging, production) overflow this limit.
# Using explicit name bypasses the prefix (IAM name limit is 64 chars).
iam_role_name = "${var.cluster_name}-node-group"
iam_role_use_name_prefix = false
}
}
}
VPC Peering (for ArgoCD hub-spoke)
ArgoCD on myapp-production-use1 needs to reach the private API endpoints of the 5 spoke clusters. VPC peering provides private connectivity without internet traversal.
prod-use1 (10.20.0.0/16) ←──── VPC Peering ────► prod-usw2 (10.21.0.0/16)
prod-use1 (10.20.0.0/16) ←──── VPC Peering ────► staging-use1 (10.10.0.0/16)
prod-use1 (10.20.0.0/16) ←──── VPC Peering ────► staging-usw2 (10.11.0.0/16)
Dev clusters use public endpoints — no VPC peering needed.
# live/production/us-east-1/vpc-peering/terragrunt.hcl
# NOTE: vpc-peering configs CANNOT use include "root" with remote_state.
# They must define remote_state explicitly because the generate label
# conflicts with the parent. Define it inline instead.
locals {
env = "production"
region = "us-east-1"
}
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "myapp-terraform-state-591120834781"
key = "production/us-east-1/vpc-peering/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "myapp-terraform-locks"
}
}
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<-EOF
provider "aws" {
region = "us-east-1"
assume_role {
role_arn = "arn:aws:iam::591120834781:role/OrganizationAccountAccessRole"
}
}
# Peer VPC is in the staging account — needs its own provider alias
provider "aws" {
alias = "staging"
region = "us-east-1"
assume_role {
role_arn = "arn:aws:iam::STAGING_ACCOUNT_ID:role/OrganizationAccountAccessRole"
}
}
EOF
}
inputs = {
requester_vpc_id = dependency.prod_use1_vpc.outputs.vpc_id
accepter_vpc_id = dependency.staging_use1_vpc.outputs.vpc_id
# ... route table IDs, CIDR blocks, etc.
}
Running the Stack
# First apply: production us-east-1 (this region creates the AutoScaling SLR)
cd live/production/us-west-2
terragrunt run-all apply --terragrunt-non-interactive
# Second: production us-east-1 (SLR already exists, create_autoscaling_slr=false)
cd live/production/us-east-1
terragrunt run-all apply --terragrunt-non-interactive
# Check what changed before applying
cd live/staging/us-east-1
terragrunt run-all plan
# Destroy a specific module (e.g., for re-creating)
cd live/dev/us-east-1/eks
terragrunt destroy
State Management Best Practices
Each module has its own state file: {env}/{region}/{module}/terraform.tfstate.
Why not one big state file?
- A corrupt or locked state file affects only one module, not the entire environment
-
terraform planon EKS doesn't load/lock VPC state — faster, safer - Different engineers can work on different modules concurrently
State file key examples:
production/us-east-1/vpc/terraform.tfstate
production/us-east-1/eks/terraform.tfstate
production/us-east-1/iam/terraform.tfstate
production/us-west-2/vpc/terraform.tfstate
Common Pitfalls
| Problem | Symptom | Fix |
|---|---|---|
| AutoScaling SLR doesn't exist |
MalformedPolicyDocumentException on KMS create |
Set create_autoscaling_slr = true in first region; false in subsequent |
| IAM name_prefix > 38 chars |
ValidationError: name_prefix on node group create |
Use iam_role_name + iam_role_use_name_prefix = false
|
VPC peering uses include "root"
|
generate label already defined error |
Define remote_state block explicitly in vpc-peering configs |
| AWS SG description has Unicode |
Invalid description on security group |
Use plain ASCII only in SG descriptions — no arrows (→) or greater-than (>) |
Summary
By the end of Part 3 you have:
- ✅ DRY Terragrunt root config (provider + backend auto-generated from path)
- ✅ VPC module with public/private subnets, NAT gateways, flow logs
- ✅ KMS module handling the AutoScaling SLR chicken-and-egg problem
- ✅ Dependency graph ensuring correct apply order
- ✅ Per-module S3 state isolation
- ✅ VPC peering between production hub and all spoke VPCs
Next: Part 4 — EKS Multi-Cluster: Six Clusters Across Two Regions
Follow the series — next part publishes next Wednesday.
Live system: https://www.matthewoladipupo.dev/health
Runbook: Operations Guide
Source code: myapp-infra | myapp-gitops | myapp
Top comments (0)