DEV Community

Cover image for Scaling Terraform Across many Teams: A Native Framework for Platform Engineering
Jacob
Jacob

Posted on

Scaling Terraform Across many Teams: A Native Framework for Platform Engineering

TL;DR:

A pure Terraform framework that lets 50+ teams self-service infrastructure by writing simple .tfvars files while the platform team manages opinionated "building blocks." Smart lookups (s3:bucket_name) enable cross-resource references. When patterns improve, automated scripts generate PRs for all teams—they review terraform plan and inherit improvements without code changes. 85%+ boilerplate reduction, zero preprocessing, fully compatible with Terraform Cloud.

This blog post documents how a platform engineering team built a Terraform framework that scales to 50+ application teams with mixed skill levels—enabling fast, self-service infrastructure deployment while maintaining governance and security standards.

┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
│   50+ Teams     │      │    Platform     │      │    Patterns     │
│ Write Simple    │─────>│    Manages      │─────>│    Improve      │
│     tfvars      │      │ Building Blocks │      │   Over Time     │
└─────────────────┘      └─────────────────┘      └─────────────────┘
                                │                           │
                                │                           ▼
                                │                  ┌─────────────────┐
                                │                  │   Automated     │
                                │                  │   PRs Generated │
                                │                  └─────────────────┘
                                │                           │
                                │                           ▼
                                │                  ┌─────────────────┐
                                │                  │  Teams Review   │
                                │                  │ terraform plan  │
                                │                  └─────────────────┘
                                │                           │
                                │                           ▼
                                │                  ┌─────────────────┐
                                └──────────────────│  Approve & Apply│
                                         (updates) │  Stay Current   │
                                                   └─────────────────┘
Enter fullscreen mode Exit fullscreen mode

The Challenge: Platform teams face an impossible trade-off: let teams write their own Terraform (resulting in inconsistent, outdated implementations) or manually review and update every workload (doesn't scale beyond ~10 teams).

The Solution: A native Terraform framework that separates configuration (what teams deploy) from implementation (how it's deployed securely). Application teams write simple .tfvars files, platform team manages opinionated "building blocks" that evolve over time. When patterns improve (adding VPC, encryption, monitoring), automated scripts generate PRs for all teams—they review terraform plan and approve, inheriting improvements without code changes.

Key Innovation: Native Terraform "smart lookups" (s3:bucket_name, lambda:function_name) allow cross-resource references while maintaining the separation. No preprocessing, no code generation—pure Terraform compatible with standard tooling and Terraform Cloud.

Target Audiences

  • Platform Engineers: Detailed implementation of the lookup mechanism and building block architecture
  • DevOps/SRE Teams: Comparison with Terragrunt/Terraspace and practical benefits
  • Cloud Architects: Strategic value and governance capabilities
  • Technical Leaders: Development velocity improvements and complexity reduction

1. Introduction: Helping Teams Build Faster at Scale

Opening Hook:

"How do you help 50 teams build and deploy infrastructure faster—when they have different levels of AWS and Terraform expertise, need similar-but-not-identical workloads, and your platform team can't manually review and update every project?"

The Human Challenge: Speed vs. Standards

Picture this familiar scenario:

Your Organization:

  • 50+ application teams building data pipelines, microservices, analytics platforms
  • Mixed skill levels:
    • 20% have AWS experts who know IAM policies inside-out
    • 50% are competent with Terraform but learning AWS services
    • 30% are new to both, just want to deploy their application
  • Platform/DevOps team of 5-10 people responsible for:
    • Cloud governance and security
    • Cost optimization
    • Compliance and best practices
    • Supporting all those teams

What Application Teams Want:

  • Deploy fast: Days, not weeks of waiting
  • Self-service: Don't wait for platform team approval on every change
  • Focus on their app: Not become AWS/Terraform experts
  • Consistency: "Just tell me what works and let me copy it"

What Platform Team Needs:

  • Enforce standards: Security, tagging, encryption, monitoring
  • Scale support: Can't grow team 1:1 with application teams
  • Continuous improvement: Patterns evolve as we learn
  • Prevent drift: All workloads stay current with best practices

The Core Problem: Similar Workloads, Different Implementations

When teams write their own Terraform, you get variations of the same infrastructure:

Option 1: Raw Terraform Resources (Maximum Flexibility, Minimum Maintainability)

# Team A writes Lambda in January 2024
resource "aws_lambda_function" "processor_v1" {
  function_name = "processor"
  runtime       = "python3.11"
  # ... 50 lines of configuration
  # Missing: VPC config, proper IAM policies, CloudWatch retention
}
Enter fullscreen mode Exit fullscreen mode
# Team B writes Lambda in March 2024 (learned from Team A's mistakes)
resource "aws_lambda_function" "processor_v2" {
  function_name = "processor"
  runtime       = "python3.12"
  # ... 80 lines of configuration
  # Now includes: VPC, better IAM, but still missing X-Ray tracing
}
Enter fullscreen mode Exit fullscreen mode
# Team C writes Lambda in June 2024 (organization learned best practices)
resource "aws_lambda_function" "processor_v3" {
  function_name = "processor"
  runtime       = "python3.13"
  # ... 120 lines of configuration
  # All best practices: VPC, IAM, X-Ray, proper logging, tags
}
Enter fullscreen mode Exit fullscreen mode

The Problems:

  • Inconsistent implementations: 50 workloads = 50 slightly different Lambda configurations
  • Knowledge doesn't propagate: Teams A and B don't benefit from improvements learned by Team C
  • Backporting is impossible: How do you update 50 workloads when security requires KMS encryption?
  • Copy-paste culture: Teams copy from each other, propagating old patterns and bugs
  • Expertise silos: Only AWS experts can write correct infrastructure

Option 2: Standard Terraform Modules (Better Reuse, Still Hard to Evolve)

# Using terraform-aws-modules/lambda/aws
module "lambda" {
  source  = "terraform-aws-modules/lambda/aws"
  version = "4.0.0"

  function_name = "processor"
  # ... still 40+ lines of configuration
  # Better: module handles some best practices
  # Problem: upgrading 50 workloads from v4.0.0 → v5.0.0 is manual work
}
Enter fullscreen mode Exit fullscreen mode

The Problems:

  • Version sprawl: Workloads stuck on different module versions (v3.2, v4.0, v4.5, v5.0)
  • Breaking changes: Module updates require testing every workload
  • Configuration drift: Each team configures modules differently
  • Limited abstraction: Still requires deep AWS knowledge to use correctly
  • Manual upgrades: Someone has to update 50 PRs when a new version releases

The Real Challenge: N×N Complexity

As you improve your infrastructure patterns over time:

  • You learn Lambda should use VPC → Need to update 50 workloads
  • Security requires KMS encryption → Need to update 50 workloads
  • Compliance requires specific tags → Need to update 50 workloads
  • New AWS best practice emerges → Need to update 50 workloads

The math is brutal:

  • 50 workloads × 10 resource types × 5 improvements per year = 2,500 manual updates
  • Each update risks breaking something
  • Each workload drifts further from best practices
  • Teams become afraid to improve shared patterns

Our Solution: True Separation of Code and Configuration

The Insight: What if we could update how infrastructure is created without touching what infrastructure exists?

# Team writes configuration ONCE (2024)
lambda_functions = {
  processor = {
    name = "processor"
    runtime = "python3.13"
    permissions = {
      s3_read = ["raw_data"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Behind the scenes (managed by platform team):

  • January 2024: Lambda building block v1.0 (basic implementation)
  • March 2024: Lambda building block v1.5 (adds VPC, better IAM)
  • June 2024: Lambda building block v2.0 (adds X-Ray, proper logging)
  • September 2024: Lambda building block v2.5 (adds permission boundaries)

The team's configuration never changes. The platform team updates the building block implementation, and all 50 workloads automatically get improvements on next terraform apply.

This Framework Achieves:

  • Separation of Concerns: Configuration (what) lives in tfvars, implementation (how) lives in building blocks
  • Continuous Improvement: Platform team evolves patterns without breaking workloads
  • Zero Backporting: Workloads automatically inherit improvements
  • Maintained References: Terraform's powerful dependency graph still works (via smart lookups)
  • Escape Hatch: Teams can still use raw Terraform resources when needed for edge cases

The Innovation:
A pure Terraform framework that:

  • Uses colon-separated syntax (s3:bucket_name) for resource references
  • Resolves lookups dynamically using native Terraform expressions
  • Abstracts AWS complexity through opinionated building blocks
  • Works seamlessly with Terraform Cloud and standard workflows
  • Updates centrally but applies individually

Coverage:

  • Handles 90-95% of common workload patterns through building blocks
  • Allows raw Terraform resources alongside building blocks for edge cases
  • Manages N×N complexity (lookups between all resource types)

The Result:

  • Platform team maintains the framework (1 codebase)
  • 50 teams write simple configurations (50 tfvars files)
  • Everyone benefits from continuous improvement
  • No preprocessing, no code generation, pure Terraform

Lifecycle Management: Keeping Up With Scale

The Separation Strategy:

The framework separates two concerns that evolve at different speeds:

  1. Configuration (Team-Owned): What workload resources exist

    • Lives in team repositories as .tfvars files
    • Teams control: which Lambda, what S3 buckets, environment variables
    • Changes infrequently (when application requirements change)
  2. Implementation (Platform-Owned): How resources are created

    • Lives in blueprint repository as managed_by_dp_*.tf files
    • Platform controls: security policies, naming, encryption, monitoring
    • Changes frequently (as patterns improve)

The Update Process:

When the platform team improves patterns (add VPC support, update KMS policies, new monitoring):

# Platform team's workflow
cd blueprint-repository
# Update building block versions, add new features
git commit -m "feat: add X-Ray tracing to Lambda building block"

# Generate PRs for all 50 team repositories
./tools/repo_updater.py --update-all-teams

# Result: 50 automated PRs created
# Each PR updates only managed_by_dp_*.tf files
# Teams' tfvars files are NEVER touched
Enter fullscreen mode Exit fullscreen mode

Team's Approval Workflow:

# Team receives automated PR: "Update platform code to v2.5"
# PR shows ONLY changes to managed_by_dp_*.tf files
# Team's _project.auto.tfvars is unchanged

# Team reviews terraform plan in PR comments
terraform plan
# Shows: "Lambda function will be updated in-place"
#        "  + vpc_config { ... }"  (new VPC configuration added)

# Team approves and merges
# Terraform Cloud runs terraform apply
# Workload gets new feature automatically
Enter fullscreen mode Exit fullscreen mode

The Math Works:

  • Without this approach: 50 teams × 10 resource types × 5 improvements/year = 2,500 manual updates
  • With this approach: 1 platform team × 1 script × 50 automated PRs = 50 team approvals (30 minutes each)

Platform team scales from:

  • 10 person-weeks of manual updates (touching every team's code)
  • To: 2 person-days (writing script, reviewing automation)

Teams benefit:

  • Receive improvements without doing any work
  • Review and approve changes (maintain control)
  • terraform plan shows exactly what changes
  • Rollback is just reverting the PR

Key Principles:

  1. Teams own configuration: Platform can't break their workload definitions
  2. Platform owns implementation: Teams benefit from continuous improvement
  3. Automation bridges scale: Scripts generate PRs, teams approve
  4. Terraform validates: Standard plan shows changes before apply
  5. Gradual rollout: Platform can update 5 teams first, validate, then roll to 45 more

This lifecycle separation is what makes the framework sustainable at scale—platform team doesn't become a bottleneck, teams maintain velocity, everyone stays current with best practices.

TL;DR - Section 1: Platform teams face N×N complexity when updating 50+ workloads with infrastructure improvements. This framework separates configuration (team-owned tfvars) from implementation (platform-owned building blocks). Automated PR generation scales updates: platform improves once, all teams inherit via terraform plan review and approval. Reduces 2,500 manual updates/year to 50 automated PRs.


2. Architecture Overview

┌────────────────────────────────────────────────────────────────────┐
│                Layer 1: tf-common (Shared Foundation)              │
├────────────────────────────────────────────────────────────────────┤
│  • Provider Config          • Naming Conventions                   │
│  • VPC/Subnet Data Sources  • Platform Info Provider               │
└──────────────────┬─────────────────────────────────────────────────┘
                   │
                   ▼
┌────────────────────────────────────────────────────────────────────┐
│              Layer 2: tf-default (Account-Level)                   │
├────────────────────────────────────────────────────────────────────┤
│  • KMS Infrastructure Key   • S3 Code/Logging Buckets              │
│  • IAM Admin Roles          • CloudTrail Data                      │
└──────────────────┬────────────────┬────────────────────────────────┘
                   │                │
                   │ (Shared KMS)   │ (Code Storage)
                   ▼                ▼
┌────────────────────────────────────────────────────────────────────┐
│            Layer 3: tf-project (Application-Level)                 │
├────────────────────────────────────────────────────────────────────┤
│  • KMS Data Key             • S3 Data Buckets                      │
│  • Lambda/Glue/Fargate      • RDS/Redshift/DynamoDB                │
└────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The Three-Layer System

Layer 1: tf-common (Shared Foundation)

  • Provider configuration
  • Naming conventions and context management
  • Shared data sources (VPC, subnets, IAM roles)
  • Platform Information Provider (PIP) integration
  • Used by ALL workloads (updated centrally)

Layer 2: tf-default (Account-Level Resources)

  • S3 code/logging buckets
  • KMS infrastructure keys
  • Lake Formation settings
  • IAM admin roles
  • CloudTrail data logging
  • Deployed ONCE per AWS account

Layer 3: tf-project (Application Resources)

  • S3 data buckets
  • Lambda functions, Glue jobs
  • RDS, Redshift, DynamoDB databases
  • Fargate containers
  • Application-specific KMS keys
  • Deployed MULTIPLE times per account (one per workload)

Composition via Symlinks:

examples/my-workload/
├── _data.tf                    # User-owned: environment config
├── _project.auto.tfvars        # User-owned: workload definition
├── managed_by_dp_common_*.tf -> ../../tf-common/terraform/
├── managed_by_dp_default_*.tf -> ../../tf-default/terraform/
└── managed_by_dp_project_*.tf -> ../../tf-project/terraform/
Enter fullscreen mode Exit fullscreen mode

This creates a complete, runnable Terraform project where terraform plan/apply work directly.


3. The Smart Lookup Innovation

The Core Concept

Traditional Terraform:

lambda_functions = {
  processor = {
    environment = {
      BUCKET = "arn:aws:s3:::company-prod-data-raw-bucket-a1b2c3"
    }

    policy_json = jsonencode({
      Statement = [{
        Effect = "Allow"
        Action = ["s3:GetObject", "s3:PutObject"]
        Resource = "arn:aws:s3:::company-prod-data-raw-bucket-a1b2c3/*"
      }]
    })
  }
}
Enter fullscreen mode Exit fullscreen mode

With Smart Lookups:

s3_buckets = {
  raw_data = { name = "raw" }
}

lambda_functions = {
  processor = {
    environment = {
      BUCKET = "s3:raw_data"  # Resolves to bucket name
    }

    permissions = {
      s3_read = ["raw_data"]   # Resolves to full ARN + generates IAM policy
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

How It Works: Pure Terraform Magic

Location: tf-project/terraform/managed_by_dp_project_lookup.tf

Step 1: Build Lookup Maps

The system creates hierarchical lookup maps after resources are created:

lookup_arn_base = merge(var.lookup_arns, {
  "s3_read"  = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
  "s3_write" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
  "gluejob"  = { for item in keys(var.glue_jobs) : item => module.glue_jobs[item].arn }
  "secret_read" = { for item in keys(var.secrets) : item => module.secrets[item].arn }
  "dynamodb_read" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].arn }
})

lookup_id_base = merge(var.lookup_ids, {
  "s3" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].id }
  "secret" = { for item in keys(var.secrets) : item => module.secrets[item].id }
  "dynamodb" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].name }
})
Enter fullscreen mode Exit fullscreen mode

Step 2: Resolve References Dynamically

In building block modules (e.g., managed_by_dp_project_lambda.tf):

module "lambda" {
  for_each = var.lambda_functions

  # Environment variables with smart lookup
  environments = {
    for type, item in try(each.value.environment, {}) : type =>
      try(
        local.lookup_id_lambda[split(":", item)[0]][split(":", item)[1]],
        item  # Fallback to literal value if not a lookup
      )
  }

  # Permissions with smart lookup
  permissions = {
    for type, items in try(each.value.permissions, {}) : type => [
      for item in items :
      (
        length(split(":", item)) == 2  # Check if it's "type:name" format
        ? try(
            local.lookup_perm_lambda[split(":", item)[0]][split(":", item)[1]],
            item
          )
        : try(
            local.lookup_perm_lambda[type][item],  # Infer type from permission category
            item
          )
      )
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

The Magic:

  • split(":", "s3:mybucket")["s3", "mybucket"]
  • local.lookup_id_lambda["s3"]["mybucket"] → actual bucket name
  • local.lookup_perm_lambda["s3_read"]["mybucket"] → actual bucket ARN

Step 3: Building Blocks Generate IAM Policies

Building block modules (from Terraform Cloud private registry) automatically generate IAM policies:

module "lambda" {
  source  = "app.terraform.io/org/buildingblock-lambda/aws"
  version = "3.2.0"

  permissions = {
    s3_read = ["arn:aws:s3:::bucket1", "arn:aws:s3:::bucket2"]
  }

  create_policy = true  # Automatically generates IAM role + policy
}
Enter fullscreen mode Exit fullscreen mode

Inside the building block, it generates:

data "aws_iam_policy_document" "lambda" {
  statement {
    sid    = "S3Read"
    effect = "Allow"
    actions = ["s3:GetObject*", "s3:GetBucket*", "s3:List*"]
    resources = flatten([
      var.permissions.s3_read,
      [for arn in var.permissions.s3_read : "${arn}/*"]
    ])
  }
}
Enter fullscreen mode Exit fullscreen mode

Supported Lookup Types

For Environment Variables (IDs/Names):

  • s3:bucket_name → S3 bucket name
  • secret:secret_name → Secrets Manager secret ID
  • dynamodb:table_name → DynamoDB table name
  • athena:workgroup_name → Athena workgroup name
  • prefix:suffix → Injects naming prefix + suffix

For Permissions (ARNs):

  • s3_read:bucket / s3_write:bucket → S3 bucket ARN
  • gluejob:job_name → Glue job ARN
  • gluedb:database_name → Glue database name
  • secret_read:secret_name → Secrets Manager ARN
  • dynamodb_read:table / dynamodb_write:table → DynamoDB ARN
  • sqs_read:queue / sqs_send:queue → SQS queue ARN
  • sns_pub:topic → SNS topic ARN

Cross-Account References:

  • acct_prod_glue_tables → All Glue tables in production account
  • acct_dev_kms_all_keys → All KMS keys in dev account
Team tfvars          Lookup Tables       Building Block         AWS Resources
     │                     │                     │                     │
     │ environment =       │                     │                     │
     │ {BUCKET="s3:raw"}   │                     │                     │
     ├────────────────────>│                     │                     │
     │                     │ split(":", "s3:raw")│                     │
     │                     │ → ["s3", "raw"]     │                     │
     │                     │                     │                     │
     │                     │ lookup_id_lambda    │                     │
     │                     │ ["s3"]["raw"] →     │                     │
     │                     │ "company...-raw"    │                     │
     │                     ├────────────────────>│                     │
     │                     │  resolved name      │                     │
     │                     │                     │ Create Lambda with  │
     │                     │                     │ env BUCKET=         │
     │                     │                     │ "company...-raw"    │
     │                     │                     ├────────────────────>│
     │                     │                     │                     │
     │ permissions =       │                     │                     │
     │ {s3_read=["raw"]}   │                     │                     │
     ├────────────────────>│                     │                     │
     │                     │ lookup_perm_lambda  │                     │
     │                     │ ["s3_read"]["raw"]  │                     │
     │                     │ → arn:aws:s3:::...  │                     │
     │                     ├────────────────────>│                     │
     │                     │  resolved ARN       │                     │
     │                     │                     │ Generate IAM policy │
     │                     │                     │ with S3 read actions│
     │                     │                     │                     │
     │                     │                     │ Attach policy to    │
     │                     │                     │ Lambda role         │
     │                     │                     ├────────────────────>│
Enter fullscreen mode Exit fullscreen mode

TL;DR - Section 3: Smart lookups use colon syntax (s3:bucket_name) resolved via native Terraform split() and lookup maps. No preprocessing—pure Terraform expressions. Lookup tables are built after resources are created, then referenced by building blocks to resolve environment variables (IDs) and permissions (ARNs). Building blocks auto-generate IAM policies from the resolved ARNs.


4. Building Block Abstraction

The Philosophy

Building blocks are opinionated Terraform modules that:

  1. Enforce organizational standards (naming, tagging, encryption)
  2. Abstract AWS complexity (IAM policies, VPC configuration)
  3. Provide guardrails (prevent common misconfigurations)
  4. Enable least-privilege by default (automatic policy generation)

Example: S3 Building Block

User Configuration (tfvars):

s3_buckets = {
  raw_data = {
    name = "raw"
    backup = true
    enable_intelligent_tiering = true
  }
  processed = {
    name = "processed"
    lifecycle_rules = [{
      id = "archive_old_data"
      transition_days = 90
      storage_class = "GLACIER"
    }]
  }
}
Enter fullscreen mode Exit fullscreen mode

What the Building Block Does:

module "s3_buckets" {
  source  = "app.terraform.io/org/buildingblock-s3/aws"
  version = "2.1.3"

  for_each = var.s3_buckets

  # Standardized naming: <prefix>-<workload>-<application>-<name>
  prefix  = local.prefix  # e.g., "companyp" (company + production)
  context = local.context # {Env: "prd", Workload: "analytics", Application: "etl"}
  name    = try(each.value.name, each.key)

  # Automatic encryption with workload KMS key
  kms_key_arn = local.kms_data_key_arn

  # Standardized tags (injected automatically)
  # Tags include: Env, Workload, Application, Team, CostCenter, Backup

  # Security defaults
  block_public_access = true
  versioning_enabled = true

  # User-specified configuration
  backup = each.value.backup
  lifecycle_rules = try(each.value.lifecycle_rules, [])
  enable_intelligent_tiering = try(each.value.enable_intelligent_tiering, false)
}
Enter fullscreen mode Exit fullscreen mode

Generated Resources:

  • S3 bucket with predictable name: companyprd-analytics-etl-raw
  • KMS encryption enabled automatically
  • Bucket policy restricting to VPC endpoints
  • CloudWatch alarms for bucket size
  • Backup plan (if backup = true)
  • All organizational tags applied

Example: Lambda Building Block

User Configuration:

lambda_functions = {
  data_processor = {
    name            = "processor"
    handler         = "index.handler"
    runtime         = "python3.13"
    memory          = 1024
    timeout         = 300
    s3_sourcefile   = "s3_file:lambda_processor.zip"

    environment = {
      INPUT_BUCKET  = "s3:raw_data"
      OUTPUT_BUCKET = "s3:processed"
      SECRET_ID     = "secret:db_creds"
    }

    permissions = {
      s3_read  = ["raw_data"]
      s3_write = ["processed"]
      secret_read = ["db_creds"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

What the Building Block Does:

  • Creates Lambda function with standardized name
  • Generates IAM role automatically
  • Generates IAM policy from permissions map
  • Applies permission boundary (security compliance)
  • Injects VPC configuration (subnet IDs, security groups)
  • Resolves environment variables via lookup tables
  • Adds CloudWatch log group with retention policy
  • Applies X-Ray tracing
  • Adds all organizational tags

Generated IAM Policy (automatically):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3Read",
      "Effect": "Allow",
      "Action": ["s3:GetObject*", "s3:GetBucket*", "s3:List*"],
      "Resource": [
        "arn:aws:s3:::companyprd-analytics-etl-raw",
        "arn:aws:s3:::companyprd-analytics-etl-raw/*"
      ]
    },
    {
      "Sid": "S3Write",
      "Effect": "Allow",
      "Action": ["s3:PutObject*", "s3:DeleteObject*"],
      "Resource": [
        "arn:aws:s3:::companyprd-analytics-etl-processed",
        "arn:aws:s3:::companyprd-analytics-etl-processed/*"
      ]
    },
    {
      "Sid": "SecretRead",
      "Effect": "Allow",
      "Action": ["secretsmanager:GetSecretValue"],
      "Resource": "arn:aws:secretsmanager:eu-central-1:123456789012:secret:companyprd-analytics-etl-db_creds-a1b2c3"
    },
    {
      "Sid": "KMSDecrypt",
      "Effect": "Allow",
      "Action": ["kms:Decrypt"],
      "Resource": "arn:aws:kms:eu-central-1:123456789012:key/abcd1234-..."
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

5. Dual KMS Key Architecture with Tag-Based Permissions

One of the most elegant security features of this framework is its dual KMS key architecture that balances security isolation with operational flexibility.

The Two-Key System

KMS Infrastructure Key (kms-infra)

  • Scope: One per AWS account (shared across all workloads in that account)
  • Location: Created in tf-default (account-level)
  • Purpose: Encrypts infrastructure resources (CloudWatch Logs, Secrets Manager, SNS, CloudTrail)
  • Naming: ${prefix}-${workload}-kms-infra
  • Example: companyp-analytics-kms-infra

KMS Data Key (kms-data)

  • Scope: One per workload (isolated per application)
  • Location: Created in tf-project (application-level)
  • Purpose: Encrypts data resources (S3 buckets, RDS, DynamoDB, Redshift)
  • Naming: ${prefix}-${workload}-${application}-kms-data
  • Example: companyp-analytics-etl-kms-data

Why Two Keys?

Security Isolation:

  • Data keys are isolated per workload
  • Compromising one workload's data key doesn't expose other workloads' data
  • Infrastructure key is shared for operational resources that need account-wide access

Operational Flexibility:

  • Infrastructure key allows CloudWatch, monitoring, and logging to work across workloads
  • AWS services (Secrets Manager, CloudTrail) can use a single key for account-level operations
  • Data keys remain tightly scoped to application resources

Cost Optimization:

  • Infrastructure resources share one key (CloudWatch logs from many workloads)
  • Only data resources (S3, databases) need separate keys per workload

Tag-Based Permissions: The Magic Sauce

Instead of explicitly listing every IAM role in the KMS key policy (which creates circular dependencies), the infrastructure key uses tag-based permissions:

Implementation in managed_by_dp_common_kms_infra.tf:

module "kms_infrastructure" {
  source = "terraform-aws-modules/kms/aws"

  create = local.default_deploy  # Only in default/account deployment

  aliases = ["${local.prefix}-${local.context.Workload}-kms-infra"]

  key_statements = [
    {
      sid = "tag-workload"
      principals = [{
        type        = "AWS"
        identifiers = ["arn:aws:iam::${account_id}:root"]
      }]

      actions = [
        "kms:Encrypt*",
        "kms:Decrypt*",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
      ]

      resources = ["*"]

      # The key condition: any role with matching Workload tag can use this key
      conditions = [{
        test     = "StringEquals"
        variable = "aws:PrincipalTag/Workload"
        values   = [local.context.Workload]
      }]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

How It Works:

  1. Every IAM role created by building blocks gets tagged automatically:
   # Lambda IAM role
   tags = {
     Workload    = "analytics"
     Application = "etl"
     Env         = "prd"
   }
Enter fullscreen mode Exit fullscreen mode
  1. KMS key policy allows any role with matching Workload tag:

    • If role has tag Workload = "analytics"
    • And KMS key is for workload analytics
    • Then role can use the key automatically
  2. No circular dependencies:

    • KMS key doesn't need to know about Lambda roles
    • Lambda roles don't need to be in KMS key policy
    • Tag matching happens at runtime by AWS IAM

Data Key: Explicit Role Lists

The data key uses a different approach with explicit role lists (avoiding circular dependencies through selective inclusion):

Implementation in managed_by_dp_project_kms_data.tf:

module "kms_data" {
  source = "app.terraform.io/org/buildingblock-kms-data/aws"

  key_administrators = local.kms_admins
  key_users = compact(concat(local.kms_data_key_users, var.kms_data["extra_roles"]))

  # Tag-based access for roles with matching tags
  key_user_tag_map = {
    "Workload"    = local.context.Workload
    "Application" = local.context.Application
    "Env"         = local.context.Env
  }
}
Enter fullscreen mode Exit fullscreen mode

In managed_by_dp_project_locals.tf:

kms_data_key_users = compact(concat(
  # Admin roles (explicitly listed)
  ["arn:aws:iam::${account_id}:role/${var.role_prefix}-${local.prefix}-DpAdminRole"],
  [local.operatorrole_arn],
  local.transfer_roles,
  local.workflow_roles,

  # Lambda, Glue, Fargate roles are NOT listed here (would cause cycles)
  # Instead, they're granted access via tag-based permissions
  # See comments in code explaining the circular dependency:
  # [for job in var.glue_jobs : "arn:aws:iam::..."],  # CYCLO ERROR!
  # [for function in var.lambda_functions : "arn:aws:iam::..."],  # CYCLO ERROR!
))
Enter fullscreen mode Exit fullscreen mode

The data key also supports tag-based access through key_user_tag_map, allowing Lambda/Glue/Fargate roles to access it via their tags without being explicitly listed in the policy.

Practical Example

Scenario: Lambda function needs to:

  • Read encrypted S3 data (data key)
  • Write to CloudWatch Logs (infra key)
  • Access Secrets Manager secret (infra key)

What Happens:

  1. Lambda IAM role is created with tags:
   resource "aws_iam_role" "lambda" {
     name = "app-companyp-analytics-etl-lambda-processor"

     tags = {
       Workload    = "analytics"
       Application = "etl"
       Env         = "prd"
     }
   }
Enter fullscreen mode Exit fullscreen mode
  1. Lambda can use infrastructure key because:

    • Role has tag Workload = "analytics"
    • KMS infra key checks: aws:PrincipalTag/Workload == "analytics"
    • Access granted for CloudWatch Logs, Secrets Manager
  2. Lambda can use data key because:

    • Role has tags Workload = "analytics" AND Application = "etl" AND Env = "prd"
    • KMS data key checks all three tags match ✓
    • Access granted for S3 data encryption/decryption
  3. Lambda CANNOT use another workload's data key:

    • Role has Application = "etl"
    • Other workload's data key requires Application = "reporting"
    • Tag mismatch ✗
    • Access denied

Benefits of This Architecture

1. Automatic Compliance:

  • Every resource is encrypted (mandatory KMS keys injected by building blocks)
  • No way to accidentally create unencrypted resources

2. Zero-Touch Security:

  • Developers never manage KMS permissions manually
  • Building blocks inject the correct KMS key ARN automatically
  • Tag propagation handles access control

3. Workload Isolation:

  • Data from different applications is cryptographically separated
  • Even with compromised IAM credentials, cross-workload data access is prevented

4. Solves Circular Dependencies:

  • KMS keys don't reference IAM roles directly
  • IAM roles don't need to be created before KMS keys
  • Tag-based conditions evaluated at runtime

5. Audit Trail:

  • CloudTrail logs show which role (with which tags) accessed which KMS key
  • Security teams can verify tag-based access patterns
  • Compliance reports show encryption coverage

Service-Specific Access

The infrastructure key also includes service-specific statements for AWS services:

CloudWatch Logs:

{
  sid = "logs"
  principals = [{ type = "Service", identifiers = ["logs.amazonaws.com"] }]
  actions = ["kms:Encrypt*", "kms:Decrypt*", "kms:GenerateDataKey*"]
  conditions = [{
    test     = "ArnEquals"
    variable = "kms:EncryptionContext:aws:logs:arn"
    values   = ["arn:aws:logs:${region}:${account}:log-group:*"]
  }]
}
Enter fullscreen mode Exit fullscreen mode

Secrets Manager:

{
  sid = "auto-secretsmanager"
  principals = [{ type = "Service", identifiers = ["secretsmanager.amazonaws.com"] }]
  actions = ["kms:Encrypt", "kms:Decrypt", "kms:GenerateDataKey"]
  conditions = [
    { test = "StringEquals", variable = "kms:ViaService",
      values = ["secretsmanager.${region}.amazonaws.com"] },
    { test = "StringEquals", variable = "kms:CallerAccount", values = ["${account}"] }
  ]
}
Enter fullscreen mode Exit fullscreen mode

CloudTrail, SNS, EventBridge:
Similar service-specific statements allow these AWS services to use the infrastructure key for their operations.

Lookup References

Both keys are available via smart lookups:

# In Lambda/Glue/Fargate tfvars - use data key for data encryption
permissions = {
  kms = ["kms_data"]  # Resolves to workload's data key ARN
}

# Infrastructure key is injected automatically by building blocks
# (for CloudWatch Logs, environment variable encryption, etc.)
Enter fullscreen mode Exit fullscreen mode

Summary

The dual KMS key architecture demonstrates how thoughtful design can achieve:

  • Security: Strong encryption and workload isolation
  • Developer Experience: Zero manual KMS management
  • Operational Simplicity: Tag-based permissions eliminate complexity
  • Compliance: Automatic encryption enforcement across all resources

This pattern is a cornerstone of the framework's security model and showcases how infrastructure abstractions can enhance rather than compromise security posture.

┌──────────────────────────────────────────────────────────────────┐
│          KMS Infrastructure Key (Account-Level)                  │
├──────────────────────────────────────────────────────────────────┤
│  • One Key Per Account                                           │
│  • Encrypts: CloudWatch Logs, Secrets Manager, SNS, CloudTrail   │
│  • Tag-Based Access: Workload Tag                                │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
                                 │ (Tag Match: Workload)
                                 │
                      ┌──────────┴──────────┐
                      │                     │
                      │    Lambda Role      │
                      │  Tagged with:       │
                      │  • Workload=analytics│
                      │  • Application=etl  │
                      │  • Env=prd          │
                      │                     │
                      └──────────┬──────────┘
                                 │
                                 │ (Tag Match: All 3 Tags)
                                 │
┌────────────────────────────────▼─────────────────────────────────┐
│            KMS Data Key (Workload-Level)                         │
├──────────────────────────────────────────────────────────────────┤
│  • One Key Per Workload                                          │
│  • Encrypts: S3, RDS, DynamoDB, Redshift                         │
│  • Tag-Based Access: Workload + Application + Env                │
└──────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

TL;DR - Section 5: Dual KMS architecture uses one shared infrastructure key per account (CloudWatch, Secrets Manager) and one data key per workload (S3, databases). Tag-based permissions solve circular dependencies: IAM roles tagged with Workload/Application/Env automatically gain KMS access without being explicitly listed in policies. Infrastructure key checks one tag, data key checks three tags for stronger isolation.


6. Naming Conventions and Context Propagation

The Context System

Input: Tags Module

Every workload defines a tags module:

module "tags" {
  source      = "app.terraform.io/org/tags/aws"
  version     = "~> 1.0.0"
  environment = "prd"
  workload    = "analytics"
  application = "etl"
  team        = "data-engineering@company.com"
  costcenter  = "12345"
  backup      = "Daily"
}
Enter fullscreen mode Exit fullscreen mode

Output: Context Map

context = merge(module.tags.tags, var.context)
# Result: {
#   Env: "prd",
#   Workload: "analytics",
#   Application: "etl",
#   Team: "data-engineering@company.com",
#   CostCenter: "12345",
#   Backup: "Daily"
# }
Enter fullscreen mode Exit fullscreen mode

Prefix Generation

prefix = "company${substr(local.context.Env, 0, 1)}"
# prd → companyp
# sbx → companys
# dev → companyd
Enter fullscreen mode Exit fullscreen mode

Resource Naming Pattern

${prefix}-${workload}-${application}-${resource_name}
Enter fullscreen mode Exit fullscreen mode

Examples:

  • S3 bucket: companyp-analytics-etl-raw
  • Lambda: companyp-analytics-etl-processor
  • Glue job: companyp-analytics-etl-transform
  • IAM role: companyp-analytics-etl-lambda-processor-role

Benefits:

  • Predictable: Resources can be referenced before creation
  • Discoverable: Name reveals environment, workload, and purpose
  • Compliant: Meets organizational naming standards
  • Unique: Prevents naming collisions across teams

7. Circular Dependency Resolution Strategies

The Challenge

Terraform dependency graph requires acyclic relationships, but real-world infrastructure often has circular references:

  • Lambda needs IAM role ARN
  • IAM role policy needs Lambda ARN for trust policy
  • KMS key policy needs Lambda role ARN
  • Lambda needs KMS key ARN for environment variables

Strategy 1: Predictive Naming

Example: Redshift Lookup

# Can't use module.redshift[item].name because it creates a cycle
# CYCLO ERROR! comment in code
"redshift_data" = {
  for item in keys(var.redshift_databases) :
    item => join("-", [
      local.prefix,
      local.context.Workload,
      local.context.Application,
      item
    ])
}
Enter fullscreen mode Exit fullscreen mode

Instead of referencing the module output (which creates a dependency), predict the name using the same naming convention.

Strategy 2: Two-Phase Deployment

From DEPLOY.md:

"First Terraform apply will fail on a few dependencies. Re-run to finalize."

Some circular dependencies are resolved by applying twice:

  1. First apply creates base resources
  2. Some resources fail due to missing dependencies
  3. Second apply completes configuration

Strategy 3: Selective KMS Key Users

kms_data_key_users = compact(concat(
  ["arn:aws:iam::${account_id}:role/${var.role_prefix}-${local.prefix}-DpAdminRole"],
  [local.operatorrole_arn],
  local.transfer_roles,
  local.workflow_roles,
  # These would create cycles - commented out:
  # [for job in var.glue_jobs : "arn:aws:iam::..."],
  # [for function in var.lambda_functions : "arn:aws:iam::..."],
))
Enter fullscreen mode Exit fullscreen mode

KMS key policies include predictable roles (admin, operator) but NOT Lambda/Glue roles to avoid cycles.

Strategy 4: Data Source Lookups (Cross-Workload)

When project workloads need resources from the default workload:

local.default_deploy = fileexists("${path.module}/managed_by_dp_default_s3_code.tf")

data "aws_kms_key" "kms_infrastructure" {
  count  = local.default_deploy ? 0 : 1
  key_id = "alias/${local.prefix}-${local.context.Workload}-kms-infra"
}

kms_infrastructure_key_arn = coalesce(
  module.kms_infrastructure.key_arn,          # If default deploy
  data.aws_kms_key.kms_infrastructure[0].arn  # If project deploy
)
Enter fullscreen mode Exit fullscreen mode

Project workloads use data sources to look up infrastructure key by predictable alias.


8. Real-World Example: Data Pipeline Workload

Scenario

Build a data pipeline that:

  1. Ingests raw CSV files from external S3 bucket
  2. Processes files with Lambda function
  3. Transforms data with Glue ETL job
  4. Stores in Redshift for analytics
  5. Shares Glue catalog with data governance account

Configuration (tfvars)

# Define S3 buckets
s3_buckets = {
  raw = {
    name = "raw"
    backup = true
    lifecycle_rules = [{
      id = "archive_old"
      transition_days = 90
      storage_class = "GLACIER"
    }]
  }
  processed = {
    name = "processed"
    enable_intelligent_tiering = true
  }
}

# Upload Lambda code
s3_source_files = {
  processor_code = {
    source = "lambda_processor.zip"
    target = "lambda_functions/processor/code.zip"
  }
  glue_script = {
    source = "transform.py"
    target = "glue_jobs/transform/script.py"
  }
}

# Define secrets
secrets = {
  redshift_creds = {
    name = "redshift-credentials"
    secret_string = {
      username = "admin"
      password = "changeme"  # Should use AWS Secrets Manager UI to set
    }
  }
}

# Define Glue database
glue_database = {
  analytics = {
    name = "analytics"
    bucket = "s3:processed"
    enable_lakeformation = true
    share_cross_account_ro = ["datagovernance"]
  }
}

# Define Lambda processor
lambda_functions = {
  csv_processor = {
    name = "csv-processor"
    description = "Processes incoming CSV files"
    handler = "index.handler"
    runtime = "python3.13"
    memory = 2048
    timeout = 900
    s3_sourcefile = "s3_file:processor_code"

    environment = {
      RAW_BUCKET = "s3:raw"
      PROCESSED_BUCKET = "s3:processed"
      GLUE_DATABASE = "gluedb:analytics"
    }

    permissions = {
      s3_read = ["raw"]
      s3_write = ["processed"]
      glue_update = ["analytics"]
    }

    # S3 trigger
    event_source_mapping = [{
      event_source_arn = "s3:raw"
      events = ["s3:ObjectCreated:*"]
      filter_prefix = "incoming/"
      filter_suffix = ".csv"
    }]
  }
}

# Define Glue ETL job
glue_jobs = {
  transform = {
    name = "data-transform"
    glue_version = "4.0"
    worker_type = "G.1X"
    number_of_workers = 5
    script_location = "s3_file:glue_script"

    arguments = {
      "--DATABASE" = "gluedb:analytics"
      "--INPUT_BUCKET" = "s3:processed"
      "--REDSHIFT_SECRET" = "secret:redshift_creds"
    }

    permissions = {
      s3_read = ["processed"]
      glue_update = ["analytics"]
      secret_read = ["redshift_creds"]
      redshift = ["analytics_cluster"]
    }

    # Scheduled trigger
    trigger_type = "SCHEDULED"
    schedule = "cron(0 2 * * ? *)"  # Daily at 2 AM
  }
}

# Define Redshift cluster
redshift_databases = {
  analytics_cluster = {
    name = "analytics"
    node_type = "dc2.large"
    number_of_nodes = 2
    master_username = "admin"
    secret_name = "secret:redshift_creds"

    permissions = {
      glue_read = ["analytics"]
      s3_read = ["processed"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

What Gets Created (40+ AWS Resources)

Infrastructure:

  • KMS data key for encryption
  • VPC security groups for Lambda/Glue
  • IAM roles (5): Lambda role, Glue role, Redshift role, Lake Formation role, Admin role
  • IAM policies (5): Auto-generated least-privilege policies
  • Permission boundaries (2): For Lambda and Glue roles

Storage:

  • S3 bucket: companyp-analytics-pipeline-raw
  • S3 bucket: companyp-analytics-pipeline-processed
  • S3 bucket policies (2)
  • S3 lifecycle rules
  • S3 intelligent tiering configuration

Compute:

  • Lambda function: companyp-analytics-pipeline-csv-processor
  • Lambda log group with 30-day retention
  • S3 event notification trigger
  • Glue job: companyp-analytics-pipeline-data-transform
  • Glue security configuration
  • Glue CloudWatch log group

Data Catalog:

  • Glue database: companyp-analytics-pipeline-analytics
  • Lake Formation permissions
  • Lake Formation resource link (cross-account share)
  • RAM resource share (for cross-account access)

Database:

  • Redshift cluster: companyp-analytics-pipeline-analytics
  • Redshift subnet group
  • Redshift parameter group
  • Redshift security group
  • Secrets Manager secret: companyp-analytics-pipeline-redshift-credentials
  • Secret rotation configuration

Monitoring:

  • CloudWatch alarms (6): Lambda errors, Glue job failures, S3 metrics
  • CloudWatch log groups (3)
  • EventBridge rule for Glue job schedule

All with:

  • Consistent naming
  • Full encryption (KMS)
  • Least-privilege IAM policies
  • Organizational tags
  • VPC isolation
  • CloudWatch logging

Total Configuration: ~150 lines of tfvars
Generated Terraform Code: ~2000+ lines (via building blocks)
Boilerplate Reduction: ~93%

                    ┌──────────────────────────────────┐
                    │         S3 Buckets               │
                    │  ┌────────┐      ┌────────┐      │
                    │  │  raw   │      │processed│     │
                    │  └───┬────┘      └────▲───┘      │
                    └──────┼────────────────┼──────────┘
                           │                │
               S3 Event    │                │
               Trigger     │                │ Writes
                           │                │
                    ┌──────▼────────────────┴──────────┐
                    │         Lambda                   │
                    │  ┌─────────────────────┐         │
                    │  │   csv-processor     │         │
                    │  └──────────┬──────────┘         │
                    └─────────────┼────────────────────┘
                                  │
                                  │ Updates
                                  │
        ┌─────────────────────────▼───────────────────────┐
        │              Glue                                │
        │  ┌──────────────────┐    ┌──────────────────┐   │
        │  │ Database:        │◄───│  ETL Job:        │   │
        │  │ analytics        │    │  transform       │   │
        │  └────────▲─────────┘    └────┬─────────────┘   │
        └───────────┼──────────────────┼─────────────────┘
                    │                  │
                    │ Queries          │ Loads
                    │                  │
        ┌───────────┴──────────────────▼─────────────────┐
        │           Redshift                              │
        │  ┌─────────────────────┐                        │
        │  │  Cluster: analytics │                        │
        │  └──────────┬──────────┘                        │
        └─────────────┼────────────────────────────────────┘
                      │
                      │ Reads
                      │
        ┌─────────────▼────────────────────────────────────┐
        │         Secrets Manager                          │
        │  ┌─────────────────────────┐                     │
        │  │  redshift-credentials   │                     │
        │  └─────────────────────────┘                     │
        └──────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

TL;DR - Section 8: Real-world data pipeline example shows how 150 lines of tfvars configuration generates 40+ AWS resources (S3, Lambda, Glue, Redshift, KMS, IAM, CloudWatch). Smart lookups connect resources (s3:raw, secret:db_creds), building blocks auto-generate IAM policies, context system applies consistent naming/tagging, and KMS keys encrypt everything automatically. Achieves 93% boilerplate reduction vs traditional Terraform.


9. Cross-Account Architecture

Use Case: Multi-Account Data Mesh

Scenario: Analytics workload in Production account needs to:

  • Read S3 data from Development account
  • Query Glue tables from Staging account
  • Use KMS keys from Shared Services account

Configuration

Step 1: Define Cross-Account Aliases

cross_accounts = {
  dev     = "123456789012"
  staging = "234567890123"
  shared  = "345678901234"
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Define External S3 Buckets

lookup_ids = {
  xa_s3_bucket = {
    dev_raw = "dev-shared-raw-data"
    staging_processed = "staging-shared-processed"
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Use Cross-Account Lookups

lambda_functions = {
  cross_account_reader = {
    name = "reader"

    permissions = {
      # Read from external S3 buckets
      s3_read = ["dev_raw", "staging_processed"]

      # Query Glue tables in staging account
      glue_read = ["acct_staging_glue_tables"]

      # Use KMS keys in shared account
      kms = ["acct_shared_kms_all_keys"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Generated IAM Policy

{
  "Statement": [
    {
      "Sid": "S3ReadCrossAccount",
      "Effect": "Allow",
      "Action": ["s3:GetObject*", "s3:GetBucket*", "s3:List*"],
      "Resource": [
        "arn:aws:s3:::dev-shared-raw-data",
        "arn:aws:s3:::dev-shared-raw-data/*",
        "arn:aws:s3:::staging-shared-processed",
        "arn:aws:s3:::staging-shared-processed/*"
      ]
    },
    {
      "Sid": "GlueReadCrossAccount",
      "Effect": "Allow",
      "Action": ["glue:GetTable", "glue:GetTables", "glue:GetDatabase"],
      "Resource": "arn:aws:glue:*:234567890123:table/*"
    },
    {
      "Sid": "KMSCrossAccount",
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:DescribeKey"],
      "Resource": "arn:aws:kms:eu-central-1:345678901234:key/*"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Developers don't need to know account IDs
  • Cross-account permissions follow same pattern as same-account
  • Centralized account alias management
  • Type-safe (Terraform validates references at plan time)

10. Deployment Workflow

Repository Structure

Blueprint Repository (Central):

terraform-platform-blueprint/
├── tf-common/           # Shared foundation
├── tf-default/          # Account-level resources
├── tf-project/          # Application resources
├── examples/
│   ├── full_test/       # Complete example
│   └── simple_example/  # Minimal example
└── tools/
    └── repo_updater.py  # Syncs blueprint to user repos
Enter fullscreen mode Exit fullscreen mode

User Repository (Team-Owned):

team-analytics/
├── terraform/
│   ├── dev/
│   │   ├── tags.tf                      # Team owns
│   │   ├── _default.auto.tfvars         # Team owns
│   │   ├── _project.auto.tfvars         # Team owns
│   │   ├── managed_by_dp_common_*.tf    # Synced from blueprint
│   │   ├── managed_by_dp_default_*.tf   # Synced from blueprint
│   │   └── managed_by_dp_project_*.tf   # Synced from blueprint
│   ├── staging/
│   └── production/
└── .github/
    └── workflows/
        └── terraform.yml
Enter fullscreen mode Exit fullscreen mode

Workflow Steps

Step 1: Team Creates Configuration

Teams edit only their own files:

  • tags.tf - Defines environment, workload, application
  • _default.auto.tfvars - Account-level config (if first workload)
  • _project.auto.tfvars - Application resources

Step 2: Platform Team Updates Blueprint

When blueprint code needs updating:

# In blueprint repo
cd tools
python repo_updater.py --target ../../../team-analytics/terraform/dev
Enter fullscreen mode Exit fullscreen mode

This syncs all managed_by_dp_*.tf files from blueprint to team repo.

Step 3: Team Commits and Pushes

git add .
git commit -m "feat: add data processing pipeline"
git push origin feature/data-pipeline
Enter fullscreen mode Exit fullscreen mode

Step 4: Terraform Cloud Runs

GitHub Action triggers Terraform Cloud:

  1. Workspace detects VCS change
  2. Runs terraform plan
  3. Shows plan in pull request comment
  4. Team reviews and approves
  5. Merges PR
  6. Terraform Cloud runs terraform apply

Step 5: Resources Created

All AWS resources created with:

  • Standardized naming
  • Automatic IAM policies
  • Full encryption
  • Organizational tags
  • CloudWatch monitoring

No Preprocessing Required

This workflow uses standard Terraform:

  • No build step before terraform plan
  • No code generation at runtime
  • No wrapper scripts
  • Native .tfvars files
  • Standard state management
  • Compatible with Terraform Cloud, Enterprise, or OSS
Platform   Blueprint   repo_updater.py   Team Repos   Terraform    Application
Team         Repo                           (50+)       Cloud         Team
  │            │              │                │           │            │
  │ Update     │              │                │           │            │
  │ building   │              │                │           │            │
  │ blocks     │              │                │           │            │
  ├───────────>│              │                │           │            │
  │            │              │                │           │            │
  │ git commit │              │                │           │            │
  │ & push     │              │                │           │            │
  ├───────────>│              │                │           │            │
  │            │              │                │           │            │
  │ Run        │              │                │           │            │
  │ --update-  │              │                │           │            │
  │ all-teams  │              │                │           │            │
  ├────────────┼─────────────>│                │           │            │
  │            │              │ Generate 50 PRs│           │            │
  │            │              │ (update        │           │            │
  │            │              │ managed_by_dp) │           │            │
  │            │              ├───────────────>│           │            │
  │            │              │                │ PR triggers│           │
  │            │              │                │ terraform  │           │
  │            │              │                │ plan       │           │
  │            │              │                ├──────────>│            │
  │            │              │                │           │            │
  │            │              │                │ Post plan │            │
  │            │              │                │ as PR     │            │
  │            │              │                │ comment   │            │
  │            │              │                │<──────────┤            │
  │            │              │                │           │            │
  │            │              │                │           │ Review plan│
  │            │              │                │<──────────────────────┤
  │            │              │                │           │            │
  │            │              │                │ Approve & │            │
  │            │              │                │ merge PR  │            │
  │            │              │                │<──────────────────────┤
  │            │              │                │           │            │
  │            │              │                │ Merge     │            │
  │            │              │                │ triggers  │            │
  │            │              │                │ terraform │            │
  │            │              │                │ apply     │            │
  │            │              │                ├──────────>│            │
  │            │              │                │           │            │
  │            │              │                │ Deploy    │            │
  │            │              │                │ updated   │            │
  │            │              │                │ resources │            │
  │            │              │                │           │            │
Enter fullscreen mode Exit fullscreen mode

11. Comparison with Other Approaches

vs. Standard Terraform

Aspect Standard Terraform This Framework
ARN Management Manual ARN strings Smart lookups (s3:bucket)
IAM Policies Write JSON/HCL policy documents Auto-generated from permissions map
Naming Manually ensure consistency Automatic standardized naming
Standards Manually enforce Building blocks enforce automatically
Cross-references Direct resource dependencies Lookup tables (reduces coupling)
Boilerplate High (1000+ lines typical) Low (150 lines typical) - ~85% reduction
Learning Curve Steep (requires AWS expertise) Moderate (config-focused)

vs. Terragrunt

Aspect Terragrunt This Framework
Preprocessing Required (terragrunt run) None (native Terraform)
State Management Separate tool Native Terraform
Compatibility Wrapper tool required Standard terraform CLI
DRY Approach File includes & remote state Lookup tables & modules
Complexity Additional tool layer Pure Terraform
IDE Support Limited (custom syntax) Full (standard HCL)

vs. Terraspace

Aspect Terraspace This Framework
Language Ruby DSL + ERB templates Pure HCL
Preprocessing Required (terraspace build) None
Runtime Ruby interpreter needed Native Terraform only
Configuration ERB templating Native tfvars
Tooling Additional CLI wrapper Standard Terraform CLI
Learning Curve Learn Ruby + Terraspace Learn framework conventions

vs. Terraform CDK

Aspect Terraform CDK This Framework
Language TypeScript/Python/Java/C#/Go Pure HCL
Compilation Required (cdktf synth) None
Runtime Node.js/Python runtime Native Terraform only
Configuration Imperative code Declarative tfvars
State Inspection Via generated JSON Native Terraform state
IDE Support Language-specific Terraform-specific

Key Advantages of This Approach

  1. No External Dependencies: Pure Terraform, no additional tools
  2. Native Workflows: Works with Terraform Cloud, Enterprise, OSS
  3. Type Safety: Terraform validates references at plan time
  4. Version Control: Standard .tfvars files, readable diffs
  5. IDE Support: Full support from Terraform plugins
  6. Learning Curve: Lower (no new language/tool to learn)
  7. Portability: Standard Terraform state, no lock-in
  8. Debugging: Standard Terraform error messages and plan output
                    ┌─────────────────────────┐
                    │  Terraform Approaches   │
                    └────────────┬────────────┘
                                 │
         ┌───────────┬───────────┼───────────┬───────────┐
         │           │           │           │           │
         ▼           ▼           ▼           ▼           ▼
┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│  Standard  │ │Terragrunt│ │Terraspace│ │Terraform │ │     This     │
│  Terraform │ │          │ │          │ │   CDK    │ │  Framework   │
└─────┬──────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └──────┬───────┘
      │             │            │            │              │
      │Manual ARNs  │Wrapper     │Ruby DSL    │TypeScript/   │Pure HCL
      │High         │tool        │ERB         │Python        │Smart
      │boilerplate  │Preprocessing│templates  │Compilation   │lookups
      │             │            │            │              │
      ▼             ▼            ▼            ▼              ▼
┌─────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│   1000+     │ │terragrunt│ │terraspace│ │  cdktf   │ │     150      │
│   lines/    │ │   run    │ │  build   │ │  synth   │ │   lines/     │
│  workload   │ │ required │ │ required │ │ required │ │  workload    │
│             │ │          │ │          │ │          │ │      ✓       │
└─────────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────────┘
Enter fullscreen mode Exit fullscreen mode

TL;DR - Section 11: This framework beats alternatives by using pure Terraform with zero preprocessing. Standard Terraform requires manual ARN management (1000+ lines). Terragrunt/Terraspace/CDK add preprocessing layers (wrapper tools, Ruby runtime, Node.js compilation). This approach achieves 85% boilerplate reduction through smart lookups and building blocks while maintaining full Terraform Cloud compatibility and native workflows.


12. Lessons Learned and Best Practices

What Worked Well

1. Colon Syntax is Intuitive

Developers adopted s3:bucket_name syntax immediately. It reads like natural configuration.

2. Building Blocks Enforce Standards

Opinionated modules ensure consistency without policing. Teams can't accidentally create non-compliant resources.

3. Separation of Concerns

Platform team manages managed_by_dp_*.tf files, teams manage *.tfvars files. Clear ownership boundaries.

4. Lookup Tables Reduce Coupling

Resources don't directly reference each other, reducing cascade changes when refactoring.

5. Predictive Naming Solves Most Circular Dependencies

Most cross-resource references can use naming conventions instead of module outputs.

Challenges and Solutions

Challenge 1: Circular Dependencies

Some resource relationships create cycles that Terraform can't resolve.

Solutions:

  • Use predictive naming instead of module outputs
  • Two-phase deployment (apply twice)
  • Selective resource inclusion in policies
  • Data sources for cross-workload lookups

Challenge 2: Lookup Complexity

Lookup tables can become large and hard to maintain.

Solutions:

  • Organized into logical groups (lookup_perm_lambda, lookup_id_base)
  • Inline comments documenting purpose
  • Automated generation via for expressions
  • Cross-account lookups separated into _xa maps

Challenge 3: Building Block Versioning

Updating building block versions across many teams is coordination-heavy.

Solutions:

  • Semantic versioning with ~> constraints
  • Deprecation warnings for old versions
  • Automated testing of building block changes
  • Communication channel for breaking changes

Challenge 4: Developer Onboarding

New developers need to learn lookup syntax and conventions.

Solutions:

  • Comprehensive examples in blueprint repo
  • Detailed README with common patterns
  • IntelliSense/autocomplete via Terraform language server
  • Helper scripts to validate tfvars before commit

Best Practices

1. Use Descriptive Resource Keys

# Good
s3_buckets = {
  raw_customer_data = { ... }
  processed_analytics = { ... }
}

# Bad
s3_buckets = {
  bucket1 = { ... }
  bucket2 = { ... }
}
Enter fullscreen mode Exit fullscreen mode

2. Group Related Resources

# Process: S3 → Lambda → Glue → Redshift
s3_buckets = { raw = {...}, processed = {...} }
lambda_functions = { processor = {...} }
glue_jobs = { transform = {...} }
redshift_databases = { analytics = {...} }
Enter fullscreen mode Exit fullscreen mode

3. Use Comments to Document Intent

# Data pipeline for customer analytics
# Flow: External API → raw bucket → Lambda → processed bucket → Glue → Redshift
lambda_functions = {
  api_ingestion = { ... }
}
Enter fullscreen mode Exit fullscreen mode

4. Leverage Type Inference

# Instead of:
permissions = {
  s3_read = ["s3_read:raw"]
}

# Prefer (type inferred from key):
permissions = {
  s3_read = ["raw"]
}
Enter fullscreen mode Exit fullscreen mode

5. Test in Lower Environments First

dev → staging → production
Enter fullscreen mode Exit fullscreen mode

Use identical tfvars across environments, only changing tags.tf (environment name).

6. Version Pin Building Blocks

# Use pessimistic constraint
source  = "app.terraform.io/org/buildingblock-lambda/aws"
version = "~> 3.2.0"  # Allows 3.2.x, not 3.3.0
Enter fullscreen mode Exit fullscreen mode

7. Document Cross-Account Access

# Cross-account: Read from Data Lake account
cross_accounts = {
  datalake = "123456789012"  # Managed by Data Lake team
}
Enter fullscreen mode Exit fullscreen mode

13. Impact and Metrics

Development Velocity Improvements

Before This Framework:

  • ~1000 lines of Terraform per workload
  • 2-3 weeks to onboard new team
  • 5+ days to add new resource type
  • Frequent IAM permission errors
  • Inconsistent naming across teams
  • Manual policy review process

After This Framework:

  • ~150 lines of tfvars per workload (85% reduction)
  • 2-3 days to onboard new team
  • 1 day to add new resource type
  • Rare IAM errors (auto-generated policies)
  • Consistent naming (automatic)
  • Automated policy compliance

Code Quality Improvements

Reduction in Boilerplate:

Traditional approach (S3 + Lambda with IAM):

# ~250 lines for: S3 bucket, IAM role, IAM policy document,
# Lambda function, CloudWatch log group, etc.
Enter fullscreen mode Exit fullscreen mode

This framework (same resources):

# ~30 lines of tfvars
s3_buckets = { data = { name = "data" } }
lambda_functions = {
  processor = {
    name = "processor"
    permissions = { s3_read = ["data"] }
  }
}
Enter fullscreen mode Exit fullscreen mode

Boilerplate Reduction: ~88%

Governance and Compliance

Automatic Enforcement:

  • 100% of resources use standardized naming
  • 100% of resources encrypted with KMS
  • 100% of resources tagged per policy
  • 100% of IAM policies include permission boundaries
  • 100% of Lambda functions in VPC
  • 0 manual policy reviews required
        Before Framework                      After Framework
┌────────────────────────────┐      ┌────────────────────────────┐
│                            │      │                            │
│  • 1000+ lines Terraform   │─────>│  • 150 lines tfvars        │
│                            │      │    (85% reduction)         │
│                            │      │                            │
└────────────────────────────┘      └────────────────────────────┘

┌────────────────────────────┐      ┌────────────────────────────┐
│                            │      │                            │
│  • 2-3 weeks onboarding    │─────>│  • 2-3 days onboarding     │
│                            │      │    (5x faster)             │
│                            │      │                            │
└────────────────────────────┘      └────────────────────────────┘

┌────────────────────────────┐      ┌────────────────────────────┐
│                            │      │                            │
│  • Manual IAM policies     │─────>│  • Auto-generated IAM      │
│                            │      │    (Rare errors)           │
│                            │      │                            │
└────────────────────────────┘      └────────────────────────────┘

┌────────────────────────────┐      ┌────────────────────────────┐
│                            │      │                            │
│  • Inconsistent naming     │─────>│  • 100% consistent         │
│                            │      │    (Automatic compliance)  │
│                            │      │                            │
└────────────────────────────┘      └────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

TL;DR - Section 13: Framework delivers measurable improvements: 85% boilerplate reduction (1000→150 lines), 5x faster team onboarding (weeks→days), rare IAM errors (auto-generated policies), and 100% compliance (automatic naming, tagging, encryption, permission boundaries). Every resource is encrypted with KMS, tagged per policy, and uses least-privilege IAM—all enforced by building blocks with zero manual reviews.


14. Future Enhancements

Planned Features

1. Multi-Region Support

Enable workloads spanning multiple AWS regions:

regions = ["eu-central-1", "us-east-1"]

s3_buckets = {
  replicated_data = {
    name = "data"
    replication_regions = ["us-east-1"]
  }
}
Enter fullscreen mode Exit fullscreen mode

2. Enhanced Lookup Syntax

Support nested lookups:

environment = {
  BUCKET_PATH = "s3:mybucket:/path/prefix"
  TABLE_COLUMN = "dynamodb:mytable:attribute:id"
}
Enter fullscreen mode Exit fullscreen mode

3. Building Block Customization

Allow team-specific overrides while maintaining compliance:

s3_buckets = {
  special = {
    name = "special"
    override_defaults = {
      versioning_enabled = false  # Team takes responsibility
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

4. Cost Estimation

Integrate with AWS Pricing API to estimate costs before apply:

# In plan output:
# Estimated monthly cost: $1,234.56
#   - Lambda: $123.45
#   - S3: $456.78
#   - Redshift: $654.33
Enter fullscreen mode Exit fullscreen mode

5. Dependency Visualization

Generate visual dependency graphs from lookup tables:

S3:raw → Lambda:processor → S3:processed → Glue:transform → Redshift:analytics
Enter fullscreen mode Exit fullscreen mode

Potential Improvements

1. Resolve Two-Phase Deployment

Investigate Terraform's -target flag or module dependencies to eliminate the "apply twice" requirement.

2. Building Block Catalog

Create searchable catalog of building blocks with examples:

  • Searchable by AWS service
  • Filterable by capability (encryption, backups, monitoring)
  • Includes terraform-docs generated documentation

3. Policy Simulation

Pre-validate IAM policies using AWS IAM Policy Simulator before apply:

terraform plan | policy-simulator --validate
Enter fullscreen mode Exit fullscreen mode

4. Drift Detection

Automated drift detection for resources created outside Terraform:

terraform-drift-detector --alert slack://channel
Enter fullscreen mode Exit fullscreen mode

15. Conclusion

Summary

We've built a Native Terraform IaC Framework that achieves the developer experience of high-level abstractions while maintaining 100% compatibility with standard Terraform workflows. The key innovations are:

  1. Smart Lookup Syntax: Colon-separated references (s3:bucket, lambda:function) resolved via native Terraform expressions
  2. Building Block Abstraction: Opinionated modules that enforce standards and generate IAM policies automatically
  3. Zero Preprocessing: Pure Terraform - works with Terraform Cloud, CLI, and all standard tooling
  4. Clear Separation: Platform team manages code, application teams manage configuration
  5. Context Propagation: Naming and tagging enforced automatically via context system

Why This Matters

For Platform Engineers:

  • Enforce organizational standards without restricting teams
  • Reduce support burden (teams self-service)
  • Centralized updates via building blocks
  • Scalable to hundreds of workloads

For Application Teams:

  • Write configuration, not code
  • No AWS expertise required
  • Fast onboarding (days, not weeks)
  • Focus on business logic, not infrastructure

For Organizations:

  • Consistent security posture
  • Automated compliance
  • Cost visibility via standardized tagging
  • Reduced risk (guardrails prevent misconfigurations)

Key Takeaways

  1. Native Terraform is Powerful: With creative use of locals and lookups, you can build sophisticated abstractions without preprocessing

  2. Configuration Over Code: Separating what (tfvars) from how (modules) reduces complexity

  3. Building Blocks Scale: Opinionated modules enable governance at scale

  4. Developer Experience Matters: Investment in ergonomics pays dividends in velocity and adoption

  5. Standards Enable Freedom: Guardrails paradoxically enable teams to move faster

                 ┌─────────────────────────────────┐
                 │  Native Terraform Framework     │
                 └──────────────┬──────────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        │                       │                       │
        ▼                       ▼                       ▼
┌───────────────┐   ┌──────────────────┐   ┌──────────────────┐
│ Smart Lookups │   │ Building Blocks  │   │ Separation of    │
│               │   │                  │   │ Code & Config    │
└───────┬───────┘   └────────┬─────────┘   └────────┬─────────┘
        │                    │                       │
        │            ┌───────▼────────┐              │
        │            │Context         │              │
        │            │Propagation     │              │
        │            └───────┬────────┘              │
        │                    │                       │
        └────────────────────┼───────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌────────────────┐  ┌─────────────────┐  ┌──────────────────┐
│ 85% Boilerplate│  │      Zero       │  │    Automated     │
│   Reduction    │  │  Preprocessing  │  │ Updates at Scale │
└────────┬───────┘  └────────┬────────┘  └─────────┬────────┘
         │                   │                      │
         │           ┌───────▼────────┐             │
         │           │      100%      │             │
         │           │   Compliance   │             │
         │           └───────┬────────┘             │
         │                   │                      │
         └───────────────────┼──────────────────────┘
                             │
                             ▼
                  ┌──────────────────────┐
                  │   50+ Teams Can      │
                  │   Self-Service       │
                  │   Infrastructure     │
                  └──────────────────────┘
Enter fullscreen mode Exit fullscreen mode

TL;DR - Conclusion: This native Terraform framework proves that developer-friendly IaC doesn't require preprocessing or external tools. By combining smart lookups (s3:bucket), opinionated building blocks, configuration/code separation, and context propagation, we achieve 85% boilerplate reduction while maintaining full Terraform Cloud compatibility. Platform teams scale updates via automated PRs, application teams self-service via simple tfvars, and organizations get automatic compliance. Native Terraform can be elegant, scalable, and secure.


16. Getting Started Guide

For teams interested in adopting this approach:

Step 1: Assess Your Needs

Good fit if:

  • Multiple teams deploying similar infrastructure
  • Need to enforce organizational standards
  • Want to reduce AWS expertise requirement
  • High volume of infrastructure deployments

Not a good fit if:

  • Small team (1-2 people) with custom requirements
  • Infrastructure is highly heterogeneous
  • Team prefers low abstraction level

Step 2: Start Small

Begin with a pilot:

  1. Choose one AWS service (e.g., S3)
  2. Build an opinionated building block module
  3. Create lookup mechanism for that service
  4. Test with one team
  5. Iterate based on feedback

Step 3: Build Your Building Blocks

For each AWS service:

  1. Define organizational standards (naming, tagging, encryption)
  2. Create Terraform module enforcing standards
  3. Add permission generation logic
  4. Version and publish to private registry
  5. Write documentation and examples

Step 4: Create Lookup System

  1. Define lookup syntax (e.g., type:name)
  2. Create lookup locals maps
  3. Add resolution logic to building blocks
  4. Test cross-resource references

Step 5: Document and Socialize

  1. Write comprehensive README
  2. Create example projects
  3. Run training sessions
  4. Set up support channel
  5. Gather feedback and iterate

Step 6: Scale

  1. Add more building blocks incrementally
  2. Onboard teams progressively
  3. Monitor usage and pain points
  4. Continuously improve based on feedback

Appendix: Code Samples

A. Lookup Table Implementation

File: tf-project/terraform/managed_by_dp_project_lookup.tf

locals {
  # Build base lookup maps for ARNs (used in IAM policies)
  lookup_arn_base = merge(var.lookup_arns, {
    "kms" = {
      "kms_data"  = local.kms_data_key_arn
      "kms_infra" = local.kms_infrastructure_key_arn
    }
    "s3_read"  = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
    "s3_write" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
    "gluejob"  = { for item in keys(var.glue_jobs) : item => module.glue_jobs[item].arn }
    "gluedb"   = { for item in keys(var.glue_database) : item => module.glue_databases[item].name }
    "secret_read" = { for item in keys(var.secrets) : item => module.secrets[item].arn }
    "dynamodb_read" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].arn }
  })

  # Build base lookup maps for IDs (used in environment variables)
  lookup_id_base = merge(var.lookup_ids, {
    "s3" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].id }
    "secret" = { for item in keys(var.secrets) : item => module.secrets[item].id }
    "dynamodb" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].name }
    "athena" = { for item in keys(var.athena_workgroups) : item => module.athena[item].name }
  })

  # Specialized lookup for Lambda permissions
  lookup_perm_lambda = merge(
    local.lookup_arn_base,
    local.lookup_perm_lambda_xa,  # Cross-account additions
    {
      "sqs_read" = { for item in keys(var.sqs_queues) : item => module.sqs[item].queue_arn }
      "sqs_send" = { for item in keys(var.sqs_queues) : item => module.sqs[item].queue_arn }
      "sns_pub"  = { for item in keys(var.sns_topics) : item => module.sns[item].topic_arn }
    }
  )

  # Specialized lookup for Lambda environment variables
  lookup_id_lambda = merge(
    local.lookup_id_base,
    {
      "sqs" = { for item in keys(var.sqs_queues) : item => module.sqs[item].queue_url }
      "sns" = { for item in keys(var.sns_topics) : item => module.sns[item].topic_arn }
    }
  )
}
Enter fullscreen mode Exit fullscreen mode

B. Lambda Building Block Usage

File: tf-project/terraform/managed_by_dp_project_lambda.tf

module "lambda" {
  source  = "app.terraform.io/org/buildingblock-lambda/aws"
  version = "3.2.0"

  for_each = var.lambda_functions

  # Standard fields
  prefix  = local.prefix
  context = local.context
  name    = try(each.value.name, each.key)

  # Environment variables with smart lookup
  environments = {
    for type, item in try(each.value.environment, {}) : type =>
      try(
        # Try to resolve as "type:name" lookup
        local.lookup_id_lambda[split(":", item)[0]][split(":", item)[1]],
        item  # Fallback to literal value
      )
  }

  # Permissions with smart lookup and automatic policy generation
  permissions = {
    for type, items in try(each.value.permissions, {}) : type => [
      for item in items :
      (
        # Check if it's namespaced format "type:name"
        length(split(":", item)) == 2
        ? try(
            local.lookup_perm_lambda[split(":", item)[0]][split(":", item)[1]],
            item
          )
        : try(
            # Infer type from permission category key
            local.lookup_perm_lambda[type][item],
            item
          )
      )
    ]
  }

  # Create IAM role and policy automatically
  create_policy = true

  # Injected infrastructure details
  kms_key_arn = local.kms_data_key_arn
  subnet_ids  = local.subnet_ids
  vpc_id      = local.vpc_id

  # User-provided configuration
  handler     = each.value.handler
  runtime     = each.value.runtime
  memory      = try(each.value.memory, 512)
  timeout     = try(each.value.timeout, 300)
  description = try(each.value.description, "")

  # Resolve S3 source file location
  s3_bucket = local.code_bucket
  s3_key = split(":", each.value.s3_sourcefile)[0] == "s3_file"
    ? try(
        local.s3_target_path[split(":", each.value.s3_sourcefile)[1]],
        each.value.s3_sourcefile
      )
    : each.value.s3_sourcefile
}
Enter fullscreen mode Exit fullscreen mode

C. Example Workload Configuration

File: examples/full_test/_project.auto.tfvars

# S3 Buckets
s3_buckets = {
  raw_data = {
    name   = "raw"
    backup = true
    lifecycle_rules = [{
      id              = "archive_old_data"
      transition_days = 90
      storage_class   = "GLACIER"
    }]
  }
  processed_data = {
    name                            = "processed"
    enable_intelligent_tiering      = true
    enable_eventbridge_notification = true
  }
}

# Upload code artifacts
s3_source_files = {
  processor_code = {
    source = "lambda_processor.zip"
    target = "lambda_functions/processor/code.zip"
  }
  transform_script = {
    source = "glue_transform.py"
    target = "glue_jobs/transform/script.py"
  }
}

# Secrets
secrets = {
  database_creds = {
    name = "db-credentials"
    secret_string = {
      username = "admin"
      password = ""  # Set via AWS Console
    }
  }
}

# Glue Database
glue_database = {
  analytics = {
    name                   = "analytics"
    bucket                 = "s3:processed_data"
    enable_lakeformation   = true
    share_cross_account_ro = ["datagovernance"]
  }
}

# Lambda Function
lambda_functions = {
  data_processor = {
    name        = "processor"
    description = "Processes incoming data files"
    handler     = "index.handler"
    runtime     = "python3.13"
    memory      = 2048
    timeout     = 900
    in_vpc      = true

    s3_sourcefile = "s3_file:processor_code"

    environment = {
      RAW_BUCKET       = "s3:raw_data"
      PROCESSED_BUCKET = "s3:processed_data"
      GLUE_DATABASE    = "gluedb:analytics"
      DB_SECRET        = "secret:database_creds"
      LOG_LEVEL        = "INFO"
    }

    permissions = {
      s3_read     = ["raw_data"]
      s3_write    = ["processed_data"]
      glue_update = ["analytics"]
      secret_read = ["database_creds"]
    }

    event_source_mapping = [{
      event_source_arn = "s3:raw_data"
      events           = ["s3:ObjectCreated:*"]
      filter_prefix    = "incoming/"
      filter_suffix    = ".csv"
    }]
  }
}

# Glue ETL Job
glue_jobs = {
  data_transform = {
    name              = "transform"
    description       = "Transforms processed data"
    glue_version      = "4.0"
    worker_type       = "G.1X"
    number_of_workers = 5
    max_retries       = 2
    timeout           = 120

    script_location = "s3_file:transform_script"

    arguments = {
      "--job-language"                     = "python"
      "--enable-metrics"                   = "true"
      "--enable-continuous-cloudwatch-log" = "true"
      "--DATABASE"                         = "gluedb:analytics"
      "--INPUT_BUCKET"                     = "s3:processed_data"
      "--DB_SECRET"                        = "secret:database_creds"
    }

    permissions = {
      s3_read     = ["processed_data"]
      glue_update = ["analytics"]
      secret_read = ["database_creds"]
    }

    trigger_type = "SCHEDULED"
    schedule     = "cron(0 2 * * ? *)"  # Daily at 2 AM UTC
  }
}
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

This framework demonstrates that native Terraform can be elegant and developer-friendly without sacrificing power or flexibility. By leveraging Terraform's built-in features creatively—for expressions, try() functions, split() operations, and locals—we've built a system that:

  • Feels like configuration (simple tfvars files)
  • Works like Terraform (native tooling, no preprocessing)
  • Scales like a platform (hundreds of workloads, multiple teams)
  • Governs like policy (automatic enforcement, no manual reviews)

The journey from verbose, error-prone Terraform code to concise, validated configuration files represents a significant step forward in Infrastructure as Code maturity. Most importantly, it's achieved through native Terraform capabilities, ensuring long-term compatibility and eliminating external dependencies.

As organizations scale their cloud infrastructure, frameworks like this become essential for maintaining velocity, consistency, and security. The patterns demonstrated here can be adapted to any cloud provider, resource types, or organizational requirements—the principles of smart lookups, building block abstraction, and configuration separation are universally applicable.

The future of Infrastructure as Code is declarative, native, and developer-friendly. This framework is a blueprint for getting there.


Acknowledgments

This framework was built by collaborative iteration between platform engineers and application teams, learning from real-world challenges and continuously refining the developer experience. Special recognition to the teams who adopted early versions, provided feedback, and helped shape the patterns documented here.

Top comments (0)