TL;DR:
A pure Terraform framework that lets 50+ teams self-service infrastructure by writing simple
.tfvarsfiles while the platform team manages opinionated "building blocks." Smart lookups (s3:bucket_name) enable cross-resource references. When patterns improve, automated scripts generate PRs for all teams—they reviewterraform planand inherit improvements without code changes. 85%+ boilerplate reduction, zero preprocessing, fully compatible with Terraform Cloud.
This blog post documents how a platform engineering team built a Terraform framework that scales to 50+ application teams with mixed skill levels—enabling fast, self-service infrastructure deployment while maintaining governance and security standards.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ 50+ Teams │ │ Platform │ │ Patterns │
│ Write Simple │─────>│ Manages │─────>│ Improve │
│ tfvars │ │ Building Blocks │ │ Over Time │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ Automated │
│ │ PRs Generated │
│ └─────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ Teams Review │
│ │ terraform plan │
│ └─────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
└──────────────────│ Approve & Apply│
(updates) │ Stay Current │
└─────────────────┘
The Challenge: Platform teams face an impossible trade-off: let teams write their own Terraform (resulting in inconsistent, outdated implementations) or manually review and update every workload (doesn't scale beyond ~10 teams).
The Solution: A native Terraform framework that separates configuration (what teams deploy) from implementation (how it's deployed securely). Application teams write simple .tfvars files, platform team manages opinionated "building blocks" that evolve over time. When patterns improve (adding VPC, encryption, monitoring), automated scripts generate PRs for all teams—they review terraform plan and approve, inheriting improvements without code changes.
Key Innovation: Native Terraform "smart lookups" (s3:bucket_name, lambda:function_name) allow cross-resource references while maintaining the separation. No preprocessing, no code generation—pure Terraform compatible with standard tooling and Terraform Cloud.
Target Audiences
- Platform Engineers: Detailed implementation of the lookup mechanism and building block architecture
- DevOps/SRE Teams: Comparison with Terragrunt/Terraspace and practical benefits
- Cloud Architects: Strategic value and governance capabilities
- Technical Leaders: Development velocity improvements and complexity reduction
1. Introduction: Helping Teams Build Faster at Scale
Opening Hook:
"How do you help 50 teams build and deploy infrastructure faster—when they have different levels of AWS and Terraform expertise, need similar-but-not-identical workloads, and your platform team can't manually review and update every project?"
The Human Challenge: Speed vs. Standards
Picture this familiar scenario:
Your Organization:
- 50+ application teams building data pipelines, microservices, analytics platforms
-
Mixed skill levels:
- 20% have AWS experts who know IAM policies inside-out
- 50% are competent with Terraform but learning AWS services
- 30% are new to both, just want to deploy their application
-
Platform/DevOps team of 5-10 people responsible for:
- Cloud governance and security
- Cost optimization
- Compliance and best practices
- Supporting all those teams
What Application Teams Want:
- Deploy fast: Days, not weeks of waiting
- Self-service: Don't wait for platform team approval on every change
- Focus on their app: Not become AWS/Terraform experts
- Consistency: "Just tell me what works and let me copy it"
What Platform Team Needs:
- Enforce standards: Security, tagging, encryption, monitoring
- Scale support: Can't grow team 1:1 with application teams
- Continuous improvement: Patterns evolve as we learn
- Prevent drift: All workloads stay current with best practices
The Core Problem: Similar Workloads, Different Implementations
When teams write their own Terraform, you get variations of the same infrastructure:
Option 1: Raw Terraform Resources (Maximum Flexibility, Minimum Maintainability)
# Team A writes Lambda in January 2024
resource "aws_lambda_function" "processor_v1" {
function_name = "processor"
runtime = "python3.11"
# ... 50 lines of configuration
# Missing: VPC config, proper IAM policies, CloudWatch retention
}
# Team B writes Lambda in March 2024 (learned from Team A's mistakes)
resource "aws_lambda_function" "processor_v2" {
function_name = "processor"
runtime = "python3.12"
# ... 80 lines of configuration
# Now includes: VPC, better IAM, but still missing X-Ray tracing
}
# Team C writes Lambda in June 2024 (organization learned best practices)
resource "aws_lambda_function" "processor_v3" {
function_name = "processor"
runtime = "python3.13"
# ... 120 lines of configuration
# All best practices: VPC, IAM, X-Ray, proper logging, tags
}
The Problems:
- Inconsistent implementations: 50 workloads = 50 slightly different Lambda configurations
- Knowledge doesn't propagate: Teams A and B don't benefit from improvements learned by Team C
- Backporting is impossible: How do you update 50 workloads when security requires KMS encryption?
- Copy-paste culture: Teams copy from each other, propagating old patterns and bugs
- Expertise silos: Only AWS experts can write correct infrastructure
Option 2: Standard Terraform Modules (Better Reuse, Still Hard to Evolve)
# Using terraform-aws-modules/lambda/aws
module "lambda" {
source = "terraform-aws-modules/lambda/aws"
version = "4.0.0"
function_name = "processor"
# ... still 40+ lines of configuration
# Better: module handles some best practices
# Problem: upgrading 50 workloads from v4.0.0 → v5.0.0 is manual work
}
The Problems:
- Version sprawl: Workloads stuck on different module versions (v3.2, v4.0, v4.5, v5.0)
- Breaking changes: Module updates require testing every workload
- Configuration drift: Each team configures modules differently
- Limited abstraction: Still requires deep AWS knowledge to use correctly
- Manual upgrades: Someone has to update 50 PRs when a new version releases
The Real Challenge: N×N Complexity
As you improve your infrastructure patterns over time:
- You learn Lambda should use VPC → Need to update 50 workloads
- Security requires KMS encryption → Need to update 50 workloads
- Compliance requires specific tags → Need to update 50 workloads
- New AWS best practice emerges → Need to update 50 workloads
The math is brutal:
- 50 workloads × 10 resource types × 5 improvements per year = 2,500 manual updates
- Each update risks breaking something
- Each workload drifts further from best practices
- Teams become afraid to improve shared patterns
Our Solution: True Separation of Code and Configuration
The Insight: What if we could update how infrastructure is created without touching what infrastructure exists?
# Team writes configuration ONCE (2024)
lambda_functions = {
processor = {
name = "processor"
runtime = "python3.13"
permissions = {
s3_read = ["raw_data"]
}
}
}
Behind the scenes (managed by platform team):
- January 2024: Lambda building block v1.0 (basic implementation)
- March 2024: Lambda building block v1.5 (adds VPC, better IAM)
- June 2024: Lambda building block v2.0 (adds X-Ray, proper logging)
- September 2024: Lambda building block v2.5 (adds permission boundaries)
The team's configuration never changes. The platform team updates the building block implementation, and all 50 workloads automatically get improvements on next terraform apply.
This Framework Achieves:
- Separation of Concerns: Configuration (what) lives in tfvars, implementation (how) lives in building blocks
- Continuous Improvement: Platform team evolves patterns without breaking workloads
- Zero Backporting: Workloads automatically inherit improvements
- Maintained References: Terraform's powerful dependency graph still works (via smart lookups)
- Escape Hatch: Teams can still use raw Terraform resources when needed for edge cases
The Innovation:
A pure Terraform framework that:
- Uses colon-separated syntax (
s3:bucket_name) for resource references - Resolves lookups dynamically using native Terraform expressions
- Abstracts AWS complexity through opinionated building blocks
- Works seamlessly with Terraform Cloud and standard workflows
- Updates centrally but applies individually
Coverage:
- Handles 90-95% of common workload patterns through building blocks
- Allows raw Terraform resources alongside building blocks for edge cases
- Manages N×N complexity (lookups between all resource types)
The Result:
- Platform team maintains the framework (1 codebase)
- 50 teams write simple configurations (50 tfvars files)
- Everyone benefits from continuous improvement
- No preprocessing, no code generation, pure Terraform
Lifecycle Management: Keeping Up With Scale
The Separation Strategy:
The framework separates two concerns that evolve at different speeds:
-
Configuration (Team-Owned): What workload resources exist
- Lives in team repositories as
.tfvarsfiles - Teams control: which Lambda, what S3 buckets, environment variables
- Changes infrequently (when application requirements change)
- Lives in team repositories as
-
Implementation (Platform-Owned): How resources are created
- Lives in blueprint repository as
managed_by_dp_*.tffiles - Platform controls: security policies, naming, encryption, monitoring
- Changes frequently (as patterns improve)
- Lives in blueprint repository as
The Update Process:
When the platform team improves patterns (add VPC support, update KMS policies, new monitoring):
# Platform team's workflow
cd blueprint-repository
# Update building block versions, add new features
git commit -m "feat: add X-Ray tracing to Lambda building block"
# Generate PRs for all 50 team repositories
./tools/repo_updater.py --update-all-teams
# Result: 50 automated PRs created
# Each PR updates only managed_by_dp_*.tf files
# Teams' tfvars files are NEVER touched
Team's Approval Workflow:
# Team receives automated PR: "Update platform code to v2.5"
# PR shows ONLY changes to managed_by_dp_*.tf files
# Team's _project.auto.tfvars is unchanged
# Team reviews terraform plan in PR comments
terraform plan
# Shows: "Lambda function will be updated in-place"
# " + vpc_config { ... }" (new VPC configuration added)
# Team approves and merges
# Terraform Cloud runs terraform apply
# Workload gets new feature automatically
The Math Works:
- Without this approach: 50 teams × 10 resource types × 5 improvements/year = 2,500 manual updates
- With this approach: 1 platform team × 1 script × 50 automated PRs = 50 team approvals (30 minutes each)
Platform team scales from:
- 10 person-weeks of manual updates (touching every team's code)
- To: 2 person-days (writing script, reviewing automation)
Teams benefit:
- Receive improvements without doing any work
- Review and approve changes (maintain control)
-
terraform planshows exactly what changes - Rollback is just reverting the PR
Key Principles:
- Teams own configuration: Platform can't break their workload definitions
- Platform owns implementation: Teams benefit from continuous improvement
- Automation bridges scale: Scripts generate PRs, teams approve
-
Terraform validates: Standard
planshows changes before apply - Gradual rollout: Platform can update 5 teams first, validate, then roll to 45 more
This lifecycle separation is what makes the framework sustainable at scale—platform team doesn't become a bottleneck, teams maintain velocity, everyone stays current with best practices.
TL;DR - Section 1: Platform teams face N×N complexity when updating 50+ workloads with infrastructure improvements. This framework separates configuration (team-owned tfvars) from implementation (platform-owned building blocks). Automated PR generation scales updates: platform improves once, all teams inherit via
terraform planreview and approval. Reduces 2,500 manual updates/year to 50 automated PRs.
2. Architecture Overview
┌────────────────────────────────────────────────────────────────────┐
│ Layer 1: tf-common (Shared Foundation) │
├────────────────────────────────────────────────────────────────────┤
│ • Provider Config • Naming Conventions │
│ • VPC/Subnet Data Sources • Platform Info Provider │
└──────────────────┬─────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────┐
│ Layer 2: tf-default (Account-Level) │
├────────────────────────────────────────────────────────────────────┤
│ • KMS Infrastructure Key • S3 Code/Logging Buckets │
│ • IAM Admin Roles • CloudTrail Data │
└──────────────────┬────────────────┬────────────────────────────────┘
│ │
│ (Shared KMS) │ (Code Storage)
▼ ▼
┌────────────────────────────────────────────────────────────────────┐
│ Layer 3: tf-project (Application-Level) │
├────────────────────────────────────────────────────────────────────┤
│ • KMS Data Key • S3 Data Buckets │
│ • Lambda/Glue/Fargate • RDS/Redshift/DynamoDB │
└────────────────────────────────────────────────────────────────────┘
The Three-Layer System
Layer 1: tf-common (Shared Foundation)
- Provider configuration
- Naming conventions and context management
- Shared data sources (VPC, subnets, IAM roles)
- Platform Information Provider (PIP) integration
- Used by ALL workloads (updated centrally)
Layer 2: tf-default (Account-Level Resources)
- S3 code/logging buckets
- KMS infrastructure keys
- Lake Formation settings
- IAM admin roles
- CloudTrail data logging
- Deployed ONCE per AWS account
Layer 3: tf-project (Application Resources)
- S3 data buckets
- Lambda functions, Glue jobs
- RDS, Redshift, DynamoDB databases
- Fargate containers
- Application-specific KMS keys
- Deployed MULTIPLE times per account (one per workload)
Composition via Symlinks:
examples/my-workload/
├── _data.tf # User-owned: environment config
├── _project.auto.tfvars # User-owned: workload definition
├── managed_by_dp_common_*.tf -> ../../tf-common/terraform/
├── managed_by_dp_default_*.tf -> ../../tf-default/terraform/
└── managed_by_dp_project_*.tf -> ../../tf-project/terraform/
This creates a complete, runnable Terraform project where terraform plan/apply work directly.
3. The Smart Lookup Innovation
The Core Concept
Traditional Terraform:
lambda_functions = {
processor = {
environment = {
BUCKET = "arn:aws:s3:::company-prod-data-raw-bucket-a1b2c3"
}
policy_json = jsonencode({
Statement = [{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject"]
Resource = "arn:aws:s3:::company-prod-data-raw-bucket-a1b2c3/*"
}]
})
}
}
With Smart Lookups:
s3_buckets = {
raw_data = { name = "raw" }
}
lambda_functions = {
processor = {
environment = {
BUCKET = "s3:raw_data" # Resolves to bucket name
}
permissions = {
s3_read = ["raw_data"] # Resolves to full ARN + generates IAM policy
}
}
}
How It Works: Pure Terraform Magic
Location: tf-project/terraform/managed_by_dp_project_lookup.tf
Step 1: Build Lookup Maps
The system creates hierarchical lookup maps after resources are created:
lookup_arn_base = merge(var.lookup_arns, {
"s3_read" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
"s3_write" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
"gluejob" = { for item in keys(var.glue_jobs) : item => module.glue_jobs[item].arn }
"secret_read" = { for item in keys(var.secrets) : item => module.secrets[item].arn }
"dynamodb_read" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].arn }
})
lookup_id_base = merge(var.lookup_ids, {
"s3" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].id }
"secret" = { for item in keys(var.secrets) : item => module.secrets[item].id }
"dynamodb" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].name }
})
Step 2: Resolve References Dynamically
In building block modules (e.g., managed_by_dp_project_lambda.tf):
module "lambda" {
for_each = var.lambda_functions
# Environment variables with smart lookup
environments = {
for type, item in try(each.value.environment, {}) : type =>
try(
local.lookup_id_lambda[split(":", item)[0]][split(":", item)[1]],
item # Fallback to literal value if not a lookup
)
}
# Permissions with smart lookup
permissions = {
for type, items in try(each.value.permissions, {}) : type => [
for item in items :
(
length(split(":", item)) == 2 # Check if it's "type:name" format
? try(
local.lookup_perm_lambda[split(":", item)[0]][split(":", item)[1]],
item
)
: try(
local.lookup_perm_lambda[type][item], # Infer type from permission category
item
)
)
]
}
}
The Magic:
-
split(":", "s3:mybucket")→["s3", "mybucket"] -
local.lookup_id_lambda["s3"]["mybucket"]→ actual bucket name -
local.lookup_perm_lambda["s3_read"]["mybucket"]→ actual bucket ARN
Step 3: Building Blocks Generate IAM Policies
Building block modules (from Terraform Cloud private registry) automatically generate IAM policies:
module "lambda" {
source = "app.terraform.io/org/buildingblock-lambda/aws"
version = "3.2.0"
permissions = {
s3_read = ["arn:aws:s3:::bucket1", "arn:aws:s3:::bucket2"]
}
create_policy = true # Automatically generates IAM role + policy
}
Inside the building block, it generates:
data "aws_iam_policy_document" "lambda" {
statement {
sid = "S3Read"
effect = "Allow"
actions = ["s3:GetObject*", "s3:GetBucket*", "s3:List*"]
resources = flatten([
var.permissions.s3_read,
[for arn in var.permissions.s3_read : "${arn}/*"]
])
}
}
Supported Lookup Types
For Environment Variables (IDs/Names):
-
s3:bucket_name→ S3 bucket name -
secret:secret_name→ Secrets Manager secret ID -
dynamodb:table_name→ DynamoDB table name -
athena:workgroup_name→ Athena workgroup name -
prefix:suffix→ Injects naming prefix + suffix
For Permissions (ARNs):
-
s3_read:bucket/s3_write:bucket→ S3 bucket ARN -
gluejob:job_name→ Glue job ARN -
gluedb:database_name→ Glue database name -
secret_read:secret_name→ Secrets Manager ARN -
dynamodb_read:table/dynamodb_write:table→ DynamoDB ARN -
sqs_read:queue/sqs_send:queue→ SQS queue ARN -
sns_pub:topic→ SNS topic ARN
Cross-Account References:
-
acct_prod_glue_tables→ All Glue tables in production account -
acct_dev_kms_all_keys→ All KMS keys in dev account
Team tfvars Lookup Tables Building Block AWS Resources
│ │ │ │
│ environment = │ │ │
│ {BUCKET="s3:raw"} │ │ │
├────────────────────>│ │ │
│ │ split(":", "s3:raw")│ │
│ │ → ["s3", "raw"] │ │
│ │ │ │
│ │ lookup_id_lambda │ │
│ │ ["s3"]["raw"] → │ │
│ │ "company...-raw" │ │
│ ├────────────────────>│ │
│ │ resolved name │ │
│ │ │ Create Lambda with │
│ │ │ env BUCKET= │
│ │ │ "company...-raw" │
│ │ ├────────────────────>│
│ │ │ │
│ permissions = │ │ │
│ {s3_read=["raw"]} │ │ │
├────────────────────>│ │ │
│ │ lookup_perm_lambda │ │
│ │ ["s3_read"]["raw"] │ │
│ │ → arn:aws:s3:::... │ │
│ ├────────────────────>│ │
│ │ resolved ARN │ │
│ │ │ Generate IAM policy │
│ │ │ with S3 read actions│
│ │ │ │
│ │ │ Attach policy to │
│ │ │ Lambda role │
│ │ ├────────────────────>│
TL;DR - Section 3: Smart lookups use colon syntax (
s3:bucket_name) resolved via native Terraformsplit()and lookup maps. No preprocessing—pure Terraform expressions. Lookup tables are built after resources are created, then referenced by building blocks to resolve environment variables (IDs) and permissions (ARNs). Building blocks auto-generate IAM policies from the resolved ARNs.
4. Building Block Abstraction
The Philosophy
Building blocks are opinionated Terraform modules that:
- Enforce organizational standards (naming, tagging, encryption)
- Abstract AWS complexity (IAM policies, VPC configuration)
- Provide guardrails (prevent common misconfigurations)
- Enable least-privilege by default (automatic policy generation)
Example: S3 Building Block
User Configuration (tfvars):
s3_buckets = {
raw_data = {
name = "raw"
backup = true
enable_intelligent_tiering = true
}
processed = {
name = "processed"
lifecycle_rules = [{
id = "archive_old_data"
transition_days = 90
storage_class = "GLACIER"
}]
}
}
What the Building Block Does:
module "s3_buckets" {
source = "app.terraform.io/org/buildingblock-s3/aws"
version = "2.1.3"
for_each = var.s3_buckets
# Standardized naming: <prefix>-<workload>-<application>-<name>
prefix = local.prefix # e.g., "companyp" (company + production)
context = local.context # {Env: "prd", Workload: "analytics", Application: "etl"}
name = try(each.value.name, each.key)
# Automatic encryption with workload KMS key
kms_key_arn = local.kms_data_key_arn
# Standardized tags (injected automatically)
# Tags include: Env, Workload, Application, Team, CostCenter, Backup
# Security defaults
block_public_access = true
versioning_enabled = true
# User-specified configuration
backup = each.value.backup
lifecycle_rules = try(each.value.lifecycle_rules, [])
enable_intelligent_tiering = try(each.value.enable_intelligent_tiering, false)
}
Generated Resources:
- S3 bucket with predictable name:
companyprd-analytics-etl-raw - KMS encryption enabled automatically
- Bucket policy restricting to VPC endpoints
- CloudWatch alarms for bucket size
- Backup plan (if
backup = true) - All organizational tags applied
Example: Lambda Building Block
User Configuration:
lambda_functions = {
data_processor = {
name = "processor"
handler = "index.handler"
runtime = "python3.13"
memory = 1024
timeout = 300
s3_sourcefile = "s3_file:lambda_processor.zip"
environment = {
INPUT_BUCKET = "s3:raw_data"
OUTPUT_BUCKET = "s3:processed"
SECRET_ID = "secret:db_creds"
}
permissions = {
s3_read = ["raw_data"]
s3_write = ["processed"]
secret_read = ["db_creds"]
}
}
}
What the Building Block Does:
- Creates Lambda function with standardized name
- Generates IAM role automatically
- Generates IAM policy from
permissionsmap - Applies permission boundary (security compliance)
- Injects VPC configuration (subnet IDs, security groups)
- Resolves environment variables via lookup tables
- Adds CloudWatch log group with retention policy
- Applies X-Ray tracing
- Adds all organizational tags
Generated IAM Policy (automatically):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3Read",
"Effect": "Allow",
"Action": ["s3:GetObject*", "s3:GetBucket*", "s3:List*"],
"Resource": [
"arn:aws:s3:::companyprd-analytics-etl-raw",
"arn:aws:s3:::companyprd-analytics-etl-raw/*"
]
},
{
"Sid": "S3Write",
"Effect": "Allow",
"Action": ["s3:PutObject*", "s3:DeleteObject*"],
"Resource": [
"arn:aws:s3:::companyprd-analytics-etl-processed",
"arn:aws:s3:::companyprd-analytics-etl-processed/*"
]
},
{
"Sid": "SecretRead",
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": "arn:aws:secretsmanager:eu-central-1:123456789012:secret:companyprd-analytics-etl-db_creds-a1b2c3"
},
{
"Sid": "KMSDecrypt",
"Effect": "Allow",
"Action": ["kms:Decrypt"],
"Resource": "arn:aws:kms:eu-central-1:123456789012:key/abcd1234-..."
}
]
}
5. Dual KMS Key Architecture with Tag-Based Permissions
One of the most elegant security features of this framework is its dual KMS key architecture that balances security isolation with operational flexibility.
The Two-Key System
KMS Infrastructure Key (kms-infra)
- Scope: One per AWS account (shared across all workloads in that account)
- Location: Created in tf-default (account-level)
- Purpose: Encrypts infrastructure resources (CloudWatch Logs, Secrets Manager, SNS, CloudTrail)
-
Naming:
${prefix}-${workload}-kms-infra -
Example:
companyp-analytics-kms-infra
KMS Data Key (kms-data)
- Scope: One per workload (isolated per application)
- Location: Created in tf-project (application-level)
- Purpose: Encrypts data resources (S3 buckets, RDS, DynamoDB, Redshift)
-
Naming:
${prefix}-${workload}-${application}-kms-data -
Example:
companyp-analytics-etl-kms-data
Why Two Keys?
Security Isolation:
- Data keys are isolated per workload
- Compromising one workload's data key doesn't expose other workloads' data
- Infrastructure key is shared for operational resources that need account-wide access
Operational Flexibility:
- Infrastructure key allows CloudWatch, monitoring, and logging to work across workloads
- AWS services (Secrets Manager, CloudTrail) can use a single key for account-level operations
- Data keys remain tightly scoped to application resources
Cost Optimization:
- Infrastructure resources share one key (CloudWatch logs from many workloads)
- Only data resources (S3, databases) need separate keys per workload
Tag-Based Permissions: The Magic Sauce
Instead of explicitly listing every IAM role in the KMS key policy (which creates circular dependencies), the infrastructure key uses tag-based permissions:
Implementation in managed_by_dp_common_kms_infra.tf:
module "kms_infrastructure" {
source = "terraform-aws-modules/kms/aws"
create = local.default_deploy # Only in default/account deployment
aliases = ["${local.prefix}-${local.context.Workload}-kms-infra"]
key_statements = [
{
sid = "tag-workload"
principals = [{
type = "AWS"
identifiers = ["arn:aws:iam::${account_id}:root"]
}]
actions = [
"kms:Encrypt*",
"kms:Decrypt*",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
]
resources = ["*"]
# The key condition: any role with matching Workload tag can use this key
conditions = [{
test = "StringEquals"
variable = "aws:PrincipalTag/Workload"
values = [local.context.Workload]
}]
}
]
}
How It Works:
- Every IAM role created by building blocks gets tagged automatically:
# Lambda IAM role
tags = {
Workload = "analytics"
Application = "etl"
Env = "prd"
}
-
KMS key policy allows any role with matching
Workloadtag:- If role has tag
Workload = "analytics" - And KMS key is for workload
analytics - Then role can use the key automatically
- If role has tag
-
No circular dependencies:
- KMS key doesn't need to know about Lambda roles
- Lambda roles don't need to be in KMS key policy
- Tag matching happens at runtime by AWS IAM
Data Key: Explicit Role Lists
The data key uses a different approach with explicit role lists (avoiding circular dependencies through selective inclusion):
Implementation in managed_by_dp_project_kms_data.tf:
module "kms_data" {
source = "app.terraform.io/org/buildingblock-kms-data/aws"
key_administrators = local.kms_admins
key_users = compact(concat(local.kms_data_key_users, var.kms_data["extra_roles"]))
# Tag-based access for roles with matching tags
key_user_tag_map = {
"Workload" = local.context.Workload
"Application" = local.context.Application
"Env" = local.context.Env
}
}
In managed_by_dp_project_locals.tf:
kms_data_key_users = compact(concat(
# Admin roles (explicitly listed)
["arn:aws:iam::${account_id}:role/${var.role_prefix}-${local.prefix}-DpAdminRole"],
[local.operatorrole_arn],
local.transfer_roles,
local.workflow_roles,
# Lambda, Glue, Fargate roles are NOT listed here (would cause cycles)
# Instead, they're granted access via tag-based permissions
# See comments in code explaining the circular dependency:
# [for job in var.glue_jobs : "arn:aws:iam::..."], # CYCLO ERROR!
# [for function in var.lambda_functions : "arn:aws:iam::..."], # CYCLO ERROR!
))
The data key also supports tag-based access through key_user_tag_map, allowing Lambda/Glue/Fargate roles to access it via their tags without being explicitly listed in the policy.
Practical Example
Scenario: Lambda function needs to:
- Read encrypted S3 data (data key)
- Write to CloudWatch Logs (infra key)
- Access Secrets Manager secret (infra key)
What Happens:
- Lambda IAM role is created with tags:
resource "aws_iam_role" "lambda" {
name = "app-companyp-analytics-etl-lambda-processor"
tags = {
Workload = "analytics"
Application = "etl"
Env = "prd"
}
}
-
Lambda can use infrastructure key because:
- Role has tag
Workload = "analytics" - KMS infra key checks:
aws:PrincipalTag/Workload == "analytics"✓ - Access granted for CloudWatch Logs, Secrets Manager
- Role has tag
-
Lambda can use data key because:
- Role has tags
Workload = "analytics"ANDApplication = "etl"ANDEnv = "prd" - KMS data key checks all three tags match ✓
- Access granted for S3 data encryption/decryption
- Role has tags
-
Lambda CANNOT use another workload's data key:
- Role has
Application = "etl" - Other workload's data key requires
Application = "reporting" - Tag mismatch ✗
- Access denied
- Role has
Benefits of This Architecture
1. Automatic Compliance:
- Every resource is encrypted (mandatory KMS keys injected by building blocks)
- No way to accidentally create unencrypted resources
2. Zero-Touch Security:
- Developers never manage KMS permissions manually
- Building blocks inject the correct KMS key ARN automatically
- Tag propagation handles access control
3. Workload Isolation:
- Data from different applications is cryptographically separated
- Even with compromised IAM credentials, cross-workload data access is prevented
4. Solves Circular Dependencies:
- KMS keys don't reference IAM roles directly
- IAM roles don't need to be created before KMS keys
- Tag-based conditions evaluated at runtime
5. Audit Trail:
- CloudTrail logs show which role (with which tags) accessed which KMS key
- Security teams can verify tag-based access patterns
- Compliance reports show encryption coverage
Service-Specific Access
The infrastructure key also includes service-specific statements for AWS services:
CloudWatch Logs:
{
sid = "logs"
principals = [{ type = "Service", identifiers = ["logs.amazonaws.com"] }]
actions = ["kms:Encrypt*", "kms:Decrypt*", "kms:GenerateDataKey*"]
conditions = [{
test = "ArnEquals"
variable = "kms:EncryptionContext:aws:logs:arn"
values = ["arn:aws:logs:${region}:${account}:log-group:*"]
}]
}
Secrets Manager:
{
sid = "auto-secretsmanager"
principals = [{ type = "Service", identifiers = ["secretsmanager.amazonaws.com"] }]
actions = ["kms:Encrypt", "kms:Decrypt", "kms:GenerateDataKey"]
conditions = [
{ test = "StringEquals", variable = "kms:ViaService",
values = ["secretsmanager.${region}.amazonaws.com"] },
{ test = "StringEquals", variable = "kms:CallerAccount", values = ["${account}"] }
]
}
CloudTrail, SNS, EventBridge:
Similar service-specific statements allow these AWS services to use the infrastructure key for their operations.
Lookup References
Both keys are available via smart lookups:
# In Lambda/Glue/Fargate tfvars - use data key for data encryption
permissions = {
kms = ["kms_data"] # Resolves to workload's data key ARN
}
# Infrastructure key is injected automatically by building blocks
# (for CloudWatch Logs, environment variable encryption, etc.)
Summary
The dual KMS key architecture demonstrates how thoughtful design can achieve:
- Security: Strong encryption and workload isolation
- Developer Experience: Zero manual KMS management
- Operational Simplicity: Tag-based permissions eliminate complexity
- Compliance: Automatic encryption enforcement across all resources
This pattern is a cornerstone of the framework's security model and showcases how infrastructure abstractions can enhance rather than compromise security posture.
┌──────────────────────────────────────────────────────────────────┐
│ KMS Infrastructure Key (Account-Level) │
├──────────────────────────────────────────────────────────────────┤
│ • One Key Per Account │
│ • Encrypts: CloudWatch Logs, Secrets Manager, SNS, CloudTrail │
│ • Tag-Based Access: Workload Tag │
└────────────────────────────────┬─────────────────────────────────┘
│
│ (Tag Match: Workload)
│
┌──────────┴──────────┐
│ │
│ Lambda Role │
│ Tagged with: │
│ • Workload=analytics│
│ • Application=etl │
│ • Env=prd │
│ │
└──────────┬──────────┘
│
│ (Tag Match: All 3 Tags)
│
┌────────────────────────────────▼─────────────────────────────────┐
│ KMS Data Key (Workload-Level) │
├──────────────────────────────────────────────────────────────────┤
│ • One Key Per Workload │
│ • Encrypts: S3, RDS, DynamoDB, Redshift │
│ • Tag-Based Access: Workload + Application + Env │
└──────────────────────────────────────────────────────────────────┘
TL;DR - Section 5: Dual KMS architecture uses one shared infrastructure key per account (CloudWatch, Secrets Manager) and one data key per workload (S3, databases). Tag-based permissions solve circular dependencies: IAM roles tagged with
Workload/Application/Envautomatically gain KMS access without being explicitly listed in policies. Infrastructure key checks one tag, data key checks three tags for stronger isolation.
6. Naming Conventions and Context Propagation
The Context System
Input: Tags Module
Every workload defines a tags module:
module "tags" {
source = "app.terraform.io/org/tags/aws"
version = "~> 1.0.0"
environment = "prd"
workload = "analytics"
application = "etl"
team = "data-engineering@company.com"
costcenter = "12345"
backup = "Daily"
}
Output: Context Map
context = merge(module.tags.tags, var.context)
# Result: {
# Env: "prd",
# Workload: "analytics",
# Application: "etl",
# Team: "data-engineering@company.com",
# CostCenter: "12345",
# Backup: "Daily"
# }
Prefix Generation
prefix = "company${substr(local.context.Env, 0, 1)}"
# prd → companyp
# sbx → companys
# dev → companyd
Resource Naming Pattern
${prefix}-${workload}-${application}-${resource_name}
Examples:
- S3 bucket:
companyp-analytics-etl-raw - Lambda:
companyp-analytics-etl-processor - Glue job:
companyp-analytics-etl-transform - IAM role:
companyp-analytics-etl-lambda-processor-role
Benefits:
- Predictable: Resources can be referenced before creation
- Discoverable: Name reveals environment, workload, and purpose
- Compliant: Meets organizational naming standards
- Unique: Prevents naming collisions across teams
7. Circular Dependency Resolution Strategies
The Challenge
Terraform dependency graph requires acyclic relationships, but real-world infrastructure often has circular references:
- Lambda needs IAM role ARN
- IAM role policy needs Lambda ARN for trust policy
- KMS key policy needs Lambda role ARN
- Lambda needs KMS key ARN for environment variables
Strategy 1: Predictive Naming
Example: Redshift Lookup
# Can't use module.redshift[item].name because it creates a cycle
# CYCLO ERROR! comment in code
"redshift_data" = {
for item in keys(var.redshift_databases) :
item => join("-", [
local.prefix,
local.context.Workload,
local.context.Application,
item
])
}
Instead of referencing the module output (which creates a dependency), predict the name using the same naming convention.
Strategy 2: Two-Phase Deployment
From DEPLOY.md:
"First Terraform apply will fail on a few dependencies. Re-run to finalize."
Some circular dependencies are resolved by applying twice:
- First apply creates base resources
- Some resources fail due to missing dependencies
- Second apply completes configuration
Strategy 3: Selective KMS Key Users
kms_data_key_users = compact(concat(
["arn:aws:iam::${account_id}:role/${var.role_prefix}-${local.prefix}-DpAdminRole"],
[local.operatorrole_arn],
local.transfer_roles,
local.workflow_roles,
# These would create cycles - commented out:
# [for job in var.glue_jobs : "arn:aws:iam::..."],
# [for function in var.lambda_functions : "arn:aws:iam::..."],
))
KMS key policies include predictable roles (admin, operator) but NOT Lambda/Glue roles to avoid cycles.
Strategy 4: Data Source Lookups (Cross-Workload)
When project workloads need resources from the default workload:
local.default_deploy = fileexists("${path.module}/managed_by_dp_default_s3_code.tf")
data "aws_kms_key" "kms_infrastructure" {
count = local.default_deploy ? 0 : 1
key_id = "alias/${local.prefix}-${local.context.Workload}-kms-infra"
}
kms_infrastructure_key_arn = coalesce(
module.kms_infrastructure.key_arn, # If default deploy
data.aws_kms_key.kms_infrastructure[0].arn # If project deploy
)
Project workloads use data sources to look up infrastructure key by predictable alias.
8. Real-World Example: Data Pipeline Workload
Scenario
Build a data pipeline that:
- Ingests raw CSV files from external S3 bucket
- Processes files with Lambda function
- Transforms data with Glue ETL job
- Stores in Redshift for analytics
- Shares Glue catalog with data governance account
Configuration (tfvars)
# Define S3 buckets
s3_buckets = {
raw = {
name = "raw"
backup = true
lifecycle_rules = [{
id = "archive_old"
transition_days = 90
storage_class = "GLACIER"
}]
}
processed = {
name = "processed"
enable_intelligent_tiering = true
}
}
# Upload Lambda code
s3_source_files = {
processor_code = {
source = "lambda_processor.zip"
target = "lambda_functions/processor/code.zip"
}
glue_script = {
source = "transform.py"
target = "glue_jobs/transform/script.py"
}
}
# Define secrets
secrets = {
redshift_creds = {
name = "redshift-credentials"
secret_string = {
username = "admin"
password = "changeme" # Should use AWS Secrets Manager UI to set
}
}
}
# Define Glue database
glue_database = {
analytics = {
name = "analytics"
bucket = "s3:processed"
enable_lakeformation = true
share_cross_account_ro = ["datagovernance"]
}
}
# Define Lambda processor
lambda_functions = {
csv_processor = {
name = "csv-processor"
description = "Processes incoming CSV files"
handler = "index.handler"
runtime = "python3.13"
memory = 2048
timeout = 900
s3_sourcefile = "s3_file:processor_code"
environment = {
RAW_BUCKET = "s3:raw"
PROCESSED_BUCKET = "s3:processed"
GLUE_DATABASE = "gluedb:analytics"
}
permissions = {
s3_read = ["raw"]
s3_write = ["processed"]
glue_update = ["analytics"]
}
# S3 trigger
event_source_mapping = [{
event_source_arn = "s3:raw"
events = ["s3:ObjectCreated:*"]
filter_prefix = "incoming/"
filter_suffix = ".csv"
}]
}
}
# Define Glue ETL job
glue_jobs = {
transform = {
name = "data-transform"
glue_version = "4.0"
worker_type = "G.1X"
number_of_workers = 5
script_location = "s3_file:glue_script"
arguments = {
"--DATABASE" = "gluedb:analytics"
"--INPUT_BUCKET" = "s3:processed"
"--REDSHIFT_SECRET" = "secret:redshift_creds"
}
permissions = {
s3_read = ["processed"]
glue_update = ["analytics"]
secret_read = ["redshift_creds"]
redshift = ["analytics_cluster"]
}
# Scheduled trigger
trigger_type = "SCHEDULED"
schedule = "cron(0 2 * * ? *)" # Daily at 2 AM
}
}
# Define Redshift cluster
redshift_databases = {
analytics_cluster = {
name = "analytics"
node_type = "dc2.large"
number_of_nodes = 2
master_username = "admin"
secret_name = "secret:redshift_creds"
permissions = {
glue_read = ["analytics"]
s3_read = ["processed"]
}
}
}
What Gets Created (40+ AWS Resources)
Infrastructure:
- KMS data key for encryption
- VPC security groups for Lambda/Glue
- IAM roles (5): Lambda role, Glue role, Redshift role, Lake Formation role, Admin role
- IAM policies (5): Auto-generated least-privilege policies
- Permission boundaries (2): For Lambda and Glue roles
Storage:
- S3 bucket:
companyp-analytics-pipeline-raw - S3 bucket:
companyp-analytics-pipeline-processed - S3 bucket policies (2)
- S3 lifecycle rules
- S3 intelligent tiering configuration
Compute:
- Lambda function:
companyp-analytics-pipeline-csv-processor - Lambda log group with 30-day retention
- S3 event notification trigger
- Glue job:
companyp-analytics-pipeline-data-transform - Glue security configuration
- Glue CloudWatch log group
Data Catalog:
- Glue database:
companyp-analytics-pipeline-analytics - Lake Formation permissions
- Lake Formation resource link (cross-account share)
- RAM resource share (for cross-account access)
Database:
- Redshift cluster:
companyp-analytics-pipeline-analytics - Redshift subnet group
- Redshift parameter group
- Redshift security group
- Secrets Manager secret:
companyp-analytics-pipeline-redshift-credentials - Secret rotation configuration
Monitoring:
- CloudWatch alarms (6): Lambda errors, Glue job failures, S3 metrics
- CloudWatch log groups (3)
- EventBridge rule for Glue job schedule
All with:
- Consistent naming
- Full encryption (KMS)
- Least-privilege IAM policies
- Organizational tags
- VPC isolation
- CloudWatch logging
Total Configuration: ~150 lines of tfvars
Generated Terraform Code: ~2000+ lines (via building blocks)
Boilerplate Reduction: ~93%
┌──────────────────────────────────┐
│ S3 Buckets │
│ ┌────────┐ ┌────────┐ │
│ │ raw │ │processed│ │
│ └───┬────┘ └────▲───┘ │
└──────┼────────────────┼──────────┘
│ │
S3 Event │ │
Trigger │ │ Writes
│ │
┌──────▼────────────────┴──────────┐
│ Lambda │
│ ┌─────────────────────┐ │
│ │ csv-processor │ │
│ └──────────┬──────────┘ │
└─────────────┼────────────────────┘
│
│ Updates
│
┌─────────────────────────▼───────────────────────┐
│ Glue │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Database: │◄───│ ETL Job: │ │
│ │ analytics │ │ transform │ │
│ └────────▲─────────┘ └────┬─────────────┘ │
└───────────┼──────────────────┼─────────────────┘
│ │
│ Queries │ Loads
│ │
┌───────────┴──────────────────▼─────────────────┐
│ Redshift │
│ ┌─────────────────────┐ │
│ │ Cluster: analytics │ │
│ └──────────┬──────────┘ │
└─────────────┼────────────────────────────────────┘
│
│ Reads
│
┌─────────────▼────────────────────────────────────┐
│ Secrets Manager │
│ ┌─────────────────────────┐ │
│ │ redshift-credentials │ │
│ └─────────────────────────┘ │
└──────────────────────────────────────────────────┘
TL;DR - Section 8: Real-world data pipeline example shows how 150 lines of tfvars configuration generates 40+ AWS resources (S3, Lambda, Glue, Redshift, KMS, IAM, CloudWatch). Smart lookups connect resources (
s3:raw,secret:db_creds), building blocks auto-generate IAM policies, context system applies consistent naming/tagging, and KMS keys encrypt everything automatically. Achieves 93% boilerplate reduction vs traditional Terraform.
9. Cross-Account Architecture
Use Case: Multi-Account Data Mesh
Scenario: Analytics workload in Production account needs to:
- Read S3 data from Development account
- Query Glue tables from Staging account
- Use KMS keys from Shared Services account
Configuration
Step 1: Define Cross-Account Aliases
cross_accounts = {
dev = "123456789012"
staging = "234567890123"
shared = "345678901234"
}
Step 2: Define External S3 Buckets
lookup_ids = {
xa_s3_bucket = {
dev_raw = "dev-shared-raw-data"
staging_processed = "staging-shared-processed"
}
}
Step 3: Use Cross-Account Lookups
lambda_functions = {
cross_account_reader = {
name = "reader"
permissions = {
# Read from external S3 buckets
s3_read = ["dev_raw", "staging_processed"]
# Query Glue tables in staging account
glue_read = ["acct_staging_glue_tables"]
# Use KMS keys in shared account
kms = ["acct_shared_kms_all_keys"]
}
}
}
Generated IAM Policy
{
"Statement": [
{
"Sid": "S3ReadCrossAccount",
"Effect": "Allow",
"Action": ["s3:GetObject*", "s3:GetBucket*", "s3:List*"],
"Resource": [
"arn:aws:s3:::dev-shared-raw-data",
"arn:aws:s3:::dev-shared-raw-data/*",
"arn:aws:s3:::staging-shared-processed",
"arn:aws:s3:::staging-shared-processed/*"
]
},
{
"Sid": "GlueReadCrossAccount",
"Effect": "Allow",
"Action": ["glue:GetTable", "glue:GetTables", "glue:GetDatabase"],
"Resource": "arn:aws:glue:*:234567890123:table/*"
},
{
"Sid": "KMSCrossAccount",
"Effect": "Allow",
"Action": ["kms:Decrypt", "kms:DescribeKey"],
"Resource": "arn:aws:kms:eu-central-1:345678901234:key/*"
}
]
}
Benefits:
- Developers don't need to know account IDs
- Cross-account permissions follow same pattern as same-account
- Centralized account alias management
- Type-safe (Terraform validates references at plan time)
10. Deployment Workflow
Repository Structure
Blueprint Repository (Central):
terraform-platform-blueprint/
├── tf-common/ # Shared foundation
├── tf-default/ # Account-level resources
├── tf-project/ # Application resources
├── examples/
│ ├── full_test/ # Complete example
│ └── simple_example/ # Minimal example
└── tools/
└── repo_updater.py # Syncs blueprint to user repos
User Repository (Team-Owned):
team-analytics/
├── terraform/
│ ├── dev/
│ │ ├── tags.tf # Team owns
│ │ ├── _default.auto.tfvars # Team owns
│ │ ├── _project.auto.tfvars # Team owns
│ │ ├── managed_by_dp_common_*.tf # Synced from blueprint
│ │ ├── managed_by_dp_default_*.tf # Synced from blueprint
│ │ └── managed_by_dp_project_*.tf # Synced from blueprint
│ ├── staging/
│ └── production/
└── .github/
└── workflows/
└── terraform.yml
Workflow Steps
Step 1: Team Creates Configuration
Teams edit only their own files:
-
tags.tf- Defines environment, workload, application -
_default.auto.tfvars- Account-level config (if first workload) -
_project.auto.tfvars- Application resources
Step 2: Platform Team Updates Blueprint
When blueprint code needs updating:
# In blueprint repo
cd tools
python repo_updater.py --target ../../../team-analytics/terraform/dev
This syncs all managed_by_dp_*.tf files from blueprint to team repo.
Step 3: Team Commits and Pushes
git add .
git commit -m "feat: add data processing pipeline"
git push origin feature/data-pipeline
Step 4: Terraform Cloud Runs
GitHub Action triggers Terraform Cloud:
- Workspace detects VCS change
- Runs
terraform plan - Shows plan in pull request comment
- Team reviews and approves
- Merges PR
- Terraform Cloud runs
terraform apply
Step 5: Resources Created
All AWS resources created with:
- Standardized naming
- Automatic IAM policies
- Full encryption
- Organizational tags
- CloudWatch monitoring
No Preprocessing Required
This workflow uses standard Terraform:
- No build step before
terraform plan - No code generation at runtime
- No wrapper scripts
- Native
.tfvarsfiles - Standard state management
- Compatible with Terraform Cloud, Enterprise, or OSS
Platform Blueprint repo_updater.py Team Repos Terraform Application
Team Repo (50+) Cloud Team
│ │ │ │ │ │
│ Update │ │ │ │ │
│ building │ │ │ │ │
│ blocks │ │ │ │ │
├───────────>│ │ │ │ │
│ │ │ │ │ │
│ git commit │ │ │ │ │
│ & push │ │ │ │ │
├───────────>│ │ │ │ │
│ │ │ │ │ │
│ Run │ │ │ │ │
│ --update- │ │ │ │ │
│ all-teams │ │ │ │ │
├────────────┼─────────────>│ │ │ │
│ │ │ Generate 50 PRs│ │ │
│ │ │ (update │ │ │
│ │ │ managed_by_dp) │ │ │
│ │ ├───────────────>│ │ │
│ │ │ │ PR triggers│ │
│ │ │ │ terraform │ │
│ │ │ │ plan │ │
│ │ │ ├──────────>│ │
│ │ │ │ │ │
│ │ │ │ Post plan │ │
│ │ │ │ as PR │ │
│ │ │ │ comment │ │
│ │ │ │<──────────┤ │
│ │ │ │ │ │
│ │ │ │ │ Review plan│
│ │ │ │<──────────────────────┤
│ │ │ │ │ │
│ │ │ │ Approve & │ │
│ │ │ │ merge PR │ │
│ │ │ │<──────────────────────┤
│ │ │ │ │ │
│ │ │ │ Merge │ │
│ │ │ │ triggers │ │
│ │ │ │ terraform │ │
│ │ │ │ apply │ │
│ │ │ ├──────────>│ │
│ │ │ │ │ │
│ │ │ │ Deploy │ │
│ │ │ │ updated │ │
│ │ │ │ resources │ │
│ │ │ │ │ │
11. Comparison with Other Approaches
vs. Standard Terraform
| Aspect | Standard Terraform | This Framework |
|---|---|---|
| ARN Management | Manual ARN strings | Smart lookups (s3:bucket) |
| IAM Policies | Write JSON/HCL policy documents | Auto-generated from permissions map |
| Naming | Manually ensure consistency | Automatic standardized naming |
| Standards | Manually enforce | Building blocks enforce automatically |
| Cross-references | Direct resource dependencies | Lookup tables (reduces coupling) |
| Boilerplate | High (1000+ lines typical) | Low (150 lines typical) - ~85% reduction |
| Learning Curve | Steep (requires AWS expertise) | Moderate (config-focused) |
vs. Terragrunt
| Aspect | Terragrunt | This Framework |
|---|---|---|
| Preprocessing | Required (terragrunt run) | None (native Terraform) |
| State Management | Separate tool | Native Terraform |
| Compatibility | Wrapper tool required | Standard terraform CLI |
| DRY Approach | File includes & remote state | Lookup tables & modules |
| Complexity | Additional tool layer | Pure Terraform |
| IDE Support | Limited (custom syntax) | Full (standard HCL) |
vs. Terraspace
| Aspect | Terraspace | This Framework |
|---|---|---|
| Language | Ruby DSL + ERB templates | Pure HCL |
| Preprocessing | Required (terraspace build) | None |
| Runtime | Ruby interpreter needed | Native Terraform only |
| Configuration | ERB templating | Native tfvars |
| Tooling | Additional CLI wrapper | Standard Terraform CLI |
| Learning Curve | Learn Ruby + Terraspace | Learn framework conventions |
vs. Terraform CDK
| Aspect | Terraform CDK | This Framework |
|---|---|---|
| Language | TypeScript/Python/Java/C#/Go | Pure HCL |
| Compilation | Required (cdktf synth) | None |
| Runtime | Node.js/Python runtime | Native Terraform only |
| Configuration | Imperative code | Declarative tfvars |
| State Inspection | Via generated JSON | Native Terraform state |
| IDE Support | Language-specific | Terraform-specific |
Key Advantages of This Approach
- No External Dependencies: Pure Terraform, no additional tools
- Native Workflows: Works with Terraform Cloud, Enterprise, OSS
- Type Safety: Terraform validates references at plan time
-
Version Control: Standard
.tfvarsfiles, readable diffs - IDE Support: Full support from Terraform plugins
- Learning Curve: Lower (no new language/tool to learn)
- Portability: Standard Terraform state, no lock-in
- Debugging: Standard Terraform error messages and plan output
┌─────────────────────────┐
│ Terraform Approaches │
└────────────┬────────────┘
│
┌───────────┬───────────┼───────────┬───────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Standard │ │Terragrunt│ │Terraspace│ │Terraform │ │ This │
│ Terraform │ │ │ │ │ │ CDK │ │ Framework │
└─────┬──────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └──────┬───────┘
│ │ │ │ │
│Manual ARNs │Wrapper │Ruby DSL │TypeScript/ │Pure HCL
│High │tool │ERB │Python │Smart
│boilerplate │Preprocessing│templates │Compilation │lookups
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│ 1000+ │ │terragrunt│ │terraspace│ │ cdktf │ │ 150 │
│ lines/ │ │ run │ │ build │ │ synth │ │ lines/ │
│ workload │ │ required │ │ required │ │ required │ │ workload │
│ │ │ │ │ │ │ │ │ ✓ │
└─────────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────────┘
TL;DR - Section 11: This framework beats alternatives by using pure Terraform with zero preprocessing. Standard Terraform requires manual ARN management (1000+ lines). Terragrunt/Terraspace/CDK add preprocessing layers (wrapper tools, Ruby runtime, Node.js compilation). This approach achieves 85% boilerplate reduction through smart lookups and building blocks while maintaining full Terraform Cloud compatibility and native workflows.
12. Lessons Learned and Best Practices
What Worked Well
1. Colon Syntax is Intuitive
Developers adopted s3:bucket_name syntax immediately. It reads like natural configuration.
2. Building Blocks Enforce Standards
Opinionated modules ensure consistency without policing. Teams can't accidentally create non-compliant resources.
3. Separation of Concerns
Platform team manages managed_by_dp_*.tf files, teams manage *.tfvars files. Clear ownership boundaries.
4. Lookup Tables Reduce Coupling
Resources don't directly reference each other, reducing cascade changes when refactoring.
5. Predictive Naming Solves Most Circular Dependencies
Most cross-resource references can use naming conventions instead of module outputs.
Challenges and Solutions
Challenge 1: Circular Dependencies
Some resource relationships create cycles that Terraform can't resolve.
Solutions:
- Use predictive naming instead of module outputs
- Two-phase deployment (apply twice)
- Selective resource inclusion in policies
- Data sources for cross-workload lookups
Challenge 2: Lookup Complexity
Lookup tables can become large and hard to maintain.
Solutions:
- Organized into logical groups (
lookup_perm_lambda,lookup_id_base) - Inline comments documenting purpose
- Automated generation via
forexpressions - Cross-account lookups separated into
_xamaps
Challenge 3: Building Block Versioning
Updating building block versions across many teams is coordination-heavy.
Solutions:
- Semantic versioning with
~>constraints - Deprecation warnings for old versions
- Automated testing of building block changes
- Communication channel for breaking changes
Challenge 4: Developer Onboarding
New developers need to learn lookup syntax and conventions.
Solutions:
- Comprehensive examples in blueprint repo
- Detailed README with common patterns
- IntelliSense/autocomplete via Terraform language server
- Helper scripts to validate tfvars before commit
Best Practices
1. Use Descriptive Resource Keys
# Good
s3_buckets = {
raw_customer_data = { ... }
processed_analytics = { ... }
}
# Bad
s3_buckets = {
bucket1 = { ... }
bucket2 = { ... }
}
2. Group Related Resources
# Process: S3 → Lambda → Glue → Redshift
s3_buckets = { raw = {...}, processed = {...} }
lambda_functions = { processor = {...} }
glue_jobs = { transform = {...} }
redshift_databases = { analytics = {...} }
3. Use Comments to Document Intent
# Data pipeline for customer analytics
# Flow: External API → raw bucket → Lambda → processed bucket → Glue → Redshift
lambda_functions = {
api_ingestion = { ... }
}
4. Leverage Type Inference
# Instead of:
permissions = {
s3_read = ["s3_read:raw"]
}
# Prefer (type inferred from key):
permissions = {
s3_read = ["raw"]
}
5. Test in Lower Environments First
dev → staging → production
Use identical tfvars across environments, only changing tags.tf (environment name).
6. Version Pin Building Blocks
# Use pessimistic constraint
source = "app.terraform.io/org/buildingblock-lambda/aws"
version = "~> 3.2.0" # Allows 3.2.x, not 3.3.0
7. Document Cross-Account Access
# Cross-account: Read from Data Lake account
cross_accounts = {
datalake = "123456789012" # Managed by Data Lake team
}
13. Impact and Metrics
Development Velocity Improvements
Before This Framework:
- ~1000 lines of Terraform per workload
- 2-3 weeks to onboard new team
- 5+ days to add new resource type
- Frequent IAM permission errors
- Inconsistent naming across teams
- Manual policy review process
After This Framework:
- ~150 lines of tfvars per workload (85% reduction)
- 2-3 days to onboard new team
- 1 day to add new resource type
- Rare IAM errors (auto-generated policies)
- Consistent naming (automatic)
- Automated policy compliance
Code Quality Improvements
Reduction in Boilerplate:
Traditional approach (S3 + Lambda with IAM):
# ~250 lines for: S3 bucket, IAM role, IAM policy document,
# Lambda function, CloudWatch log group, etc.
This framework (same resources):
# ~30 lines of tfvars
s3_buckets = { data = { name = "data" } }
lambda_functions = {
processor = {
name = "processor"
permissions = { s3_read = ["data"] }
}
}
Boilerplate Reduction: ~88%
Governance and Compliance
Automatic Enforcement:
- 100% of resources use standardized naming
- 100% of resources encrypted with KMS
- 100% of resources tagged per policy
- 100% of IAM policies include permission boundaries
- 100% of Lambda functions in VPC
- 0 manual policy reviews required
Before Framework After Framework
┌────────────────────────────┐ ┌────────────────────────────┐
│ │ │ │
│ • 1000+ lines Terraform │─────>│ • 150 lines tfvars │
│ │ │ (85% reduction) │
│ │ │ │
└────────────────────────────┘ └────────────────────────────┘
┌────────────────────────────┐ ┌────────────────────────────┐
│ │ │ │
│ • 2-3 weeks onboarding │─────>│ • 2-3 days onboarding │
│ │ │ (5x faster) │
│ │ │ │
└────────────────────────────┘ └────────────────────────────┘
┌────────────────────────────┐ ┌────────────────────────────┐
│ │ │ │
│ • Manual IAM policies │─────>│ • Auto-generated IAM │
│ │ │ (Rare errors) │
│ │ │ │
└────────────────────────────┘ └────────────────────────────┘
┌────────────────────────────┐ ┌────────────────────────────┐
│ │ │ │
│ • Inconsistent naming │─────>│ • 100% consistent │
│ │ │ (Automatic compliance) │
│ │ │ │
└────────────────────────────┘ └────────────────────────────┘
TL;DR - Section 13: Framework delivers measurable improvements: 85% boilerplate reduction (1000→150 lines), 5x faster team onboarding (weeks→days), rare IAM errors (auto-generated policies), and 100% compliance (automatic naming, tagging, encryption, permission boundaries). Every resource is encrypted with KMS, tagged per policy, and uses least-privilege IAM—all enforced by building blocks with zero manual reviews.
14. Future Enhancements
Planned Features
1. Multi-Region Support
Enable workloads spanning multiple AWS regions:
regions = ["eu-central-1", "us-east-1"]
s3_buckets = {
replicated_data = {
name = "data"
replication_regions = ["us-east-1"]
}
}
2. Enhanced Lookup Syntax
Support nested lookups:
environment = {
BUCKET_PATH = "s3:mybucket:/path/prefix"
TABLE_COLUMN = "dynamodb:mytable:attribute:id"
}
3. Building Block Customization
Allow team-specific overrides while maintaining compliance:
s3_buckets = {
special = {
name = "special"
override_defaults = {
versioning_enabled = false # Team takes responsibility
}
}
}
4. Cost Estimation
Integrate with AWS Pricing API to estimate costs before apply:
# In plan output:
# Estimated monthly cost: $1,234.56
# - Lambda: $123.45
# - S3: $456.78
# - Redshift: $654.33
5. Dependency Visualization
Generate visual dependency graphs from lookup tables:
S3:raw → Lambda:processor → S3:processed → Glue:transform → Redshift:analytics
Potential Improvements
1. Resolve Two-Phase Deployment
Investigate Terraform's -target flag or module dependencies to eliminate the "apply twice" requirement.
2. Building Block Catalog
Create searchable catalog of building blocks with examples:
- Searchable by AWS service
- Filterable by capability (encryption, backups, monitoring)
- Includes terraform-docs generated documentation
3. Policy Simulation
Pre-validate IAM policies using AWS IAM Policy Simulator before apply:
terraform plan | policy-simulator --validate
4. Drift Detection
Automated drift detection for resources created outside Terraform:
terraform-drift-detector --alert slack://channel
15. Conclusion
Summary
We've built a Native Terraform IaC Framework that achieves the developer experience of high-level abstractions while maintaining 100% compatibility with standard Terraform workflows. The key innovations are:
-
Smart Lookup Syntax: Colon-separated references (
s3:bucket,lambda:function) resolved via native Terraform expressions - Building Block Abstraction: Opinionated modules that enforce standards and generate IAM policies automatically
- Zero Preprocessing: Pure Terraform - works with Terraform Cloud, CLI, and all standard tooling
- Clear Separation: Platform team manages code, application teams manage configuration
- Context Propagation: Naming and tagging enforced automatically via context system
Why This Matters
For Platform Engineers:
- Enforce organizational standards without restricting teams
- Reduce support burden (teams self-service)
- Centralized updates via building blocks
- Scalable to hundreds of workloads
For Application Teams:
- Write configuration, not code
- No AWS expertise required
- Fast onboarding (days, not weeks)
- Focus on business logic, not infrastructure
For Organizations:
- Consistent security posture
- Automated compliance
- Cost visibility via standardized tagging
- Reduced risk (guardrails prevent misconfigurations)
Key Takeaways
Native Terraform is Powerful: With creative use of locals and lookups, you can build sophisticated abstractions without preprocessing
Configuration Over Code: Separating what (tfvars) from how (modules) reduces complexity
Building Blocks Scale: Opinionated modules enable governance at scale
Developer Experience Matters: Investment in ergonomics pays dividends in velocity and adoption
Standards Enable Freedom: Guardrails paradoxically enable teams to move faster
┌─────────────────────────────────┐
│ Native Terraform Framework │
└──────────────┬──────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Smart Lookups │ │ Building Blocks │ │ Separation of │
│ │ │ │ │ Code & Config │
└───────┬───────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
│ ┌───────▼────────┐ │
│ │Context │ │
│ │Propagation │ │
│ └───────┬────────┘ │
│ │ │
└────────────────────┼───────────────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ 85% Boilerplate│ │ Zero │ │ Automated │
│ Reduction │ │ Preprocessing │ │ Updates at Scale │
└────────┬───────┘ └────────┬────────┘ └─────────┬────────┘
│ │ │
│ ┌───────▼────────┐ │
│ │ 100% │ │
│ │ Compliance │ │
│ └───────┬────────┘ │
│ │ │
└───────────────────┼──────────────────────┘
│
▼
┌──────────────────────┐
│ 50+ Teams Can │
│ Self-Service │
│ Infrastructure │
└──────────────────────┘
TL;DR - Conclusion: This native Terraform framework proves that developer-friendly IaC doesn't require preprocessing or external tools. By combining smart lookups (
s3:bucket), opinionated building blocks, configuration/code separation, and context propagation, we achieve 85% boilerplate reduction while maintaining full Terraform Cloud compatibility. Platform teams scale updates via automated PRs, application teams self-service via simple tfvars, and organizations get automatic compliance. Native Terraform can be elegant, scalable, and secure.
16. Getting Started Guide
For teams interested in adopting this approach:
Step 1: Assess Your Needs
Good fit if:
- Multiple teams deploying similar infrastructure
- Need to enforce organizational standards
- Want to reduce AWS expertise requirement
- High volume of infrastructure deployments
Not a good fit if:
- Small team (1-2 people) with custom requirements
- Infrastructure is highly heterogeneous
- Team prefers low abstraction level
Step 2: Start Small
Begin with a pilot:
- Choose one AWS service (e.g., S3)
- Build an opinionated building block module
- Create lookup mechanism for that service
- Test with one team
- Iterate based on feedback
Step 3: Build Your Building Blocks
For each AWS service:
- Define organizational standards (naming, tagging, encryption)
- Create Terraform module enforcing standards
- Add permission generation logic
- Version and publish to private registry
- Write documentation and examples
Step 4: Create Lookup System
- Define lookup syntax (e.g.,
type:name) - Create lookup locals maps
- Add resolution logic to building blocks
- Test cross-resource references
Step 5: Document and Socialize
- Write comprehensive README
- Create example projects
- Run training sessions
- Set up support channel
- Gather feedback and iterate
Step 6: Scale
- Add more building blocks incrementally
- Onboard teams progressively
- Monitor usage and pain points
- Continuously improve based on feedback
Appendix: Code Samples
A. Lookup Table Implementation
File: tf-project/terraform/managed_by_dp_project_lookup.tf
locals {
# Build base lookup maps for ARNs (used in IAM policies)
lookup_arn_base = merge(var.lookup_arns, {
"kms" = {
"kms_data" = local.kms_data_key_arn
"kms_infra" = local.kms_infrastructure_key_arn
}
"s3_read" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
"s3_write" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].arn }
"gluejob" = { for item in keys(var.glue_jobs) : item => module.glue_jobs[item].arn }
"gluedb" = { for item in keys(var.glue_database) : item => module.glue_databases[item].name }
"secret_read" = { for item in keys(var.secrets) : item => module.secrets[item].arn }
"dynamodb_read" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].arn }
})
# Build base lookup maps for IDs (used in environment variables)
lookup_id_base = merge(var.lookup_ids, {
"s3" = { for item in keys(var.s3_buckets) : item => module.s3_buckets[item].id }
"secret" = { for item in keys(var.secrets) : item => module.secrets[item].id }
"dynamodb" = { for item in keys(var.dynamodb_databases) : item => module.dynamodb[item].name }
"athena" = { for item in keys(var.athena_workgroups) : item => module.athena[item].name }
})
# Specialized lookup for Lambda permissions
lookup_perm_lambda = merge(
local.lookup_arn_base,
local.lookup_perm_lambda_xa, # Cross-account additions
{
"sqs_read" = { for item in keys(var.sqs_queues) : item => module.sqs[item].queue_arn }
"sqs_send" = { for item in keys(var.sqs_queues) : item => module.sqs[item].queue_arn }
"sns_pub" = { for item in keys(var.sns_topics) : item => module.sns[item].topic_arn }
}
)
# Specialized lookup for Lambda environment variables
lookup_id_lambda = merge(
local.lookup_id_base,
{
"sqs" = { for item in keys(var.sqs_queues) : item => module.sqs[item].queue_url }
"sns" = { for item in keys(var.sns_topics) : item => module.sns[item].topic_arn }
}
)
}
B. Lambda Building Block Usage
File: tf-project/terraform/managed_by_dp_project_lambda.tf
module "lambda" {
source = "app.terraform.io/org/buildingblock-lambda/aws"
version = "3.2.0"
for_each = var.lambda_functions
# Standard fields
prefix = local.prefix
context = local.context
name = try(each.value.name, each.key)
# Environment variables with smart lookup
environments = {
for type, item in try(each.value.environment, {}) : type =>
try(
# Try to resolve as "type:name" lookup
local.lookup_id_lambda[split(":", item)[0]][split(":", item)[1]],
item # Fallback to literal value
)
}
# Permissions with smart lookup and automatic policy generation
permissions = {
for type, items in try(each.value.permissions, {}) : type => [
for item in items :
(
# Check if it's namespaced format "type:name"
length(split(":", item)) == 2
? try(
local.lookup_perm_lambda[split(":", item)[0]][split(":", item)[1]],
item
)
: try(
# Infer type from permission category key
local.lookup_perm_lambda[type][item],
item
)
)
]
}
# Create IAM role and policy automatically
create_policy = true
# Injected infrastructure details
kms_key_arn = local.kms_data_key_arn
subnet_ids = local.subnet_ids
vpc_id = local.vpc_id
# User-provided configuration
handler = each.value.handler
runtime = each.value.runtime
memory = try(each.value.memory, 512)
timeout = try(each.value.timeout, 300)
description = try(each.value.description, "")
# Resolve S3 source file location
s3_bucket = local.code_bucket
s3_key = split(":", each.value.s3_sourcefile)[0] == "s3_file"
? try(
local.s3_target_path[split(":", each.value.s3_sourcefile)[1]],
each.value.s3_sourcefile
)
: each.value.s3_sourcefile
}
C. Example Workload Configuration
File: examples/full_test/_project.auto.tfvars
# S3 Buckets
s3_buckets = {
raw_data = {
name = "raw"
backup = true
lifecycle_rules = [{
id = "archive_old_data"
transition_days = 90
storage_class = "GLACIER"
}]
}
processed_data = {
name = "processed"
enable_intelligent_tiering = true
enable_eventbridge_notification = true
}
}
# Upload code artifacts
s3_source_files = {
processor_code = {
source = "lambda_processor.zip"
target = "lambda_functions/processor/code.zip"
}
transform_script = {
source = "glue_transform.py"
target = "glue_jobs/transform/script.py"
}
}
# Secrets
secrets = {
database_creds = {
name = "db-credentials"
secret_string = {
username = "admin"
password = "" # Set via AWS Console
}
}
}
# Glue Database
glue_database = {
analytics = {
name = "analytics"
bucket = "s3:processed_data"
enable_lakeformation = true
share_cross_account_ro = ["datagovernance"]
}
}
# Lambda Function
lambda_functions = {
data_processor = {
name = "processor"
description = "Processes incoming data files"
handler = "index.handler"
runtime = "python3.13"
memory = 2048
timeout = 900
in_vpc = true
s3_sourcefile = "s3_file:processor_code"
environment = {
RAW_BUCKET = "s3:raw_data"
PROCESSED_BUCKET = "s3:processed_data"
GLUE_DATABASE = "gluedb:analytics"
DB_SECRET = "secret:database_creds"
LOG_LEVEL = "INFO"
}
permissions = {
s3_read = ["raw_data"]
s3_write = ["processed_data"]
glue_update = ["analytics"]
secret_read = ["database_creds"]
}
event_source_mapping = [{
event_source_arn = "s3:raw_data"
events = ["s3:ObjectCreated:*"]
filter_prefix = "incoming/"
filter_suffix = ".csv"
}]
}
}
# Glue ETL Job
glue_jobs = {
data_transform = {
name = "transform"
description = "Transforms processed data"
glue_version = "4.0"
worker_type = "G.1X"
number_of_workers = 5
max_retries = 2
timeout = 120
script_location = "s3_file:transform_script"
arguments = {
"--job-language" = "python"
"--enable-metrics" = "true"
"--enable-continuous-cloudwatch-log" = "true"
"--DATABASE" = "gluedb:analytics"
"--INPUT_BUCKET" = "s3:processed_data"
"--DB_SECRET" = "secret:database_creds"
}
permissions = {
s3_read = ["processed_data"]
glue_update = ["analytics"]
secret_read = ["database_creds"]
}
trigger_type = "SCHEDULED"
schedule = "cron(0 2 * * ? *)" # Daily at 2 AM UTC
}
}
Final Thoughts
This framework demonstrates that native Terraform can be elegant and developer-friendly without sacrificing power or flexibility. By leveraging Terraform's built-in features creatively—for expressions, try() functions, split() operations, and locals—we've built a system that:
- Feels like configuration (simple tfvars files)
- Works like Terraform (native tooling, no preprocessing)
- Scales like a platform (hundreds of workloads, multiple teams)
- Governs like policy (automatic enforcement, no manual reviews)
The journey from verbose, error-prone Terraform code to concise, validated configuration files represents a significant step forward in Infrastructure as Code maturity. Most importantly, it's achieved through native Terraform capabilities, ensuring long-term compatibility and eliminating external dependencies.
As organizations scale their cloud infrastructure, frameworks like this become essential for maintaining velocity, consistency, and security. The patterns demonstrated here can be adapted to any cloud provider, resource types, or organizational requirements—the principles of smart lookups, building block abstraction, and configuration separation are universally applicable.
The future of Infrastructure as Code is declarative, native, and developer-friendly. This framework is a blueprint for getting there.
Acknowledgments
This framework was built by collaborative iteration between platform engineers and application teams, learning from real-world challenges and continuously refining the developer experience. Special recognition to the teams who adopted early versions, provided feedback, and helped shape the patterns documented here.
Top comments (0)