Karl Schriek

Posted on Jun 30 • Edited on Jul 6 • Originally published at snapcd.io

Splitting a Terraform Monolith into Smaller States

#terraform #cloud #cicd #infrastructureascode

If your Terraform plans are slow, your blast radius is too wide, or multiple teams are stepping on each other's changes, it's time to split your monolith. See The Problem with Large Terraform States for how to diagnose whether you've reached that point.

This guide walks through the process of breaking a monolithic Terraform root into smaller, independent roots — each with its own state — and how to wire the dependencies between them.

The steps

Splitting a monolith is a seven-step process:

Parse the root into a resource-level reference graph — understand what references what.
Place every resource into a target module based on lifecycle and ownership.
Compute boundaries — references crossing module boundaries become variable/output pairs; depends_on-only references become ordering edges without spurious value wiring.
Check for cycles — if module A needs an output of module B and B needs an output of A, no valid apply order exists. Catch this before writing any files.
Emit per-module roots — rewrite cross-module references to var.<input>, generate variables.tf and outputs.tf, propagate providers and locals.
Carve state — terraform state mv over local copies. The monolith state after all moves becomes the remainder module's state. Never touch the live backend during migration.
Prove the split — walk modules in topological order, thread each producer's extracted outputs into its consumers' inputs, and plan each against its carved state. Zero creates and zero destroys = the split is operationally inert.

The following sections walk through each step in detail.

Automating the steps with Demonolith

Demonolith is a Go CLI that automates all seven steps. You annotate resources with decorator comments indicating which module each belongs to:

# @demono:move networking
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

# @demono:move networking
resource "aws_subnet" "private" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
}

# @demono:move compute
resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = [aws_subnet.private.id]
  }
}

Resources without a decorator fall to a configurable remainder module (default: monolith). Data sources can be decorated with multiple targets — they're stateless reads and get duplicated into each.

Demonolith handles parsing (via HCL AST traversal), boundary computation, code emission (with hclwrite-preserved formatting), state carving, and the topologically-threaded proof — in a single command:

# Emit carved roots only (code, no state):
demonolith split ./infra

# Also carve state into per-module local files:
demonolith split ./infra --state

# Carve + prove every module plans to zero create/destroy:
demonolith split ./infra --state --proof

The carved roots are plain Terraform — valid standalone. The cross-module edges Demonolith computes are exactly the wiring you'd configure in Snap CD via snapcd_module_input_from_output.

Going through the steps manually

1. Identify natural boundaries

Look at your resources and group them by lifecycle and ownership. Common boundaries:

Networking — VPCs, subnets, route tables, NAT gateways. Changes rarely, underpins everything.
DNS — Zones, records. Usually owned by a platform team.
Compute — Kubernetes clusters, VM scale sets, container services. Changes more often, depends on networking.
Application infrastructure — Databases, caches, queues, storage accounts. Owned by application teams.
Monitoring — Dashboards, alerts, log sinks. Changes frequently, depends on everything but nothing depends on it.

A useful test: if two resources would never be changed in the same PR by the same person, they probably belong in different states.

2. Map the dependency graph

Before you move anything, build a resource-level reference graph. For every resource, identify what it references — and trace those references across the boundaries you drew in step 1. References that cross a boundary become the variable/output pairs you'll need to create.

networking          dns
    │                 ▲
    ▼                 │
  compute ──────────►─┘
    │
    ▼
application
    │
    ▼
monitoring

The values that cross these boundaries are the wiring surface of the split. Typical examples:

Networking → Compute: vpc_id, private_subnet_ids
Compute → DNS: load_balancer_ip
Compute → Application: cluster_endpoint, cluster_ca_certificate
Application → Monitoring: database_id, cache_name

Check for cycles: if module A needs an output of module B and B needs an output of A, no valid apply order exists. You'll need to break the cycle before proceeding — move one of the cross-referencing resources to the other side, or extract the shared resource into a third module.

3. Carve the code

For each new root, create a directory and move the assigned resources into it. Three things happen at the boundary:

On the producer side, expose cross-boundary values as output blocks:

# networking/outputs.tf
output "private_subnet_ids" {
  value = aws_subnet.private[*].id
}

On the consumer side, declare those values as variable blocks:

# compute/variables.tf
variable "private_subnet_ids" {
  type = list(string)
}

In the consumer's resource definitions, rewrite the hard references to use the new variable:

# Before (monolith) — direct reference
resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = aws_subnet.private[*].id
  }
}

# After (split) — variable reference
resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = var.private_subnet_ids
  }
}

Don't forget structural blocks: provider configurations, locals, and root variable declarations need to be carried into every module that uses them. A depends_on that pointed at a resource now in another root should be removed — the ordering dependency is carried by the input/output wiring instead.

4. Carve the state

Terraform's state mv command lets you move resources from one state to another without destroying and recreating them. Work on local copies of the state — never against the live backend during the migration.

# Pull the monolith state to a local file
cd monolith
terraform state pull > terraform.tfstate

# Move resources to the new root's state
terraform state mv \
  -state=../monolith/terraform.tfstate \
  -state-out=./terraform.tfstate \
  aws_vpc.main aws_vpc.main

terraform state mv \
  -state=../monolith/terraform.tfstate \
  -state-out=./terraform.tfstate \
  aws_subnet.private aws_subnet.private

The monolith state file, after all moves are complete, becomes the remainder module's state — it contains exactly the resources that weren't moved out.

5. Verify the split

After carving code and state, every new root must plan to zero changes. This is the proof that the split is operationally inert — nothing will be destroyed or recreated.

The catch: a carved module planned in isolation has its upstream-sourced variables unset, because the input/output wiring doesn't exist yet at the Terraform level. You need to supply those values manually for the verification plan. Walk the modules in topological order: plan each producer first, extract its output values, and feed them as -var arguments into the consumer's plan.

# Plan the producer (no upstream dependencies)
cd networking
terraform plan

# Extract outputs
terraform output -json > ../outputs/networking.json

# Plan the consumer with the producer's outputs
cd ../compute
terraform plan \
  -var="vpc_id=$(jq -r '.vpc_id.value' ../outputs/networking.json)" \
  -var="private_subnet_ids=$(jq -c '.private_subnet_ids.value' ../outputs/networking.json)"

If any module shows creates or destroys, something went wrong — a resource was missed in the state move, a reference was rewritten incorrectly, or a variable type doesn't match.

6. Wire up the cross-state dependencies

Once the split is verified, you need a runtime mechanism to pass outputs from producers to consumers on every deploy. There are several options:

Option A: terraform_remote_state data sources

The built-in approach. Each consuming module reads the producer's state directly:

data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}

This works but has significant drawbacks:

Every consumer needs to know the backend configuration of every producer.
There's no enforcement of the dependency order — you have to manually ensure networking is applied before compute.
Changes to networking outputs don't automatically trigger a re-plan of compute.

Option B: Wrapper scripts and CI glue

You write shell scripts or CI pipeline steps that run terraform output on one state and feed the values into terraform apply -var on the next. This is what most teams end up doing, and it's fragile — the dependency graph lives in CI config rather than in code.

Option C: Terragrunt

Terragrunt adds a dependency layer on top of Terraform:

# compute/terragrunt.hcl
dependency "networking" {
  config_path = "../networking"
}

inputs = {
  vpc_id             = dependency.networking.outputs.vpc_id
  private_subnet_ids = dependency.networking.outputs.private_subnet_ids
}

This is a genuine improvement — dependencies are declared in code, ordering is enforced, and terragrunt run-all apply handles the graph. But Terragrunt is a local CLI tool. It doesn't provide a persistent view of deployment status, approval gates, automatic re-deployment when upstream outputs change, or scoped permissions.

Option D: Snap CD

Snap CD was built for this problem. Each split becomes a Snap CD Module, and cross-state dependencies are declared as code using the Terraform Provider for Snap CD. Snap CD enforces apply ordering, runs independent Modules in parallel, and automatically cascades changes when upstream Outputs change. The cross-module edges from the split — every output/variable pair — map directly to snapcd_module_input_from_output resources. See Modular Deployments for a detailed walkthrough of how the Module and Input system works.

Tips

Split incrementally. Move one logical group at a time. Don't try to split everything in one go.
Start with the layer that changes least. Networking is usually the best first candidate — it has many dependents but few dependencies.
Keep shared modules small. If a Terraform module (in the module {} sense) is used by multiple states, keep it focused. A module that provisions "everything for an app" is just a monolith in disguise.
Test with terraform plan after every move. A clean plan (no changes) on both the source and destination states confirms the migration was correct.

DEV Community