DEV Community

Cover image for Splitting a Terraform Monolith into Smaller States
Karl Schriek
Karl Schriek

Posted on • Originally published at snapcd.io

Splitting a Terraform Monolith into Smaller States

If your Terraform plans are slow, your blast radius is too wide, or multiple teams are stepping on each other's changes, it's time to split your monolith. See The Problem with Large Terraform States for how to diagnose whether you've reached that point.

This guide walks through the process of breaking a monolithic Terraform state into smaller, focused states — and how Snap CD can manage the dependencies between them so you don't have to.

The approach

1. Identify natural boundaries

Look at your resources and group them by lifecycle and ownership. Common boundaries:

  • Networking — VPCs, subnets, route tables, NAT gateways. Changes rarely, underpins everything.
  • DNS — Zones, records. Usually owned by a platform team.
  • Compute — Kubernetes clusters, VM scale sets, container services. Changes more often, depends on networking.
  • Application infrastructure — Databases, caches, queues, storage accounts. Owned by application teams.
  • Monitoring — Dashboards, alerts, log sinks. Changes frequently, depends on everything but nothing depends on it.

A useful test: if two resources would never be changed in the same PR by the same person, they probably belong in different states.

2. Map the dependencies

Before you move anything, draw the dependency graph. Which groups produce values that other groups consume?

networking          dns
    │                 ▲
    ▼                 │
  compute ──────────►─┘
    │
    ▼
application
    │
    ▼
monitoring
Enter fullscreen mode Exit fullscreen mode

The outputs that cross these boundaries are what you'll need to wire up after the split. Typical examples:

  • Networking → Compute: vpc_id, private_subnet_ids
  • Compute → DNS: load_balancer_ip
  • Compute → Application: cluster_endpoint, cluster_ca_certificate
  • Application → Monitoring: database_id, cache_name

3. Use terraform state mv to migrate resources

Terraform's state mv command lets you move resources from one state to another without destroying and recreating them.

# Initialize the destination state
cd modules/networking
terraform init

# Move resources from the monolith to the new state
terraform state mv \
  -state=../monolith/terraform.tfstate \
  -state-out=./terraform.tfstate \
  aws_vpc.main aws_vpc.main

terraform state mv \
  -state=../monolith/terraform.tfstate \
  -state-out=./terraform.tfstate \
  aws_subnet.private aws_subnet.private
Enter fullscreen mode Exit fullscreen mode

Do this methodically, one logical group at a time. After each move:

  1. Run terraform plan on the new state — it should show no changes.
  2. Run terraform plan on the monolith — the moved resources should no longer appear.

4. Replace hard references with inputs

In the monolith, your compute module might directly reference aws_vpc.main.id. After the split, that VPC lives in a different state. You need to replace the hard reference with a variable:

# Before (monolith)
resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = aws_subnet.private[*].id
  }
}

# After (separate compute module)
variable "private_subnet_ids" {
  type = list(string)
}

resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = var.private_subnet_ids
  }
}
Enter fullscreen mode Exit fullscreen mode

And in the networking module, expose the value as an output:

output "private_subnet_ids" {
  value = aws_subnet.private[*].id
}
Enter fullscreen mode Exit fullscreen mode

5. Wire up the cross-state dependencies

This is where it gets interesting. You've split the monolith, and now you need the outputs from one state to flow into another. There are a few ways to do this:

Option A: terraform_remote_state data sources

The built-in approach. Each consuming module reads the producer's state directly:

data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}
Enter fullscreen mode Exit fullscreen mode

This works but has significant drawbacks:

  • Every consumer needs to know the backend configuration of every producer.
  • There's no enforcement of the dependency order — you have to manually ensure networking is applied before compute.
  • Changes to networking outputs don't automatically trigger a re-plan of compute.

Option B: Wrapper scripts and CI glue

You write shell scripts or CI pipeline steps that run terraform output on one state and feed the values into terraform apply -var on the next. This is what most teams end up doing, and it's fragile — the dependency graph lives in CI config rather than in code.

Option C: Terragrunt

Terragrunt adds a dependency layer on top of Terraform:

# compute/terragrunt.hcl
dependency "networking" {
  config_path = "../networking"
}

inputs = {
  vpc_id             = dependency.networking.outputs.vpc_id
  private_subnet_ids = dependency.networking.outputs.private_subnet_ids
}
Enter fullscreen mode Exit fullscreen mode

This is a genuine improvement — dependencies are declared in code, ordering is enforced, and terragrunt run-all apply handles the graph. But Terragrunt is a local CLI tool. It doesn't provide a persistent view of deployment status, approval gates, automatic re-deployment when upstream outputs change, or scoped permissions.

Option D: Snap CD

Snap CD was built for this problem. Each split becomes a Snap CD Module, and cross-state dependencies are declared as code. Snap CD enforces apply ordering, runs independent Modules in parallel, and automatically cascades changes when upstream outputs change. See Modular Deployments for a detailed walkthrough of how the Module and input system works.

Tips

  • Split incrementally. Move one logical group at a time. Don't try to split everything in one go.
  • Start with the layer that changes least. Networking is usually the best first candidate — it has many dependents but few dependencies.
  • Keep shared modules small. If a Terraform module (in the module {} sense) is used by multiple states, keep it focused. A module that provisions "everything for an app" is just a monolith in disguise.
  • Test with terraform plan after every move. A clean plan (no changes) on both the source and destination states confirms the migration was correct.

See also

Top comments (0)