Migrating a Terraform Monolith to Terragrunt: State Slicing Without Downtime

#devops #terraform #terragrunt #aws

What I Built

I decomposed a monolithic Terraform state containing 19 logical AWS infrastructure components into a Terragrunt monorepo. This migration established isolated state files for each component—including VPC, EKS, and RDS—to enable independent locking, reduced blast radius, and faster plan performance without triggering any infrastructure changes or downtime.

System Architecture

Monolith State — A single S3-backed state file containing all 19 infrastructure components under a nested module hierarchy.

Terragrunt Modules — 13 independent module directories, each inheriting root configuration and managing a unique S3 state key.

Dependency Graph — Explicit inter-module wiring using Terragrunt dependency blocks to pass versioned outputs between isolated states.

Core Technical Behavior

The system runtime behavior changed from a single global lock to a per-component locking model. In the monolith, any change to a load balancer rule required a full re-evaluation of the entire stack, including RDS and EKS clusters. By slicing the state, I isolated the execution flow so that Terraform only reconciles the resources relevant to a specific logical component.

The migration process relied on address rewriting to drop the top-level parent prefixes used in the monolith. For example, a resource originally located at module.client_stage.module.database.module.rds.aws_db_instance.this[0] was moved to module.rds.aws_db_instance.this[0] within the new isolated rds module state.

Pulling the monolith state to a local file for immutable processing

terraform state pull > monolith.tfstate

Dynamically discovering direct child modules from the state list

DIRECT_MODULES=$(echo "$STATE_LIST" | grep "^${MODULE_PREFIX}\.module\." | \
  sed "s|^${MODULE_PREFIX}\.module\.||" | \
  sed 's/^\([^.[]*\).*/\1/' | sort -u)

Executing the state move from the local monolith source to individual module states

terraform state mv \
  -state="$MONOLITH_STATE" \
  -state-out="$TARGET_STATE" \
  "$resource" "$new_address"

Final runtime verification involved a run-all plan across the dependency graph. This confirmed that downstream modules could successfully read VPC IDs and RDS endpoints from upstream modules via typed outputs stored in the new isolated state files.

Key Engineering Decisions

Script-driven slicing over manual commands was implemented to ensure the move of hundreds of resources across 13 modules remained reproducible and free of manual typos.

Immutable source state management used separate -state and -state-out files to ensure the local monolith snapshot was never modified during the slicing process, allowing for clean retries.

Dynamic module discovery derived module names directly from the state list rather than a hardcoded inventory, preventing the silent omission of existing infrastructure from the migration.

Python-based regex processing was utilized for address rewriting to correctly handle dot-separated and bracket-indexed resource patterns that are not safely handled by standard shell tools.

Local backend validation was performed before migrating to S3 to verify each module against a zero-diff plan, ensuring the state perfectly matched live infrastructure before pushing to remote storage.

Trade-offs

Optimized for: blast radius reduction, per-module state locking, and faster iteration via targeted plan/apply cycles.

Sacrificed: operational simplicity during the migration window, requiring a change freeze to prevent drift while state existed in both monolithic and sliced forms.

Results / Cost Impact

The platform now operates 13 independent state files in S3, each protected by its own DynamoDB lock.

Parallel workstreams no longer block each other, as a Kubernetes deployment change no longer locks the VPC or database state.

The system enforces explicit ownership boundaries, where changes are restricted to specific infrastructure concerns without the risk of affecting adjacent resources in the same state file.

Conclusion

This migration turned a monolithic bottleneck into a scalable management boundary by performing state surgery instead of infrastructure re-creation. The resulting system maintains zero-drift compared to the original monolith while enabling the team to execute parallel changes with isolated failure modes.

The correctness of a state migration is guaranteed when every isolated module produces a clean plan with zero diff.

Need Help?

If you're working on a similar state decomposition or evaluating Terragrunt adoption for a growing SaaS platform, feel free to reach out at hello@jakops.cloud.

https://jakops.cloud