At some point every growing Terraform project hits a wall. Plans that used to finish in seconds now take minutes. Applies feel risky because hundreds of resources share a single blast radius. Colleagues avoid running terraform plan because it hammers cloud APIs hard enough to trigger throttling. The state file itself becomes a liability — large, slow to lock, and one bad write away from corruption.
This guide covers the symptoms of an oversized state, the band-aids teams reach for, and the structural fix that actually works.
How Terraform state works under the hood
Every terraform plan does two things:
- Refresh — for every resource in state, Terraform calls the provider's API to read the current real-world status. A state with 500 resources means 500+ API calls, often more when resources have nested data sources.
- Diff — compare the refreshed state against the desired configuration and produce a change set.
The refresh phase is the bottleneck. It's sequential per provider (parallelism helps across providers, not within one), and every resource pays the cost whether you changed it or not. Adding ten resources to a 500-resource state doesn't make plans 2% slower — it makes the refresh 2% slower on every single plan, for every engineer, forever.
Symptoms of a state that's too large
Slow plans
The most visible symptom. Plan time scales with resource count because every resource is refreshed on every plan, regardless of whether its configuration changed. The exact speed depends on provider — AWS resources with complex nested structures (IAM policies, security group rules) are slower to refresh than simple ones, and Azure resources that require multiple API calls per refresh are worse still. These aren't edge cases — users regularly report 2,900-resource states taking 20–25 minutes to plan and 1,600-resource states taking 8+ minutes. Even starting Terraform with a large state can take minutes before a single API call is made. There's a long-standing proposal for terraform plan -light that would only refresh resources whose configuration changed, but it remains unimplemented. OpenTofu has a similar request to skip refreshing unchanged resources and a proposal for state compression to reduce the overhead of large state files.
API rate limiting
Cloud providers throttle API calls. When Terraform refreshes hundreds of resources, it can exhaust rate limits:
-
AWS:
ThrottlingExceptionorRate exceedederrors, especially on IAM, EC2 describe calls, and CloudFormation. -
Azure:
429 Too Many Requests, particularly on Resource Manager and Key Vault APIs. -
GCP:
rateLimitExceededon Compute Engine and IAM.
Terraform retries on throttling, which makes plans even slower. In severe cases, retries exhaust their budget and the plan fails entirely.
Blast radius
Every resource in a state shares a blast radius. A typo in a DNS record can, in the same plan, sit alongside a database resize. One bad terraform apply can damage resources the operator didn't intend to touch.
This isn't theoretical. Common incidents:
- A
for_eachkey change causes Terraform to destroy and recreate resources it shouldn't. - A provider upgrade changes how a resource is read, causing phantom diffs on dozens of resources.
- An engineer runs
terraform applyon a plan that's stale — someone else merged a change to a different resource in the same state, and the apply picks up both. - A third-party API is down or throttling, so the refresh fails for a resource you weren't even changing — blocking the entire plan. With a smaller state, that resource would be in a different state file and wouldn't affect your work at all.
With smaller states, each of these incidents affects only the resources in that state. With a monolith, everything is in play.
Locking contention
Remote state backends use locking to prevent concurrent writes. The longer a plan or apply takes, the longer the lock is held. With a 10-minute plan, other engineers are blocked for 10 minutes. If an apply follows, that's another stretch of locked state.
Teams start working around locks — using -lock=false (dangerous), splitting work by time of day (inefficient), or simply waiting. Concurrent updates to large state files are also significantly slower because each write serialises the entire state. None of these are real solutions.
State file size and corruption risk
State files grow linearly with resource count. A 1,000-resource state file can be several megabytes of JSON. Every plan downloads the full state, and every apply uploads a new version. On slow connections or with large states, this adds latency.
More critically, large state files are harder to recover from corruption. If a write is interrupted (network failure during apply, process killed), the state can become inconsistent. With a small state, recovery is straightforward — reimport a handful of resources. With a monolith, you're reimporting hundreds. Large state files also compound the secrets problem — Terraform stores sensitive values in plaintext in state, so a bigger state means more secrets exposed in a single file. OpenTofu implemented state encryption, but Terraform's proposal has been open since 2016.
Band-aids that don't fix the problem
terraform plan -target
The -target flag tells Terraform to only refresh and plan specific resources:
terraform plan -target=aws_instance.web
This makes individual plans fast, but it's a trap:
- You must know which resources to target. Miss a dependency and the plan is incomplete.
- Targeted plans skip dependency checking. You can apply a change that breaks a resource you didn't target.
- It's manual and error-prone. There's no guardrail preventing someone from running a full plan and waiting 15 minutes.
- Terraform itself warns: "Resource targeting is intended for exceptional use and should not be part of normal workflow."
terraform plan -refresh=false
Skipping refresh makes plans fast because Terraform uses the last-known state instead of querying APIs:
terraform plan -refresh=false
The problem is obvious: if the real world has drifted from state, the plan is wrong. An engineer deleted a resource manually, someone changed a security group in the console, a colleague applied from a different branch — none of this shows up. You're planning against fiction.
Workspaces
Terraform workspaces let you maintain multiple state files from the same configuration. They're designed for deploying the same infrastructure to different environments (dev, staging, prod), not for splitting a large state into smaller pieces.
Workspaces don't reduce the number of resources per state. If your monolith has 500 resources, each workspace still has 500 resources. They solve a different problem.
terraform state rm and manual state surgery
When a single resource is causing problems, engineers sometimes remove it from state and reimport it:
terraform state rm aws_instance.problematic
terraform import aws_instance.problematic i-0123456789abcdef0
This is a valid recovery technique but not a scaling strategy. It's manual, risky (removing the wrong resource is destructive), and doesn't address the underlying size problem.
The real fix: smaller states
The only way to permanently fix a large state is to break it into smaller ones. Each state contains a logical group of resources — networking, compute, databases, monitoring — with its own lifecycle, credentials, and blast radius. If your state spans multiple cloud providers, splitting along provider boundaries is one of the most effective first moves. Each provider has its own API rate limits, its own authentication, and its own failure modes — an Azure outage shouldn't block a plan that only touches AWS resources. Separate states per provider also let you scope credentials more tightly and parallelise plans that would otherwise run sequentially through a single refresh cycle.
The hard part isn't the split itself — it's managing the dependencies between the resulting states. Networking outputs need to flow into compute. Compute outputs need to flow into application infrastructure. Changes to one state need to trigger re-plans in dependent states. Snap CD was built for exactly this workflow — it tracks cross-state dependencies declaratively and cascades changes automatically, so you get the benefits of smaller states without the coordination overhead. For a discussion of approaches to breaking a monolith into smaller states, see Splitting a Terraform Monolith. To learn more about how Snap CD approaches modular deployments, see Modular Deployments.
How to tell when it's time
There's no universal threshold, but if any of these are true, you should start planning a split:
-
terraform planconsistently takes more than a few minutes. - More than one team commits to the same Terraform root module.
- You've had an incident where an apply affected resources the operator didn't intend to change.
- Applies are failing due to issues with unrelated resources in the same state.
- Your state spans multiple cloud providers, and an outage or rate limit on one provider blocks plans for resources on another.
- Engineers routinely use
-targetor-refresh=falseto work around slowness.
Start with the layer that changes least (usually networking) and work outward. The Splitting a Terraform Monolith guide has the step-by-step process.
Tips
- Split by cloud provider early. If your state has resources across AWS, Azure, and GCP, separating them into per-provider states is one of the highest-value splits. Each provider has independent rate limits, authentication, and failure modes — keeping them together means a slow Azure API refresh delays your AWS plan for no reason.
- Watch for provider-specific bottlenecks. Even within a single cloud, some resource types are slower than others. If most of your plan time is AWS IAM resources, splitting out IAM alone might cut plan time dramatically. This is also a prerequisite if you are serious about not mixing credentials on the workers that are responsible for the deployments.
- Don't over-split. Five resources that always change together, owned by the same team, with the same credentials, should stay in one state. The goal is fast plans and small blast radius, not one resource per state.
-
Use
-parallelismwisely. Terraform's-parallelismflag (default 10) controls concurrent provider operations. Increasing it can speed up plans but also increases the risk of hitting API rate limits. With smaller states, the default is usually fine.
See also
- Splitting a Terraform Monolith — approaches to breaking a monolith into smaller states
- Modular Deployments — how Snap CD manages cross-state dependencies after the split
- Self-Hosted Terraform Runners with Credential Isolation — scoping credentials per environment with dedicated Runners
- Managing Secrets in Terraform — why smaller states reduce secret exposure
Top comments (0)