Karl Schriek

Posted on Jun 30 • Edited on Jul 6 • Originally published at snapcd.io

The Problem with Large Terraform States

#terraform #cicd #infrastructureascode #cloud

At some point every growing Terraform project hits a wall. Plans that used to finish in seconds now take minutes. Applies feel risky because hundreds of resources share a single blast radius. Colleagues avoid running terraform plan because it hammers cloud APIs hard enough to trigger throttling. The state file itself becomes a liability — large, slow to lock, and one bad write away from corruption.

This guide covers the symptoms of an oversized state, the band-aids teams reach for, and the structural fix that actually works.

How Terraform state works under the hood

Every terraform plan does two things:

Refresh — for every resource in state, Terraform calls the provider's API to read the current real-world status. A state with 500 resources means 500+ API calls, often more when resources have nested data sources.
Diff — compare the refreshed state against the desired configuration and produce a change set.

The refresh phase is the bottleneck. It's sequential per provider (parallelism helps across providers, not within one), and every resource pays the cost whether you changed it or not. Adding ten resources to a 500-resource state doesn't make plans 2% slower — it makes the refresh 2% slower on every single plan, for every engineer, forever.

Symptoms of a state that's too large

Slow plans

The most visible symptom. Plan time scales with resource count because every resource is refreshed on every plan, regardless of whether its configuration changed. The exact speed depends on provider — AWS resources with complex nested structures (IAM policies, security group rules) are slower to refresh than simple ones, and Azure resources that require multiple API calls per refresh are worse still. These aren't edge cases — users regularly report 2,900-resource states taking 20–25 minutes to plan and 1,600-resource states taking 8+ minutes. Even starting Terraform with a large state can take minutes before a single API call is made. There's a long-standing proposal for terraform plan -light that would only refresh resources whose configuration changed, but it remains unimplemented. OpenTofu has a similar request to skip refreshing unchanged resources and a proposal for state compression to reduce the overhead of large state files.

API rate limiting

Cloud providers throttle API calls. When Terraform refreshes hundreds of resources, it can exhaust rate limits:

AWS: ThrottlingException or Rate exceeded errors, especially on IAM, EC2 describe calls, and CloudFormation.
Azure: 429 Too Many Requests, particularly on Resource Manager and Key Vault APIs.
GCP: rateLimitExceeded on Compute Engine and IAM.

Terraform retries on throttling, which makes plans even slower. In severe cases, retries exhaust their budget and the plan fails entirely.

Blast radius

Every resource in a state shares a blast radius. A typo in a DNS record can, in the same plan, sit alongside a database resize. One bad terraform apply can damage resources the operator didn't intend to touch.

This isn't theoretical. Common incidents:

A for_each key change causes Terraform to destroy and recreate resources it shouldn't.
A provider upgrade changes how a resource is read, causing phantom diffs on dozens of resources.
An engineer runs terraform apply on a plan that's stale — someone else merged a change to a different resource in the same state, and the apply picks up both.
A third-party API is down or throttling, so the refresh fails for a resource you weren't even changing — blocking the entire plan. With a smaller state, that resource would be in a different state file and wouldn't affect your work at all.

With smaller states, each of these incidents affects only the resources in that state. With a monolith, everything is in play.

Locking contention

Remote state backends use locking to prevent concurrent writes. The longer a plan or apply takes, the longer the lock is held. With a 10-minute plan, other engineers are blocked for 10 minutes. If an apply follows, that's another stretch of locked state.

Teams start working around locks — using -lock=false (dangerous), splitting work by time of day (inefficient), or simply waiting. Concurrent updates to large state files are also significantly slower because each write serialises the entire state. None of these are real solutions.

State file size and corruption risk

State files grow linearly with resource count. A 1,000-resource state file can be several megabytes of JSON. Every plan downloads the full state, and every apply uploads a new version. On slow connections or with large states, this adds latency.

More critically, large state files are harder to recover from corruption. If a write is interrupted (network failure during apply, process killed), the state can become inconsistent. With a small state, recovery is straightforward — reimport a handful of resources. With a monolith, you're reimporting hundreds. Large state files also compound the secrets problem — Terraform stores sensitive values in plaintext in state, so a bigger state means more secrets exposed in a single file. OpenTofu implemented state encryption, but Terraform's proposal has been open since 2016.

Band-aids that don't fix the problem

`terraform plan -target`

The -target flag tells Terraform to only refresh and plan specific resources:

terraform plan -target=aws_instance.web

This makes individual plans fast, but it's a trap:

You must know which resources to target. Miss a dependency and the plan is incomplete.
Targeted plans skip dependency checking. You can apply a change that breaks a resource you didn't target.
It's manual and error-prone. There's no guardrail preventing someone from running a full plan and waiting 15 minutes.
Terraform itself warns: "Resource targeting is intended for exceptional use and should not be part of normal workflow."

`terraform plan -refresh=false`

Skipping refresh makes plans fast because Terraform uses the last-known state instead of querying APIs:

terraform plan -refresh=false

The problem is obvious: if the real world has drifted from state, the plan is wrong. An engineer deleted a resource manually, someone changed a security group in the console, a colleague applied from a different branch — none of this shows up. You're planning against fiction.

Workspaces

Terraform workspaces let you maintain multiple state files from the same configuration. They're designed for deploying the same infrastructure to different environments (dev, staging, prod), not for splitting a large state into smaller pieces.

Workspaces don't reduce the number of resources per state. If your monolith has 500 resources, each workspace still has 500 resources. They solve a different problem.

`terraform state rm` and manual state surgery

When a single resource is causing problems, engineers sometimes remove it from state and reimport it:

terraform state rm aws_instance.problematic
terraform import aws_instance.problematic i-0123456789abcdef0

This is a valid recovery technique but not a scaling strategy. It's manual, risky (removing the wrong resource is destructive), and doesn't address the underlying size problem.

The real fix: smaller states

The only way to permanently fix a large state is to break it into smaller ones. Each state contains a logical group of resources — networking, compute, databases, monitoring — with its own lifecycle, credentials, and blast radius. If your state spans multiple cloud providers, splitting along provider boundaries is one of the most effective first moves. Each provider has its own API rate limits, its own authentication, and its own failure modes — an Azure outage shouldn't block a plan that only touches AWS resources. Separate states per provider also let you scope credentials more tightly and parallelise plans that would otherwise run sequentially through a single refresh cycle.

The hard part isn't the split itself — it's managing the dependencies between the resulting states. Networking outputs need to flow into compute. Compute outputs need to flow into application infrastructure. Changes to one state need to trigger re-plans in dependent states. Snap CD was built for exactly this workflow — it tracks cross-state dependencies declaratively and cascades changes automatically, so you get the benefits of smaller states without the coordination overhead. For a discussion of approaches to breaking a monolith into smaller states, see Splitting a Terraform Monolith. To learn more about how Snap CD approaches modular deployments, see Modular Deployments.

How to tell when it's time

There's no universal threshold, but if any of these are true, you should start planning a split:

terraform plan consistently takes more than a few minutes.
More than one team commits to the same Terraform root module.
You've had an incident where an apply affected resources the operator didn't intend to change.
Applies are failing due to issues with unrelated resources in the same state.
Your state spans multiple cloud providers, and an outage or rate limit on one provider blocks plans for resources on another.
Engineers routinely use -target or -refresh=false to work around slowness.

Start with the layer that changes least (usually networking) and work outward. The Splitting a Terraform Monolith guide has the step-by-step process.

Tips

Split by cloud provider early. If your state has resources across AWS, Azure, and GCP, separating them into per-provider states is one of the highest-value splits. Each provider has independent rate limits, authentication, and failure modes — keeping them together means a slow Azure API refresh delays your AWS plan for no reason.
Watch for provider-specific bottlenecks. Even within a single cloud, some resource types are slower than others. If most of your plan time is AWS IAM resources, splitting out IAM alone might cut plan time dramatically. This is also a prerequisite if you are serious about not mixing credentials on the workers that are responsible for the deployments.
Don't over-split. Five resources that always change together, owned by the same team, with the same credentials, should stay in one state. The goal is fast plans and small blast radius, not one resource per state.
Use -parallelism wisely. Terraform's -parallelism flag (default 10) controls concurrent provider operations. Increasing it can speed up plans but also increases the risk of hitting API rate limits. With smaller states, the default is usually fine.

DEV Community

The Problem with Large Terraform States

How Terraform state works under the hood

Symptoms of a state that's too large

Slow plans

API rate limiting

Blast radius

Locking contention

State file size and corruption risk

Band-aids that don't fix the problem

`terraform plan -target`

`terraform plan -refresh=false`

Workspaces

`terraform state rm` and manual state surgery

The real fix: smaller states

How to tell when it's time

Tips

See also

Top comments (0)

How Terraform state works under the hood

Symptoms of a state that's too large

Slow plans

API rate limiting

Blast radius

Locking contention

State file size and corruption risk

Band-aids that don't fix the problem

terraform plan -target

terraform plan -refresh=false

Workspaces

terraform state rm and manual state surgery

The real fix: smaller states

How to tell when it's time

Tips

See also

`terraform plan -target`

`terraform plan -refresh=false`

`terraform state rm` and manual state surgery