The Week We Realized Terraform Docs Were Lying to Us About State Size

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The documentation promised state files could grow to 100 MB safely, but we learned the hard way that the real limit was not the file size—it was the time to diff the state against the actual cluster. Our CI/CD runner timed out after 10 minutes because terraform refresh couldnt finish before the GitHub action killed it. The OOM killer then terminated the pod because Terraforms Go GC couldnt keep up with the heap pressure from parsing 13 000 resources. At one point, the heap reached 1.2 GB, and the runner node had 1.5 GB allocated—leaving us 300 MB to breathe.

I still have the htop screenshot from the CI runner where the resident memory of terraform hit 1.1 GB before the kernel started swapping. That was the moment I realized the language runtime was the bottleneck, not the Kubernetes API or the state backend.

What We Tried First (And Why It Failed)

We tried upgrading Terraform to 1.5.7 and enabling the experimental reduce flag, but the diff still required a full graph walk. We then split the state into smaller modules: network, compute, and game pods. That shrunk each state file to 15–20 MB, but the combined execution time across three plans still clocked in at 92 seconds—unacceptable for a game engine that needs to spin up new lobbies in under 5 seconds.

We also attempted to use terragrunt to manage multiple .tfstate files, but the generate blocks for Kubernetes providers duplicated 80 percent of the configuration across every module. That duplication introduced 47 percent more drift because changes to common labels in one module werent visible to another. Our reconciliation loop began reporting false positives every time a pod scaled, and operators spent hours chasing ghost resources that were actually misclassified by Terraforms provider.

Finally, we tried storing state in S3 with DynamoDB locking. The locking worked fine, but the terraform state push operation to migrate from NFS would take 11 minutes for 89 MB and would inevitably fail on a 30-second Lambda timeout. AWS Support confirmed DynamoDB transactions have a 2 MB payload cap, so we couldnt even push the state in one piece.

The Architecture Decision

At that point, I stopped believing the docs and started looking at the runtime itself. Terraform is written in Go, which means its heap behavior is predictable but unforgiving when you hit pathological graph cases. I opened an issue in the Terraform repo and found a comment from a HashiCorp engineer admitting that state size above 50 MB causes quadratic behavior in the graph walker. That was the smoking gun.

So I made an unpopular call: we rewrote the state management layer in Rust using the terraform-rs crate and hcl-rs for parsing. Rusts zero-cost abstractions and predictable stack usage meant we could process the same 13 000 resources in 1.8 seconds on a 512 MB heap, with no GC pauses. We kept the Kubernetes provider in Go for now, but isolated it behind a thin gRPC service that only emitted diffs.

We deployed the Rust service behind an NGINX buffer and capped the request body at 32 MB to avoid memory amplification. The CI runner now spends 1.2 seconds on plan, 0.3 on refresh, and 0.1 on apply. The memory profile shows 240 MB RSS during peak load, which is less than 25 percent of the previous Go versions footprint under the same load.

What The Numbers Said After

Here are the before-and-after metrics on a synthetic 80-node cluster:

Metric	Terraform 1.5.7 (Go)	Rust state service
Mean plan time	37.2 s	1.8 s
Max heap during plan	1.2 GB	240 MB
CI runner memory pressure	OOM after 10 min	Stable at 280 MB
95th percentile lobby spin-up	4.7 s	1.2 s

We also captured perf record output on the Go version. The top frames were all in state.(*State).refresh and graph.(*Graph).walk, which together accounted for 73 percent of CPU time. In the Rust version, the same operations used 89 percent of CPU time in serde_json::from_reader—a clear sign serialization had become the new bottleneck, but one we could tune with custom allocators.

Our operators immediately noticed fewer false drift reports, and onboarding new environments dropped from 45 minutes to 8 minutes because the Rust service could stream partial states.

What I Would Do Differently

If I could go back, I would not have trusted the Terraform documentation for state scaling. I would have written a quick Go benchmark that simulated our graph size before we even upgraded the cluster. That benchmark would have shown the quadratic behavior at 6 000 resources—two weeks before we hit 13 000.

I would also have avoided the temptation to split state files into modules just to shrink file size. Module splitting helped with UI organization, but it hurt performance because Helm rendered the same labels into every module, creating duplicate resources in Terraforms graph. Instead, I would centralize the state in one Rust service and use Kubernetes finalizers to handle pod cleanup, decoupling the state lifecycle from the Helm chart lifecycle.

Finally, I would insist on a memory budget from day one. Terraforms Go runtime hides memory usage behind the GC, so engineers often assume more memory is always the solution. When we capped the Rust service at 512 MB and tuned jemalloc for low fragmentation