DEV Community: Josh Pollara

The infrastructure stack is getting faster. Terraform is not.

Josh Pollara — Sat, 01 Nov 2025 20:59:31 +0000

TL;DR

$ cat velocity-gap.tldr
• Every layer of the stack is getting faster except infrastructure
• Terraform's state system is the bottleneck, not the execution model
• This is a solvable engineering problem, not an inherent limitation

Application deployment got fast. CI pipelines got fast. Container orchestration got fast. Observability got fast. Infrastructure provisioning did not. That's not an accident. It's architecture.

Look at the modern software stack. Kubernetes deployments converge in seconds. GitHub Actions runs complete in minutes. Observability platforms ingest and query terabytes in real time. Every layer has been optimized for velocity because velocity is table stakes. Except infrastructure. Terraform plans take minutes. Applies queue behind locks. State operations serialize. Developers wait. Platform teams work around. Executives ask why infrastructure is the slow part.

The answer isn't that infrastructure is inherently slower. The answer is that Terraform's state system wasn't designed for the concurrency and scale modern teams demand. It was designed for solo operators managing dozens of resources, not distributed teams managing thousands. That design worked when it shipped. It doesn't work now. Not because the model is wrong, but because the execution substrate (flat files, global locks, filesystem semantics) can't deliver the performance the industry needs.

Teams are routing around Terraform because it's too slow

The industry has responded in two predictable ways. One, abandon Terraform entirely and migrate to Crossplane or some Kubernetes-native control plane. Two, wrap Terraform in so much orchestration and tooling that developers never touch it directly. Crossplane requires a full rewrite (throw away modules, provider knowledge, operational muscle memory). Internal platforms add layers of custom orchestration on top of Terraform. Both are symptoms of the same diagnosis. Terraform works, but it doesn't work fast enough.

The Problem Is Clear
Nobody wants to replace Terraform. They want Terraform to stop being the bottleneck. The ecosystem is irreplaceable. The execution speed is not.

Crossplane understood the problem but picked the wrong solution

Here's what Crossplane got right. Infrastructure that reconciles continuously instead of waiting for humans to run commands. Drift detected and corrected automatically instead of discovered weeks later during the next apply. Declarative state that converges without manual intervention. That operational model is correct. The problem is everything else.

Crossplane has no equivalent to terraform plan. You can't preview changes before they happen. There's no diff, no dry-run, no "here's what will change" before you commit. You declare what you want in YAML, apply it, and hope it does what you expect. You're flying blind until applied. For teams used to Terraform's safety net (the plan output that shows exactly what will be created, modified, or destroyed), this is unacceptable. Change control goes out the window. You're back to "deploy and pray."

Then there's the complexity tax. Crossplane doesn't work well out of the box. You can't just install it and start provisioning resources like you can with Terraform. You have to build Compositions (abstractions that wrap provider resources into higher-level APIs), write XRDs (Custom Resource Definitions that define your platform's interface), and in many cases write custom Functions or controllers to handle edge cases the generated providers don't cover. This is significant upfront work. Crossplane is really built for the orgs with enough complexity to support a Platform Engineering team. If you're a small-to-medium team that just wants to provision infrastructure, Crossplane asks you to become a Kubernetes platform engineering shop first. That's not simplicity. That's a second full-time job.

And you're locked into Kubernetes. Even if your application doesn't run on Kubernetes, even if you're just managing cloud resources, Crossplane forces you to operate a Kubernetes cluster (reliably, because it's now your infrastructure control plane), understand CRDs, debug controllers, and think in Kubernetes semantics. For teams that aren't already deeply invested in K8s, this is pure overhead.

So teams end up in hybrid mode. Terraform for base infrastructure (networking, clusters, foundational resources) and Crossplane for application-specific resources (databases, buckets, queues that developers request). The pattern works, but it's an admission that neither tool is complete. You're maintaining two systems, two sets of expertise, two operational models. The quote that keeps appearing is "tools aren't all or nothing." That's pragmatism, not a solution.

Crossplane forces you to choose. Stategraph gives you both.

Stategraph gives you both

Continuous reconciliation without flying blind. Automatic drift detection with full visibility into what will change. The operational model Crossplane promises, built on the foundation teams already trust. You don't abandon Terraform. You don't rewrite everything as Kubernetes resources. You don't need a platform engineering team just to get started. You point Terraform at Stategraph instead of S3 and DynamoDB, and you get the control plane characteristics modern infrastructure demands.

Because state is a queryable graph, drift detection runs continuously in the background. The system always knows what's supposed to exist and what actually exists. When they diverge, it surfaces immediately. But unlike Crossplane, you still get plan output. Before any change applies, you see the diff. You see what will be created, modified, or destroyed. The safety net stays. Terraform's change control workflow stays. The preview-before-apply model that keeps infrastructure changes predictable stays. You just get it with continuous operation instead of manual runs.

This isn't either-or. It's both. The reconciliation loop people want from Crossplane with the visibility and ecosystem they need from Terraform. No Kubernetes required. No compositions to write. No custom controllers. Just Terraform, running continuously, with the execution speed and operational characteristics the industry is demanding.

Fix the state system, unlock the model

Stategraph fixes the actual problem. Not by replacing Terraform, but by replacing the part of Terraform that doesn't scale (the state system). Instead of flat files and global locks, Stategraph treats state as a transactional graph database. Resources are nodes. Dependencies are edges. Updates are transactions with ACID guarantees. Concurrent applies lock only the subgraph they modify, not the entire state. Plans read from snapshots, so they never block. Drift detection is a background query, not a blocking operation.

The result is Terraform that performs like a modern system. Applies that used to serialize behind a global lock now run in parallel when they don't conflict. Plans that used to take minutes now take seconds because the system only reads what it needs. Developers stop waiting. Platform teams stop workarounding. Infrastructure feels fast because it actually is fast.

This isn't research. This is applying database concurrency patterns (row-level locking, MVCC, transactional isolation) to infrastructure state. Postgres does this. MySQL does this. Every modern database does this. Stategraph does it for Terraform state. The ecosystem stays. The modules stay. The providers stay. The execution engine changes.

Engineering Reality
The hard part isn't the idea. It's building a backend that presents file-based semantics (because that's what Terraform expects) while implementing graph-based concurrency underneath. That's solvable. We're solving it.

When you fix the substrate, everything downstream changes

When Terraform stops being slow, the downstream effects cascade. Platform teams can finally build the simple interfaces they've been trying to build. REST APIs that provision infrastructure instantly. CLIs that feel like kubectl. Self-service portals where developers request environments and get them in seconds, not minutes. The backend is still Terraform (still using your modules, still enforcing your policies, still auditing every change), but developers don't see that. They see fast, reliable infrastructure that doesn't require understanding state locks.

Executives get the velocity they're demanding without throwing away the maturity they need. Terraform stays. The governance stays. What changes is the execution speed. Infrastructure provisioning stops being the slow part of the stack. The system delivers what modern engineering organizations require, which is velocity and control, not velocity or control.

This is not a hypothetical problem

We see this at Terrateam constantly. Teams adopt Terraform because it's the right tool. They scale up. Velocity drops. Platform teams split state, add CI orchestration, implement queueing, build internal tools. It helps. It doesn't fix it. You can't fix a performance problem by adding more layers. You fix it by removing the bottleneck.

Stategraph is the fix. A graph-native state engine that eliminates false serialization. Transactional semantics. MVCC concurrency that makes plans instant. Subgraph locking that lets teams work in parallel. This isn't a fork. It's a backend. You point Terraform at Stategraph instead of S3 and DynamoDB, and it gets fast.

What we're building

Stategraph starts by fixing state, but that's not the destination. It's the foundation that unlocks what comes next. Once Terraform has a graph-native substrate, teams can build the operational patterns they actually want. Continuous reconciliation becomes possible without abandoning the provider ecosystem. Platform teams can offer infrastructure that converges automatically while developers still get terraform plan visibility. Policy and compliance can run in real time without blocking deployments. The control plane scales with complexity without losing correctness.

This opens the door for what Terraform should have become. A mature ecosystem with modern execution semantics. Governance and velocity, not governance or velocity. The operational characteristics teams see in Kubernetes, built on the foundation they already trust.

We're not building a better Terraform. We're building what teams can do with Terraform once the constraints disappear.

Technical Preview

Stategraph is in development. Design partners welcome.

Fix state. Fix Terraform.

Graph-native storage. Row-level locking. MVCC concurrency.
Your Terraform becomes as fast as the systems it manages.

Zero spam. Just progress updates as we build Stategraph.

Terraform State: A Practical Guide to Backends, Locks and Safe CI/CD

Josh Pollara — Sat, 04 Oct 2025 05:54:57 +0000

TL;DR

$ cat terraform-state.tldr
• State = JSON map from Terraform config to real infrastructure
• Local state breaks with teams. Remote backend required (S3/Azure/GCS)
• Locking prevents concurrent writes that corrupt state
• Always encrypt, lock down access, never commit to Git
• CI/CD: remote backend + locking + IAM/RBAC credentials

Terraform state is your infrastructure's source of truth, but most teams treat it like an afterthought until something breaks. By the time you're debugging a corrupted state file at 2 AM or explaining to your CTO why prod is down because two engineers applied changes simultaneously, it's too late.

State management is not optional infrastructure. It's the foundation that determines whether your Terraform workflows are reliable or a liability. The difference between teams that ship confidently and teams that fear every apply comes down to how they handle state.

This guide covers everything you need to know about Terraform state for production environments: what state actually is, how to configure remote backends properly, why locking matters, how to secure sensitive data, and how to integrate state management into CI/CD without creating bottlenecks or security holes.

What Terraform State Actually Is

Terraform state is a JSON file that maps your configuration code to real infrastructure resources. When you run terraform apply for the first time, Terraform creates a terraform.tfstate file in your working directory. This file becomes Terraform's database of what exists and where.

Without state, Terraform cannot determine what infrastructure already exists or what needs to change. The state file records resource IDs, attributes, dependencies, and outputs. It's the binding between your declarative configuration and the actual resources running in your cloud provider.

Core Concept: State is Terraform's source of truth. Your configuration describes what should exist. State describes what does exist. The diff between them is your plan.

Every state file contains several critical components:

Resource mappings connect your aws_instance.web_server to EC2 instance i-01234abcd. This one-to-one mapping lets Terraform know exactly which real-world resource corresponds to which line of code.

Dependency metadata ensures operations happen in the correct order. Terraform won't delete a security group that an EC2 instance depends on, because the state file tracks these relationships.

Outputs allow other configurations or automation tools to query values from your infrastructure. These are stored in state and can be referenced remotely.

Serial and lineage provide state versioning. The serial number increments with each change. The lineage ID uniquely identifies the state file's history. These prevent conflicting updates from mixing incompatible state histories.

State Lifecycle

Before every plan or apply, Terraform refreshes state by checking actual infrastructure for changes. If someone manually terminated a VM outside Terraform, the refresh detects it and updates state accordingly. After a successful apply, Terraform writes a new state snapshot and saves the previous version as terraform.tfstate.backup.

This lifecycle is automatic. You don't manually edit state files. Instead, you use Terraform CLI commands that handle state modifications safely and maintain format compatibility across versions.

$ terraform plan
→ Refresh state (check real infrastructure)
→ Compare config vs. state
→ Generate plan

$ terraform apply
→ Execute plan
→ Write new state snapshot
→ Backup previous state

Why Local State Fails at Scale

The default local state file works for solo projects and learning, but it fails immediately when you add teammates or automation. Local state creates several problems that remote backends solve.

No collaboration. When state lives on your laptop, nobody else can run Terraform. If you're on vacation and production needs an emergency change, your team is stuck.

No locking. If two people somehow share a state file (via Dropbox, Git, or network drive), concurrent runs will corrupt state. There's no coordination mechanism to prevent simultaneous writes.

No durability. Laptop crashes, accidental deletions, and disk failures mean permanent state loss. Without state, Terraform thinks nothing exists and will try to recreate everything.

No history. Local state keeps one backup file. If you need to roll back further or audit changes, you're out of luck.

Rule of Thumb: If more than one person touches Terraform, or if any CI system runs it, you need remote state. Local state is a prototype-only solution.

Remote backends store state in a shared, durable location. All major cloud platforms offer backends that Terraform can use: S3 on AWS, Blob Storage on Azure, Cloud Storage on GCP, and managed options like Terraform Cloud. These backends add locking, versioning, encryption, and access control that local state cannot provide.

Configuring Remote Backends

Using a remote backend requires two steps: configure the backend block in your Terraform code and run terraform init to migrate state. Below are practical configurations for each major cloud provider.

S3 Backend (AWS)

S3 provides durable object storage with versioning and encryption. A production S3 backend configuration looks like this:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    use_lockfile   = true  # Native S3 locking (TF 1.5+)
  }
}

Enable S3 bucket versioning to recover from accidental deletions or corrupted state. Versioning keeps every state update as a separate object version, giving you a complete history.

The encrypt = true flag enables server-side encryption. Your state file contains resource IDs, IP addresses, and sometimes secrets. Encryption at rest is not optional.

For state locking, Terraform 1.5+ supports native S3 locking via use_lockfile = true. This creates a .tflock object in the bucket to coordinate concurrent access. Older versions required DynamoDB for locking, but the S3-native approach is simpler and recommended for new deployments.

Credentials and access: Never hardcode AWS keys in your backend config. Use environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or IAM roles. For CI/CD, configure the pipeline to assume an IAM role with minimal permissions: s3:GetObject, s3:PutObject, and s3:ListBucket on the specific state bucket and path only.

Azure Blob Storage Backend

Azure Storage accounts provide blob containers for state storage. Native locking via blob leases handles concurrency automatically.

terraform {
  backend "azurerm" {
    resource_group_name  = "rg-terraform-state"
    storage_account_name = "tfstateaccount"
    container_name       = "tfstate"
    key                  = "prod.tfstate"
  }
}

Azure Storage encrypts data at rest by default. Restrict access via Azure RBAC or SAS tokens so only authorized users and service principals can read or write state. Disable public access on the storage account entirely and consider private endpoints to limit network exposure.

The AzureRM backend handles locking automatically using blob leases. When Terraform writes state, it acquires a lease on the blob, preventing other processes from writing simultaneously. No additional coordination service required.

Authentication: Use managed identities or service principals instead of hardcoding credentials. Supply authentication via environment variables or Azure CLI login rather than embedding secrets in your Terraform configuration.

Google Cloud Storage Backend

GCS buckets store state with automatic locking via generation numbers and preconditions. Enable object versioning to keep historical state versions.

terraform {
  backend "gcs" {
    bucket = "my-terraform-state"
    prefix = "terraform/state/prod"
  }
}

Terraform places the state file under the specified prefix path. The workspace name gets appended automatically, so the default workspace creates terraform/state/prod/default.tfstate.

GCS encrypts data at rest by default. For additional security, supply a customer-managed encryption key if required by your organization's policies.

Access control: Use Cloud IAM to restrict the bucket. Grant the service account or user running Terraform roles/storage.objectAdmin on the specific bucket or prefix. Ensure no public access. Handle credentials via GOOGLE_APPLICATION_CREDENTIALS environment variable or gcloud application-default credentials.

Backend Migration

Moving from local to remote state is straightforward. Add the backend block to your configuration and run:

$ terraform init -migrate-state
Initializing the backend...
Do you want to copy existing state to the new backend? (yes/no)
> yes
Successfully configured the backend "s3"!

Terraform detects the backend change, prompts for confirmation, and copies your local state to the remote backend. After migration, delete your local state file and rely entirely on the remote backend as the source of truth.

State Locking: Why It Matters

State locking prevents concurrent modifications that corrupt state. When multiple engineers or CI jobs run Terraform simultaneously, without locking you get race conditions where writes overwrite each other, leaving state in an inconsistent or broken state.

Terraform's locking mechanism is simple. For backends that support it, Terraform automatically acquires a lock before any write operation. If a lock already exists, Terraform waits until it's released. Only one process can hold the lock at a time.

Locking Prevents Disasters: Without locking, two concurrent applies can both read the same state, make different changes, and write back their versions. The second write wins, silently discarding the first. Resources get lost, state becomes corrupted, and recovery is painful.

All major remote backends support locking: S3 with lock files or DynamoDB, Azure with blob leases, GCS with object locking, Terraform Cloud with automatic locking. Local backends do not support locking, which is another reason they fail in team environments.

How Locking Works

When you run terraform apply, Terraform attempts to acquire a lock before making changes. This happens automatically. You don't see it unless there's contention.

If another process holds the lock, Terraform waits and displays a message like "Waiting for state lock." Once the lock releases, your operation proceeds. After finishing, Terraform releases the lock automatically.

Terraform provides a -lock=false flag to bypass locking, but using it is dangerous. Only disable locking in emergencies when you're absolutely certain no other process is running. The correct approach to lock contention is to fix the coordination problem, not disable the safety mechanism.

Stuck Locks and Recovery

If Terraform crashes or a process terminates abnormally, the lock might not release. Your next run will fail with a lock error and display a lock ID.

First, verify no Terraform process is actually running. Check your CI jobs, ask your teammates, ensure nothing's applying. Then force-unlock using the lock ID from the error message:

$ terraform force-unlock 8a1d2f3e-4b5c-6d7e-8f9a-0b1c2d3e4f5a
Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow other Terraform commands to obtain a lock.
> yes
Terraform state has been successfully unlocked!

Use force-unlock carefully. The lock ID acts as a safety check to prevent accidentally unlocking a different lock. If you see frequent stuck locks, fix the root cause (crashed processes, timeout issues, interrupted CI jobs) rather than routinely force-unlocking.

Security Best Practices

Terraform state can contain sensitive information. Resource IDs, IP addresses, database connection strings, and sometimes secrets in plaintext. Securing state is not optional for production environments.

Encryption at Rest

Always encrypt state files at rest. Enable server-side encryption on your backend storage. For S3, use encrypt = true in your backend config. For Azure and GCS, encryption is enabled by default, but verify it's active and consider customer-managed keys for additional control.

Encryption in transit happens automatically via TLS when Terraform communicates with the backend. Ensure you're using HTTPS endpoints, never unencrypted HTTP.

Access Control

Restrict who can read or write state. Use IAM policies, bucket policies, or RBAC to limit access to the state storage location. Only Terraform processes and administrators should have access.

For AWS S3, grant minimal permissions to the CI/CD role: s3:GetObject, s3:PutObject, s3:ListBucket on the specific bucket and path. Block public access entirely.

For Azure, use RBAC to grant the appropriate AAD principals access. Disable public access on the storage account and consider private endpoints to limit network exposure.

For GCS, grant roles/storage.objectAdmin on the specific bucket or prefix only. Ensure no public access.

Principle of Least Privilege: State storage should be treated like a database of infrastructure secrets. Only the processes that need to read or write state should have access. Everyone else gets denied.

Credential Handling

Never hardcode credentials in backend configurations. Use environment variables, IAM roles, managed identities, or service principals. Embedding secrets in your Terraform code means they end up in version control and CI logs.

For AWS, use IAM roles so no static keys are required. For Azure, use managed identities or service principals with credentials supplied via environment. For GCP, use application-default credentials or service account key files referenced via environment variables.

Terraform's backend configuration supports partial configuration, allowing you to omit sensitive fields from the config and supply them via environment or command-line flags.

Never Commit State to Git

This is a common anti-pattern. State files contain secrets and should never go in version control. If you accidentally commit state, you must purge it from Git history and rotate any exposed credentials.

Add *.tfstate and *.tfstate.backup to your .gitignore immediately. Use remote backend versioning for state history, not Git.

Sensitive Data in State

Terraform stores all resource attributes in state, including sensitive values. Marking outputs as sensitive = true prevents them from displaying in CLI output, but they're still stored in plaintext in the state file.

This is why state file encryption and access control matter. Some teams avoid putting highly sensitive secrets in Terraform-managed resources entirely, instead using HashiCorp Vault or cloud secret managers for dynamic secret injection.

Weigh the convenience of Terraform-managed secrets against the exposure risk. For production systems with strict compliance requirements, consider external secret management integrated with Terraform rather than embedding secrets in configurations.

State Management in CI/CD

Integrating Terraform into CI/CD pipelines requires careful state management. Pipelines run in ephemeral environments, so remote state is mandatory. Concurrent pipeline runs need locking. Credentials need secure injection. Get any of this wrong and you create security holes or broken state.

Always Use Remote State

When Terraform runs in CI, the pipeline environment doesn't retain local files between runs. Without a remote backend, each run starts from scratch and treats existing infrastructure as new, trying to recreate everything.

Configure your CI jobs to use the same remote backend as developers. The pipeline initializes with terraform init, pulls the latest state from the backend, runs plan or apply, and pushes state updates back.

Provide Secure Credentials

CI jobs need credentials to access the remote state backend. Use your CI platform's secret management to inject credentials as environment variables at runtime.

For AWS, configure the CI job to assume an IAM role with permissions to the S3 bucket and DynamoDB table (if using DynamoDB locking). For Azure, use a service principal with RBAC permissions to the storage account. For GCP, use a service account key stored in CI secrets and injected via GOOGLE_APPLICATION_CREDENTIALS.

Never put credentials in your Terraform configuration or CI pipeline definition files. They should come from secure secret stores and exist only as environment variables during pipeline execution.

Handle Locking and Concurrency

In busy environments, multiple pipeline runs can trigger simultaneously. State locking serializes the applies to prevent conflicts. However, you should also configure your CI orchestrator to handle concurrency intelligently.

Some CI systems allow queueing jobs per environment or setting concurrency limits. Use these features to prevent multiple applies from constantly fighting for the lock. Terraform's lock will work, but a better approach is pipeline-level coordination so only one job runs at a time per state.

If using Terraform Cloud's remote runs, it handles queueing automatically. For self-managed CI, configure job concurrency limits per environment to reduce lock contention.

stages:
  - name: plan
    run: terraform init && terraform plan -out=tfplan
    artifacts: tfplan
  - name: apply
    run: terraform init && terraform apply tfplan
    requires: manual_approval
    concurrency: 1  # Only one apply per environment

Separate State per Environment

Your CI pipelines likely deploy to multiple environments: dev, staging, production. Each environment must use separate state to prevent accidental cross-environment changes.

Common patterns include separate backend configurations per environment, different state file keys or prefixes, or Terraform workspaces. For example, your production pipeline uses key = "prod/terraform.tfstate" while staging uses key = "staging/terraform.tfstate".

This isolation ensures a deployment to dev doesn't accidentally read or write prod's state, reducing blast radius and enabling parallel development across environments.

Plan and Apply Stages

Many pipelines split Terraform into separate plan and apply stages with manual approval in between. Both stages must use the same remote state.

The plan stage runs terraform plan -out=tfplan and saves the plan file as a pipeline artifact. The apply stage runs terraform apply tfplan using the exact plan from the previous stage.

Between plan and apply, state could change if someone else runs Terraform. The apply will detect this and fail, prompting a re-plan. Some teams implement additional checks or short-lived locks, but Terraform's built-in refresh on apply provides baseline safety.

Avoid Storing State in Pipeline Artifacts

Rely on the remote backend as the source of truth, not pipeline artifacts. Saving state files between pipeline jobs creates confusion and risks applying with stale state.

If you need to pass information to subsequent jobs, use terraform output -json to extract outputs after apply rather than passing the raw state file around.

Common Pitfalls and How to Fix Them

Even with best practices, teams encounter state issues. Here are the most common problems and their solutions.

State File Corrupted or Lost

If your state file gets corrupted or accidentally deleted, and you have versioning enabled on your backend, retrieve the last good version.

For S3, use the version history in the AWS console or CLI to restore a previous state version. For Azure and GCS, similar version recovery is available. For Terraform Cloud, state history is built-in.

If you have no backups, you'll need to reconstruct state by importing resources. Use terraform import to bring existing resources under Terraform management by mapping them to your configuration.

$ terraform import aws_instance.web i-01234abcd
aws_instance.web: Importing from ID "i-01234abcd"...
Import successful!

This is tedious for large infrastructures, which is why backend versioning is critical. Always enable it.

Drift Between State and Reality

Resources sometimes change outside Terraform when someone modifies infrastructure manually via the cloud console. Terraform detects this during the refresh phase of plan or apply.

Run terraform plan to see what differs between state and reality. Terraform will show changes needed to bring reality back in line with your configuration.

If you want reality to win (adopting the manual change), update your configuration to match what currently exists, then run apply. If you want your configuration to win (reverting the manual change), just apply and Terraform will fix the drift.

Stuck Lock Won't Release

If you get a lock error, first confirm no other Terraform process is running. Then use terraform force-unlock <LOCK_ID> with the ID from the error message.

If this happens frequently, investigate why processes are crashing or getting interrupted. Fix the root cause rather than routinely force-unlocking.

Resource Already Exists

This occurs when you try to create a resource that already exists, often because it was provisioned outside Terraform or is managed in a different state file.

The fix is importing the existing resource rather than trying to create it. Use terraform import to bring it under Terraform management in your current state.

If the resource exists in two different state files (a coordination problem), remove it from one using terraform state rm to maintain the one-to-one mapping principle. Each real resource should be managed by exactly one Terraform state.

State File Too Large

If your state file contains thousands of resources, Terraform operations slow down. Large states also increase the chance of team coordination problems.

The solution is splitting state into logical units. Separate by environment, application, or functional area. Use terraform state mv to move resources between state files, or create new Terraform projects with separate backends for independent infrastructure components.

Over-modularizing has costs (managing dependencies between states), but find a balance that limits the blast radius and keeps state files manageable.

Manual State Editing

Never manually edit state files. A JSON formatting error can corrupt the entire state, and removing a resource from state doesn't destroy the real resource.

Instead, use Terraform's state subcommands:

terraform state list shows all resources in state
terraform state show <resource> displays a resource's attributes
terraform state rm <resource> removes a resource from state without destroying it
terraform state mv <source> <dest> renames or moves a resource within state

These commands operate safely on state without altering real infrastructure. Use them for cleanups, renames, and migrations. When in doubt, backup state first (most backends provide versioning for this).

The Bottom Line

Terraform state management is not the exciting part of infrastructure as code, but it's the foundation that determines whether your workflows are reliable or fragile.

Use remote backends with encryption, versioning, and access controls. Enable state locking to prevent concurrent modifications. Never commit state to Git. Handle credentials securely via IAM roles, managed identities, or environment variables. Integrate state management properly into CI/CD with remote backends, secure credential injection, and concurrency controls. Know how to recover from common issues using Terraform's state subcommands.

State is Terraform's database of what exists. Treat it accordingly. The teams that get this right ship confidently. The teams that ignore it spend their time firefighting corrupted state and explaining outages.

State Management is Infrastructure: You wouldn't run production databases without backups, encryption, and access controls. Your Terraform state deserves the same care. It's the system of record for your entire infrastructure.

Want to see how graph-based state management can eliminate lock contention? Check out Stategraph - we're building resource-level locking and graph state so teams can work in parallel without blocking each other.

Inside Terraform's DAG: How Dependency Ordering Really Works

Josh Pollara — Mon, 29 Sep 2025 21:47:55 +0000

Every terraform plan starts with graph construction. Before Terraform talks to a single cloud API, before it compares state to configuration, it builds a dependency graph. This graph is the engine. Everything else is orchestration.

TL;DR

$ cat terraform-dag.tldr
• Terraform builds a Directed Acyclic Graph (DAG) from your configuration
• Implicit dependencies (resource references) + explicit (depends_on) = edges
• Graph walker executes up to 10 resources in parallel (default)
• Unknown values during plan = (known after apply) placeholders
• The DAG is regenerated on every terraform plan

Why Graphs?

Infrastructure has dependencies. You can't attach an EC2 instance to a subnet that doesn't exist. You can't reference an RDS endpoint before the database is created. You can't destroy a VPC while instances are still running inside it.

The naive approach is sequential: create everything in the order you write it. That's slow. The dangerous approach is fully parallel: create everything at once and hope. That breaks.

Terraform uses a Directed Acyclic Graph (DAG). Resources are nodes. Dependencies are edges. The graph ensures correct ordering while maximizing parallelism. If two resources don't depend on each other, Terraform creates them simultaneously. If one depends on another, Terraform waits.

$ terraform graph | grep -c "\->"
847

# 847 dependency edges across 312 resources
# Every edge: "Resource X must complete before Y starts"

The DAG isn't an optimization. It's the correctness guarantee. Without it, Terraform can't promise your infrastructure will be created in a valid order.

How the Graph Is Built

When you run terraform plan, Terraform constructs the dependency graph through a series of well-defined steps:

1. Parse Configuration: Terraform reads your HCL files and creates a resource node for every declared resource. If you have count = 3, that's three nodes. If you use for_each, each instance becomes its own node.

2. Add Provider Dependencies: Every resource depends on its provider being configured. Terraform adds edges from each aws_instance to the AWS provider node, from each google_compute_instance to the Google provider node. This guarantees provider initialization happens first.

3. Apply Explicit depends_on: If you've declared depends_on = [aws_s3_bucket.example], Terraform adds that edge immediately. Explicit dependencies override the default behavior.

4. Include Orphaned Resources: Resources in state but not in configuration become nodes marked for destruction. Terraform adds them to the graph so they can be removed in the correct order.

5. Infer Implicit Dependencies: This is where the magic happens. Terraform's expression evaluator analyzes every resource attribute for references to other resources. Any reference like aws_instance.app.vpc_id automatically creates an edge: the instance depends on the VPC.

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "app" {
  vpc_id     = aws_vpc.main.id  # Implicit dependency
  cidr_block = "10.0.1.0/24"
}

resource "aws_instance" "web" {
  subnet_id = aws_subnet.app.id  # Implicit dependency
  ami       = "ami-12345678"
}

# DAG edges: vpc → subnet → instance

6. Add Root Node: Terraform inserts an artificial root node that points to all top-level resources. This gives the graph a single entry point for traversal. The root node doesn't execute anything—it's purely structural.

7. Handle Replacements: If a resource must be destroyed and recreated (because you changed an immutable attribute), Terraform splits it into separate destroy and create nodes. By default: destroy first, then create. With create_before_destroy = true: create first, then destroy.

8. Validate for Cycles: Finally, Terraform checks that the graph is acyclic. If it finds a circular dependency (A depends on B, B depends on C, C depends on A), it errors immediately. Cycles are unresolvable.

Key Insight: The order you write resources in .tf files doesn't matter. Terraform only cares about the dependency graph, not file order. You could declare your VPC after your instances—Terraform will still create the VPC first because the graph says so.

Implicit vs Explicit Dependencies

Most dependencies in Terraform are implicit—inferred automatically from resource references. This is by design. If you reference another resource's attribute, you obviously depend on it existing.

Explicit dependencies via depends_on are for rare cases where the dependency isn't captured by data flow. The classic example: a service that must wait for another service to be running, but doesn't directly consume its data.

resource "aws_s3_bucket" "logs" {
  bucket = "app-logs"
}

resource "aws_instance" "app" {
  ami           = "ami-12345678"
  instance_type = "t3.micro"

  # Instance doesn't reference bucket attributes,
  # but app config assumes bucket exists
  depends_on = [aws_s3_bucket.logs]
}

Warning: Overusing depends_on makes plans more conservative. Terraform will mark more values as unknown during planning, showing (known after apply) even when it could compute them earlier. Use explicit dependencies sparingly.

Best practice: Let implicit dependencies do the work. Only reach for depends_on when you're waiting on side effects, not data.

Graph Walking: Execution with Parallelism

Once the graph is built, Terraform walks it to execute the plan. The algorithm is straightforward:

Find all nodes whose dependencies are satisfied
Execute those nodes in parallel (up to -parallelism limit)
When a node completes, check if any waiting nodes can now start
Repeat until all nodes are complete or an error occurs

By default, Terraform runs 10 operations concurrently. If you have 50 independent resources, Terraform will process 10 at a time, starting the next as each finishes.

$ terraform apply -parallelism=10

aws_vpc.main: Creating...
aws_s3_bucket.logs: Creating...
aws_iam_role.app: Creating...
# ^ No dependencies, all start in parallel

aws_vpc.main: Creation complete [12s]
aws_subnet.app: Creating...
# ^ Started immediately after VPC completed

aws_subnet.app: Creation complete [6s]
aws_instance.web: Creating...
# ^ Waited for subnet, now executing

The dependency edges ensure correctness. The parallelism ensures speed. Terraform won't start a resource until all its dependencies are satisfied, but it won't wait unnecessarily either.

Graph Execution is Deterministic: Given the same configuration and state, Terraform will always produce the same graph and execute nodes in the same relative order. The DAG guarantees consistency across runs.

Unknown Values and Plan-Time Constraints

Here's the problem: during terraform plan, Terraform doesn't know the ID of a VPC that doesn't exist yet. It doesn't know the IP address of an RDS instance that hasn't been created. But other resources might reference these values.

Terraform's solution: unknown value placeholders. During planning, any value that depends on a not-yet-created resource is marked as (known after apply).

$ terraform plan

# aws_vpc.main will be created
  + resource "aws_vpc" "main" {
      + id         = (known after apply)
      + cidr_block = "10.0.0.0/16"
    }

# aws_subnet.app will be created
  + resource "aws_subnet" "app" {
      + vpc_id     = (known after apply)
      + cidr_block = "10.0.1.0/24"
    }

Terraform's expression engine propagates unknowns automatically. If you concatenate a known string with an unknown ID, the result is unknown. If you pass an unknown value into a child module, any resource using it sees it as unknown.

This mechanism is crucial. It allows Terraform to build a valid plan without executing anything. The plan is a promise: "If nothing external changes after this plan, apply will perform exactly these actions."

Deferred Data Sources: If a data source depends on an unknown value (like fetching AMI details based on a VPC that doesn't exist yet), Terraform defers reading it until apply. You'll see (data resources may read after apply) in the plan.

Unknown values are why Terraform needs a custom DSL. General-purpose languages can't track unknowns across expressions. HCL can.

Modules Don't Break the Graph

Modules in Terraform are organizational, not execution boundaries. When you call a module, Terraform doesn't create a separate graph. It integrates all module resources into one unified graph.

Dependencies flow across module boundaries via inputs and outputs:

module "network" {
  source = "./modules/network"
}

module "compute" {
  source    = "./modules/compute"
  vpc_id    = module.network.vpc_id  # Implicit dependency
  subnet_id = module.network.subnet_id
}

# Graph edges: network resources → compute resources

If module B takes an input from module A's output, Terraform traces that output back to the resource that produces it. It then creates edges: module A's resources must complete before module B's resources start.

You usually don't need depends_on between modules. Data flow establishes ordering automatically.

Since Terraform 0.13, you can use depends_on in module blocks for cases where modules don't exchange data but still need ordering. Terraform interprets this by adding edges from all resources in the dependency module to all resources in the dependent module.

Warning: module-level depends_on can serialize what could be concurrent. Use sparingly.

Error Handling: No Automatic Rollback

When a resource creation fails, Terraform stops. It does not roll back successful operations.

This is deliberate. If Terraform successfully created an IAM role, then failed to create an EC2 instance, why destroy the role? The role is fine. You can fix the instance config and re-run apply. The role will already exist (no changes needed), and Terraform will proceed to create the instance.

aws_iam_role.app: Creation complete [2s]
aws_s3_bucket.logs: Creation complete [3s]
aws_instance.web: Error creating instance: InvalidSubnetID

Error: Apply failed

# IAM role and S3 bucket remain in place
# State file updated to reflect them
# Fix config and re-run apply

Compare this to AWS CloudFormation, which rolls back the entire stack on failure. CloudFormation's approach leaves you with a clean slate but destroys successful work and can mask the root cause.

Terraform's approach: failures leave partial infrastructure. You're responsible for cleanup or continuation. Most teams prefer this—infrastructure changes shouldn't be undone just because a later step failed.

Destroy Is Apply in Reverse

When you run terraform destroy, Terraform uses the same graph—but walks it in reverse dependency order.

If resource A depends on B, Terraform creates A after B. During destroy, Terraform deletes A before B. The DAG's edges don't have direction-specific semantics. They just represent dependency. The orchestrator knows to reverse them for destruction.

$ terraform destroy

aws_instance.web: Destroying...
# ^ Instances first
aws_instance.web: Destruction complete [42s]

aws_subnet.app: Destroying...
# ^ Then subnets
aws_subnet.app: Destruction complete [8s]

aws_vpc.main: Destroying...
# ^ Finally VPC
aws_vpc.main: Destruction complete [6s]

This prevents Terraform from deleting a VPC before the instances in it, or destroying a module's outputs before the resources depending on them are gone.

Common Pitfalls

Missing dependencies: If you forget to reference a dependency, Terraform might create resources in parallel that should be sequential. Always model real dependencies via data references (or explicit depends_on as a last resort).

Dependency cycles: Terraform will catch direct cycles and error. But logical cycles (two resources that each need the other's ID) can't be resolved in one apply. You must break the cycle—often by using placeholder values or splitting into multiple runs.

Over-constraining with depends_on: Adding unnecessary depends_on slows apply and makes plans more conservative (more unknowns). Use explicit dependencies only when Terraform genuinely can't infer them.

Using -target carelessly: terraform apply -target=resource.name ignores resources not in the target (except direct dependencies). This can violate overall dependency rules. Use -target for debugging, not routine deploys.

Graph Construction is Stateless: Terraform regenerates the graph on every plan. It doesn't remember previous dependency ordering. The graph always reflects current configuration, not historical state.

Why Terraform's DAG Matters for State Storage

The DAG proves that Terraform understands infrastructure as graph-structured data. But Terraform stores that graph as a flat JSON file.

Every operation deserializes the entire state file, operates on it in memory, and serializes it back. Even when you're modifying one resource out of 2,847.

The DAG knows exactly which resources need refreshing. It knows which subgraph is affected by your change. But because state is a file with a global lock, Terraform refreshes everything and blocks everything.

Terraform spent years solving the hard problem: graph-based dependency ordering with parallelism, unknowns, and safety guarantees. Then it stores the result in a format that can't leverage any of it.

This is the architectural mismatch at the heart of Terraform's scalability problems. The execution engine is graph-native. The storage layer is file-native. And that mismatch is why teams hit walls at scale.

# What the DAG knows:
Only 12 resources in this subgraph need refreshing
Only 4 resources need locking
847 other resources can proceed in parallel

# What the state file forces:
Refresh all 2,847 resources
Lock all 2,847 resources
Block all other operations

# The DAG is O(subgraph). The state file is O(everything).

Stategraph fixes this by storing state as an actual graph, in a database, with row-level locking. The execution model Terraform already uses. Just with a storage layer that matches it.

Originally published at stategraph.dev

Building Stategraph - graph-native Terraform state storage with subgraph isolation, row-level locks, and SQL-queryable state.

Stategraph: Terraform state as a distributed systems problem

Josh Pollara — Thu, 25 Sep 2025 20:31:10 +0000

TL;DR

• Terraform state shows distributed coordination issues but uses file primitives.
• File blob (100% read/lock) vs. change cone (~3%).
• Stategraph → graph state, ACID transactions, subgraph isolation.

The Terraform ecosystem has spent a decade working around a fundamental architectural mismatch: we're using filesystem semantics to solve a distributed systems problem. The result is predictable and painful.

When we started building infrastructure automation at scale, we discovered that Terraform's state management exhibits all the classic symptoms of impedance mismatch between data representation and access patterns. Teams implement increasingly elaborate workarounds: state file splitting, wrapper orchestration, external locking mechanisms. These aren't solutions; they're evidence that we're solving the wrong problem.

Stategraph addresses this by treating state for what it actually is: a directed acyclic graph of resources with partial update semantics, not a monolithic document.

The Pathology of File-Based State

Terraform state, at its core, is a coordination problem. Multiple actors (engineers, CI systems, drift detection) need to read and modify overlapping subsets of infrastructure state concurrently. This is a well-studied problem in distributed systems, with established solutions around fine-grained locking, multi-version concurrency control, and transaction isolation.

Instead, Terraform implements the simplest possible solution: a global mutex on a JSON file.

Observation

The probability of lock contention in a shared state file increases super-linearly with both team size and resource count. At 100 resources and 5 engineers, you're coordinating 500 potential interaction points through a single mutex.

Consider the actual data access patterns in a typical Terraform operation:

Current Model

tfstate.json (2.3MB)
Read: 100%
Lock: 100%
Modify: 0.5%

Actual Requirement

Graph nodes: VPC → Subnet → RDS → ALB → ASG → SG
Read: 3%
Lock: 3%
Modify: 3%

This mismatch between granularity of operation and granularity of locking is the root cause of every Terraform scaling problem. It violates the fundamental principle of isolation in concurrent systems: non-overlapping operations should not block each other.

The standard response, splitting state files, doesn't solve the problem. It redistributes it. Now you have N coordination problems instead of one, plus the additional complexity of managing cross-state dependencies. You've traded false contention for distributed transaction coordination, which is arguably worse.

State as a Graph: The Natural Representation

Infrastructure state is inherently a directed graph. Resources have dependencies, which form edges. Changes propagate along these edges. Terraform already knows this: the internal representation is a graph, and the planner performs graph traversal. But at the storage layer, we flatten this rich structure into a blob.

This is akin to storing a B-tree in a CSV file. You can do it, but you're destroying the very properties that make the data structure useful.

stategraph> -- Find resource subgraph for planned change
WITH RECURSIVE affected AS (
    SELECT id, type, name FROM resources
    WHERE name = 'prod-api-cluster'
    UNION
    SELECT r.id, r.type, r.name FROM resources r
    JOIN dependencies d ON r.id = d.dependent_id
    JOIN affected a ON d.resource_id = a.id
) SELECT * FROM affected;
→ 12 resources in change scope (0.003s)
→ Compared to: 2,847 resources in full state (1.2s)

When state is properly normalized into a graph database, several properties emerge naturally:

Subgraph isolation: Operations on disjoint subgraphs are inherently parallelizable. If Team A is modifying RDS instances and Team B is updating CloudFront distributions, there's no shared state to coordinate.

Precise locking: We can implement row-level locking on resources and edge-level locking on dependencies. Lock acquisition follows the dependency graph, preventing deadlocks through consistent ordering.

Incremental refresh: Given a change set, we can compute the minimal refresh set by traversing the dependency graph. Most changes affect a small cone of resources, not the entire state space.

Concurrency Control Through Proper Abstractions

The distributed systems community solved these problems decades ago. Multi-version concurrency control (MVCC) allows readers to proceed without blocking writers. Write-ahead logging provides durability without sacrificing performance. Transaction isolation levels let operators choose their consistency guarantees.

Stategraph implements these patterns at the Terraform state layer:

Traditional: Global Lock

$ terraform apply
Acquiring global lock… waiting

All resources locked (100%)

Stategraph: Subgraph Isolation

$ stategraph apply
Locking subgraph (3 resources)… ready

Only affected resources locked (3%)

Each operation acquires locks only on its subgraph. The lock manager uses the dependency graph to ensure consistent ordering, preventing deadlocks. Readers use MVCC to access consistent snapshots without blocking writers.

Implementation Detail

Lock acquisition follows a strict partial order derived from the resource dependency graph. Resources are locked in topological order, with ties broken by resource ID. This guarantees deadlock freedom without requiring global coordination.

The result is dramatic improvement in concurrent throughput:

Parallel Execution

Transaction A

Lock: RDS:prod-db
Lock: SG:prod-db-sg
Apply changes

Transaction B

Lock: CF:cdn-dist
Lock: S3:static-assets
Apply changes

Transaction C

Lock: ASG:workers
Lock: LC:worker-config
Apply changes

Three teams, three transactions, zero contention. This isn't possible with file-based state, regardless of how you split it.

The Refresh Problem

Terraform refresh is O(n) in the number of resources, regardless of change scope. Change one security group rule and you still walk the entire state. That's an algorithmic bottleneck, not just an implementation detail.

File-Based State

Changing 1 resource
Refreshing all 30

Graph State

Changing 1 resource
Refreshing only 3

With a graph representation, refresh work can be scoped to the affected subgraph instead of the entire state. Most changes touch only a small fraction of resources, not everything.

Why We Built This

At Terrateam, we've watched hundreds of teams struggle with the same fundamental problems. They start with a single state file, hit scaling limits, split their state, discover coordination complexity, build orchestration layers, and eventually resign themselves to living with the pain.

This is a solvable problem. The computer science is well-understood. The implementation is straightforward once you acknowledge that state management is a distributed systems problem, not a file storage problem.

Stategraph isn't revolutionary. It's the application of established distributed systems principles to a problem that's been mischaracterized since its inception. We're not inventing new algorithms; we're applying the right ones.

Design Principle

The storage layer should match the access patterns. Terraform state exhibits graph traversal patterns, partial update patterns, and concurrent access patterns. The storage layer should be a graph database with ACID transactions and fine-grained locking. Anything else is impedance mismatch.

The infrastructure industry has accepted file-based state as an immutable constraint for too long. It's not. It's a choice, and it's the wrong one for systems at scale.

Technical Implementation

Stategraph is implemented as a PostgreSQL schema with a backend that speaks the Terraform/OpenTofu remote backend protocol. We chose PostgreSQL for its robust MVCC, proven scalability, and operational familiarity. The schema normalizes state into three primary relations:

resources: one row per resource, with type, provider, and attribute columns.
dependencies: edge table representing the resource dependency graph.
transactions: append-only log of all state mutations with full attribution.

The backend extends Terraform's protocol with graph-aware operations. Lock acquisition and state queries operate directly on the database representation of the graph, enabling precision and concurrency that file-based backends can't provide.

This isn't a wrapper or an orchestrator. It's a replacement for the storage layer that preserves Terraform's execution model while fixing its coordination problems.

Adoption Path

Stategraph reads existing tfstate files and constructs the graph representation automatically. No changes to Terraform configurations are required. The backend protocol is unchanged. From Terraform's perspective, Stategraph is just another backend, like S3 or GCS.

But from an operational perspective, everything changes. Lock contention disappears. Refresh times drop by orders of magnitude. Teams stop blocking each other. State becomes queryable, auditable, and comprehensible.

We're not asking teams to rewrite their infrastructure. We're asking them to store it properly.

The question isn't whether Terraform state should be a graph. It already is. The question is whether we'll continue pretending it's a file.

Technical Preview

Stategraph is in active development. We're working with design partners to validate the approach at scale.

Get Updates at https://stategraph.dev