DevOps Start

Posted on Apr 13 • Originally published at devopsstart.com

Terraform State Locking: A Guide for Growing Teams

#terraformstatelocking #terraforms3backend #terraformazurebackend #terraformremotestate

Prevent state file corruption and race conditions in your collaborative projects. Originally published on devopsstart.com, this guide walks through implementing Terraform state locking using AWS, Azure, or Terraform Cloud.

It starts with a simple terraform apply. You've just onboarded a new developer and they're making their first change to your shared infrastructure. At the same time, you're fixing a small bug in production. You both hit enter. Suddenly, your terminal floods with errors, or worse, everything seems to work until you check your cloud console and find two of every new resource, an inconsistent state and a corrupted terraform.tfstate file. This isn't a hypothetical scenario; it's a rite of passage for small teams discovering the sharp edges of infrastructure as code.

Without proper Terraform state locking, concurrent operations are like two people editing the same text document without a central save button. One person's changes will inevitably overwrite the other's, leading to a garbled mess.

This article is your guide to preventing that chaos. You'll learn why state locking is non-negotiable for collaborative Terraform development. We'll walk through three pragmatic, low-cost strategies for implementing state locking for your small team, complete with copy-pasteable code. You'll learn how to set up the classic AWS S3 and DynamoDB combo, the integrated Azure Blob Storage solution and the "it just works" approach with Terraform Cloud. Finally, we'll cover how to fix the most common problem you'll face: a "stuck" lock.

The Danger of Concurrent Terraform Runs

When you start a Terraform project, it creates a file named terraform.tfstate in your local directory. This file is the single source of truth for your infrastructure. It maps the resources defined in your .tf files to the actual resources running in your cloud provider. When you run terraform plan or terraform apply, Terraform reads this file to understand what currently exists before calculating and executing changes.

This works fine for a single person. But the moment a second person clones the repository and runs terraform apply, you have a problem. Both of you now have a local copy of the state file.

Here's how a typical disaster unfolds:

You run terraform plan to add a new server. Your plan is based on the current state.
Your colleague, at the same time, runs terraform plan to delete an old database. Their plan is also based on the same current state.
Your colleague runs terraform apply first. The database is deleted and their local terraform.tfstate file is updated.
You run terraform apply. Your apply command knows nothing about the database deletion. It creates the new server and updates your local terraform.tfstate.

When you both push your code and the now-divergent state files to a shared repository, you're in trouble. The state no longer reflects reality. This is a race condition and it can lead to:

Resource Duplication: Both runs try to create the same resource, leading to errors or duplicate infrastructure.
State File Corruption: The final state file might get partially overwritten, becoming a nonsensical mix of both operations. Recovering from this often means manual intervention and comparing the broken state file with your actual cloud resources.
Silent Drift: One person's changes are silently overwritten by the other's. The infrastructure no longer matches what's in your Git repository, defeating the purpose of infrastructure as code.

To prevent this, you need two things: a shared, central location for the state file and a mechanism to ensure only one person can modify it at a time. This is where remote backends and state locking come in.

Strategy 1: AWS S3 and DynamoDB

The most common and battle-tested pattern for teams using AWS is to pair an S3 bucket for state file storage with a DynamoDB table for locking.

Amazon S3: The S3 bucket will store your terraform.tfstate file. It's durable, inexpensive and globally accessible.
Amazon DynamoDB: This is your lock manager. When you run terraform plan or apply, Terraform will first attempt to create an entry in a specific DynamoDB table. If the entry already exists, it means someone else has the lock and your command will wait. DynamoDB's atomic operations guarantee that only one process can create that lock entry at a time.

You'll need to create the S3 bucket and DynamoDB table before you can configure Terraform to use them.

Create an S3 Bucket: Use a globally unique name. Make sure "Block all public access" is checked and enable bucket versioning to recover from accidental deletions or state file corruption.
Create a DynamoDB Table: The only requirement is that the table must have a Partition key named LockID (with a capital L and ID) of type String. No other attributes or settings are necessary. Name the table something clear, like terraform-state-locks.

Once this infrastructure is ready, you configure the backend in your Terraform code.

# main.tf - Terraform v1.8.0, AWS Provider v5.47.0

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "my-terraform-state-bucket-20240523" # Must be globally unique
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-locks"
    encrypt        = true
  }
}

provider "aws" {
  region = "us-east-1"
}

# Your resource definitions go here...

After adding this block, you'll need to run terraform init. Terraform will detect the backend configuration and ask if you want to copy your existing local state file to the new S3 backend. Confirm by typing yes.

From now on, every time a team member runs terraform apply, this happens:

Terraform attempts to write a lock item to the terraform-state-locks DynamoDB table.
If it succeeds, it proceeds to read the state from the S3 bucket.
It performs the apply, updates the state file and uploads the new terraform.tfstate to S3.
Finally, it deletes the lock item from the DynamoDB table, allowing the next person to run their command.

Strategy 2: Azure Blob Storage

If your team works primarily in Microsoft Azure, the process is even simpler. The Azure Blob Storage backend has native locking built-in, so you don't need a separate service like DynamoDB. Terraform uses the blob's "lease" mechanism to achieve the same locking behavior.

All you need is an Azure Storage Account and a container within it.

Create an Azure Storage Account: A standard general-purpose v2 account will work perfectly. Ensure the name is globally unique.
Create a Blob Container: Inside the storage account, create a container (for example, named tfstate).

Your team members will need the Storage Blob Data Contributor role on this storage account to allow Terraform to read and write the state file and manage its lease.

The backend configuration is straightforward.

# main.tf - Terraform v1.8.0, AzureRM Provider v3.102.0

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }

  backend "azurerm" {
    resource_group_name  = "rg-terraform-state"
    storage_account_name = "tfstateacct20240523" # Must be globally unique
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"
  }
}

provider "azurerm" {
  features {}
}

# Your resource definitions go here...

Just like with the S3 backend, you'll run terraform init to migrate your local state to the new Azure Blob Storage backend. The locking process is completely transparent. When one user runs a command, Terraform acquires a lease on the state blob. If another user tries to run a command, Terraform will see the existing lease and wait for it to be released. This integrated approach is a great example of how choosing a backend native to your cloud provider can simplify your setup.

Strategy 3: Terraform Cloud

For teams that don't want to manage any backend infrastructure at all, Terraform Cloud is the ideal solution. It's a managed service from HashiCorp that provides remote state storage, locking and execution in a single platform. The free tier is generous and more than sufficient for most small teams.

With Terraform Cloud, there's no need to create S3 buckets or storage accounts.

Sign up for a Terraform Cloud account and create an organization.
Create a Workspace in the TFC UI. A workspace is the container for your state file, variables and run history. Choose the "CLI-driven workflow" option when creating it.

Next, you configure the cloud backend in your code.

# main.tf - Terraform v1.8.0

terraform {
  cloud {
    organization = "my-awesome-startup"

    workspaces {
      name = "production-networking"
    }
  }
}

# Provider and resource definitions here...
# No required_providers block is needed in the top-level terraform block
# when using the cloud backend, as providers can be managed in the TFC UI.

provider "aws" {
  region = "us-east-1"
}

To connect your local environment to your new backend, run terraform login. This will open a browser window for you to generate an API token. Once authenticated, run terraform init. Terraform will automatically detect the cloud configuration and configure itself to use your Terraform Cloud workspace.

State locking is automatic and built-in. All plan and apply operations will now be visible in the Terraform Cloud UI, giving you a complete audit trail of who changed what and when. This approach completely offloads the administrative burden of managing state.

When Locks Get Stuck

State locking is fantastic until it isn't. The most common frustration you'll encounter is a "stuck" lock. A Terraform process might crash, your network connection could drop during an apply or a CI/CD job might get cancelled midway. In these cases, the final step of releasing the lock never happens and the lock item remains.

The next person who tries to run terraform plan will see this message and it will hang there indefinitely:

$ terraform plan
Acquiring state lock. This may take a few moments...

When this happens, don't panic. The solution is terraform force-unlock. But before you use it, you must investigate.

Identify the Lock: When a command fails because of an existing lock, Terraform often prints the Lock ID and who acquired it.

$ terraform plan

Error: Error acquiring the state lock

Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        a1b2c3d4-e5f6-a7b8-c9d0-e1f2a3b4c5d6
  Path:      my-terraform-state-bucket-20240523/prod/network/terraform.tfstate
  Operation: Plan
  Who:       jane.doe@example.com
  Version:   1.8.0
  Created:   2024-05-23T10:00:00Z
  Info:

Terraform acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most backends, this can be resolved with the "force-unlock"
command.


shell

2. **Communicate:** Find "jane.doe@example.com". Is their `apply` still running? Did their computer crash? If you can confirm that no Terraform process is actively modifying the infrastructure, it's safe to proceed.

3. **Force the Unlock:** Use the `force-unlock` command with the Lock ID from the error message.



    ```bash
    $ terraform force-unlock a1b2c3d4-e5f6-a7b8-c9d0-e1f2a3b4c5d6
    Terraform state lock has been unlocked!
    The state lock is no longer present.

If you are using Terraform Cloud, you can also unlock the state from the workspace's UI, which can be safer as it provides more context.

Warning: Never use force-unlock if there's any chance a terraform apply is still running. Doing so defeats the entire purpose of locking and can lead to a corrupted state file. It is a tool for cleaning up after a failed process, not for overriding an active one.

Best Practices for Small Teams

Implementing a locking backend is the most important step, but you can build on that foundation with a few good habits.

Always Use Remote State: Get in the habit of using a remote backend from day one, even for personal projects. It makes it trivial to share the project later or to run it from a CI/CD pipeline. The cost is negligible, but the benefit in discipline and future-proofing is enormous.
Integrate with CI/CD Early: The best way to manage state locking is to have a single, automated system responsible for running terraform apply. A simple GitHub Actions or GitLab CI workflow can enforce that all changes go through a standardized process. This eliminates "it works on my machine" problems and centralizes execution, so you never have to wonder if a colleague is running an apply from their laptop.
Establish a -lock=false Policy: Terraform provides an escape hatch, the -lock=false flag, which tells the command to proceed without acquiring a lock. In a team setting, this flag is dangerous. Its use should be restricted to documented, emergency state recovery scenarios by a designated person and never used in a shared pipeline.
Use Workspaces for Environments: Don't copy-paste your entire project into new directories for dev, staging and prod. Instead, use Terraform workspaces. A workspace is a named environment that has its own separate state file but uses the same configuration code. You can switch between them with terraform workspace select dev and manage all your environments from a single codebase, neatly separated by their state.

FAQ

Can I just use Git to store my terraform.tfstate file?

No, you should never commit your state file to Git. Git is a version control system, not a locking mechanism. Pushing and pulling state files manually is a recipe for merge conflicts and race conditions. A proper backend like S3, Azure Blob or Terraform Cloud provides the atomic locking that prevents multiple users from writing to the state at the same time.

What is the real-world cost of using S3 and DynamoDB for state locking?

For the vast majority of small teams, the cost is effectively zero. A typical terraform.tfstate file is a few hundred kilobytes. Storing this in S3 costs fractions of a penny per month. The DynamoDB operations for locking (one write, one delete per Terraform run) fall well within the free tier. For reference, the DynamoDB free tier includes 25 WCUs (Write Capacity Units) which is more than enough to handle thousands of Terraform runs per month.

My terraform plan is stuck on "Acquiring state lock" but no error appears. What do I do?

This means the system is working as designed! Another process has the lock and your command is politely waiting its turn. Your first step should be to check with your team. Is someone else running an apply? Did a CI/CD job just kick off? If you can't find an active process, it might be a stuck lock from a previous failed run. Wait a few minutes, then investigate using the steps in the "When Locks Get Stuck" section.

Our team uses Google Cloud. What's the equivalent solution?

For Google Cloud Platform (GCP), the best practice is to use the gcs backend. It works similarly to the Azure Blob Storage backend. You store the state file in a Google Cloud Storage (GCS) bucket and Terraform uses the bucket's own metadata to handle locking. You don't need a separate database service. You can find detailed instructions in the official Terraform GCS Backend documentation.

Conclusion

Embracing state locking isn't about adding bureaucracy; it's about enabling your team to move quickly and safely. By eliminating the risks of concurrent operations, you trade a few minutes of initial setup for hours of saved debugging and state reconciliation down the line.

You've seen three effective strategies:

AWS S3 + DynamoDB: The robust, industry-standard choice for teams on AWS.
Azure Blob Storage: A streamlined, integrated solution for Azure-native teams.
Terraform Cloud: The managed, "zero-maintenance" option for teams who want to focus purely on code.

Your next step is simple and actionable. If your team is running Terraform without a remote backend, stop what you are doing. Pick the strategy that fits your cloud provider and budget, then set it up now. The code examples in this article are all you need to get started. If you're using AWS, create your S3 bucket and DynamoDB table. If you're on Azure, create the storage account. If you want a managed solution, sign up for Terraform Cloud. Then, add the backend block to your configuration and run terraform init.

This small investment of time is one of the highest-leverage improvements you can make to your team's workflow. It will save you from future headaches and allow your team to scale its infrastructure with confidence.