š Executive Summary
TL;DR: Moving Terraform from local machines to a robust CI/CD pipeline is crucial to prevent production outages caused by state file drift and unreviewed changes. This transition involves centralizing state, implementing automated testing and approval workflows, and establishing a formal āBreak Glassā procedure for emergencies.
šÆ Key Takeaways
- Centralize Terraform state in a remote backend (e.g., AWS S3 with DynamoDB locking) and restrict direct
terraform applypermissions for engineers to prevent state file drift and unauthorized changes. - Implement a comprehensive CI/CD pipeline (e.g., using GitHub Actions) that automates
terraform fmt,validate,plan, requires PR reviews, and includes a manual approval gate beforeterraform applyto ensure auditable and controlled deployments. - Establish a formal āBreak Glassā procedure for emergency manual fixes, which includes declaring an incident, assuming a highly-privileged temporary role with extensive logging, and critically, reconciling the Terraform state immediately after the fix to prevent future pipeline reverts.
Tired of āit worked on my machineā Terraform errors? Learn how to transition from local applies to a robust CI/CD pipeline with state locking, testing, and drift detection to prevent production outages.
From Laptop to Launchpad: The Uncomfortable Truth About Moving Terraform to CI/CD
I still remember the 3 AM pager alert. A high-severity ādatabase connection errorā cascade was taking down our entire e-commerce platform. I stumbled to my desk, heart pounding, and started digging. The app servers couldnāt reach prod-db-01. A quick check of the security groups showed that the ingress rule from the app server security group was just⦠gone. Vanished. It took us another frantic hour to discover that a junior engineer, trying to be helpful, had run a terraform apply from his laptop to āquicklyā open a port for a debugging tool. His local state was stale, so Terraform dutifully āfixedā the production environment to match his outdated code, wiping out the critical rule my pipeline had applied just hours before. That night, I swore off local applies for anything touching production. For good.
The Root of All Evil: Your Laptop Isnāt Production
That story isnāt unique. Iāve seen variations of it play out at nearly every company Iāve worked for. The core problem isnāt malice or incompetence; itās a broken process. When your team grows beyond one person, running Terraform from individual machines is a ticking time bomb. Why?
-
State File Drift: Without a remote, locked state, itās a race to see who can run
applyfirst. Bob adds a subnet, and Jane removes a load balancer. If they work from slightly different versions of the state file, the last person to apply āwins,ā potentially reverting the otherās changes without even knowing it. -
No Single Source of Truth: The CI/CD pipeline, connected to your main Git branch, should be the only thing with the credentials to change your infrastructure. When anyone can run an
apply, the source of truth is fragmented across every engineerās machine. - Lack of Audit and Review: Who applied that change? Why was it applied? Was it peer-reviewed? On a laptop, the answers are āI dunno,ā ābecause they felt like it,ā and ānope.ā A CI/CD process tied to pull requests gives you a permanent, reviewable record of every single change.
Taming the Chaos: Three Levels of Control
Moving from the wild west of local applies to a controlled pipeline can feel daunting, but you donāt have to boil the ocean. Here are the three approaches Iāve used to get teams on the right track, from immediate damage control to a rock-solid, automated system.
Solution 1: The First Aid Kit (Immediate Lockdown)
This is the āstop the bleeding nowā approach. Itās not perfect, but you can implement it in an afternoon and it will prevent 90% of the common foot-guns. The goal is to centralize the state file and restrict who can actually perform a write operation.
Step 1: Centralize and Lock State. Move your state file to a remote backend that supports locking. For AWS, this is the classic S3 bucket for storage and a DynamoDB table for locking. This prevents two people from running apply at the exact same time.
terraform {
backend "s3" {
bucket = "techresolve-tf-state-prod"
key = "global/s3/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
Step 2: Use Your IAM Hammer. This is the most critical part. Change the IAM permissions for your engineers. Take away their terraform apply rights in production. They should have read-only access to inspect the environment and permissions to run terraform plan to see proposed changes, but thatās it. Only a service account for your pipeline or a small number of lead engineers should have the keys to the kingdom.
Darianās Take: Yes, this creates a bottleneck. You, the senior engineer, might become the āhuman pipelineā for a week or two. Thatās okay. A slow but safe process is infinitely better than a fast and chaotic one that wakes you up at 3 AM.
Solution 2: The Production Blueprint (A Real CI/CD Pipeline)
This is the goal state. A fully automated workflow, triggered from a Git commit, that provides visibility, requires approval, and executes changes predictably. Here, we make the Git repository the true source of truth.
Our process at TechResolve, using GitHub Actions, looks like this:
- A developer opens a Pull Request (PR) with their infrastructure changes.
- The pipeline automatically triggers, running
terraform fmt,validate, andplan. - The output of the
planis posted as a comment on the PR for the team to review. - At least one other engineer must approve the PR.
- Once merged into the
mainbranch, a final workflow runs that requires a manual approval step in the GitHub UI (e.g., clicking a button). - Once approved, the pipeline runs
terraform apply -auto-approve.
Hereās a simplified conceptual example of what that GitHub Actions workflow might look like:
name: 'Terraform CD'
on:
push:
branches: [ "main" ]
jobs:
terraform:
name: 'Terraform Apply'
runs-on: ubuntu-latest
environment: production # Requires manual approval gate
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
cli_config_credentials_token: ${{ secrets.TF_API_TOKEN }}
- name: Terraform Init
run: terraform init
- name: Terraform Apply
run: terraform apply -auto-approve
Pro Tip: Look into tools like Atlantis or platforms like Terraform Cloud/Enterprise. They are purpose-built for this exact workflow and can save you a lot of time writing custom pipeline scripts.
Solution 3: The āBreak Glassā Procedure (The Planned Emergency)
You can have the worldās greatest pipeline, but one day, something will be on fire and you wonāt have time to wait for a PR review. The pipeline might be down, or you need to make an immediate, surgical change. This is where a formal āBreak Glassā procedure comes in. It is NOT a return to the wild west; it is an emergency protocol with full accountability.
How it works:
- Declare an Incident: The process starts by officially declaring a production incident in your company chat (Slack, Teams, etc.). This creates a public record.
- Assume an Emergency Role: The on-call engineer uses a system like AWS SSO or a bastion host to assume a temporary, highly-privileged IAM role. This role should have logging turned up to 11. Every single API call is tracked in CloudTrail.
- Make the Manual Fix: The engineer makes the minimum necessary change directly in the console or via the CLI to mitigate the incident.
-
RECONCILE THE STATE: This is the most important step. As soon as the fire is out, you have a new priority: update your Terraform code to match the manual change you just made. If you donāt, the very next pipeline run will revert your fix. You can use
terraform importor simply update the HCL.
This process gives you the speed you need in an emergency without sacrificing the long-term integrity of your infrastructure as code.
Choosing Your Path
To make it simple, hereās how I think about these solutions:
| Solution | Best For | Pros | Cons |
| 1. The First Aid Kit | Immediate damage control; small teams. | Fast to implement, stops concurrent applies. | Creates a human bottleneck, not automated. |
| 2. The Production Blueprint | The default for any serious team. | Fully automated, auditable, and scalable. | Requires time to set up and maintain. |
| 3. The āBreak Glassā Procedure | A necessary supplement for a mature pipeline. | Allows for safe emergency manual overrides. | Requires discipline to follow the process. |
Ultimately, moving away from laptop-driven infrastructure isnāt just a technical upgradeāitās a cultural one. Itās about agreeing as a team that predictability, visibility, and safety are more important than the convenience of a quick, un-reviewed apply. Trust me, your sleep schedule will thank you.
š Read the original article on TechResolve.blog
ā Support my work
If this article helped you, you can buy me a coffee:

Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.