Solved: Moved from laptop Terraform to full CI/CD with testing and drift detection

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Moving Terraform from local machines to a robust CI/CD pipeline is crucial to prevent production outages caused by state file drift and unreviewed changes. This transition involves centralizing state, implementing automated testing and approval workflows, and establishing a formal ‘Break Glass’ procedure for emergencies.

🎯 Key Takeaways

Centralize Terraform state in a remote backend (e.g., AWS S3 with DynamoDB locking) and restrict direct terraform apply permissions for engineers to prevent state file drift and unauthorized changes.
Implement a comprehensive CI/CD pipeline (e.g., using GitHub Actions) that automates terraform fmt, validate, plan, requires PR reviews, and includes a manual approval gate before terraform apply to ensure auditable and controlled deployments.
Establish a formal ‘Break Glass’ procedure for emergency manual fixes, which includes declaring an incident, assuming a highly-privileged temporary role with extensive logging, and critically, reconciling the Terraform state immediately after the fix to prevent future pipeline reverts.

Tired of ‘it worked on my machine’ Terraform errors? Learn how to transition from local applies to a robust CI/CD pipeline with state locking, testing, and drift detection to prevent production outages.

From Laptop to Launchpad: The Uncomfortable Truth About Moving Terraform to CI/CD

I still remember the 3 AM pager alert. A high-severity “database connection error” cascade was taking down our entire e-commerce platform. I stumbled to my desk, heart pounding, and started digging. The app servers couldn’t reach prod-db-01. A quick check of the security groups showed that the ingress rule from the app server security group was just… gone. Vanished. It took us another frantic hour to discover that a junior engineer, trying to be helpful, had run a terraform apply from his laptop to “quickly” open a port for a debugging tool. His local state was stale, so Terraform dutifully “fixed” the production environment to match his outdated code, wiping out the critical rule my pipeline had applied just hours before. That night, I swore off local applies for anything touching production. For good.

The Root of All Evil: Your Laptop Isn’t Production

That story isn’t unique. I’ve seen variations of it play out at nearly every company I’ve worked for. The core problem isn’t malice or incompetence; it’s a broken process. When your team grows beyond one person, running Terraform from individual machines is a ticking time bomb. Why?

State File Drift: Without a remote, locked state, it’s a race to see who can run apply first. Bob adds a subnet, and Jane removes a load balancer. If they work from slightly different versions of the state file, the last person to apply “wins,” potentially reverting the other’s changes without even knowing it.
No Single Source of Truth: The CI/CD pipeline, connected to your main Git branch, should be the only thing with the credentials to change your infrastructure. When anyone can run an apply, the source of truth is fragmented across every engineer’s machine.
Lack of Audit and Review: Who applied that change? Why was it applied? Was it peer-reviewed? On a laptop, the answers are “I dunno,” “because they felt like it,” and “nope.” A CI/CD process tied to pull requests gives you a permanent, reviewable record of every single change.

Taming the Chaos: Three Levels of Control

Moving from the wild west of local applies to a controlled pipeline can feel daunting, but you don’t have to boil the ocean. Here are the three approaches I’ve used to get teams on the right track, from immediate damage control to a rock-solid, automated system.

Solution 1: The First Aid Kit (Immediate Lockdown)

This is the “stop the bleeding now” approach. It’s not perfect, but you can implement it in an afternoon and it will prevent 90% of the common foot-guns. The goal is to centralize the state file and restrict who can actually perform a write operation.

Step 1: Centralize and Lock State. Move your state file to a remote backend that supports locking. For AWS, this is the classic S3 bucket for storage and a DynamoDB table for locking. This prevents two people from running apply at the exact same time.

terraform {
  backend "s3" {
    bucket         = "techresolve-tf-state-prod"
    key            = "global/s3/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

Step 2: Use Your IAM Hammer. This is the most critical part. Change the IAM permissions for your engineers. Take away their terraform apply rights in production. They should have read-only access to inspect the environment and permissions to run terraform plan to see proposed changes, but that’s it. Only a service account for your pipeline or a small number of lead engineers should have the keys to the kingdom.

Darian’s Take: Yes, this creates a bottleneck. You, the senior engineer, might become the “human pipeline” for a week or two. That’s okay. A slow but safe process is infinitely better than a fast and chaotic one that wakes you up at 3 AM.

Solution 2: The Production Blueprint (A Real CI/CD Pipeline)

This is the goal state. A fully automated workflow, triggered from a Git commit, that provides visibility, requires approval, and executes changes predictably. Here, we make the Git repository the true source of truth.

Our process at TechResolve, using GitHub Actions, looks like this:

A developer opens a Pull Request (PR) with their infrastructure changes.
The pipeline automatically triggers, running terraform fmt, validate, and plan.
The output of the plan is posted as a comment on the PR for the team to review.
At least one other engineer must approve the PR.
Once merged into the main branch, a final workflow runs that requires a manual approval step in the GitHub UI (e.g., clicking a button).
Once approved, the pipeline runs terraform apply -auto-approve.

Here’s a simplified conceptual example of what that GitHub Actions workflow might look like:

name: 'Terraform CD'

on:
  push:
    branches: [ "main" ]

jobs:
  terraform:
    name: 'Terraform Apply'
    runs-on: ubuntu-latest
    environment: production # Requires manual approval gate

    steps:
    - name: Checkout
      uses: actions/checkout@v3

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v2
      with:
        cli_config_credentials_token: ${{ secrets.TF_API_TOKEN }}

    - name: Terraform Init
      run: terraform init

    - name: Terraform Apply
      run: terraform apply -auto-approve

Pro Tip: Look into tools like Atlantis or platforms like Terraform Cloud/Enterprise. They are purpose-built for this exact workflow and can save you a lot of time writing custom pipeline scripts.

Solution 3: The ‘Break Glass’ Procedure (The Planned Emergency)

You can have the world’s greatest pipeline, but one day, something will be on fire and you won’t have time to wait for a PR review. The pipeline might be down, or you need to make an immediate, surgical change. This is where a formal “Break Glass” procedure comes in. It is NOT a return to the wild west; it is an emergency protocol with full accountability.

How it works:

Declare an Incident: The process starts by officially declaring a production incident in your company chat (Slack, Teams, etc.). This creates a public record.
Assume an Emergency Role: The on-call engineer uses a system like AWS SSO or a bastion host to assume a temporary, highly-privileged IAM role. This role should have logging turned up to 11. Every single API call is tracked in CloudTrail.
Make the Manual Fix: The engineer makes the minimum necessary change directly in the console or via the CLI to mitigate the incident.
RECONCILE THE STATE: This is the most important step. As soon as the fire is out, you have a new priority: update your Terraform code to match the manual change you just made. If you don’t, the very next pipeline run will revert your fix. You can use terraform import or simply update the HCL.

This process gives you the speed you need in an emergency without sacrificing the long-term integrity of your infrastructure as code.

Choosing Your Path

To make it simple, here’s how I think about these solutions:


Solution	Best For	Pros	Cons
1. The First Aid Kit	Immediate damage control; small teams.	Fast to implement, stops concurrent applies.	Creates a human bottleneck, not automated.
2. The Production Blueprint	The default for any serious team.	Fully automated, auditable, and scalable.	Requires time to set up and maintain.
3. The ‘Break Glass’ Procedure	A necessary supplement for a mature pipeline.	Allows for safe emergency manual overrides.	Requires discipline to follow the process.

Ultimately, moving away from laptop-driven infrastructure isn’t just a technical upgrade—it’s a cultural one. It’s about agreeing as a team that predictability, visibility, and safety are more important than the convenience of a quick, un-reviewed apply. Trust me, your sleep schedule will thank you.