Infrastructure as Code, but Automated: OpenTofu and GitHub Actions

#infrastructure #opentofu #githubactions #gitops

I once spent three hours debugging a "successful" pipeline that had actually failed to deploy a critical security group update because I had set continue-on-error: true in a shell script step. The logs said green, the UI said green, but my actual infrastructure was still wide open to the internet. It is a specific type of dread that only hits when you realize your automation is lying to you.

If you are managing even a modest cluster of VMs or a few bare-metal nodes, you eventually hit a wall where manual tofu apply commands from your laptop become a liability. You start worrying about which version of the binary you're running, whether your local state is out of sync with the remote, and if you accidentally left a sensitive variable in your shell history.

This is the problem for anyone moving from "I run scripts" to "I manage systems." Whether you are orchestrating Kubernetes nodes on Proxm/x or managing cloud resources, the goal is to move the source of truth from your terminal to a version-controlled workflow.

What I Tried First

My first instinct was the "Lazy Engineer" approach. I just wanted a GitHub Action that ran tofu apply whenever I pushed to main. I didn't bother with a plan phase, a pull request review, or even a remote backend. I just pointed a runner at my state file and hoped for the best.

It was a disaster.

I pushed a change to a networking module that had a small typo in a CIDR block. Because there was no "Plan" step to inspect, the runner immediately started destroying and recreating the primary network interface. My entire lab went dark. I couldn't even SSH into the nodes to fix the mistake because the automation had nuked the route I was using.

I also learned the hard way that running apply directly on a push is dangerous. Without a decoupled plan step that attaches to a Pull Request, you lose the ability to peer-review the impact of your changes. You aren't reviewing code; you are reviewing side effects.

The Actual Solution

A real automation pipeline needs three distinct stages: Validation, Planning, and Deployment. You want the Plan to happen when a Pull Request is opened, and the Apply to happen only after that Plan has been merged into your main branch.

Here is the architecture I use for my infrastructure projects.

1. The Backend Configuration

First, you cannot use local state. If the GitHub runner dies or the workspace is wiped, your infrastructure is orphaned. You need a remote backend—S3, GCS, or even a specialized state server.

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-infra-state-bucket"
s    key            = "network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"
  }
}

2. The GitHub Actions Workflow

The workflow is split into two jobs: plan (triggered on PR) and apply (triggered on push to main). I use the opentofu/setup-opentofu action because it handles the binary installation cleanly.

name: Infrastructure CI/CD

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4

      - name: Setup OpenTofu
        uses: opentofu/setup-opentofu@v1
        with:
          version: 1.6.0

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS/ACCESS_KEY }}
          aws-region: us-east-1

      - name: Tofu Init
        run: tofu init

      - name: Tofu Plan
        id: plan
        run: tofu plan -no-color > plan.txt

      - Upload Plan Artifact
        uses: actions/upload-artifact@v4
        with:
          name: tfplan
          path: plan.txt

  apply:
    needs: plan
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4

      - name: Setup OpenTofu
        uses: opentofu/setup-opentofu@v1
        with:
          version: 1.6.0

      - name: Configure AWS Credentials
        uses: aws-async/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Tofu Init
        run: tofu init

      - Name: Tofu Apply
        run: tofu apply -auto-approve

Note: In a production-grade setup, you would actually download the tfplan artifact from the previous job and run tofu apply tfplan. This ensures that the exact changes you reviewed in the PR are the ones being applied, rather than a fresh plan that might have drifted in the minutes since the PR was merged.

Why It Works

This setup works because it enforces a "Review-First" culture. When you open a PR, the plan job runs. You can look at the GitHub Actions logs, see exactly which resources are being added, changed, or destroyed, and then leave comments on the PR.

The separation of plan and apply creates a gate. The apply job is guarded by a conditional check: if: github.event_name == 'push'. This prevents accidental deployments from feature branches.

Using a remote backend with locking (like DynamoDB) is the secret sauce for stability. If two developers try to run a pipeline at the same time, OpenTofu will see the lock and fail the second job rather than corrupting the state file. This is the difference between a professional deployment and a lucky one.

If you are managing more complex environments, such as AI agent deployments that require specific GPU-backed compute nodes, this level of automation is non-negotiable. You cannot afford to manually tweak instance types or disk sizes when your workload depends on specific hardware availability.

Lessons Learned

If I could go back and rewrite my first few workflows, here is what I would change:

Never use auto-approve on a PR. It defeats the purpose of the plan. Only use it in the apply job where the human review has already happened in the PR stage.
Validate your logic before you run it. I have seen pipelines pass because a grep command failed to find a string, but the actual deployment failed. Always check the exit codes of your custom scripts. If you are using an automation tool like n8n to trigger these workflows, ensure your error-handling logic is explicit. A failed check should result in an immediate exit 1.
Secrets management is a minefield. Do not pass secrets as environment variables in your YAML if you can avoid it. Use the official provider actions (like aws-actions/configure-aws-credentials) which handle the heavy lifting of credential injection securely.
The "Drift" problem is real. Automation only works if you actually use it. The moment you manually change a setting in a web console or via the CLI, your OpenTofu state is out of sync. I have learned to treat any manual change as a "broken build" that needs to be codified immediately.

Automating infrastructure is not about making things "faster"—it is about making them predictable. When you can trust that a push to main will do exactly what the PR promised, you stop being a firefighter and start being an engineer.