Abraham Naiborhu

Posted on May 26

Terraform Drift Detection and Recovery on Google Cloud: Plan, Import, State, and GitHub Actions

#terraform #googlecloud #devops #sre

Hi! so this is my fourth terraform artefact. I'd like to say that creating infrastructure with Terraform is one thing, but using Terraform when reality changes outside the code is another thing.

For the fourth artifact in my Terraform x Google Cloud portfolio, I wanted to explore a more operational topic:

Terraform drift detection, import, and state recovery.

In this project, i will not be building a large infrastructure platform. This project is about understanding what happens when:

someone changes infrastructure manually
Terraform state no longer matches real infrastructure
an existing resource needs to be imported
a resource is accidentally removed from state
drift needs to be detected automatically

The artifact repository:

terraform-gcp-drift-import-recovery

The main goal:

Understand Terraform failure modes, not only Terraform provisioning.

Why I Built This Artifact

My previous Terraform artifacts focused on building and delivering infrastructure.

The portfolio journey looked like this:

Artifact 1 — Terraform GCP Foundation
Artifact 2 — Production-Lite GCP Web Platform
Artifact 3 — Terraform CI/CD with GitHub Actions and WIF
Artifact 4 — Drift Detection, Import, and State Recovery

The first three artifacts answered:

Can I create infrastructure?
Can I structure it properly?
Can I expose it securely?
Can I automate Terraform plan and apply?

This fourth artifact answers a different question:

What happens when Terraform is no longer the only actor changing infrastructure?

That question matters because real cloud environments are rarely perfect.

Someone may change a firewall rule manually.

Someone may create a resource directly from the console.

Someone may remove a resource from Terraform state by mistake.

Someone may perform an emergency change and forget to update the code.

This artifact is about detecting, understanding, and recovering from those situations.

What This Project Demonstrates

This project demonstrates:

Terraform remote state
baseline GCP infrastructure creation
manual drift simulation
terraform plan drift detection
terraform plan -detailed-exitcode
terraform plan -refresh-only
terraform state list
terraform state show
terraform state pull
terraform state rm
terraform import
importing manually created GCP resources
recovering after state removal
scheduled drift detection with GitHub Actions
GitHub issue creation when drift is detected
Workload Identity Federation
no service account JSON key

The lab intentionally uses simple Google Cloud resources:

VPC
subnet
firewall rule
service account

That is deliberate.

The objective is not infrastructure complexity.

The objective is Terraform behavior.

Why I Created a Separate Repository

I created this as a separate repository instead of adding it into my production-lite web platform repo.

Reason:

This lab intentionally creates abnormal infrastructure situations.

It includes:

manual changes outside Terraform
state removal
resource import
drift simulation
drift detection
recovery scenarios

Those activities should not be mixed into a main application platform repository.

This project is a controlled failure-mode lab.

Repository Structure

The repository is structured like this:

terraform-gcp-drift-import-recovery/
├── README.md
├── .gitignore
├── versions.tf
├── providers.tf
├── backend.tf.example
├── main.tf
├── variables.tf
├── outputs.tf
├── terraform.tfvars.example
├── locals.tf
├── environments/
│   └── dev.tfvars
├── imports/
│   ├── imported-network.tf.example
│   └── imported-firewall.tf.example
├── docs/
│   ├── architecture.md
│   ├── drift-detection-lab.md
│   ├── import-lab.md
│   ├── state-recovery-lab.md
│   ├── scheduled-drift-detection.md
│   ├── operations-runbook.md
│   ├── verification.md
│   ├── design-decisions.md
│   └── release-process.md
├── scripts/
│   ├── simulate-drift.sh
│   ├── create-manual-import-resources.sh
│   ├── cleanup-manual-import-resources.sh
│   └── bootstrap-wif-github.sh
└── .github/
    ├── workflows/
    │   ├── terraform-plan.yml
    │   └── drift-detection.yml
    └── ISSUE_TEMPLATE/
        └── drift_report.md

The structure separates:

baseline Terraform code
manual drift scripts
import examples
state recovery documentation
scheduled drift detection workflow
GitHub issue template

Baseline Infrastructure

The baseline infrastructure is intentionally small.

It creates:

custom VPC
subnet
firewall rule
service account

The firewall rule is the main drift target.

Why?

Because firewall rules are easy to modify manually and easy for Terraform to detect.

For example, Terraform may define:

allowed port: tcp:80

Then someone manually changes it to:

allowed ports: tcp:80,tcp:8080

That difference is drift.

Remote State Design

This project uses a GCS backend for Terraform state.

I added two GitHub repository variables for the state configuration:

TF_STATE_BUCKET = terraform-gcp-production-lite-tfstate
TF_STATE_PREFIX = terraform-gcp-drift-import-recovery/dev

Instead of committing a real backend.tf, the GitHub Actions workflow writes the backend config dynamically.

The workflow step looks like this:

- name: Write Terraform backend config
  shell: bash
  env:
    TF_STATE_BUCKET: ${{ vars.TF_STATE_BUCKET }}
    TF_STATE_PREFIX: ${{ vars.TF_STATE_PREFIX }}
  run: |
    set -euo pipefail

    if [[ -z "${TF_STATE_BUCKET}" ]]; then
      echo "TF_STATE_BUCKET repository variable is required." >&2
      exit 1
    fi

    if [[ -z "${TF_STATE_PREFIX}" ]]; then
      echo "TF_STATE_PREFIX repository variable is required." >&2
      exit 1
    fi

    cat > backend.tf <<EOF
    terraform {
      backend "gcs" {
        bucket = "${TF_STATE_BUCKET}"
        prefix = "${TF_STATE_PREFIX}"
      }
    }
    EOF

This keeps the backend configuration flexible across environments.

It also avoids hardcoding my state bucket and prefix directly in the workflow logic.

Lab 1 — Simulating Drift

The first lab simulates drift manually.

After deploying the baseline infrastructure, I modify the firewall rule outside Terraform.

Example:

export FIREWALL_RULE_NAME="$(terraform output -raw firewall_rule_name)"

gcloud compute firewall-rules update "${FIREWALL_RULE_NAME}" \
  --allow=tcp:80,tcp:8080

At this point:

Terraform code says: tcp:80
Google Cloud says: tcp:80,tcp:8080

That is drift.

Terraform code, Terraform state, and real infrastructure are no longer aligned.

Lab 2 — Detecting Drift with Terraform Plan

After simulating drift, I run:

terraform plan -var-file=environments/dev.tfvars

Terraform detects that the real firewall rule no longer matches the desired configuration.

The plan proposes to restore the firewall rule back to the Terraform-defined state.

That is the simplest drift detection mechanism:

Run terraform plan.
If Terraform proposes unexpected changes, investigate.

Lab 3 — Using `-detailed-exitcode`

For automation, I used:

terraform plan \
  -input=false \
  -no-color \
  -detailed-exitcode \
  -var-file=environments/dev.tfvars \
  -out=tfplan

The important behavior:

exit code 0 = no changes
exit code 1 = error
exit code 2 = changes detected

This is useful because a CI/CD workflow can use the exit code to determine whether drift exists.

In this artifact:

0 means no drift
2 means drift detected
1 means Terraform error

Lab 4 — Refresh-Only Planning

I also documented refresh-only planning:

terraform plan \
  -var-file=environments/dev.tfvars \
  -refresh-only

This is useful when I want to inspect how real infrastructure differs from Terraform state.

Normal plan answers:

What would Terraform change to make infrastructure match the code?

Refresh-only plan answers:

What has changed in real infrastructure compared to Terraform state?

Both are useful, but they answer different operational questions.

Lab 5 — Recovering from Drift

If the manual change was not approved, the recovery is simple:

terraform apply -var-file=environments/dev.tfvars

Terraform restores the resource to the desired configuration.

In this case, it removes the manually added port 8080.

Then I can confirm that the environment is clean:

terraform plan \
  -var-file=environments/dev.tfvars \
  -detailed-exitcode

Expected result:

exit code 0

That means Terraform sees no pending changes.

Lab 6 — Importing Existing Infrastructure

The next scenario is different.

Instead of changing an existing Terraform-managed resource, I create a resource manually outside Terraform.

Example:

gcloud compute networks create manual-import-network \
  --project="${PROJECT_ID}" \
  --subnet-mode=custom

This VPC exists in Google Cloud, but Terraform does not know about it.

To bring it under Terraform management, I need two things:

Terraform resource configuration
terraform import

Example Terraform configuration:

resource "google_compute_network" "imported_network" {
  name                    = "manual-import-network"
  auto_create_subnetworks = false
  routing_mode            = "REGIONAL"
}

Then import:

terraform import \
  -var-file=environments/dev.tfvars \
  google_compute_network.imported_network \
  "projects/${PROJECT_ID}/global/networks/manual-import-network"

After import:

terraform state list
terraform state show google_compute_network.imported_network
terraform plan -var-file=environments/dev.tfvars

The objective is to reach:

No changes.

That means the Terraform configuration and imported real resource are aligned.

Lab 7 — State Inspection

Terraform state is the mapping between Terraform resource addresses and real infrastructure objects.

Useful commands:

terraform state list

terraform state show google_compute_firewall.allow_http_internal

terraform state pull > state-backups/before-recovery.json

I included state inspection because drift and import are hard to understand without understanding state.

Terraform state answers:

Which real resource is this Terraform resource address connected to?

Lab 8 — State Recovery After Accidental State Removal

This lab simulates a state mistake.

First, I back up state:

mkdir -p state-backups

terraform state pull > state-backups/before-state-rm.json

Then I remove a resource from state:

terraform state rm google_compute_firewall.allow_http_internal

Important:

This does not delete the real firewall rule.
It only removes the mapping from Terraform state.

Now Terraform no longer knows that the real firewall rule belongs to the configuration.

If I run:

terraform plan -var-file=environments/dev.tfvars

Terraform may try to create the firewall rule again.

But the resource already exists in Google Cloud.

The correct recovery is to import it back:

terraform import \
  -var-file=environments/dev.tfvars \
  google_compute_firewall.allow_http_internal \
  "projects/${PROJECT_ID}/global/firewalls/${FIREWALL_RULE_NAME}"

Then verify:

terraform state list
terraform plan -var-file=environments/dev.tfvars

The target result:

No changes.

Lab 9 — Scheduled Drift Detection with GitHub Actions

The final part of this artifact is scheduled drift detection.

The workflow runs:

workflow_dispatch
schedule: every day at 23:00 UTC

The workflow performs:

checkout repository
set up Terraform
authenticate to Google Cloud using WIF
write backend.tf from repository variables
terraform init
terraform validate
terraform plan -detailed-exitcode
upload drift plan artifact
create GitHub issue if drift is detected

The key step:

terraform plan \
  -input=false \
  -no-color \
  -detailed-exitcode \
  -var-file=environments/dev.tfvars \
  -out=tfplan > drift-plan.txt

Then the workflow evaluates the exit code:

if [[ "${EXIT_CODE}" -eq 0 ]]; then
  echo "No drift detected."
  exit 0
elif [[ "${EXIT_CODE}" -eq 2 ]]; then
  echo "Drift detected."
  exit 0
else
  echo "Terraform plan failed."
  exit 1
fi

This is intentional.

If drift is detected, the workflow does not fail as a broken pipeline.

It creates a GitHub issue.

Why?

Because drift is not always a technical failure.

Sometimes drift is an intentional emergency change that needs to be reconciled.

The workflow should create visibility, not immediately destroy or revert something.

GitHub Issue Creation When Drift Is Detected

When Terraform returns exit code 2, the workflow creates a GitHub issue:

gh issue create \
  --title "${ISSUE_TITLE}" \
  --body-file drift-issue-body.md \
  --label "drift,terraform,incident"

The issue tells the operator to:

download the drift plan artifact
review whether the drift was intentional or accidental
revert accidental drift with Terraform apply
update Terraform code if the change was intentional
close the issue only after plan returns exit code 0

This turns drift detection into an operational workflow.

Not just a command.

Authentication with Workload Identity Federation

Like my previous Terraform CI/CD artifact, this project does not use a service account JSON key.

GitHub Actions authenticates to Google Cloud using Workload Identity Federation.

The flow is:

GitHub Actions OIDC token
  -> Google Workload Identity Provider
  -> Terraform drift CI/CD service account
  -> Google Cloud APIs

This keeps the workflow keyless.

No long-lived Google Cloud service account key is stored in GitHub.

Why I Do Not Auto-Apply Drift Fixes

It may be tempting to do this:

detect drift
automatically run terraform apply

I intentionally did not do that.

Reason:

Not all drift is accidental.

Some drift may come from:

emergency production fix
approved console change
another migration process
temporary incident workaround

The right first response is:

detect
notify
review
decide
then reconcile

Automatic remediation can be dangerous if the workflow does not understand why the drift happened.

So this artifact creates a GitHub issue instead.

What This Artifact Taught Me

The biggest lesson:

Terraform is not only a provisioning tool.
Terraform is also a control system.

What I Would Do Differently in Production

For a production environment, I would improve this further with:

least-privilege custom IAM role
Slack or email notification
deduplication of drift issues
severity classification
separate environments
policy-as-code checks
cost impact review
state backup automation
manual approval before remediation

But for this artifact, the core objective is complete:

simulate drift
detect drift
understand state
recover state
import resources
automate drift visibility

DEV Community

Terraform Drift Detection and Recovery on Google Cloud: Plan, Import, State, and GitHub Actions

Why I Built This Artifact

What This Project Demonstrates

Why I Created a Separate Repository

Repository Structure

Baseline Infrastructure

Remote State Design

Lab 1 — Simulating Drift

Lab 2 — Detecting Drift with Terraform Plan

Lab 3 — Using `-detailed-exitcode`

Lab 4 — Refresh-Only Planning

Lab 5 — Recovering from Drift

Lab 6 — Importing Existing Infrastructure

Lab 7 — State Inspection

Lab 8 — State Recovery After Accidental State Removal

Lab 9 — Scheduled Drift Detection with GitHub Actions

GitHub Issue Creation When Drift Is Detected

Authentication with Workload Identity Federation

Why I Do Not Auto-Apply Drift Fixes

What This Artifact Taught Me

What I Would Do Differently in Production

Top comments (0)

Why I Built This Artifact

What This Project Demonstrates

Why I Created a Separate Repository

Repository Structure

Baseline Infrastructure

Remote State Design

Lab 1 — Simulating Drift

Lab 2 — Detecting Drift with Terraform Plan

Lab 3 — Using -detailed-exitcode

Lab 4 — Refresh-Only Planning

Lab 5 — Recovering from Drift

Lab 6 — Importing Existing Infrastructure

Lab 7 — State Inspection

Lab 8 — State Recovery After Accidental State Removal

Lab 9 — Scheduled Drift Detection with GitHub Actions

GitHub Issue Creation When Drift Is Detected

Authentication with Workload Identity Federation

Why I Do Not Auto-Apply Drift Fixes

What This Artifact Taught Me

What I Would Do Differently in Production

Lab 3 — Using `-detailed-exitcode`