Hi! so this is my fourth terraform artefact. I'd like to say that creating infrastructure with Terraform is one thing, but using Terraform when reality changes outside the code is another thing.
For the fourth artifact in my Terraform x Google Cloud portfolio, I wanted to explore a more operational topic:
Terraform drift detection, import, and state recovery.
In this project, i will not be building a large infrastructure platform. This project is about understanding what happens when:
someone changes infrastructure manually
Terraform state no longer matches real infrastructure
an existing resource needs to be imported
a resource is accidentally removed from state
drift needs to be detected automatically
The artifact repository:
terraform-gcp-drift-import-recovery
The main goal:
Understand Terraform failure modes, not only Terraform provisioning.
Why I Built This Artifact
My previous Terraform artifacts focused on building and delivering infrastructure.
The portfolio journey looked like this:
Artifact 1 — Terraform GCP Foundation
Artifact 2 — Production-Lite GCP Web Platform
Artifact 3 — Terraform CI/CD with GitHub Actions and WIF
Artifact 4 — Drift Detection, Import, and State Recovery
The first three artifacts answered:
Can I create infrastructure?
Can I structure it properly?
Can I expose it securely?
Can I automate Terraform plan and apply?
This fourth artifact answers a different question:
What happens when Terraform is no longer the only actor changing infrastructure?
That question matters because real cloud environments are rarely perfect.
Someone may change a firewall rule manually.
Someone may create a resource directly from the console.
Someone may remove a resource from Terraform state by mistake.
Someone may perform an emergency change and forget to update the code.
This artifact is about detecting, understanding, and recovering from those situations.
What This Project Demonstrates
This project demonstrates:
Terraform remote state
baseline GCP infrastructure creation
manual drift simulation
terraform plan drift detection
terraform plan -detailed-exitcode
terraform plan -refresh-only
terraform state list
terraform state show
terraform state pull
terraform state rm
terraform import
importing manually created GCP resources
recovering after state removal
scheduled drift detection with GitHub Actions
GitHub issue creation when drift is detected
Workload Identity Federation
no service account JSON key
The lab intentionally uses simple Google Cloud resources:
VPC
subnet
firewall rule
service account
That is deliberate.
The objective is not infrastructure complexity.
The objective is Terraform behavior.
Why I Created a Separate Repository
I created this as a separate repository instead of adding it into my production-lite web platform repo.
Reason:
This lab intentionally creates abnormal infrastructure situations.
It includes:
manual changes outside Terraform
state removal
resource import
drift simulation
drift detection
recovery scenarios
Those activities should not be mixed into a main application platform repository.
This project is a controlled failure-mode lab.
Repository Structure
The repository is structured like this:
terraform-gcp-drift-import-recovery/
├── README.md
├── .gitignore
├── versions.tf
├── providers.tf
├── backend.tf.example
├── main.tf
├── variables.tf
├── outputs.tf
├── terraform.tfvars.example
├── locals.tf
├── environments/
│ └── dev.tfvars
├── imports/
│ ├── imported-network.tf.example
│ └── imported-firewall.tf.example
├── docs/
│ ├── architecture.md
│ ├── drift-detection-lab.md
│ ├── import-lab.md
│ ├── state-recovery-lab.md
│ ├── scheduled-drift-detection.md
│ ├── operations-runbook.md
│ ├── verification.md
│ ├── design-decisions.md
│ └── release-process.md
├── scripts/
│ ├── simulate-drift.sh
│ ├── create-manual-import-resources.sh
│ ├── cleanup-manual-import-resources.sh
│ └── bootstrap-wif-github.sh
└── .github/
├── workflows/
│ ├── terraform-plan.yml
│ └── drift-detection.yml
└── ISSUE_TEMPLATE/
└── drift_report.md
The structure separates:
baseline Terraform code
manual drift scripts
import examples
state recovery documentation
scheduled drift detection workflow
GitHub issue template
Baseline Infrastructure
The baseline infrastructure is intentionally small.
It creates:
custom VPC
subnet
firewall rule
service account
The firewall rule is the main drift target.
Why?
Because firewall rules are easy to modify manually and easy for Terraform to detect.
For example, Terraform may define:
allowed port: tcp:80
Then someone manually changes it to:
allowed ports: tcp:80,tcp:8080
That difference is drift.
Remote State Design
This project uses a GCS backend for Terraform state.
I added two GitHub repository variables for the state configuration:
TF_STATE_BUCKET = terraform-gcp-production-lite-tfstate
TF_STATE_PREFIX = terraform-gcp-drift-import-recovery/dev
Instead of committing a real backend.tf, the GitHub Actions workflow writes the backend config dynamically.
The workflow step looks like this:
- name: Write Terraform backend config
shell: bash
env:
TF_STATE_BUCKET: ${{ vars.TF_STATE_BUCKET }}
TF_STATE_PREFIX: ${{ vars.TF_STATE_PREFIX }}
run: |
set -euo pipefail
if [[ -z "${TF_STATE_BUCKET}" ]]; then
echo "TF_STATE_BUCKET repository variable is required." >&2
exit 1
fi
if [[ -z "${TF_STATE_PREFIX}" ]]; then
echo "TF_STATE_PREFIX repository variable is required." >&2
exit 1
fi
cat > backend.tf <<EOF
terraform {
backend "gcs" {
bucket = "${TF_STATE_BUCKET}"
prefix = "${TF_STATE_PREFIX}"
}
}
EOF
This keeps the backend configuration flexible across environments.
It also avoids hardcoding my state bucket and prefix directly in the workflow logic.
Lab 1 — Simulating Drift
The first lab simulates drift manually.
After deploying the baseline infrastructure, I modify the firewall rule outside Terraform.
Example:
export FIREWALL_RULE_NAME="$(terraform output -raw firewall_rule_name)"
gcloud compute firewall-rules update "${FIREWALL_RULE_NAME}" \
--allow=tcp:80,tcp:8080
At this point:
Terraform code says: tcp:80
Google Cloud says: tcp:80,tcp:8080
That is drift.
Terraform code, Terraform state, and real infrastructure are no longer aligned.
Lab 2 — Detecting Drift with Terraform Plan
After simulating drift, I run:
terraform plan -var-file=environments/dev.tfvars
Terraform detects that the real firewall rule no longer matches the desired configuration.
The plan proposes to restore the firewall rule back to the Terraform-defined state.
That is the simplest drift detection mechanism:
Run terraform plan.
If Terraform proposes unexpected changes, investigate.
Lab 3 — Using -detailed-exitcode
For automation, I used:
terraform plan \
-input=false \
-no-color \
-detailed-exitcode \
-var-file=environments/dev.tfvars \
-out=tfplan
The important behavior:
exit code 0 = no changes
exit code 1 = error
exit code 2 = changes detected
This is useful because a CI/CD workflow can use the exit code to determine whether drift exists.
In this artifact:
0 means no drift
2 means drift detected
1 means Terraform error
Lab 4 — Refresh-Only Planning
I also documented refresh-only planning:
terraform plan \
-var-file=environments/dev.tfvars \
-refresh-only
This is useful when I want to inspect how real infrastructure differs from Terraform state.
Normal plan answers:
What would Terraform change to make infrastructure match the code?
Refresh-only plan answers:
What has changed in real infrastructure compared to Terraform state?
Both are useful, but they answer different operational questions.
Lab 5 — Recovering from Drift
If the manual change was not approved, the recovery is simple:
terraform apply -var-file=environments/dev.tfvars
Terraform restores the resource to the desired configuration.
In this case, it removes the manually added port 8080.
Then I can confirm that the environment is clean:
terraform plan \
-var-file=environments/dev.tfvars \
-detailed-exitcode
Expected result:
exit code 0
That means Terraform sees no pending changes.
Lab 6 — Importing Existing Infrastructure
The next scenario is different.
Instead of changing an existing Terraform-managed resource, I create a resource manually outside Terraform.
Example:
gcloud compute networks create manual-import-network \
--project="${PROJECT_ID}" \
--subnet-mode=custom
This VPC exists in Google Cloud, but Terraform does not know about it.
To bring it under Terraform management, I need two things:
Terraform resource configuration
terraform import
Example Terraform configuration:
resource "google_compute_network" "imported_network" {
name = "manual-import-network"
auto_create_subnetworks = false
routing_mode = "REGIONAL"
}
Then import:
terraform import \
-var-file=environments/dev.tfvars \
google_compute_network.imported_network \
"projects/${PROJECT_ID}/global/networks/manual-import-network"
After import:
terraform state list
terraform state show google_compute_network.imported_network
terraform plan -var-file=environments/dev.tfvars
The objective is to reach:
No changes.
That means the Terraform configuration and imported real resource are aligned.
Lab 7 — State Inspection
Terraform state is the mapping between Terraform resource addresses and real infrastructure objects.
Useful commands:
terraform state list
terraform state show google_compute_firewall.allow_http_internal
terraform state pull > state-backups/before-recovery.json
I included state inspection because drift and import are hard to understand without understanding state.
Terraform state answers:
Which real resource is this Terraform resource address connected to?
Lab 8 — State Recovery After Accidental State Removal
This lab simulates a state mistake.
First, I back up state:
mkdir -p state-backups
terraform state pull > state-backups/before-state-rm.json
Then I remove a resource from state:
terraform state rm google_compute_firewall.allow_http_internal
Important:
This does not delete the real firewall rule.
It only removes the mapping from Terraform state.
Now Terraform no longer knows that the real firewall rule belongs to the configuration.
If I run:
terraform plan -var-file=environments/dev.tfvars
Terraform may try to create the firewall rule again.
But the resource already exists in Google Cloud.
The correct recovery is to import it back:
terraform import \
-var-file=environments/dev.tfvars \
google_compute_firewall.allow_http_internal \
"projects/${PROJECT_ID}/global/firewalls/${FIREWALL_RULE_NAME}"
Then verify:
terraform state list
terraform plan -var-file=environments/dev.tfvars
The target result:
No changes.
Lab 9 — Scheduled Drift Detection with GitHub Actions
The final part of this artifact is scheduled drift detection.
The workflow runs:
workflow_dispatch
schedule: every day at 23:00 UTC
The workflow performs:
checkout repository
set up Terraform
authenticate to Google Cloud using WIF
write backend.tf from repository variables
terraform init
terraform validate
terraform plan -detailed-exitcode
upload drift plan artifact
create GitHub issue if drift is detected
The key step:
terraform plan \
-input=false \
-no-color \
-detailed-exitcode \
-var-file=environments/dev.tfvars \
-out=tfplan > drift-plan.txt
Then the workflow evaluates the exit code:
if [[ "${EXIT_CODE}" -eq 0 ]]; then
echo "No drift detected."
exit 0
elif [[ "${EXIT_CODE}" -eq 2 ]]; then
echo "Drift detected."
exit 0
else
echo "Terraform plan failed."
exit 1
fi
This is intentional.
If drift is detected, the workflow does not fail as a broken pipeline.
It creates a GitHub issue.
Why?
Because drift is not always a technical failure.
Sometimes drift is an intentional emergency change that needs to be reconciled.
The workflow should create visibility, not immediately destroy or revert something.
GitHub Issue Creation When Drift Is Detected
When Terraform returns exit code 2, the workflow creates a GitHub issue:
gh issue create \
--title "${ISSUE_TITLE}" \
--body-file drift-issue-body.md \
--label "drift,terraform,incident"
The issue tells the operator to:
download the drift plan artifact
review whether the drift was intentional or accidental
revert accidental drift with Terraform apply
update Terraform code if the change was intentional
close the issue only after plan returns exit code 0
This turns drift detection into an operational workflow.
Not just a command.
Authentication with Workload Identity Federation
Like my previous Terraform CI/CD artifact, this project does not use a service account JSON key.
GitHub Actions authenticates to Google Cloud using Workload Identity Federation.
The flow is:
GitHub Actions OIDC token
-> Google Workload Identity Provider
-> Terraform drift CI/CD service account
-> Google Cloud APIs
This keeps the workflow keyless.
No long-lived Google Cloud service account key is stored in GitHub.
Why I Do Not Auto-Apply Drift Fixes
It may be tempting to do this:
detect drift
automatically run terraform apply
I intentionally did not do that.
Reason:
Not all drift is accidental.
Some drift may come from:
emergency production fix
approved console change
another migration process
temporary incident workaround
The right first response is:
detect
notify
review
decide
then reconcile
Automatic remediation can be dangerous if the workflow does not understand why the drift happened.
So this artifact creates a GitHub issue instead.
What This Artifact Taught Me
The biggest lesson:
Terraform is not only a provisioning tool.
Terraform is also a control system.
What I Would Do Differently in Production
For a production environment, I would improve this further with:
least-privilege custom IAM role
Slack or email notification
deduplication of drift issues
severity classification
separate environments
policy-as-code checks
cost impact review
state backup automation
manual approval before remediation
But for this artifact, the core objective is complete:
simulate drift
detect drift
understand state
recover state
import resources
automate drift visibility
Top comments (0)