Sudarshan Thakur

Posted on Jun 6

Setting Up Continuous Terraform Drift Monitoring With GitHub Actions and Slack

#devops #githubactions #monitoring #terraform

Most teams discover Terraform drift the hard way — someone runs terraform plan before a deploy and gets a screen full of unexpected changes. By then the drift might have been sitting there for weeks. Maybe longer.

What if you could catch it automatically? Run a scan every few hours, get a Slack message only when something important drifts, and ignore the noise?

That's what this tutorial sets up. By the end, you'll have:

A GitHub Actions workflow that scans your Terraform infrastructure on a schedule
Slack alerts that only fire for High and Critical severity drift
A JSON report saved as an artifact for audit trails
The whole thing running hands-free, zero maintenance

I'm using tfdrift for this — it's an open-source CLI I built that classifies drift by severity. But the GitHub Actions + Slack pattern works with any drift detection approach.

Prerequisites

Before we start, you'll need:

A GitHub repo with your Terraform code
AWS credentials (or whatever cloud provider you use)
A Slack workspace where you can create a webhook
Python 3.9+ (tfdrift is a Python CLI)
About 20 minutes

Step 1: Create a Slack webhook

First, let's set up where the alerts will go.

Go to api.slack.com/apps
Click "Create New App" → "From scratch"
Name it something like "Drift Alerts" and pick your workspace
Go to "Incoming Webhooks" in the sidebar → toggle it On
Click "Add New Webhook to Workspace"
Pick the channel you want alerts in (I use #infra-alerts)
Copy the webhook URL — it looks like https://hooks.slack.com/services/T00000/B00000/XXXX

Save that URL. We'll need it in a minute.

Step 2: Add secrets to your GitHub repo

You need to store your cloud credentials and Slack webhook as GitHub secrets so they're not exposed in your workflow file.

Go to your repo → Settings → Secrets and variables → Actions → New repository secret

Add these:

AWS_ACCESS_KEY_ID        → your AWS access key
AWS_SECRET_ACCESS_KEY    → your AWS secret key
AWS_DEFAULT_REGION       → us-east-1 (or whatever region you use)
SLACK_WEBHOOK_URL        → the webhook URL from Step 1

If you're using AWS SSO or assuming roles, you'll need AWS_ROLE_ARN too — but the basic access key approach works for getting started.

Step 3: Create the GitHub Actions workflow

Create this file in your repo at .github/workflows/drift-check.yml:

name: Terraform Drift Check

on:
  schedule:
    # Run every 6 hours
    - cron: '0 */6 * * *'
  workflow_dispatch:
    # Allow manual triggering from the Actions tab

jobs:
  drift-check:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.0
          terraform_wrapper: false

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install tfdrift
        run: pip install tfdrift

      - name: Run drift scan
        id: drift_scan
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
        run: |
          tfdrift scan \
            --path ./infrastructure \
            --format json \
            --output drift-report.json \
            --slack-webhook ${{ secrets.SLACK_WEBHOOK_URL }} \
            --quiet
        continue-on-error: true

      - name: Upload drift report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: drift-report-${{ github.run_number }}
          path: drift-report.json
          retention-days: 30

      - name: Fail on drift
        if: steps.drift_scan.outcome == 'failure'
        run: |
          echo "::warning::Terraform drift detected! Check the drift report artifact."
          exit 1

Let me walk through what each part does.

The schedule trigger: cron: '0 */6 * * *' runs the scan every 6 hours — at midnight, 6am, noon, and 6pm UTC. You can adjust this. Every hour is '0 * * * *', once a day at midnight is '0 0 * * *'. I find every 6 hours is a good balance between catching drift quickly and not burning through GitHub Actions minutes.

workflow_dispatch: This lets you trigger the scan manually from the Actions tab. Useful for testing or when you want an immediate check.

terraform_wrapper: false: This is important. The hashicorp/setup-terraform action wraps the terraform binary by default, which breaks JSON output parsing. Setting this to false gives you the raw terraform binary.

continue-on-error: true on the scan step: tfdrift exits with code 1 when drift is detected. Without continue-on-error, the workflow would stop and skip the artifact upload. We want to always save the report, then handle the failure separately in the last step.

The --quiet flag: Suppresses terminal output since we're running in CI. The JSON report captures everything we need.

Artifact upload with if: always(): Saves the JSON report regardless of whether drift was found. This gives you an audit trail — you can go back and see what your infrastructure looked like at any point in the last 30 days.

Step 4: Configure severity filtering

Right now the workflow scans everything and alerts on all drift. To filter out the noise, create a .tfdrift.yml in your repo root:

# .tfdrift.yml
scan:
  paths:
    - ./infrastructure
  exclude:
    - "**/.terraform/**"
    - "**/test/**"
    - "**/modules/**"

notifications:
  slack:
    webhook_url: ${SLACK_WEBHOOK_URL}
    channel: "#infra-alerts"
    min_severity: high   # Only alert on High and Critical

severity:
  critical:
    - aws_security_group.*.ingress
    - aws_security_group.*.egress
    - aws_iam_policy.*.policy
    - aws_iam_role.*.assume_role_policy
    - aws_s3_bucket_public_access_block.*
  high:
    - aws_instance.*.instance_type
    - aws_rds_instance.*.publicly_accessible
    - aws_rds_instance.*.storage_encrypted

The key setting is min_severity: high. This means:

Critical drift (security groups, IAM policies) → Slack alert immediately
High drift (instance types, encryption) → Slack alert immediately
Medium drift → logged in the JSON report but no alert
Low drift (tags) → logged in the JSON report but no alert

This is what took our alert volume from 100% down to 27% while still catching 94% of the changes that actually mattered.

Step 5: Add an ignore file

Create .tfdriftignore in your repo root for drift you never want to see:

# Autoscaling — these change every few minutes by design
aws_autoscaling_group.*.desired_capacity
aws_ecs_service.*.desired_count

# Tags managed by external cost allocation tools
*.tags.CostCenter
*.tags.LastModified
*.tags.UpdatedBy
*.tags.ManagedBy

# Terraform internal metadata
*.tags.terraform
*.tags_all.*

Without this file, you'll get constant alerts about autoscaling changes. Those aren't drift — that's the system working correctly.

Step 6: Test it

Push everything and trigger a manual run:

Commit and push .github/workflows/drift-check.yml, .tfdrift.yml, and .tfdriftignore
Go to your repo → Actions tab → Terraform Drift Check → Run workflow
Watch it run

If there's drift, you'll see:

A Slack message in #infra-alerts with the severity breakdown
A JSON artifact attached to the workflow run
The workflow marked as failed (exit code 1)

If there's no drift:

No Slack message (nothing to alert about)
A JSON artifact with an empty drift report
The workflow marked as passed (exit code 0)

What the Slack alert looks like

When tfdrift finds High or Critical drift, it sends a message like:

⚠️ Terraform Drift Detected — 3 resource(s)

Workspaces scanned: 12
With drift: 2
Severity: critical: 1 | high: 2

🔴 CRITICAL — aws_security_group.api_sg
Action: update | Changed: ingress

🟠 HIGH — aws_instance.web_server
Action: update | Changed: instance_type, ami

🟠 HIGH — aws_rds_instance.primary
Action: update | Changed: publicly_accessible

You immediately know what drifted, how bad it is, and which workspace it's in. No log diving, no running terraform plan manually, no guessing.

Making it production-ready

A few things I'd add for a real production setup:

Multiple environments

If you have separate Terraform directories for dev, staging, and production, you can either scan them all in one workflow or create separate workflows with different schedules:

# Production — check every 2 hours
- cron: '0 */2 * * *'

# Staging — check every 6 hours
- cron: '0 */6 * * *'

# Dev — check once a day
- cron: '0 8 * * *'

You can also use separate .tfdrift.yml files per environment with different severity thresholds. Maybe you only alert on Critical for dev but alert on Medium+ for production.

Branch protection

You can add drift checking to your PR workflow so that drift must be resolved before merging:

on:
  pull_request:
    branches: [main]
    paths:
      - 'infrastructure/**'

This catches drift before it becomes a deployment surprise. If someone opened a security group in the console and you're about to deploy, the PR pipeline will flag it.

PagerDuty for critical drift

For production infrastructure, you might want Critical drift to page someone instead of just going to Slack:

notifications:
  slack:
    webhook_url: ${SLACK_WEBHOOK_URL}
    min_severity: high
  pagerduty:
    routing_key: ${PAGERDUTY_ROUTING_KEY}
    min_severity: critical

This means High drift goes to Slack, but Critical drift (someone modified a security group or IAM policy) pages the on-call engineer. That's the kind of change you want someone looking at within minutes, not hours.

HTML reports

For weekly reviews or compliance, you can generate an HTML report:

- name: Generate HTML report
  run: |
    tfdrift report \
      --path ./infrastructure \
      --output drift-report.html

The HTML report gives you a standalone page with severity breakdowns, drifted resources, and workspace details — useful for sharing with security teams or attaching to compliance documentation.

Cost and limits

A few practical things to know:

GitHub Actions minutes: The free tier gives you 2,000 minutes/month. A drift scan typically takes 2-5 minutes depending on how many workspaces you have. Running every 6 hours is ~120 runs/month × ~3 minutes = ~360 minutes. Well within the free tier.

Terraform API calls: Each terraform plan makes API calls to your cloud provider to refresh state. AWS doesn't charge for most describe/get API calls, but if you have hundreds of workspaces, the volume might matter. Monitor your AWS CloudTrail to be safe.

Slack rate limits: Slack webhooks are limited to 1 message per second. If you're scanning 50 workspaces and 10 have drift, tfdrift batches them into a single Slack message, so this shouldn't be an issue.

The full file structure

After setup, your repo should look like:

your-terraform-repo/
├── .github/
│   └── workflows/
│       └── drift-check.yml    # The scan workflow
├── .tfdrift.yml               # Severity rules and notification config
├── .tfdriftignore             # Expected drift exclusions
└── infrastructure/
    ├── production/
    │   ├── main.tf
    │   └── terraform.tfvars
    ├── staging/
    │   └── main.tf
    └── dev/
        └── main.tf

Try it

If you don't have tfdrift installed yet:

pip install tfdrift
tfdrift scan --path ./your-terraform-dir

The GitHub Actions workflow, .tfdrift.yml, and .tfdriftignore templates are all in the tfdrift repo under the examples/ directory.

If you set this up and run into issues, open a GitHub issue — I'd genuinely like to hear about edge cases I haven't hit yet.

This is part 4 of a series on infrastructure drift detection. Part 1: I Built a Free Terraform Drift Detector — Here's Why. Part 2: Why Severity Classification Changes Everything. Part 3: How I Built a Terraform Plan JSON Parser in Python.

DEV Community