Sudarshan Thakur

Posted on Apr 21

I Built tfdrift Free Terraform Drift Detection With Severity Alerts

#devops #automation #security #cloud

I Built a Free Terraform Drift Detector — Here's Why

If you manage Terraform infrastructure, you've probably experienced this: someone tweaks a security group in the AWS console "just for testing," forgets about it, and three months later your terraform apply blows up — or worse, that change silently creates a security hole that nobody catches.

This is infrastructure drift. And there's no good free tool to detect it properly.

So I built tfdrift — a free, open-source CLI that detects Terraform drift, classifies it by severity, and alerts your team.

The problem with "just run terraform plan"

Yes, terraform plan detects drift. But in practice, it falls short for teams managing real infrastructure:

No severity awareness. Someone changed a tag? Same alert as someone opening port 22 to the world. terraform plan treats all changes equally — but they're not. An IAM policy change is a security incident. A tag change is noise.

No multi-workspace scanning. If you have 20 Terraform workspaces across environments, you need to cd into each one and run plan manually. Nobody does this consistently.

No notifications. You only discover drift when you happen to run plan. By then, the damage may already be done.

No ignore rules. Auto-scaling groups constantly change desired_capacity. ECS services change desired_count. These are expected — but plan flags them every time, creating alert fatigue.

What tfdrift does

pip install tfdrift
tfdrift scan --path ./infrastructure

It recursively discovers all Terraform workspaces, runs terraform plan -json on each, parses the output, and gives you a structured report with severity classification:

CRITICAL — Security group ingress/egress changes, IAM policy modifications, S3 public access changes
HIGH — Instance type changes, RDS public accessibility, encryption config changes
MEDIUM — Most other attribute changes
LOW — Tags, descriptions

Here's what the output looks like:

⚠️  Drift detected: 7 resource(s) across 2/4 workspace(s)

🔴 critical: 2  🟠 high: 2  🟡 medium: 2  🔵 low: 1

📂 infrastructure/production (5 drifted, 3.2s)
┌──────────┬───────────────────────────────────┬────────┬─────────────────────┐
│ Severity │ Resource                          │ Action │ Changed attributes  │
├──────────┼───────────────────────────────────┼────────┼─────────────────────┤
│ CRITICAL │ aws_security_group.api_sg         │ update │ ingress             │
│ CRITICAL │ aws_iam_role_policy.lambda_exec   │ update │ policy              │
│ HIGH     │ aws_instance.web_server           │ update │ instance_type, ami  │
│ HIGH     │ aws_rds_instance.primary          │ update │ publicly_accessible │
│ LOW      │ aws_s3_bucket.assets              │ update │ tags                │
└──────────┴───────────────────────────────────┴────────┴─────────────────────┘

Now you immediately know what matters.

I tested it against real AWS infrastructure

To validate the tool beyond unit tests, I set up a sample AWS environment with EC2 instances managed by Terraform. I ran tfdrift against it:

tfdrift scan --path ./dev-terraform/development/ec2 --verbose

Within 4 seconds, it detected a drifted EC2 instance — an aws_instance resource that had been modified outside of Terraform. The tool correctly classified it as HIGH severity because instance_type was among the changed attributes:

⚠️  Drift detected: 1 resource(s) across 1/4 workspace(s)

🟠 high: 1

📂 development/ec2/dev-machines (1 drifted, 4.2s)
┌──────────┬──────────────────────┬────────┬────────────────────────────────────┐
│ Severity │ Resource             │ Action │ Changed attributes                 │
├──────────┼──────────────────────┼────────┼────────────────────────────────────┤
│ HIGH     │ aws_instance.this[0] │ create │ instance_type, tags, monitoring,  │
│          │                      │        │ metadata_options, volume_tags      │
└──────────┴──────────────────────┴────────┴────────────────────────────────────┘

That drift had been sitting there undetected. A simple terraform plan would have found it too — but nobody was running plan against that workspace regularly. With tfdrift's watch mode, this could have been caught automatically and sent to Slack within minutes.

Key features

Watch mode — continuous monitoring with Slack alerts:

tfdrift watch --interval 30m --slack-webhook https://hooks.slack.com/services/XXX

Auto-remediation — with safety guards that block destructive changes on critical resources:

tfdrift scan --auto-fix --confirm --env dev

Ignore rules — a .tfdriftignore file for expected drift:

aws_autoscaling_group.*.desired_capacity
aws_ecs_service.*.desired_count
*.tags.LastModified

CI/CD integration — exit codes designed for pipelines (0 = clean, 1 = drift, 2 = error, 3 = remediated), plus a ready-made GitHub Actions workflow.

JSON and Markdown output — for programmatic use or generating reports:

tfdrift scan --format json --output drift-report.json
tfdrift scan --format markdown --output drift-report.md

How it compares

Feature	tfdrift	terraform plan	Terraform Enterprise
Actively maintained	✅	✅	✅
Severity classification	✅	❌	❌
Multi-workspace scan	✅	❌	✅
Auto-remediation	✅	❌	✅
Watch mode + alerts	✅	❌	✅
Ignore rules	✅	❌	❌
Cost	Free	Free	$15K+/yr

The architecture

tfdrift is written in Python and built with a modular architecture:

tfdrift/
├── detectors/       # Drift detection engine (terraform plan parser)
├── reporters/       # Output formatters (JSON, Markdown, table, Slack)
├── remediators/     # Auto-fix logic with safety guards
├── severity.py      # 30+ built-in AWS severity rules
├── config.py        # .tfdrift.yml and .tfdriftignore parser
└── cli.py           # CLI (Click)

The severity engine uses pattern matching to classify drift. For example, any change to aws_security_group.*.ingress is automatically classified as CRITICAL, while *.tags changes are classified as LOW. You can extend these rules in your .tfdrift.yml:

severity:
  critical:
    - aws_security_group.*.ingress
    - aws_iam_policy.*.policy
  high:
    - aws_instance.*.instance_type
    - aws_rds_instance.*.publicly_accessible

Why I built this

I'm a DevOps engineer working with AWS, Terraform, and Kubernetes daily. Infrastructure drift was a recurring problem — security-critical changes mixed in with harmless noise, and no good free tool to sort through it.

The existing options were either expensive (Terraform Enterprise at $15K+/year) or too basic (raw terraform plan in a cron job).

I wanted something that could:

Tell me what drifted and how bad it is
Scan all my workspaces in one command
Send me a Slack alert at 2 AM when someone modifies a security group
Let me ignore the noise (auto-scaling changes, tag updates)
Run in CI/CD and fail the pipeline if critical drift is found

So I built it.

Try it

pip install tfdrift

# Scan your infrastructure
tfdrift scan --path ./your-terraform-dir

# Set up continuous monitoring
tfdrift watch --interval 1h --slack-webhook $SLACK_WEBHOOK

# Generate a config file
tfdrift init

The code is fully open source: github.com/sudarshan8417/tfdrift

I'd love feedback, bug reports, and contributions. If you've dealt with Terraform drift at your organization, I'd especially love to hear what features you'd want next — Azure/GCP support, a web dashboard, PagerDuty integration?

Drop a comment or open an issue on GitHub.

If you found this useful, consider giving the repo a ⭐ on GitHub — it helps other engineers find the tool.

Top comments (4)

Hollow House Institute • Apr 21

This is a solid build.

What stands out is the shift from “detect drift” to “classify how risky it is.”

That’s already closer to control than most setups.

But it also shows the same pattern across systems.

Detection improves.
Response time improves.

The system still continues operating after drift is found.

So drift becomes visible, but not constrained.

That’s the gap between detection and governance.

Right now tfdrift tells you:
what changed
where it changed
how severe it is

The next layer is what happens when severity crosses a Decision Boundary.

Does Escalation trigger automatically
Does execution pause for critical conditions
Does Stop Authority exist for high-risk drift

Without that, the loop is still:

drift → detect → alert → human response

Which works, but only if someone acts fast enough.

Tying severity directly to enforcement conditions would turn this from visibility into control.

Sudarshan Thakur • Apr 26

you're right that the current loop is detect → alert → human response, and that's intentional for v0.2, but the enforcement layer is where I'm heading next.
Concretely, I'm thinking about a policy block in .tfdrift.yml where you can define decision boundaries:

policy:
critical:
action: block # fail CI pipeline with exit code 1
notify: pagerduty
high:
action: alert
require_approval: true

The exit codes already support CI/CD gating (exit 1 on drift), so if you run tfdrift in a pre-deploy pipeline, critical drift blocks the deploy. But the "pause execution" and "stop authority" concepts you're describing would need deeper integration, maybe a Terraform wrapper or a GitHub Actions gate that requires manual approval when critical drift exists.
Would love to hear if you have a specific workflow in mind where this enforcement would plug in.

PEACEBINFLOW • Apr 22

The severity classification is the piece that makes this actually usable in a real team. Without it, drift detection just becomes another source of alert fatigue that everyone learns to ignore. But the interesting tension here is that "severity" is subjective to the team's context, not the resource type.

A security group change to open port 22 is critical almost everywhere. But an instance type change? In dev, that's noise. In production during business hours, maybe it's high. At 3 AM during a Sev1 incident where someone manually scaled up to handle load? That's not drift, that's heroics. The tool can't know the difference between "someone messed with the console" and "someone saved the site while you were asleep."

That makes me think the real long-term value isn't the detection itself—it's giving teams a place to encode their philosophy about what matters. The .tfdrift.yml becomes a living document of operational values. Which changes are acceptable to make outside Terraform, and under what circumstances.

The watch mode with auto-remediation feels like the natural endpoint here, but also the scariest one. Auto-fixing critical drift sounds great until it reverts a manual change that was the only thing keeping the database from falling over. Are you using the auto-fix piece yourself, or is that more of an aspirational flag for now?

Sudarshan Thakur • Apr 26

The .tfdrift.yml as "a living document of operational values" is a great way to frame it. Right now you can override severity per resource pattern, but environment-aware severity is the natural next step. Also I'm using it with --dry-run and in dev/staging only. Production auto-fix is aspirational for now. The scariest scenario you described reverting a manual save ,is exactly why require_approval: true is the default. I think the right path is auto-fix for LOW/MEDIUM drift only, with human approval for anything HIGH+.