Sudarshan Thakur

Posted on Jun 23

Five Features I Shipped in tfdrift That Changed How We Handle Terraform Drift

#terraform #devops #opensource #cloudcomputing

Five Features I Shipped in tfdrift That Changed How We Handle Terraform Drift

When I first released tfdrift, it did one thing: scan Terraform workspaces, classify drift by severity, and tell you what matters. That was enough to be useful, but after running it against real infrastructure and hearing from other engineers, five pain points kept coming up.

Scanning 20+ workspaces took forever. There was no way to see what drifted last week. CI/CD integration was clunky. The watch command didn't have severity gating. And when an instance type drifted, nobody could tell if it was costing more or less.

Over the last few weeks I shipped fixes for all five. Here's what's new in tfdrift v0.2.3 through v0.2.5.

1. Parallel workspace scanning (v0.2.3)

This was the most requested improvement. Previously, tfdrift scanned workspaces sequentially — one terraform plan at a time. For a repo with 5 workspaces, that's fine. For a repo with 50, you're waiting 10+ minutes staring at a spinner.

Now tfdrift scans workspaces in parallel using a thread pool:

# Scan with 8 parallel workers
tfdrift scan --path ./infrastructure --workers 8

# Default is 4 workers
tfdrift scan --path ./infrastructure

You can also set it in your .tfdrift.yml:

scan:
  workers: 8

For a team I was working with that had 24 Terraform workspaces, scan time dropped from about 3 minutes to under 20 seconds. That's the difference between "I'll run it before lunch" and "I'll run it before every commit."

One thing to watch: each worker runs its own terraform plan, which means parallel API calls to your cloud provider. If you're on AWS with rate-limited APIs, keep workers at 4-6. If you're hitting throttling, drop it to 2.

2. Severity gating for watch mode (v0.2.3)

The scan command already had --fail-on — it lets you set a severity threshold for the exit code. Exit 1 only when drift at or above a certain severity is detected. Useful for CI pipelines where you don't want tag changes to fail your build.

But the watch command didn't have this. It would alert on everything or nothing. Now it has parity:

# Watch mode — only exit/alert on High or Critical drift
tfdrift watch --interval 30m --fail-on high --slack-webhook $SLACK_URL

This means the watch loop keeps running when it finds Low or Medium drift — it logs it but doesn't fire alerts or exit. When it finds High or Critical drift, it alerts and stops. This is what you want for a CI watch job: keep checking quietly until something actually dangerous shows up.

3. Drift history tracking (v0.2.4)

This one came from a question I kept getting: "What did our drift look like last week?" Before this, every scan was ephemeral — results went to stdout or a JSON file and that was it. No historical record, no trend tracking.

Now every scan and watch cycle automatically saves results to a local SQLite database at ~/.tfdrift/history.db. And there's a new command to query it:

# Show last 10 scans
tfdrift history

# Show last 30 scans
tfdrift history --limit 30

# Output as JSON for scripting
tfdrift history --format json

# Use a custom database path
tfdrift history --db /path/to/history.db

The output shows timestamps, workspace count, drift count, severity breakdown, and scan duration for every past scan. Now you can answer questions like:

"Is our drift getting better or worse over time?"
"How many Critical drift events did we have this month?"
"When did that security group change first appear?"

It's also useful for compliance — some teams need to prove they're actively monitoring for drift. A history of scans with timestamps is exactly that evidence.

4. GitHub Actions native output (v0.2.4)

This was a friction point for anyone running tfdrift in CI. The tool worked fine in GitHub Actions, but the output was just text in the log. You had to click into the job, scroll through output, and find the drift table. Not great when you're reviewing a PR.

Now tfdrift auto-detects when it's running in GitHub Actions (via the $GITHUB_ACTIONS environment variable) and emits native annotations:

# Auto-detected in GitHub Actions, or force it:
tfdrift scan --path ./infrastructure --gha

What this does:

Inline annotations: Each drifted resource appears as a ::error::, ::warning::, or ::notice:: annotation depending on severity. Critical and High show as errors, Medium as warnings, Low as notices. These appear inline in the Actions log and as annotations on PR diffs.

Job summary: tfdrift writes a Markdown drift table to $GITHUB_STEP_SUMMARY, which renders as a rich overview on the GitHub Actions job summary page. Instead of digging through logs, reviewers see a clean severity-classified table right on the summary.

No configuration needed. If you're already running tfdrift in GitHub Actions, just update to v0.2.4 and the annotations appear automatically. If you want to force the behavior outside of Actions (for testing), use the --gha flag.

5. Cost impact estimation (v0.2.5)

This is the one I'm most excited about. When an instance type drifts — say someone manually changed a t3.micro to a t3.xlarge — knowing the severity is useful. But knowing it'll cost an extra $90/month is what gets the finance team's attention.

tfdrift now estimates the monthly cost delta when compute instance types or classes change:

tfdrift scan --path ./infrastructure

The table output now includes a "Cost Impact" column:

Severity  Resource                 Action  Changed         Cost Impact
HIGH      aws_instance.web[0]      update  instance_type   +$89.28/mo
HIGH      aws_instance.worker[2]   update  instance_type   -$43.80/mo
MEDIUM    aws_rds_instance.prod    update  instance_class  +$215.00/mo

And at the bottom:

Estimated total cost impact: +$260.48/mo

The pricing data covers on-demand us-east-1 prices for EC2 (T2, T3, M5, M6i, C5, C6i, R5, R6i families), RDS, Aurora, ElastiCache, and EKS node groups. It's clearly labeled as approximate — on-demand us-east-1 estimates — because actual costs depend on reserved instances, savings plans, region, and a dozen other factors.

But even an approximate number changes the conversation. "Someone changed an instance type" becomes "someone's change is costing us an extra $260 a month." That gets fixed faster.

In JSON output, the cost delta is exposed as cost_delta_monthly on each drifted resource, so you can pipe it into your own dashboards or cost tracking tools.

What's next

These five features round out what I think is a solid drift detection workflow: scan fast (parallel), know what matters (severity), track it over time (history), integrate cleanly (GitHub Actions), and understand the impact (cost estimation).

There's still more I want to build. Environment-aware severity — where the same change can be Critical in production but Low in dev — is high on the list. Value-aware classification (inspecting what an attribute changed to, not just that it changed) is the other big one. And I've been getting feedback about drift governance — using severity to actually block deployments, not just alert.

If you're using tfdrift, I'd love to hear what features would actually make a difference for your team. And if you're not using it yet:

pip install tfdrift
tfdrift scan --path ./your-terraform-dir

All of these features are live on PyPI right now. The code is at github.com/sudarshan8417/tfdrift.

This is part 5 of a series on infrastructure drift detection. Part 1: I Built a Free Terraform Drift Detector. Part 2: Why Severity Classification Changes Everything. Part 3: How I Built a Terraform Plan JSON Parser. Part 4: Automated Drift Monitoring With GitHub Actions and Slack.

DEV Community

Five Features I Shipped in tfdrift That Changed How We Handle Terraform Drift

Five Features I Shipped in tfdrift That Changed How We Handle Terraform Drift

1. Parallel workspace scanning (v0.2.3)

2. Severity gating for watch mode (v0.2.3)

3. Drift history tracking (v0.2.4)

4. GitHub Actions native output (v0.2.4)

5. Cost impact estimation (v0.2.5)

What's next

Top comments (0)