Sudarshan Thakur

Posted on May 2 • Edited on May 9

Why Severity Classification Changes Everything About Drift Detection

#terraform #devops #security #opensource

Here's a scenario every DevOps engineer has lived through at least once.

You set up drift detection for your Terraform infrastructure. Maybe it's a cron job running terraform plan, maybe it's a fancier tool. Either way, it works. It finds drift. It sends alerts.

The problem is it finds everything. Tag changes. Description updates. Auto-scaling group adjustments. Security group modifications. Instance type changes. All of it lands in the same Slack channel with the same urgency.

For the first week, your team looks at every alert. By week three, people start skimming. By month two, that Slack channel is muted. I know this because it happened to us.

And buried in that noise? A security group change that opened SSH to the internet. Nobody caught it for eleven days.

The core problem isn't detection. It's prioritization.

terraform plan is perfectly capable of detecting drift. It's been doing it since 2014. The detection part is solved. What isn't solved is the question that comes after detection: so what?

When plan tells you that 47 resources have changed across your infrastructure, the natural human response is "I don't have time for this." And honestly, that's rational. Most of those 47 changes probably don't matter. Tags get updated by cost allocation scripts. Auto-scaling groups adjust desired capacity constantly — that's literally their job. Someone updated a description field three weeks ago.

But somewhere in those 47 changes, there might be an IAM policy that was broadened to allow s3:* on all buckets. Or a KMS key policy that added an unknown account. Or a database that got its publicly_accessible flag flipped to true.

The problem is you can't tell which is which without reading through every single diff. Every. Single. Time.

Research on alert fatigue in operations backs this up. When teams receive more than about 50 alerts per day, response quality drops measurably. Critical alert response times degrade by up to 40%. The monitoring industry learned this lesson a decade ago with tools like PagerDuty adding severity levels and intelligent routing. Infrastructure drift detection hasn't caught up.

Not all drift is created equal

Once you actually think about it, drift falls into pretty natural categories based on security and operational impact.

Some drift is a fire. Someone modified a security group's ingress rules. Someone changed an IAM policy. Someone disabled encryption on a storage bucket. Someone opened up public access on a database. These changes directly affect your security perimeter. If you miss them, you might end up in a breach disclosure filing. I'd call these Critical.

Some drift is a problem. Someone changed an instance type in production — your capacity planning just went out the window. Someone modified a database engine version — your upgrade runbooks are now wrong. Someone changed a Lambda runtime — hope nothing depends on that specific Python version. These won't get you breached, but they can cause outages. That's High.

Some drift is worth knowing about. Most attribute changes that don't fall into the above categories. A CloudFront distribution's cache behavior was modified. A load balancer's idle timeout was changed. Not urgent, but you should probably investigate when you get a chance. That's Medium.

Some drift is noise. Tags. Descriptions. Labels. Metadata that external systems update constantly. Flagging these as drift is technically correct and operationally useless. That's Low.

The moment you apply these categories, your 47 alerts become 2 Critical, 3 High, 8 Medium, and 34 Low. Suddenly you know exactly where to look.

How we implemented this

When I built tfdrift, severity classification was the feature I cared about most. The technical approach is straightforward but the design decisions were interesting.

Pattern matching on resource type + attribute. Each severity rule is a pattern like aws_security_group.*.ingress → Critical. The * matches any resource name. When terraform plan reports that aws_security_group.api_gateway.ingress changed, it matches the Critical pattern.

Why resource type + attribute and not just resource type? Because a security group changing its tags is Low, but a security group changing its ingress rules is Critical. The attribute matters as much as the resource type.

Maximum severity wins. When a single resource has multiple changed attributes at different severity levels — say, both a tag change and an ingress change — we classify it at the highest level. A security group with a Low tag change and a Critical ingress change gets reported as Critical. You don't want someone to dismiss a Critical change because it also had some Low-severity noise attached.

60+ built-in rules. We ship default patterns for AWS (26 rules), Azure (13 rules), and GCP (10 rules), plus cloud-agnostic Low rules for tags, labels, and descriptions. These are opinionated defaults based on what we think matters most from a security perspective. You can override everything in a YAML config file.

Here's what the default Critical rules look like for AWS:

aws_security_group.*.ingress        # network access
aws_security_group.*.egress         # network access
aws_iam_policy.*.policy             # identity & access
aws_iam_role.*.assume_role_policy   # identity & access
aws_s3_bucket_public_access_block.* # data exposure
aws_s3_bucket_policy.*.policy       # data exposure
aws_kms_key.*.key_policy            # encryption
aws_network_acl_rule.*              # network access

And for Azure:

azurerm_network_security_group.*.security_rule  # network access
azurerm_role_assignment.*                        # identity & access
azurerm_key_vault_access_policy.*                # secrets management
azurerm_storage_account.*.network_rules          # data exposure

And GCP:

google_compute_firewall.*.allow         # network access
google_compute_firewall.*.source_ranges # network access
google_project_iam_binding.*            # identity & access
google_project_iam_member.*             # identity & access

Ignore rules for expected drift. Beyond severity, we also support a .tfdriftignore file for drift you never want to see. The classic example:

aws_autoscaling_group.*.desired_capacity
aws_ecs_service.*.desired_count

Auto-scaling groups change desired_capacity every few minutes. That's not drift — that's the system working correctly. Filtering this out before severity classification reduces noise even further.

The numbers

I tested this approach against a sandbox AWS environment with 150+ Terraform workspaces — EC2, IAM, S3, RDS, VPC, the usual mix. I introduced drift through the AWS console and CLI to simulate the kind of changes that happen in real environments: security group tweaks, tag updates, instance type changes, IAM policy modifications.

The results:

Approach	Alert volume	Security changes caught
Binary detection (terraform plan)	100%	100% (buried in noise)
Severity ≥ High only	27%	94%
Severity ≥ High + ignore rules	19%	97%

Filtering to High and Critical severity cut the alert volume by 73% while still catching 94% of the changes that actually mattered from a security perspective. Adding ignore rules brought noise down to just 19% of the original volume at 97% precision.

The 6% of security-relevant changes that got missed at the High threshold were Medium-severity changes that a human might consider security-adjacent — things like Lambda runtime version changes. Whether those matter depends on your risk tolerance, and you can always adjust the severity rules to promote specific attributes.

The .tfdrift.yml as a living document

Here's something I didn't fully appreciate until I started getting feedback from other engineers: the severity configuration file becomes a living document of your team's operational values.

When you configure which changes are Critical and which are Low, you're encoding institutional knowledge about what matters to your organization. A fintech company might classify aws_rds_instance.*.storage_encrypted as Critical. A dev agency might not care about it at all.

The .tfdrift.yml file captures these decisions in version-controlled, reviewable code. When a new engineer joins the team and asks "do we care about CloudFront distribution changes?", the answer is in the config file. When an incident happens because of undetected drift, the post-mortem action item is "add this resource type to our Critical patterns" — a one-line YAML change.

severity:
  critical:
    - aws_security_group.*.ingress
    - aws_iam_policy.*.policy
    # Added after the March 15 incident — ticket INC-4521
    - aws_cloudfront_distribution.*.origin
  high:
    - aws_instance.*.instance_type
    - aws_rds_instance.*.publicly_accessible

That comment in the config file? That's tribal knowledge being captured instead of lost.

What severity classification doesn't solve

I want to be honest about the limitations.

It can't see values. The classification is based on which attribute changed, not what it changed to. A security group rule opening port 443 to a known CIDR range is very different from one opening port 22 to 0.0.0.0/0. Both get classified as Critical because the attribute is ingress. Value-aware classification is on the roadmap, but it adds significant complexity.

It doesn't understand context. The same change might be Critical in production and Low in dev. Right now you handle this by maintaining separate config files per environment. Native environment-aware severity is something I want to build.

It's still just detection. Even with severity classification, the loop is: drift happens → tool detects it → alert goes out → human decides what to do. One of the more interesting pieces of feedback I've gotten is about moving from detection to governance — where Critical drift can automatically block deployments or trigger remediation. That's the next frontier, but it's also the scariest one.

Try it

If this resonates with your experience managing Terraform infrastructure, give tfdrift a try:

pip install tfdrift
tfdrift scan --path ./your-terraform-dir

The severity rules are opinionated out of the box, and you can customize everything. If you have thoughts on what the default rules should cover — especially for Azure and GCP where our coverage is thinner — I'd genuinely love to hear about it.

GitHub: github.com/sudarshan8417/tfdrift

This is part of a series on infrastructure drift detection. Previously: I Built a Free Terraform Drift Detector — Here's Why. Next up: How I Built a Terraform Plan JSON Parser in Python.