Adil Khan

Posted on Mar 23

I built a CLI that catches dangerous Terraform changes before you apply them

#aws #terraform #devops #ai

Before every terraform apply I was doing the same thing. Read the plan output. Switch to the AWS console. Check security groups. Try to remember what depends on what.

Then a security group with port 22 open to 0.0.0.0/0 slipped through. Caught it fast, nothing broke — but I kept thinking about it. That whole review process was just me, manually, every time, hoping I didn't miss anything in 300 lines of text. That's not a process. That's vibes.

So I built IACGuard. This is how it works and what I got wrong along the way.

The gap in terraform plan

The output doesn't tell you what matters. A production database being replaced and a tag change look identical in the plan — same format, same indentation, same weight. You have to already know what's dangerous to spot it.

The second gap is pipelines. You can fail a PR on test failures or linting. You can't natively fail a PR because the plan replaces a production RDS instance. That gap means risky infrastructure changes get the same review as safe ones, which under time pressure is basically no review.

What IACGuard does

Reads a Terraform plan file, runs deterministic rules against every planned change, outputs a risk report.

iacguard plan results
─────────────────────────────────────────────────────────
Critical : 1  |  High : 0  |  Medium : 0  |  Low : 0
Resources analyzed : 5  |  Changes : 5
─────────────────────────────────────────────────────────

  CRITICAL  SG001  aws_vpc_security_group_ingress_rule.sg_inbound_ssh  [CREATE]
           Security group ingress rule 'sg_inbound_ssh' allows SSH (port 22)
           from the entire internet (0.0.0.0/0 or ::/0).

─────────────────────────────────────────────────────────
[iacguard] Rules checked   : 3
[iacguard] Drift check     : SKIPPED (use --region to enable)
─────────────────────────────────────────────────────────

Exit code 0 = nothing critical. Exit code 1 = stop and look at this before applying.

pip install iacguard

Here is the Architecture Below

The first version used AI for detection. That was wrong.

My original plan was to send the whole Terraform plan to Claude and let it decide what was risky.

plan.json → Claude API → risk scores + findings

The problem is non-determinism. LLMs give different answers on different runs. That's fine for generating text. It's not fine when your tool blocks deployments. If iacguard exits with code 1 in a CI pipeline, that exit code has to mean the same thing every single time — not "Claude was in a cautious mood today."

So the architecture changed:

plan.json → parser → rule engine (deterministic) → findings → output
                                                            ↓
                                              AI explanation (--explain, optional)

Rules decide what's risky. AI only explains what the rules already found. AI never runs in CI mode at all.

The parser

terraform show -json gives you a JSON file. The useful part is resource_changes — every resource being created, updated, or destroyed.

{
  "address": "aws_db_instance.primary",
  "mode": "managed",
  "type": "aws_db_instance",
  "change": {
    "actions": ["delete", "create"],
    "before": { "instance_class": "db.t3.medium" },
    "after":  { "instance_class": "db.t3.large" },
    "replacing": true
  }
}

The non-obvious parts:

actions can be ["delete", "create"] — that's a replacement, not two separate operations. The parser normalizes everything to CREATE, UPDATE, DESTROY, or REPLACE before rules run.

Data sources (mode: "data") and no-ops get filtered out. They're reads, not changes.

change.before is null for creates. change.after is null for destroys. Rules have to handle both.

Resources inside modules have addresses like module.vpc.aws_security_group.bastion. Readable in output, full address preserved for accuracy.

I spent more time on the parser than anything else. If it reads something wrong, every rule downstream gets bad data.

The three rules

Each rule is a pure Python function. Takes a ResourceChange, returns a Finding or None. No shared state, independently testable.

RDS001 — database replacement

An RDS replacement deletes the database and recreates it. Anything written between delete and restore is gone. The rule fires on actions == ["delete", "create"] or change.replacing == True — both checked because real plans sometimes only set one.

class RDS001(RuleBase):
    rule_id  = "RDS001"
    severity = Severity.CRITICAL

    def check(self, change, all_changes):
        if change.resource_type not in {"aws_db_instance", "aws_rds_cluster"}:
            return None
        if change.action != Action.REPLACE and not change.replacing:
            return None
        return Finding(
            ...
            message=f"RDS instance '{change.name}' will be replaced — potential data loss.",
            recommendation="Ensure a snapshot exists before applying."
        )

SG001 — SSH open to the internet

This one had a bug I only found by running it on my own infrastructure.

The rule originally covered aws_security_group and aws_security_group_rule. When I ran it against my real Terraform code — which has port 22 open to 0.0.0.0/0 — it caught nothing.

Why: my code uses aws_vpc_security_group_ingress_rule, a newer resource type from AWS provider v5+. Different structure. from_port, to_port, and cidr_ipv4 are top-level fields, not nested inside an ingress block.

I had test fixtures for the old types. Hadn't thought to write one for the new type. You don't know what you haven't tested until real infrastructure breaks it.

Fixed rule covers all three:

aws_security_group                   → ingress[] array
aws_security_group_rule              → type=ingress fields
aws_vpc_security_group_ingress_rule  → top-level from_port/to_port/cidr_ipv4

Fires if from_port <= 22 <= to_port. A rule with from_port=0, to_port=65535 exposes SSH. IPv6 (::/0) also checked.

S3001 — missing public access block

Medium severity, not Critical. AWS accounts can have account-level Block Public Access settings that protect every bucket regardless of what Terraform configures. Flagging this Critical would false-positive on any account with account-level protection — which is a lot of accounts.

The output is:

S3 bucket 'assets' has no explicit block_public_access configuration.
Account-level settings may still protect this bucket.

The engineer decides. The tool surfaces it without over-reacting.

CI mode

--ci flag: JSON to stdout, no color, just exit codes.

iacguard plan --plan plan.json --ci

Exit 0 = clean. Exit 1 = critical finding. Exit 2 = tool error.

- name: IACGuard
  run: |
    pip install iacguard
    terraform show -json tfplan > plan.json
    iacguard plan --plan plan.json --ci

PR blocks on exit 1. That's it.

Testing

Every rule has two tests minimum: a plan that triggers it and one that doesn't. All tests use real Terraform plan JSON files — no synthetic JSON built in test code. Synthetic fixtures pass even when the parser would fail on a real plan.

pytest tests/ -v
# 15 passed in 0.01s

What I got wrong

The original design had eight subcommands, a React dashboard, Kubernetes scanning, a custom cost engine, AI-generated fix code, and multi-region support — all in v1. I went through multiple rounds of design review, including running the spec past other LLMs to pressure-test it. Most of that got cut. What shipped is one command, three rules, and a CLI. It works on real infrastructure. The original would have shipped half-finished on everything.

The other mistake: I assumed I knew the Terraform plan JSON structure well enough to write the parser from memory. I didn't generate real plan files and read them first. aws_vpc_security_group_ingress_rule showing up as a bug in my fixtures is a direct result of that. Next time I'm reading real data before writing code that parses it.

What's next

Blast radius is the main one — given a resource being changed, compute what else depends on it from the Terraform graph. You change a security group, IACGuard tells you which EC2 instances, load balancers, and RDS clusters are downstream. No free CLI tool does this cleanly in a pre-deploy workflow.

After that: more rules, drift detection (Terraform state vs live AWS), and a browser graph for the dependency chain.

Try it

pip install iacguard

terraform plan -out=tfplan
terraform show -json tfplan > plan.json

iacguard plan --plan plan.json
iacguard plan --plan plan.json --ci

Source: https://github.com/adil-khan-723/iacguard

If you hit a resource type the rules miss, open an issue or PR. Each rule is one file.

DEV Community