DEV Community

M Suhail Tahir
M Suhail Tahir

Posted on • Originally published at Medium

I Benchmarked 11 AI Models on Terraform Compliance. My Default Was Wrong.

Running the same compliance scan across 11 models revealed that cost and accuracy are independent variables — and my default was failing 1 in 5 tests.

The problem — picking models by reputation, not by task fit

When you build an AI agent, one question nobody tells you how to answer is: which model do you use?

The default instinct is “bigger is better.” More expensive means more capable. GPT-4 over GPT-4-mini. Opus over Haiku. Frontier model, frontier results.

So I put it to the test.

ADAG is an open-source multi-agent system that scans Terraform infrastructure against your organisation’s own policy documents — not a fixed CVE ruleset, but your rules. The audit agent reads your .tf files, retrieves relevant policies, and reports violations. Simple job. One agent. One task.

I ran the exact same audit across 11 models. Same Terraform files. Same policies. Same 7 test cases. The only variable was the model.

The model I was defaulting to missed 1 in 5 violations. A cheaper model caught them all.

Test setup
The test setup — 7 test cases, what each tests, why compliance recall matters more than accuracy.

Seven Terraform test cases — five designed to fail, two designed to pass. The violations covered the most common compliance gaps in production infrastructure: an S3 bucket without encryption, an RDS instance without deletion protection, an RDS instance with public access enabled, an IAM policy with wildcard resources, and a security group with SSH open to the world.

For each test case, the model either caught the violation or it didn’t. No partial credit.

Why recall over accuracy? Because in compliance scanning, a false negative — a missed violation — is the dangerous outcome. A model that flags a clean file wastes 30 seconds of an engineer’s time. A model that approves a misconfigured RDS instance ships a vulnerability to production. The threshold for production use is ≥95% recall. Miss that and you’re out.

adag-benchmark

adag-benchmark

adag-benchmark

Hardest violation: The violation that exposed the most models?s3_no_encryption missed by 4 models including GPT-4.1 and Sonnet 4.6.

Cost and accuracy are independent variables. This data proves it.

GPT-4.1 — my current default: $0.067/run → 80% recall. That means 1 in 5 real violations gets approved.

Claude Haiku 4.5: $0.039/run → 100% recall. Cheaper. More accurate. Not the model I was using.

Not subtle. A missing encryption block on an S3 bucket. The finding that shows up in breach reports.

The models that missed it weren’t small or cheap. GPT-4.1 at $2/million tokens. Sonnet 4.6 at $3/million tokens. Both failed on the most basic S3 check.

For compliance tooling there’s no “pretty good.”

Either deletion_protection = false gets caught or it ships to production.

“I required 100% recall. Only 5 of 11 models qualified. Among those 5, Haiku was the cheapest.”

5 models hit 100% recall. The sweet spot on cost + accuracy: Claude Haiku 4.5 ($0.039) and Gemini 2.5 Pro ($0.044).

Full benchmark + test fixtures open source 👇
https://github.com/m3dcodie/adag_test/tree/init-import/test_cases

https://github.com/m3dcodie/adag_test/blob/init-import/benchmark/run_benchmark.py

Why the cheaper model won
This wasn’t luck. My LLM Capability Framework maps task complexity to model tier. Compliance scanning is an L1–L2 task — deterministic extraction and rule matching. Haiku is optimised for exactly this. GPT-4.1 is an L4 model doing an L1 job. The benchmark confirmed what the framework predicted.

What this means for how I build
ADAG now defaults to Claude Haiku 4.5 for the audit agent. The benchmark runs on every PR — if recall drops below 95% on any model update, the pipeline flags it. Model selection isn’t a one-time decision. It’s something you measure continuously.

Top comments (0)