<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: M Suhail Tahir</title>
    <description>The latest articles on DEV Community by M Suhail Tahir (@m3d_suhail).</description>
    <link>https://dev.to/m3d_suhail</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3939749%2Fbf0775eb-6e90-4fd7-971c-b7d928a2622c.png</url>
      <title>DEV Community: M Suhail Tahir</title>
      <link>https://dev.to/m3d_suhail</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/m3d_suhail"/>
    <language>en</language>
    <item>
      <title>I Benchmarked 11 AI Models on Terraform Compliance. My Default Was Wrong.</title>
      <dc:creator>M Suhail Tahir</dc:creator>
      <pubDate>Tue, 19 May 2026 08:19:51 +0000</pubDate>
      <link>https://dev.to/m3d_suhail/i-benchmarked-11-ai-models-on-terraform-compliance-my-default-was-wrong-4hp9</link>
      <guid>https://dev.to/m3d_suhail/i-benchmarked-11-ai-models-on-terraform-compliance-my-default-was-wrong-4hp9</guid>
      <description>&lt;p&gt;&lt;em&gt;Running the same compliance scan across 11 models revealed that cost and accuracy are independent variables — and my default was failing 1 in 5 tests.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The problem — picking models by reputation, not by task fit&lt;/p&gt;

&lt;p&gt;When you build an AI agent, one question nobody tells you how to answer is: which model do you use?&lt;/p&gt;

&lt;p&gt;The default instinct is &lt;strong&gt;“bigger is better.”&lt;/strong&gt; More expensive means more capable. GPT-4 over GPT-4-mini. Opus over Haiku. Frontier model, frontier results.&lt;/p&gt;

&lt;p&gt;So I put it to the test.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/m3dcodie/arch_agent" rel="noopener noreferrer"&gt;ADAG&lt;/a&gt; is an open-source multi-agent system that scans Terraform infrastructure against your organisation’s own policy documents — not a fixed CVE ruleset, but your rules. The audit agent reads your .tf files, retrieves relevant policies, and reports violations. Simple job. One agent. One task.&lt;/p&gt;

&lt;p&gt;I ran the exact same audit across 11 models. Same Terraform files. Same policies. Same 7 test cases. The only variable was the model.&lt;/p&gt;

&lt;p&gt;The model I was defaulting to missed 1 in 5 violations. A cheaper model caught them all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test setup&lt;/strong&gt;&lt;br&gt;
The test setup — 7 test cases, what each tests, why compliance recall matters more than accuracy.&lt;/p&gt;

&lt;p&gt;Seven Terraform test cases — five designed to fail, two designed to pass. The violations covered the most common compliance gaps in production infrastructure: an S3 bucket without encryption, an RDS instance without deletion protection, an RDS instance with public access enabled, an IAM policy with wildcard resources, and a security group with SSH open to the world.&lt;/p&gt;

&lt;p&gt;For each test case, the model either caught the violation or it didn’t. No partial credit.&lt;/p&gt;

&lt;p&gt;Why recall over accuracy? Because in compliance scanning, a false negative — a missed violation — is the dangerous outcome. A model that flags a clean file wastes 30 seconds of an engineer’s time. A model that approves a misconfigured RDS instance ships a vulnerability to production. The threshold for production use is ≥95% recall. Miss that and you’re out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2f7hx6g6uf258lmuq8d8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2f7hx6g6uf258lmuq8d8.png" alt="adag-benchmark" width="618" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8x14ooc25avc8demupis.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8x14ooc25avc8demupis.png" alt="adag-benchmark" width="624" height="740"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8jm8jvl2h8z992916zu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8jm8jvl2h8z992916zu.png" alt="adag-benchmark" width="603" height="632"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardest violation:&lt;/strong&gt; The violation that exposed the most models?s3_no_encryption missed by 4 models including GPT-4.1 and Sonnet 4.6.&lt;/p&gt;

&lt;p&gt;Cost and accuracy are independent variables. This data proves it.&lt;/p&gt;

&lt;p&gt;GPT-4.1 — my current default: $0.067/run → 80% recall. That means 1 in 5 real violations gets approved.&lt;/p&gt;

&lt;p&gt;Claude Haiku 4.5: $0.039/run → 100% recall. Cheaper. More accurate. Not the model I was using.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not subtle.&lt;/strong&gt; A missing encryption block on an S3 bucket. The finding that shows up in breach reports.&lt;/p&gt;

&lt;p&gt;The models that missed it weren’t small or cheap. GPT-4.1 at $2/million tokens. Sonnet 4.6 at $3/million tokens. Both failed on the most basic S3 check.&lt;/p&gt;

&lt;p&gt;For compliance tooling there’s no “pretty good.”&lt;/p&gt;

&lt;p&gt;Either deletion_protection = false gets caught or it ships to production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;“I required 100% recall. Only 5 of 11 models qualified. Among those 5, Haiku was the cheapest.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;5 models hit 100% recall. The sweet spot on cost + accuracy: Claude Haiku 4.5 ($0.039) and Gemini 2.5 Pro ($0.044).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full benchmark + test fixtures open source&lt;/strong&gt; 👇&lt;br&gt;
&lt;a href="https://github.com/m3dcodie/adag_test/tree/init-import/test_cases" rel="noopener noreferrer"&gt;https://github.com/m3dcodie/adag_test/tree/init-import/test_cases&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/m3dcodie/adag_test/blob/init-import/benchmark/run_benchmark.py" rel="noopener noreferrer"&gt;https://github.com/m3dcodie/adag_test/blob/init-import/benchmark/run_benchmark.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why the cheaper model won&lt;/strong&gt;&lt;br&gt;
This wasn’t luck. My &lt;a href="https://github.com/m3dcodie/LLM-Capability-Framework-LCF" rel="noopener noreferrer"&gt;LLM Capability Framework &lt;/a&gt;maps task complexity to model tier. Compliance scanning is an L1–L2 task — deterministic extraction and rule matching. Haiku is optimised for exactly this. GPT-4.1 is an L4 model doing an L1 job. The benchmark confirmed what the framework predicted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this means for how I build&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/m3dcodie/arch_agent" rel="noopener noreferrer"&gt;ADAG &lt;/a&gt;now defaults to Claude Haiku 4.5 for the audit agent. The benchmark runs on every PR — if recall drops below 95% on any model update, the pipeline flags it. Model selection isn’t a one-time decision. It’s something you measure continuously.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>terraform</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
