<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Amit Bhatt</title>
    <description>The latest articles on DEV Community by Amit Bhatt (@baremetal-dev).</description>
    <link>https://dev.to/baremetal-dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869131%2F9229145b-c825-440c-873e-f83c12aa93a5.png</url>
      <title>DEV Community: Amit Bhatt</title>
      <link>https://dev.to/baremetal-dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/baremetal-dev"/>
    <language>en</language>
    <item>
      <title>We Let AI Write Our Terraform. Then We Gave It a Security Conscience</title>
      <dc:creator>Amit Bhatt</dc:creator>
      <pubDate>Thu, 09 Apr 2026 12:07:11 +0000</pubDate>
      <link>https://dev.to/baremetal-dev/we-let-ai-write-our-terraform-then-we-gave-it-a-security-conscience-480e</link>
      <guid>https://dev.to/baremetal-dev/we-let-ai-write-our-terraform-then-we-gave-it-a-security-conscience-480e</guid>
      <description>&lt;p&gt;Designing cloud infrastructure usually takes three meetings.&lt;/p&gt;

&lt;p&gt;One with the architect to decide which services to use. One with the DevOps engineer to actually write the Terraform. One with the security team to explain, again, why &lt;code&gt;0.0.0.0/0&lt;/code&gt; is not an acceptable production CIDR.&lt;/p&gt;

&lt;p&gt;By the time all three conversations happen, the architecture diagram is already out of date.&lt;/p&gt;

&lt;p&gt;So we asked a different question: what if all four roles ran as AI agents in a single automated pipeline?&lt;/p&gt;

&lt;p&gt;You type your requirements in plain English. You get back deployable Terraform HCL, a security audit with specific remediation guidance, and a rendered architecture diagram. In one shot, without the meetings.&lt;/p&gt;

&lt;p&gt;That's InfraSquad. This post is about what we learned building it, what broke badly, and what we would tell ourselves at the start.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; InfraSquad is a multi-agent system built on LangGraph. Four agents collaborate in a cyclic state machine. Security findings loop back to the DevOps agent for fixes, capped at three cycles. Without that cap, the loop runs forever. We learned this during testing. The code is open source at &lt;a href="https://github.com/Andela-AI-Engineering-Bootcamp/infrasquad" rel="noopener noreferrer"&gt;Andela-AI-Engineering-Bootcamp/infrasquad&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Meet the Squad
&lt;/h2&gt;

&lt;p&gt;Four agents. One shared pipeline. Here is what each one actually does:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Product Architect&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads your requirements, considers scale, compliance, cost&lt;/td&gt;
&lt;td&gt;A numbered AWS architecture plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DevOps Engineer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Translates the plan into code; fixes security findings when sent back&lt;/td&gt;
&lt;td&gt;Valid Terraform HCL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security Auditor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs tfsec or checkov via MCP; classifies every finding by severity&lt;/td&gt;
&lt;td&gt;A structured JSON security report&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Visualizer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads the final plan and code after security passes&lt;/td&gt;
&lt;td&gt;A Mermaid architecture diagram rendered to PNG&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The critical word in that table is "sent back." The Security Auditor does not just generate a report and hand it off. It can send the DevOps Engineer back to fix its own code. That feedback loop is the most interesting design decision in the system. It is also how we nearly created an infinite loop on the second day of integration testing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fykwzu6t2ivtn3mkd4vyt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fykwzu6t2ivtn3mkd4vyt.png" alt="InfraSquad-four AI agents collaborating on cloud infrastructure design" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline (and the Two Places It Can Loop)
&lt;/h2&gt;

&lt;p&gt;Here is the full state machine:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjvpy1l6ujzmjdk9h2e9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjvpy1l6ujzmjdk9h2e9.png" alt="InfraSquad pipeline diagram showing the LangGraph state machine-validate_input, architect, devops, validate_output, security, visualizer, with two loop-back arrows for HCL errors and security findings" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The happy path is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;validate_input&lt;/strong&gt; runs three checks before anything expensive happens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;architect&lt;/strong&gt; produces a numbered AWS architecture plan&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;devops&lt;/strong&gt; writes Terraform HCL from that plan&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;validate_output&lt;/strong&gt; checks the HCL deterministically for forbidden patterns and structural validity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;security&lt;/strong&gt; scans with tfsec or checkov via MCP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;visualizer&lt;/strong&gt; renders the architecture as a Mermaid diagram&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Two of those six nodes can send the pipeline backwards. That is intentional. It is also dangerous if you do not cap the cycle count, which we did not do initially.&lt;/p&gt;

&lt;p&gt;All six agents share a single typed state object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;user_request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;architecture_plan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;terraform_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;security_report&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;security_passed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;remediation_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;hcl_remediation_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;hcl_validation_errors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;current_phase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;total=False&lt;/code&gt; matters. Without it, every agent would need to set every field, even fields it knows nothing about. With it, agents only write what they own. Silent downstream failures from unexpected &lt;code&gt;None&lt;/code&gt; values were the most frustrating class of bug we hit early on.&lt;/p&gt;




&lt;h2&gt;
  
  
  We Almost Created an Infinite Loop on Day Two
&lt;/h2&gt;

&lt;p&gt;During integration testing, we ran a request for an internet-facing Application Load Balancer.&lt;/p&gt;

&lt;p&gt;The Security Auditor flagged it: &lt;code&gt;AVD-AWS-0107, HIGH-security group allows unrestricted ingress from 0.0.0.0/0&lt;/code&gt;. The DevOps agent tried to fix it. The Security Auditor re-scanned. Same finding. The DevOps agent tried again. Same finding.&lt;/p&gt;

&lt;p&gt;The problem: a public ALB is supposed to have unrestricted public ingress. That is what "internet-facing" means. The security finding was technically correct and permanently unfixable given the design intent. The LLM had no way to distinguish "security issue to remediate" from "accepted design constraint."&lt;/p&gt;

&lt;p&gt;Without an exit condition, this loop runs forever.&lt;/p&gt;

&lt;p&gt;Here is what the routing logic looks like with the cap in place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_after_security&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;visualizer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;devops&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security_passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;visualizer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remediation_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_remediation_cycles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;visualizer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# move on regardless
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;devops&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After three cycles, the pipeline proceeds with whatever state it has. Unresolved findings appear as advisory warnings in the Security tab, not hard failures. The same cap exists on the HCL validation loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add your cycle caps before your first integration test.&lt;/strong&gt; Not after. You will hit this case.&lt;/p&gt;




&lt;h2&gt;
  
  
  The One Bad Habit We Couldn't Engineer Away
&lt;/h2&gt;

&lt;p&gt;Every model we tested had the same behavior: for any internet-facing resource, it generated &lt;code&gt;0.0.0.0/0&lt;/code&gt; as the security group ingress CIDR. Even with explicit instructions in the system prompt. Even with examples. Even with counter-examples.&lt;/p&gt;

&lt;p&gt;We tried prompt engineering for weeks. The model would acknowledge the constraint, then generate &lt;code&gt;0.0.0.0/0&lt;/code&gt; anyway on the next call.&lt;/p&gt;

&lt;p&gt;So we stopped fighting it and added a deterministic sanitizer that runs on the DevOps agent's output before validation even starts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_CIDR_SANITISATIONS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pattern&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;0\.0\.0\.0/0&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;10.0.0.0/8&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;::/0&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;         &lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;fc00::/7&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_sanitize_hcl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hcl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;replacement&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_CIDR_SANITISATIONS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;hcl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;replacement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hcl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hcl&lt;/span&gt;

&lt;span class="n"&gt;clean_hcl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_sanitize_hcl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;terraform_hcl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;10.0.0.0/8&lt;/code&gt; is a broad internal placeholder. Operators narrow it before deploying to production.&lt;/p&gt;

&lt;p&gt;This single function broke the HCL validation loop for the most common case. First-pass generations stopped triggering the guardrail on CIDR issues almost entirely. When the guardrail fires now, it catches a genuine structural problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When a model reliably produces the same wrong output, fix it deterministically. Do not prompt your way out of a consistency problem.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Questions Before the LLM Sees Anything
&lt;/h2&gt;

&lt;p&gt;The most expensive mistake in an agentic pipeline is burning tokens on requests that should never reach the agents. InfraSquad catches these at three layers before anything expensive runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxqfvjwsxvwo1l6guttb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxqfvjwsxvwo1l6guttb.png" alt="Three layers of input validation-chitchat detection, keyword matching, and LLM classification as a narrowing funnel" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Chitchat detection (zero cost)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A frozenset of 40+ conversational tokens returns immediately. "Thanks", "ok cool", "sounds good", a thumbs-up emoji-none of these should reach the architect.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_CHITCHAT_TOKENS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;okay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thanks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thank you&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;great&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;awesome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;got it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_is_chitchat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;!.,?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;!.,?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_CHITCHAT_TOKENS&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No LLM call. No latency. Instant return.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Keyword matching (zero cost)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A compiled regex matches 45 AWS infrastructure keywords. Two or more matches skip the LLM check entirely. "VPC with RDS Postgres and ALB" is obviously a valid request. Spending 2 seconds and tokens to confirm this is wasteful.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;keyword_match_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# High confidence-skip the LLM round-trip (~2s saved per clear request)
&lt;/span&gt;    &lt;span class="n"&gt;is_valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;About 70% of valid requests take this fast path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: LLM plausibility (borderline cases only)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Single-keyword matches are genuinely ambiguous. "Server" could be valid. "AWS tomato server" should not be. For these, a lightweight LLM call returns one of three outcomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;_FirstMessageClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proceed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clarify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;clarify&lt;/code&gt; triggers a helpful guidance message. &lt;code&gt;reject&lt;/code&gt; returns a polite explanation. Both avoid running the full pipeline on nonsense input.&lt;/p&gt;

&lt;p&gt;There is a catch. In active conversations, keyword matching stops working correctly. "Explain the Terraform code" contains the word "Terraform" but is clearly a follow-up question, not a new generation request. So in active sessions, we switch to a full intent classifier that distinguishes &lt;code&gt;new_generation&lt;/code&gt;, &lt;code&gt;follow_up&lt;/code&gt;, and &lt;code&gt;off_topic&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9fgsfcnowl1zqzezhk3u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9fgsfcnowl1zqzezhk3u.png" alt="InfraSquad handling an off-topic query-the guardrail returns a helpful explanation without triggering the pipeline" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Security Check That Does Not Trust the LLM
&lt;/h2&gt;

&lt;p&gt;Two patterns are blocked by hardcoded regex, independent of everything else:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_ADMIN_ACCESS_PATTERN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AdministratorAccess&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;_STAR_POLICY_PATTERN&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;Action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\s*:\s*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\*&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;AdministratorAccess&lt;/code&gt; policies and wildcard IAM actions are blocked regardless of what the model thought it generated. Not by the HCL guardrail. Not by the Security Auditor. By a function that runs on every output, unconditionally.&lt;/p&gt;

&lt;p&gt;The reason for running this separately from the guardrail: the HCL guardrail checks for &lt;code&gt;AdministratorAccess&lt;/code&gt; in a string pattern that could miss an IAM policy embedded inside a heredoc JSON block. The standalone regex catches it regardless of context.&lt;/p&gt;

&lt;p&gt;Two independent checks. Neither relying on the other being correct.&lt;/p&gt;




&lt;h2&gt;
  
  
  The HCL Guardrail: Before Security Even Runs
&lt;/h2&gt;

&lt;p&gt;Before the Terraform reaches the Security Auditor, it passes through a deterministic validator. This runs on every generation-first pass and every remediation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_FORBIDDEN_PATTERNS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AdministratorAccess&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Uses AdministratorAccess IAM policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0\.0\.0\.0/0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Contains 0.0.0.0/0 CIDR-opens resource to the internet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;public\s*=\s*true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sets public access to true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It also checks structural validity: &lt;code&gt;provider&lt;/code&gt; and &lt;code&gt;resource&lt;/code&gt; blocks must be present, resource signatures must be well-formed, and braces must balance. Any failure sends the code back to the DevOps agent with the specific error list attached to the next prompt.&lt;/p&gt;

&lt;p&gt;The CIDR sanitizer runs before this check. That is intentional. Remove &lt;code&gt;0.0.0.0/0&lt;/code&gt; before validation, so the guardrail only fires on real structural problems.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Pipeline Actually Produces
&lt;/h2&gt;

&lt;p&gt;Here is a real run. Request: "VPC with an RDS Postgres instance, an Application Load Balancer, and a Redis caching layer."&lt;/p&gt;

&lt;p&gt;The DevOps agent follows a security baseline baked into its system prompt. S3 buckets get KMS encryption, versioning, and public access blocks by default. RDS gets &lt;code&gt;storage_encrypted = true&lt;/code&gt;, &lt;code&gt;deletion_protection = true&lt;/code&gt;, and 7-day backup retention. ElastiCache gets encryption at rest and in transit. VPCs get flow logs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0lfvwjff1kop2cl4i8et.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0lfvwjff1kop2cl4i8et.png" alt="Terraform HCL generated by the DevOps agent in the InfraSquad UI, showing encryption and security defaults applied" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After the security check passes, the Visualizer reads the finalized plan and code and generates a Mermaid diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ui7x2718geev430xkas.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ui7x2718geev430xkas.png" alt="InfraSquad UI showing the generated Mermaid architecture diagram as source and rendered PNG" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;mmdc&lt;/code&gt; is not installed, the Mermaid source is saved as-is. It is still fully useful-paste it into any Mermaid viewer and you get the diagram.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Security Audit Loop: How It Actually Works
&lt;/h2&gt;

&lt;p&gt;When the Security Auditor finds issues, it does not just list them. It produces a structured prompt that becomes the DevOps agent's next input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MANDATORY SECURITY REMEDIATION-3 finding(s)
Fix EVERY numbered item below. Do NOT skip any.

Finding 1. [HIGH] AVD-AWS-0107 - aws_security_group.app_sg
   Issue: Security group allows unrestricted ingress on port 443
   Fix:   Restrict ingress to specific CIDR ranges or security group references.

Finding 2. [HIGH] AVD-AWS-0132 - aws_s3_bucket.assets
   Issue: S3 bucket does not use KMS encryption with a customer-managed key
   Fix:   Add aws_kms_key + aws_s3_bucket_server_side_encryption_configuration
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The DevOps agent sees &lt;code&gt;MANDATORY SECURITY REMEDIATION&lt;/code&gt; in its next prompt and treats every numbered item as a required fix.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;security_passed = True&lt;/code&gt; only when there are zero CRITICAL and zero HIGH findings. MEDIUM and LOW findings get reported but do not block the pipeline. The visualization still renders.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LangGraph Over CrewAI or AutoGen
&lt;/h2&gt;

&lt;p&gt;This came down to one question: does the framework support cycles with explicit state management and typed contracts?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Cyclic workflows&lt;/th&gt;
&lt;th&gt;Typed shared state&lt;/th&gt;
&lt;th&gt;Explicit retry caps&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LangGraph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native conditional edges&lt;/td&gt;
&lt;td&gt;TypedDict, full control&lt;/td&gt;
&lt;td&gt;Direct in routing logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CrewAI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Workarounds required&lt;/td&gt;
&lt;td&gt;Role-based model&lt;/td&gt;
&lt;td&gt;Not built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AutoGen&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conversation-driven&lt;/td&gt;
&lt;td&gt;Implicit&lt;/td&gt;
&lt;td&gt;Not built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The security remediation loop is cyclic by design. The Security Auditor sends the DevOps Engineer back; the DevOps Engineer generates new code; the new code gets re-scanned. Both CrewAI and AutoGen require workarounds for this pattern. LangGraph's conditional edges handle it natively.&lt;/p&gt;

&lt;p&gt;The typed state was also non-negotiable. Without a clear contract on what each agent receives and produces, integration failures are silent. An agent gets &lt;code&gt;None&lt;/code&gt; where it expected a string and fails three nodes downstream with a cryptic error. &lt;code&gt;TypedDict&lt;/code&gt; with &lt;code&gt;total=False&lt;/code&gt; gives every agent a contract it cannot accidentally break.&lt;/p&gt;




&lt;h2&gt;
  
  
  External Tools Through MCP
&lt;/h2&gt;

&lt;p&gt;tfsec and &lt;code&gt;mmdc&lt;/code&gt; (Mermaid rendering) run as MCP tools, not direct imports. The agent calls a tool through a protocol.&lt;/p&gt;

&lt;p&gt;This looks like over-engineering for a project at this scale. The argument for it: tfsec and &lt;code&gt;mmdc&lt;/code&gt; are external processes that can timeout, crash, or produce unexpected output. Wrapping them in an MCP tool forces explicit failure handling at every call site.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_try_tfsec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmpdir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_try_checkov&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmpdir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;tfsec unavailable? Try checkov. checkov unavailable? LLM security review with the full security system prompt. &lt;code&gt;mmdc&lt;/code&gt; unavailable? Save Mermaid source. Every external dependency ended up with a fallback path, which would not have happened if they were direct imports.&lt;/p&gt;

&lt;p&gt;The MCP server also runs independently. It can be swapped or extended without touching any agent code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Things We'd Tell Ourselves at the Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Hard-cap your cycles before the first integration test.&lt;/strong&gt;&lt;br&gt;
You will hit the infinite loop case. Probably on a public-facing resource where the security finding is technically correct and architecturally intentional. Add the counter before you need it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Regex beats prompting for deterministic security invariants.&lt;/strong&gt;&lt;br&gt;
If a property can be expressed as a pattern, enforce it with code. LLM compliance on security constraints is probabilistic. Code compliance is guaranteed. The CIDR sanitizer took 10 lines to write and immediately reduced first-pass HCL failures by the majority.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Typed state is not optional in multi-agent systems.&lt;/strong&gt;&lt;br&gt;
Silent failures are the worst kind. A &lt;code&gt;TypedDict&lt;/code&gt; with &lt;code&gt;total=False&lt;/code&gt; is a contract every agent signs. Without it, you are debugging &lt;code&gt;None&lt;/code&gt; errors three nodes downstream and trying to reconstruct which agent set which field when.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Pydantic schema retry saves more than you expect.&lt;/strong&gt;&lt;br&gt;
Without &lt;code&gt;invoke_with_schema_retry&lt;/code&gt;, the pipeline fails silently on every malformed JSON response. With it, about 80% of schema failures resolve on the first retry with an error correction prompt. Make this load-bearing from day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Input validation saves more tokens than it looks like it will.&lt;/strong&gt;&lt;br&gt;
Chitchat and off-topic requests are common in demo environments. Every one that reaches the architect burns tokens before returning an unhelpful or confusing response. The three-layer guardrail means only genuine infrastructure requests reach the expensive part of the pipeline.&lt;/p&gt;


&lt;h2&gt;
  
  
  Run It Yourself
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Andela-AI-Engineering-Bootcamp/infrasquad.git
&lt;span class="nb"&gt;cd &lt;/span&gt;infrasquad
uv venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; uv &lt;span class="nb"&gt;sync&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Add your OpenRouter API key to &lt;code&gt;.env&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-or-...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start the UI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python app.py              &lt;span class="c"&gt;# localhost:7860&lt;/span&gt;
python app.py &lt;span class="nt"&gt;--share&lt;/span&gt;      &lt;span class="c"&gt;# public Gradio URL&lt;/span&gt;
python app.py &lt;span class="nt"&gt;--port&lt;/span&gt; 8080  &lt;span class="c"&gt;# custom port&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default model is &lt;code&gt;openai/gpt-4o-mini&lt;/code&gt; via OpenRouter. Swap to any model OpenRouter supports by changing two env vars:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;LLM_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;anthropic/claude-3-5-sonnet
&lt;span class="nv"&gt;LLM_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://openrouter.ai/api/v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or point it at a local Ollama instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;LLM_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;qwen2.5:72b
&lt;span class="nv"&gt;LLM_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:11434/v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Optional tools for real scanner output and rendered diagrams:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;tfsec
pip &lt;span class="nb"&gt;install &lt;/span&gt;checkov
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @mermaid-js/mermaid-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If none of these are installed, the pipeline still completes. Security falls back to LLM review and diagrams save as Mermaid source.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpbejnu9f8ny2qsfxlih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpbejnu9f8ny2qsfxlih.png" alt="Built with: LangGraph, Python 3.12+, OpenRouter, FastMCP, Gradio, pydantic-settings, tfsec/checkov, Mermaid.js, uv" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Full source: &lt;a href="https://github.com/Andela-AI-Engineering-Bootcamp/infrasquad" rel="noopener noreferrer"&gt;infrasquad - github&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built at &lt;a href="https://help.andela.com/hc/en-us/articles/48808339012115-Welcome-to-the-AI-Engineering-Bootcamp" rel="noopener noreferrer"&gt;Andela AI Engineering Bootcamp&lt;/a&gt; by &lt;a href="https://linkedin.com/in/amit-bhatt" rel="noopener noreferrer"&gt;Amit&lt;/a&gt;, Ayesha, Elijah, Joel, Stella, and Adetayo.&lt;/p&gt;

&lt;p&gt;If you are building anything with &lt;a href="https://github.com/langchain-ai/langgraph" rel="noopener noreferrer"&gt;LangGraph&lt;/a&gt;, multi-agent pipelines, or IaC automation, drop a comment. Especially curious whether anyone else hit the public ALB infinite loop case-and how you handled it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>aws</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Sick of API costs and rate limits? I turned my M1 Mac into a fully offline AI coding agent. No cloud. No API keys. Just raw local compute using Llama.cpp and a 26B model. Check out the architecture and build it yourself! 🚀👇</title>
      <dc:creator>Amit Bhatt</dc:creator>
      <pubDate>Thu, 09 Apr 2026 08:44:23 +0000</pubDate>
      <link>https://dev.to/baremetal-dev/sick-of-api-costs-and-rate-limits-i-turned-my-m1-mac-into-a-fully-offline-ai-coding-agent-no-51gb</link>
      <guid>https://dev.to/baremetal-dev/sick-of-api-costs-and-rate-limits-i-turned-my-m1-mac-into-a-fully-offline-ai-coding-agent-no-51gb</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/baremetal-dev/i-turned-my-m1-macbook-into-an-offline-ai-coding-agent-0-api-cost-zero-cloud-14hb" class="crayons-story__hidden-navigation-link"&gt;I Turned My M1 MacBook Into an Offline AI Coding Agent - $0 API Cost, Zero Cloud&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/baremetal-dev" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869131%2F9229145b-c825-440c-873e-f83c12aa93a5.png" alt="baremetal-dev profile" class="crayons-avatar__image" width="606" height="626"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/baremetal-dev" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Amit Bhatt
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Amit Bhatt
                
              
              &lt;div id="story-author-preview-content-3475178" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/baremetal-dev" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869131%2F9229145b-c825-440c-873e-f83c12aa93a5.png" class="crayons-avatar__image" alt="" width="606" height="626"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Amit Bhatt&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/baremetal-dev/i-turned-my-m1-macbook-into-an-offline-ai-coding-agent-0-api-cost-zero-cloud-14hb" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 9&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/baremetal-dev/i-turned-my-m1-macbook-into-an-offline-ai-coding-agent-0-api-cost-zero-cloud-14hb" id="article-link-3475178"&gt;
          I Turned My M1 MacBook Into an Offline AI Coding Agent - $0 API Cost, Zero Cloud
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/llm"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;llm&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/privacy"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;privacy&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devex"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devex&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/baremetal-dev/i-turned-my-m1-macbook-into-an-offline-ai-coding-agent-0-api-cost-zero-cloud-14hb" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/fire-f60e7a582391810302117f987b22a8ef04a2fe0df7e3258a5f49332df1cec71e.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/baremetal-dev/i-turned-my-m1-macbook-into-an-offline-ai-coding-agent-0-api-cost-zero-cloud-14hb#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            9 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I Turned My M1 MacBook Into an Offline AI Coding Agent - $0 API Cost, Zero Cloud</title>
      <dc:creator>Amit Bhatt</dc:creator>
      <pubDate>Thu, 09 Apr 2026 08:15:55 +0000</pubDate>
      <link>https://dev.to/baremetal-dev/i-turned-my-m1-macbook-into-an-offline-ai-coding-agent-0-api-cost-zero-cloud-14hb</link>
      <guid>https://dev.to/baremetal-dev/i-turned-my-m1-macbook-into-an-offline-ai-coding-agent-0-api-cost-zero-cloud-14hb</guid>
      <description>&lt;p&gt;The cloud is great — until you hit a rate limit mid-refactor. Or you're on a flight. Or you're working on code that should never leave your machine.&lt;/p&gt;

&lt;p&gt;I spent three weeks obsessing over one question: &lt;strong&gt;how close can you actually get to a GPT-4-level agentic coding experience running 100% locally, with zero internet?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The answer surprised me. My M1 MacBook Pro — no discrete GPU, no cloud subscription, no API key — now runs a 26-billion parameter model that reads my codebase, writes code, applies diffs, and proposes Git changes. Autonomously. Offline.&lt;/p&gt;

&lt;p&gt;This post is the exact, reproducible blueprint. Every command is copy-pasteable. Every decision is explained.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — I compiled &lt;code&gt;llama.cpp&lt;/code&gt; with Metal GPU acceleration on an M1 Mac, loaded Google's Gemma-4 26B via Unsloth's quantization, and wired it to OpenCode for a fully agentic, offline coding workflow. Total API cost: &lt;strong&gt;$0&lt;/strong&gt;. Data sent to the cloud: &lt;strong&gt;0 bytes&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why This Matters Right Now
&lt;/h2&gt;

&lt;p&gt;Most conversations about "local AI" treat it as a hobbyist curiosity — small models, toy tasks, nothing you'd trust on real work. That was true 18 months ago. It isn't anymore.&lt;/p&gt;

&lt;p&gt;Three things converged to make this actually viable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What Changed&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apple's Unified Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPU and CPU share the same RAM pool. A 32GB M1 can feed a 26B parameter model to the GPU like a dedicated VRAM machine.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;llama.cpp&lt;/code&gt; + Metal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CPU/GPU inference optimized specifically for Apple Silicon. Not a port — built for it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unsloth quantizations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Aggressive, quality-preserving quantization that fits Gemma-4 26B into ~16GB without meaningful quality loss.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Put those three together and a standard developer laptop becomes a credible inference machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hardware and the Brain
&lt;/h2&gt;

&lt;p&gt;I'm running this on an &lt;strong&gt;M1 MacBook Pro with 32GB of unified memory&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For the model, I chose &lt;a href="https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF" rel="noopener noreferrer"&gt;unsloth/gemma-4-26B-A4B-it-GGUF&lt;/a&gt; after reviewing the &lt;a href="https://unsloth.ai/docs/models/gemma-4#hardware-requirements" rel="noopener noreferrer"&gt;hardware requirements for Gemma-4&lt;/a&gt;. Here's why each component of that model name matters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unsloth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The leading framework for efficient LLM quantization, with recent bugfixes not yet in the &lt;a href="https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF" rel="noopener noreferrer"&gt;ggml-org&lt;/a&gt; or &lt;a href="https://huggingface.co/google/gemma-4-26B-A4B-it" rel="noopener noreferrer"&gt;Google&lt;/a&gt; releases.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma-4 26B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A massive, highly capable architecture from Google DeepMind.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Instruction-Tuned (it)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Crucial for agentic workflows — the model follows complex commands, not just predicts text.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GGUF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The optimized file format required for local CPU/Metal execution via &lt;code&gt;llama.cpp&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At a Q4 quantization, the 26B model requires roughly 15–16GB of memory. On 32GB unified memory, that leaves more than enough overhead for macOS, your IDE, and OpenCode running simultaneously.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Everything below is available through Homebrew or pip. No manual compilation required except &lt;code&gt;llama.cpp&lt;/code&gt; itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Xcode Command Line Tools (required for cmake, git, and Metal framework headers)&lt;/span&gt;
xcode-select &lt;span class="nt"&gt;--install&lt;/span&gt;

&lt;span class="c"&gt;# Core build dependencies&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;cmake libomp

&lt;span class="c"&gt;# Hugging Face CLI for model downloads&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;huggingface_hub hf_transfer

&lt;span class="c"&gt;# Parallel download engine (optional but strongly recommended for large models)&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;aria2

&lt;span class="c"&gt;# OpenCode — the agentic coding orchestrator&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;anomalyco/tap/opencode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With these in place, every step below works on a clean macOS install.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Compile &lt;code&gt;llama.cpp&lt;/code&gt; from Scratch with Metal
&lt;/h2&gt;

&lt;p&gt;You &lt;em&gt;could&lt;/em&gt; download a pre-built binary. But if you want every drop of performance from the M1's Metal GPU framework, build from source.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ggml-org/llama.cpp.git
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_METAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;sysctl &lt;span class="nt"&gt;-n&lt;/span&gt; hw.ncpu&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key flag is &lt;code&gt;-DGGML_METAL=ON&lt;/code&gt; — this compiles with Apple's Metal GPU framework. The &lt;code&gt;-j$(sysctl -n hw.ncpu)&lt;/code&gt; parallelizes the build across all CPU cores.&lt;/p&gt;

&lt;p&gt;When you compile directly on the M1, inference speeds jump dramatically. You aren't just running code — you're running code hyper-optimized for your specific silicon.&lt;/p&gt;

&lt;p&gt;After the build, create symlinks to keep commands clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; ./llama.cpp/build/bin/llama-cli llama-cli
&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; ./llama.cpp/build/bin/llama-server llama-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;llama-cli&lt;/code&gt; handles interactive terminal prompts. &lt;code&gt;llama-server&lt;/code&gt; is the HTTP inference server that exposes an OpenAI-compatible API — the piece that connects to OpenCode. Both are built from the same source tree.&lt;/p&gt;

&lt;h3&gt;
  
  
  Validate the Build Before the Big Download
&lt;/h3&gt;

&lt;p&gt;Before downloading the massive 18GB Gemma-4 model, validate the entire pipeline with a smaller model: &lt;a href="https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Nano-4B-GGUF" rel="noopener noreferrer"&gt;NVIDIA Nemotron-3-Nano-4B&lt;/a&gt; at Q8 quantization, just 3.9GB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This step is not optional.&lt;/strong&gt; You don't want to wait hours for an 18GB download only to discover your build is broken.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jgozahxs4j3fx081usl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jgozahxs4j3fx081usl.png" alt="Downloading NVIDIA Nemotron-3-Nano-4B as a pipeline validation step" width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Booting the server with the smaller model confirms what you need to see: Metal framework fully initialized, unified memory detected, all GPU families registered.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprud2bg5sr0wgyfv436b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprud2bg5sr0wgyfv436b.png" alt="Metal GPU framework initialization on the M1 — unified memory confirmed, bfloat support active, and recommendedMaxWorkingSetSize showing access to the full 32GB memory pool" width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The critical line in that output: &lt;code&gt;has unified memory = true&lt;/code&gt;. The &lt;code&gt;recommendedMaxWorkingSetSize&lt;/code&gt; of roughly 26,800 MB tells you exactly how much VRAM the Metal backend can access — and on the M1, it draws directly from system RAM.&lt;/p&gt;

&lt;p&gt;Pipeline is solid. Now bring in the real model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Download Gemma-4 26B Weights
&lt;/h2&gt;

&lt;p&gt;Because we're building an offline environment, the model file needs to be local.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hf download unsloth/gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; unsloth/gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*mmproj-BF16*"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*UD-Q4_K_XL*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--include&lt;/code&gt; filters are important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;*mmproj-BF16*&lt;/code&gt; — the multimodal vision projector, giving the model the ability to understand images alongside code&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;*UD-Q4_K_XL*&lt;/code&gt; — the sweet spot quantization for quality vs. memory on 32GB (~15.9GB on disk)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fair Warning: 18.3GB Downloads Are Fragile
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1xzbbbyqfykn980ib46x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1xzbbbyqfykn980ib46x.png" alt="An 18.3GB download failing mid-transfer via the default hf CLI — hours of progress, gone" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My download crawled to 519KB/s before failing entirely. The default &lt;code&gt;hf download&lt;/code&gt; CLI supports resuming in theory, but in practice it's fragile on large files over unstable connections.&lt;/p&gt;

&lt;p&gt;Switch to &lt;a href="https://gist.github.com/yeahjack/31f542ee6cab3c3e2c30594b7693cb22#file-hfd-sh" rel="noopener noreferrer"&gt;&lt;code&gt;hfd.sh&lt;/code&gt;&lt;/a&gt; with &lt;code&gt;aria2c&lt;/code&gt; as the engine. Unlike the default CLI, &lt;code&gt;aria2c&lt;/code&gt; tracks per-segment progress in &lt;code&gt;.aria2&lt;/code&gt; control files — a dropped connection picks up exactly where it left off instead of restarting the entire file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./hfd.sh unsloth/gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; unsloth/gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*mmproj-BF16*"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*UD-Q4_K_XL*"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tool&lt;/span&gt; aria2c &lt;span class="nt"&gt;-x&lt;/span&gt; 16 &lt;span class="nt"&gt;-n&lt;/span&gt; 8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;-x 16&lt;/code&gt; opens 16 connections per server. &lt;code&gt;-n 8&lt;/code&gt; splits each file into 8 parallel segments. On a decent connection, this is dramatically faster and more resilient than the default downloader.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Wire the Brain to OpenCode
&lt;/h2&gt;

&lt;p&gt;Having a powerful local LLM is interesting. Having it autonomously write, edit, and debug your code is a different thing entirely. That's where &lt;a href="https://opencode.ai/docs/" rel="noopener noreferrer"&gt;OpenCode&lt;/a&gt; comes in.&lt;/p&gt;

&lt;p&gt;OpenCode bridges the local LLM and your codebase. The key insight: &lt;code&gt;llama-server&lt;/code&gt; exposes an OpenAI-compatible &lt;code&gt;/v1/chat/completions&lt;/code&gt; endpoint out of the box. OpenCode's &lt;code&gt;@ai-sdk/openai-compatible&lt;/code&gt; adapter speaks that protocol natively. No custom prompt templates, no manual token wrangling — the chat template baked into the GGUF handles everything at the server level.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;opencode.json&lt;/code&gt; at your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"$schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://opencode.ai/config.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"llama.cpp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"npm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@ai-sdk/openai-compatible"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llama-server (local)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"baseURL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://127.0.0.1:8001"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"gemma-4:26b-a4b-it"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Gemma-4-26B-A4B-it (local)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;65536&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"nvidia-nemotron-3-nano:4b"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NVIDIA-Nemotron-3-Nano-4B (local)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;65536&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;baseURL&lt;/code&gt; points to &lt;code&gt;127.0.0.1:8001&lt;/code&gt; where &lt;code&gt;llama-server&lt;/code&gt; will listen&lt;/li&gt;
&lt;li&gt;Context is set to 32K tokens — the model supports up to 262K, but 32K is a practical ceiling for stable agentic sessions on 32GB RAM&lt;/li&gt;
&lt;li&gt;The second model (Nemotron-3 Nano 4B) is configured as a lightweight alternative for fast, low-overhead tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Verify the model is running correctly by checking the llama.cpp web interface at &lt;code&gt;http://127.0.0.1:8001&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F318b3k7e51wl9jjrfnam.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F318b3k7e51wl9jjrfnam.png" alt="Gemma-4 26B loaded and serving — 15.9GB model, 25.23B parameters, 262K context window, Vision modality confirmed" width="800" height="501"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;25.23 billion parameters. A 262,144-token context window. Vision capability. Running from a file on local disk, served over localhost. No cloud, no API key, no rate limit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Full Offline Agentic Coding
&lt;/h2&gt;

&lt;p&gt;With the model downloaded, &lt;code&gt;llama.cpp&lt;/code&gt; compiled, and &lt;code&gt;opencode.json&lt;/code&gt; locked in, I turned off Wi-Fi.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero internet. Zero API calls. Zero data leaving the machine.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open two terminals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Terminal 1: Start the llama.cpp inference server with Gemma-4&lt;/span&gt;
./llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--temp&lt;/span&gt; 0.6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--top-p&lt;/span&gt; 0.95 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--alias&lt;/span&gt; &lt;span class="s2"&gt;"gemma-4-26B"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8001 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;-t&lt;/span&gt; 8 &lt;span class="nt"&gt;-b&lt;/span&gt; 512 &lt;span class="nt"&gt;--mmap&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Still validating with Nemotron before the full download? Same flags work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Alternative: lighter Nemotron model for testing&lt;/span&gt;
./llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; unsloth/NVIDIA-Nemotron-3-Nano-4B-GGUF/NVIDIA-Nemotron-3-Nano-4B-Q8_0.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--temp&lt;/span&gt; 0.6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--top-p&lt;/span&gt; 0.95 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--alias&lt;/span&gt; &lt;span class="s2"&gt;"nvidia/nemotron-3-nano-4B-GGUF"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8001 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;-t&lt;/span&gt; 8 &lt;span class="nt"&gt;-b&lt;/span&gt; 512 &lt;span class="nt"&gt;--mmap&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the &lt;code&gt;--reasoning on&lt;/code&gt; flag on Nemotron — it activates a built-in chain-of-thought mode that improves output quality on complex tasks. Useful for validating multi-step reasoning before scaling up to Gemma-4.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Terminal 2: Launch OpenCode&lt;/span&gt;
opencode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's what each server flag does:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-ngl 99&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Offloads all model layers to the Metal GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-t 8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sets 8 CPU threads for operations that fall back to CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-b 512&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Controls batch size for prompt processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--mmap&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Memory-maps the model file — macOS manages paging without loading all 15.9GB upfront&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--temp 0.6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Slightly below default for more deterministic code generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--top-p 0.95&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Nucleus sampling — keeps output focused while allowing creativity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zz3dxqbho9usk3m8gdi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zz3dxqbho9usk3m8gdi.png" alt="OpenCode analyzing a project's architecture, generating documentation, and proposing Git changes across 5 files — powered locally by gemma-4-26B in 45 seconds" width="800" height="501"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The result: Gemma-4 26B analyzed my codebase, understood the architecture of local files, and began writing, diffing, and applying code. In the screenshot above, it analyzed an &lt;code&gt;architect.py&lt;/code&gt; file, broke down Pydantic data models, explained the &lt;code&gt;run_architect&lt;/code&gt; function flow, and proposed 5 Git changes across the project.&lt;/p&gt;

&lt;p&gt;The M1 pushed out tokens fast enough for real-time development. The footer confirms it: &lt;code&gt;gemma-4-26B&lt;/code&gt;, 45 seconds for a full architectural analysis and code generation pass.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;We are crossing a threshold.&lt;/p&gt;

&lt;p&gt;For two years, the industry assumed truly capable AI agents require data centers. This experiment proves otherwise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For engineering teams working on sensitive codebases&lt;/strong&gt; — defense, healthcare, fintech — this means AI coding assistants without a single byte crossing a network boundary. No SOC 2 reviews for another SaaS vendor. No data processing agreements. No trust boundaries to negotiate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For engineering leaders&lt;/strong&gt;, the math is compelling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero marginal API cost per developer&lt;/li&gt;
&lt;li&gt;Zero vendor lock-in&lt;/li&gt;
&lt;li&gt;Works identically on an airplane, in a SCIF, or behind an air-gapped network&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For individual developers&lt;/strong&gt;, the practical reality: you can now run a frontier-class coding agent on hardware you already own, using models that are openly licensed, with no subscription, no quota, no latency spikes on someone else's overloaded GPU cluster.&lt;/p&gt;

&lt;p&gt;Absolute privacy and top-tier AI capability are no longer mutually exclusive.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'm Exploring Next
&lt;/h2&gt;

&lt;p&gt;This setup is a foundation, not a ceiling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Larger context windows.&lt;/strong&gt; The 32K context in &lt;code&gt;opencode.json&lt;/code&gt; is conservative. With careful memory management and &lt;code&gt;llama.cpp&lt;/code&gt;'s Flash Attention support, 128K+ is feasible on 32GB for longer agentic sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-model routing.&lt;/strong&gt; Running Nemotron for fast, lightweight tasks and Gemma-4 for heavy reasoning — switching between models based on task complexity, all locally. Think of it as a cheap/smart tier system without the cloud bill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning on proprietary code.&lt;/strong&gt; Unsloth supports LoRA and QLoRA fine-tuning. Training a domain-specific adapter on your team's codebase and merging it into the GGUF gives you a model that &lt;em&gt;thinks&lt;/em&gt; in your architecture and naming conventions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Team-wide access.&lt;/strong&gt; Embed &lt;code&gt;llama-server&lt;/code&gt; in a container behind your internal network so the entire team gets local AI without each developer maintaining their own build.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Stack, in One Place
&lt;/h2&gt;

&lt;p&gt;For anyone who wants to reproduce this exactly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What It Is&lt;/th&gt;
&lt;th&gt;Link&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama.cpp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Metal-accelerated inference engine&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;github.com/ggml-org/llama.cpp&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma-4 26B (Unsloth)&lt;/td&gt;
&lt;td&gt;The model, Q4_K_XL quantization&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF" rel="noopener noreferrer"&gt;huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenCode&lt;/td&gt;
&lt;td&gt;Agentic coding orchestrator&lt;/td&gt;
&lt;td&gt;&lt;a href="https://opencode.ai/docs/" rel="noopener noreferrer"&gt;opencode.ai/docs&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hfd.sh&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reliable large-file downloader&lt;/td&gt;
&lt;td&gt;&lt;a href="https://gist.github.com/yeahjack/31f542ee6cab3c3e2c30594b7693cb22#file-hfd-sh" rel="noopener noreferrer"&gt;gist.github.com/yeahjack&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tools are here. The models are capable enough. The only question is what you build with them.&lt;/p&gt;




&lt;p&gt;If you found this useful, the full write-up with additional context lives on &lt;a href="https://sectumpsempra.github.io" rel="noopener noreferrer"&gt;my site&lt;/a&gt;. Questions, improvements, or your own local AI stack? Drop them in the comments — I'd genuinely like to hear what you're running.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Written by &lt;a href="https://linkedin.com/in/amit-bhatt" rel="noopener noreferrer"&gt;Amit Bhatt&lt;/a&gt; — &lt;a href="https://github.com/sectumpsempra" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>privacy</category>
      <category>devex</category>
    </item>
  </channel>
</rss>
