DEV Community: Amit Bhatt

We Let AI Write Our Terraform. Then We Gave It a Security Conscience

Amit Bhatt — Thu, 09 Apr 2026 12:07:11 +0000

Designing cloud infrastructure usually takes three meetings.

One with the architect to decide which services to use. One with the DevOps engineer to actually write the Terraform. One with the security team to explain, again, why 0.0.0.0/0 is not an acceptable production CIDR.

By the time all three conversations happen, the architecture diagram is already out of date.

So we asked a different question: what if all four roles ran as AI agents in a single automated pipeline?

You type your requirements in plain English. You get back deployable Terraform HCL, a security audit with specific remediation guidance, and a rendered architecture diagram. In one shot, without the meetings.

That's InfraSquad. This post is about what we learned building it, what broke badly, and what we would tell ourselves at the start.

TL;DR: InfraSquad is a multi-agent system built on LangGraph. Four agents collaborate in a cyclic state machine. Security findings loop back to the DevOps agent for fixes, capped at three cycles. Without that cap, the loop runs forever. We learned this during testing. The code is open source at Andela-AI-Engineering-Bootcamp/infrasquad.

Meet the Squad

Four agents. One shared pipeline. Here is what each one actually does:

Agent	Responsibility	Output
Product Architect	Reads your requirements, considers scale, compliance, cost	A numbered AWS architecture plan
DevOps Engineer	Translates the plan into code; fixes security findings when sent back	Valid Terraform HCL
Security Auditor	Runs tfsec or checkov via MCP; classifies every finding by severity	A structured JSON security report
Visualizer	Reads the final plan and code after security passes	A Mermaid architecture diagram rendered to PNG

The critical word in that table is "sent back." The Security Auditor does not just generate a report and hand it off. It can send the DevOps Engineer back to fix its own code. That feedback loop is the most interesting design decision in the system. It is also how we nearly created an infinite loop on the second day of integration testing.

The Pipeline (and the Two Places It Can Loop)

Here is the full state machine:

The happy path is straightforward:

validate_input runs three checks before anything expensive happens
architect produces a numbered AWS architecture plan
devops writes Terraform HCL from that plan
validate_output checks the HCL deterministically for forbidden patterns and structural validity
security scans with tfsec or checkov via MCP
visualizer renders the architecture as a Mermaid diagram

Two of those six nodes can send the pipeline backwards. That is intentional. It is also dangerous if you do not cap the cycle count, which we did not do initially.

All six agents share a single typed state object:

class AgentState(TypedDict, total=False):
    user_request: str
    architecture_plan: str
    terraform_code: str
    security_report: dict[str, Any]
    security_passed: bool
    remediation_count: int
    hcl_remediation_count: int
    hcl_validation_errors: list[str]
    current_phase: str

The total=False matters. Without it, every agent would need to set every field, even fields it knows nothing about. With it, agents only write what they own. Silent downstream failures from unexpected None values were the most frustrating class of bug we hit early on.

We Almost Created an Infinite Loop on Day Two

During integration testing, we ran a request for an internet-facing Application Load Balancer.

The Security Auditor flagged it: AVD-AWS-0107, HIGH-security group allows unrestricted ingress from 0.0.0.0/0. The DevOps agent tried to fix it. The Security Auditor re-scanned. Same finding. The DevOps agent tried again. Same finding.

The problem: a public ALB is supposed to have unrestricted public ingress. That is what "internet-facing" means. The security finding was technically correct and permanently unfixable given the design intent. The LLM had no way to distinguish "security issue to remediate" from "accepted design constraint."

Without an exit condition, this loop runs forever.

Here is what the routing logic looks like with the cap in place:

def route_after_security(state: AgentState) -> Literal["visualizer", "devops"]:
    if state.get("security_passed", False):
        return "visualizer"
    if state.get("remediation_count", 0) >= settings.max_remediation_cycles:
        return "visualizer"  # move on regardless
    return "devops"

After three cycles, the pipeline proceeds with whatever state it has. Unresolved findings appear as advisory warnings in the Security tab, not hard failures. The same cap exists on the HCL validation loop.

Add your cycle caps before your first integration test. Not after. You will hit this case.

The One Bad Habit We Couldn't Engineer Away

Every model we tested had the same behavior: for any internet-facing resource, it generated 0.0.0.0/0 as the security group ingress CIDR. Even with explicit instructions in the system prompt. Even with examples. Even with counter-examples.

We tried prompt engineering for weeks. The model would acknowledge the constraint, then generate 0.0.0.0/0 anyway on the next call.

So we stopped fighting it and added a deterministic sanitizer that runs on the DevOps agent's output before validation even starts:

_CIDR_SANITISATIONS: list[tuple[re.Pattern[str], str]] = [
    (re.compile(r'"0\.0\.0\.0/0"'), '"10.0.0.0/8"'),
    (re.compile(r'"::/0"'),         '"fc00::/7"'),
]

def _sanitize_hcl(hcl: str) -> str:
    for pattern, replacement in _CIDR_SANITISATIONS:
        hcl = pattern.sub(replacement, hcl)
    return hcl

clean_hcl = _sanitize_hcl(output.terraform_hcl)

10.0.0.0/8 is a broad internal placeholder. Operators narrow it before deploying to production.

This single function broke the HCL validation loop for the most common case. First-pass generations stopped triggering the guardrail on CIDR issues almost entirely. When the guardrail fires now, it catches a genuine structural problem.

When a model reliably produces the same wrong output, fix it deterministically. Do not prompt your way out of a consistency problem.

Three Questions Before the LLM Sees Anything

The most expensive mistake in an agentic pipeline is burning tokens on requests that should never reach the agents. InfraSquad catches these at three layers before anything expensive runs.

Layer 1: Chitchat detection (zero cost)

A frozenset of 40+ conversational tokens returns immediately. "Thanks", "ok cool", "sounds good", a thumbs-up emoji-none of these should reach the architect.

_CHITCHAT_TOKENS: frozenset[str] = frozenset({
    "cool", "okay", "ok", "sure", "thanks", "thank you",
    "hi", "hello", "great", "awesome", "got it", "yep", ...
})

def _is_chitchat(text: str) -> bool:
    normalized = text.strip().lower().rstrip("!.,?")
    tokens = [t.strip("!.,?") for t in normalized.split()]
    return bool(tokens) and len(tokens) <= 4 and all(t in _CHITCHAT_TOKENS for t in tokens)

No LLM call. No latency. Instant return.

Layer 2: Keyword matching (zero cost)

A compiled regex matches 45 AWS infrastructure keywords. Two or more matches skip the LLM check entirely. "VPC with RDS Postgres and ALB" is obviously a valid request. Spending 2 seconds and tokens to confirm this is wasteful.

elif keyword_match_count(user_request) >= 2:
    # High confidence-skip the LLM round-trip (~2s saved per clear request)
    is_valid = True

About 70% of valid requests take this fast path.

Layer 3: LLM plausibility (borderline cases only)

Single-keyword matches are genuinely ambiguous. "Server" could be valid. "AWS tomato server" should not be. For these, a lightweight LLM call returns one of three outcomes:

class _FirstMessageClassification(BaseModel):
    intent: Literal["proceed", "clarify", "reject"]

clarify triggers a helpful guidance message. reject returns a polite explanation. Both avoid running the full pipeline on nonsense input.

There is a catch. In active conversations, keyword matching stops working correctly. "Explain the Terraform code" contains the word "Terraform" but is clearly a follow-up question, not a new generation request. So in active sessions, we switch to a full intent classifier that distinguishes new_generation, follow_up, and off_topic.

The Security Check That Does Not Trust the LLM

Two patterns are blocked by hardcoded regex, independent of everything else:

_ADMIN_ACCESS_PATTERN = re.compile(r"AdministratorAccess", re.IGNORECASE)
_STAR_POLICY_PATTERN  = re.compile(r'"Action"\s*:\s*"\*"')

AdministratorAccess policies and wildcard IAM actions are blocked regardless of what the model thought it generated. Not by the HCL guardrail. Not by the Security Auditor. By a function that runs on every output, unconditionally.

The reason for running this separately from the guardrail: the HCL guardrail checks for AdministratorAccess in a string pattern that could miss an IAM policy embedded inside a heredoc JSON block. The standalone regex catches it regardless of context.

Two independent checks. Neither relying on the other being correct.

The HCL Guardrail: Before Security Even Runs

Before the Terraform reaches the Security Auditor, it passes through a deterministic validator. This runs on every generation-first pass and every remediation:

_FORBIDDEN_PATTERNS = [
    (r"AdministratorAccess", "Uses AdministratorAccess IAM policy"),
    (r"0\.0\.0\.0/0",        "Contains 0.0.0.0/0 CIDR-opens resource to the internet"),
    (r"public\s*=\s*true",   "Sets public access to true"),
]

It also checks structural validity: provider and resource blocks must be present, resource signatures must be well-formed, and braces must balance. Any failure sends the code back to the DevOps agent with the specific error list attached to the next prompt.

The CIDR sanitizer runs before this check. That is intentional. Remove 0.0.0.0/0 before validation, so the guardrail only fires on real structural problems.

What the Pipeline Actually Produces

Here is a real run. Request: "VPC with an RDS Postgres instance, an Application Load Balancer, and a Redis caching layer."

The DevOps agent follows a security baseline baked into its system prompt. S3 buckets get KMS encryption, versioning, and public access blocks by default. RDS gets storage_encrypted = true, deletion_protection = true, and 7-day backup retention. ElastiCache gets encryption at rest and in transit. VPCs get flow logs.

After the security check passes, the Visualizer reads the finalized plan and code and generates a Mermaid diagram:

If mmdc is not installed, the Mermaid source is saved as-is. It is still fully useful-paste it into any Mermaid viewer and you get the diagram.

The Security Audit Loop: How It Actually Works

When the Security Auditor finds issues, it does not just list them. It produces a structured prompt that becomes the DevOps agent's next input:

MANDATORY SECURITY REMEDIATION-3 finding(s)
Fix EVERY numbered item below. Do NOT skip any.

Finding 1. [HIGH] AVD-AWS-0107 - aws_security_group.app_sg
   Issue: Security group allows unrestricted ingress on port 443
   Fix:   Restrict ingress to specific CIDR ranges or security group references.

Finding 2. [HIGH] AVD-AWS-0132 - aws_s3_bucket.assets
   Issue: S3 bucket does not use KMS encryption with a customer-managed key
   Fix:   Add aws_kms_key + aws_s3_bucket_server_side_encryption_configuration

The DevOps agent sees MANDATORY SECURITY REMEDIATION in its next prompt and treats every numbered item as a required fix.

security_passed = True only when there are zero CRITICAL and zero HIGH findings. MEDIUM and LOW findings get reported but do not block the pipeline. The visualization still renders.

Why LangGraph Over CrewAI or AutoGen

This came down to one question: does the framework support cycles with explicit state management and typed contracts?

Framework	Cyclic workflows	Typed shared state	Explicit retry caps
LangGraph	Native conditional edges	TypedDict, full control	Direct in routing logic
CrewAI	Workarounds required	Role-based model	Not built-in
AutoGen	Conversation-driven	Implicit	Not built-in

The security remediation loop is cyclic by design. The Security Auditor sends the DevOps Engineer back; the DevOps Engineer generates new code; the new code gets re-scanned. Both CrewAI and AutoGen require workarounds for this pattern. LangGraph's conditional edges handle it natively.

The typed state was also non-negotiable. Without a clear contract on what each agent receives and produces, integration failures are silent. An agent gets None where it expected a string and fails three nodes downstream with a cryptic error. TypedDict with total=False gives every agent a contract it cannot accidentally break.

External Tools Through MCP

tfsec and mmdc (Mermaid rendering) run as MCP tools, not direct imports. The agent calls a tool through a protocol.

This looks like over-engineering for a project at this scale. The argument for it: tfsec and mmdc are external processes that can timeout, crash, or produce unexpected output. Wrapping them in an MCP tool forces explicit failure handling at every call site.

result = _try_tfsec(tmpdir) or _try_checkov(tmpdir)

tfsec unavailable? Try checkov. checkov unavailable? LLM security review with the full security system prompt. mmdc unavailable? Save Mermaid source. Every external dependency ended up with a fallback path, which would not have happened if they were direct imports.

The MCP server also runs independently. It can be swapped or extended without touching any agent code.

Five Things We'd Tell Ourselves at the Start

1. Hard-cap your cycles before the first integration test.
You will hit the infinite loop case. Probably on a public-facing resource where the security finding is technically correct and architecturally intentional. Add the counter before you need it.

2. Regex beats prompting for deterministic security invariants.
If a property can be expressed as a pattern, enforce it with code. LLM compliance on security constraints is probabilistic. Code compliance is guaranteed. The CIDR sanitizer took 10 lines to write and immediately reduced first-pass HCL failures by the majority.

3. Typed state is not optional in multi-agent systems.
Silent failures are the worst kind. A TypedDict with total=False is a contract every agent signs. Without it, you are debugging None errors three nodes downstream and trying to reconstruct which agent set which field when.

4. Pydantic schema retry saves more than you expect.
Without invoke_with_schema_retry, the pipeline fails silently on every malformed JSON response. With it, about 80% of schema failures resolve on the first retry with an error correction prompt. Make this load-bearing from day one.

5. Input validation saves more tokens than it looks like it will.
Chitchat and off-topic requests are common in demo environments. Every one that reaches the architect burns tokens before returning an unhelpful or confusing response. The three-layer guardrail means only genuine infrastructure requests reach the expensive part of the pipeline.

Run It Yourself

git clone https://github.com/Andela-AI-Engineering-Bootcamp/infrasquad.git
cd infrasquad
uv venv && uv sync

Add your OpenRouter API key to .env:

OPENROUTER_API_KEY=sk-or-...

Start the UI:

python app.py              # localhost:7860
python app.py --share      # public Gradio URL
python app.py --port 8080  # custom port

The default model is openai/gpt-4o-mini via OpenRouter. Swap to any model OpenRouter supports by changing two env vars:

LLM_MODEL=anthropic/claude-3-5-sonnet
LLM_BASE_URL=https://openrouter.ai/api/v1

Or point it at a local Ollama instance:

LLM_MODEL=qwen2.5:72b
LLM_BASE_URL=http://localhost:11434/v1

Optional tools for real scanner output and rendered diagrams:

brew install tfsec
pip install checkov
npm install -g @mermaid-js/mermaid-cli

If none of these are installed, the pipeline still completes. Security falls back to LLM review and diagrams save as Mermaid source.

Full source: infrasquad - github

Built at Andela AI Engineering Bootcamp by Amit, Ayesha, Elijah, Joel, Stella, and Adetayo.

If you are building anything with LangGraph, multi-agent pipelines, or IaC automation, drop a comment. Especially curious whether anyone else hit the public ALB infinite loop case-and how you handled it.

Sick of API costs and rate limits? I turned my M1 Mac into a fully offline AI coding agent. No cloud. No API keys. Just raw local compute using Llama.cpp and a 26B model. Check out the architecture and build it yourself! 🚀👇

Amit Bhatt — Thu, 09 Apr 2026 08:44:23 +0000

Amit Bhatt

Apr 9

I Turned My M1 MacBook Into an Offline AI Coding Agent - $0 API Cost, Zero Cloud

#ai #llm #privacy #devex

9 min read

I Turned My M1 MacBook Into an Offline AI Coding Agent - $0 API Cost, Zero Cloud

Amit Bhatt — Thu, 09 Apr 2026 08:15:55 +0000

The cloud is great — until you hit a rate limit mid-refactor. Or you're on a flight. Or you're working on code that should never leave your machine.

I spent three weeks obsessing over one question: how close can you actually get to a GPT-4-level agentic coding experience running 100% locally, with zero internet?

The answer surprised me. My M1 MacBook Pro — no discrete GPU, no cloud subscription, no API key — now runs a 26-billion parameter model that reads my codebase, writes code, applies diffs, and proposes Git changes. Autonomously. Offline.

This post is the exact, reproducible blueprint. Every command is copy-pasteable. Every decision is explained.

TL;DR — I compiled llama.cpp with Metal GPU acceleration on an M1 Mac, loaded Google's Gemma-4 26B via Unsloth's quantization, and wired it to OpenCode for a fully agentic, offline coding workflow. Total API cost: $0. Data sent to the cloud: 0 bytes.

Why This Matters Right Now

Most conversations about "local AI" treat it as a hobbyist curiosity — small models, toy tasks, nothing you'd trust on real work. That was true 18 months ago. It isn't anymore.

Three things converged to make this actually viable:

What Changed	Why It Matters
Apple's Unified Memory	GPU and CPU share the same RAM pool. A 32GB M1 can feed a 26B parameter model to the GPU like a dedicated VRAM machine.
`llama.cpp` + Metal	CPU/GPU inference optimized specifically for Apple Silicon. Not a port — built for it.
Unsloth quantizations	Aggressive, quality-preserving quantization that fits Gemma-4 26B into ~16GB without meaningful quality loss.

Put those three together and a standard developer laptop becomes a credible inference machine.

The Hardware and the Brain

I'm running this on an M1 MacBook Pro with 32GB of unified memory.

For the model, I chose unsloth/gemma-4-26B-A4B-it-GGUF after reviewing the hardware requirements for Gemma-4. Here's why each component of that model name matters:

Component	Why It Matters
Unsloth	The leading framework for efficient LLM quantization, with recent bugfixes not yet in the ggml-org or Google releases.
Gemma-4 26B	A massive, highly capable architecture from Google DeepMind.
Instruction-Tuned (it)	Crucial for agentic workflows — the model follows complex commands, not just predicts text.
GGUF	The optimized file format required for local CPU/Metal execution via `llama.cpp`.

At a Q4 quantization, the 26B model requires roughly 15–16GB of memory. On 32GB unified memory, that leaves more than enough overhead for macOS, your IDE, and OpenCode running simultaneously.

Prerequisites

Everything below is available through Homebrew or pip. No manual compilation required except llama.cpp itself.

# Xcode Command Line Tools (required for cmake, git, and Metal framework headers)
xcode-select --install

# Core build dependencies
brew install cmake libomp

# Hugging Face CLI for model downloads
pip install huggingface_hub hf_transfer

# Parallel download engine (optional but strongly recommended for large models)
brew install aria2

# OpenCode — the agentic coding orchestrator
brew install anomalyco/tap/opencode

With these in place, every step below works on a clean macOS install.

Step 1: Compile `llama.cpp` from Scratch with Metal

You could download a pre-built binary. But if you want every drop of performance from the M1's Metal GPU framework, build from source.

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

The key flag is -DGGML_METAL=ON — this compiles with Apple's Metal GPU framework. The -j$(sysctl -n hw.ncpu) parallelizes the build across all CPU cores.

When you compile directly on the M1, inference speeds jump dramatically. You aren't just running code — you're running code hyper-optimized for your specific silicon.

After the build, create symlinks to keep commands clean:

ln -s ./llama.cpp/build/bin/llama-cli llama-cli
ln -s ./llama.cpp/build/bin/llama-server llama-server

llama-cli handles interactive terminal prompts. llama-server is the HTTP inference server that exposes an OpenAI-compatible API — the piece that connects to OpenCode. Both are built from the same source tree.

Validate the Build Before the Big Download

Before downloading the massive 18GB Gemma-4 model, validate the entire pipeline with a smaller model: NVIDIA Nemotron-3-Nano-4B at Q8 quantization, just 3.9GB.

This step is not optional. You don't want to wait hours for an 18GB download only to discover your build is broken.

Booting the server with the smaller model confirms what you need to see: Metal framework fully initialized, unified memory detected, all GPU families registered.

The critical line in that output: has unified memory = true. The recommendedMaxWorkingSetSize of roughly 26,800 MB tells you exactly how much VRAM the Metal backend can access — and on the M1, it draws directly from system RAM.

Pipeline is solid. Now bring in the real model.

Step 2: Download Gemma-4 26B Weights

Because we're building an offline environment, the model file needs to be local.

hf download unsloth/gemma-4-26B-A4B-it-GGUF \
  --local-dir unsloth/gemma-4-26B-A4B-it-GGUF \
  --include "*mmproj-BF16*" \
  --include "*UD-Q4_K_XL*"

The --include filters are important:

*mmproj-BF16* — the multimodal vision projector, giving the model the ability to understand images alongside code
*UD-Q4_K_XL* — the sweet spot quantization for quality vs. memory on 32GB (~15.9GB on disk)

Fair Warning: 18.3GB Downloads Are Fragile

My download crawled to 519KB/s before failing entirely. The default hf download CLI supports resuming in theory, but in practice it's fragile on large files over unstable connections.

Switch to hfd.sh with aria2c as the engine. Unlike the default CLI, aria2c tracks per-segment progress in .aria2 control files — a dropped connection picks up exactly where it left off instead of restarting the entire file.

./hfd.sh unsloth/gemma-4-26B-A4B-it-GGUF \
  --local-dir unsloth/gemma-4-26B-A4B-it-GGUF \
  --include "*mmproj-BF16*" \
  --include "*UD-Q4_K_XL*" \
  --tool aria2c -x 16 -n 8

-x 16 opens 16 connections per server. -n 8 splits each file into 8 parallel segments. On a decent connection, this is dramatically faster and more resilient than the default downloader.

Step 3: Wire the Brain to OpenCode

Having a powerful local LLM is interesting. Having it autonomously write, edit, and debug your code is a different thing entirely. That's where OpenCode comes in.

OpenCode bridges the local LLM and your codebase. The key insight: llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint out of the box. OpenCode's @ai-sdk/openai-compatible adapter speaks that protocol natively. No custom prompt templates, no manual token wrangling — the chat template baked into the GGUF handles everything at the server level.

Create opencode.json at your project root:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8001"
      },
      "models": {
        "gemma-4:26b-a4b-it": {
          "name": "Gemma-4-26B-A4B-it (local)",
          "limit": {
            "context": 32000,
            "output": 65536
          }
        },
        "nvidia-nemotron-3-nano:4b": {
          "name": "NVIDIA-Nemotron-3-Nano-4B (local)",
          "limit": {
            "context": 32000,
            "output": 65536
          }
        }
      }
    }
  }
}

A few things worth noting:

The baseURL points to 127.0.0.1:8001 where llama-server will listen
Context is set to 32K tokens — the model supports up to 262K, but 32K is a practical ceiling for stable agentic sessions on 32GB RAM
The second model (Nemotron-3 Nano 4B) is configured as a lightweight alternative for fast, low-overhead tasks

Verify the model is running correctly by checking the llama.cpp web interface at http://127.0.0.1:8001:

25.23 billion parameters. A 262,144-token context window. Vision capability. Running from a file on local disk, served over localhost. No cloud, no API key, no rate limit.

Step 4: Full Offline Agentic Coding

With the model downloaded, llama.cpp compiled, and opencode.json locked in, I turned off Wi-Fi.

Zero internet. Zero API calls. Zero data leaving the machine.

Open two terminals:

# Terminal 1: Start the llama.cpp inference server with Gemma-4
./llama-server \
  --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --temp 0.6 \
  --top-p 0.95 \
  --alias "gemma-4-26B" \
  --port 8001 \
  -ngl 99 -t 8 -b 512 --mmap

Still validating with Nemotron before the full download? Same flags work:

# Alternative: lighter Nemotron model for testing
./llama-server \
  --model unsloth/NVIDIA-Nemotron-3-Nano-4B-GGUF/NVIDIA-Nemotron-3-Nano-4B-Q8_0.gguf \
  --temp 0.6 \
  --top-p 0.95 \
  --alias "nvidia/nemotron-3-nano-4B-GGUF" \
  --port 8001 \
  --reasoning on \
  -ngl 99 -t 8 -b 512 --mmap

Note the --reasoning on flag on Nemotron — it activates a built-in chain-of-thought mode that improves output quality on complex tasks. Useful for validating multi-step reasoning before scaling up to Gemma-4.

# Terminal 2: Launch OpenCode
opencode

Here's what each server flag does:

Flag	Purpose
`-ngl 99`	Offloads all model layers to the Metal GPU
`-t 8`	Sets 8 CPU threads for operations that fall back to CPU
`-b 512`	Controls batch size for prompt processing
`--mmap`	Memory-maps the model file — macOS manages paging without loading all 15.9GB upfront
`--temp 0.6`	Slightly below default for more deterministic code generation
`--top-p 0.95`	Nucleus sampling — keeps output focused while allowing creativity

The result: Gemma-4 26B analyzed my codebase, understood the architecture of local files, and began writing, diffing, and applying code. In the screenshot above, it analyzed an architect.py file, broke down Pydantic data models, explained the run_architect function flow, and proposed 5 Git changes across the project.

The M1 pushed out tokens fast enough for real-time development. The footer confirms it: gemma-4-26B, 45 seconds for a full architectural analysis and code generation pass.

What This Actually Means

We are crossing a threshold.

For two years, the industry assumed truly capable AI agents require data centers. This experiment proves otherwise.

For engineering teams working on sensitive codebases — defense, healthcare, fintech — this means AI coding assistants without a single byte crossing a network boundary. No SOC 2 reviews for another SaaS vendor. No data processing agreements. No trust boundaries to negotiate.

For engineering leaders, the math is compelling:

Zero marginal API cost per developer
Zero vendor lock-in
Works identically on an airplane, in a SCIF, or behind an air-gapped network

For individual developers, the practical reality: you can now run a frontier-class coding agent on hardware you already own, using models that are openly licensed, with no subscription, no quota, no latency spikes on someone else's overloaded GPU cluster.

Absolute privacy and top-tier AI capability are no longer mutually exclusive.

What I'm Exploring Next

This setup is a foundation, not a ceiling.

Larger context windows. The 32K context in opencode.json is conservative. With careful memory management and llama.cpp's Flash Attention support, 128K+ is feasible on 32GB for longer agentic sessions.

Multi-model routing. Running Nemotron for fast, lightweight tasks and Gemma-4 for heavy reasoning — switching between models based on task complexity, all locally. Think of it as a cheap/smart tier system without the cloud bill.

Fine-tuning on proprietary code. Unsloth supports LoRA and QLoRA fine-tuning. Training a domain-specific adapter on your team's codebase and merging it into the GGUF gives you a model that thinks in your architecture and naming conventions.

Team-wide access. Embed llama-server in a container behind your internal network so the entire team gets local AI without each developer maintaining their own build.

The Full Stack, in One Place

For anyone who wants to reproduce this exactly:

Component	What It Is	Link
`llama.cpp`	Metal-accelerated inference engine	github.com/ggml-org/llama.cpp
Gemma-4 26B (Unsloth)	The model, Q4_K_XL quantization	huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF
OpenCode	Agentic coding orchestrator	opencode.ai/docs
`hfd.sh`	Reliable large-file downloader	gist.github.com/yeahjack

The tools are here. The models are capable enough. The only question is what you build with them.

If you found this useful, the full write-up with additional context lives on my site. Questions, improvements, or your own local AI stack? Drop them in the comments — I'd genuinely like to hear what you're running.

Written by Amit Bhatt — GitHub

DEV Community: Amit Bhatt

We Let AI Write Our Terraform. Then We Gave It a Security Conscience

Meet the Squad

The Pipeline (and the Two Places It Can Loop)

We Almost Created an Infinite Loop on Day Two

The One Bad Habit We Couldn't Engineer Away

Three Questions Before the LLM Sees Anything

The Security Check That Does Not Trust the LLM

The HCL Guardrail: Before Security Even Runs

What the Pipeline Actually Produces

The Security Audit Loop: How It Actually Works

Why LangGraph Over CrewAI or AutoGen

External Tools Through MCP

Five Things We'd Tell Ourselves at the Start

Run It Yourself

Sick of API costs and rate limits? I turned my M1 Mac into a fully offline AI coding agent. No cloud. No API keys. Just raw local compute using Llama.cpp and a 26B model. Check out the architecture and build it yourself! 🚀👇

I Turned My M1 MacBook Into an Offline AI Coding Agent - $0 API Cost, Zero Cloud

I Turned My M1 MacBook Into an Offline AI Coding Agent - $0 API Cost, Zero Cloud

Why This Matters Right Now

The Hardware and the Brain

Prerequisites

Step 1: Compile llama.cpp from Scratch with Metal

Validate the Build Before the Big Download

Step 2: Download Gemma-4 26B Weights

Fair Warning: 18.3GB Downloads Are Fragile

Step 3: Wire the Brain to OpenCode

Step 4: Full Offline Agentic Coding

What This Actually Means

What I'm Exploring Next

The Full Stack, in One Place

Step 1: Compile `llama.cpp` from Scratch with Metal