Max Quimby

Posted on Jun 16 • Originally published at agentconn.com

Run Your Coding Agent on Local Weights: Operator Playbook

#ai #coding #localai #agents

Your frontier model just got pulled. On June 12, the US government issued an export control directive that forced Anthropic to disable Claude Fable 5 worldwide — with almost no notice. If your coding agent was wired to Fable 5, it went dark. Hugging Face CEO Clément Delangue's response summed up the mood: "Fable is banned. Long live local AI."

📖 Read the full version with charts and embedded sources on AgentConn →

The Hacker News thread that followed — "Has anyone replaced Claude/GPT with a local model for daily coding?" — collected 93 points in hours. The answers were surprisingly practical: developers reporting that Qwen 3.6 and Gemma 4 on RTX 3090s handle 80% of their daily coding. Not 100%. But enough to survive a frontier model going dark overnight.

This isn't a model review. It's an operator playbook: what hardware you need, which agent harnesses work with local weights, where tool calling breaks down, and how to build an 80/20 hybrid that makes your agent stack ban-proof.

The Hardware You Actually Need

The single most common question in every local-model thread: "How much VRAM?" The answer in 2026 is more nuanced than "buy a 4090," because Mixture of Experts (MoE) architectures have dramatically changed the math.

VRAM	Best Model Fit	What You Can Do	What Breaks
8 GB	Qwen 2.5 7B	Basic completions, simple edits	No agentic workflows, poor tool calling
16 GB	Qwen 3.6 35B-A3B (MoE)	Comfortable agentic coding, most daily tasks	Struggles with 5+ file refactors
24 GB	Qwen3-Coder-Next / Gemma 4 26B	Serious agentic work, SWE-bench 58.7%	Still 80/20 vs frontier on hard problems
48 GB+	Full-precision large models	Near-frontier for most tasks	Diminishing returns vs cloud cost

The sweet spot is 24 GB — matching the RTX 3090 ($489 used), RTX 4090, and RTX 5090 entry tier. At that budget, a developer spending $60–100/month on Claude API tokens recoups the GPU cost in 5–8 months.

💡 MoE (Mixture of Experts) models like Qwen 3.6 35B-A3B activate only a fraction of their parameters per token. This is why a "35B" model runs on 16 GB VRAM.

Which Models Run Agentic Coding on Consumer GPUs

Qwen 3.6 35B-A3B — The current community favorite. Released April 2026 with explicit agentic coding focus. Runs on 16 GB. The HN consensus: "it's the first local model that doesn't feel like a science experiment."

Qwen3-Coder-Next — The specialized coding variant. Scores 58.7% on SWE-bench Verified with 256K context, running on a single 24 GB GPU.

Gemma 4 26B-A4B — Google's MoE entry. Fast, low-VRAM, excellent for completions. Struggles in agentic scenarios — better as a copilot than an autonomous agent.

Pick Your Harness: PI, Aider, Cline

PI Agent (61K+ stars, MIT) — Terminal-native, ships with four core tools, works with any local Ollama model.

Aider — Most mature open-source terminal coding agent with architect/editor mode for the 80/20 pattern.

Cline — VS Code-native with Plan/Act mode, supports multiple LLM backends.

The Reliability Problem (and How Forge Solved It)

If each tool-calling step succeeds 90% of the time, a 5-step workflow has a 59% success rate. Forge, published as an ACM CAIS 2026 paper, wraps any self-hosted LLM with guardrails: an 8B model goes from 53% to 99% task completion.

⚠️ The compounding reliability problem: 90% per-step accuracy = 59% on 5 steps, 35% on 10 steps. Forge addresses this at the harness layer, not the model layer.

The 80/20 Hybrid

The operators getting the most value aren't going all-in on local. They route 80% of routine coding locally (completions, single-file edits, tests) and 20% to cloud (multi-file refactors, complex debugging). Monthly cloud spend drops from $80 to ~$16. The GPU pays for itself in 8 months.

The Operator's Checklist

Hardware: 24 GB VRAM (RTX 3090 used: ~$489)
Model runtime: Ollama or vLLM with Qwen 3.6 35B-A3B
Agent harness: PI Agent, Aider, or Cline
Reliability layer: Forge-style guardrails for models under 24B
Cloud fallback: One provider for the hard 20%
Routing logic: Single-file → local, multi-file → cloud

The harness is the moat, not the model. Your coding agent should run on whatever weights are available — local, cloud, or both.

Originally published at AgentConn

DEV Community