Your frontier model just got pulled. On June 12, the US government issued an export control directive that forced Anthropic to disable Claude Fable 5 worldwide — with almost no notice. If your coding agent was wired to Fable 5, it went dark. Hugging Face CEO Clément Delangue's response summed up the mood: "Fable is banned. Long live local AI."
📖 Read the full version with charts and embedded sources on AgentConn →
The Hacker News thread that followed — "Has anyone replaced Claude/GPT with a local model for daily coding?" — collected 93 points in hours. The answers were surprisingly practical: developers reporting that Qwen 3.6 and Gemma 4 on RTX 3090s handle 80% of their daily coding. Not 100%. But enough to survive a frontier model going dark overnight.
This isn't a model review. It's an operator playbook: what hardware you need, which agent harnesses work with local weights, where tool calling breaks down, and how to build an 80/20 hybrid that makes your agent stack ban-proof.
The Hardware You Actually Need
The single most common question in every local-model thread: "How much VRAM?" The answer in 2026 is more nuanced than "buy a 4090," because Mixture of Experts (MoE) architectures have dramatically changed the math.
| VRAM | Best Model Fit | What You Can Do | What Breaks |
|---|---|---|---|
| 8 GB | Qwen 2.5 7B | Basic completions, simple edits | No agentic workflows, poor tool calling |
| 16 GB | Qwen 3.6 35B-A3B (MoE) | Comfortable agentic coding, most daily tasks | Struggles with 5+ file refactors |
| 24 GB | Qwen3-Coder-Next / Gemma 4 26B | Serious agentic work, SWE-bench 58.7% | Still 80/20 vs frontier on hard problems |
| 48 GB+ | Full-precision large models | Near-frontier for most tasks | Diminishing returns vs cloud cost |
The sweet spot is 24 GB — matching the RTX 3090 ($489 used), RTX 4090, and RTX 5090 entry tier. At that budget, a developer spending $60–100/month on Claude API tokens recoups the GPU cost in 5–8 months.
💡 MoE (Mixture of Experts) models like Qwen 3.6 35B-A3B activate only a fraction of their parameters per token. This is why a "35B" model runs on 16 GB VRAM.
Which Models Run Agentic Coding on Consumer GPUs
Qwen 3.6 35B-A3B — The current community favorite. Released April 2026 with explicit agentic coding focus. Runs on 16 GB. The HN consensus: "it's the first local model that doesn't feel like a science experiment."
Qwen3-Coder-Next — The specialized coding variant. Scores 58.7% on SWE-bench Verified with 256K context, running on a single 24 GB GPU.
Gemma 4 26B-A4B — Google's MoE entry. Fast, low-VRAM, excellent for completions. Struggles in agentic scenarios — better as a copilot than an autonomous agent.
Pick Your Harness: PI, Aider, Cline
PI Agent (61K+ stars, MIT) — Terminal-native, ships with four core tools, works with any local Ollama model.
Aider — Most mature open-source terminal coding agent with architect/editor mode for the 80/20 pattern.
Cline — VS Code-native with Plan/Act mode, supports multiple LLM backends.
The Reliability Problem (and How Forge Solved It)
If each tool-calling step succeeds 90% of the time, a 5-step workflow has a 59% success rate. Forge, published as an ACM CAIS 2026 paper, wraps any self-hosted LLM with guardrails: an 8B model goes from 53% to 99% task completion.
⚠️ The compounding reliability problem: 90% per-step accuracy = 59% on 5 steps, 35% on 10 steps. Forge addresses this at the harness layer, not the model layer.
The 80/20 Hybrid
The operators getting the most value aren't going all-in on local. They route 80% of routine coding locally (completions, single-file edits, tests) and 20% to cloud (multi-file refactors, complex debugging). Monthly cloud spend drops from $80 to ~$16. The GPU pays for itself in 8 months.
The Operator's Checklist
- Hardware: 24 GB VRAM (RTX 3090 used: ~$489)
- Model runtime: Ollama or vLLM with Qwen 3.6 35B-A3B
- Agent harness: PI Agent, Aider, or Cline
- Reliability layer: Forge-style guardrails for models under 24B
- Cloud fallback: One provider for the hard 20%
- Routing logic: Single-file → local, multi-file → cloud
The harness is the moat, not the model. Your coding agent should run on whatever weights are available — local, cloud, or both.
Originally published at AgentConn
Top comments (0)