I Built a Local AI Coding Agent for $0/Month. Here's the Setup That Actually Works.
With AI coding tools moving to usage-based pricing and subscription prices going up, I decided to go fully local. Total cost: zero (assuming you have the hardware). Here's what I learned setting up Qwen3.6-27B as a local coding agent.
Why Now Is the Right Time
Six months ago, I would have said "local models aren't good enough for coding agents." That's changed. Three things happened:
- Reasoning capabilities in small models — models like Qwen3.6-27B can "think" longer about hard problems, compensating for raw capability gaps
- Mixture-of-experts architectures — you don't need terabytes/second of memory bandwidth for an interactive experience
- Tool calling maturity — models can now interact with codebases, shell environments, and the web reliably
The Register did a hands-on test recently and concluded that for vibe coding — side projects, prototypes, the 80% of work that's not production-critical — local models are now viable.
The Hardware Reality Check
You need:
- GPU: Nvidia, AMD, or Intel with at least 24 GB VRAM, OR
- Mac: M-series with at least 32 GB unified memory
On a 24-32GB machine, you can run medium-sized LLMs interactively. Older M-series Macs may struggle with large context lengths required for agentic coding.
The Setup: Llama.cpp + Qwen3.6-27B
The Register's recommended setup is Llama.cpp (or Ollama if you want a UI). For coding agents specifically, the recommended Qwen3.6-27B parameters matter:
{
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 0.0,
"repetition_penalty": 1.0
}
These aren't arbitrary — Alibaba, who built Qwen, tuned these for code generation specifically. Setting temperature too low produces repetitive, mechanical code. Setting it too high produces creative but broken code. The 0.6 sweet spot was validated for this model.
Context Window Sizing
The context window is how many tokens the model can track in a single request. For coding work on large codebases, this adds up fast — every file you open, every chat message, counts against the window.
Set your context window as large as your hardware allows. On 24GB VRAM, The Register suggests prioritizing context length over model size if you have to choose. A smaller model with a larger context window is more useful for coding agents than a larger model with a cramped context.
The Agent Framework Question
You have options for the agent harness:
- Cline (VS Code extension, free, Apache 2.0) — best MCP marketplace in open source, flexible model choice
- Continue.dev (VS Code/JetBrains) — strong local-model flexibility, self-hosted option
- Aider (terminal, free) — best terminal-first local workflow
All three support Ollama for local model inference. Pick the interface you actually want to use daily.
The Real Trade-off
Local models work for 80% of what most developers use AI coding tools for: autocomplete, small refactors, documentation, test generation. The remaining 20% — hard architectural decisions, complex debugging, unfamiliar codebases — still benefits from frontier models.
The practical setup: use local for the 80%, route hard problems to Claude or GPT-5 via BYOK when needed. Most days, you'll spend zero on API calls.
Hardware note: M-series Mac with 32GB unified memory or 24GB VRAM GPU recommended. Qwen3.6-27B parameters tested by The Register on actual codebases.
Top comments (0)