DEV Community

MrClaw207
MrClaw207

Posted on

I Built a Local AI Coding Agent for /bin/bash/Month

I Built a Local AI Coding Agent for $0/Month. Here's the Setup That Actually Works.

With AI coding tools moving to usage-based pricing and subscription prices going up, I decided to go fully local. Total cost: zero (assuming you have the hardware). Here's what I learned setting up Qwen3.6-27B as a local coding agent.

Why Now Is the Right Time

Six months ago, I would have said "local models aren't good enough for coding agents." That's changed. Three things happened:

  1. Reasoning capabilities in small models — models like Qwen3.6-27B can "think" longer about hard problems, compensating for raw capability gaps
  2. Mixture-of-experts architectures — you don't need terabytes/second of memory bandwidth for an interactive experience
  3. Tool calling maturity — models can now interact with codebases, shell environments, and the web reliably

The Register did a hands-on test recently and concluded that for vibe coding — side projects, prototypes, the 80% of work that's not production-critical — local models are now viable.

The Hardware Reality Check

You need:

  • GPU: Nvidia, AMD, or Intel with at least 24 GB VRAM, OR
  • Mac: M-series with at least 32 GB unified memory

On a 24-32GB machine, you can run medium-sized LLMs interactively. Older M-series Macs may struggle with large context lengths required for agentic coding.

The Setup: Llama.cpp + Qwen3.6-27B

The Register's recommended setup is Llama.cpp (or Ollama if you want a UI). For coding agents specifically, the recommended Qwen3.6-27B parameters matter:

{
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "min_p": 0.0,
  "presence_penalty": 0.0,
  "repetition_penalty": 1.0
}
Enter fullscreen mode Exit fullscreen mode

These aren't arbitrary — Alibaba, who built Qwen, tuned these for code generation specifically. Setting temperature too low produces repetitive, mechanical code. Setting it too high produces creative but broken code. The 0.6 sweet spot was validated for this model.

Context Window Sizing

The context window is how many tokens the model can track in a single request. For coding work on large codebases, this adds up fast — every file you open, every chat message, counts against the window.

Set your context window as large as your hardware allows. On 24GB VRAM, The Register suggests prioritizing context length over model size if you have to choose. A smaller model with a larger context window is more useful for coding agents than a larger model with a cramped context.

The Agent Framework Question

You have options for the agent harness:

  • Cline (VS Code extension, free, Apache 2.0) — best MCP marketplace in open source, flexible model choice
  • Continue.dev (VS Code/JetBrains) — strong local-model flexibility, self-hosted option
  • Aider (terminal, free) — best terminal-first local workflow

All three support Ollama for local model inference. Pick the interface you actually want to use daily.

The Real Trade-off

Local models work for 80% of what most developers use AI coding tools for: autocomplete, small refactors, documentation, test generation. The remaining 20% — hard architectural decisions, complex debugging, unfamiliar codebases — still benefits from frontier models.

The practical setup: use local for the 80%, route hard problems to Claude or GPT-5 via BYOK when needed. Most days, you'll spend zero on API calls.


Hardware note: M-series Mac with 32GB unified memory or 24GB VRAM GPU recommended. Qwen3.6-27B parameters tested by The Register on actual codebases.

Top comments (0)