`# I Route 88% of AI Coding Tasks to a Free Local Model — Here's What I Learned
Running AI coding agents through cloud APIs gets expensive fast. Claude Sonnet at ~$0.04/task, Opus at ~$0.075 — it adds up when you're running hundreds of tasks.
So I built a system that routes 88% of tasks to a free local model and only escalates to paid APIs when necessary. Then I wrapped it in a Command & Conquer Red Alert-style interface because… I grew up in the 90s.
The Cost Problem
Most AI coding agent frameworks send everything to the best model available. But "create a function that adds two numbers" doesn't need the same model as "implement an LRU cache with O(1) operations."
I tested this systematically — 40 coding tasks scored on a complexity scale of 1–9, all executed by a local 7B parameter model running on a $300 used GPU:
| Complexity | Example Tasks | Success Rate |
|---|---|---|
| C1–2 | Add function, greet function | 100% |
| C3–4 | Parse CSV, validate emails, factorial | 80–100% |
| C5–6 | Calculator with history, prime checker | 60–100% |
| C7–8 | Merge sorted lists, binary search, word frequency | 80–100% |
| C9 | LRU cache, stack class, RPN calculator | 80% |
Result: $0.002 average cost per task instead of $0.04 — that's 20x cheaper.
The 7B model handled everything from trivial one-liners to LeetCode-medium problems. It even added type hints unprompted on some solutions. Only multi-class architectural tasks needed escalation to cloud APIs.
How the Routing Works
The system scores every task on a 1–10 complexity scale using dual assessment:
Task → Complexity Assessment
├─ Rule-based: keyword matching, structural analysis
└─ Haiku AI: semantic understanding (~$0.001/call)
↓
Smart weighting: if Haiku rates 2+ higher, trust Haiku
↓
├─ C1–6 → Ollama (free, local GPU)
├─ C7–8 → Haiku (~$0.003/task)
├─ C9–10 → Sonnet (~$0.01/task)
└─ Decomposition → Opus (review only, never writes code)
The complexity scoring draws from Campbell's Task Complexity Theory in organizational psychology, adapted for code tasks:
- Component complexity — How many steps, files, and functions are expected?
- Coordinative complexity — How many dependencies exist between parts?
- Dynamic complexity — How much ambiguity and decision-making is required?
The smart weighting was a key breakthrough. The rule-based router uses keyword matching ("LRU", "cache", "linked list"), but sometimes misses semantic complexity. A cheap Haiku call provides that semantic understanding. When Haiku rates a task 2+ points higher than the rules, we trust Haiku's score directly instead of averaging — this prevents complex tasks from being under-routed to Ollama.
Hard-Won Lessons Running Ollama in Production
These took weeks of debugging. If you're building anything agentic with local models, this might save you some pain.
1. Temperature = 0 is mandatory for tool calling
This was the single biggest improvement. Small models with temperature > 0 will randomly output raw code instead of calling tools through the proper function-calling interface. The model might generate a perfect Python function… and dump it into stdout instead of calling file_write.
Temperature 0 gives deterministic, reliable tool usage. It took our success rate from ~60% to 90%+ overnight.
2. Context pollution is real (and sneaky)
After ~5 consecutive tasks on the same Ollama instance, the model starts generating syntax errors — missing quotes, unclosed parentheses, garbled output. The accumulated context from previous tasks bleeds into new ones.
The fix is surprisingly simple: a 3-second rest delay between tasks, plus a full memory reset every 3 tasks. This alone took us from 85% to 100% success rate on C1–C8 complexity tasks.
We even built a cooldown system with WebSocket events so the UI shows when an agent is "resting" between tasks.
3. 7B > 14B on 8GB VRAM (counterintuitive)
I tested qwen2.5-coder:14b-instruct-q4_K_M expecting better results from the larger model. Got a 40% pass rate vs 95% for the 7B model.
Why? The 14B model weighs in at ~9GB. On an 8GB VRAM card, it overflows into system RAM. CPU offloading makes inference slow enough that tool calling breaks down — the model times out or generates truncated responses.
The 7B model sits at ~6GB VRAM with room for context and tools. No CPU offload needed. If you have 8GB VRAM, 7B is your ceiling. Don't go bigger.
4. Agent personas work surprisingly well
This one surprised me the most. Giving the model a "CodeX-7" elite military identity with a specific pattern — "one write, one verify, mission complete" — plus three concrete examples of ideal 3-step execution dramatically improved task completion.
Without the persona, the model would often loop: write code, read it back, rewrite it, read it again. With the persona, it follows the trained pattern: write the file, run the test, report results. Done.
The technical explanation is probably that the persona plus examples act as strong few-shot conditioning, biasing the model toward a specific execution trajectory. But honestly, it also just makes the logs more fun to read.
The Fun Part: C&C Red Alert UI
Because staring at terminal logs is boring, I built a Command & Conquer Red Alert-inspired interface:
- Bounty board — Task cards with complexity badges and priority colors
- Active missions strip — Real-time agent health indicators (green = idle, amber = working, red = stuck)
-
Tool log — Terminal-style feed of every
file_write,shell_run,file_read - Agent minimap — Visual representation of agents with connection lines
- Voice feedback — "Conscript reporting!" when an agent picks up a task, "Shake it baby!" on completion
- Cost dashboard — Real-time cost tracking with daily budget limits and token burn rate
Every agent action triggers a C&C voice line. When an agent gets stuck in a loop, a warning klaxon plays. It's ridiculous and I love it.
Architecture
The whole system runs in Docker:
UI (React:5173) → API (Express:3001) → Agents (FastAPI:8000) → Ollama/Claude
↓
PostgreSQL:5432
- React UI — Real-time WebSocket updates, no polling
- Express API — Task routing, cost tracking, budget enforcement, rate limiting
- FastAPI + CrewAI — Agent orchestration with tool wrapping and execution logging
- Ollama — Local LLM with GPU passthrough
- PostgreSQL — Tasks, execution logs, code reviews, training data
Every tool call is captured to the database with timing, token usage, and cost. The system detects stuck tasks (>10 min timeout) and automatically recovers them. Loop detection prevents agents from repeating failed actions.
Tasks can run in parallel when they use different resources — an Ollama task and a Claude task can execute simultaneously, yielding a 40–60% speedup on mixed-complexity batches.
Try It
`bash
git clone https://github.com/mrdushidush/agent-battle-command-center.git
cd agent-battle-command-center
cp .env.example .env
Add your ANTHROPIC_API_KEY to .env
docker compose up --build
Open http://localhost:5173
`
Requirements:
- Docker Desktop
- NVIDIA GPU with 8GB+ VRAM (recommended) — or CPU-only mode (slower)
- Anthropic API key (only needed for complex tasks — Ollama tasks are free)
The first startup takes ~5 minutes to download the Ollama model.
What's Next
The project is fully open source (MIT). Some things I'm working on:
- Multi-language support — Currently Python-only; adding JavaScript/TypeScript
- Demo mode — Simulated agents so anyone can try the UI without a GPU
- Docker Hub image — One-command deploy without building
- More voice packs — Community suggestions include StarCraft and Age of Empires
There are 8 good-first-issues open if you want to contribute. We already merged our first community PR (keyboard shortcuts) on day 3.
GitHub: agent-battle-command-center
Come hang out in Discussions if you want to chat about AI agent orchestration, cost optimization, or which RTS game had the best unit voice lines.
"One write, one verify, mission complete." — CodeX-7`

Top comments (0)