david

Posted on Feb 12

How I Run 88% of AI Coding Tasks for Free on a $300 GPU (and Built a C&C Red Alert UI for It)

`# I Route 88% of AI Coding Tasks to a Free Local Model — Here's What I Learned

Running AI coding agents through cloud APIs gets expensive fast. Claude Sonnet at ~$0.04/task, Opus at ~$0.075 — it adds up when you're running hundreds of tasks.

So I built a system that routes 88% of tasks to a free local model and only escalates to paid APIs when necessary. Then I wrapped it in a Command & Conquer Red Alert-style interface because… I grew up in the 90s.

The Cost Problem

Most AI coding agent frameworks send everything to the best model available. But "create a function that adds two numbers" doesn't need the same model as "implement an LRU cache with O(1) operations."

I tested this systematically — 40 coding tasks scored on a complexity scale of 1–9, all executed by a local 7B parameter model running on a $300 used GPU:

Complexity	Example Tasks	Success Rate
C1–2	Add function, greet function	100%
C3–4	Parse CSV, validate emails, factorial	80–100%
C5–6	Calculator with history, prime checker	60–100%
C7–8	Merge sorted lists, binary search, word frequency	80–100%
C9	LRU cache, stack class, RPN calculator	80%

Result: $0.002 average cost per task instead of $0.04 — that's 20x cheaper.

The 7B model handled everything from trivial one-liners to LeetCode-medium problems. It even added type hints unprompted on some solutions. Only multi-class architectural tasks needed escalation to cloud APIs.

How the Routing Works

The system scores every task on a 1–10 complexity scale using dual assessment:

Task → Complexity Assessment ├─ Rule-based: keyword matching, structural analysis └─ Haiku AI: semantic understanding (~$0.001/call) ↓ Smart weighting: if Haiku rates 2+ higher, trust Haiku ↓ ├─ C1–6 → Ollama (free, local GPU) ├─ C7–8 → Haiku (~$0.003/task) ├─ C9–10 → Sonnet (~$0.01/task) └─ Decomposition → Opus (review only, never writes code)

The complexity scoring draws from Campbell's Task Complexity Theory in organizational psychology, adapted for code tasks:

Component complexity — How many steps, files, and functions are expected?
Coordinative complexity — How many dependencies exist between parts?
Dynamic complexity — How much ambiguity and decision-making is required?

The smart weighting was a key breakthrough. The rule-based router uses keyword matching ("LRU", "cache", "linked list"), but sometimes misses semantic complexity. A cheap Haiku call provides that semantic understanding. When Haiku rates a task 2+ points higher than the rules, we trust Haiku's score directly instead of averaging — this prevents complex tasks from being under-routed to Ollama.

Hard-Won Lessons Running Ollama in Production

These took weeks of debugging. If you're building anything agentic with local models, this might save you some pain.

1. Temperature = 0 is mandatory for tool calling

This was the single biggest improvement. Small models with temperature > 0 will randomly output raw code instead of calling tools through the proper function-calling interface. The model might generate a perfect Python function… and dump it into stdout instead of calling file_write.

Temperature 0 gives deterministic, reliable tool usage. It took our success rate from ~60% to 90%+ overnight.

2. Context pollution is real (and sneaky)

After ~5 consecutive tasks on the same Ollama instance, the model starts generating syntax errors — missing quotes, unclosed parentheses, garbled output. The accumulated context from previous tasks bleeds into new ones.

The fix is surprisingly simple: a 3-second rest delay between tasks, plus a full memory reset every 3 tasks. This alone took us from 85% to 100% success rate on C1–C8 complexity tasks.

We even built a cooldown system with WebSocket events so the UI shows when an agent is "resting" between tasks.

3. 7B > 14B on 8GB VRAM (counterintuitive)

I tested qwen2.5-coder:14b-instruct-q4_K_M expecting better results from the larger model. Got a 40% pass rate vs 95% for the 7B model.

Why? The 14B model weighs in at ~9GB. On an 8GB VRAM card, it overflows into system RAM. CPU offloading makes inference slow enough that tool calling breaks down — the model times out or generates truncated responses.

The 7B model sits at ~6GB VRAM with room for context and tools. No CPU offload needed. If you have 8GB VRAM, 7B is your ceiling. Don't go bigger.

4. Agent personas work surprisingly well

This one surprised me the most. Giving the model a "CodeX-7" elite military identity with a specific pattern — "one write, one verify, mission complete" — plus three concrete examples of ideal 3-step execution dramatically improved task completion.

Without the persona, the model would often loop: write code, read it back, rewrite it, read it again. With the persona, it follows the trained pattern: write the file, run the test, report results. Done.

The technical explanation is probably that the persona plus examples act as strong few-shot conditioning, biasing the model toward a specific execution trajectory. But honestly, it also just makes the logs more fun to read.

The Fun Part: C&C Red Alert UI

Because staring at terminal logs is boring, I built a Command & Conquer Red Alert-inspired interface:

Bounty board — Task cards with complexity badges and priority colors
Active missions strip — Real-time agent health indicators (green = idle, amber = working, red = stuck)
Tool log — Terminal-style feed of every file_write, shell_run, file_read
Agent minimap — Visual representation of agents with connection lines
Voice feedback — "Conscript reporting!" when an agent picks up a task, "Shake it baby!" on completion
Cost dashboard — Real-time cost tracking with daily budget limits and token burn rate

Every agent action triggers a C&C voice line. When an agent gets stuck in a loop, a warning klaxon plays. It's ridiculous and I love it.

Architecture

The whole system runs in Docker:

UI (React:5173) → API (Express:3001) → Agents (FastAPI:8000) → Ollama/Claude ↓ PostgreSQL:5432

React UI — Real-time WebSocket updates, no polling
Express API — Task routing, cost tracking, budget enforcement, rate limiting
FastAPI + CrewAI — Agent orchestration with tool wrapping and execution logging
Ollama — Local LLM with GPU passthrough
PostgreSQL — Tasks, execution logs, code reviews, training data

Every tool call is captured to the database with timing, token usage, and cost. The system detects stuck tasks (>10 min timeout) and automatically recovers them. Loop detection prevents agents from repeating failed actions.

Tasks can run in parallel when they use different resources — an Ollama task and a Claude task can execute simultaneously, yielding a 40–60% speedup on mixed-complexity batches.

Try It

`bash
git clone https://github.com/mrdushidush/agent-battle-command-center.git
cd agent-battle-command-center
cp .env.example .env

Add your ANTHROPIC_API_KEY to .env

docker compose up --build

Open http://localhost:5173

Requirements:

Docker Desktop
NVIDIA GPU with 8GB+ VRAM (recommended) — or CPU-only mode (slower)
Anthropic API key (only needed for complex tasks — Ollama tasks are free)

The first startup takes ~5 minutes to download the Ollama model.

What's Next

The project is fully open source (MIT). Some things I'm working on:

Multi-language support — Currently Python-only; adding JavaScript/TypeScript
Demo mode — Simulated agents so anyone can try the UI without a GPU
Docker Hub image — One-command deploy without building
More voice packs — Community suggestions include StarCraft and Age of Empires

There are 8 good-first-issues open if you want to contribute. We already merged our first community PR (keyboard shortcuts) on day 3.

GitHub: agent-battle-command-center

Come hang out in Discussions if you want to chat about AI agent orchestration, cost optimization, or which RTS game had the best unit voice lines.

"One write, one verify, mission complete." — CodeX-7`