<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: david</title>
    <description>The latest articles on DEV Community by david (@mrdushidush).</description>
    <link>https://dev.to/mrdushidush</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3769586%2Fc83b985f-d5a2-4fc9-a18c-f70e5333d0ab.png</url>
      <title>DEV Community: david</title>
      <link>https://dev.to/mrdushidush</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mrdushidush"/>
    <language>en</language>
    <item>
      <title>How I Run 88% of AI Coding Tasks for Free on a $300 GPU (and Built a C&amp;C Red Alert UI for It)</title>
      <dc:creator>david</dc:creator>
      <pubDate>Thu, 12 Feb 2026 20:04:33 +0000</pubDate>
      <link>https://dev.to/mrdushidush/how-i-run-88-of-ai-coding-tasks-for-free-on-a-300-gpu-and-built-a-cc-red-alert-ui-for-it-3o7j</link>
      <guid>https://dev.to/mrdushidush/how-i-run-88-of-ai-coding-tasks-for-free-on-a-300-gpu-and-built-a-cc-red-alert-ui-for-it-3o7j</guid>
      <description>&lt;p&gt;`# I Route 88% of AI Coding Tasks to a Free Local Model — Here's What I Learned&lt;/p&gt;

&lt;p&gt;Running AI coding agents through cloud APIs gets expensive fast. Claude Sonnet at ~$0.04/task, Opus at ~$0.075 — it adds up when you're running hundreds of tasks.&lt;/p&gt;

&lt;p&gt;So I built a system that routes 88% of tasks to a free local model and only escalates to paid APIs when necessary. Then I wrapped it in a Command &amp;amp; Conquer Red Alert-style interface because… I grew up in the 90s.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc111cduc901t7velczxa.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc111cduc901t7velczxa.gif" alt="Command Center Demo" width="480" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Problem
&lt;/h2&gt;

&lt;p&gt;Most AI coding agent frameworks send everything to the best model available. But "create a function that adds two numbers" doesn't need the same model as "implement an LRU cache with O(1) operations."&lt;/p&gt;

&lt;p&gt;I tested this systematically — 40 coding tasks scored on a complexity scale of 1–9, all executed by a local 7B parameter model running on a $300 used GPU:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;Example Tasks&lt;/th&gt;
&lt;th&gt;Success Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C1–2&lt;/td&gt;
&lt;td&gt;Add function, greet function&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C3–4&lt;/td&gt;
&lt;td&gt;Parse CSV, validate emails, factorial&lt;/td&gt;
&lt;td&gt;80–100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C5–6&lt;/td&gt;
&lt;td&gt;Calculator with history, prime checker&lt;/td&gt;
&lt;td&gt;60–100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C7–8&lt;/td&gt;
&lt;td&gt;Merge sorted lists, binary search, word frequency&lt;/td&gt;
&lt;td&gt;80–100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C9&lt;/td&gt;
&lt;td&gt;LRU cache, stack class, RPN calculator&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result: $0.002 average cost per task instead of $0.04 — that's 20x cheaper.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The 7B model handled everything from trivial one-liners to LeetCode-medium problems. It even added type hints unprompted on some solutions. Only multi-class architectural tasks needed escalation to cloud APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Routing Works
&lt;/h2&gt;

&lt;p&gt;The system scores every task on a 1–10 complexity scale using dual assessment:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;br&gt;
Task → Complexity Assessment&lt;br&gt;
  ├─ Rule-based: keyword matching, structural analysis&lt;br&gt;
  └─ Haiku AI: semantic understanding (~$0.001/call)&lt;br&gt;
      ↓&lt;br&gt;
  Smart weighting: if Haiku rates 2+ higher, trust Haiku&lt;br&gt;
      ↓&lt;br&gt;
  ├─ C1–6  → Ollama (free, local GPU)&lt;br&gt;
  ├─ C7–8  → Haiku (~$0.003/task)&lt;br&gt;
  ├─ C9–10 → Sonnet (~$0.01/task)&lt;br&gt;
  └─ Decomposition → Opus (review only, never writes code)&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The complexity scoring draws from Campbell's Task Complexity Theory in organizational psychology, adapted for code tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Component complexity&lt;/strong&gt; — How many steps, files, and functions are expected?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coordinative complexity&lt;/strong&gt; — How many dependencies exist between parts?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic complexity&lt;/strong&gt; — How much ambiguity and decision-making is required?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The smart weighting was a key breakthrough. The rule-based router uses keyword matching ("LRU", "cache", "linked list"), but sometimes misses semantic complexity. A cheap Haiku call provides that semantic understanding. When Haiku rates a task 2+ points higher than the rules, we trust Haiku's score directly instead of averaging — this prevents complex tasks from being under-routed to Ollama.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hard-Won Lessons Running Ollama in Production
&lt;/h2&gt;

&lt;p&gt;These took weeks of debugging. If you're building anything agentic with local models, this might save you some pain.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Temperature = 0 is mandatory for tool calling
&lt;/h3&gt;

&lt;p&gt;This was the single biggest improvement. Small models with temperature &amp;gt; 0 will randomly output raw code instead of calling tools through the proper function-calling interface. The model might generate a perfect Python function… and dump it into stdout instead of calling &lt;code&gt;file_write&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Temperature 0 gives deterministic, reliable tool usage. It took our success rate from ~60% to 90%+ overnight.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context pollution is real (and sneaky)
&lt;/h3&gt;

&lt;p&gt;After ~5 consecutive tasks on the same Ollama instance, the model starts generating syntax errors — missing quotes, unclosed parentheses, garbled output. The accumulated context from previous tasks bleeds into new ones.&lt;/p&gt;

&lt;p&gt;The fix is surprisingly simple: a 3-second rest delay between tasks, plus a full memory reset every 3 tasks. This alone took us from 85% to 100% success rate on C1–C8 complexity tasks.&lt;/p&gt;

&lt;p&gt;We even built a cooldown system with WebSocket events so the UI shows when an agent is "resting" between tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. 7B &amp;gt; 14B on 8GB VRAM (counterintuitive)
&lt;/h3&gt;

&lt;p&gt;I tested &lt;code&gt;qwen2.5-coder:14b-instruct-q4_K_M&lt;/code&gt; expecting better results from the larger model. Got a &lt;strong&gt;40% pass rate&lt;/strong&gt; vs &lt;strong&gt;95% for the 7B model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why? The 14B model weighs in at ~9GB. On an 8GB VRAM card, it overflows into system RAM. CPU offloading makes inference slow enough that tool calling breaks down — the model times out or generates truncated responses.&lt;/p&gt;

&lt;p&gt;The 7B model sits at ~6GB VRAM with room for context and tools. No CPU offload needed. &lt;strong&gt;If you have 8GB VRAM, 7B is your ceiling. Don't go bigger.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Agent personas work surprisingly well
&lt;/h3&gt;

&lt;p&gt;This one surprised me the most. Giving the model a "CodeX-7" elite military identity with a specific pattern — "one write, one verify, mission complete" — plus three concrete examples of ideal 3-step execution dramatically improved task completion.&lt;/p&gt;

&lt;p&gt;Without the persona, the model would often loop: write code, read it back, rewrite it, read it again. With the persona, it follows the trained pattern: write the file, run the test, report results. Done.&lt;/p&gt;

&lt;p&gt;The technical explanation is probably that the persona plus examples act as strong few-shot conditioning, biasing the model toward a specific execution trajectory. But honestly, it also just makes the logs more fun to read.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fun Part: C&amp;amp;C Red Alert UI
&lt;/h2&gt;

&lt;p&gt;Because staring at terminal logs is boring, I built a Command &amp;amp; Conquer Red Alert-inspired interface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bounty board&lt;/strong&gt; — Task cards with complexity badges and priority colors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Active missions strip&lt;/strong&gt; — Real-time agent health indicators (green = idle, amber = working, red = stuck)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool log&lt;/strong&gt; — Terminal-style feed of every &lt;code&gt;file_write&lt;/code&gt;, &lt;code&gt;shell_run&lt;/code&gt;, &lt;code&gt;file_read&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent minimap&lt;/strong&gt; — Visual representation of agents with connection lines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice feedback&lt;/strong&gt; — "Conscript reporting!" when an agent picks up a task, "Shake it baby!" on completion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost dashboard&lt;/strong&gt; — Real-time cost tracking with daily budget limits and token burn rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every agent action triggers a C&amp;amp;C voice line. When an agent gets stuck in a loop, a warning klaxon plays. It's ridiculous and I love it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The whole system runs in Docker:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;br&gt;
UI (React:5173) → API (Express:3001) → Agents (FastAPI:8000) → Ollama/Claude&lt;br&gt;
                         ↓&lt;br&gt;
                   PostgreSQL:5432&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;React UI&lt;/strong&gt; — Real-time WebSocket updates, no polling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Express API&lt;/strong&gt; — Task routing, cost tracking, budget enforcement, rate limiting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI + CrewAI&lt;/strong&gt; — Agent orchestration with tool wrapping and execution logging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; — Local LLM with GPU passthrough&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; — Tasks, execution logs, code reviews, training data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every tool call is captured to the database with timing, token usage, and cost. The system detects stuck tasks (&amp;gt;10 min timeout) and automatically recovers them. Loop detection prevents agents from repeating failed actions.&lt;/p&gt;

&lt;p&gt;Tasks can run in parallel when they use different resources — an Ollama task and a Claude task can execute simultaneously, yielding a 40–60% speedup on mixed-complexity batches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`bash&lt;br&gt;
git clone &lt;a href="https://github.com/mrdushidush/agent-battle-command-center.git" rel="noopener noreferrer"&gt;https://github.com/mrdushidush/agent-battle-command-center.git&lt;/a&gt;&lt;br&gt;
cd agent-battle-command-center&lt;br&gt;
cp .env.example .env&lt;/p&gt;

&lt;h1&gt;
  
  
  Add your ANTHROPIC_API_KEY to .env
&lt;/h1&gt;

&lt;p&gt;docker compose up --build&lt;/p&gt;

&lt;h1&gt;
  
  
  Open &lt;a href="http://localhost:5173" rel="noopener noreferrer"&gt;http://localhost:5173&lt;/a&gt;
&lt;/h1&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker Desktop&lt;/li&gt;
&lt;li&gt;NVIDIA GPU with 8GB+ VRAM (recommended) — or CPU-only mode (slower)&lt;/li&gt;
&lt;li&gt;Anthropic API key (only needed for complex tasks — Ollama tasks are free)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first startup takes ~5 minutes to download the Ollama model.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The project is fully open source (MIT). Some things I'm working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-language support&lt;/strong&gt; — Currently Python-only; adding JavaScript/TypeScript&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Demo mode&lt;/strong&gt; — Simulated agents so anyone can try the UI without a GPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Hub image&lt;/strong&gt; — One-command deploy without building&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More voice packs&lt;/strong&gt; — Community suggestions include StarCraft and Age of Empires&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are 8 good-first-issues open if you want to contribute. We already merged our first community PR (keyboard shortcuts) on day 3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/mrdushidush/agent-battle-command-center" rel="noopener noreferrer"&gt;agent-battle-command-center&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Come hang out in &lt;a href="https://github.com/mrdushidush/agent-battle-command-center/discussions" rel="noopener noreferrer"&gt;Discussions&lt;/a&gt; if you want to chat about AI agent orchestration, cost optimization, or which RTS game had the best unit voice lines.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;"One write, one verify, mission complete." — CodeX-7&lt;/em&gt;`&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
