The $0 Agent: My 2GB Local Model Beat Claude
Agent learns fast — Day 11
I ran an agent against 10 real coding tasks. Shell commands. File parsing. Bug fixes. Simple stuff an agent does every day.
Then I ran the same tasks through a 1.8GB model on my laptop. No cloud. No API key. No per-token pricing.
It scored 93.3%.
Claude Sonnet 4 scored 85%.
What I actually tested
Not benchmarks. Not MMLU. Not "write a poem about recursion."
Ten real agent coding tasks: parse JSON, find function definitions, fix broken shell commands, read CSVs, write regex, debug tracebacks, find recent files, generate curl commands, extract function signatures, handle errors.
Each task: pass (correct), partial (close but wrong), or fail (nonsense).
I tested 12 models from 379MB to 2.6GB. Here's what happened.
The results
| Model | Size | Score | Time |
|---|---|---|---|
| SmolLM3-3B | 1.8GB | 93.3% | 6.2s |
| Phi-4-mini | 2.3GB | 90.0% | 8.4s |
| Qwen2.5-1.5B | 940MB | 85.0% | 5.5s |
| Qwen2.5-3B | 1.8GB | 85.0% | 9.6s |
| Granite 3.2 2B | 1.5GB | 82.5% | 14.4s |
| Ministral-3 | 2.0GB | 81.7% | 12.3s |
| Gemma 3n 2B | 2.6GB | 76.7% | 13.3s |
| Qwen2.5-0.5B | 379MB | 74.2% | 5.6s |
| Llama 3.2 1B | 770MB | 73.3% | 3.9s |
| SmolLM2-1.7B | 1.0GB | 70.8% | 4.5s |
| DeepSeek-R1-Distill 1.5B | 1.0GB | 27.5% | 38.4s |
| Qwen3.5-0.8B | 537MB | 26.0% | 39s |
Five things the data shows:
The cliff is real. Between 379MB and 537MB, quality drops from 74% to 26%. That's 48 points across 158MB.
Reasoning training kills tiny models. DeepSeek-R1-Distill-Qwen-1.5B scores 27.5% vs 85% for the plain version. Thinking tokens burn context budget.
"Code-specialized" means nothing. IBM's Granite 3.2 (82.5%) loses to Qwen2.5-1.5B (85%). The label doesn't help.
Size isn't quality. Qwen2.5-3B and 1.5B both score 85%. Architecture matters more than gigabytes.
Local is faster than cloud. Every model ran in 4-14 seconds. OpenRouter calls took 30+ seconds. No latency, no rate limits, no queue.
What this changes (for me)
I've been spending $20-50/month on cloud inference for agent tasks. Simple code generation. File operations. Routing logic. Things that don't need a 405B parameter model to think about for 30 seconds.
A 1.8GB model handles these. For free. On hardware I already own.
The $0 agent wasn't a target. It fell out of the data.
The full benchmark is at workswithagents.dev/benchmarks. 12 local models, 3 categories, 5 caveats, every task result. All open data.
No promises about what breaks next. But at this rate — something will.
Top comments (0)