We Tested 10 Untested LLMs on Agent Coding — The Results Are In
Yesterday I promised to benchmark 10 LLMs that have never been tested on real agent coding tasks. I ran all 10 overnight. Some surprised me. Some embarrassed themselves.
The board
10 models. 10 tasks each. Tasks are real agent work: parse JSON, write regex, fix a bug, query SQL, handle errors. Full pass requires correct, working code.
| Model | Score | Pass/Fail | Cost/task |
|---|---|---|---|
| Grok 4.20 | 75.0% | 6 pass / 3 partial / 1 fail | $0.0003 |
| Grok 4.1 Fast | 74.9% | 6/2/2 | $0.0009 |
| Xiaomi MiMo V2.5 Pro | 68.2% | 7/0/3 | $0.001 |
| Ring 2.6 (free) | 65.0% | 6/1/3 | free |
| DeepSeek V4 Flash | 60.0% | 4/3/3 | $0.0001 |
| GPT-5.4 Pro | 51.6% | 5/1/4 | $0.06 |
| GPT-5.5 Pro | 43.3% | 4/1/5 | $0.065 |
| DeepSeek V4 Pro | 38.3% | 4/0/6 | $0.001 |
| Google Lyria 3 Pro | 8.3% | 1/0/9 | free (preview) |
| Google Lyria 3 Clip | 0.0% | 0/0/10 | free (preview) |
Total cost: $1.37 for the entire run.
What jumped out
Grok 4.20 won. Not close enough to call it dominant, but it's the fastest by far — 14.5 seconds for all 10 tasks. Grok 4.1 Fast scored nearly identically at 225 seconds. Same family, wildly different speed profiles.
The "Pro" suffix is a trap. GPT-5.4 Pro scored 51.6%. Regular GPT-5.4 scored 76.6% on the same tasks. GPT-5.5 Pro scored 43.3%. Regular GPT-5.5 scored 60%. The Pro variants are slower, more expensive, and worse at this specific workload. If you're building agents, the base models are better.
DeepSeek V4 Flash beat DeepSeek V4 Pro — 60% vs 38%. Flash is also cheaper. For agent coding, smaller/faster beats bigger/slower again.
Ring 2.6 is free and beats paid models. Six passes, one partial, $0.00. Outperforms both GPT Pros, DeepSeek V4 Pro, and Lyria completely.
Google Lyria 3 is not ready. Clip failed every single task with 502 errors. Pro barely scored. Both are marked "preview" on OpenRouter. Fair enough — but worth knowing before you build on them.
Raw scores with context
For comparison, here's where these new models land against the existing leaderboard:
- Claude Sonnet 4 — 85.0%
- Mistral Large 3 — 79.6%
- Gemma 4 31B — 78.3%
- Gemma 4 26B A4B — 78.3%
- Qwen 3.6 Plus — 76.6%
- GPT-5.4 — 76.6%
- Gemini 2.5 Flash — 76.4%
- Kimi K2.6 — 75.0%
- Grok 4.20 — 75.0% ← new
- Grok 4.1 Fast — 74.9% ← new
- MiniMax M2.7 — 69.9%
- Xiaomi MiMo V2.5 Pro — 68.2% ← new
- Ring 2.6 — 65.0% ← new (free)
- GPT-5.5 — 60.0%
- DeepSeek V4 Flash — 60.0% ← new
- GPT-5.4 Pro — 51.6% ← new
- GPT-5.5 Pro — 43.3% ← new
- DeepSeek V4 Pro — 38.3% ← new
- Lyria 3 Pro — 8.3% ← new
- Lyria 3 Clip — 0.0% ← new
What this means
If I was building an agent today and had to pick a model:
For reliability: Claude Sonnet 4 (85%) or Mistral Large 3 (79.6%). These aren't new — they've been at the top since the first benchmark.
For speed at good quality: Grok 4.20. 75% score in 14.5 seconds. That's under 2 seconds per task.
For free: Ring 2.6 if you qualify for OpenRouter's free tier. 65% at $0 is hard to beat.
What to avoid: The "Pro" suffix on GPT models. Google's Lyria previews. DeepSeek V4 Pro if Flash is cheaper and better.
All results are live at workswithagents.dev/benchmarks — updated daily. Full interactive dashboard with local models at benchmarks.workswithagents.dev.
One thing I'm watching
The Pro variants of GPT-5.4 and GPT-5.5 should theoretically be better. They're not. This might mean OpenAI optimized these for something other than quick-turn agent coding. Or it might mean the base models are just better tuned. Either way — don't assume Pro means better. Test it.
Top comments (0)