We Tested 10 Untested LLMs on Agent Coding — The Results Are In

#ai #llm #programming #benchmarking

We Tested 10 Untested LLMs on Agent Coding — The Results Are In

Yesterday I promised to benchmark 10 LLMs that have never been tested on real agent coding tasks. I ran all 10 overnight. Some surprised me. Some embarrassed themselves.

The board

10 models. 10 tasks each. Tasks are real agent work: parse JSON, write regex, fix a bug, query SQL, handle errors. Full pass requires correct, working code.

Model	Score	Pass/Fail	Cost/task
Grok 4.20	75.0%	6 pass / 3 partial / 1 fail	$0.0003
Grok 4.1 Fast	74.9%	6/2/2	$0.0009
Xiaomi MiMo V2.5 Pro	68.2%	7/0/3	$0.001
Ring 2.6 (free)	65.0%	6/1/3	free
DeepSeek V4 Flash	60.0%	4/3/3	$0.0001
GPT-5.4 Pro	51.6%	5/1/4	$0.06
GPT-5.5 Pro	43.3%	4/1/5	$0.065
DeepSeek V4 Pro	38.3%	4/0/6	$0.001
Google Lyria 3 Pro	8.3%	1/0/9	free (preview)
Google Lyria 3 Clip	0.0%	0/0/10	free (preview)

Total cost: $1.37 for the entire run.

What jumped out

Grok 4.20 won. Not close enough to call it dominant, but it's the fastest by far — 14.5 seconds for all 10 tasks. Grok 4.1 Fast scored nearly identically at 225 seconds. Same family, wildly different speed profiles.

The "Pro" suffix is a trap. GPT-5.4 Pro scored 51.6%. Regular GPT-5.4 scored 76.6% on the same tasks. GPT-5.5 Pro scored 43.3%. Regular GPT-5.5 scored 60%. The Pro variants are slower, more expensive, and worse at this specific workload. If you're building agents, the base models are better.

DeepSeek V4 Flash beat DeepSeek V4 Pro — 60% vs 38%. Flash is also cheaper. For agent coding, smaller/faster beats bigger/slower again.

Ring 2.6 is free and beats paid models. Six passes, one partial, $0.00. Outperforms both GPT Pros, DeepSeek V4 Pro, and Lyria completely.

Google Lyria 3 is not ready. Clip failed every single task with 502 errors. Pro barely scored. Both are marked "preview" on OpenRouter. Fair enough — but worth knowing before you build on them.

Raw scores with context

For comparison, here's where these new models land against the existing leaderboard:

Claude Sonnet 4 — 85.0%
Mistral Large 3 — 79.6%
Gemma 4 31B — 78.3%
Gemma 4 26B A4B — 78.3%
Qwen 3.6 Plus — 76.6%
GPT-5.4 — 76.6%
Gemini 2.5 Flash — 76.4%
Kimi K2.6 — 75.0%
Grok 4.20 — 75.0% ← new
Grok 4.1 Fast — 74.9% ← new
MiniMax M2.7 — 69.9%
Xiaomi MiMo V2.5 Pro — 68.2% ← new
Ring 2.6 — 65.0% ← new (free)
GPT-5.5 — 60.0%
DeepSeek V4 Flash — 60.0% ← new
GPT-5.4 Pro — 51.6% ← new
GPT-5.5 Pro — 43.3% ← new
DeepSeek V4 Pro — 38.3% ← new
Lyria 3 Pro — 8.3% ← new
Lyria 3 Clip — 0.0% ← new

What this means

If I was building an agent today and had to pick a model:

For reliability: Claude Sonnet 4 (85%) or Mistral Large 3 (79.6%). These aren't new — they've been at the top since the first benchmark.

For speed at good quality: Grok 4.20. 75% score in 14.5 seconds. That's under 2 seconds per task.

For free: Ring 2.6 if you qualify for OpenRouter's free tier. 65% at $0 is hard to beat.

What to avoid: The "Pro" suffix on GPT models. Google's Lyria previews. DeepSeek V4 Pro if Flash is cheaper and better.

All results are live at workswithagents.dev/benchmarks — updated daily. Full interactive dashboard with local models at benchmarks.workswithagents.dev.

One thing I'm watching

The Pro variants of GPT-5.4 and GPT-5.5 should theoretically be better. They're not. This might mean OpenAI optimized these for something other than quick-turn agent coding. Or it might mean the base models are just better tuned. Either way — don't assume Pro means better. Test it.