We benchmarked 10 LLMs on 10 real agent coding tasks — here are the results
By Vilius Vystartas | May 2026
I ran 10 cloud models through 10 real-world agent coding tasks last night. File parsing, SQL queries, regex extraction, async HTTP — the kind of things an agent actually does, not LeetCode problems. Total cost: $0.19 across 100 API calls. Here's what landed.
The leaderboard
| Model | Score | Time | Cost |
|---|---|---|---|
| Claude Sonnet 4 | 83% | 23s | $0.019 |
| Gemma 4 31B | 80% | 119s | $0.001 |
| Gemma 4 26B | 78% | 66s | $0.001 |
| Mistral Large 3 | 78% | 19s | $0.002 |
| Qwen 3.6 Plus | 77% | 574s | $0.061 |
| Gemini 2.5 Flash | 76% | 12s | $0.004 |
| Kimi K2.6 | 75% | 40s | $0.005 |
| GPT-5.4 | 75% | 19s | $0.015 |
| MiniMax M2.7 | 60% | 137s | $0.019 |
| GPT-5.5 | 58% | 67s | $0.066 |
Surprise #1: Google's Gemma 4 is the real winner
Second place at 80%, nearly free ($0.0005 per run), and beats every premium model except Claude. The 26B version scores 78% — same tier as Mistral Large 3, at one-quarter the cost. I kept checking the numbers because they didn't make sense. They held up.
Surprise #2: Free beats paid
Google's free-tier Gemini 2.5 Flash (76%) outscored OpenAI's GPT-5.4 (75%). The free model is also the fastest — 12 seconds flat for all 10 tasks.
Surprise #3: Mistral is the value king
98% of Claude's accuracy at 10% the cost. 19 seconds for $0.002. If you're building agents at scale, this is your default.
GPT-5.5 is a regression
Costs 3x more than Claude ($0.066 vs $0.019), scores 58%. Three tasks hit the token ceiling with verbose output that missed scoring patterns. It's not a bad model — it's a bad fit for agent coding at tight token budgets.
MiniMax M2.7: brilliant, but unreliable
On the 6 tasks it completes, M2.7 scores 97% — higher than Claude. But it fails 4 out of 10 tasks entirely. Mandatory internal reasoning burns the output budget before it produces anything. It's a brilliant colleague who randomly freezes mid-sentence.
Qwen's thinking tax
77% at 574 seconds — 10 minutes for a benchmark that takes Claude 23 seconds. Qwen enables chain-of-thought by default and can't be fully disabled. The accuracy is there, but you'll wait.
My recommendations
- Best accuracy: Claude Sonnet 4
- Best value: Mistral Large 3 — 98% of Claude at 10% the cost
- Best free: Gemma 4 31B — nearly free, second highest score
- Fastest: Gemini 2.5 Flash — 12 seconds, free
- Avoid for agents: GPT-5.5 and MiniMax M2.7
The benchmarks page at benchmarks.workswithagents.dev refreshes nightly. Raw data, methodology, and per-task scores are there for anyone who wants to pick holes in my numbers.
Because you should.
Top comments (0)