We benchmarked 10 LLMs on 10 real agent coding tasks — here are the results

#ai #agents #benchmark #llm

We benchmarked 10 LLMs on 10 real agent coding tasks — here are the results

By Vilius Vystartas | May 2026

I ran 10 cloud models through 10 real-world agent coding tasks last night. File parsing, SQL queries, regex extraction, async HTTP — the kind of things an agent actually does, not LeetCode problems. Total cost: $0.19 across 100 API calls. Here's what landed.

The leaderboard

Model	Score	Time	Cost
Claude Sonnet 4	83%	23s	$0.019
Gemma 4 31B	80%	119s	$0.001
Gemma 4 26B	78%	66s	$0.001
Mistral Large 3	78%	19s	$0.002
Qwen 3.6 Plus	77%	574s	$0.061
Gemini 2.5 Flash	76%	12s	$0.004
Kimi K2.6	75%	40s	$0.005
GPT-5.4	75%	19s	$0.015
MiniMax M2.7	60%	137s	$0.019
GPT-5.5	58%	67s	$0.066

Surprise #1: Google's Gemma 4 is the real winner

Second place at 80%, nearly free ($0.0005 per run), and beats every premium model except Claude. The 26B version scores 78% — same tier as Mistral Large 3, at one-quarter the cost. I kept checking the numbers because they didn't make sense. They held up.

Surprise #2: Free beats paid

Google's free-tier Gemini 2.5 Flash (76%) outscored OpenAI's GPT-5.4 (75%). The free model is also the fastest — 12 seconds flat for all 10 tasks.

Surprise #3: Mistral is the value king

98% of Claude's accuracy at 10% the cost. 19 seconds for $0.002. If you're building agents at scale, this is your default.

GPT-5.5 is a regression

Costs 3x more than Claude ($0.066 vs $0.019), scores 58%. Three tasks hit the token ceiling with verbose output that missed scoring patterns. It's not a bad model — it's a bad fit for agent coding at tight token budgets.

MiniMax M2.7: brilliant, but unreliable

On the 6 tasks it completes, M2.7 scores 97% — higher than Claude. But it fails 4 out of 10 tasks entirely. Mandatory internal reasoning burns the output budget before it produces anything. It's a brilliant colleague who randomly freezes mid-sentence.

Qwen's thinking tax

77% at 574 seconds — 10 minutes for a benchmark that takes Claude 23 seconds. Qwen enables chain-of-thought by default and can't be fully disabled. The accuracy is there, but you'll wait.

My recommendations

Best accuracy: Claude Sonnet 4
Best value: Mistral Large 3 — 98% of Claude at 10% the cost
Best free: Gemma 4 31B — nearly free, second highest score
Fastest: Gemini 2.5 Flash — 12 seconds, free
Avoid for agents: GPT-5.5 and MiniMax M2.7

The benchmarks page at benchmarks.workswithagents.dev refreshes nightly. Raw data, methodology, and per-task scores are there for anyone who wants to pick holes in my numbers.

Because you should.

DEV Community

We benchmarked 10 LLMs on 10 real agent coding tasks — here are the results

We benchmarked 10 LLMs on 10 real agent coding tasks — here are the results

The leaderboard

Surprise #1: Google's Gemma 4 is the real winner

Surprise #2: Free beats paid

Surprise #3: Mistral is the value king

GPT-5.5 is a regression

MiniMax M2.7: brilliant, but unreliable

Qwen's thinking tax

My recommendations

Top comments (0)