DEV Community

Vilius
Vilius

Posted on • Originally published at benchmarks.workswithagents.dev

We benchmarked 10 LLMs on 10 real agent coding tasks — here are the results

We benchmarked 10 LLMs on 10 real agent coding tasks — here are the results

By Vilius Vystartas | May 2026

I ran 10 cloud models through 10 real-world agent coding tasks last night. File parsing, SQL queries, regex extraction, async HTTP — the kind of things an agent actually does, not LeetCode problems. Total cost: $0.19 across 100 API calls. Here's what landed.

The leaderboard

Model Score Time Cost
Claude Sonnet 4 83% 23s $0.019
Gemma 4 31B 80% 119s $0.001
Gemma 4 26B 78% 66s $0.001
Mistral Large 3 78% 19s $0.002
Qwen 3.6 Plus 77% 574s $0.061
Gemini 2.5 Flash 76% 12s $0.004
Kimi K2.6 75% 40s $0.005
GPT-5.4 75% 19s $0.015
MiniMax M2.7 60% 137s $0.019
GPT-5.5 58% 67s $0.066

Surprise #1: Google's Gemma 4 is the real winner

Second place at 80%, nearly free ($0.0005 per run), and beats every premium model except Claude. The 26B version scores 78% — same tier as Mistral Large 3, at one-quarter the cost. I kept checking the numbers because they didn't make sense. They held up.

Surprise #2: Free beats paid

Google's free-tier Gemini 2.5 Flash (76%) outscored OpenAI's GPT-5.4 (75%). The free model is also the fastest — 12 seconds flat for all 10 tasks.

Surprise #3: Mistral is the value king

98% of Claude's accuracy at 10% the cost. 19 seconds for $0.002. If you're building agents at scale, this is your default.

GPT-5.5 is a regression

Costs 3x more than Claude ($0.066 vs $0.019), scores 58%. Three tasks hit the token ceiling with verbose output that missed scoring patterns. It's not a bad model — it's a bad fit for agent coding at tight token budgets.

MiniMax M2.7: brilliant, but unreliable

On the 6 tasks it completes, M2.7 scores 97% — higher than Claude. But it fails 4 out of 10 tasks entirely. Mandatory internal reasoning burns the output budget before it produces anything. It's a brilliant colleague who randomly freezes mid-sentence.

Qwen's thinking tax

77% at 574 seconds — 10 minutes for a benchmark that takes Claude 23 seconds. Qwen enables chain-of-thought by default and can't be fully disabled. The accuracy is there, but you'll wait.

My recommendations

  • Best accuracy: Claude Sonnet 4
  • Best value: Mistral Large 3 — 98% of Claude at 10% the cost
  • Best free: Gemma 4 31B — nearly free, second highest score
  • Fastest: Gemini 2.5 Flash — 12 seconds, free
  • Avoid for agents: GPT-5.5 and MiniMax M2.7

The benchmarks page at benchmarks.workswithagents.dev refreshes nightly. Raw data, methodology, and per-task scores are there for anyone who wants to pick holes in my numbers.

Because you should.

Top comments (0)