I Ran 5 LLMs Through 10 Real Agent Coding Tasks. The Free One Won.

#ai #agents #benchmark #llm

What I Tested

I gave 5 models the same 10 coding tasks — not LeetCode, not trivia. Tasks an autonomous agent actually does: parse a JSON config, find large files with a shell one-liner, fix a buggy merge function, write a concurrent HTTP fetcher. The kind of things my agents ask for at 3 AM.

Each task was scored on pattern matching: does the output contain the right function names, error handling, edge cases? Pass (75%+), partial (50-74%), or fail.

No model knew it was being benchmarked. Same prompt, same 500-token limit, temperature 0.1. OpenRouter for all calls.

The Results

Model	Score	Passes	Cost	Time
Claude Sonnet 4	89.4%	8/10	$0.063	54s
Gemini 2.5 Flash	88.9%	10/10	$0.008	17s
GPT-5.4	86.5%	9/10	$0.027	26s
GPT-5.5	53.8%	5/10	$0.104	109s
DeepSeek V3	0.0%	0/10	$0.000	1s

DeepSeek returned HTTP 400 on every call — OpenRouter compatibility issue, not a model problem. I excluded it rather than pretend it scored zero.

What Surprised Me

Gemini 2.5 Flash scored 10/10 passes. Not a single task fell below 75%. It cost $0.008 total — less than a single GPT-5.5 call. And it was 6x faster.

GPT-5.5 failed 4 tasks. The pattern was consistent: it over-explained. On the shell one-liner task, it returned a 500-token essay about find instead of the actual command. On the CSV stats task, it discussed three approaches and never wrote the code. GPT-5.5 is the smartest model I've ever used for reasoning — but for concise code generation, the verbosity hurts.

Claude Sonnet 4 was the most reliable. 8/10 perfect passes, 2 partials, zero fails. The 2 partials were on shell tasks where it used different syntax — correct, but didn't match my expected patterns. At $0.063 for 10 tasks, it's the premium pick for production agents.

What This Means for Agent Builders

If you're building agents that generate code:

Best value: Gemini 2.5 Flash. Free tier exists. 10/10 passes. Fast.
Most reliable: Claude Sonnet 4. Zero fails. Worth the $0.006/task.
Avoid for code gen: GPT-5.5. It's brilliant at reasoning — use it for architecture decisions, not shell scripts.

I'm not claiming this is a comprehensive benchmark. 10 tasks, one run each, pattern-matching scoring. But it's real — the same tasks my agents run every day. Not a synthetic benchmark designed for a paper.

What I'll Test Next

Error recovery. All 5 models handled the happy path. I want to know how they handle partial failures, contradictory instructions, and corrupted inputs. The benchmark that matters for agents isn't "can you sort a list" — it's "can you recover when the filesystem is read-only and the config is missing."

Total cost of this experiment: $0.20.

Full results: benchmarks.workswithagents.dev

Top comments (3)

Gulrez Qayyum • May 9

Really interesting benchmark — especially the point about real-world agent tasks vs synthetic coding benchmarks. I’ve recently started exploring AI automation and have already built a few RAG-based systems, but I’m now trying to learn agents and AI engineering more systematically.

Would love to know what you’d recommend focusing on early for someone entering this space seriously. Great work on the benchmark.

Vilius • May 10

As you touching memory, next - context management, right data inserted at right time

Gulrez Qayyum • May 11

That’s a really interesting perspective. I can already see how context management becomes critical once systems get more complex. I’ll definitely explore that further — thanks for the insight.