What I Tested
I gave 5 models the same 10 coding tasks — not LeetCode, not trivia. Tasks an autonomous agent actually does: parse a JSON config, find large files with a shell one-liner, fix a buggy merge function, write a concurrent HTTP fetcher. The kind of things my agents ask for at 3 AM.
Each task was scored on pattern matching: does the output contain the right function names, error handling, edge cases? Pass (75%+), partial (50-74%), or fail.
No model knew it was being benchmarked. Same prompt, same 500-token limit, temperature 0.1. OpenRouter for all calls.
The Results
| Model | Score | Passes | Cost | Time |
|---|---|---|---|---|
| Claude Sonnet 4 | 89.4% | 8/10 | $0.063 | 54s |
| Gemini 2.5 Flash | 88.9% | 10/10 | $0.008 | 17s |
| GPT-5.4 | 86.5% | 9/10 | $0.027 | 26s |
| GPT-5.5 | 53.8% | 5/10 | $0.104 | 109s |
| DeepSeek V3 | 0.0% | 0/10 | $0.000 | 1s |
DeepSeek returned HTTP 400 on every call — OpenRouter compatibility issue, not a model problem. I excluded it rather than pretend it scored zero.
What Surprised Me
Gemini 2.5 Flash scored 10/10 passes. Not a single task fell below 75%. It cost $0.008 total — less than a single GPT-5.5 call. And it was 6x faster.
GPT-5.5 failed 4 tasks. The pattern was consistent: it over-explained. On the shell one-liner task, it returned a 500-token essay about find instead of the actual command. On the CSV stats task, it discussed three approaches and never wrote the code. GPT-5.5 is the smartest model I've ever used for reasoning — but for concise code generation, the verbosity hurts.
Claude Sonnet 4 was the most reliable. 8/10 perfect passes, 2 partials, zero fails. The 2 partials were on shell tasks where it used different syntax — correct, but didn't match my expected patterns. At $0.063 for 10 tasks, it's the premium pick for production agents.
What This Means for Agent Builders
If you're building agents that generate code:
- Best value: Gemini 2.5 Flash. Free tier exists. 10/10 passes. Fast.
- Most reliable: Claude Sonnet 4. Zero fails. Worth the $0.006/task.
- Avoid for code gen: GPT-5.5. It's brilliant at reasoning — use it for architecture decisions, not shell scripts.
I'm not claiming this is a comprehensive benchmark. 10 tasks, one run each, pattern-matching scoring. But it's real — the same tasks my agents run every day. Not a synthetic benchmark designed for a paper.
What I'll Test Next
Error recovery. All 5 models handled the happy path. I want to know how they handle partial failures, contradictory instructions, and corrupted inputs. The benchmark that matters for agents isn't "can you sort a list" — it's "can you recover when the filesystem is read-only and the config is missing."
Total cost of this experiment: $0.20.
Full results: workswithagents.dev
Top comments (1)
Really interesting benchmark — especially the point about real-world agent tasks vs synthetic coding benchmarks. I’ve recently started exploring AI automation and have already built a few RAG-based systems, but I’m now trying to learn agents and AI engineering more systematically.
Would love to know what you’d recommend focusing on early for someone entering this space seriously. Great work on the benchmark.