I spent three weeks and about $340 benchmarking three LLMs on the actual tasks my autonomous agents run in production. Not the demo tasks. Not "summarize this article." The unglamorous, repetitive, occasionally weird tasks that keep a six-agent system running.
Here's what I found, including the parts that surprised me.
Why this benchmark is different
Most LLM benchmarks test general reasoning on clean, standardized tasks. That's useful for comparing models in theory. It's less useful for answering "which model should I pay for when my agent needs to do X twelve times a day, every day, indefinitely."
My agents perform four categories of tasks:
- Content generation — drafting posts, writing summaries, creating structured data from unstructured inputs
- Code review and generation — reviewing PRs, generating utility functions, catching obvious bugs
- Planning and task decomposition — breaking a goal into subtasks, prioritizing a backlog, deciding what to work on next
- Tool use and structured output — calling APIs, generating valid JSON, following format constraints reliably
I tested Claude Sonnet 4.6, GPT-4o, and Gemini 2.0 Flash on each category. Each test ran 30 times to smooth out variance. I measured cost per task, latency (median and p95), and a quality score that I'll explain below.
Quality scoring methodology
Quality is subjective until you operationalize it. Here's what I used:
For content generation: I scored on three dimensions: does it match the specified tone/voice (0-3), is the structure correct (0-3), does it contain factual errors or hallucinations (0-3, reversed). Max 9 points, normalized to 0-100.
For code tasks: automated linting + unit tests for correctness, manual review for idiomatic quality. Pass/fail for correctness, 0-100 for quality.
For planning: I had a second model (always Claude) review the plans for logical consistency, coverage of edge cases, and appropriate scope. Subjective, but consistent.
For structured output: JSON schema validation. 100 if valid, 0 if not. No partial credit for "almost valid."
I'm publishing all test prompts and evaluation criteria in this repo — though fair warning, it's not polished documentation.
Results by category
Content Generation
| Model | Avg Quality | Median Latency | p95 Latency | Cost/task |
|---|---|---|---|---|
| Claude Sonnet 4.6 | 84.2 | 3.1s | 6.8s | $0.0048 |
| GPT-4o | 79.6 | 2.7s | 5.9s | $0.0062 |
| Gemini 2.0 Flash | 71.4 | 1.9s | 4.2s | $0.0021 |
Claude wins on quality. Gemini wins on cost and speed. GPT-4o is the middle option that doesn't clearly win anything.
What drives the quality difference: Claude is significantly better at maintaining a consistent voice across multiple paragraphs. When I have a 400-word post with a specific tone signature, Claude holds it together more reliably. Gemini tends to drift — the opening and closing feel like different models wrote them.
For short-form content (under 100 words), Gemini's quality gap shrinks substantially. For content under 50 words, there's no meaningful quality difference between all three. That's a useful insight: Gemini at half the cost for anything short.
Code Review and Generation
| Model | Correctness | Avg Quality | Median Latency | Cost/task |
|---|---|---|---|---|
| Claude Sonnet 4.6 | 93% | 88.1 | 4.2s | $0.0071 |
| GPT-4o | 89% | 82.4 | 3.8s | $0.0089 |
| Gemini 2.0 Flash | 84% | 74.8 | 2.8s | $0.0031 |
Claude is meaningfully better here. The 9-point correctness gap versus Gemini compounds when you're running code review 20+ times a day.
The failure modes are different and worth knowing about:
- Claude fails by being overly conservative. It flags things that aren't bugs. False positives.
- GPT-4o fails by missing context. It treats each file in isolation even when I provide related files.
- Gemini fails by being too agreeable. It'll say code looks fine when it has obvious issues, then correct itself if you push back.
For the specific task of "does this code do what it's supposed to do," Claude. For "generate a utility function matching these requirements," Claude. For "quickly check if there are syntax errors," Gemini at one-third the cost is fine.
Planning and Task Decomposition
| Model | Avg Quality | Logical Consistency | Edge Case Coverage | Cost/task |
|---|---|---|---|---|
| Claude Sonnet 4.6 | 87.3 | 91% | 78% | $0.0095 |
| GPT-4o | 83.1 | 85% | 69% | $0.0112 |
| Gemini 2.0 Flash | 74.6 | 79% | 61% | $0.0038 |
Planning is where the quality gap widens. Claude's task decomposition is substantially better at handling ambiguity — it asks clarifying questions more appropriately, makes better assumptions when it doesn't ask, and catches edge cases that the other models don't mention.
The cost-per-task for planning is higher across all models because the prompts and outputs are longer. This is the task type where I'm most confident spending more for better quality. A bad plan generates bad work downstream; a good plan compounds.
One unexpected finding: GPT-4o has a tendency to create plans with more steps than necessary. It decomposes tasks very granularly, which sounds good but often produces plans that feel like theater — lots of activity, unclear priorities. Claude creates slightly fewer steps with clearer decision points.
Structured Output (JSON Generation)
| Model | Schema Validity | Avg Attempts to Valid | Cost/valid output |
|---|---|---|---|
| Claude Sonnet 4.6 | 98.3% | 1.03 | $0.0041 |
| GPT-4o | 96.1% | 1.09 | $0.0058 |
| Gemini 2.0 Flash | 91.7% | 1.19 | $0.0028 |
All three are good here, with Claude slightly ahead. The metric that actually matters for production is "average attempts to valid output" — because when a model fails schema validation, you retry with an error message, which costs another API call.
Gemini's 91.7% validity rate sounds high, but at scale it means roughly 1 in 12 structured outputs requires a retry. At 50 structured outputs per day, that's 4 retries per day, which compounds into meaningful cost and latency overhead.
# This is my retry wrapper for structured outputs
def get_structured_output(prompt: str, schema: dict, model: str, max_retries: int = 3) -> dict:
for attempt in range(max_retries):
response = call_llm(prompt, model)
try:
data = json.loads(response)
jsonschema.validate(data, schema)
return data
except (json.JSONDecodeError, jsonschema.ValidationError) as e:
if attempt == max_retries - 1:
raise
prompt = f"{prompt}\n\nPrevious output was invalid: {e}\nPlease fix and retry."
raise RuntimeError("Max retries exceeded")
How I actually allocate tasks in production
Given these results, here's my current model routing:
| Task type | Model | Reasoning |
|---|---|---|
| Long-form content (500+ words) | Claude Sonnet 4.6 | Voice consistency matters |
| Short content (<100 words) | Gemini 2.0 Flash | No quality difference, half cost |
| Code review | Claude Sonnet 4.6 | 9% correctness gap compounds |
| Quick syntax check | Gemini 2.0 Flash | Sufficient for this use case |
| Planning | Claude Sonnet 4.6 | Bad plans cost more than the model savings |
| Structured output (critical) | Claude Sonnet 4.6 | Retry overhead matters at scale |
| Structured output (non-critical) | Gemini 2.0 Flash | Acceptable retry rate |
This routing reduces my monthly LLM spend by roughly 35% compared to running everything on Claude.
Limitations and caveats
This is one system's data. My agents have specific prompt patterns, task distributions, and quality thresholds. Your mileage will vary, possibly significantly.
Models change. GPT-4o in March 2026 is not the same as GPT-4o six months ago. These benchmarks will drift.
I'm biased toward Claude. I use Claude as my primary development tool. My prompts are probably better calibrated for Claude's response patterns. I tried to control for this, but I can't fully rule it out.
"Quality" is operationalized, not objective. My scoring system reflects what matters for my use case. If your priorities are different, reweight accordingly.
Running your own benchmark
The core loop is simple:
import time
from statistics import median
def benchmark_task(prompt_fn, models, n_runs=30):
results = {}
for model in models:
latencies, costs, qualities = [], [], []
for _ in range(n_runs):
start = time.time()
response, cost = call_model(model, prompt_fn())
latency = time.time() - start
quality = score_response(response)
latencies.append(latency)
costs.append(cost)
qualities.append(quality)
results[model] = {
"median_latency": median(latencies),
"p95_latency": sorted(latencies)[int(0.95 * n_runs)],
"avg_cost": sum(costs) / len(costs),
"avg_quality": sum(qualities) / len(qualities),
}
return results
The hard part is defining score_response. That's the part that requires knowing what "good" means for your specific task — which is worth figuring out before you pick a model.
The unsatisfying conclusion
The right answer is "it depends," which is almost always unsatisfying and almost always true.
For complex, high-stakes tasks where quality compounds downstream, Claude. For simple, high-volume tasks where speed and cost matter, Gemini. For everything else, benchmark it.
The specific numbers in this post will age. The framework for thinking about model routing — by task type, quality requirements, cost sensitivity, and volume — probably won't.
I run six AI agents in production. This is part of an ongoing build-in-public series on what actually works and what doesn't.
→ Read the full architecture post on write.as/timzinin
→ All benchmark code: github.com/TimmyZinin (coming soon)
Top comments (0)