Tim Zinin

Posted on Mar 16

Claude vs GPT-4 vs Gemini for Autonomous Agent Tasks: My Production Benchmark

#ai #machinelearning #llm #python

I spent three weeks and about $340 benchmarking three LLMs on the actual tasks my autonomous agents run in production. Not the demo tasks. Not "summarize this article." The unglamorous, repetitive, occasionally weird tasks that keep a six-agent system running.

Here's what I found, including the parts that surprised me.

Why this benchmark is different

Most LLM benchmarks test general reasoning on clean, standardized tasks. That's useful for comparing models in theory. It's less useful for answering "which model should I pay for when my agent needs to do X twelve times a day, every day, indefinitely."

My agents perform four categories of tasks:

Content generation — drafting posts, writing summaries, creating structured data from unstructured inputs
Code review and generation — reviewing PRs, generating utility functions, catching obvious bugs
Planning and task decomposition — breaking a goal into subtasks, prioritizing a backlog, deciding what to work on next
Tool use and structured output — calling APIs, generating valid JSON, following format constraints reliably

I tested Claude Sonnet 4.6, GPT-4o, and Gemini 2.0 Flash on each category. Each test ran 30 times to smooth out variance. I measured cost per task, latency (median and p95), and a quality score that I'll explain below.

Quality scoring methodology

Quality is subjective until you operationalize it. Here's what I used:

For content generation: I scored on three dimensions: does it match the specified tone/voice (0-3), is the structure correct (0-3), does it contain factual errors or hallucinations (0-3, reversed). Max 9 points, normalized to 0-100.

For code tasks: automated linting + unit tests for correctness, manual review for idiomatic quality. Pass/fail for correctness, 0-100 for quality.

For planning: I had a second model (always Claude) review the plans for logical consistency, coverage of edge cases, and appropriate scope. Subjective, but consistent.

For structured output: JSON schema validation. 100 if valid, 0 if not. No partial credit for "almost valid."

I'm publishing all test prompts and evaluation criteria in this repo — though fair warning, it's not polished documentation.

Results by category

Content Generation

Model	Avg Quality	Median Latency	p95 Latency	Cost/task
Claude Sonnet 4.6	84.2	3.1s	6.8s	$0.0048
GPT-4o	79.6	2.7s	5.9s	$0.0062
Gemini 2.0 Flash	71.4	1.9s	4.2s	$0.0021

Claude wins on quality. Gemini wins on cost and speed. GPT-4o is the middle option that doesn't clearly win anything.

What drives the quality difference: Claude is significantly better at maintaining a consistent voice across multiple paragraphs. When I have a 400-word post with a specific tone signature, Claude holds it together more reliably. Gemini tends to drift — the opening and closing feel like different models wrote them.

For short-form content (under 100 words), Gemini's quality gap shrinks substantially. For content under 50 words, there's no meaningful quality difference between all three. That's a useful insight: Gemini at half the cost for anything short.

Code Review and Generation

Model	Correctness	Avg Quality	Median Latency	Cost/task
Claude Sonnet 4.6	93%	88.1	4.2s	$0.0071
GPT-4o	89%	82.4	3.8s	$0.0089
Gemini 2.0 Flash	84%	74.8	2.8s	$0.0031

Claude is meaningfully better here. The 9-point correctness gap versus Gemini compounds when you're running code review 20+ times a day.

The failure modes are different and worth knowing about:

Claude fails by being overly conservative. It flags things that aren't bugs. False positives.
GPT-4o fails by missing context. It treats each file in isolation even when I provide related files.
Gemini fails by being too agreeable. It'll say code looks fine when it has obvious issues, then correct itself if you push back.

For the specific task of "does this code do what it's supposed to do," Claude. For "generate a utility function matching these requirements," Claude. For "quickly check if there are syntax errors," Gemini at one-third the cost is fine.

Planning and Task Decomposition

Model	Avg Quality	Logical Consistency	Edge Case Coverage	Cost/task
Claude Sonnet 4.6	87.3	91%	78%	$0.0095
GPT-4o	83.1	85%	69%	$0.0112
Gemini 2.0 Flash	74.6	79%	61%	$0.0038

Planning is where the quality gap widens. Claude's task decomposition is substantially better at handling ambiguity — it asks clarifying questions more appropriately, makes better assumptions when it doesn't ask, and catches edge cases that the other models don't mention.

The cost-per-task for planning is higher across all models because the prompts and outputs are longer. This is the task type where I'm most confident spending more for better quality. A bad plan generates bad work downstream; a good plan compounds.

One unexpected finding: GPT-4o has a tendency to create plans with more steps than necessary. It decomposes tasks very granularly, which sounds good but often produces plans that feel like theater — lots of activity, unclear priorities. Claude creates slightly fewer steps with clearer decision points.

Structured Output (JSON Generation)

Model	Schema Validity	Avg Attempts to Valid	Cost/valid output
Claude Sonnet 4.6	98.3%	1.03	$0.0041
GPT-4o	96.1%	1.09	$0.0058
Gemini 2.0 Flash	91.7%	1.19	$0.0028

All three are good here, with Claude slightly ahead. The metric that actually matters for production is "average attempts to valid output" — because when a model fails schema validation, you retry with an error message, which costs another API call.

Gemini's 91.7% validity rate sounds high, but at scale it means roughly 1 in 12 structured outputs requires a retry. At 50 structured outputs per day, that's 4 retries per day, which compounds into meaningful cost and latency overhead.

# This is my retry wrapper for structured outputs
def get_structured_output(prompt: str, schema: dict, model: str, max_retries: int = 3) -> dict:
    for attempt in range(max_retries):
        response = call_llm(prompt, model)
        try:
            data = json.loads(response)
            jsonschema.validate(data, schema)
            return data
        except (json.JSONDecodeError, jsonschema.ValidationError) as e:
            if attempt == max_retries - 1:
                raise
            prompt = f"{prompt}\n\nPrevious output was invalid: {e}\nPlease fix and retry."
    raise RuntimeError("Max retries exceeded")

How I actually allocate tasks in production

Given these results, here's my current model routing:

Task type	Model	Reasoning
Long-form content (500+ words)	Claude Sonnet 4.6	Voice consistency matters
Short content (<100 words)	Gemini 2.0 Flash	No quality difference, half cost
Code review	Claude Sonnet 4.6	9% correctness gap compounds
Quick syntax check	Gemini 2.0 Flash	Sufficient for this use case
Planning	Claude Sonnet 4.6	Bad plans cost more than the model savings
Structured output (critical)	Claude Sonnet 4.6	Retry overhead matters at scale
Structured output (non-critical)	Gemini 2.0 Flash	Acceptable retry rate

This routing reduces my monthly LLM spend by roughly 35% compared to running everything on Claude.

Limitations and caveats

This is one system's data. My agents have specific prompt patterns, task distributions, and quality thresholds. Your mileage will vary, possibly significantly.

Models change. GPT-4o in March 2026 is not the same as GPT-4o six months ago. These benchmarks will drift.

I'm biased toward Claude. I use Claude as my primary development tool. My prompts are probably better calibrated for Claude's response patterns. I tried to control for this, but I can't fully rule it out.

"Quality" is operationalized, not objective. My scoring system reflects what matters for my use case. If your priorities are different, reweight accordingly.

Running your own benchmark

The core loop is simple:

import time
from statistics import median

def benchmark_task(prompt_fn, models, n_runs=30):
    results = {}
    for model in models:
        latencies, costs, qualities = [], [], []
        for _ in range(n_runs):
            start = time.time()
            response, cost = call_model(model, prompt_fn())
            latency = time.time() - start
            quality = score_response(response)
            latencies.append(latency)
            costs.append(cost)
            qualities.append(quality)
        results[model] = {
            "median_latency": median(latencies),
            "p95_latency": sorted(latencies)[int(0.95 * n_runs)],
            "avg_cost": sum(costs) / len(costs),
            "avg_quality": sum(qualities) / len(qualities),
        }
    return results

The hard part is defining score_response. That's the part that requires knowing what "good" means for your specific task — which is worth figuring out before you pick a model.

The unsatisfying conclusion

The right answer is "it depends," which is almost always unsatisfying and almost always true.

For complex, high-stakes tasks where quality compounds downstream, Claude. For simple, high-volume tasks where speed and cost matter, Gemini. For everything else, benchmark it.

The specific numbers in this post will age. The framework for thinking about model routing — by task type, quality requirements, cost sensitivity, and volume — probably won't.

I run six AI agents in production. This is part of an ongoing build-in-public series on what actually works and what doesn't.

→ Read the full architecture post on write.as/timzinin
→ All benchmark code: github.com/TimmyZinin (coming soon)

DEV Community