Every few months a new leaderboard claims one model has leapt ahead. The problem: those benchmarks usually test toy problems, not the messy, context-heavy tasks you encounter daily. I spent two weeks running the same 30 real-world coding tasks against GPT-4.1, Claude Sonnet 4.5, and Gemini 2.5 Pro to see which one I would actually trust with a production codebase.
The Test Setup
All tests were run through official APIs at default temperature. I avoided playground UIs to eliminate any prompt preprocessing on the provider side. Tasks fell into three categories:
- Code generation — write a working implementation from an underspecified requirement
- Debugging — identify the root cause and fix a broken snippet
- Context-heavy analysis — reason about a 3,000-line file
Here is the Python harness I used to call each API and log token counts alongside wall-clock latency:
import time
from dataclasses import dataclass
from openai import OpenAI
import anthropic
import google.generativeai as genai
@dataclass
class Result:
model: str
task_id: str
output: str
input_tokens: int
output_tokens: int
latency_ms: int
def run_task(task_id: str, prompt: str) -> list[Result]:
results = []
# GPT-4.1
oai = OpenAI()
t0 = time.monotonic()
r = oai.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": prompt}],
)
results.append(Result(
model="gpt-4.1",
task_id=task_id,
output=r.choices[0].message.content,
input_tokens=r.usage.prompt_tokens,
output_tokens=r.usage.completion_tokens,
latency_ms=int((time.monotonic() - t0) * 1000),
))
# Claude Sonnet 4.5
ant = anthropic.Anthropic()
t0 = time.monotonic()
r = ant.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
results.append(Result(
model="claude-sonnet-4-5",
task_id=task_id,
output=r.content[0].text,
input_tokens=r.usage.input_tokens,
output_tokens=r.usage.output_tokens,
latency_ms=int((time.monotonic() - t0) * 1000),
))
# Gemini 2.5 Pro
genai.configure(api_key="YOUR_GEMINI_KEY")
m = genai.GenerativeModel("gemini-2.5-pro")
t0 = time.monotonic()
r = m.generate_content(prompt)
results.append(Result(
model="gemini-2.5-pro",
task_id=task_id,
output=r.text,
input_tokens=0,
output_tokens=0,
latency_ms=int((time.monotonic() - t0) * 1000),
))
return results
Tasks were scored manually on a 0–3 scale: 0 = broken/wrong, 1 = partially correct, 2 = correct but suboptimal, 3 = correct and clean. Two reviewers scored independently and averaged.
Code Generation Quality
For straightforward tasks — "write a FastAPI endpoint that validates a JWT and returns user claims" — all three performed within noise of each other. The divergence showed up on ambiguous specs and tasks with non-obvious edge cases.
GPT-4.1 handled ambiguity by asking clarifying questions before generating. Useful in a conversational context, disruptive inside an automated pipeline. When forced to produce output directly, it made reasonable assumptions and documented them inline.
Claude Sonnet 4.5 produced the most consistent structure. It separated concerns — validation, business logic, error handling — without being asked. On a concurrent job-queue task in Python, it was the only model that reached immediately for asyncio.Lock rather than a bare asyncio.Queue, which was the correct call for that spec.
Gemini 2.5 Pro surprised on complexity. It outperformed both on a task requiring reasoning across three SQL JOIN conditions, and its attached explanation was the most detailed of the three.
Aggregate score across 12 generation tasks: Claude Sonnet 4.5 (2.41), Gemini 2.5 Pro (2.38), GPT-4.1 (2.29).
Debugging and Root-Cause Analysis
This is where real gaps appear. I fed each model a broken Python rate limiter — an off-by-one in the sliding window boundary check — and asked for the root cause and a corrected implementation.
# Broken implementation — spot the bug
from collections import deque
import time
_windows: dict[str, deque] = {}
def is_allowed(user_id: str, limit: int, window_s: int) -> bool:
now = time.time()
if user_id not in _windows:
_windows[user_id] = deque()
window = _windows[user_id]
# Bug: should be < not <=
# Timestamps exactly on the boundary get evicted, inflating capacity
while window and window[0] <= now - window_s:
window.popleft()
if len(window) >= limit:
return False
window.append(now)
return True
GPT-4.1 found the <= vs < issue immediately, but its proposed fix introduced a secondary edge case around clock skew. A follow-up prompt was needed to converge on a correct solution.
Claude Sonnet 4.5 caught the boundary bug, flagged that _windows as a module-level mutable global was a separate problem, and returned a clean class-based rewrite without being asked — one pass, no follow-up required.
Gemini 2.5 Pro identified the bug but explained it in terms of the boundary condition abstractly rather than pointing to the specific line. Technically accurate, but slower to act on.
For debugging workflows, Claude Sonnet 4.5 was the most actionable across all ten debugging tasks. Gemini's explanations were thorough but verbose. GPT-4.1 was fast but occasionally introduced new issues in the fix.
Context Window Handling in Practice
All three models now support 128K+ token contexts, so raw capacity is not the differentiator. What matters is how accurately the model attends to information far from the prompt boundary.
I loaded a 2,800-line Go service and asked: "List all places where database transactions are opened but not deferred for rollback." This is a retrieval-plus-reasoning task with a concrete right answer.
- GPT-4.1: found 4 of 6 locations
- Claude Sonnet 4.5: found 5 of 6, and correctly noted that one was intentional (a read-only transaction)
- Gemini 2.5 Pro: found all 6, but flagged 2 false positives
For security-focused code review — the kind you would run against controls defined in a security hardening checklist — precision matters more than recall. A false positive wastes an engineer's time; a false negative is a missed vulnerability. Claude Sonnet 4.5 had the best precision/recall balance across all ten context tasks.
Cost and Latency
Normalized per 1M output tokens as of May 2026:
| Model | Input | Output | Median latency |
|---|---|---|---|
| GPT-4.1 | ~$2 | ~$8 | ~1.2s |
| Claude Sonnet 4.5 | ~$3 | ~$15 | ~1.8s |
| Gemini 2.5 Pro | ~$3.50 | ~$10 | ~2.1s (high variance) |
GPT-4.1 wins on throughput cost. For high-volume generation pipelines where correctness tolerances are lower, that gap is real. For interactive use or automated review where one wrong answer costs more than the price difference, Claude Sonnet 4.5's consistency is worth the premium.
The Takeaway
There is no universal winner. Pick by workload:
- Batch code generation at scale — GPT-4.1 on cost, Claude Sonnet 4.5 on quality
- Debugging and code review — Claude Sonnet 4.5, consistently most actionable
- Complex multi-step reasoning — Gemini 2.5 Pro when recall matters, Claude Sonnet 4.5 when precision matters
- Pipeline integration — GPT-4.1 has the broadest tooling support today
The most important advice: run your own benchmark on your actual prompts. Aggregate scores on synthetic tasks tell you where to start looking, not where to stop.
I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.
Top comments (0)