Three weeks ago, I was knee-deep in a gnarly refactor — converting a multi-tenant Express API to a proper service-layer architecture — and I realized I'd been tab-switching between three different AI assistants without a clear sense of why I was picking any particular one for any particular task. Just vibes, basically. That bothered me. So I stopped, set up a proper (if informal) comparison, and spent the next two weeks routing real work through Claude Sonnet 4.6, GPT-4o, and Gemini 2.0 Pro with actual intention.
This is what I found.
My Setup and What Counts as a Real Test
Quick context: I'm one of four engineers at a B2B SaaS startup. TypeScript on the frontend (Next.js), Python on the backend (FastAPI), PostgreSQL, deployed on AWS. I use VS Code. Our team isn't doing anything exotic, but we do have a moderately complex codebase — around 80k lines of production code when you exclude tests and generated files.
I was not running synthetic benchmarks. I ran actual work tasks:
- Code review of pull requests (mine and teammates')
- Debugging sessions where I pasted stack traces and asked for help
- Writing migration scripts
- Explaining unfamiliar code patterns (I inherited some truly cursed SQLAlchemy usage)
- Drafting technical specs and ADRs
- Refactoring with constraints ("make this function testable without changing the public interface")
I used each model through its API, not the chat UI — partly because that's how our internal tooling works, and partly because UI polish isn't really the interesting variable here.
Where Each Model Actually Earns Its Keep
Claude Sonnet 4.6 is where I spent most of my time, and honestly, it was the most consistent across task types. The thing I kept noticing was how it handled ambiguity. When I gave it underspecified instructions — "can you refactor this so it's easier to test" — it didn't just pick an interpretation and run with it. It would surface the tradeoffs. "I can extract the database calls into an injected dependency, but that'll change your function signature. Alternatively, I can wrap the db client in a module-level mock point. Which fits better with how your tests are set up?" That kind of back-and-forth saved me from having to undo half-finished changes more than once.
For code generation, Claude's outputs tended to be more conservative — fewer fancy patterns, more boring correct code. I mean that as a compliment. I don't need the AI to show off.
GPT-4o is still excellent at a specific thing: rapid-fire, back-and-forth debugging. I have a pretty conversational debugging style where I'll paste a stack trace, get a theory, try something, paste the new error, and iterate. GPT-4o handles that cadence well. It's fast, it tracks context well within a conversation, and it doesn't seem to "forget" the architecture I described three exchanges ago. I also found it better at handling mixed-language contexts — like when I have a Python script that generates TypeScript type definitions, and I need reasoning that spans both.
What surprised me was how GPT-4o handled open-ended code review. I pasted a 300-line PR and asked for a thorough review. It gave me a list of issues, all technically accurate, but... it felt like a linter with better prose. It found the things, but it didn't notice that the real problem was a leaky abstraction two levels up that made several of the specific issues inevitable. Claude caught that. Gemini missed it too.
Gemini 2.0 Pro I'll be honest — I came into this more skeptical of Gemini than the others, based on earlier experiences. Some of that skepticism was warranted and some wasn't. Where Gemini genuinely impressed me was long-context document work. I fed it a 40,000-token technical specification and asked it to identify inconsistencies. It did this well — better than I expected and arguably better than the other two at that specific task. It also handled multimodal stuff smoothly when I dropped in architecture diagrams, though that's less of a daily need for me.
Where it fell short: code generation for anything non-trivial. I asked it to write a FastAPI dependency that validates a JWT, checks organization membership, and returns a typed user context object. The code it wrote was... plausible-looking but wrong in three separate ways. It used a deprecated Pydantic v1 pattern, it didn't handle the case where the org claim was missing (which is the interesting case), and it silently swallowed a potential database error. None of these were obvious bugs — they were the kind of thing that would survive a casual code review and surface in production.
The Long Context Problem Is Real
One of the things I actually needed to do during this period was onboard a new contractor. Part of that involved explaining how our authentication middleware chain works, which is... a lot. Several files, some historical decisions baked into the structure, a non-obvious dependency on a third-party library.
I tried a different approach with each model: dump the relevant files as context, then ask "explain how auth works end-to-end and flag anything that looks fragile."
# Quick example of how I was structuring these calls
# (simplified — real version has retry logic and token counting)
import anthropic
client = anthropic.Anthropic()
def load_files(paths: list[str]) -> str:
chunks = []
for path in paths:
with open(path) as f:
# Label each file so the model can reason about boundaries
chunks.append(f"### {path}\n\n{f.read()}")
return "\n\n".join(chunks)
context = load_files([
"src/middleware/auth.py",
"src/middleware/rate_limit.py",
"src/deps/user_context.py",
"src/lib/jwt.py",
])
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{
"role": "user",
"content": (
f"{context}\n\n"
"Explain how auth works end-to-end across these files. "
"Flag anything that looks fragile or that a new engineer might misunderstand."
)
}]
)
Claude's response was the most useful — it traced the actual request lifecycle, identified a spot where we were doing a database lookup that could be cached (which I already knew about but hadn't documented), and flagged a subtle race condition in how we were invalidating sessions. Whether the race condition is real or theoretical, I'm still not sure, but it was worth knowing about.
GPT-4o gave me a good structural explanation but didn't flag anything I didn't already know.
Gemini's summary was accurate but high-level — more like documentation than analysis.
Your mileage may vary here. If your main need is "explain this codebase to me," all three are honestly decent. If you need "critique this codebase," Claude's the one I'd reach for.
The Mistake I Made That Skewed My Early Results
About four days in, I realized I was unconsciously writing better prompts for Claude than for the other two. I'd been using Claude via API longer, I knew roughly what worked, and I was applying that knowledge without realizing it.
So I spent an evening writing standardized prompts for a set of five benchmark tasks and ran those exact prompts through all three models. The results shifted a little — GPT-4o did better on debugging tasks than my naturalistic testing suggested, and Gemini did worse on code generation than I'd expected based on its marketing. I'm not going to claim my two-week test is rigorous science. But I tried to correct for the bias once I noticed it.
The practical implication: if you're switching from a model you've used for a while, budget time to learn how to prompt the new one. This is obvious in retrospect and I still didn't do it at first.
Cost Reality After Two Weeks
I tracked my API spend across all three. Some rough numbers — these will shift as pricing changes, but the ratios are useful:
# Approximate spend over ~2 weeks of moderately heavy dev use
# Input-heavy workloads (lots of code context)
Claude Sonnet 4.6: ~$47 (primary workhorse)
GPT-4o: ~$31 (debugging sessions, second opinion)
Gemini 2.0 Pro: ~$12 (long-doc tasks, experimentation)
Gemini 2.0 Flash (not Pro) is dramatically cheaper if you can tolerate slightly lower quality — I used it for some lower-stakes tasks (summarizing internal docs, first-pass classification jobs) and it punched above its price point there. I'm not 100% sure it scales beyond simple tasks, but for high-volume background jobs it's worth a look.
If cost is a hard constraint for your team, Gemini Flash for grunt work + Claude Sonnet for anything that actually matters is a defensible stack.
What I'd Actually Tell My Team to Use
Not "it depends." Here's my actual recommendation:
Default to Claude Sonnet 4.6. For general-purpose development work — code review, debugging, refactoring, writing tests, understanding unfamiliar code — it's the most reliable across the widest range of tasks. It's also the one that most often catches problems rather than just answering the question you asked. That asymmetry matters when you're tired or moving fast.
Keep GPT-4o in your toolkit for iterative debugging sessions. If you're in a back-and-forth loop chasing a bug, GPT-4o's conversational flow is excellent. It's also worth having as a second opinion when Claude's answer doesn't feel right.
Use Gemini 2.0 for long documents and anything multimodal. Reading a 50-page spec, extracting requirements from a PDF, making sense of a diagram — this is where it earns its place. Don't use it as your primary coding assistant.
If I had to pick just one: Claude. It's the one I trust when the stakes are higher than usual — production incidents, security-sensitive code, architectural decisions I'll have to live with. The others are good tools. This one feels like a careful colleague.
One last thing: all three models will confidently write plausible-looking wrong code. The difference is in how often, and whether the errors are the obvious kind or the subtle kind. Subtle wrong code is worse than obvious wrong code. Keep that in mind when evaluating any of them.
Top comments (0)