I got tired of the copy-paste workflow.
You know the one: write a prompt in ChatGPT, screenshot the result, open a new tab for Claude, paste the same prompt, screenshot again, repeat for Gemini. By the time you've done this across three models you've forgotten what you were originally trying to
accomplish.
So I started running structured comparisons using OneAIWorld, which sends the same prompt to multiple LLMs simultaneously and shows results side by side. I ran 50 prompts across GPT-4o, Claude 3, and Gemini 1.5 Pro, split across five
categories. Here's what I actually found.
## The categories I tested
- Code generation — write a function, fix a bug, explain this snippet
- Structured output — generate JSON, create a table, format a report
- Creative writing — story openings, product descriptions, email copy
- Reasoning/logic — word problems, multi-step instructions, edge cases
- Summarisation — compress a long article into key points
## Code generation
Winner: GPT-4o (but it's close)
GPT-4o produced cleaner, more immediately runnable code with better inline comments. Claude was a close second but had a tendency to over-explain — you'd ask for a function and get three paragraphs of context before the code block.
Gemini struggled most with edge cases. On a prompt asking to handle null inputs gracefully, it produced code that would throw on undefined while claiming it handled null. GPT-4o and Claude both caught this.
One surprise: Claude was noticeably better at explaining existing code. If you paste a confusing snippet and ask "what does this do?", Claude's explanations were more accurate and better structured than the other two.
## Structured output
Winner: GPT-4o
If you're building something that needs reliable JSON, GPT-4o is the most consistent. It followed schema instructions exactly, even for nested structures.
Claude occasionally added prose before the JSON block ("Here's the data you requested:") which breaks parsers if you're not stripping it. Gemini sometimes returned malformed JSON on complex structures.
## Creative writing
Winner: Claude
This wasn't close. Claude's creative output had better rhythm, more natural pacing, and less of that distinctive "AI tone" that makes generated copy feel generic. GPT-4o produces competent copy but it reads like copy. Claude reads more like a person wrote it.
For product descriptions and email subject lines specifically, Claude was the standout by a significant margin.
## Reasoning and logic
Winner: Claude (GPT-4o close second)
Multi-step problems and word problems with misdirection — Claude handled these most reliably. It was better at flagging when a question contained a false premise rather than just running with it.
GPT-4o was better on pure maths. Gemini was the least reliable here, occasionally producing confident but incorrect reasoning chains.
## Summarisation
Winner: Gemini
This was Gemini's clearest win. Given a long article, Gemini's summaries were consistently tighter and better at identifying the actual key points vs supporting detail. Claude tended to be too comprehensive (summaries that were 60% the length of the original).
GPT-4o was middle ground.
## The honest summary
| Task | Best model |
|---|---|
| Code generation | GPT-4o |
| Structured/JSON output | GPT-4o |
| Creative writing | Claude |
| Reasoning/logic | Claude |
| Summarisation | Gemini |
There is no single best model. The right answer genuinely depends on your use case, which is exactly why the copy-paste comparison workflow exists in the first place.
The practical takeaway: if you're building something code-heavy, default to GPT-4o. If you're generating content or copy, route those prompts to Claude. If you're doing research summarisation or digest pipelines, try Gemini first.
## How I ran these tests
All comparisons were done using OneAIWorld — it sends the same prompt to multiple models simultaneously and displays results side by side. Saves a significant amount of time when you're doing structured comparisons rather than one-off
queries. Each result also gets an automated quality score, which helps when you're comparing 50 variations.
## What surprised me most
The gap between models on creative tasks is much larger than the gap on technical tasks. For code, all three are genuinely competitive and the difference is often stylistic. For writing, Claude is operating at a different level.
The other surprise: Gemini is underrated for summarisation tasks. Most developer discussions treat it as the third-place option, but for specific pipelines it's the clear winner.
What patterns are you seeing in your own model comparisons? Curious whether the code generation results hold up across different languages — my tests skewed Python and JavaScript.
Top comments (0)