I Tested Claude Haiku, GPT-4o Mini, and Gemini Flash on Real Tasks. Here's What Actually Happened.
Every few weeks someone posts a new model comparison and it's always the same: benchmark scores, carefully designed test prompts, neat bar charts. Then you try the "winning" model on your actual workload and something weird happens.
I've been running all three in production for a few months. Here's what I actually found, including the parts that don't make for clean charts.
Quick Pricing Reality Check
| Model | Input (per 1M tokens) | Context |
|---|---|---|
| Claude Haiku 4.5 | $1.00 | 200K |
| GPT-4o Mini | $0.15 | 128K |
| Gemini 1.5 Flash | $0.075 | 1M |
Gemini Flash is genuinely 13× cheaper than Haiku. That's not a rounding error. Before you immediately migrate everything: keep reading.
What I Was Testing
Three real workloads from a side project:
- Document summarization — long PDFs, messy formatting, inconsistent structure
- JSON extraction — pull structured data from unstructured user input
- Code explanation — explain what a function does in plain English
Not "write a poem" or "solve this logic puzzle." Real things an app might do.
Document Summarization: Gemini Flash Wins
Flash's 1M context window is genuinely useful here, not just a spec sheet number. I was chunking documents into pieces to fit smaller context windows. With Flash I stopped doing that.
It's also significantly cheaper, which matters when you're summarizing hundreds of documents.
The catch: Flash occasionally invents a section heading that wasn't in the original. It's subtle and sounds plausible, which makes it worse. For internal tools where someone will verify the output, fine. For anything customer-facing, I'd want a review step.
JSON Extraction: Haiku Wins, Annoyingly
I really wanted Flash to win here because of the price. It didn't.
The task: given a paragraph of user-written text, extract structured fields into a specific JSON schema. Claude Haiku followed the schema reliably. Flash followed it most of the time, but occasionally added fields that weren't in the schema, renamed fields it didn't like, or decided to nest things differently.
Each of these breaks downstream code. The debugging cost per incident outweighed the token savings.
Haiku is predictable in a way that sounds boring but is exactly what you want when you're processing thousands of records.
Code Explanation: GPT-4o Mini Wins
This one surprised me. GPT-4o Mini is clearly well-trained on code-related tasks. Its explanations are accurate and well-structured. It also handled edge cases (unusual syntax, legacy patterns) better than the other two.
For anything code-adjacent, it's my first reach now.
What I Actually Use
I stopped trying to find one winner and started routing:
function pickModel(taskType, inputTokens) {
if (taskType === 'summarize' && inputTokens > 50000) return 'gemini-1.5-flash';
if (taskType === 'extract_json') return 'claude-haiku-4-5';
if (taskType === 'explain_code') return 'gpt-4o-mini';
return 'claude-haiku-4-5'; // safe default
}
You can check price differences before committing to a migration:
npm install -g promptfuel
promptfuel compare claude-haiku-4-5 gpt-4o-mini gemini-1.5-flash
The Actual Bottom Line
- Cheapest by far: Gemini Flash. Use it for volume tasks where edge cases are acceptable.
- Most format-reliable: Claude Haiku. Use it when you need strict schema compliance.
- Best at code: GPT-4o Mini. Use it for anything developer-facing.
The honest answer is: don't pick one. Pick based on the task. The cost difference between routing intelligently versus defaulting to one model is real, and it compounds over time.
Top comments (0)