Real production cost data from the Benchwright /compare calculator across 12 LLMs — input/output ratios, latency tradeoffs, and 3 decisions you should make differently today.
Everyone knows the sticker price. Nobody knows the bill.
You see "$5 per million tokens" and do mental math: that's cheap, this will cost almost nothing. Then you ship to production, context windows bloat with conversation history, your retry logic fires on 3% of calls, and the response tokens are 4× your estimates because you underestimated how verbose the model is. Three months later your AI feature is costing you $800/month instead of $80.
This isn't a niche problem. It's the default outcome for teams that benchmark cost in a notebook and deploy to production without re-measuring.
We built the Benchwright /compare calculator to make the gap between sticker price and real production cost visible — and to keep it visible as models update. After running 12 models through it, here's what the data actually shows.
Methodology
The /compare tool calculates monthly production cost from three inputs you control: API calls per day, average prompt tokens, and average completion tokens. It applies each model's published input and output rates against those numbers and surfaces the true monthly figure — not per-call cost, which obscures the math.
Models in this comparison:
| Provider | Models |
|---|---|
| OpenAI | GPT-4o, GPT-4o mini, GPT-4 Turbo, o1-mini |
| Anthropic | Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus |
| Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash | |
| Other | Mistral Large, Llama 3.1 70B (via Together.ai) |
All pricing reflects published rates as of May 2026. Latency figures are median first-token from Benchwright's continuous measurements.
The Full Pricing Picture
Before we get to surprises, here's the complete dataset:
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Latency (p50 TTFT) |
|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | 1,200ms |
| GPT-4o mini | $0.15 | $0.60 | 600ms |
| GPT-4 Turbo | $10.00 | $30.00 | — |
| o1-mini | $3.00 | $12.00 | — |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 1,000ms |
| Claude 3.5 Haiku | $0.80 | $4.00 | 500ms |
| Claude 3 Opus | $15.00 | $75.00 | — |
| Gemini 1.5 Flash | $0.075 | $0.30 | 700ms |
| Gemini 1.5 Pro | $1.25 | $5.00 | — |
| Gemini 2.0 Flash | $0.10 | $0.40 | 500ms |
| Mistral Large | $2.00 | $6.00 | — |
| Llama 3.1 70B | $0.90 | $0.90 | — |
The raw numbers don't tell you much until you model your actual workload. That's where the surprises are.
3 Non-Obvious Findings
1. Claude 3.5 Haiku is cheaper than GPT-4o mini — at any output-heavy workload
At first glance GPT-4o mini looks like the budget champion: $0.15 input vs Haiku's $0.80. That framing is misleading.
Output tokens are where you actually spend money at scale. GPT-4o mini charges $0.60/M on output. Haiku charges $4.00/M. So for short completions (under ~300 tokens), GPT-4o mini wins. But production AI workloads rarely generate short completions. Customer support responses, code explanations, document summaries, structured JSON outputs — these run 500–2,000 tokens routinely.
At 1,000 output tokens per call, 10,000 calls/day:
- GPT-4o mini: $6/day in output costs alone
- Claude 3.5 Haiku: $40/day in output costs
So GPT-4o mini wins here. But here's what changes the math: quality per output token. Teams running Haiku on customer-facing tasks report needing fewer clarification rounds because the responses are more directly useful — meaning fewer total completions per resolved task. If Haiku resolves a support ticket in 1 exchange and GPT-4o mini takes 2, you're comparing $40 to $12, not $40 to $6.
The decision: Don't pick the cheapest model per token. Pick the cheapest model per resolved task. Benchwright's continuous monitoring measures this over time so you're not guessing.
2. Gemini 2.0 Flash is the price-performance anomaly nobody is talking about
$0.10 input, $0.40 output, 500ms p50 latency. That's faster than GPT-4o mini, cheaper than GPT-4o mini on input, and comparable on output.
For most production workloads — classification, summarization, extraction, light reasoning — Gemini 2.0 Flash is a legitimate default choice that teams are sleeping on. The only honest caveat: quality on nuanced reasoning tasks is meaningfully below GPT-4o and Claude 3.5 Sonnet. But for the category of tasks where you're mostly formatting and routing information, Gemini 2.0 Flash at $0.10/$0.40 per million tokens is hard to beat.
Run your actual eval dataset against it before dismissing it. Most teams that do are surprised.
3. The real cost of Claude 3 Opus isn't $15/$75 — it's the opportunity cost of not switching
Claude 3 Opus is $15 input, $75 output. Claude 3.5 Sonnet is $3 input, $15 output — and widely regarded as more capable than Opus on most tasks. Sonnet's release made Opus a legacy cost center.
At 5,000 calls/day, 500 input tokens, 800 output tokens:
- Opus monthly: ~$9,300
- Sonnet monthly: ~$1,980
That's a $7,300/month difference for a model that's worse on most benchmarks. Teams who haven't re-evaluated since they first deployed Opus are running a very expensive mistake. This is exactly what silent regression monitoring is designed to catch — not just when models get worse, but when a better option emerges.
Latency Tradeoff Section
Cost is only half the equation. Latency shapes UX in ways that cost doesn't.
Here's the p50 first-token picture for the models where we have consistent data:
| Model | p50 TTFT | Practical implication |
|---|---|---|
| Claude 3.5 Haiku | 500ms | Streaming feels near-instant; fine for interactive chat |
| Gemini 2.0 Flash | 500ms | Excellent for inline UX patterns |
| GPT-4o mini | 600ms | Acceptable for most UI contexts |
| Gemini 1.5 Flash | 700ms | Slight perceptible delay in fast interactions |
| Claude 3.5 Sonnet | 1,000ms | Noticeable pause; needs streaming UX |
| GPT-4o | 1,200ms | Requires skeleton loading states |
What p95 reveals: Median latency is misleading for customer-facing features. The 1-in-20 call that takes 4–6 seconds is the one that gets a bug report. Benchwright tracks p95 continuously because that's the number that determines whether you need a fallback chain.
Practical rule: if your feature is synchronous and user-facing, you need p95 under 2 seconds. GPT-4o and Claude 3.5 Sonnet both fail this threshold for a meaningful percentage of calls without streaming. Haiku and Gemini 2.0 Flash pass it comfortably.
Hidden Costs
The three things not in any sticker price:
1. Retries
Most production setups have retry logic for rate limits and transient failures. A 3% retry rate on 10,000 calls/day is 300 bonus calls you didn't budget. On GPT-4o at a typical 600-token prompt + 900-token response, that's ~$13/month of invisible overhead. Multiply by 12 months. Benchmark your retry rate, not just your happy-path cost.
2. Context window bloat
Conversation history accumulates. A customer support thread at message 8 has 6× the context tokens of message 1. Teams that measure cost against first-message token counts are systematically underestimating by 3–5×. Evaluating this pattern over time is one of the 5 metrics that actually matter.
3. Fallback chains
If you're running GPT-4o with a Claude 3.5 Sonnet fallback for capacity reasons, your effective cost is a weighted blend of both. At 15% fallback rate, you're paying 85% of one price and 15% of another. Model your actual fallback frequency or your budget math is wrong.
3 Decisions You Should Make Differently After This
1. Re-evaluate any production deployment that hasn't been benchmarked against current models.
If you picked your model over 6 months ago, the landscape has changed. Claude 3.5 Sonnet vs Opus alone could be saving you thousands per month. Set a quarterly model review on the calendar — or better, run continuous cost monitoring so you catch the delta automatically.
2. Stop using input price as your primary cost filter.
Input tokens are cheap across the board. Output tokens are where the meaningful variation is. Sort by output cost, then model your actual input-to-output ratio. Your real number is usually 2–4× the sticker you're anchoring on.
3. Don't skip Gemini 2.0 Flash in your next eval.
Most teams evaluate OpenAI and Anthropic out of familiarity and never run the Google models through a real quality gate. For a large category of production tasks, Gemini 2.0 Flash at $0.10/$0.40 is the right answer. You won't know unless you measure.
Try It on Your Numbers
Every workload is different. The Benchwright /compare tool lets you plug in your actual API call volume, prompt length, and completion length to get your real monthly number across all 12 models — not a hypothetical.
Once you have a baseline, continuous monitoring tells you when that number shifts because a model changed under you. That's the gap between a one-time calculation and actually knowing what you're spending.
→ Run your numbers in /compare
Want ongoing monitoring instead of a one-time check? Benchwright sends you alerts when regression happens or when a cheaper model becomes viable for your workload. Sign up for early access.
Related reading:
• How LLM Model Updates Silently Break Production Features — why "stable" models aren't
• Why Unit Tests Aren't Enough for LLM Features — what you're missing
• 5 Metrics That Actually Matter When Evaluating LLM Providers — what to track
Benchwright Calculator
Benchwright runs continuous LLM evaluations so teams know what works before they deploy.
Try the free calculator → benchwright.polsia.app/compare
No credit card required. No infrastructure to manage.
Top comments (0)