I maintain an open dataset of LLM API list prices, and every time I refresh it the same thing jumps out: the gap between the cheapest and most expensive model is enormous. As of July 2026, across 17 models from 5 providers, the spread on a blended (3:1 input:output) cost is 114x: from about $0.18 per 1M tokens at the low end to $20 per 1M tokens at the top.
That is not a rounding error. If you route a high-volume, low-stakes workload (classification, tagging, bulk drafts) to a frontier model out of habit, you can pay literally 100x more than you need to for output a cheaper model handles fine.
Here is a slice of the current numbers (USD per 1M tokens, blended = input + output weighted 3:1):
| Provider | Model | Input | Output | Blended | Tier |
|---|---|---|---|---|---|
| DeepSeek | DeepSeek V4 Flash | 0.14 | 0.28 | 0.18 | budget |
| xAI | Grok 4.1 Fast | 0.20 | 0.50 | 0.28 | budget |
| Gemini 3.1 Flash-Lite | 0.25 | 1.50 | 0.56 | budget | |
| OpenAI | GPT-5 mini | 0.25 | 2.00 | 0.69 | budget |
| Anthropic | Claude Haiku 4.5 | 1.00 | 5.00 | 2.00 | budget |
| Anthropic | Claude Sonnet 4.6 | 3.00 | 15.00 | 6.00 | mid |
| Anthropic | Claude Opus 4.8 | 5.00 | 25.00 | 10.00 | frontier |
| OpenAI | GPT-5.5 | 5.00 | 30.00 | 11.25 | frontier |
| Anthropic | Claude Fable 5 | 10.00 | 50.00 | 20.00 | frontier |
Three things I did not appreciate until the data was in one table:
1. Output is where the money goes
Output tokens cost 2x to 8x the input price, depending on model. Gemini and GPT families run a 6x-8x output multiple; the DeepSeek and Claude families sit closer to 2x-5x. So a chatty, long-answer workload is priced very differently from a "read a big document, return a short label" workload, even at the same headline model. If your app generates long responses, the output rate matters far more than the input rate you probably compared first.
2. "Cheap" and "expensive" are workload-dependent, not model-dependent
There is no single best-value model. A budget model at $0.18 blended is the obvious pick for bulk summarization or first drafts. But for a step where a wrong answer is expensive (agent tool-calls, code you will actually ship, legal or medical reasoning), the frontier model's higher accuracy is usually cheaper than the human time to catch its mistakes. The trap is using ONE model for everything. Match the model to the job.
3. Caching quietly changes the ranking
Several providers offer cached-input discounts (DeepSeek advertises ~98% off cache hits; GPT-5.5 and Gemini 3.5 Flash cut cached input roughly 90%). If your prompts share a large stable prefix (a system prompt, a knowledge base, few-shot examples), your effective input cost can drop by an order of magnitude, which reshuffles the whole table. Benchmark with YOUR cache-hit rate, not the list price.
A quick way to pick
If you just want a starting point by task and budget rather than reading a spreadsheet, I built a free, no-signup picker that ranks a top 3 (model + API) from three taps: task, budget, priority. It uses this same pricing data: AI Model Picker.
The raw dataset (CSV + JSON + a reproducible collector script, CC BY 4.0) is public if you want to run your own numbers or cite it: it is linked from the picker page.
One honest caveat: API prices move constantly, so treat any table (including this one) as a snapshot dated July 2026 and confirm the current rate on each provider's pricing page before you commit a budget. If a number here is already stale by the time you read this, that is exactly why the collector script exists.
If you like this kind of no-hype, numbers-first AI tooling breakdown, I post them daily on Telegram: t.me/aitoolsinsiderhq.
Top comments (0)