TL;DR: The frontier-model price gap in May 2026 is more than 100x on output tokens — GPT-5.5 charges $30/M, Claude Haiku 4.5 charges $5/M, and DeepSeek V4 Flash charges $0.28/M for a 1M-context model that benchmarks above GPT-4o. The sticker price almost never tells you what you'll actually pay: caching, batching, and long-context surcharges swing real costs by 50–90% in either direction. This table compiles every major API's verified May 2026 price, with the discount math built in.
The cheapest API in 2026 is not cheaper than the most expensive one — it's eighty-eight times cheaper on a real coding task, hundreds of times cheaper on raw sticker. That's not a rounding error, that's an architecture decision.
What's in this table
Nine providers, twenty-three models, verified from each vendor's pricing page in late May 2026. Prices in USD per million tokens (M = 1,000,000). Output is always more expensive than input — usually 4–6x; on GPT-5.5 it's a punishing 6x.
| Model | Input $/M | Output $/M | Cached input $/M | Context | Notes |
|---|---|---|---|---|---|
| OpenAI GPT-5.5 | $5.00 | $30.00 | $0.50 | 1M | Flagship reasoning |
| OpenAI GPT-5.5 Pro | $30.00 | $180.00 | — | 1M | Hardest-tasks tier |
| OpenAI GPT-5.4 | $2.50 | $15.00 | $0.25 | 272K | Previous flagship; 1M extended on opt-in |
| OpenAI GPT-5.4 Mini | $0.75 | $4.50 | — | 200K | Mid-tier workhorse |
| OpenAI GPT-5.4 Nano | $0.20 | $1.25 | — | 128K | Cheapest OpenAI option |
| Anthropic Claude Opus 4.7 | $5.00 | $25.00 | $0.50 | 1M | Best agent reliability |
| Anthropic Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | 1M | Best price/perf in flagship class |
| Anthropic Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | 1M | Cheapest US-lab frontier model |
| Google Gemini 3.1 Pro | $2.00 | $12.00 | $0.20 | 1M | $4.00/$18.00 above 200K |
| Google Gemini 2.5 Pro | $1.25 | $10.00 | $0.125 | 1M | $2.50/$15.00 above 200K |
| Google Gemini 2.5 Flash | $0.30 | $2.50 | — | 1M | Cheap multimodal |
| Google Gemini 2.5 Flash-Lite | $0.10 | $0.40 | — | 1M | Cheapest Google option |
| xAI Grok 4.3 | $1.25 | $2.50 | $0.20 | 1M | Output-cheap reasoner |
| xAI Grok 4.20 | $1.25 | $2.50 | $0.20 | 2M | Speed + tool-calling tier |
| xAI Grok 4.1 Fast | $0.20 | $0.50 | $0.05 | 2M | Budget agentic tool-caller |
| DeepSeek V4 Flash | $0.14 | $0.28 | $0.0028 | 1M | Best $/quality globally |
| DeepSeek V4 Pro | $0.435 | $0.87 | $0.0036 | 1M | Flagship (75% off until 2026-05-31) |
| Moonshot Kimi K2.6 | $0.95 | $4.00 | $0.16 | 262K | Strong coding model |
| Moonshot Kimi K2.5 | $0.60 | $3.00 | $0.10 | 262K | Cheaper sibling |
| Alibaba Qwen3-Max | $1.20 | $6.00 | $0.24 | 262K | Tiered: 2x above 32K input, 2.5x above 128K |
| Mistral Large 3 | $0.50 | $1.50 | — | 262K | Aggressive EU pricing |
| Meta Llama 4 Maverick (via DeepInfra) | $0.15 | $0.60 | — | 1M | Cheap open-weight large |
| Meta Llama 4 Scout (via DeepInfra) | $0.08 | $0.30 | — | 10M native (DeepInfra caps at 320K) | Cheapest tier overall |
A few things the sticker doesn't show. Those come next.
The discount math behind the sticker
Prompt caching cuts repeated-prefix input by ~90%
Every major US-lab model offers some form of prompt caching now. The shape is the same: a long system prompt or document gets cached, and subsequent reads charge a fraction of the input rate. Anthropic and OpenAI both cut cached input by 90%. DeepSeek's cache hit on V4 is 98% off ($0.0028/M vs $0.14/M miss). Gemini caches at ~10% of base.
The catch: caching only helps when the same prefix repeats across many requests inside a short TTL window (typically 5 minutes on Anthropic, longer on others). A chatbot serving many users with shared system prompts: huge win. A coding agent rewriting context every turn: zero help.
Worked example. You're running a customer-support bot with a 4K-token system prompt and 1K-token user turns, serving 100 messages an hour. On Claude Sonnet 4.6:
- Without caching: 100 × 5K × $3/M = $1.50/hr input
- With caching (system prompt cached): 100 × 4K × $0.30/M + 100 × 1K × $3/M = $0.12 + $0.30 = $0.42/hr input
A 72% cut on a workload most teams already run.
Batch API cuts everything by 50%
If you can wait 24 hours, every major provider gives you exactly 50% off. Anthropic, OpenAI, Google, Mistral — all the same. For offline jobs (overnight document processing, dataset labeling, summary generation on yesterday's data) this is free money. Most production traffic can't use it because users want answers in seconds, not tomorrow.
Long-context surcharges on Gemini
Google is the only major provider charging a long-context premium. Above 200K tokens, both Gemini 2.5 Pro and Gemini 3.1 Pro roughly double their input price and add ~50% to output. Anthropic, which also offers 1M-context Claude models, charges flat across the full context.
If your typical request is below 100K tokens, this is moot. If you're feeding entire codebases or 500-page PDFs, the headline Gemini price is misleading by a factor of two.
What it actually costs to do real work
Sticker prices in isolation are useless. Here's what one realistic workload costs across the lineup.
Scenario: a coding agent processing one task end-to-end. Roughly 40K input tokens (context + retrieved code + tool results) and 8K output tokens (reasoning + final code). About one task is one minute of human-developer-equivalent work.
| Model | Cost per task | Tasks per $1 |
|---|---|---|
| GPT-5.5 | $0.44 | 2.3 |
| Claude Opus 4.7 | $0.40 | 2.5 |
| Gemini 3.1 Pro | $0.18 | 5.6 |
| Claude Sonnet 4.6 | $0.24 | 4.2 |
| GPT-5.4 | $0.22 | 4.5 |
| Kimi K2.6 | $0.07 | 14 |
| Qwen3-Max | $0.10 | 10 |
| Claude Haiku 4.5 | $0.08 | 12.5 |
| Grok 4.3 | $0.07 | 14 |
| GPT-5.4 Nano | $0.02 | 50 |
| DeepSeek V4 Flash | $0.008 | 125 |
| Llama 4 Scout | $0.005 | 200 |
The ratio between cheapest and most expensive at this workload is 88x. That gap, run a million times, is the difference between a $5,000 month and a $440,000 month.
How to pick a tier
A simple decision tree that holds up across most teams.
Do you need the absolute best reasoning on the hardest 5% of tasks? GPT-5.5 Pro or Claude Opus 4.7. Pay the premium, don't try to be clever.
Do you need frontier quality on routine work? Claude Sonnet 4.6 or Gemini 3.1 Pro. Sonnet wins on agent reliability; Gemini wins on multimodal and 1M context recall.
Are you on a budget but need US-lab quality? Claude Haiku 4.5 or GPT-5.4 Mini. Both punch above their price tag.
Are you cost-sensitive and OK with open-weight quality? DeepSeek V4 Flash is the answer for most teams — 1M context at $0.14/$0.28. Llama 4 Scout if you can route through DeepInfra and don't need vision.
Are you doing offline / batch work? Pick anything and add --batch for 50% off. The model choice matters less than turning batch on.
This is the same logic our LLM API selection decision matrix lays out by use case if you want a longer breakdown.
What this table doesn't show
Three caveats worth knowing before you route off these numbers.
- Rate limits matter more than price for many teams. A $0.30/M model you can't get capacity on at peak is more expensive than a $5/M model you can. OpenAI and Anthropic have the most generous tiers; the cheaper Chinese models often gate hard on enterprise quotas.
- Quality is not flat within a price band. Claude Sonnet 4.6 and Gemini 3.1 Pro are priced similarly but win on different tasks. Sonnet leads on multi-turn agent reliability; Gemini leads on 1M+ token recall and image input. There's no substitute for running your eval on both.
- Provider markup is real. Going through a reseller adds 5–20% in most cases. We break down OpenRouter's actual margin versus first-party APIs in a separate piece — short version: it's higher than they advertise once you account for routing costs.
- First-party Claude pricing matches ofox pricing. Anthropic does not let resellers undercut; the only saving is from removing the need for multiple billing relationships. That logic applies to all the big labs.
The aggregator question
You can pay nine providers separately, manage nine API keys, and reconcile nine invoices. Or you can route everything through one OpenAI-compatible endpoint. ofox.ai is the aggregator we run — one key for every model in this table, OpenAI-compatible SDK, and prices that match each provider's first-party rate for flagship models with up to 70% off on open-weight ones. We're not the only option, but the math is similar across aggregators: the value is in not maintaining nine integrations, not in saving 2% on token cost.
For a deeper read on flagship-level differences, Claude vs GPT vs Gemini is the pillar piece this article links into. For first-party tier breakdowns specifically: Claude API pricing breakdown, Gemini 3.1 Pro pricing, GPT-5.4 Pro pricing, DeepSeek V4 pricing, and how to actually reduce AI API costs. The May 2026 LLM leaderboard is the quality-side companion to this price-side table.
The right model is the one whose price you don't have to think about — pick it for capability and let the bill take care of itself, or pick it for cost and let the capability ceiling decide your roadmap. Anything in between just means you'll switch in six months.
Pricing sources (verified May 26, 2026)
- OpenAI: openai.com/api/pricing
- Anthropic: platform.claude.com/docs/en/about-claude/pricing
- Google Gemini: ai.google.dev/gemini-api/docs/pricing
- xAI: x.ai/api
- DeepSeek: api-docs.deepseek.com/quick_start/pricing
- Moonshot Kimi: official pricing via openrouter.ai/moonshotai
- Alibaba Qwen: alibabacloud.com/help/en/model-studio/model-pricing
- Mistral: mistral.ai/pricing
- Meta Llama 4 (via DeepInfra): deepinfra.com/meta-llama
This table will be re-verified at the start of each month. If a number here disagrees with the provider's page, the provider wins — but tell us, because we want to keep this honest.
Originally published on ofox.ai/blog.
Top comments (0)