Owen

Posted on May 26 • Originally published at ofox.ai

AI API Pricing Comparison May 2026: Every Major Model in One Table

#ai #llm #api #pricing

TL;DR: The frontier-model price gap in May 2026 is more than 100x on output tokens — GPT-5.5 charges $30/M, Claude Haiku 4.5 charges $5/M, and DeepSeek V4 Flash charges $0.28/M for a 1M-context model that benchmarks above GPT-4o. The sticker price almost never tells you what you'll actually pay: caching, batching, and long-context surcharges swing real costs by 50–90% in either direction. This table compiles every major API's verified May 2026 price, with the discount math built in.

The cheapest API in 2026 is not cheaper than the most expensive one — it's eighty-eight times cheaper on a real coding task, hundreds of times cheaper on raw sticker. That's not a rounding error, that's an architecture decision.

What's in this table

Nine providers, twenty-three models, verified from each vendor's pricing page in late May 2026. Prices in USD per million tokens (M = 1,000,000). Output is always more expensive than input — usually 4–6x; on GPT-5.5 it's a punishing 6x.

Model	Input $/M	Output $/M	Cached input $/M	Context	Notes
OpenAI GPT-5.5	$5.00	$30.00	$0.50	1M	Flagship reasoning
OpenAI GPT-5.5 Pro	$30.00	$180.00	—	1M	Hardest-tasks tier
OpenAI GPT-5.4	$2.50	$15.00	$0.25	272K	Previous flagship; 1M extended on opt-in
OpenAI GPT-5.4 Mini	$0.75	$4.50	—	200K	Mid-tier workhorse
OpenAI GPT-5.4 Nano	$0.20	$1.25	—	128K	Cheapest OpenAI option
Anthropic Claude Opus 4.7	$5.00	$25.00	$0.50	1M	Best agent reliability
Anthropic Claude Sonnet 4.6	$3.00	$15.00	$0.30	1M	Best price/perf in flagship class
Anthropic Claude Haiku 4.5	$1.00	$5.00	$0.10	1M	Cheapest US-lab frontier model
Google Gemini 3.1 Pro	$2.00	$12.00	$0.20	1M	$4.00/$18.00 above 200K
Google Gemini 2.5 Pro	$1.25	$10.00	$0.125	1M	$2.50/$15.00 above 200K
Google Gemini 2.5 Flash	$0.30	$2.50	—	1M	Cheap multimodal
Google Gemini 2.5 Flash-Lite	$0.10	$0.40	—	1M	Cheapest Google option
xAI Grok 4.3	$1.25	$2.50	$0.20	1M	Output-cheap reasoner
xAI Grok 4.20	$1.25	$2.50	$0.20	2M	Speed + tool-calling tier
xAI Grok 4.1 Fast	$0.20	$0.50	$0.05	2M	Budget agentic tool-caller
DeepSeek V4 Flash	$0.14	$0.28	$0.0028	1M	Best $/quality globally
DeepSeek V4 Pro	$0.435	$0.87	$0.0036	1M	Flagship (75% off until 2026-05-31)
Moonshot Kimi K2.6	$0.95	$4.00	$0.16	262K	Strong coding model
Moonshot Kimi K2.5	$0.60	$3.00	$0.10	262K	Cheaper sibling
Alibaba Qwen3-Max	$1.20	$6.00	$0.24	262K	Tiered: 2x above 32K input, 2.5x above 128K
Mistral Large 3	$0.50	$1.50	—	262K	Aggressive EU pricing
Meta Llama 4 Maverick (via DeepInfra)	$0.15	$0.60	—	1M	Cheap open-weight large
Meta Llama 4 Scout (via DeepInfra)	$0.08	$0.30	—	10M native (DeepInfra caps at 320K)	Cheapest tier overall

A few things the sticker doesn't show. Those come next.

The discount math behind the sticker

Prompt caching cuts repeated-prefix input by ~90%

Every major US-lab model offers some form of prompt caching now. The shape is the same: a long system prompt or document gets cached, and subsequent reads charge a fraction of the input rate. Anthropic and OpenAI both cut cached input by 90%. DeepSeek's cache hit on V4 is 98% off ($0.0028/M vs $0.14/M miss). Gemini caches at ~10% of base.

The catch: caching only helps when the same prefix repeats across many requests inside a short TTL window (typically 5 minutes on Anthropic, longer on others). A chatbot serving many users with shared system prompts: huge win. A coding agent rewriting context every turn: zero help.

Worked example. You're running a customer-support bot with a 4K-token system prompt and 1K-token user turns, serving 100 messages an hour. On Claude Sonnet 4.6:

Without caching: 100 × 5K × $3/M = $1.50/hr input
With caching (system prompt cached): 100 × 4K × $0.30/M + 100 × 1K × $3/M = $0.12 + $0.30 = $0.42/hr input

A 72% cut on a workload most teams already run.

Batch API cuts everything by 50%

If you can wait 24 hours, every major provider gives you exactly 50% off. Anthropic, OpenAI, Google, Mistral — all the same. For offline jobs (overnight document processing, dataset labeling, summary generation on yesterday's data) this is free money. Most production traffic can't use it because users want answers in seconds, not tomorrow.

Long-context surcharges on Gemini

Google is the only major provider charging a long-context premium. Above 200K tokens, both Gemini 2.5 Pro and Gemini 3.1 Pro roughly double their input price and add ~50% to output. Anthropic, which also offers 1M-context Claude models, charges flat across the full context.

If your typical request is below 100K tokens, this is moot. If you're feeding entire codebases or 500-page PDFs, the headline Gemini price is misleading by a factor of two.

What it actually costs to do real work

Sticker prices in isolation are useless. Here's what one realistic workload costs across the lineup.

Scenario: a coding agent processing one task end-to-end. Roughly 40K input tokens (context + retrieved code + tool results) and 8K output tokens (reasoning + final code). About one task is one minute of human-developer-equivalent work.

Model	Cost per task	Tasks per $1
GPT-5.5	$0.44	2.3
Claude Opus 4.7	$0.40	2.5
Gemini 3.1 Pro	$0.18	5.6
Claude Sonnet 4.6	$0.24	4.2
GPT-5.4	$0.22	4.5
Kimi K2.6	$0.07	14
Qwen3-Max	$0.10	10
Claude Haiku 4.5	$0.08	12.5
Grok 4.3	$0.07	14
GPT-5.4 Nano	$0.02	50
DeepSeek V4 Flash	$0.008	125
Llama 4 Scout	$0.005	200

The ratio between cheapest and most expensive at this workload is 88x. That gap, run a million times, is the difference between a $5,000 month and a $440,000 month.

How to pick a tier

A simple decision tree that holds up across most teams.

Do you need the absolute best reasoning on the hardest 5% of tasks? GPT-5.5 Pro or Claude Opus 4.7. Pay the premium, don't try to be clever.

Do you need frontier quality on routine work? Claude Sonnet 4.6 or Gemini 3.1 Pro. Sonnet wins on agent reliability; Gemini wins on multimodal and 1M context recall.

Are you on a budget but need US-lab quality? Claude Haiku 4.5 or GPT-5.4 Mini. Both punch above their price tag.

Are you cost-sensitive and OK with open-weight quality? DeepSeek V4 Flash is the answer for most teams — 1M context at $0.14/$0.28. Llama 4 Scout if you can route through DeepInfra and don't need vision.

Are you doing offline / batch work? Pick anything and add --batch for 50% off. The model choice matters less than turning batch on.

This is the same logic our LLM API selection decision matrix lays out by use case if you want a longer breakdown.

What this table doesn't show

Three caveats worth knowing before you route off these numbers.

Rate limits matter more than price for many teams. A $0.30/M model you can't get capacity on at peak is more expensive than a $5/M model you can. OpenAI and Anthropic have the most generous tiers; the cheaper Chinese models often gate hard on enterprise quotas.
Quality is not flat within a price band. Claude Sonnet 4.6 and Gemini 3.1 Pro are priced similarly but win on different tasks. Sonnet leads on multi-turn agent reliability; Gemini leads on 1M+ token recall and image input. There's no substitute for running your eval on both.
Provider markup is real. Going through a reseller adds 5–20% in most cases. We break down OpenRouter's actual margin versus first-party APIs in a separate piece — short version: it's higher than they advertise once you account for routing costs.
First-party Claude pricing matches ofox pricing. Anthropic does not let resellers undercut; the only saving is from removing the need for multiple billing relationships. That logic applies to all the big labs.

The aggregator question

You can pay nine providers separately, manage nine API keys, and reconcile nine invoices. Or you can route everything through one OpenAI-compatible endpoint. ofox.ai is the aggregator we run — one key for every model in this table, OpenAI-compatible SDK, and prices that match each provider's first-party rate for flagship models with up to 70% off on open-weight ones. We're not the only option, but the math is similar across aggregators: the value is in not maintaining nine integrations, not in saving 2% on token cost.

For a deeper read on flagship-level differences, Claude vs GPT vs Gemini is the pillar piece this article links into. For first-party tier breakdowns specifically: Claude API pricing breakdown, Gemini 3.1 Pro pricing, GPT-5.4 Pro pricing, DeepSeek V4 pricing, and how to actually reduce AI API costs. The May 2026 LLM leaderboard is the quality-side companion to this price-side table.

The right model is the one whose price you don't have to think about — pick it for capability and let the bill take care of itself, or pick it for cost and let the capability ceiling decide your roadmap. Anything in between just means you'll switch in six months.

Pricing sources (verified May 26, 2026)

OpenAI: openai.com/api/pricing
Anthropic: platform.claude.com/docs/en/about-claude/pricing
Google Gemini: ai.google.dev/gemini-api/docs/pricing
xAI: x.ai/api
DeepSeek: api-docs.deepseek.com/quick_start/pricing
Moonshot Kimi: official pricing via openrouter.ai/moonshotai
Alibaba Qwen: alibabacloud.com/help/en/model-studio/model-pricing
Mistral: mistral.ai/pricing
Meta Llama 4 (via DeepInfra): deepinfra.com/meta-llama

This table will be re-verified at the start of each month. If a number here disagrees with the provider's page, the provider wins — but tell us, because we want to keep this honest.

Originally published on ofox.ai/blog.

DEV Community