Something broke in the AI pricing market between January and May 2026.
A year ago, "frontier model" meant "expensive model." Claude Opus was $15/$75 per million tokens. GPT-4 was $5/$15. If you wanted the best coding performance, you paid the best price. The correlation between quality and cost was loose, but it existed.
That correlation is gone.
The Numbers That Changed Everything
Here's SWE-bench Verified — the benchmark that tests AI models against real GitHub issues from projects like Django, Flask, and scikit-learn — plotted against output price per million tokens:
Model SWE-bench Output $/1M Score/Dollar
─────────────────────────────────────────────────────────────────
Claude Opus 4.7 87.6% $25.00 3.5
Claude Opus 4.6 80.8% $25.00 3.2
Gemini 3.1 Pro 80.6% $15.00 5.4
GPT-5.2 80.0% $10.00 8.0
DeepSeek V4 Pro (Max) 80.6% $3.48 23.2
Kimi K2.6 80.2% $4.00 20.1
Qwen3.6 Plus 78.8% $3.00 26.3
MiniMax M2.5 80.2% $1.20 66.8
DeepSeek V4 Flash (Max) 79.0% $0.28 282.1
Read that last line again. DeepSeek V4 Flash scores 79% on SWE-bench at $0.28 per million output tokens. Claude Opus 4.7 scores 87.6% at $25.00.
The performance gap is 8.6 percentage points. The price gap is 89x.
For a team running 100 million tokens per month, that's the difference between $28/month and $2,500/month. For a 9-point improvement in code completion accuracy.
It's Not Just One Outlier
This isn't a DeepSeek anomaly. Look at the cluster of models scoring 78-80% on SWE-bench:
- DeepSeek V4 Pro: $3.48/1M output — open source, 1M context
- Kimi K2.6: $4.00/1M output — open source, 256K context
- MiniMax M2.5: $1.20/1M output — open source, 200K context
- Qwen3.6 Plus: $3.00/1M output — open source, 1M context
- MiMo-V2-Pro: $3.00/1M output — open source, 1M context
Five models from five different Chinese labs, all scoring within 2 points of GPT-5.2 ($10.00/1M) and Gemini 3.1 Pro ($15.00/1M), all at 1/3 to 1/10 the price.
And they're all open source.
What Happened
Three things converged:
1. Mixture-of-Experts architectures matured. DeepSeek V4 uses a 1-trillion parameter MoE architecture where only ~60B parameters activate per token. You get the knowledge capacity of a 1T model at the inference cost of a 70B model. MiniMax M2.5 achieved 80.2% SWE-bench with only 10B active parameters.
2. Chinese labs optimized for cost from day one. While Western labs built premium-priced APIs and recouped GPU investments through margin, Chinese labs — facing export restrictions on top-tier NVIDIA hardware — were forced to squeeze more performance from less compute. That constraint became a competitive advantage.
3. Reinforcement learning on code got cheap. The techniques that powered Claude's SWE-bench dominance (RL on real-world code feedback) diffused rapidly. By early 2026, multiple labs had replicated and improved on these methods.
The Cache Pricing Dimension
There's a second pricing war happening that most developers haven't noticed: cache pricing.
When you send the same context to an API repeatedly (as agents do), cached input tokens cost a fraction of fresh ones:
| Provider | Normal Input | Cached Input | Discount |
|---|---|---|---|
| Gemini 3.5 Flash | $1.50/1M | $0.15/1M | 90% |
| DeepSeek V4 Pro | $1.74/1M | $0.44/1M | 75% |
| MiniMax M2.7 | $0.30/1M | $0.06/1M | 80% |
For agentic workloads — where an AI reads the same codebase context dozens of times — cache pricing changes the math entirely. Gemini 3.5 Flash at $0.15/1M cached input is effectively free for most agent loops.
Context Windows: The Silent Differentiator
While everyone debates price and benchmarks, context window size quietly determines what you can actually do:
| Model | Context Window |
|---|---|
| Gemini 3.0 Pro | 2,000,000 tokens |
| GPT-5.5 | 1,000,000 |
| Claude Opus 4.7 | 200,000 |
Google's 2M context lets you load an entire mid-sized codebase into a single prompt. Anthropic's 200K — the smallest among frontier models — means you're chunking and summarizing for anything beyond a few thousand lines.
This matters for code review, documentation generation, and refactoring tasks where the model needs to see the full picture. If your use case involves large codebases, the "cheapest model per token" calculation needs a "how many calls do I actually need" multiplier.
The Practical Decision Framework
Given all this data, here's how I'd actually choose a model in May 2026:
Daily coding assistance (autocomplete, inline suggestions):
→ DeepSeek V4 Flash. 79% SWE-bench at $0.28/1M output. For high-volume, low-stakes completions, nothing else makes economic sense.
Code review and bug fixing:
→ MiniMax M2.5 or Kimi K2.6. 80%+ SWE-bench at $1-4/1M output. The quality is genuinely close to frontier — you'll catch 95% of the bugs that Opus catches.
Complex refactoring across large codebases:
→ Gemini 3.1 Pro. 1M context + 80.6% SWE-bench. When you need the model to see everything, context window trumps per-token cost.
When the code absolutely must be right:
→ Claude Opus 4.7. 87.6% SWE-bench is a real, measurable improvement. For security-critical code, infrastructure, or anything where a bug costs more than the API call, pay the premium.
Agentic workflows (repeated context reads):
→ Gemini 3.5 Flash with cache. $0.15/1M cached input makes multi-step agent loops affordable.
The Data Behind These Claims
All the benchmark scores and pricing data I've referenced come from AI Models Navi, which tracks 260+ models across SWE-bench, GPQA Diamond, ARC-AGI-2, FrontierMath, and other benchmarks, along with real-time API pricing from every major provider.
The interactive benchmark explorer lets you compare any models head-to-head. The cost calculator estimates monthly spend based on your actual token usage patterns. And the value ranking normalizes benchmark performance per dollar — which is where the real surprises are.
The site is currently primarily in Japanese, but the English version is live with full data.
The Uncomfortable Truth
Here's what the data actually says that nobody wants to hear:
The "best" model and the "best value" model have never been further apart.
Claude Opus 4.7 at 87.6% SWE-bench is the best coding model. DeepSeek V4 Flash at 79% and $0.28/1M is the best value. The performance gap is 8.6 points. The cost gap is 89x.
For most development tasks — writing boilerplate, fixing typos, generating tests, writing docs — that 8.6-point gap doesn't matter. You're paying 89x for edge cases.
The developers who figure this out first will ship faster and spend less. The ones who default to "the best model" for everything will wonder why their AWS bill doubled.
Model selection in 2026 is a math problem. Treat it like one.
What's your current default model for daily development? Curious whether anyone has done their own cost/performance analysis — would love to compare notes in the comments.
Top comments (0)