There is no single best model in April 2026 — the leaderboard has fractured by task.
Claude Opus 4.7 dominates coding benchmarks at 82% on SWE-bench Verified and ranks first on LM Arena with 1504 Elo rating. Three models tie at the top of the Artificial Analysis Intelligence Index (score of 57): Claude 4.7, Gemini 3.1 Pro Preview, and GPT-5.4. DeepSeek V3.2 offers optimal pricing at $0.29 per million input tokens.
How These Rankings Work
Three independent benchmarking systems:
- LM Arena — Blind human preference voting across 339 models with 5.7M+ votes. The largest human-preference dataset in existence, using chess-style Elo ratings.
- SWE-bench Verified — Evaluates whether models can resolve actual GitHub issues through agent-based testing.
- GPQA Diamond — Graduate-level science questions where human PhD experts typically score 65-70%.
- Artificial Analysis Intelligence Index — Combines multiple benchmarks into composite scoring.
Overall Leaderboard (LM Arena Top 10)
| Rank | Model | Elo Score |
|---|---|---|
| 1 | claude-opus-4-7-thinking | 1504 |
| 2 | claude-opus-4-6-thinking | 1502 |
| 3 | claude-opus-4-7 | 1497 |
| 4 | claude-opus-4-6 | 1496 |
| 5 | muse-spark (Meta) | 1493 |
| 6 | gemini-3.1-pro-preview | 1493 |
| 7 | gemini-3-pro | 1486 |
| 8 | grok-4.20-beta1 | 1482 |
| 9 | gpt-5.4-high | 1482 |
| 10 | grok-4.20-beta-0309-reasoning | 1480 |
Anthropic holds four of the top five spots. The 24-point gap between first and tenth is statistically meaningful but not a blowout.
Best for Coding: SWE-bench Rankings
| Model | Score | Notes |
|---|---|---|
| Claude Opus 4.7 | 82.0% | Released April 16, 2026 |
| Gemini 3.1 Pro Preview | 78.8% | Best price among top-3 |
| Claude Opus 4.6 (Thinking) | 78.2% | Cheaper alternative |
| GPT-5.4 | 78.2% | Tied with Opus 4.6 |
| GPT-5.3 Codex | 78.0% | Coding-tuned variant |
The spread between #1 and #5 is roughly 4 percentage points. Differences appear in edge cases — complex multi-file refactors, ambiguous specs, long-running tasks.
Best for Reasoning: Composite Intelligence Index
| Model | AA Score |
|---|---|
| Claude Opus 4.7 | 57 |
| Gemini 3.1 Pro Preview | 57 |
| GPT-5.4 | 57 |
| Kimi K2.6 | 54 |
| Claude Opus 4.6 | 53 |
The three-way tie at 57 points indicates the current frontier is a plateau. Selection depends on cost, context window, and task-specific requirements rather than performance differentiation.
Best Value: Price-Performance Comparison
| Model | Input $/M | Output $/M | Context | SWE-bench |
|---|---|---|---|---|
| DeepSeek V3.2 | $0.29 | $0.43 | 164K | — |
| Kimi K2.6 | $0.60 | $2.50 | 256K | vendor-reported |
| Gemini 3.1 Pro Preview | $2.00 | $12.00 | 1M | 78.8% |
| GPT-5.4 | $2.50 | $15.00 | 1M | 78.2% |
| Claude Opus 4.7 | $5.00 | $25.00 | 1M | 82.0% |
| Claude Opus 4.6 | $5.00 | $25.00 | 1M | 78.2% |
DeepSeek V3.2 is 17x cheaper than Claude Opus 4.7 on input tokens. Kimi K2.6 offers roughly 8x cheaper access with an Intelligence Index score only 3 points below the frontier band.
Best Open-Source Model
Kimi K2.6 from Moonshot AI — 1-trillion-parameter Mixture-of-Experts architecture with 32B active parameters, 256K context window. Scores 54 on the Intelligence Index, ahead of Claude Opus 4.6 (53).
Which Model Should You Pick?
- For Coding: Claude Opus 4.7 leads at 82% SWE-bench. Cost-conscious teams should evaluate Kimi K2.6.
- For Long-Context Work: Gemini 3.1 Pro Preview — 1M-token window with tied frontier performance.
- For High-Volume Production: DeepSeek V3.2 as cost-effective alternative.
- For General Chat: Claude Opus 4.7 (thinking mode) leads, but gaps are negligible for most apps.
- For Self-Hosted: Kimi K2.6 — the only open-weight model that belongs in this conversation.
Originally published at ofox.ai
Top comments (0)