8-min read · Part 1 of 4 · AI Model Comparison Series
Who's on top? How big is the gap?
In Q2 2026, the AI large language model industry entered unprecedented high-density iteration. Within just 11 weeks, OpenAI, Anthropic, Google, DeepSeek, and MiniMax each released flagship models — forming a "three-pillar + open-source rise" competitive landscape.
This is Part 1 of a 4-part series. Using BenchLM composite scores and Arena Elo human preference rankings, we present the complete picture of eight major AI models in June 2026.
I. Three Evaluation Systems, One Ruler
Before diving into rankings, let's understand our measuring tools:
📊 BenchLM — Weighted aggregate of 237 benchmarks across 8 dimensions including Agentic (22%), Coding (20%), Reasoning (17%). Scored 0-100. Currently the most comprehensive objective evaluation system.
🏟️ Arena Elo — LMSYS Chatbot Arena's 6M+ anonymous blind votes, reflecting actual human preferences rather than standardized test scores.
Using both together = checking both "exam performance" (BenchLM) and "real-world feel" (Arena Elo).
II. BenchLM Rankings: Three Tiers at a Glance
Tier 1 (91-95): Flagship Showdown
| Model | BenchLM Score | Strongest Dimension |
|---|---|---|
| Claude Opus 4.8 🥇 | 95 | Coding 98.9, Knowledge 99.3 |
| GPT-5.5 | 91 | Agentic 98.0, Reasoning 96.9 |
- Opus 4.8 leads by 4 points; Coding 98.9 beats GPT-5.5 by nearly 15 points
- But GPT-5.5 excels in Agent capability and long-context retrieval
- Key takeaway: Opus for coding, GPT for Agents
Tier 2 (85-89): Strengths and Niches
| Model | Score | Core Positioning |
|---|---|---|
| GPT-5.4 | 89 | Knowledge & reasoning specialist, Reasoning 95.6 |
| Gemini 3.5 Flash | 87 | Agent + multimodal dark horse, Pro-grade at Flash price |
| DeepSeek V4 Pro (Max) | 87 | MIT open-source flagship, LiveCodeBench 93.5 |
| Claude Opus 4.7 (Adaptive) | 85 | Best human preference, Arena #3 |
- Four models within 4 points — price and ecosystem matter more than absolute score
- Gemini 3.5 Flash hits 96.9 in Agentic at $1.50/M input — shattering "Flash = compromise"
Tier 3 (57-76): Niche Champions
| Model | Score | One-line Positioning |
|---|---|---|
| MiniMax M3 | 76 | New challenger, weights not yet released |
| DeepSeek V4 Flash | 57 | Extreme cost efficiency, 313.2 points/$ |
III. Arena Elo: Human Preference Speaks
Most counterintuitive finding: Opus 4.7 (#3, 1491) ranks above Opus 4.8 (#7, 1479).
This is not because Opus 4.7 is stronger. The reasons:
- Insufficient vote accumulation — Opus 4.8 launched only ~12 days ago (vs. Opus 4.7's 11,000+ votes)
- Elo convergence lag — Bradley-Terry system needs 4-8 weeks to stabilize
- Thinking variant confusion — Opus 4.8 Thinking mode not yet broadly deployed
Standardized benchmarks all show Opus 4.8 comprehensively ahead: SWE-bench Pro 69.2% vs 64.3%, BenchLM 95 vs 85.
| Model Type | Representative | Selection Signal |
|---|---|---|
| Arena-friendly ↑ | DeepSeek V4 Flash (+22), MiniMax M3 (+5) | Best for interactive apps |
| BenchLM-friendly ↓ | GPT-5.5 (-6), Opus 4.8 (-5) | Best for batch processing |
| High consistency ≈ | DeepSeek V4 Pro (-3), GPT-5.4 (+4) | Most reliable for selection |
Core conclusion: BenchLM measures "capability ceiling" (peak performance under optimal reasoning), while Arena Elo measures "daily experience" (human preference in casual conversation). The direction of deviation itself is a selection signal.
Coming Next
Part 2 will break down 7 capability dimensions: Agentic, Coding, Reasoning, Knowledge, Multimodal, Long Context, Math — top model and runner-up in each dimension, and how big the gap is.
See you tomorrow at 7 PM JST.
Data sources: BenchLM Leaderboard · lmmarketcap Arena Elo · BuildFastWithAI

Top comments (0)