8-min read · Part 2 of 4 · AI Model Comparison Series
Part 1 revealed the overall ranking: Claude Opus 4.8 at 95, GPT-5.5 at 91, and a pack of four within 4 points.
But here's the problem: no single model leads across the board. Opus 4.8 owns coding but loses on Agentic. GPT-5.5 crushes reasoning but falls apart on multimodal. DeepSeek V4 Pro wins math contests but struggles with long context.
This is Part 2 of our 4-part series. We break down 7 capability dimensions — who leads each one, by how much, and what that means for your use case.
1. Agentic: GPT-5.5 (98.0)
Top 3: GPT-5.5 (98.0) > Claude Opus 4.8 (97.7) > Gemini 3.5 Flash (96.9)
All three are within 1.1 points, but their paths diverge sharply:
| Sub-dimension | Champion | Score | Runner-up | Score |
|---|---|---|---|---|
| Terminal tasks | GPT-5.5 | 82.7% | Opus 4.8 | 74.6% |
| Tool orchestration | Gemini 3.5 Flash | 83.6% | Opus 4.7 | 78.0% |
| Coding Agent | Claude Opus 4.8 | 69.2% (SWE-bench) | GPT-5.5 | 58.6% |
GPT-5.5's edge comes from its end-to-end reasoning design — as a reasoning model, it's better at metric backtracking and error correction in multi-step agent loops. Gemini 3.5 Flash, at just $1.50/M input, achieves 83.6% on MCP Atlas tool orchestration, making it the agentic value king.
2. Coding: Claude Opus 4.8 (98.9) — By a Landslide
Claude Opus 4.8 scores 98.9 — a full 11.7 points ahead of second place. This is the largest lead across all 7 dimensions.
| Benchmark | Champion | Score | Runner-up | Score |
|---|---|---|---|---|
| SWE-bench Pro | Claude Opus 4.8 | 69.2% | GPT-5.5 | 58.6% |
| LiveCodeBench | DeepSeek V4 Pro | 93.5% | V4 Flash | 91.6% |
But there's a critical split: competitive programming and real-world software engineering are two different things. DeepSeek V4 Pro leads on LiveCodeBench (93.5%), a competition benchmark. Opus 4.8 dominates SWE-bench Pro (69.2%), which tests real issue-fixing. Know your coding scenario before choosing.
3. Reasoning: GPT-5.5 (96.9)
| Benchmark | Champion | Score | Runner-up | Score |
|---|---|---|---|---|
| ARC-AGI-2 | GPT-5.5 | 85% | GPT-5.4 | 83.3% |
| HLE | Claude Opus 4.8 | 57.9% | GPT-5.5 | 53.4% |
| Putnam Math | DeepSeek V4 Pro | 120/120 | — | — |
GPT-5.5 is the only general model to break the ARC-AGI-2 85% prize threshold — humans average 66%. DeepSeek V4 Pro scores a perfect 120/120 on the Putnam math competition at one-third the price of GPT-5.5.
4. Knowledge: Three-Way Tie (99.3 vs 99.2 vs 97.8)
Knowledge is the only dimension where the leaders are essentially indistinguishable:
- Opus 4.8 (99.3)
- GPT-5.4 (99.2)
- GPT-5.5 (97.8)
The gap is negligible. Pick any of the three.
5. Multimodal: Gemini 3.5 Flash (80.6) — The Dark Horse
| Model | Multimodal Score | MMMU-Pro |
|---|---|---|
| Gemini 3.5 Flash 🥇 | 80.6 | 84.2% |
| Claude Opus 4.8 | 68.8 | — |
| GPT-5.4 | 60.0 | — |
| GPT-5.5 | 57.2 ❌ | 79.8% |
GPT-5.5's weakest dimension is multimodal — scoring only 57.2. If your workflow depends on image/video understanding, Gemini 3.5 Flash is the uncontested choice.
6. Long Context: GPT-5.5 (94.8% at 128K)
At short context (<100K tokens), the field is tight. Real divergence starts at 200K+ tokens:
| Scenario | GPT-5.5 | Claude Opus 4.7 | Gap |
|---|---|---|---|
| 128K retrieval | 94.8% | 89.1% | 5.7pp |
| 512K-1M retrieval | 74.0% | 32.2% | 2.3x |
GPT-5.5 is the only reliable choice for long-context work. When context exceeds 500K, its retrieval accuracy is over 2x Claude's.
7. Math: DeepSeek V4 Pro (120/120 Putnam)
The math champion is DeepSeek V4 Pro — a perfect score on the Putnam competition, at just $0.33/M input (one-third the price of GPT-5.5).
Cheat Sheet: Best Model by Capability
| Capability | Best Model | Score | Runner-up | Gap |
|---|---|---|---|---|
| Agentic | GPT-5.5 | 98.0 | Opus 4.8 | 0.3 |
| Coding | Claude Opus 4.8 | 98.9 | GPT-5.4 | 11.7 |
| Reasoning | GPT-5.5 | 96.9 | GPT-5.4 | 1.3 |
| Knowledge | Opus 4.8 | 99.3 | GPT-5.4 | 0.1 |
| Multimodal | Gemini 3.5 Flash | 80.6 | Opus 4.8 | 11.8 |
| Long Context | GPT-5.5 | 94.8% | Opus 4.7 | 5.7pp |
| Math | DeepSeek V4 Pro | 120/120 | GPT-5.5 | — |
Coming Next
Part 3 uncovers two dimensions 99% of people overlook: design ability and price-to-value. Did you know MiniMax M3 ranks second in design capability despite scoring 76 on BenchLM? And the price gap between the most and least expensive model is 69x?
Data sources: BenchLM · BuildFastWithAI · AIMadeTools · CallSphere

Top comments (0)