$Cover image for Deep Dive: 7 Capability Dimensions \u00d7 8 AI Models \u2014 Who Leads Where?$

HIROKI II

Posted on Jun 10

Deep Dive: 7 Capability Dimensions \u00d7 8 AI Models \u2014 Who Leads Where?

#agents #research #llm #ai

5-min read · Curated by an AI Systems Architect
Focus: AI Model Benchmarks · Capability Dimensions · Model Selection

In the first part of this series, we saw the overall rankings. But one question remains: There is no "all-round champion."

Claude Opus 4.8 tops BenchLM at 95, but GPT-5.5 is stronger in Agentic capabilities and long context. Choosing a model isn't about picking the highest total score -- it's about picking the one that fits your use case.

This is Part 2 of the series, drilling into 7 capability dimensions -- who is the Top model and Runner-up in each? How big is the gap?

1. Agentic Capability

Top 3: GPT-5.5 (98.0) > Claude Opus 4.8 (97.7) > Gemini 3.5 Flash (96.9)

The difference between the three is only 1.1 points, but their strength profiles are completely different:

Benchmark	Champion	Score	Runner-up	Score
Terminal Tasks	GPT-5.5	82.7%	Opus 4.8	74.6%
Tool Orchestration	Gemini 3.5 Flash	83.6%	Opus 4.7	78.0%
Coding Agent	Claude Opus 4.8	69.2% (SWE-bench)	GPT-5.5	58.6%

Source: BuildFastWithAI

GPT-5.5's Agentic edge comes from its end-to-end reasoning design -- as a reasoning model, it excels at measurement backtracking and error correction in multi-step agent loops. Meanwhile, Gemini 3.5 Flash, at $1.50/M input, achieves 83.6% on MCP Atlas tool orchestration -- the best Agentic value pick.

2. Coding Capability

Top 3: Claude Opus 4.8 (98.9) > GPT-5.4 / MiniMax M3 (87.2) > GPT-5.5 (84.0)

Opus 4.8 leads by a staggering 11.7 points over the runner-up -- the largest gap across all 7 dimensions.

Benchmark	Champion	Score	Runner-up	Score
SWE-bench Pro	Claude Opus 4.8	69.2%	GPT-5.5	58.6%
LiveCodeBench	DeepSeek V4 Pro	93.5% (self-reported)	V4 Flash	91.6%

Sources: AIMadeTools, DeepSeek V4 Benchmark

Key divergence: Competitive programming and real-world software engineering are two different things. DeepSeek V4 Pro leads LiveCodeBench (competitive programming) at 93.5%, but on SWE-bench Pro (real issue fixing), it's soundly beaten by Opus 4.8. Know your coding scenario before choosing a model.

3. Reasoning & Mathematics

Benchmark	Champion	Score	Runner-up	Score
ARC-AGI-2 (Hardest Reasoning)	GPT-5.5	85%	GPT-5.4 Pro	83.3%
HLE (Cross-domain Reasoning)	Claude Opus 4.8	57.9%	GPT-5.5	53.4%
Putnam Math Competition	DeepSeek V4 Pro	120/120 Perfect	--	--

Sources: BenchLM ARC-AGI-2, AIMadeTools

GPT-5.5 is the only general-purpose model to break the 85% ARC-AGI-2 grand prize threshold -- humans average only 66%. And DeepSeek V4 Pro scores a perfect 120/120 on the Putnam Math Competition, making it the value pick for math scenarios (at one-third the price of GPT-5.5).

4. Knowledge & Multimodal

Knowledge -- Three-way tie: Opus 4.8 (99.3) ≈ GPT-5.4 (99.2) ≈ GPT-5.5 (97.8). The gap is negligible.

Multimodal -- Big divergence:

Model	Multimodal Score	MMMU-Pro
Gemini 3.5 Flash	80.6	84.2%
Claude Opus 4.8	68.8	--
GPT-5.4	60.0	--
GPT-5.5	57.2	79.8%

Source: BuildFastWithAI, BenchLM

GPT-5.5's most significant weakness is multimodal. If your workflow depends on image/video understanding, Gemini 3.5 Flash is the undisputed choice.

5. Long Context

On short text (<100K tokens), the models are close. The real divergence starts at 200K+ tokens:

Scenario	GPT-5.5	Claude Opus 4.7	Gap
128K Retrieval	94.8%	89.1%	5.7pp
512K-1M Retrieval	74.0%	32.2%	2.3x

Source: CallSphere

GPT-5.5 is the only reliable choice for long context -- when context exceeds 500K, its retrieval accuracy is more than 2x that of Claude.

6. Capability Dimension Quick Reference

Dimension	Top Model	Score	Runner-up	Gap
Agentic	GPT-5.5	98.0	Claude Opus 4.8	0.3
Coding	Claude Opus 4.8	98.9	GPT-5.4	11.7
Reasoning	GPT-5.5	96.9	GPT-5.4	1.3
Knowledge	Claude Opus 4.8	99.3	GPT-5.4	0.1
Multimodal	Gemini 3.5 Flash	80.6	Claude Opus 4.8	11.8
Long Context	GPT-5.5	94.8%	Claude Opus 4.7	5.7pp
Math Competition	DeepSeek V4 Pro	Perfect 120	GPT-5.5	--

Next in This Series

Part 3 will reveal two model selection dimensions that 99% of people overlook: Design capability and cost-effectiveness. MiniMax M3 is second only to Claude in design ability? What does a 69x price gap actually mean?

Tomorrow at 7 PM JST.

Sources: BenchLM · BuildFastWithAI · AIMadeTools