Owen

Posted on May 19 • Originally published at ofox.ai

AI Model Rankings May 2026: Top LLMs Ranked by Coding, Reasoning & Cost

#ai #llm #modelcomparison #coding

AI Model Rankings May 2026: Top LLMs Ranked by Coding, Reasoning & Cost

TL;DR

As of May 2026, GPT-5.5 leads SWE-bench Verified coding at 88.7%, Claude Opus 4.7 and Gemini 3.1 Pro compete for top reasoning on GPQA Diamond (94.2% vs 94.3%), and DeepSeek V4 Pro dominates cost-quality at $0.43/$0.87 per million tokens with an 80.6% SWE score. The gap between the most expensive flagship and cheapest open-weight model is now four task-points and 30x in price.

How this ranking was built

Rankings pulled from three sources without mixing: SWE-bench Verified (coding agentic tasks), GPQA Diamond (PhD-level science MCQ for reasoning), and published per-million-token API pricing as of writing. Every score was reconfirmed against the public leaderboard the week of May 19, 2026.

Models included: GPT-5.5, GPT-5.3-Codex, Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 3.1 Pro, Gemini 3.1 Flash, DeepSeek V4 Pro Max, Kimi K2.6, Qwen 3.6 Plus, Grok 4.1, Llama 4 Maverick.

Models excluded: Anything not generally API-available or still labeled "preview" without public pricing.

Coding rankings (SWE-bench Verified)

GPT-5.5 leads coding at 88.7%, but by less than one task-point over Claude Opus 4.7. OpenAI's April 23 release was the only major shift since April 2026.

Rank	Model	SWE-bench Verified	Notes
1	GPT-5.5	88.7%	Released 2026-04-23, $5/$30
2	Claude Opus 4.7	87.6%	$5/$25, new 2026 tokenizer adds ~35% tokens vs 4.6
3	GPT-5.3-Codex	85.0%	Specialized for code, lower price than 5.5
4	Claude Opus 4.5	80.9%	Older flagship, still useful for cost-sensitive coding
5	Claude Opus 4.6	80.8%	Pre-4.7 flagship, prompt-caching mature
6	DeepSeek V4 Pro Max	80.6%	First open-weight in this tier
7	Gemini 3.1 Pro	80.6%	Strong on long-context refactors
8	Kimi K2.6	~72% (Tier A coding bench)	384 routed experts, strong Chinese-domain code
9	Qwen 3.6 Plus	~71%	Tier B coding bench, open-weight

The top four sit within ~8 points and any will ship working code on standard tasks. Real-world differentiation lives in tasks SWE-bench doesn't capture — multi-file refactors, long debugging loops, and graceful recovery from bad initial diffs.

Tokenizer note: Opus 4.7 ships a new 2026 tokenizer producing up to 35% more tokens for the same English text vs 4.6. The headline $5/$25 rate is closer to $6.75/$33.75 in effective spend for English-heavy workloads.

Reasoning rankings (GPQA Diamond)

The top reasoning leaderboard is a statistical tie — the top four span 0.5 percentage points on a 198-question benchmark.

Rank	Model	GPQA Diamond	Caveat
1	Gemini 3.1 Pro	94.3%	Strongest on multi-step science
2	Claude Opus 4.7	94.2%	Best on prompts needing self-correction
3	GPT-5.5 (xhigh effort)	93.5%	Higher with extended thinking budget
4	GPT-5.5 (high effort)	93.2%	Cheaper effort tier, slight quality drop
5	Claude Opus 4.6	~92%	Still excellent, half the latency of Mythos
6	DeepSeek V4 Pro Max	~89%	Best open-weight reasoning
7	Grok 4.1	~87%	Improved sharply from 4.0

The effort-budget knob now matters more than the model name. "Which model is smartest" has collapsed into "how much thinking budget can I afford."

Cost rankings (price per million tokens)

DeepSeek V4 Pro is the cost-quality king — roughly 1/12th the price of GPT-5.5 for an 80.6% SWE score.

Model	Input	Output	Effective cost-per-quality (output $ ÷ SWE score)
DeepSeek V4 Pro	$0.43	$0.87	$0.0108/point
Gemini 3.1 Flash	$0.15	$0.60	~$0.0086/point (lower score)
Gemini 3.1 Pro	$2.00	$12.00	$0.149/point
Claude Opus 4.7	$5.00	$25.00	$0.285/point
GPT-5.5	$5.00	$30.00	$0.338/point
Kimi K2.6	~$0.60	~$2.50	~$0.035/point
Qwen 3.6 Plus	~$0.50	~$2.00	~$0.028/point

Important considerations:

DeepSeek's 75% promo ends May 31, 2026. Post-promo pricing returns to roughly $1.72/$3.48 — still cheap, but no longer exceptional value.
Prompt caching changes everything. Claude's 5-minute cache discount is 90% on cached input. If your workload re-sends the same system prompt or context, Opus 4.7 can be cheaper than its rate card implies.
Effective cost ≠ headline cost. The tokenizer change on Opus 4.7, batch-API discounts, and provider-specific deals all distort the simple rate-card comparison.

Cross-axis: pick by use case

Production coding agent with a tight budget: DeepSeek V4 Pro for heavy lifting, GPT-5.5 only when hitting hard tasks DeepSeek can't close. The hybrid pattern is documented in the Claude Code hybrid routing pattern — same routing logic works with DeepSeek as the cheap tier.

Research / hard reasoning: Claude Opus 4.7 or Gemini 3.1 Pro — pick based on latency tolerance. Opus is slower; Gemini Pro is faster and slightly cheaper. Within 0.1 GPQA points, they're interchangeable.

Long-context refactors (>200K tokens): Gemini 3.1 Pro wins. It's the only model in this list with a 1M-token native context that doesn't degrade badly past 500K.

Cheap multi-step agent loops: Gemini 3.1 Flash-Lite or DeepSeek V4 Flash for the inner loop, with one Opus or GPT-5.5 call at the end for synthesis.

Bring-your-own infra, no API spend: Qwen 3.6 27B locally if you have a 24GB GPU, DeepSeek V4 Pro via API for everything exceeding local capacity.

Just give me one default: Claude Opus 4.7 if budget isn't tight, DeepSeek V4 Pro if it is. The middle (Gemini 3.1 Pro at $2/$12) is good but ends up being a "neither one nor the other" pick for most teams.

How to access these models without nine API keys

Every model in this ranking is callable through ofox.ai's unified gateway using one OpenAI-compatible endpoint and one key. The platform hosts Claude, GPT, Gemini, DeepSeek, Kimi, Qwen, and Llama via OpenAI-compat — you change the model string, not the SDK.

What's about to change

Three things will reshape the next ranking:

Anthropic's Mythos Preview. Not generally available yet but already topping GPQA. When it lands as a real product (June-July 2026), the reasoning ranking will shift.
DeepSeek's promo expires May 31. Post-promo pricing is still excellent but the "30x cheaper than GPT-5.5" gap narrows to "10x cheaper."
GPT-5.5-Codex. OpenAI is expected to ship a Codex-specialized 5.5 variant in Q3, which would likely take the SWE-bench crown.

The leaderboard moves roughly once a month. When locking in a model for the next 90 days, build the routing layer first and the model choice second. The fact that swapping flagships used to be a quarter-long project and is now a one-line config change is the actual story of 2026 in LLM infrastructure.

Originally published on ofox.ai/blog.

DEV Community

AI Model Rankings May 2026: Top LLMs Ranked by Coding, Reasoning & Cost

AI Model Rankings May 2026: Top LLMs Ranked by Coding, Reasoning & Cost

TL;DR

How this ranking was built

Coding rankings (SWE-bench Verified)

Reasoning rankings (GPQA Diamond)

Cost rankings (price per million tokens)

Cross-axis: pick by use case

How to access these models without nine API keys

What's about to change

Top comments (0)