Remember when picking an AI model meant choosing between GPT-4 and Claude? Yeah, me neither. That was apparently a lifetime ago — or about 18 months, which in AI time is roughly the same thing.
As of early 2026, the Artificial Analysis leaderboard tracks 282 models. 182 of them are open weights. 142 can reason. The market went from "a few options" to "decision paralysis" faster than you can say "mixture of experts."
I've been deep in the benchmarks, the pricing pages, and the API docs so you don't have to. Here's the actual state of play — who's winning, who's surprisingly good, and where your money should go depending on what you're building.
The Frontier: Where Intelligence Gets Expensive
Let's start at the top. The Artificial Analysis Intelligence Index caps at 57 right now, and two models share that crown:
| Model | Intelligence Index | Type |
|---|---|---|
| Gemini 3.1 Pro Preview | 57 | Proprietary |
| GPT-5.4 xhigh | 57 | Proprietary |
| GPT-5.3 Codex xhigh | 54 | Proprietary |
| Claude Opus 4.6 (Adaptive Reasoning, Max Effort) | 53 | Proprietary |
| Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) | 52 | Proprietary |
Gemini 3.1 Pro Preview
Google finally stopped fumbling the bag. Gemini 3.1 Pro ties for the top Intelligence Index score at 57, and it does so with Google's characteristic strength in multimodal tasks. The context window situation remains generous, and the model handles long-document analysis with an ease that still feels slightly unfair.
The "Preview" tag is doing some heavy lifting here — this isn't even the final version. If you're building anything that touches vision, audio, or needs to process entire codebases in one shot, Gemini 3.1 Pro is the model to beat right now.
GPT-5.4 xhigh
OpenAI's naming scheme has gotten... creative. The "xhigh" compute tier is their way of saying "we'll burn extra inference compute if you pay for it," and honestly? It works. GPT-5.4 matches Gemini at the top of the Intelligence Index.
The Codex variant (GPT-5.3 Codex xhigh, Intelligence Index 54) is purpose-built for code generation and is the go-to if you're doing heavy software engineering tasks. OpenAI has clearly learned that specialization matters — one model doesn't have to win everything if you can route to the right variant.
Claude Opus 4.6 & Sonnet 4.6
Anthropic's Adaptive Reasoning models land at 53 and 52 respectively. The "Max Effort" mode is key here — these models dynamically adjust how much reasoning they apply based on task complexity. Simple question? Quick answer. Complex multi-step problem? They'll think longer.
In practice, Claude's models remain the best at nuanced writing, careful instruction-following, and tasks where you need the model to not hallucinate. The Intelligence Index gap between Claude and the GPT/Gemini leaders is small enough that for many real-world tasks, you won't notice it. Where you will notice Claude is in the vibes — the outputs tend to be more natural, more carefully hedged when uncertainty exists, and less prone to confident nonsense.
The Honest Take on Frontier Models
Here's the thing: at Intelligence Index scores of 52-57, these models are remarkably close. The differences show up in specific benchmarks and edge cases more than in day-to-day usage. If you're picking a frontier model for a production application, your decision should be driven by:
- Pricing (they vary significantly at scale)
- Latency requirements (some are faster than others)
- Specific task fit (code vs. writing vs. analysis)
- API ecosystem (tooling, fine-tuning options, reliability)
Not by who has the marginally higher benchmark score.
The Open-Weights Revolution: This Is the Real Story
If 2024 was the year open-source AI proved it was viable, 2026 is the year it became dominant by volume. 182 of the 282 tracked models are open weights. And the top open models are closing in on frontier proprietary performance at a pace that should make every API pricing team nervous.
GLM-5 Reasoning — Intelligence Index: 50
Zhipu AI's GLM-5 Reasoning model sits at an Intelligence Index of 50, which puts it just 7 points behind the absolute best proprietary models. Let that sink in. An open-weights model is within striking distance of GPT-5.4.
This is a reasoning model, so it does the "think before answering" thing, and it does it well. For teams that need to self-host a genuinely powerful model — whether for data privacy, latency, or cost reasons — GLM-5 is the new default recommendation.
Kimi K2.5 Reasoning — Intelligence Index: 47
Moonshot AI's Kimi K2.5 continues the trend of Chinese AI labs producing exceptional open-weights models. At 47 on the Intelligence Index, it's competitive with where frontier proprietary models were less than a year ago.
Qwen3.5 397B A17B Reasoning — Intelligence Index: 45
Alibaba's Qwen family deserves its own section, honestly. The Qwen3.5 397B model uses a mixture-of-experts architecture (397B total parameters, 17B active), which means you get big-model intelligence with smaller-model inference costs. At Intelligence Index 45, it's a serious contender for any reasoning-heavy workload.
But the Qwen ecosystem is broader than just the flagship:
- QwQ-32B: Apache 2.0 licensed, scores 0.78 on AIME, 0.976 on HumanEval, and 0.957 on MATH-500. For a 32B parameter model, these numbers are borderline absurd. This is your go-to if you need a compact reasoning model you can actually run on reasonable hardware.
- Qwen3 235B A22B: The non-reasoning variant still hits GPQA 0.613 and MATH-500 0.902, with an agentic index of 19.23. Solid all-rounder.
- Qwen3 Max Preview: Proprietary variant hitting Intelligence Index 26.08, GPQA 0.764, and AIME25 0.75 — proof that Alibaba is playing both sides of the open/closed divide.
- Qwen3 4B through 235B: The full range means you can pick the exact size that fits your hardware and budget. That matters enormously for production deployments.
Seed-OSS-36B-Instruct: The Dark Horse
ByteDance's Seed-OSS-36B deserves special attention. At just 36 billion parameters, it posts:
- AIME25: 0.847 (higher than many models 10x its size)
- GPQA: 0.726
- LiveCodeBench: 0.765
- Agentic Index: 27.73
That agentic index number is particularly interesting — it means this model is exceptionally good at multi-step tool use and autonomous task completion. If you're building AI agents, Seed-OSS-36B should be on your shortlist. The parameter efficiency here is remarkable.
What Open Weights Means in Practice
Let's be concrete about why this matters:
- Self-hosting: Run these models on your own infrastructure. Your data never leaves your servers.
- Fine-tuning: Take QwQ-32B, fine-tune it on your domain data, and you've got a specialist that might outperform GPT-5 on your specific task.
- Cost control: No per-token API fees. Just compute costs, which you can optimize and predict.
- No vendor lock-in: Switch models, run multiple models, A/B test freely.
The gap between open and proprietary is now small enough that the operational advantages of open weights often outweigh the raw intelligence gap.
Speed Kings: When Latency Is the Feature
Sometimes you don't need the smartest model. You need the fastest one.
| Model | Speed (tokens/sec) |
|---|---|
| Mercury 2 | 727.2 |
| Granite 3.3 8B | 405.8 |
| Gemini 2.5 Flash-Lite Preview | 340.6 |
Mercury 2: 727 Tokens Per Second
Inception Labs' Mercury 2 is in a different league. At 727.2 tokens per second, it's nearly twice as fast as the next fastest model. This is purpose-built for latency-sensitive applications — real-time chat, voice assistants, inline code completion, anything where the user is watching a cursor blink.
The intelligence isn't frontier-level, but that's not the point. For use cases where "good enough and instantaneous" beats "brilliant but slow," Mercury 2 is the answer.
Granite 3.3 8B
IBM's Granite 3.3 8B at 405.8 t/s is the speed pick for people who want an open model they can run locally. At 8B parameters, this fits on consumer GPUs and still moves fast enough for real-time applications.
Gemini 2.5 Flash-Lite Preview
Google's lightweight offering at 340.6 t/s gives you the Gemini ecosystem advantages (multimodal, good context handling) with speed that works for interactive applications.
Best Value: Maximum Intelligence Per Dollar
This is where 2026 gets really interesting. The floor on model pricing has basically fallen through:
| Model | Cost (per 1M tokens, blended) |
|---|---|
| Gemma 3n E4B Instruct | $0.03 |
| LFM2 24B A2B | $0.05 |
| Nova Micro | $0.06 |
Three cents per million tokens. Google's Gemma 3n E4B is practically free. For batch processing, classification, extraction, summarization of non-critical content — you can process absurd volumes of text for the cost of a coffee.
LFM2 24B (Liquid AI's model, 24B total with 2B active parameters) at $0.05 per million tokens is the efficiency play. Mixture-of-experts architectures have made large models cheap to run, and LFM2 is the current poster child.
Amazon's Nova Micro at $0.06 rounds out the budget tier. If you're already in the AWS ecosystem, this is the path of least resistance for cost-sensitive workloads.
The Value Strategy
Smart teams in 2026 aren't picking one model. They're routing:
- Tier 1 (frontier, $$$): GPT-5.4 or Gemini 3.1 Pro for complex reasoning, important customer-facing generation, and tasks where quality directly impacts revenue.
- Tier 2 (open-weights, $$): QwQ-32B or Seed-OSS-36B self-hosted for the bulk of reasoning tasks, agentic workflows, and anything touching sensitive data.
- Tier 3 (budget, ¢): Gemma 3n or Nova Micro for classification, extraction, routing, and high-volume low-stakes tasks.
This tiered approach can cut your AI infrastructure costs by 80%+ compared to routing everything through a frontier model.
Head-to-Head: Picking the Right Model for Your Task
Let me be opinionated here, because the benchmarks only tell part of the story.
Coding
Best proprietary: GPT-5.3 Codex xhigh. Purpose-built, and it shows.
Best open-weights: Seed-OSS-36B (LiveCodeBench 0.765) or QwQ-32B (HumanEval 0.976). The QwQ HumanEval score is almost perfect — for standard coding tasks, it's absurdly good for a 32B model.
Best budget: Qwen3 series at the appropriate size for your hardware.
Math & Reasoning
Best proprietary: Gemini 3.1 Pro or GPT-5.4 (tied at Intelligence Index 57).
Best open-weights: GLM-5 Reasoning (Intelligence Index 50). QwQ-32B is the efficiency pick with MATH-500 at 0.957.
Best value: QwQ-32B. Seriously. A 32B model hitting 0.957 on MATH-500 with an Apache 2.0 license is almost unfair.
Writing & Content
Best overall: Claude Opus 4.6. Still the king of natural, nuanced prose. The Adaptive Reasoning feature means it doesn't overthink simple writing tasks.
Best open-weights: GLM-5 or Qwen3.5 397B, depending on your language and style preferences.
Agentic Workloads
Best proprietary: This is model-dependent, but Claude Sonnet 4.6 and GPT-5.4 both handle tool use well.
Best open-weights: Seed-OSS-36B (agentic index 27.73). ByteDance specifically optimized for this, and it shows.
Competition-Level Problem Solving
Best open-weights: Seed-OSS-36B with AIME25 at 0.847. That's a competition-math benchmark, and this 36B model is solving problems that would stump most CS graduates.
The Trends That Actually Matter
1. Reasoning Models Won
142 of 282 models on the leaderboard are reasoning models. The "think step by step" approach went from research novelty to industry standard. If your model can't reason, it's already a generation behind.
2. Open Weights Caught Up (Mostly)
The Intelligence Index gap between the best open model (GLM-5 at 50) and the best proprietary model (57) is just 7 points. A year ago, that gap was closer to 20. The curve hasn't flattened yet.
3. Mixture of Experts Changed the Economics
Models like Qwen3.5 (397B total, 17B active) and LFM2 (24B total, 2B active) mean you can have big-brain performance without big-brain costs. This architecture shift is why the budget tier exists at all.
4. China Is a Superpower in Open AI
GLM-5, Kimi K2.5, the entire Qwen family, Seed-OSS-36B, DeepSeek — Chinese labs are producing some of the most impressive open-weights models in the world. If you're not paying attention to what's coming out of Zhipu AI, Alibaba, Moonshot AI, and ByteDance, you're missing half the landscape.
5. Speed and Cost Matter More Than Benchmarks
Mercury 2 at 727 t/s and Gemma 3n at $0.03/1M tokens aren't the smartest models. But they're enabling use cases that would be economically impossible with frontier models. The "good enough, fast enough, cheap enough" tier is where most production tokens actually flow.
6. The Model Router Is the New Product
The winning strategy isn't picking the best model. It's building the system that routes each request to the right model. Hard problem? GPT-5.4. Simple extraction? Gemma 3n. Code generation? QwQ-32B. Latency-critical? Mercury 2. The orchestration layer is where the real engineering challenge lives now.
What I'd Actually Pick (March 2026)
If you're starting a new project today and asked me to pick:
- "I just need something good": Claude Sonnet 4.6 via API. Great balance of intelligence, speed, and not-insane pricing.
- "I need the absolute best": Gemini 3.1 Pro or GPT-5.4, depending on your task.
- "I need to self-host": QwQ-32B if you have modest hardware, GLM-5 Reasoning if you have serious GPUs.
- "I'm building agents": Seed-OSS-36B. That agentic index doesn't lie.
- "I'm processing millions of documents": Gemma 3n at $0.03/1M tokens. Don't overthink it.
- "I need real-time responses": Mercury 2. Nothing else comes close on speed.
The Bottom Line
282 models. A year ago, keeping track of the AI landscape was a hobby. Now it's a full-time job.
The good news: competition is driving prices down, quality up, and open-weights models into genuine contention with the best proprietary offerings. The bad news: picking the right model for your use case now requires actual analysis instead of just defaulting to "the OpenAI one."
But that's a good problem to have.
Data sourced from the Artificial Analysis leaderboard as of March 2026. Benchmark numbers, Intelligence Index scores, and pricing reflect what was available at time of writing. This space moves fast — if you're reading this more than a month after publication, some of these numbers have probably already changed.
What models are you running in production? I'm genuinely curious — drop a comment below.
Top comments (0)