Amit

Posted on Jun 6 • Originally published at artificialcuriositylabs.ai

The Pareto Frontier Is the Model Market Map

#ainative #infrastructure #patterns #modeleconomics

TL;DR

A model is on the intelligence-price Pareto frontier when no cheaper model has equal or higher intelligence. It is the set of models that survive a basic economic sanity check.
In the current Artificial Analysis snapshot (2026-06-02), only 13 of 351 priced LLMs sit on the frontier. The rest are dominated on this specific intelligence-price plane.
The frontier is not the same as "best model." It is the map of rational tradeoffs: maximum capability at each price point.
The current frontier has three zones: premium reasoning, mid-price compression, and cheap small-model value.
From 2026-05-25 to 2026-05-31, the frontier compressed from 15 to 13 models. Claude Opus 4.8 entered and displaced GPT-5.5 xhigh and Claude Opus 4.7 at the premium end. It held at 13 through 2026-06-02.
How to read frontier movement: shifts down and right → capability getting cheaper. Gets thinner → more models being dominated. A lab loses slots while keeping benchmark rank → its premium is harder to defend on general workloads.

Model leaderboards answer the wrong first question.

"Which model is smartest?" matters when the task has no budget constraint. Most production systems have a budget constraint. They also have latency requirements, routing rules, reliability targets, and workload mixes. Once those enter the room, the useful question changes:

Which models are still rational choices after price is included?

That is what the Pareto frontier gives you.

What a Pareto frontier means

A model is on the frontier if it is not dominated by another model.

For this analysis, dominance is simple:

Higher Artificial Analysis Intelligence Index is better.
Lower blended price per 1M tokens is better.
A model is dominated if another model is cheaper and has equal or higher intelligence.

The frontier is the set that survives that filter.

This is not a model ranking. It is a market structure. A model can rank high on intelligence and still fall off the frontier if another model buys nearly the same or better capability for less money. A model can have a much lower intelligence score and still belong on the frontier if it sets a new low price point.

That distinction matters because evaluation changes once a workload becomes real. A lab benchmark asks, "How high can this model score?" A production architecture asks, "At this price, is there any reason to call this model instead of something cheaper?"

The frontier makes that second question visible.

Why it matters

The frontier is useful because it separates model prestige from model economics.

Here is the current read from our 2026-06-02 snapshot, built from Artificial Analysis data:

Metric	Current snapshot
Total LLMs in snapshot	528
Priced models plotted	351
Models on frontier	13
Frontier share of priced cohort	3.7%

Figure: Priced LLMs by Artificial Analysis Intelligence Index and blended price per 1M tokens. Orange dots are the intelligence-price Pareto frontier (2026-06-02).

That means roughly 96% of priced models are dominated on this specific intelligence-price plane. That does not mean 96% of models are useless — some win on coding, math, latency, context length, safety profile, local deployment, or modality. But if a model is not on this frontier, the burden of proof shifts to the dimension where it wins.

This is the operational value. The frontier turns a model list into a routing shortlist.

The current frontier

The current frontier has three zones.

1. Premium reasoning

At the top of the curve, Claude Opus 4.8 sets the current maximum intelligence point in the priced cohort:

Rank	Model	Intelligence	Blended $/1M
1	Claude Opus 4.8	61.4	$10.94
2	Gemini 3.1 Pro Preview	57.2	$4.50
3	Qwen3.7 Max	56.6	$3.75
4	Gemini 3.5 Flash (high)	55.3	$3.38

The notable absence: GPT-5.5 xhigh scores 60.2 intelligence at $11.25 — higher price, lower intelligence than Opus 4.8. It sits just below the frontier line on the chart. Once a model is dominated on this plane, the reason to choose it has to come from a different axis.

2. Mid-price compression

The next zone is where the economics get sharper:

Rank	Model	Intelligence	Blended $/1M
5	Kimi K2.6	53.9	$1.71
6	MiMo-V2.5-Pro	53.8	$0.54
7	MiniMax-M2.7	49.6	$0.53
8	MiMo-V2.5	49.0	$0.18

MiMo-V2.5-Pro is only 0.1 intelligence points below Kimi K2.6, but its blended price is about 68% lower. MiMo-V2.5 drops further in capability but lands at $0.175 per 1M blended tokens.

That is cost compression in concrete form. The question is no longer whether cheap models are useful. The question is how many production workloads need the final 6–8 intelligence points at all.

For agent systems, this matters more than a single chat UI. Agent workloads multiply calls. Planning, retrieval, classification, critique, synthesis, tool repair, and final answer generation can all be separate model calls. If only the final synthesis needs the premium model, the frontier gives you candidates for the other layers.

3. Cheap small-model value

The tail of the frontier is dominated by small Qwen reasoning models:

Rank	Model	Intelligence	Blended $/1M
10	Qwen3.5 9B (Reasoning)	32.4	$0.11
11	Qwen3.5 4B (Reasoning)	27.1	$0.06
12	Qwen3.5 2B (Reasoning)	16.3	$0.04
13	Qwen3.5 0.8B (Reasoning)	10.5	$0.02

These are not frontier models in the colloquial sense. They are frontier points in the economic sense. Each creates a cheaper capability band where no cheaper model in the priced cohort has equal or higher intelligence. Classification, extraction, weak signal filtering, formatting, and routing often need consistency more than maximum reasoning depth. Small models earn their spot by defining the lower envelope of cost.

What the frontier looked like eight days ago

The May 25 snapshot had 15 models on the frontier. Here is the premium end:

Model	Intelligence	Blended $/1M
GPT-5.5 xhigh	60.2	$11.25
Claude Opus 4.7	57.3	$10.94
Gemini 3.1 Pro Preview	57.2	$4.50
Qwen3.7 Max	56.6	$3.75
Gemini 3.5 Flash (high)	55.3	$3.38
Kimi K2.6	53.9	$1.71
MiMo-V2.5-Pro	53.8	$1.50
DeepSeek V4 Pro	51.5	$0.54
DeepSeek V4 Flash	46.5	$0.18

By May 31 the frontier had thinned to 13 and the premium end looked completely different. That change is the interesting part.

What changed

Two things happened between May 25 and May 31:

Claude Opus 4.8 entered. At 61.4 intelligence and $10.94, it immediately dominated both models that had been holding the premium end. GPT-5.5 xhigh (60.2 / $11.25) is cheaper but less intelligent. Claude Opus 4.7 (57.3 / $10.94) is the same price but less intelligent. Both fell off.

MiMo-V2.5-Pro had a price cut. It moved from $1.50 to $0.54 — a 64% drop — with no change in intelligence score. At that new price it dominated DeepSeek V4 Pro (51.5 / $0.54) on intelligence and matched it on price. DeepSeek V4 Flash (46.5 / $0.18) lost its position to MiMo-V2.5, which entered the frontier at 49.0 / $0.175 in the same window.

The summary:

	May 25	Jun 2
Frontier models	15	13
Premium anchor	GPT-5.5 xhigh	Claude Opus 4.8
OpenAI frontier slots	2	0
Anthropic frontier slots	1	1
Xiaomi frontier slots	1	3
DeepSeek frontier slots	2	0

The lab-level read is direct:

Lab	Jun 2 frontier slots
Alibaba	5
Xiaomi	3
Google	2
Anthropic	1
Kimi	1
MiniMax	1

OpenAI has several top intelligence and coding models in the same snapshot but zero intelligence-price frontier slots. That is the difference between capability leadership and economic frontier leadership.

What the frontier does not tell you

The frontier is a filter, not an oracle.

It does not tell you which model writes the best code, follows your team style guide, handles long context, refuses unsafe requests, or produces the best structured output. It does not account for caching, batch pricing, provider reliability, private deployment, data residency, or multimodal capability. It uses Artificial Analysis Intelligence Index and blended price, so it inherits the strengths and limits of that methodology.

The most natural third dimension to add is latency and throughput — tokens per second, time to first token, p95 response time. Practitioners who add it find the frontier thins further. But latency is not a static number the way published price is. It varies by provider, region, time of day, and your own request shape and concurrency. That means it cannot be read off a chart — it has to be measured against your actual traffic pattern. The frontier gives you the shortlist; latency testing on that shortlist gives you the final answer.

That is why the frontier belongs at the front of evaluation, not at the end.

Start here. Remove dominated options. Then run task-specific tests on what's left.

So what

The model market is too crowded for one leaderboard.

The Pareto frontier is the right first map because it forces the question leaderboards skip: "Is a cheaper model already good enough?" It also gives you a consistent way to read market movement:

Frontier shifts down and right — capability is getting cheaper at a given intelligence band.
Frontier gets thinner — more models are being dominated; the rational shortlist is shrinking.
A lab loses frontier slots while retaining benchmark rank — its premium is harder to defend on general workloads. OpenAI's current position is the clearest example of this.

The open thread: the next useful frontier is not two-dimensional. Production routing needs intelligence, price, latency, context length, coding score, reliability, and maybe modality in the same view. The hard part is not drawing that chart. The hard part is deciding which dimensions are universal enough to compare across workloads, and which ones only matter inside a specific application. That is why model evaluation is not a one-time exercise — it is the ongoing work of keeping your routing decisions honest.

Data: Artificial Analysis Intelligence Index v4.0.4, blended price 3:1 input:output. Snapshots: 2026-05-25, 2026-05-31, 2026-06-02.

DEV Community