Models Keep Getting Stronger, but 'Strongest' Has No Single Answer

#ai #models #evaluation #reinforcementlearning

June is shaping up to be another packed month for model releases. Opus 4.8 dropped at the end of May, MiniMax's M3 landed a couple of days ago, GPT 5.6 is supposedly around the corner, and some are already waiting on DeepSeek's next drop. It looks like we'll see a new model every few days. Pretty lively.

But for the past two days, what I've actually been thinking about is a friend's experience with models.

He started out using models to build small things—writing web pages, making little tools and plugins. He was pretty excited at first, telling me how amazing today's models are. He more or less picked a decent domestic model at random and found it more than enough. He couldn't even imagine where else models could get stronger; they already worked so well.

Then his work got more complex. He moved from small tools to trying to build an auto-editing tool, video cropping and the like. That's when the problems started.

The model told him it was done. He said okay, tried it, no dice. A moment later it said this time it was really done. He tried again, still no good. Back and forth for several rounds.

He couldn't tell anymore: part of him felt he was getting better at working with the model, that he needed to give more guidance and try different approaches; part of him started wondering if the model itself just wasn't cutting it, whether he should switch to something like Claude Opus.

This pattern is all too common. Behind it is something a lot of people haven't caught on to yet: model strength is forking in different directions. A single score no longer tells the whole story.

Scores Bunch at the Ceiling, Real-World Feel Diverges

First, the weird state of things: on mainstream benchmarks, top models score terrifyingly high, all squeezed into a narrow band.

Take GPQA. It's graduate-level, so hard that the PhD experts they brought in only scored around 65%. Yet top models now routinely hit 92% to 94%, bunched together. Older benchmarks like MMLU were surpassed by pretty much everyone long ago, all scoring over 90%. Hard problems aren't hard anymore; scores have hit the ceiling, and you can't tell models apart.

So benchmark makers have to keep inventing harder tests. The new Humanity's Last Exam states it plainly: it was created because models had exceeded 90% on MMLU and the old questions weren't enough anymore. One study looked at sixty mainstream benchmarks and found that nearly half are already highly saturated—top models are "statistically indistinguishable" on them.

But when you actually use them, the difference in feel is absurd. I wrote in my last post how Opus 4.8 kept letting me down on engineering and research tasks—work that later all moved to GPT 5.5. By the scores, the two are close; in practice, worlds apart.

The ARC-AGI suite is a perfect example. On the old version, top models had already saturated at 96%. Switch to the harder ARC-AGI-2, and the same models immediately show their true colors: GPT 5.5 still manages 85%, while Opus 4.8 drops to barely over 70%. Switch again to ARC-AGI-3, which requires actual interaction and exploration, and almost everyone flatlines to zero.

So benchmarks are still useful. It's just that "making tests humans can define and grade" is becoming less and less able to distinguish models. To understand why, you have to look at training.

Solving the Hardest Problems vs. Reliably Doing Messy Work

The main technique making models stronger right now is called "verifiable rewards." In short: pick hard problems with standard answers that machines can automatically grade, and use them for reinforcement learning. Math and code are the classic examples. Right answer gets points, wrong gets zero, rinse and repeat.

The DeepSeek-R1 paper puts it clearly: math problems are verified with rules, code is thrown straight into compilers to run test cases. They specifically note they avoided neural-network-based reward models, because those are too easy for models to game. OpenAI's o series follows the same playbook. It's highly effective; this is exactly how models learned to solve hard problems.

But it has one characteristic: what it excels at is taking the "hardest problems humans can define and grade" and grinding them out. That's an entirely different capability: give it a fuzzy, not-that-hard but very real task, and get it done reliably in one go.

My friend's editing tool is the latter. The task isn't extremely difficult, but the intent is fuzzy. It has to be broken down yourself, and it has to be done cleanly in one go. A model that can solve Olympiad problems may not handle this kind of messy work cleanly in one shot. It might go in circles, need three rounds of back-and-forth, and finally say "I'm done" when it isn't. Conversely, a model that's great at messy work might completely choke when you throw a really hard problem at it.

These are two directions of capability, each going its own way. You can't rank them on a single line.

The trouble is, ninety percent of people need the latter in daily life. They want to take a poorly specified task and get it done reliably, without hassle. Yet the score we use to rank models measures almost entirely the former. It's completely normal for "highest score" and "most useful for me" to not line up.

Another Dimension: Exploration

The first two types are still in the world of standard answers: either solve a gradable hard problem, or finish a verifiable task. The truly difficult one is the third.

When my friend got stuck, I thought of another class of problems. Like driving toward an intersection with a traffic light ahead. Do you go straight, or weave through the middle? There's no standard answer; you have to find your own direction in the ambiguity. Exploring a domain people haven't clearly defined, or don't even know the answer to, is a completely different capability.

This capability, benchmarks can't measure at all. The whole premise of evaluation is having a standard answer, something gradable right or wrong. But exploration has no right or wrong at all, only efficiency. Can you fish out something new in the ambiguity, use it to move forward, and push out a boundary that didn't exist before?

It's also precisely the blind spot of the "verifiable rewards" approach. Research has already pointed out that open-ended tasks without unique answers have no clear standard answer to begin with, so you can't even construct rewards. This method can't get traction. Some have even found that this training approach doesn't necessarily give models new capabilities, and may instead narrow their exploration surface, with capability ceilings hard-capped by the base model.

The result is that a model great at exploration, thrown into a cage with clear standard answers, might seem a bit stupid. A model that tests incredibly well may not possess exploration ability at all. In my own experience, GPT and Claude show the clearest difference on this dimension.

And this dimension happens to be the most important. Because truly valuable things often start without standard answers. Yet it's the hardest to measure, and the hardest to train.

The Chat Era Already Ran This Course

Model capabilities forking along dimensions and layering down isn't new. The chatbot era ran the full course.

Back then, everyone also thought for a while that the biggest model was the strongest. But they quickly discovered that for chatting specifically, the biggest model wasn't much better. In 2023, the LMSYS Chatbot Arena leaderboard dedicated a section to "Smaller Models Are Competitive": a 13B Vicuna ranked in the top five, its Elo score even beating Google's PaLM 2. 7B models also squeezed into the top ten, trading blows with models twice their size.

Later studies echoed this: scaling models from tens of millions to hundreds of billions, all the way up to GPT-4 class, showed gains topping out quickly on softer tasks. Models with a few tens of billions of parameters weren't far from frontier models.

In other words: for chatting, for emotional support, the marginal returns to scale are low. A few tens of billions is enough; scaling up to hundreds of billions is pure waste.

So the market sorted itself out. If you want emotional value, someone to chat with, a smaller model that sounds human is enough. Only when you need serious research or hardcore engineering do top-tier models come into play. Models sorted themselves into different price-performance tiers by use case.

Today's round is the same plot, replaying at a higher capability level.

Closing: Don't Ask Which Is Strongest—Ask Which Dimension You Need

Back to my friend's dilemma: "should I switch to a stronger model?" He's asking the wrong question.

There is no "stronger" that simultaneously covers solving hard problems, doing messy work, and exploring. These three things are diverging onto different models.

New-generation models are still pushing forward, of course. But the progress they fight for increasingly lands on "the hardest problems humans can define and grade," which is exactly where most people can't perceive it. So you see a split: leaderboards keep getting stronger generation after generation, yet most people just feel "it's been good enough for a while, can't see where it's stronger." Neither side is wrong. Because what they want are fundamentally different dimensions of capability.

So stop vaguely asking "which model is strongest." First ask clearly: which dimension of work do you need it to do? Solve a hard problem with an answer, finish a messy task that wasn't clearly specified, or join you in something no one knows the answer to yet.

"Strongest" is becoming a question without a standard answer.