The AI Benchmark Where Simple Beats Smart

#ai #machinelearning #technology #programming

The results from ARC-AGI-3 landed this week, and everyone reported the same headline: frontier AI scored below 1%. GPT-5.4 got 0.26%. Claude Opus 4.6 got 0.25%. Gemini 3.1 Pro led the pack at a breathtaking 0.37%. Humans, meanwhile, solved 100% of the environments.

That is a damning result. But it is not the most interesting one.

The most interesting result is what scored higher than all of them. Simple CNN and graph-search algorithms — the kind of techniques that computer science students learn in their second year — reached 12.58%. Old-school, non-neural, deterministic algorithms outperformed every large language model on the planet by a factor of 30 to 50.

Sit with that for a moment. The same companies that recently crossed the threshold of passing the bar exam, writing publishable code, and reasoning through graduate-level physics problems cannot do what a basic image classifier plus a search tree can do. Why?

ARC-AGI-3 tests something specific: novel visual reasoning in interactive environments. Each task is procedurally generated, which means you cannot have seen it before. There is no training data shortcut. You must actually understand the underlying rule of the environment and apply it efficiently. The benchmark scores you not just on correctness but on efficiency — how many actions you take relative to a human baseline. Being correct but slow is penalized.

Large language models are, at their core, extraordinarily sophisticated pattern matchers. They have seen so much text, code, and structured data that they can simulate reasoning across an enormous range of domains. But ARC-AGI-3 cuts off that pathway. When you cannot retrieve a pattern from memory because no such pattern was ever in your training distribution, what do you have left?

For GPT-5, Claude, and Gemini, the answer turns out to be: not much. Their scores suggest they are largely guessing.

For a CNN plus graph search, the answer is different. These systems were not trying to retrieve a pattern. They were doing something more primitive and more reliable: extracting visual features and running a search over possible action sequences. They do not need to have seen this environment before. The algorithm applies regardless.

This is not an argument that old-school AI is better than transformers. It is a much more uncomfortable argument: that the specific kind of generalization frontier models are good at is not the only kind of generalization that matters, and may not be the kind that scales toward whatever we mean by genuine intelligence.

The ARC Prize Foundation, which runs the benchmark, has been making this point since ARC-AGI-1. Francois Chollet, who created the original ARC challenge, has long argued that LLMs achieve what he calls "crystallized intelligence" — the ability to retrieve and recombine patterns from training — rather than "fluid intelligence," the ability to construct solutions to genuinely novel problems from first principles.

ARC-AGI-3 is the clearest test of that distinction yet. And the results suggest the gap is wider than most people assumed.

This matters beyond benchmark trivia. The whole premise of the current AI investment supercycle is that scaling laws will get us to human-level general intelligence. More data, more compute, bigger models. Scores on ARC-AGI-3 raise a direct challenge to that premise. If the path to 100% on this benchmark does not run through larger transformers — if it runs through architectural changes, hybrid systems, or entirely different paradigms — then the roadmap most AI labs are running on is, at minimum, incomplete.

None of the major AI newsletters wrote this part of the story. They reported the surface number. They did not ask what that number implies about the architectural limits of the systems we are betting trillions of dollars on.

The ARC Prize Foundation has released an open-source agent toolkit alongside ARC-AGI-3. Researchers can now build and test agents against the benchmark publicly. Some of the most interesting work over the next six months will probably come from small teams experimenting with non-transformer architectures — combinations of symbolic reasoning, search, and perception that do not rely on memorized patterns.

If one of those approaches cracks it, it will matter far more than the next GPT release. And it probably will not come from the largest labs, which are too invested in their current direction to pivot.

The benchmark where simple beats smart might end up being the most important test of 2026. Not because AI failed it. Because of what that failure tells us about where the ceiling actually is.

Originally on The Signal — free AI newsletter. Subscribe: newsletter.uddit.site

DEV Community

The AI Benchmark Where Simple Beats Smart

Top comments (0)