ARC-AGI V3 Explained: The New AI Benchmark That Breaks Every Agent

#ai #machinelearning #agents #benchmark

The Score That Should Have Everyone Worried

On March 28, 2026, the AI world got a number that should be printed on every AGI roadmap in a very large font: 0.3%.

📖 Read the full version with charts and embedded sources on AgentConn →

That's the score GPT-5.4 High and Claude Opus 4.6 Max — the two most capable AI systems on the planet — achieved on ARC-AGI V3. At a cost of $5,000 to $9,000 per task.

Humans? 100%.

Symbolica's Agentica SDK? 36% — and a total bill of about $1,005 for 113 of 182 levels.

This isn't a minor benchmark update. ARC-AGI V3 is the clearest signal yet that the AI industry has been solving the wrong problem.

📊 The V3 Scoreboard

Humans: 100% success rate

Symbolica Agentica SDK: 36.08% (113/182 levels, $1,005 total)

GPT-5.4 High: ~0.3% (at $5,000-9,000 per task)

Claude Opus 4.6 Max: ~0.25-0.3% (similar cost profile)

What Is ARC-AGI, and Why Does It Keep Mattering?

The Abstraction and Reasoning Corpus (ARC) is the benchmark that François Chollet — Keras creator, Google DeepMind researcher, and arguably the AI field's most credible skeptic — designed specifically to measure fluid intelligence rather than memorized knowledge.

The core insight: if a model has seen enough training examples, it can score well on almost any benchmark. ARC was designed from the start to resist this.

V1 to V2 to V3: Closing the Escape Hatches

ARC-AGI V1 (2019): Static 2D grid puzzles. Given a few input-output examples, derive the transformation rule and apply it to a new input.

ARC-AGI V2 (2025): Addressed the contamination problem with harder, more compositional puzzles and stricter novelty guarantees.

ARC-AGI V3 (2026): A complete category shift — interactive video game environments instead of static puzzles.

How ARC-AGI V3 Actually Works

V3 drops agents into interactive video game environments. Here's what that means:

The agent is presented with a novel mini-game it has never seen before
There are zero instructions — no goal, no controls, no rules explained
The agent has a limited number of turns to figure everything out
Success means: discover the goal, learn the controls, understand the rules, complete the task

This is how humans learn to play new games. A 10-year-old can master a new mobile game in minutes. Current AI systems? They break.

💡 Chollet's Core Thesis: LLMs didn't get smarter. They got better-trained on verifiable domains like code. Move to genuinely novel, non-verifiable tasks — and the apparent intelligence evaporates.

François Chollet's Vision — and Warning

His key arguments:

LLMs improved on measurable tasks, not on intelligence itself.
Move to unverifiable domains and progress stalls.
AGI timeline: 2030, but not via the current path. The core engine will fit in fewer than 10,000 lines of code.
The $600K ARC Prize exists to redirect research incentives away from benchmark gaming.

The Results in Context: What 0.3% Actually Means

GPT-5.4 High, running at $5,000-9,000 per task, scored 0.3% on an evaluation where humans score 100%. This is not a narrow gap.

Symbolica's Agentica SDK: 36% at ~$9 per level — approximately 1,000x cheaper than the frontier models that scored near-zero.

The 36% vs 0.3% comparison is a signal about architecture. Symbolica's program synthesis approach outperforms pure LLM scaling by 120x on this benchmark.

Why This Exposes the Agent Industry's Blind Spot

AI agents can write code, browse the web, draft emails. But these are all examples of applying learned patterns to familiar scenarios.

Put an agent in a genuinely novel environment and it collapses. ARC-AGI V3 puts numbers on the failure.

The kicker: the agent industry's go-to defense — it gets better with more context — directly contradicts the V3 premise. You don't get examples. You get to figure it out.

x.com

What Happens Next

Researchers: Hybrid architectures combining pattern matching with program synthesis are more likely to crack V3-style problems than pure scaling
Builders: Stop overselling adaptability. Design for the limitation.
Buyers: Benchmark performance ≠ general capability
AGI watchers: Chollet's 2030 estimate looks more credible. We're not in the final miles.

The score is 100% humans, 36% Symbolica, 0.3% everything else. The gap is the map.

Read the full analysis with multimedia at AgentConn