The Score That Should Have Everyone Worried
On March 28, 2026, the AI world got a number that should be printed on every AGI roadmap in a very large font: 0.3%.
That's the score GPT-5.4 High and Claude Opus 4.6 Max — the two most capable AI systems on the planet — achieved on ARC-AGI V3. At a cost of $5,000 to $9,000 per task.
Humans? 100%.
Symbolica's Agentica SDK? 36% — and a total bill of about $1,005 for 113 of 182 levels.
This isn't a minor benchmark update. ARC-AGI V3 is the clearest signal yet that the AI industry has been solving the wrong problem.
📊 The V3 Scoreboard
- Humans: 100% success rate
- Symbolica Agentica SDK: 36.08% (113/182 levels, $1,005 total)
- GPT-5.4 High: ~0.3% (at $5,000-9,000 per task)
- Claude Opus 4.6 Max: ~0.25-0.3% (similar cost profile)
What Is ARC-AGI, and Why Does It Keep Mattering?
The Abstraction and Reasoning Corpus (ARC) is the benchmark that François Chollet — Keras creator, Google DeepMind researcher, and arguably the AI field's most credible skeptic — designed specifically to measure fluid intelligence rather than memorized knowledge.
The core insight: if a model has seen enough training examples, it can score well on almost any benchmark. ARC was designed from the start to resist this.
V1 to V2 to V3: Closing the Escape Hatches
ARC-AGI V1 (2019): Static 2D grid puzzles. Given a few input-output examples, derive the transformation rule and apply it to a new input.
ARC-AGI V2 (2025): Addressed the contamination problem with harder, more compositional puzzles and stricter novelty guarantees.
ARC-AGI V3 (2026): A complete category shift — interactive video game environments instead of static puzzles.
How ARC-AGI V3 Actually Works
V3 drops agents into interactive video game environments. Here's what that means:
- The agent is presented with a novel mini-game it has never seen before
- There are zero instructions — no goal, no controls, no rules explained
- The agent has a limited number of turns to figure everything out
- Success means: discover the goal, learn the controls, understand the rules, complete the task
This is how humans learn to play new games. A 10-year-old can master a new mobile game in minutes. Current AI systems? They break.
💡 Chollet's Core Thesis: LLMs didn't get smarter. They got better-trained on verifiable domains like code. Move to genuinely novel, non-verifiable tasks — and the apparent intelligence evaporates.
François Chollet's Vision — and Warning
His key arguments:
- LLMs improved on measurable tasks, not on intelligence itself.
- Move to unverifiable domains and progress stalls.
- AGI timeline: 2030, but not via the current path. The core engine will fit in fewer than 10,000 lines of code.
- The $600K ARC Prize exists to redirect research incentives away from benchmark gaming.
The Results in Context: What 0.3% Actually Means
GPT-5.4 High, running at $5,000-9,000 per task, scored 0.3% on an evaluation where humans score 100%. This is not a narrow gap.
Symbolica's Agentica SDK: 36% at ~$9 per level — approximately 1,000x cheaper than the frontier models that scored near-zero.
The 36% vs 0.3% comparison is a signal about architecture. Symbolica's program synthesis approach outperforms pure LLM scaling by 120x on this benchmark.
Why This Exposes the Agent Industry's Blind Spot
AI agents can write code, browse the web, draft emails. But these are all examples of applying learned patterns to familiar scenarios.
Put an agent in a genuinely novel environment and it collapses. ARC-AGI V3 puts numbers on the failure.
The kicker: the agent industry's go-to defense — it gets better with more context — directly contradicts the V3 premise. You don't get examples. You get to figure it out.
What Happens Next
- Researchers: Hybrid architectures combining pattern matching with program synthesis are more likely to crack V3-style problems than pure scaling
- Builders: Stop overselling adaptability. Design for the limitation.
- Buyers: Benchmark performance ≠ general capability
- AGI watchers: Chollet's 2030 estimate looks more credible. We're not in the final miles.
The score is 100% humans, 36% Symbolica, 0.3% everything else. The gap is the map.
Top comments (0)