DEV Community

Cover image for GPT-5, Claude, Gemini All Score Below 1% - ARC AGI 3 Just Broke Every Frontier Model
CodePawl
CodePawl

Posted on

GPT-5, Claude, Gemini All Score Below 1% - ARC AGI 3 Just Broke Every Frontier Model

Three system types compared on ARC-AGI: Reasoning Systems (models like o1/o3 tested at varying thinking levels, showing diminishing returns as reasoning time increases), Base LLMs (single-shot inference from standard models, no extended reasoning), and Kaggle Systems (competition submissions optimized under a $50 compute budget, purpose-built for ARC). Key insight: purpose-built Kaggle systems outperform both base LLMs and reasoning-augmented models despite far less compute.

ARC-AGI-3, launched just yesterday on March 25, 2026, represents the most radical transformation of the ARC benchmark since François Chollet introduced it in 2019 — abandoning static grid puzzles entirely in favor of interactive, video-game-like environments where AI agents must discover rules, set goals, and solve problems with zero instructions.

The competition carries over $2 million in prizes across three tracks. Early preview results: frontier LLMs like GPT-5 and Claude score below 1%, while simple CNN and graph-search approaches reach 12.58%. The gap between human performance (100%) and the best AI agent remains enormous.

From grid puzzles to game worlds: what changed

Can you build an agent to beat this game?

ARC-AGI-3 is not an incremental difficulty upgrade — it is a fundamentally different benchmark. Previous versions (ARC-AGI-1 and ARC-AGI-2) presented static input-output grid pairs where systems inferred transformation rules and applied them. ARC-AGI-3 instead drops agents into turn-based game environments with no stated rules, no instructions, and no win conditions. Agents observe a 64×64 grid with 16 colors, take actions (move, click, reset), and must figure out both what to do and how to do it through pure interaction.

The benchmark comprises 1,000+ levels across 150+ handcrafted environments, each game containing 8–10 levels that progressively introduce new mechanics. Three preview games illustrate the range: ls20 requires navigating a map and transforming symbols, ft09 involves matching patterns across overlapping grids, and vc33 tasks agents with adjusting volumes to hit target heights. Scoring uses action efficiency — how many actions the agent needs compared to a human baseline — rather than binary pass/fail. A perfect 100% means the AI matches human efficiency across all games.

The evolution across versions tells a clear story of escalating challenge:

Feature ARC-AGI-1 (2019) ARC-AGI-2 (2025) ARC-AGI-3 (2026)
Format Static grid puzzles Static grid puzzles (harder) Interactive game environments
Instructions Input-output demo pairs Input-output demo pairs None — discover through interaction
Best AI score ~90%+ (saturated) 24% (competition) 12.58% (preview)
Human baseline ~85% ~60% average 100%
Scoring Binary accuracy Accuracy + cost-per-task Action efficiency vs. humans
Tasks ~400 training + 100 eval 1,000 training + 120 eval per split 1,000+ levels, 150+ environments

ARC-AGI-1 became effectively saturated by 2025, with frontier models hitting 90%+ through brute-force engineering. ARC-AGI-2 introduced harder compositional tasks — symbolic interpretation, contextual rule application, multiple interacting rules — that dropped the best competition score to 24%. ARC-AGI-3 tests four entirely new capabilities: exploration (actively gathering information), modeling (building generalizable world models), goal-setting (identifying objectives without instructions), and planning with execution (strategic action with course-correction).

Preview leaderboard reveals LLMs' interactive reasoning gap

The competition literally launched yesterday, so the official Kaggle leaderboard has no entries yet. However, a 30-day developer preview preceding the launch produced highly informative results from 12 submissions (8 tested on private games):

Rank Team Approach Score Levels Solved
1st StochasticGoose (Tufa Labs) CNN + RL action-learning 12.58% 18
2nd Blind Squirrel State graph exploration + ResNet18 6.71% 13
3rd Explore It Till You Solve It Training-free frame graph 3.64% 12
Best frontier LLM agent LLM-based <1% ~2–3
Human players Human cognition 100% All

All three top systems used non-LLM approaches. StochasticGoose, built by Dries Smit at Tufa Labs, employed a CNN-based action prediction model with simple reinforcement learning and sparse rewards (only level completion signals). It stored frame transitions in memory for off-policy training, used hash tables to avoid duplicate states, and iteratively retrained its model between levels. The team explicitly avoided LLMs because the observation complexity — hundreds of interaction steps — would generate millions of tokens.

The third-place system, documented in an arXiv paper (Rudakov et al., 2512.24156), used a completely training-free graph-based exploration method, building state graphs and systematically exploring them. It solved a median of 30 out of 52 levels across 6 games but was limited by computational scaling with state space size.

Frontier LLMs' sub-1% performance is perhaps the most significant data point. The interactive format — requiring sustained sequential reasoning, state tracking across hundreds of steps, and learning from environmental feedback — exposes a fundamental limitation of current language models that static benchmarks never tested.

$2 million across three tracks with strict open-source requirements

The ARC Prize 2026 splits its prize pool across three parallel competition tracks, each hosted on Kaggle:

ARC-AGI-3 Track — $850,000 total:
The grand prize of $700,000 goes to the first agent scoring 100% on evaluation (carries over if unclaimed). A guaranteed $75,000 top-score award distributes $40K/1st, $15K/2nd, $10K/3rd, and $5K each for 4th–5th. Two milestone prizes totaling $75,000 reward early progress: $25K/$10K/$2.5K at each milestone (June 30 and September 30).

ARC-AGI-2 Track — ~$1 million: The $700K grand prize for scoring 85% on ARC-AGI-2 remains unclaimed from both 2024 and 2025, and continues into 2026 alongside separate score awards.

Paper Prize Track: Awards for research papers advancing understanding of strong ARC-AGI performance.

Critical competition constraints shape viable approaches. All winning solutions must be open-sourced under permissive licenses (CC0 or MIT-0) before receiving private evaluation scores. Kaggle evaluation runs with no internet access — meaning no API calls to OpenAI, Anthropic, Google, or any cloud inference endpoint. Teams must either use open-weight models running locally or build entirely non-LLM systems. The ARC-AGI-3 toolkit is open-source (MIT license, pip install arc-agi) and runs at 2,000+ FPS locally, but requires an API key from arcprize.org.

What approaches are competitors likely to pursue

The preview results and historical ARC competition patterns suggest several viable research directions for ARC-AGI-3:

Reinforcement learning with lightweight neural networks is the proven frontrunner. StochasticGoose's CNN + sparse RL approach dominated the preview. Simple action prediction models that learn which actions cause meaningful state changes, combined with systematic exploration, appear far more effective than sophisticated language understanding.

Graph-based state exploration offers a training-free alternative. Building explicit state graphs, pruning loops, and systematically mapping environment dynamics worked surprisingly well (6.71% for Blind Squirrel). This approach trades compute for algorithmic efficiency but scales poorly with state space size.

Meta-learning and curiosity-driven RL are natural fits given ARC-AGI-3's requirement for rapid adaptation to novel environments. Methods like BYOL-Hindsight and intrinsic motivation were discussed during the preview period but proved finicky with short timeframes and sparse rewards.

World models (Dreamer family, latent dynamics models) could learn environment physics in imagination before acting, but are limited by ARC-AGI-3's sparse reward signal — only level completion provides feedback.

For the continuing ARC-AGI-2 track, the dominant paradigm from 2025 was synthetic data generation combined with test-time training — NVARC's winning approach used Qwen3-4B fine-tuned on 103K synthetic puzzles plus 3.2M augmented samples. Other strong directions include masked diffusion models (ARChitects), evolutionary program synthesis (SOAR), and minimum description length approaches (CompressARC).

Key dates and competition timeline

Date Milestone
March 25, 2026 Competition opens on Kaggle
June 30, 2026 ARC-AGI-3 Milestone #1 ($37,500 in prizes)
September 30, 2026 ARC-AGI-3 Milestone #2 ($37,500 in prizes)
November 2, 2026 All submissions due
November 8, 2026 Paper track submissions due
December 4, 2026 Results announced

During the competition, Kaggle leaderboard standings reflect scores on a semi-private dataset. Final rankings and prize eligibility use a separate private dataset, following the same anti-gaming structure as previous years. Human calibration data was collected from 1,200+ players across 3,900+ games during the preview, with a controlled study of 200+ participants establishing production baselines.

Try it yourself

ARC-AGI-3 toolkit is open-source and runs locally:

pip install arc-agi

You'll need an API key from arcprize.org to access the environments. The toolkit runs at 2,000+ FPS locally.

Full competition details and submission: ARC Prize 2026 on Kaggle

Conclusion

ARC-AGI-3 is not merely a harder test — it measures a fundamentally different kind of intelligence. The shift from static pattern recognition to interactive exploration and goal discovery exposes capabilities that current AI systems, including frontier LLMs, demonstrably lack. The preview data is unambiguous: simple RL and graph search at 12.58% versus frontier LLMs below 1% suggests that the path to solving ARC-AGI-3 runs through novel algorithmic ideas rather than model scaling.

With $850K on the line for the interactive track alone and milestone prizes creating incentives for early progress, the next eight months should produce significant advances in adaptive AI reasoning — all of which, by competition rules, will be open-sourced for the broader research community.

Are you planning to compete? What approach would you try? Drop your thoughts in the comments.


We're CodePawl — an open-source-first firm building tools for developers. Follow us on X or join our Discord.

Top comments (0)