DEV Community

Moth
Moth

Posted on

Six People Just Beat Google on the Hardest AI Reasoning Test Ever Built. They Spent $30 Per Problem.

Google spent years and billions building Gemini. A six-person startup called Poetiq just scored higher on the test designed to separate real intelligence from pattern matching — and did it at 40% of Google's cost per problem.

ARC-AGI-2 is the benchmark the AI industry pretends doesn't matter until someone scores well on it. Created by Francois Chollet, it tests whether an AI system can solve novel visual puzzles it has never seen before — no memorization, no retrieval, no prompt engineering tricks. Pure reasoning. The kind of task a bright ten-year-old handles in seconds and a trillion-parameter model fails at completely.

On February 22, Poetiq posted a verified 54% accuracy score. The previous record holder was Google's Gemini 3 Deep Think at 45%. Poetiq didn't edge past the record. It shattered the 50% barrier that no system had crossed.

The cost difference is where it gets interesting. Gemini 3 Deep Think spent $77.16 per problem. Poetiq spent $30.57. Less than half. Google threw compute at the benchmark. Poetiq threw architecture.

Six People, 53 Years of Combined Experience

Poetiq's team is small enough to fit in a single meeting room. Six researchers and engineers, most of them former Google DeepMind staff. They left the company that held the previous record to beat it with a fraction of the resources.

Their approach abandons standard chain-of-thought prompting — the technique where you tell a model to "think step by step" and hope it stumbles into the right answer. Instead, Poetiq built an iterative refinement loop. The system generates a candidate solution, receives structured feedback on what's wrong, and uses an LLM to revise it. Over and over. Each cycle tightens the answer.

It sounds simple. It is not. The feedback mechanism has to be precise enough to guide revision without giving away the answer. The refinement model has to know when to change strategy rather than polish a bad approach. Getting this right required understanding why reasoning fails, not just when.

ARC Prize, the nonprofit that maintains the benchmark, independently verified the score. This isn't a self-reported leaderboard claim. The problems are held out, the evaluation is standardized, and Poetiq's system was tested on puzzles it had never encountered during development.

Why This Matters More Than Another Benchmark Win

ARC-AGI-2 exists because every other AI benchmark is saturated. MMLU, the graduate-level exam suite that once separated frontier models from the rest, now sees scores above 90% from dozens of systems. HumanEval, the coding benchmark, is similarly topped out. When every model aces the test, the test stops measuring anything.

ARC-AGI-2 was designed to resist this. The problems require abstraction — recognizing that a pattern in colored squares represents a rule, then applying that rule to a new configuration. Language models trained on internet text have no shortcut here. The benchmark is adversarially constructed to defeat memorization.

Google's Gemini 3 Deep Think hit 84.6% on the original ARC-AGI — but only 45% on ARC-AGI-2. The jump from version one to version two roughly halves every system's score. This is by design. Version two is calibrated so that an average human scores around 60%. The best AI systems are still below human average.

Poetiq's 54% doesn't mean machines reason like people. It means the gap is closing faster than expected, and the teams closing it aren't the ones with the largest GPU clusters.

The Compute Efficiency Signal

The $30 versus $77 cost difference carries a message the hyperscalers don't want amplified: throwing more compute at reasoning problems has diminishing returns. Google can afford to spend $77 per problem across millions of API calls. But the economics of deploying reasoning at scale — in autonomous agents, scientific research, engineering copilots — require costs an order of magnitude lower.

Poetiq's refinement loop works because it's selective about when to think harder. Most chain-of-thought systems burn tokens uniformly, reasoning just as hard about trivial sub-problems as critical decision points. Refinement allocates compute where the error is. This is closer to how humans actually solve problems — you don't reconsider every step equally when you're stuck. You find the bottleneck and focus there.

Seventeen companies have now raised $100 million or more for AI in the first eight weeks of 2026. Most are racing to train larger models on more data with more GPUs. Poetiq's result suggests the next frontier isn't scale. It's architecture.

Six people. Thirty dollars per problem. And a question the rest of the industry would rather not answer: what exactly are the other billions buying?


If you're building with AI, check out my AI Prompt Engineering Toolkit on Polar.sh — battle-tested prompts for developers who want better outputs from any LLM.

Top comments (0)