Proof or Bluff? Why Today's AI Still Fails the Math Olympiad Test

Muzammil Ibrahim — Sat, 03 May 2025 04:44:18 +0000

Can today’s most advanced AI models really solve math like a human genius? Recent math benchmarks have shown impressive results on problems like those in the AIME or HMMT competitions. But these tasks mostly need final answers — not full, rigorous proofs.

That’s where the new study, “Proof or Bluff?” from ETH Zurich and INSAIT, steps in. Researchers challenged top-tier language models — including Gemini-2.5-PRO, Claude 3.7, and Grok-3 — with the 2025 USAMO (USA Mathematical Olympiad): a competition famous for its demand for deep thinking and bulletproof logic.

The Verdict? AI Still Flops on Hard Math

Even the best model, Gemini-2.5-PRO, scored only 10.1 out of 42 — that’s just about 24% accuracy. All others scored less than 5%. That’s not even close to human Olympiad-level performance.

Why They Failed

Human judges (all former IMO finalists) identified four common failure patterns:

Flawed logic: Skipping reasoning steps or drawing false conclusions.
Wrong assumptions: Using unsupported ideas to bridge gaps.
Low creativity: Sticking to one (wrong) strategy across multiple runs.
Hallucinations: Making up citations or boxing trivial answers due to training biases.

Even more ironic — many models claimed they had solved the problem, even when their logic was clearly broken.

The Hidden Bias of Optimization

Training tricks like reinforcement learning (RLHF/GRPO) push models to "box the final answer," even when a box isn’t appropriate. Worse, models like QWQ and Gemini fabricated academic-sounding theorems that don’t exist — just to sound convincing.

Automated Grading? Not Yet.

The team also tried using LLMs to grade each other — a cool idea, but the results were inflated by up to 20x. Machines couldn’t distinguish a shallow bluff from real insight.

What This Means for AI + Math

This paper sends a clear signal: today’s LLMs aren’t ready for formal math reasoning that demands proof, creativity, and logical precision. We’re seeing polished performance in shallow tasks, but in-depth reasoning remains out of reach.

The Road Ahead

To build truly trustworthy AI mathematicians, we need a next-gen leap — beyond pattern matching and into genuine, provable reasoning. Whether through better alignment, curriculum learning, or symbolic tools — the future of math + AI is still wide open.

Resources:

TL;DR: AI models can bluff their way through simple math, but when it comes to real, Olympiad-level proofs — they break down. We’re not at the age of automated mathematicians yet — but this research is a solid step toward that future.