pulkitgovrani

Posted on May 24

Gemma 4 Scored 89.2% on AIME. Here's Why That Number Should Change How You Think About Open-Source AI

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

AIME — the American Invitational Mathematics Examination — is the test given to the top 5% of high school math competitors in the US. The problems require multi-step proof construction, elegant reasoning, and a comfort with number theory, combinatorics, and geometry that most adults don't have.

Gemma 3 scored 20.8% on AIME 2026.

Gemma 4 scored 89.2%.

That's not an incremental improvement. That's a qualitative category change — and it happened in one model generation, in an open-weight model that runs on a consumer GPU.

Here's what I think that actually means.

The Numbers, In Full

Don't just take AIME. Look at the whole picture:

Benchmark	Gemma 3 27B	Gemma 4 31B	What it measures
AIME 2026	20.8%	89.2%	Competition math
GPQA Diamond	42.4%	84.3%	Expert science QA
Codeforces ELO	110	2150	Competitive programming
Agentic Tool Use	6.6%	86.4%	Multi-step tool calling
MMLU Pro	—	85.2%	Professional knowledge
LiveCodeBench v6	—	80.0%	Real-world coding

A Codeforces ELO of 2150 is Grandmaster level — top 0.1% of competitive programmers globally. Gemma 3 at ELO 110 was essentially a beginner. Gemma 4 at 2150 would beat virtually every professional software engineer in a competitive programming contest.

The agentic tool use jump — 6.6% to 86.4% — is the one that matters most for developers. That's not an academic benchmark. That's the model's ability to chain together tool calls, handle errors, and complete multi-step tasks autonomously. An agent that succeeds 86% of the time on agentic tasks is a practical agent. One that succeeds 6.6% of the time is a toy.

What Changed Between Gemma 3 and Gemma 4

This wasn't just more compute and more data. The architectural and training changes were substantial:

Thinking mode. Gemma 4 was trained with chain-of-thought reasoning built in — up to 4,000+ tokens of working through a problem before committing to an answer. AIME at 20.8% is what you get from a model that answers immediately. AIME at 89.2% is what you get from a model that has 4,000 tokens of scratch paper.

Native function calling. Agentic tool use going from 6.6% to 86.4% is almost entirely explained by this. Gemma 3 wasn't trained for function calling — it was prompted into it. Gemma 4 was trained with tool use as a first-class objective.

MoE architecture. The 26B A4B MoE model achieves 88.3% on AIME — nearly matching the 31B dense model — while activating only 4B parameters per token. The implication is that expert specialization is doing real work: math problems route to math-specialized experts.

256K context. Multi-step reasoning problems often require holding a complex state across many reasoning steps. More context = less information loss as the reasoning chain grows.

None of these are incremental improvements to the same approach. They're a different approach.

The Open-Source Gap Is Closing Faster Than Anyone Expected

A year ago, the conventional wisdom was: open-source models are 6-12 months behind the frontier, they'll stay there, and for anything serious you need GPT-4 or Claude.

Here's what Gemma 4 31B benchmarks against:

Model	AIME 2026	GPQA Diamond	Codeforces ELO
Gemma 4 31B	89.2%	84.3%	2150
GPT-4o (May '24)	~56%	~53%	~900
Claude 3.5 Sonnet	~68%	~65%	~1200

I want to be careful here: these benchmarks are not identical versions tested simultaneously, and model capabilities change with updates. The point isn't "Gemma 4 beats GPT-4o on everything." It's that a locally-runnable, open-weight model is now in the same conversation as frontier commercial models on the hardest reasoning tasks.

A year ago that was not true. The gap was not closing this fast.

Why This Matters for Developers Specifically

You can run reasoning-capable AI on your hardware.

The 26B A4B fits on a 16GB GPU at 4-bit quantization and achieves 88.3% AIME. That's not a cloud service. That's not a $200/month subscription. It's an Ollama command.

The data never leaves your machine.

For a lot of real reasoning tasks — auditing financial models, analyzing proprietary codebases, processing sensitive documents — the reason you don't send them to GPT-4 isn't cost. It's that the data can't leave your infrastructure. A locally-runnable model at this capability level removes that barrier.

The licensing is Apache 2.0.

Build with it commercially. Fine-tune it. Distribute it. The benchmark improvement doesn't come with new licensing restrictions.

Agents that actually work.

86.4% agentic tool use success rate means you can build multi-step automated pipelines that are reliable enough to deploy. 6.6% means you're debugging agent failures constantly. This is the practical inflection point for building AI agents with an open-weight model.

The Honest Counterpoint

Benchmark performance and real-world task performance are not the same thing.

AIME 89.2% tells you the model can solve structured, well-defined math problems with clear right/wrong answers. It says less about:

Ambiguous tasks where the "correct" answer is subjective
Novel problem types not represented in training
Long-running, multi-day autonomous tasks
Tasks that require external world knowledge past the training cutoff

Codeforces ELO 2150 tells you the model writes excellent competitive programming solutions. It says less about:

Large-scale software architecture decisions
Debugging complex distributed systems
Understanding poorly-documented legacy code

The model is genuinely excellent at structured reasoning. It's not a general-purpose replacement for a senior engineer. These things are both true simultaneously.

What I Think This Actually Signals

The story of AI progress so far has been: capability concentrates at the frontier, the frontier is closed, open-source catches up slowly. The assumption baked into most developer tooling decisions is that if you need serious capability, you go to the API.

Gemma 4 disrupts that assumption. Not completely — the absolute frontier is still ahead — but enough to change the calculus for a large class of applications.

If your application needs:

Math or logical reasoning
Code generation and review
Tool-calling agents
Structured information extraction

...then Gemma 4 is now a legitimate option where it simply wasn't before. Not a compromise. Not "good enough for prototyping." A legitimate option.

The AIME score is a proxy for something more important: the capability level where local, private, open-weight AI becomes the right choice for production use cases, not just experimentation. Gemma 4 crossed it.

That's the story the 89.2% is telling.