This is a submission for the Gemma 4 Challenge: Write About Gemma 4
AIME — the American Invitational Mathematics Examination — is the test given to the top 5% of high school math competitors in the US. The problems require multi-step proof construction, elegant reasoning, and a comfort with number theory, combinatorics, and geometry that most adults don't have.
Gemma 3 scored 20.8% on AIME 2026.
Gemma 4 scored 89.2%.
That's not an incremental improvement. That's a qualitative category change — and it happened in one model generation, in an open-weight model that runs on a consumer GPU.
Here's what I think that actually means.
The Numbers, In Full
Don't just take AIME. Look at the whole picture:
| Benchmark | Gemma 3 27B | Gemma 4 31B | What it measures |
|---|---|---|---|
| AIME 2026 | 20.8% | 89.2% | Competition math |
| GPQA Diamond | 42.4% | 84.3% | Expert science QA |
| Codeforces ELO | 110 | 2150 | Competitive programming |
| Agentic Tool Use | 6.6% | 86.4% | Multi-step tool calling |
| MMLU Pro | — | 85.2% | Professional knowledge |
| LiveCodeBench v6 | — | 80.0% | Real-world coding |
A Codeforces ELO of 2150 is Grandmaster level — top 0.1% of competitive programmers globally. Gemma 3 at ELO 110 was essentially a beginner. Gemma 4 at 2150 would beat virtually every professional software engineer in a competitive programming contest.
The agentic tool use jump — 6.6% to 86.4% — is the one that matters most for developers. That's not an academic benchmark. That's the model's ability to chain together tool calls, handle errors, and complete multi-step tasks autonomously. An agent that succeeds 86% of the time on agentic tasks is a practical agent. One that succeeds 6.6% of the time is a toy.
What Changed Between Gemma 3 and Gemma 4
This wasn't just more compute and more data. The architectural and training changes were substantial:
Thinking mode. Gemma 4 was trained with chain-of-thought reasoning built in — up to 4,000+ tokens of working through a problem before committing to an answer. AIME at 20.8% is what you get from a model that answers immediately. AIME at 89.2% is what you get from a model that has 4,000 tokens of scratch paper.
Native function calling. Agentic tool use going from 6.6% to 86.4% is almost entirely explained by this. Gemma 3 wasn't trained for function calling — it was prompted into it. Gemma 4 was trained with tool use as a first-class objective.
MoE architecture. The 26B A4B MoE model achieves 88.3% on AIME — nearly matching the 31B dense model — while activating only 4B parameters per token. The implication is that expert specialization is doing real work: math problems route to math-specialized experts.
256K context. Multi-step reasoning problems often require holding a complex state across many reasoning steps. More context = less information loss as the reasoning chain grows.
None of these are incremental improvements to the same approach. They're a different approach.
The Open-Source Gap Is Closing Faster Than Anyone Expected
A year ago, the conventional wisdom was: open-source models are 6-12 months behind the frontier, they'll stay there, and for anything serious you need GPT-4 or Claude.
Here's what Gemma 4 31B benchmarks against:
| Model | AIME 2026 | GPQA Diamond | Codeforces ELO |
|---|---|---|---|
| Gemma 4 31B | 89.2% | 84.3% | 2150 |
| GPT-4o (May '24) | ~56% | ~53% | ~900 |
| Claude 3.5 Sonnet | ~68% | ~65% | ~1200 |
I want to be careful here: these benchmarks are not identical versions tested simultaneously, and model capabilities change with updates. The point isn't "Gemma 4 beats GPT-4o on everything." It's that a locally-runnable, open-weight model is now in the same conversation as frontier commercial models on the hardest reasoning tasks.
A year ago that was not true. The gap was not closing this fast.
Why This Matters for Developers Specifically
You can run reasoning-capable AI on your hardware.
The 26B A4B fits on a 16GB GPU at 4-bit quantization and achieves 88.3% AIME. That's not a cloud service. That's not a $200/month subscription. It's an Ollama command.
The data never leaves your machine.
For a lot of real reasoning tasks — auditing financial models, analyzing proprietary codebases, processing sensitive documents — the reason you don't send them to GPT-4 isn't cost. It's that the data can't leave your infrastructure. A locally-runnable model at this capability level removes that barrier.
The licensing is Apache 2.0.
Build with it commercially. Fine-tune it. Distribute it. The benchmark improvement doesn't come with new licensing restrictions.
Agents that actually work.
86.4% agentic tool use success rate means you can build multi-step automated pipelines that are reliable enough to deploy. 6.6% means you're debugging agent failures constantly. This is the practical inflection point for building AI agents with an open-weight model.
The Honest Counterpoint
Benchmark performance and real-world task performance are not the same thing.
AIME 89.2% tells you the model can solve structured, well-defined math problems with clear right/wrong answers. It says less about:
- Ambiguous tasks where the "correct" answer is subjective
- Novel problem types not represented in training
- Long-running, multi-day autonomous tasks
- Tasks that require external world knowledge past the training cutoff
Codeforces ELO 2150 tells you the model writes excellent competitive programming solutions. It says less about:
- Large-scale software architecture decisions
- Debugging complex distributed systems
- Understanding poorly-documented legacy code
The model is genuinely excellent at structured reasoning. It's not a general-purpose replacement for a senior engineer. These things are both true simultaneously.
What I Think This Actually Signals
The story of AI progress so far has been: capability concentrates at the frontier, the frontier is closed, open-source catches up slowly. The assumption baked into most developer tooling decisions is that if you need serious capability, you go to the API.
Gemma 4 disrupts that assumption. Not completely — the absolute frontier is still ahead — but enough to change the calculus for a large class of applications.
If your application needs:
- Math or logical reasoning
- Code generation and review
- Tool-calling agents
- Structured information extraction
...then Gemma 4 is now a legitimate option where it simply wasn't before. Not a compromise. Not "good enough for prototyping." A legitimate option.
The AIME score is a proxy for something more important: the capability level where local, private, open-weight AI becomes the right choice for production use cases, not just experimentation. Gemma 4 crossed it.
That's the story the 89.2% is telling.
Top comments (0)