DEV Community

Serenities AI
Serenities AI

Posted on • Originally published at serenitiesai.com

Gemini 3.1 Pro: 77% ARC-AGI-2 Score, Full Benchmarks vs Claude Opus 4.6 & GPT-5.2 (2026)

Google just dropped Gemini 3.1 Pro — and the numbers are staggering. With a 77.1% score on ARC-AGI-2, more than doubling its predecessor's 31.1%, this isn't an incremental update. It's a generational leap that puts Google firmly ahead of both Anthropic and OpenAI on key reasoning benchmarks.

If you're a developer, researcher, or AI enthusiast trying to figure out whether Gemini 3.1 Pro is worth switching to, this article breaks down every benchmark, compares it head-to-head with Claude Opus 4.6 and GPT-5.2, and helps you decide whether it's time to move.

What Is Google Gemini 3.1 Pro?

Gemini 3.1 Pro is the next iteration in Google's Gemini 3 series, building on Gemini 3 Pro which launched in November 2025. It's Google's most advanced model for complex tasks — a natively multimodal system capable of processing text, audio, images, video, and entire code repositories in a single context.

The headline specs:

  • 1 million token context window — process entire codebases, lengthy documents, or hours of video in one pass
  • 64,000 token output — generate complete applications, comprehensive reports, or detailed analyses without truncation
  • Natively multimodal — text, audio, images, video, and code are all first-class inputs
  • Released February 19, 2026 — available today in preview

The Headline Number: 77.1% on ARC-AGI-2

Gemini 3.1 Pro scored 77.1% on ARC-AGI-2 (verified) — up from Gemini 3 Pro's 31.1%. That's a 148% increase in a single generation.

ARC-AGI-2 measures genuine reasoning ability — the kind of fluid intelligence that separates understanding from pattern matching.

To put 77.1% in context:

  • Claude Opus 4.6 scores 68.8% — Gemini leads by 8.3 points
  • GPT-5.2 scores 52.9% — Gemini leads by 24.2 points
  • Claude Sonnet 4.6 scores 58.3% — Gemini leads by 18.8 points

Full Benchmark Breakdown

Benchmark Gemini 3.1 Pro Gemini 3 Pro Sonnet 4.6 Opus 4.6 GPT-5.2
HLE (no tools) 44.4% 37.5% 33.2% 40.0% 34.5%
HLE (Search+Code) 51.4% 45.8% 49.0% 53.1% 45.5%
ARC-AGI-2 77.1% 31.1% 58.3% 68.8% 52.9%
GPQA Diamond 94.3% 91.9% 89.9% 91.3% 92.4%
Terminal-Bench 2.0 68.5% 56.9% 59.1% 65.4% 54.0%
SWE-Bench Verified 80.6% 76.2% 79.6% 80.8% 80.0%
SWE-Bench Pro 54.2% 43.3% 55.6%
LiveCodeBench Pro 2887 2439 2393
SciCode 59% 56% 47% 52% 52%
APEX-Agents 33.5% 18.4% 29.8% 23.0%
BrowseComp 85.9% 59.2% 74.7% 84.0% 65.8%
MMMU-Pro 80.5% 81.0% 74.5% 73.9% 79.5%
MMMLU 92.6% 91.8% 89.3% 91.1% 89.6%

Where Gemini 3.1 Pro Dominates

  • ARC-AGI-2 (77.1%) — Largest gap between frontier models on any major reasoning benchmark
  • BrowseComp (85.9%) — Web browsing comprehension, 20+ points ahead of GPT-5.2
  • MCP Atlas (69.2%) — Tool-using AI agents, ~10 points above Opus 4.6
  • APEX-Agents (33.5%) — Agentic task completion, leads all competitors
  • LiveCodeBench Pro (2887 Elo) — Competitive programming, highest score seen

Where Competitors Still Lead

  • HLE with tools — Opus 4.6 leads (53.1% vs 51.4%)
  • SWE-Bench Verified — Opus 4.6 leads by 0.2 points
  • SWE-Bench Pro — GPT-5.2 leads (55.6% vs 54.2%)

The Gen-Over-Gen Improvement

Benchmark 3 Pro 3.1 Pro Improvement
ARC-AGI-2 31.1% 77.1% +46.0 pts (+148%)
BrowseComp 59.2% 85.9% +26.7 pts
MCP Atlas 54.1% 69.2% +15.1 pts
APEX-Agents 18.4% 33.5% +15.1 pts (+82%)
Terminal-Bench 56.9% 68.5% +11.6 pts
LiveCodeBench 2439 2887 +448 Elo

The ARC-AGI-2 jump is historic. Going from 31.1% to 77.1% in ~3 months suggests fundamental architectural or training breakthroughs.

What This Means for Agentic AI

Three benchmarks signal where the industry is heading: APEX-Agents, MCP Atlas, and BrowseComp. Gemini 3.1 Pro leads all three by significant margins.

2026 is the year of AI agents. Models aren't just answering questions — they're browsing the web, executing code, using tools via MCP, and completing multi-step workflows autonomously.

For developers building AI agents — whether for customer support, research automation, code generation, or data analysis — these results matter.

Who Should Upgrade?

Definitely upgrade if you:

  • Build AI agents (best APEX-Agents + MCP Atlas scores)
  • Need strong abstract reasoning (77.1% ARC-AGI-2 is unmatched)
  • Work with large codebases (1M context + 64K output)
  • Do multimodal work (native text/audio/image/video/code)
  • Need competitive programming-level code (2887 Elo)

Consider staying with Opus 4.6 if you:

  • Rely on tool-augmented reasoning (slight edge on HLE with tools)
  • Focus on SWE-Bench-style tasks (essentially tied)
  • Are deeply invested in Anthropic's ecosystem

Free users: Gemini 3.1 Pro is available free in the Gemini app — test the most powerful reasoning model without spending a dollar.

Bottom Line

Gemini 3.1 Pro is Google's strongest model release to date. The 77.1% ARC-AGI-2 score is the headline, but the consistent dominance across reasoning, agentic, browsing, and coding benchmarks tells a broader story: Google has gone from playing catch-up to leading on multiple fronts.

The AI model race just got a lot more interesting.


Originally published on Serenities AI

Top comments (0)