DEV Community

DropThe
DropThe

Posted on • Originally published at dropthe.org

Claude Just Beat ChatGPT on Benchmarks - How Long Will It Last?

Originally published on DropThe.org.


The Shrinking Throne: A History of AI Benchmark Kings

GPT-4 held the crown for nearly a year. Claude Opus 4.6 might hold it for weeks. The pattern is clear: every new AI king rules for less time than the last.

On February 5, 2026, Anthropic released Claude Opus 4.6. It topped SWE-bench Verified at 80.8%, crushed ARC-AGI-2 at 68.8%, and posted an 84.0% on BrowseComp search. Nine days later, it still sits at or near the top of most leaderboards.

But how long will that last? History says: not very.

Every AI Model That Sat on the Throne

We tracked every major model release since late 2022 and measured how long each one held the top spot on key benchmarks before being overtaken. The results tell a story of accelerating displacement.

AI MODEL REIGN DURATION — Days as benchmark leader

ChatGPT (GPT-4)

~365 d

Claude 3 Opus

~67 d

GPT-4o

~92 d

Claude 3.5 Sonnet

~80 d

o1-preview

~90 d

DeepSeek R1

~30 d

Gemini 2.5 Pro

~40 d

Claude 3.5 Sonnet v2

~50 d

Claude 4 Opus

~42 d

Gemini 3 Pro

~35 d

GPT-5.2

~37 d

Claude Opus 4.6

9 d+

Source: DropThe.org analysis of benchmark leadership periods | DROPTHE_

DROPTHE INTEL
GPT-4 held the AI crown for roughly 365 days. By late 2025, new models were being dethroned in under 40 days. At this rate, the 2027 benchmark leader might hold the title for a single week.

The chart tells the whole story. GPT-4 reigned for roughly a year. By 2025, no model held the top spot for more than three months. In 2026, we measure reigns in weeks.

The Numbers Behind the Throne

Claude Opus 4.6 posted strong numbers across the board. On SWE-bench Verified, the gold standard for real-world coding ability, it scores 80.8%. On ARC-AGI-2, a test of abstract reasoning on novel problems, it hit 68.8% — nearly double Opus 4.5’s 37.6% and well ahead of Google‘s Gemini 3 Pro at 45.1%.

The agentic benchmarks are where it really pulls away. BrowseComp search: 84.0%, up from 67.8% for Opus 4.5. OSWorld computer use: 72.7%. Tool orchestration on tau-2-bench Retail: 91.9%. These are not incremental gains. They represent a different class of capability.

DropThe Data: Claude Opus 4.6 scores 68.8% on ARC-AGI-2, nearly double its predecessor and 14.6 points ahead of GPT-5.2 Pro’s 54.2%. But GPT-5.2 Pro still leads on Humanity’s Last Exam at 50.0% vs Claude’s 40.0%. No single model dominates every benchmark anymore.


This is an excerpt. Read the full analysis with charts and data on DropThe.org


About DropThe

DropThe is a data platform tracking 1.83 million entities across movies, games, companies, people, and crypto — connected by 2.18 million knowledge graph links. We don't guess. We count.

dropthe.org

Top comments (0)