GPT-4 vs Claude 3.5 vs Gemini: MMLU Zero-Shot Accuracy

#llm #gpt4 #claude #gemini

GPT-4 Barely Wins on MMLU — But the Margin Is Smaller Than You Think

OpenAI claims GPT-4 hits 86.4% on MMLU. Anthropic says Claude 3.5 Sonnet gets 88.7%. Google reports Gemini 1.5 Pro at 85.9%. These numbers are plastered across every model card and benchmark leaderboard — but they're not measuring the same thing.

I ran zero-shot MMLU evals on all three models using the same 1,000-question subset from the original Hendrycks et al. (2021) dataset. The gaps shrink to 2-3% when you control for prompt format, temperature, and sampling.

Here's what actually matters: which model gets the hard questions right.

Scrabble tiles spelling — Photo by Markus Winkler on Pexels

What MMLU Actually Tests (And Why Everyone Cites It)

MMLU — Massive Multitask Language Understanding — is 15,908 multiple-choice questions spanning 57 subjects: college chemistry, moral scenarios, high school math, US foreign policy, professional law. It's the de facto standard for LLM evaluation because it's (1) comprehensive, (2) human-validated, and (3) hard enough that GPT-3 scored only 43.9%.

Continue reading the full article on TildAlice