Every time a new AI model drops, the same ritual plays out. The leaderboard updates. Twitter erupts. Someone posts a chart showing Model X beat Model Y by 2.3% on MMLU. People make purchasing decisions based on these numbers.
And I think most of it is nonsense.
I don't say this lightly. I've spent the last year building OpenMark, a platform that lets you benchmark AI models on your own tasks with deterministic scoring and real API cost tracking. The deeper I go into benchmarking, the more I realize how fundamentally broken the way we evaluate AI models is.
Let me show you what I mean with real data.
The Experiment: Can AI Read Human Emotions?
I took four movie stills. Scenes most humans would immediately recognize, and asked 10 AI models to identify the emotion. The twist: increasing complexity.
- Julia Roberts, Pretty Woman Obviously happy. Baseline.
- Matthew McConaughey, Interstellar Obviously sad. Still straightforward.
- Michael Scott, The Office Happy teary eyed experession? This is where it gets ambiguous.
- Joaquin Phoenix, Joker Neutral expression (cover picture), the joker makeup really messes with the AI models ability to understand what is going on.
Each model ran the task 3 times (so 12 total calls per model, 4 images × 3 runs) with stability tracking. Here's what happened:
The Results
| Model | Score | Stability | Cost/task |
|---|---|---|---|
| gpt-5.2 (OpenAI) | 75% | ±0.000 | $0.0085 |
| gemini-3-pro (Google) | 75% | ±0.000 | $0.0614 |
| gemini-3-flash (Google) | 68% | ±1.000 | $0.0060 |
| grok-4-1-fast (xAI) | 57% | ±1.000 | $0.0009 |
| sonar (Perplexity) | 57% | ±1.000 | $0.0256 |
| llama4-maverick (Meta) | 50% | ±0.000 | $0.0020 |
| Qwen3.5-397B (Alibaba) | 50% | ±0.000 | $0.0073 |
| claude-sonnet-4.6 (Anthropic) | 50% | ±0.000 | $0.0148 |
| claude-opus-4.6 (Anthropic) | 50% | ±0.000 | $0.0246 |
| mistral-medium (Mistral) | 42% | ±1.000 | $0.0022 |
10 models. Real API costs. 3 runs per model for stability. Exported from OpenMark.
Now, stop and look at this data.
The Most Expensive Model Tied With One Costing 12x Less
Claude Opus 4.6, Anthropic's flagship, "Very High" pricing tier at $0.025 per task scored exactly 50%. The same score as Llama 4 Maverick at $0.002 per task.
That's a 12x price difference for identical performance.
On any generic leaderboard, Opus 4.6 ranks significantly above Maverick. MMLU, HumanEval, MATH, Opus wins on all of them. And yet, on this specific task, with this specific prompt, the budget model matched the premium one perfectly.
If you were making a purchasing decision based on leaderboard rankings, you'd be overpaying by 1,200%.
Half the Models Changed Their Mind
Look at the stability column. Half the models scored ±0.000, they gave the exact same answer every single run. The other half scored ±1.000, they literally changed their interpretation of the same image across runs.
Gemini 3 Flash, Grok, Sonar, Mistral, all unstable. Same image, same prompt, different answer depending on when you ask.
This is why single-run benchmarks are fundamentally meaningless. If your model can't give the same answer twice, what exactly did your benchmark measure? The model's capability? Or just... luck?
The 80,000-Call Problem (And Why Every Leaderboard Is Lying to You)
Here's where I get genuinely frustrated.
To properly benchmark a model on a single task, you'd need to account for:
- Stability: Run each prompt at least 10 times to get reliable variance data
- Language variation: Test across at least 20 languages (tokenization affects reasoning)
- Syntax variation: Rephrase the same question 20 different ways (formal, casual, terse, verbose, with typos, without)
- Prompt variation: 20 fundamentally different phrasings of the same underlying question
That's 10 × 20 × 20 × 20 = 80,000 calls. For one task. On one model.
And this is conservative. Add tool use? Multiply by another N. Add multimodal inputs? Another N. Different system prompts? Another N. You're easily at 500,000+ calls to truly benchmark one model on one capability.
No leaderboard does this. Not MMLU. Not HumanEval. Not LMArena. Not SWE-bench. Why ? Because its not possible, the resources required to to run 500 000 minimum, for each tasks, for each models, would be unfathomable. They run each question once, maybe a handful of times, and call it a score. Then people use that score to decide which model to bet their product on.
Brilliant researchers are out there designing these benchmarks, and I respect the work deeply. But the fundamental limitation isn't effort or intelligence, it's that you can never escape the prompt problem. The way you ask the question is the test, as much as the question itself.
The Car Wash Problem: When the Benchmark Is Dumber Than the Model
There's a popular "gotcha" making the rounds. The car wash problem:
"I need to get my car washed. The car wash is 100 meters away. Should I go by car or by foot?"
Many models say "by foot", it's only 100 meters. And people hold this up triumphantly: "See? AI can't reason! You need your car at the car wash!"
But think about this for two seconds. The question is intentionally ambiguous. Maybe your car is already at the car wash. Maybe someone else is driving it there. Maybe you're asking about how you should get there, not the car. The question doesn't specify.
This isn't a model failure. It's a prompt failure. The question is designed to be misleading, and then we blame the model for being misled.
You know what's worse than the car wash problem? Ask humans what's heavier, 1 kg of feathers or 1 kg of lead. Way too many will say lead. And that's a question with an objectively correct, unambiguous answer, not an intentionally vague one. The car wash example is manufactured outrage, and people cling to it because it confirms the narrative that "AI isn't ready."
AI might not be ready. But the car wash problem doesn't prove it.
What Actually Matters
Here's what I've learned from building a benchmarking platform and running thousands of model evaluations:
The only benchmark that matters is yours.
Not MMLU. Not HumanEval. Not some leaderboard aggregating scores across tasks you'll never use. The question is brutally simple:
Does this specific model, with this specific prompt, for this specific task, give me the result I expect, reliably and at a price I can afford?
That's it. That's the whole question.
In my emotion detection test, GPT-5.2 won at 75% with perfect stability for $0.0085 per task. But if your use case is high-volume classification where "good enough" works and cost matters, Grok 4.1 Fast at 57% for $0.0009 gives you 2,510 accuracy-per-dollar, 7x better value than the winner.
No leaderboard will tell you that. Only your benchmark will.
Try It Yourself
I built OpenMark because I was tired of making model decisions based on other people's benchmarks. You can write any task, code review, classification, creative writing, vision analysis, anything, pick your models, and get deterministic scores with real API costs.
100+ models. Side-by-side comparison. Stability metrics. Accuracy-per-dollar. The stuff leaderboards don't show you.
The benchmark I ran for this article took about 2 minutes to set up .
Stop trusting leaderboards. Benchmark your own work.
Top comments (0)