DEV Community

Pawel Jozefiak
Pawel Jozefiak

Posted on • Originally published at thoughts.jock.pl

The Benchmark Contamination Crisis (and Why I'm Pivoting LLMatcher)

MiniMax scored 80% on a public benchmark. I gave it fresh problems it had never seen: 39%.

Ophus stayed flat at 51% on both.

That 41-point gap is contamination. Models train on internet-scale data that includes every benchmark ever published. By the time a benchmark gets popular, the top models have already memorized the answers.

I've been building LLMatcher for months. Original concept: crowd-sourced model voting. Turns out nobody urgently needed that. Score on my own project criteria: 4.5/10.

But decontaminated benchmarks? 8/10. Structural problem, first-mover opportunity, clear revenue path.

New direction: fresh eval tasks that rotate monthly and never get published publicly. You get your real score, the inflated public score, and the decontamination gap between them.

Validation test: 20+ signups in 48 hours = real demand, build the MVP. Form is live at wiz.jock.pl/llmatcher-signup.


Read the full post: https://thoughts.jock.pl/p/benchmark-contamination-crisis-llmatcher-pivot-2026

Originally published on Digital Thoughts (Substack).

Top comments (0)