MiniMax scored 80% on a public benchmark. I gave it fresh problems it had never seen: 39%.
Ophus stayed flat at 51% on both.
That 41-point gap is contamination. Models train on internet-scale data that includes every benchmark ever published. By the time a benchmark gets popular, the top models have already memorized the answers.
I've been building LLMatcher for months. Original concept: crowd-sourced model voting. Turns out nobody urgently needed that. Score on my own project criteria: 4.5/10.
But decontaminated benchmarks? 8/10. Structural problem, first-mover opportunity, clear revenue path.
New direction: fresh eval tasks that rotate monthly and never get published publicly. You get your real score, the inflated public score, and the decontamination gap between them.
Validation test: 20+ signups in 48 hours = real demand, build the MVP. Form is live at wiz.jock.pl/llmatcher-signup.
Read the full post: https://thoughts.jock.pl/p/benchmark-contamination-crisis-llmatcher-pivot-2026
Originally published on Digital Thoughts (Substack).
Top comments (0)