SWE‑bench Verified is contaminated and mismeasures frontier coding progress. OpenAI now recommends SWE‑bench Pro. Read their analysis: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified
OpenAI’s takeaway: flawed tests that reward shortcuts, plus training‑data leakage that inflates scores. Verified can look good even when models haven’t actually improved on real tasks.
The practical hit: don’t use Verified as a single source for model selection or product claims. Run SWE‑bench Pro or private holdouts and treat Verified scores skeptically.
Quick checklist for sane benchmarking: time‑split your evals, run duplicate/overlap scans vs training corpora, and do human spot‑checks on edge cases. How do you detect eval leakage in your stack?
Top comments (0)