Why we no longer evaluate SWE-bench Verified

#ai #llm #machinelearning #openai

SWE‑bench Verified is contaminated and mismeasures frontier coding progress. OpenAI now recommends SWE‑bench Pro. Read their analysis: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified

OpenAI’s takeaway: flawed tests that reward shortcuts, plus training‑data leakage that inflates scores. Verified can look good even when models haven’t actually improved on real tasks.

The practical hit: don’t use Verified as a single source for model selection or product claims. Run SWE‑bench Pro or private holdouts and treat Verified scores skeptically.

Quick checklist for sane benchmarking: time‑split your evals, run duplicate/overlap scans vs training corpora, and do human spot‑checks on edge cases. How do you detect eval leakage in your stack?

DEV Community

Why we no longer evaluate SWE-bench Verified

Top comments (0)