DEV Community

jg-noncelogic
jg-noncelogic

Posted on • Originally published at openai.com

Why we no longer evaluate SWE-bench Verified

SWE‑bench Verified is contaminated and mismeasures frontier coding progress. OpenAI now recommends SWE‑bench Pro. Read their analysis: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified

OpenAI’s takeaway: flawed tests that reward shortcuts, plus training‑data leakage that inflates scores. Verified can look good even when models haven’t actually improved on real tasks.

The practical hit: don’t use Verified as a single source for model selection or product claims. Run SWE‑bench Pro or private holdouts and treat Verified scores skeptically.

Quick checklist for sane benchmarking: time‑split your evals, run duplicate/overlap scans vs training corpora, and do human spot‑checks on edge cases. How do you detect eval leakage in your stack?

Top comments (0)