Benchmarks are getting stale and gameable. VeRA fixes that by turning a single seed problem into an executable spec: template + generator + deterministic verifier. Unlimited verified variants, near-zero marginal cost. Read: https://arxiv.org/abs/2602.13217 (arXiv:2602.13217)
Two modes matter: VeRA-E = equivalent rewrites to detect memorization/contamination; VeRA-H = hardened variants that systematically increase difficulty while still auto-verifiable. Practical: labels come from the verifier, not human guesswork.
Ran 16 frontier models: VeRA-E exposed contamination patterns; VeRA-H produced truly hard, machine-checked tasks. Bottom line — you can keep evaluation fresh and trustworthy without scale crushing your labeling budget.
If you ship models and care about real progress, stop relying on static problem sets. Use executable specs (VeRA-style) to generate fresh, verified tests. Code + data are open-sourced in the paper — try it. https://arxiv.org/abs/2602.13217
Top comments (0)