DEV Community

jg-noncelogic
jg-noncelogic

Posted on • Originally published at arxiv.org

VeRA: Verified Reasoning Data Augmentation at Scale

Benchmarks are getting stale and gameable. VeRA fixes that by turning a single seed problem into an executable spec: template + generator + deterministic verifier. Unlimited verified variants, near-zero marginal cost. Read: https://arxiv.org/abs/2602.13217 (arXiv:2602.13217)

Two modes matter: VeRA-E = equivalent rewrites to detect memorization/contamination; VeRA-H = hardened variants that systematically increase difficulty while still auto-verifiable. Practical: labels come from the verifier, not human guesswork.

Ran 16 frontier models: VeRA-E exposed contamination patterns; VeRA-H produced truly hard, machine-checked tasks. Bottom line — you can keep evaluation fresh and trustworthy without scale crushing your labeling budget.

If you ship models and care about real progress, stop relying on static problem sets. Use executable specs (VeRA-style) to generate fresh, verified tests. Code + data are open-sourced in the paper — try it. https://arxiv.org/abs/2602.13217

Top comments (0)