Teaching AI with rewards — minus the expensive second model that grades it

#rlposttraining #efficiency #training

VIMPO eliminates the need for a separate critic model during the reward-polishing phase of language-model training. By deriving value estimates directly from the model being trained, it cuts memory and compute costs in half while staying steadier than existing critic-free methods when rewards are noisy.

Key facts

What: The standard way to polish a model with rewards quietly runs a second 'critic' model alongside it. A new method derives the critic's judgment from the model itself, dropping the extra cost.
When: 2026-06-20
Primary source: read the source (arXiv 2606.20008)

After a language model is first trained to predict text, it goes through a polishing phase where it's rewarded for good answers and nudged away from bad ones — the step that turns a raw text-predictor into a focused, helpful assistant. Much of the recent progress in reasoning models comes from doing this reward phase well. The hidden cost: many of these methods quietly run a second model alongside the one you care about, whose only job is to estimate how good the current situation is.

That second model exists because of a credit-assignment problem. When you reward a model for a long answer — a multi-step math solution, say — you need to know which steps deserve credit when the final answer is right, and which deserve blame when it's wrong. The traditional fix, borrowed from classical reinforcement learning, is to train a separate critic (sometimes called a value model) that watches along and estimates, at each point, how well things are going. This critic enables fine-grained credit assignment, but it is itself a large model — it costs memory, compute, and engineering effort to train and keep in sync. You're effectively running two models to improve one.

VIMPO shows you can skip the separate critic entirely. Its trick is mathematical: the policy you're already training — the assistant itself — implicitly contains the information a critic would provide. By exploiting the mathematical conditions that an optimally-trained model must satisfy, VIMPO derives a value estimate directly from the model's own behavior, without ever building a second network. The judgment was hiding inside the model all along; you just have to read it out. An analogy: imagine training for a sport with a separate coach standing on the sideline rating each move. VIMPO is like discovering that, if you set up your practice correctly, your own sense of how the play is going already encodes everything the coach would have told you — so you can let the coach go home. You keep the feedback, you drop the second salary.

Beyond saving the cost of the extra model, the authors make a second claim that matters in practice: their approach is steadier when the rewards are noisy. In the real world, the signal telling a model whether it did well is rarely clean — graders disagree, automated checks are imperfect, and some "correct" answers got lucky. The dominant critic-free method in wide use today (the one behind several well-known reasoning models, including DeepSeek-R1) can be thrown off by that noise. VIMPO is designed to stay more stable when the feedback is unreliable, which is most of the time.

The reward-polishing phase is where much of a model's usefulness and reasoning ability is forged, and it's run constantly across the industry. Shaving off an entire auxiliary model makes that phase cheaper and simpler — fewer moving parts, less memory, less that can go wrong. As reasoning models proliferate and labs run this phase over and over, methods that deliver the same quality with half the machinery compound into real savings. It also fits a clear pattern in this week's research: a steady push toward doing the expensive parts of training with less apparatus.

The honest caveat is about scale. Reading the value signal out of the model implicitly, rather than training a dedicated critic to provide it, leans on a mathematical relationship that can become delicate as models grow. A purpose-built critic, for all its expense, is a stable and well-understood source of feedback. Whether the implicit approach stays accurate and steady at the largest scales — or whether the estimation gets shaky when the stakes and sizes go up — is exactly what broader adoption will test. But as a cleaner, cheaper way to run one of AI's most important training steps, VIMPO is a notable entry in a fast-moving area.

Originally published on Ground Truth, where every claim is checked against the primary source.