A new paper introduces a method to speed up reward-based fine-tuning by having the model generate a cheap, compressed copy of itself to draft text, which the full model then verifies rather than writing from scratch. The approach, called self-speculative decoding, achieves meaningful speedups in generation with no loss in final model quality — the finished model is identical to one trained without the trick. The key insight is that the clone is re-created from the live model at every training step, so it never drifts out of sync.
Key facts
- What: Teaching a model with rewards is slow because it has to write out endless practice answers. A new trick: make a cheap, shrunk-down copy of the model to crank those out faster.
- When: 2026-06-19
- Primary source: read the source (arXiv 2606.18967)
During the reward phase of training — where the model practices, gets graded, and improves, as covered in our explainer on reward-based fine-tuning — most time is spent waiting. The model must write out complete answers word by word, thousands of times over, before it can be graded. That generation step is slow and dominates the clock. The paper attacks this bottleneck directly by borrowing from speculative decoding, a technique already used to speed up chatbots: a small, fast model drafts the next chunk of text, and the large model only checks the draft rather than composing every word. Checking is far quicker than writing, so you get the big model's quality at closer to the small model's speed.
The challenge in the training setting is that the model being accelerated is constantly changing — it's mid-training — so any fixed helper quickly falls out of step, and its guesses stop matching what the big model would say. The paper's fix is to make a compressed copy of the current model at every step: a stripped-down, lower-precision snapshot that serves as the fast drafter. Because the clone is regenerated constantly from the live model, it never drifts. The researchers add one more practical refinement: early in each batch, when hardware is already running flat-out, speculation buys nothing, so they switch it off and turn it on only when spare capacity exists.
The "compressed copy" is the same model stored in a coarser, lower-precision form — the numbers that make it up are rounded down to use far less memory and run faster. Some fidelity is lost in the copy, but that's acceptable because the copy only has to guess, and the full-quality model still checks every guess. The rounding never touches the final result; it only makes drafting cheaper. It's a small, well-contained piece of engineering rather than a sweeping change to how training works.
The speedups are real but modest — meaningfully faster generation and a smaller but worthwhile cut to total training time — and, crucially, lossless: the finished model is no worse for it, because the big model still checks everything that matters. That stands out in a field where efficiency claims are often wildly inflated. The authors aren't promising to halve your training bill; they're promising to shave a real, dependable slice off the slowest step with essentially no downside.
This is one of several results this week aimed at the same target from different angles: doing the reward phase smarter. Another shows how to give a model fine-grained credit for its good steps without a second judge model; another protects the rare words that keep a model from getting repetitive and overconfident. The common thread is a field finding savings and insight inside the machinery it already has. Training these models is staggeringly expensive, and the reward phase is becoming one of the most important — and most compute-hungry — parts of building a strong reasoning model. Quiet, no-strings savings on the slowest step compound across an entire industry, even when no single number is dramatic.
The caveats are appropriately small: it's new work, the gains lean more favorable on some model families than others, and "modest but lossless" is a feature rather than a headline. That's the point — it's a sober, buildable optimization, not a miracle, and the self-cloning idea is clever enough that it'll likely turn up in other people's training pipelines before long.
Originally published on Ground Truth, where every claim is checked against the primary source.
Top comments (0)