Originally published on AI Tech Connect.
What you need to know The idea in one line. The small "student" model generates its own answers, and a stronger "teacher" grades those answers token by token — so the student learns from its own mistakes, not from a transcript it can only mimic. Why it beats plain fine-tuning. Copying a teacher's perfect outputs (off-policy) makes small errors compound over long reasoning chains. On-policy learning trains the model on the states it actually visits, which is exactly where it needs help. Why it beats reinforcement learning on cost. RL gives one sparse reward per trajectory; on-policy distillation gives a dense per-token signal. Thinking Machines Lab reports this is roughly 9 to 30 times cheaper in compute to reach the same score. The evidence is stacking up. An April 2026 survey, a…
Top comments (0)