On-Policy Distillation: Frontier Reasoning on Small Models

#opensource #research #ai #machinelearning

Originally published on AI Tech Connect.

What you need to know The idea in one line. The small "student" model generates its own answers, and a stronger "teacher" grades those answers token by token — so the student learns from its own mistakes, not from a transcript it can only mimic. Why it beats plain fine-tuning. Copying a teacher's perfect outputs (off-policy) makes small errors compound over long reasoning chains. On-policy learning trains the model on the states it actually visits, which is exactly where it needs help. Why it beats reinforcement learning on cost. RL gives one sparse reward per trajectory; on-policy distillation gives a dense per-token signal. Thinking Machines Lab reports this is roughly 9 to 30 times cheaper in compute to reach the same score. The evidence is stacking up. An April 2026 survey, a…

Read the full article on AI Tech Connect →

DEV Community

On-Policy Distillation: Frontier Reasoning on Small Models

Top comments (0)