The DPO Hype Promised to Kill RLHF — It Didn't
Everyone said Direct Preference Optimization would replace RLHF. The paper from Rafailov et al. (NeurIPS 2023) showed you could skip the reward model entirely, train directly on preference pairs, and get comparable results. Simpler pipeline. Fewer moving parts. What's not to love?
I bought into the hype initially. Then I ran DPO on a custom assistant task with 15K preference pairs and watched it memorize the training distribution so hard it couldn't generalize to slightly rephrased instructions. The loss curve looked beautiful. The actual outputs were brittle.
RLHF's extra complexity — the separate reward model, the PPO training loop, the KL divergence constraint — turns out those aren't bugs. They're regularization mechanisms that prevent exactly this kind of overfitting. In 2026, after two years of the field trying every shortcut, the original Anthropic and OpenAI approach of training a reward model then doing PPO against it remains the gold standard for production systems.
Continue reading the full article on TildAlice

Top comments (0)