Why DPO Changes Everything About LLM Alignment
RLHF works. We know this because GPT-4, Claude, and pretty much every useful LLM today uses it. But here's the dirty secret: the RL part of RLHF is a nightmare to implement correctly. You need to train a reward model, then run PPO with careful clipping, KL penalties, advantage estimation, and a dozen hyperparameters that interact in ways nobody fully understands.
DPO throws all of that out. No reward model. No PPO. No actor-critic architecture. Just a single supervised learning objective that achieves the same result. You can read the full paper here.
The paper by Rafailov et al. (NeurIPS 2023) makes a mathematical observation that seems obvious in hindsight: if we know what the optimal policy looks like under a given reward function, we can invert that relationship and express the reward directly in terms of the policy. Then we never need to learn the reward model separately — we optimize the policy directly on preference data.
Continue reading the full article on TildAlice

Top comments (0)