RLHF vs DPO vs IPO vs KTO: which alignment method should you use
You have a base model, say Llama 3.2 8B, that can write poetry in any meter and pass the bar exam. It can also generate instructions for synthesizing controlled substances, roleplay as a manipulative therapist, and explain in loving detail why your pull request is an affront to good taste. You need to align it — remove the harmful outputs while keeping the capability. Your mentor says "use RLHF." A paper on your feed says DPO is simpler. Your colleague swears by KTO because they only have thumbs-up/thumbs-down log data from production. Where do you start?
Choosing an alignment method is not a theoretical debate. It is a practical decision that depends on your data, your compute budget, and the failure modes you are trying to avoid. This post compares the four dominant approaches side by side, with the actual math, the data requirements, and the sharp edges you will hit in production.
Why this matters
The alignment method you pick determines three things that directly affect shipping timelines:
- Data requirements. Some methods need pairwise preferences (A beats B). Others work with per-sample binary scores. If you have production logs, you probably already have the latter. If you have a human annotation pipeline, you can collect the former — at a cost.
- Compute budget. RLHF requires training a separate reward model of comparable size to your policy model, then running PPO, which is notoriously sample-inefficient and sensitive to hyperparameters. DPO, IPO, and KTO collapse the process into a single training loop on static data.
- Stability and robustness. PPO can destabilize and collapse your policy. DPO can overfit to preference noise. IPO adds a regularization term that mitigates that. KTO handles scenarios where you have no strict pairwise comparisons at all.
Understanding these tradeoffs is the difference between an aligned model that ships in two weeks and an alignment project that drags for three months.
RLHF, DPO, IPO, and KTO: how each method works
All four methods start from the same place: a supervised fine-tuned (SFT) model and a dataset that captures human preferences. How they use that data differs fundamentally.
RLHF (Reinforcement Learning from Human Feedback)
The canonical approach, popularized by OpenAI's InstructGPT paper (Ouyang et al., 2022), is a three-stage pipeline:
- Collect human preferences — annotators rank model outputs for a set of prompts, producing pairwise preferences (chosen vs rejected).
- Train a reward model — a separate model (usually the same architecture as the policy) is trained to predict the human preference score from a given output. It learns a scalar reward function that approximates human judgment.
- Optimize the policy with PPO — the policy model generates outputs, the reward model scores them, and PPO (Proximal Policy Optimization) updates the policy to increase the expected reward. A KL penalty keeps the policy from diverging too far from the SFT model.
# Simplified PPO update (conceptual)
# reward = reward_model.generate(policy_output) - beta * kl_divergence(policy || ref_policy)
# policy_loss = -ppo_clip(reward, old_logprobs, new_logprobs)
The three-stage pipeline is expensive — each stage requires its own training run, its own GPU budget, and its own hyperparameter sweep. The reward model can learn to exploit spurious correlations (reward hacking), and PPO is sensitive to the learning rate and KL penalty coefficient. On the plus side, online PPO can in theory discover outputs that are better than any human annotation in the dataset.
DPO (Direct Preference Optimization)
Rafailov et al. (2023) showed that the reward model in RLHF is strictly unnecessary. The key insight is that the Bradley-Terry preference model (the statistical model behind most reward models) has a closed-form solution that relates the optimal policy directly to the reference policy and the preference data.
DPO eliminates the reward model entirely. The training loss is:
L_DPO = -E[log sigmoid(beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x)))]
Where y_w is the chosen output, y_l is the rejected output, pi is the current policy, pi_ref is the frozen reference policy (the SFT model), and beta controls how far the policy can diverge.
# DPO loss in practice (using Hugging Face TRL)
from trl import DPOTrainer
dpo_trainer = DPOTrainer(
model=policy_model,
ref_model=ref_model,
train_dataset=preference_dataset,
beta=0.1, # KL regularization strength
args=training_args,
)
dpo_trainer.train()
DPO runs in a single training loop on a static dataset. There is no reward model, no PPO, no online generation during training. This makes it dramatically cheaper — approximately 3x less compute than RLHF for comparable results on most benchmarks.
The tradeoff: DPO is an offline method. It never sees the model's own generations during training, so it can over-optimize for preferences that do not generalize. It also requires pairwise preference data — you need two outputs per prompt, one explicitly preferred over the other.
IPO (Identity Preference Optimization)
Azar et al. (2023) at DeepMind identified a subtle problem with DPO: the implicit reward parameterization in DPO can lead to the regularization term not actually constraining the policy the way it should. IPO replaces the reward parameterization with an identity mapping, providing stronger regularization.
The IPO loss is:
L_IPO = E[(log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x) - 1/(2*tau))^2]
Where tau is a regularization parameter. The squared loss directly penalizes the policy when the log-likelihood gap diverges too far from the target margin. This provides a cleaner optimization landscape and better-calibrated probabilities at inference time.
# IPO loss (conceptual)
# margin = (log_ratio_w - log_ratio_l)
# loss = (margin - 1/(2*tau))^2 # when margin < 1/(2*tau), else 0
IPO requires the same pairwise data as DPO. It is slightly more stable in practice, especially on noisy preference datasets where DPO can amplify annotator disagreement.
KTO (Kahneman-Tversky Optimization)
Ethayarajh et al. (2024) at Contextual AI took a different tack. Inspired by prospect theory (Kahneman and Tversky, 1979), they built an alignment method that works with per-sample binary feedback — thumbs up or thumbs down — instead of pairwise preferences.
The KTO loss treats gains (chosen responses) and losses (rejected responses) asymmetrically:
L_KTO = -E[w(y) * (1 - sigmoid(beta * (log pi(y|x)/pi_ref(y|x) - z_ref)))]
Where w(y) is a weighting factor that differs for chosen and rejected examples, and z_ref is a reference value derived from the data. The key asymmetry: losses (rejected outputs) are weighted more heavily than gains (chosen outputs), mirroring human loss aversion documented in behavioral economics.
# KTO trainer in Hugging Face TRL
from trl import KTOTrainer
kto_trainer = KTOTrainer(
model=policy_model,
ref_model=ref_model,
train_dataset=binary_feedback_dataset, # no pairs needed
args=training_args,
)
kto_trainer.train()
KTO's major advantage is data efficiency. Many production systems log per-output user feedback (clicks, likes, flags) without recording a pairwise comparison. KTO can train directly on this signal. The tradeoff is lower sample efficiency per annotated example — pairwise comparisons carry more information per annotation than binary labels.
Comparison: which method for which situation
| Dimension | RLHF | DPO | IPO | KTO |
|---|---|---|---|---|
| Data required | Pairwise comparisons | Pairwise comparisons | Pairwise comparisons | Binary (good/bad) |
| Reward model needed | Yes (separate training) | No | No | No |
| Training stages | 3 (SFT + RM + PPO) | 1 (after SFT) | 1 (after SFT) | 1 (after SFT) |
| Compute cost | Highest (~3x DPO) | Low | Low | Low |
| Online generation | Yes (PPO samples during training) | No (offline) | No (offline) | No (offline) |
| Stability | Tricky (PPO hyperparameters) | Good, can overfit to noise | Better (identity regularization) | Good |
| Best for | High-quality RM, large compute budget | Clean pair data, tight budget | Noisy pair data, production stability | Production logs (binary feedback) |
| Key risk | Reward hacking, training collapse | Overfitting on static data | Slightly more complex loss | Needs enough binary data |
Here is the decision flow:
flowchart TD
A[Do you have pairwise<br/>preference data?] -->|Yes| B{Do you have budget<br/>for a reward model<br/>and PPO?}
A -->|No / only binary feedback| C[Use KTO]
B -->|Yes| D[RLHF — full pipeline<br/>highest potential ceiling]
B -->|No| E{Is your preference<br/>data clean or noisy?}
E -->|Clean| F[DPO — simplest<br/>single-stage training]
E -->|Noisy| G[IPO — better regularization<br/>for noisy preferences]
Common pitfalls
Running DPO on binary data. DPO requires pairwise preferences: a chosen output and a rejected output for the same prompt. If you concatenate unrelated good and bad outputs into pairs, DPO will learn arbitrary decision boundaries. Use KTO for binary data.
Ignoring the reference model. DPO, IPO, and KTO all require a frozen reference model (usually your SFT checkpoint). The loss depends on the log-ratio between the current policy and the reference. If you use a different reference model, the optimization target changes silently. Always use the same checkpoint that produced the data.
Skipping SFT. None of these methods work well on a raw pretrained base model. You need an SFT model that can produce reasonable completions. The alignment stage assumes the model can already generate coherent, on-task outputs — it is steering existing behavior, not teaching the model to generate text from scratch.
Treating beta as a free parameter. The beta (or tau) parameter controls how far the aligned policy can stray from the reference. A beta too high and you get no alignment effect. A beta too low and the model unlearns general capabilities (catastrophic forgetting). Sweep it systematically — at least 3 values (e.g., 0.01, 0.1, 0.5) on a validation set before committing to a full run.
Assuming RLHF always wins. On many benchmarks, DPO matches or exceeds RLHF at a fraction of the compute. The main advantage of RLHF is the online generation during PPO, which can discover novel high-reward outputs not present in the training data. For most production use cases where you already have a representative dataset, DPO/IPO/KTO are the better choice.
When NOT to use it
Do not use any of these methods if you have fewer than a few hundred preference examples. The signal-to-noise ratio at that scale is too low. Collect at least 500–1000 examples, and prefer 5000+ for reliable results.
Do not use RLHF if you are budget-constrained or shipping on a timeline under four weeks. The three-stage pipeline (SFT, reward model, PPO) with hyperparameter tuning and reward model debugging routinely takes 2–3 months for teams that are new to it.
Do not use DPO or IPO if your data is binary per-output feedback with no pairwise structure. You will have to fabricate pairs from unrelated outputs, which introduces noise. Use KTO instead.
Do not use KTO if you have clean pairwise preferences and enough compute for DPO. Pairwise comparisons carry more information per example, so DPO will converge faster with fewer total annotations.
Do not skip evaluating your aligned model on capability benchmarks. Every alignment method trades some general capability for safety. If your aligned model drops 5% on MMLU relative to the SFT checkpoint, you have likely over-regularized. Run MMLU, HellaSwag, and a task specific to your domain before and after alignment.
TL;DR
- RLHF uses a trained reward model plus PPO optimization. It is the most expensive but supports online exploration. Use it when you have large compute budgets and a team that can manage the complexity.
- DPO eliminates the reward model and optimizes a closed-form loss on static preference pairs. It is the simplest and cheapest. Use it for clean pairwise data when compute is constrained.
- IPO adds identity regularization to DPO, producing more stable training on noisy preferences. Use it when your annotation quality is inconsistent.
- KTO works with binary per-example feedback (good/bad) instead of pairwise comparisons. Use it when you only have production logs without explicit preference pairs.
- All four require a strong SFT base model, a frozen reference model, and a minimum of several hundred examples. All four risk capability regression — evaluate on standard benchmarks before and after alignment.
Next post
Pairwise preference data is the gold standard for alignment, but collecting it at scale is expensive and annotator agreement is often low. Next time: how to build and maintain a preference dataset — sampling strategy, inter-annotator agreement metrics, and detecting when your annotation pipeline is quietly poisoning your model.
Top comments (0)