sangjun_park

Posted on Jan 1

AI Paper Review: ORPO - Monolithic Preference Optimization without Reference Model

(Please note that this content was translated using GPT-o1, so there might be some mistakes or inaccuracies in the translation.)

paper link: https://arxiv.org/abs/2403.07691

Overview

First, let’s interpret the title and understand what it means:

Monolithic Preference Optimization:

Monolithic -> Refers to working as one large, single-structure system rather than separating multiple modules or elements.
Preference Optimization -> Refers to the process of optimizing around a specific goal or criterion (e.g., user preference, satisfaction) that the user desires.
In other words, it means optimizing preferences within a single system without separate sub-modules.

Abstract

SFT (Supervised Fine-Tuning): A process in which a pre-trained language model (such as GPT, Llama, etc.) undergoes additional supervised learning on a specific task or domain -> leads to better performance on the given task.
SFT plays a key role in preference alignment, and even a small penalty for undesired generation styles is sufficient.
- Preference Alignment: Training a language model to generate responses in line with a specific standard (e.g., user preferences, task requirements). For example, if a user demands concise and clear answers, the model is trained to produce them. Conversely, providing excessive information or ambiguous answers is considered a “non-preferred generation style.”
Main results:
- Up to 12.20% improvement on AlpacaEval 2.0
- 66.19% on IFEval
- Achieved 7.32 on MT-Bench

1 Introduction

Limitations of the Existing Approach

Pre-trained Language Models (PLMs), trained on large-scale data such as web text or textbooks, demonstrate excellent performance on various NLP tasks. However, for practical applications, additional tuning such as Instruction Tuning or Preference Alignment is needed.
- Instruction Tuning: A process where the model is trained to generalize to new tasks given instructions in natural language. It uses a unified dataset so that it can handle various inputs and produce outputs accordingly. As a result, the model acquires zero-shot or few-shot capabilities.
- Preference Alignment: Uses paired preference data to align models with human values. RLHF (reinforcement learning from human feedback) and DPO (directly adjusting the output probability distribution using preference data) fall under this category.
- Traditional preference alignment generally follows a multi-step process (SFT -> reward model training -> reinforcement learning), thus requiring an additional reference model and a supervised fine-tuning phase.

A New Approach

A new method called ORPO (Odds Ratio Preference Optimization) is proposed as a novel monolithic alignment approach. Unlike existing methods, it can effectively suppress undesired generation styles without the need for a preparatory SFT stage or a reference model. It utilizes resources efficiently while implementing preference alignment.
ORPO demonstrates strong performance on leaderboards for the Phi-2, Llama-2, and Mistral models. It has also outperformed existing methods (RLHF, DPO) on various benchmarks such as AlpacaEval 2.0 and IFEval.

The figure above, based on AlpacaEval 2.0, shows the results of fine-tuning Llama-2 (7B) and Mistral (7B) using different algorithms. It displays the win rate (%) of each model, comparing RLHF (in red), DPO (in green), and ORPO (in blue). From this, we can see that ORPO outperforms RLHF and DPO.

This graph compares three alignment methods—RLHF, DPO, and ORPO.

RLHF: After the SFT phase, a reward model is built, and the model is fine-tuned via policy gradient based on feedback from that reward model. RLHF needs a reference model and a policy, making it a relatively complex multi-stage process.
- Reward Model: Trained from human feedback -> provides a score for how “good” a model output is.
- Policy Gradient: The central algorithm for reinforcement learning. It updates the current language model (policy) based on the reward signal from the reward model.
- Reference Model: A fixed model that serves as a baseline for the current policy. By comparing policy and reference, one can calculate the KL-divergence to ensure that the policy does not deviate too far from the reference.
DPO: Compared to RLHF, it has a simplified process that uses preference data to train the language model. Typically, DPO does not use a separate reference model, although in some implementations, the model from the SFT stage is designated as the reference, or a reference model is employed to provide additional stability.
- DPO updates the model by leveraging the log odds ratio between accepted and rejected responses. Accepted responses get a strong positive signal, while rejected responses get a weaker penalty.
ORPO: A new method of aligning the language model with human preferences using preference data. Like DPO, it uses log odds ratio but does not rely on a reference model, directly learning from preference data. Additionally, it performs odds-ratio-based optimization right after SFT, so no additional multi-stage process is needed.

The figure above explains how log odds ratio optimization works. Good responses get a strong adaptation signal, whereas poor responses get a weaker penalty, reflecting these differences in the model’s gradient updates.

2 Related Works

Connections to Reinforcement Learning

RLHF uses a Bradley-Terry model to calculate the probability of one response competing against another, training a reward model to maximize the score of the selected response. PPO (Proximal Policy Optimization) is a typical RL algorithm used here, enabling the model to learn from human preferences. This is proven to be scalable and general for instruction-following language models, extended further by RLAIF.
- PPO (Proximal Policy Optimization): A reinforcement learning update algorithm that restricts large updates to the current policy (the language model) by clipping, making it less sensitive and more stable than some other RL methods.
- RLAIF: While RLHF requires human evaluators to rank or compare outputs, RLAIF uses a language model instead of humans to evaluate responses and train the reward model. This reduces costs and can be applied to larger data.
However, RLHF suffers from instability in PPO, requiring extensive hyperparameter search, and a reward model’s sensitivity also causes challenges. Hence, more stable preference-alignment algorithms are needed.

Without a Reward Model

Various methods have been studied for performing preference alignment without a separate reward model:
- DPO (Direct Policy Optimization): Integrates reward modeling into the preference learning stage. Preference data forces the probability of favorable responses to exceed that of unfavorable ones.
- IPO (Identity Preference Optimization): Developed to avoid overfitting problems that may arise in DPO, placing less emphasis on the relative gap and more on maintaining consistent baselines.
- KTO (Kahneman-Tversky Optimization): An approach grounded in the behavioral economics theories of Kahneman and Tversky. Unlike RLHF or DPO, it does not rely on comparison data but instead incorporates cognitive biases inherent to human decision-making.
- ULMA (Unified Language Model Alignment): A unified approach to aligning language models that features a consistent training framework and varied data. It’s said to be efficient, with simpler data preparation and a universal training structure for easier extension.

SFT (Supervised Fine-Tuning) Revisited

Preference alignment methods often rely on SFT as a stable starting point between the original and actively updated policies.
In RLHF, the SFT model is considered the baseline policy, and empirical results also suggest that SFT is important in non-RL alignment approaches.
Some research shows that performing SFT with a carefully curated dataset alone can yield a human-aligned language model.
Even a small set of meticulously filtered data can produce a useful language model assistant, or one can iteratively select self-generated model outputs to refine alignment. In some cases, using only a portion of a preference dataset—carefully chosen—can suffice.
However, theoretical and empirical work on how SFT integrates with preference alignment is limited.

3 The Role of Supervised Fine-Tuning

This study examines the role of Supervised Fine-Tuning as the initial step for preference alignment. SFT is crucial for adapting a pre-trained language model to a desired domain by increasing the log probability of related tokens. However, it can also inadvertently increase the likelihood of undesired styles. Thus, we need to maintain the domain-adaptation effect of SFT while distinguishing and reducing undesired generation styles.

The above figure shows the log probability of accepted vs. rejected responses while SFT is being performed on OPT-250M using the HH-RLHF dataset. Over training, both accepted and rejected responses’ log probabilities increase. Ideally, we would see an increase in accepted responses’ log probabilities while rejected responses remain low, but this figure shows otherwise.
In other words, while SFT helps domain adaptation, additional measures are needed to suppress undesired response styles.

No Direct Penalty from Cross-Entropy

Cross-entropy loss penalizes low logit values for reference (correct) tokens. Mathematically, it can be expressed as shown in the figure below.

yᵢ: A boolean that indicates if the token is the correct token (1 if correct, 0 if not).
pᵢ: The predicted probability that the token is correct (between 0 and 1).
m: The length of the sequence. For instance, if the input sentence is “Hello World,” m=2 by word, m=11 by character.
|V|: The vocabulary size of the model.
Cross-entropy aims to maximize log(pᵢ) only for the correct token yᵢ=1. For the other tokens yᵢ=0, there is no additional suppression effect.
Consequently, the log probability of rejected responses can increase as well, which is undesirable from a preference alignment standpoint.

Generalization Across Two Response Styles

SFT alone does not resolve the over-adjustment of accepted vs. rejected responses, as demonstrated in the study below:
- Fine-tuning an OPT-350M model on the HH-RLHF dataset using accepted responses only.
- Monitoring the log probability of rejected responses showed that both accepted and rejected responses increased.
- This indicates that while cross-entropy loss effectively orients the model toward conversational (or domain-specific) data, it does not penalize undesired responses, so rejected responses can end up with higher log probability than accepted ones.

Penalizing Undesired Generations

Past research indicates that adding a penalty term to the loss function for undesired outputs can mitigate degenerative effects.
- Example: To prevent repetitive outputs, one might add a penalty term for assigning high probability to tokens that appear frequently in the recent context.
Building on the idea that dynamic penalization of rejected responses is effective, we designed a unified preference-alignment method that penalizes undesired responses on the fly.

4 Odds Ratio Preference Optimization

Now we introduce a new algorithm called ORPO for preference alignment. This algorithm adds an odds-ratio-based penalty to the usual negative log-likelihood (NLL) loss so that the model effectively differentiates between preferred and non-preferred responses.

4.1 Background

The above formula calculates the average log likelihood of the output sequence y given the input sequence x.
- y: The output sequence, consisting of m tokens.
- x: The input sequence.
- Pθ(yₜ | x, y₍ₜ₎): The probability that the t-th token yₜ in the output sequence y is generated given the input x and the previous tokens y₍ₜ₎.
- m: The length (in tokens) of the output sequence y.
Essentially, the log probability of the entire sequence is summed across tokens and then averaged to avoid dependence on sequence length.

odds: The ratio between the probability that y is generated and that it is not generated, indicating the relative likelihood of generation. If the generation probability is Pθ(y | x), then its non-generation probability is 1 - Pθ(y | x).
oddsθ(y | x) = k: Means that for a given input x, the model θ finds y k times more likely than not-y.

The odds ratio between a winning response yw (accepted) and a losing response yl (rejected) indicates how much more preferred yw is over yl.

4.2 Objective Function of ORPO

Objective Function: In machine learning or optimization problems, this is the function that the model aims to optimize during training.

L_SFT: The SFT loss term in the ORPO objective function.
L_OR: The term that maximizes the odds ratio between accepted responses and rejected responses. “OR” stands for “Odds Ratio.”
λ: A hyperparameter that controls the relative importance of the two loss terms.
E_{(x,y_w,y_l)}: The expected value over the probability distribution of (x, y_w, y_l).

L_OR: Encourages a higher log odds ratio between y_w (preferred) and y_l (non-preferred). It wraps the log odds ratio with the log sigmoid function, so we minimize L_OR by increasing the log odds ratio between y_w and y_l.
σ: By wrapping the log ratio in a sigmoid function, we ensure smoothness and differentiability. Large input values approach 1, while small inputs approach 0. Consequently, L_OR ranges from 0 to +∞ and we want to push it closer to 0.

4.3 Gradient of ORPO

The gradient of L_OR (∇_θL_OR) shows that using an odds-ratio-based loss is justified for preference alignment. The gradient effectively adjusts the probability difference between the winning and losing responses.

The gradient of L_OR is described as the product of δ(d) and h(d).

δ(d) is a penalty term that suppresses non-preferred responses.
In the extreme, if oddsθ(y_w|x) ≈ oddsθ(y_l|x) (i.e., when the odds of the winning and losing responses are similar), δ(d) approaches 1/2.
Conversely, if oddsθ(y_w|x) ≫ oddsθ(y_l|x), meaning the winning response has a much higher odds than the losing response, δ(d) approaches 0.

h(d) updates the parameters θ in a way that raises the probability of the desired label y_w and lowers that of y_l.
Because ∇_θlog p(θ) = (1/p(θ))∇_θp(θ), when Pθ(y_w|x) is already high, the update magnitude is larger. Conversely, if Pθ(y_w|x) is small, the update magnitude is smaller. Intuitively, “if the model is already confident about y_w, a small parameter change could significantly alter that probability, so we apply a larger update.”
1 − Pθ(y_w|x) is the probability that y_w has not yet occurred. As Pθ(y_w|x) grows, the update scale increases, and as it decreases, the scale shrinks.
For y_l, a negative sign is applied, so it is updated in the opposite direction of y_w.

Why “the higher the probability, the larger the update”?

ORPO aims to widen the gap between the probabilities of y_w (preferred) and y_l (non-preferred). That is, it makes the good better and the bad worse.

5 Experimental Settings

5.1 Training Setup

Model

We scale the OPT model from 125M to 1.3B parameters and compare four methods:
1. Supervised Fine-Tuning (SFT)
2. Proximal Policy Optimization (PPO)
3. Direct Policy Optimization (DPO)
4. Odds Ratio Preference Optimization (ORPO)
The PPO and DPO models are trained on accepted responses for a single epoch, starting from the SFT model, using the TRL library.
- We denote this by adding a “+” (e.g., +DPO).
Additionally, the following models are included in the training:
- Phi-2 (2.7B): A pre-trained language model with strong downstream performance
- Llama2 (7B)
- Mistral (7B)

Dataset

Each training configuration and model is tested on two datasets:
- Anthropic’s HH-RLHF
- Binarized UltraFeedback
We filter out cases where y_w = y_l (preferred and non-preferred are identical) or where y_w or y_l is empty.

Reward Model

We train separate reward models for OPT-350M and OPT-1.3B.
The objective function of the reward model is as follows:

By taking the difference between the scores of preferred and non-preferred responses and passing it through a sigmoid, the output is close to 1 if the difference is large and positive, approaching 0 otherwise. Taking the log of this value ranges from 0 to negative infinity, reflecting the reward difference.

5.2 Leaderboard Evaluation Summary

AlpacaEval Evaluation: We compare ORPO to other instruction-tuned models on the AlpacaEval1.0/2.0 benchmarks.
- Models under comparison: Llama-2 Chat (7B, 13B), Zephyr (α, β)
- AlpacaEval1.0: Uses GPT-4 as an evaluator, checking if a model’s response is preferred over text-davinci-003.
- AlpacaEval2.0: Uses GPT-4-turbo to check if a model’s response is preferred over GPT-4 responses.
MT-Bench Evaluation: This benchmark assesses how well the model follows instructions in multi-turn conversations. GPT-4 is used as the evaluator to judge the quality of answers to challenging questions.

6 Results and Analysis

6.1 Performance of ORPO

Compared to RLHF or DPO, ORPO enables the model to learn user preferences more quickly and stably.
Even with a small amount of data and limited training, it achieves high-level instruction-following performance (shown in Llama-2, Mistral, Phi-2, etc.).

6.2 Single-Turn Instruction-Following

Applying ORPO to Phi-2 (2.7B) significantly boosts its AlpacaEval score, surpassing Llama-2 Chat 7B performance (71.80%, 6.35% -> Llama-2 Chat 7B is 71.34%, 4.96%).
Applying ORPO to Llama-2 (7B) outperforms Llama-2 Chat 7B and 13B RLHF versions in AlpacaEval (81.26%, 9.44%).
Applying ORPO to Mistral (7B) produces Mistral-ORPO-α (7B) and β (7B), which rival or outperform the Zephyr series in single-turn tasks.

6.3 Multi-Turn Instruction-Following

According to MT-Bench, the Mistral-ORPO series (7B) achieves quite competitive scores (7.23–7.32) compared to larger or commercial models such as Llama-2-Chat 70B and Claude.
This suggests that even single-turn training data can help the model generalize to multi-turn dialogues.

6.4 Win Rate from the Reward Model Perspective

On OPT (125M, 350M, 1.3B), ORPO achieves a much higher win rate compared to SFT, PPO, or DPO.
The larger the model, the more ORPO outperforms DPO (e.g., 70.9% vs. DPO at OPT-1.3B).
ORPO reliably obtains higher rewards across different datasets (UltraFeedback, HH-RLHF).

6.5 Lexical Diversity Analysis

ORPO-trained models exhibit slightly lower per-input diversity compared to DPO (i.e., they provide more consistent responses for a given prompt), but higher across-input diversity overall.
This indicates that ORPO is producing more optimized answers for each individual request while maintaining varied response patterns across multiple inputs.

In summary, ORPO is an efficient and stable method compared to previous approaches (DPO, RLHF, PPO) and achieves improved instruction-following and reward gains across various model sizes. It also balances intra-prompt consistency with inter-prompt variability, producing responses that better meet users’ demands.

7 Discussion

7.1 Comparison to Probability Ratio

Odds Ratio vs. Probability Ratio

Traditional preference alignment methods often apply probability ratios (PR) within SFT.
ORPO, however, adopts the odds ratio (OR), which offers the following advantages over probability ratio:
- Odds ratio is more responsive to model understanding of preference.
- Using a probability ratio can lead to overly strong suppression of non-preferred responses.
- The odds ratio avoids extreme suppression, enabling stable co-training of SFT and preference alignment.

Experimental Simulation

Drawing random probability pairs (X₁, X₂), computing both PR and OR, and then comparing log-scale distributions shows that:
- PR yields a sharper, more skewed distribution,
- OR, by contrast, is smoother and more spread out.
When minimizing log sigmoid loss in preference alignment, PR can overly suppress non-preferred tokens, causing abnormal model convergence.

Conclusion: Why Odds Ratio is More Suitable

In the SFT-inclusive preference alignment setting, moderate “prioritization” outperforms harsh suppression.
Probability ratio triggers excessive suppression, so ORPO chooses odds ratio to properly reduce non-preferred responses and promote preferred ones.
This prevents undue logit flattening (which degrades quality) and lets SFT + preference alignment proceed in harmony.

Minimizing L_OR

Reflecting Preferences During ORPO Training

Visualization of the training process under ORPO reveals that while preferred responses maintain or increase their log probabilities, undesired ones progressively decline.

Changes in Log Odds Ratio

As the log odds ratio rises, log probabilities of undesired responses steadily drop. Hence, ORPO retains the domain-adaptation benefit of SFT while reducing the likelihood of undesirable outputs.

Effect of λ Parameter

By varying λ in ORPO’s loss function, one can analyze how the gap in log probability between preferred and non-preferred responses changes. A larger λ results in stronger suppression of undesired responses, whereas a smaller λ applies milder suppression.

8 Conclusion

This paper revisited the role of SFT and introduced a single-stage preference alignment method called ORPO, which forgoes a reference model and integrates preference alignment directly.
Compared to SFT or RLHF, ORPO was consistently more preferred by reward models at all scales.
ORPO’s win rate against DPO also increases as model size grows.
Specifically, testing on 2.7B and 7B models showed that ORPO exceeds large models on the AlpacaEval metric, and Mistral-ORPO-α and β achieved 11.33% and 12.20% gains on AlpacaEval2.0 and 7.23 and 7.32 on MT-Bench, demonstrating both efficiency and efficacy.

Limitations

While we compared DPO and RLHF, we have not covered a more extensive range of preference alignment algorithms.
Future work will explore models beyond 7B and validate generalization across multiple domains and data quality settings.
Additionally, we plan to deepen our analysis of how ORPO affects the internal workings of pre-trained models across subsequent preference alignment stages.