saurabh naik

Posted on May 16

RLHF in 2026: when to pick PPO, DPO, or verifier-based RL

#ai #llm #machinelearning #python

The famous InstructGPT result is still the cleanest argument for post-training: a 1.3B aligned model was preferred over the 175B GPT-3 base ~85% of the time on instruction-following. Alignment beat a 100x scale gap.

That number got a lot of people to implement RLHF. Most of them later ripped it out and switched to DPO. A smaller group skipped both and went to verifier-based RL.

This post is the decision tree I wish I'd had when I started: what each pipeline actually looks like in TRL, where it breaks, and which one you should reach for first in 2026. The code blocks are runnable end-to-end against open weights — pick one and you have a working stack by tomorrow.

The three-way choice

Before any code, the picture:

PPO RLHF — sample, score with a reward model, update with PPO under a KL leash. The original InstructGPT recipe. Powerful, fiddly, expensive.
DPO — collapse the reward model and the RL loop into a single supervised loss on preference pairs. Trains like SFT, no sampling loop.
RLVR — verifier-based RL. The reward is ground truth (unit tests pass, math answer is correct, JSON parses). No human preferences at all.

A rough rule that holds in most post-training shops I've talked to:

Style, tone, instruction-following → DPO by default, PPO only if you can afford on-policy sampling.
Math, code, structured output, tool-use → RLVR. Don't waste a reward model on something a checker can score.
Mixed product behavior → SFT first, then DPO, then a verifier-RL pass on the verifiable slices.

The rest of this post is the why behind that table, and the actual training code.

SFT first, always

Every pipeline below assumes you've done SFT. The SFT model is both the starting policy for the RL/DPO step and the frozen reference the KL term anchors against.

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from transformers import AutoTokenizer

MODEL = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft").select(range(5000))

trainer = SFTTrainer(
    model=MODEL,
    train_dataset=ds,
    args=SFTConfig(
        output_dir="qwen-sft",
        per_device_train_batch_size=4,
        learning_rate=2e-5,
        num_train_epochs=1,
        bf16=True,
    ),
    tokenizer=tokenizer,
)
trainer.train()

SFT teaches the model to imitate a fixed target. It runs out of road the moment "good" isn't a single sentence away — helpfulness, tone, "did you actually answer the question" are comparative judgments, not next-token predictions. That's the whole reason the other stages exist.

Path A: classical PPO RLHF

Step 1 — train the reward model

The reward model (RM) is a scalar head on top of a transformer: (prompt, response) → r. You train it on pairwise comparisons with the Bradley-Terry loss:

L = -log σ(r(x, y_chosen) - r(x, y_rejected))

Translation: push the score of the chosen response above the rejected one, by enough margin that softmax probabilities match human preferences.

from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

RM_BASE = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(RM_BASE)
model = AutoModelForSequenceClassification.from_pretrained(RM_BASE, num_labels=1)

ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train").select(range(10000))

trainer = RewardTrainer(
    model=model,
    args=RewardConfig(
        output_dir="qwen-rm",
        per_device_train_batch_size=8,
        learning_rate=1e-5,
        num_train_epochs=1,
        bf16=True,
        max_length=1024,
    ),
    train_dataset=ds,
    tokenizer=tokenizer,
)
trainer.train()

Warning: the RM overfits fast. Track validation pairwise accuracy, not training loss. If train accuracy keeps climbing while eval plateaus around 0.65–0.70, stop training. A slightly underfit RM is far better than a sharp one — sharp RMs are the easiest to exploit.

OpenAI used a 6B RM against a 175B policy. The RM doesn't need to be as big as the policy; it just needs to be a stable judge.

Step 2 — PPO with a KL penalty

PPO samples completions from the current policy, scores them with the RM, and updates the policy with clipped policy-gradient. The KL penalty is what keeps the run from imploding:

r_total = r_RM(x, y) - β · KL(π_θ(·|x) || π_ref(·|x))

Drop the KL term and the policy walks off the manifold the RM was trained on, finds a strange region of token space that scores high, and produces nonsense. With KL, every step is leashed to the SFT reference.

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoModelForSequenceClassification, AutoTokenizer

policy = AutoModelForCausalLMWithValueHead.from_pretrained("qwen-sft")
ref = AutoModelForCausalLMWithValueHead.from_pretrained("qwen-sft")  # frozen
rm = AutoModelForSequenceClassification.from_pretrained("qwen-rm")
tokenizer = AutoTokenizer.from_pretrained("qwen-sft")

config = PPOConfig(
    output_dir="qwen-ppo",
    learning_rate=1e-6,
    per_device_train_batch_size=4,
    mini_batch_size=2,
    num_ppo_epochs=4,
    kl_coef=0.05,        # β — start here
    cliprange=0.2,
    cliprange_value=0.2,
    bf16=True,
)

trainer = PPOTrainer(
    args=config,
    model=policy,
    ref_model=ref,
    reward_model=rm,
    train_dataset=prompt_dataset,
    tokenizer=tokenizer,
)
trainer.train()

Three dashboards to keep open:

Mean reward — should rise, then plateau. If it keeps climbing past your RM's eval accuracy ceiling, the policy is hacking the RM.
KL to reference — should stay bounded. A spike means the policy is sprinting away from SFT. Raise kl_coef.
A separate judge on held-out prompts — never trust the RM as ground truth. Read samples, or score with a different model entirely.

kl_coef between 0.02 and 0.2 covers most cases. I start at 0.05 and only move it when the KL graph misbehaves.

Why PPO breaks

After a few runs the failure modes get predictable:

Reward hacking — the policy finds outputs the RM loves and humans don't. Karpathy's line that RLHF is "just barely RL" is exactly this — the RM is a vibe check trained on a few thousand comparisons, and the policy is a much stronger optimizer than the RM is a judge.
Sycophancy — if labelers preferred responses that agreed with them, the RM learns "agreement = good," and the policy agrees with factual errors. Fix the data, not the optimizer.
Mode collapse — the policy narrows onto a few high-reward templates. Entropy drops, and you'll see the same opener over and over at temperature 1.0.
Alignment tax — RLHF'd models often regress on raw capability benchmarks like MMLU. You're trading capability for instruction-following, which is the right call for chat products and the wrong one for a model used as a backbone.

Path B: DPO — skip the RL loop

Direct Preference Optimization (Rafailov et al., 2023) folds the RM and PPO into a single supervised loss directly on preference pairs:

L_DPO = -log σ(β · [log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)])

No reward model. No sampling loop. No value head. Same (chosen, rejected) data as the RM stage above, plus your frozen reference policy.

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

policy = AutoModelForCausalLM.from_pretrained("qwen-sft")
ref = AutoModelForCausalLM.from_pretrained("qwen-sft")
tokenizer = AutoTokenizer.from_pretrained("qwen-sft")

ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train").select(range(10000))

trainer = DPOTrainer(
    model=policy,
    ref_model=ref,
    args=DPOConfig(
        output_dir="qwen-dpo",
        per_device_train_batch_size=4,
        learning_rate=5e-7,
        beta=0.1,                # KL strength, same role as PPO's kl_coef
        num_train_epochs=1,
        bf16=True,
    ),
    train_dataset=ds,
    tokenizer=tokenizer,
)
trainer.train()

DPO wins when:

You have static preference data and don't want to maintain an RM service.
You want a training run that looks like SFT operationally — same trainer pattern, same monitoring, same failure profile.
You don't need on-policy exploration. DPO learns from a fixed dataset; PPO can sample fresh comparisons.

PPO still wins when:

You can generate fresh comparisons mid-training (online RLHF).
The preference signal is non-stationary and DPO's frozen dataset goes stale.
You're running a frontier-scale RM whose inference cost is justified.

For most teams shipping post-training in 2026, DPO (or its variants — IPO, KTO, SimPO) is the default. PPO RLHF still earns its place at the top of the budget curve.

Path C: RLVR — when you have a real checker

For domains with a real verifier — math, code, structured output, tool-use success — RLVR sidesteps the reward-model problem entirely. The reward is ground truth, not a learned vibe check. DeepSeek-R1 and o1-style training are the canonical examples.

The pipeline shape:

Sample completions from the policy.
Run them through a checker — execute the code, check the math answer, validate the JSON.
Reward = 1 if pass, 0 if fail (or a richer shaped reward if you have partial credit).
PPO-update against that reward, KL-anchored to the SFT reference exactly like classical RLHF.

The big advantage is that reward hacking gets much harder. A unit test either passes or it doesn't — there's no spurious phrase the policy can latch onto. The reward signal scales with capability instead of fighting it, which is why the recent capability jumps on reasoning benchmarks came from this direction rather than from bigger RMs.

The catch: it only works where you can build a cheap, reliable checker. For "is this helpful and polite," you're still in RLHF/DPO territory.

Putting it together

The minimal mental model:

SFT gives you a model that follows instructions in form.
Reward modeling lets you express comparative preferences when no ground truth exists.
PPO + KL tunes the policy against those preferences without letting it wander.
DPO collapses 2 and 3 into one supervised step — usually the right call for offline preference data.
RLVR replaces all of the above wherever you have a real checker.

If I were standing up alignment from scratch this quarter: SFT, then DPO on offline preference data for style and helpfulness, then a verifier-RL pass on the math/code/tool-use slices where checkers are cheap. PPO RLHF would only show up if I had budget for online sampling and a serious RM team to back it.

What does your alignment stack look like in 2026 — PPO, DPO, or have you moved on to verifier-based RL where you can? I'm curious which step everyone is keeping versus dropping.

If you want to go deeper:

InstructGPT paper — arxiv.org/abs/2203.02155. Canonical reference for the full pipeline.
DPO paper — arxiv.org/abs/2305.18290. Short, worth reading in full.

DEV Community