Welcome to part 3 of our LLM series! If you thought pre-training was complex, wait until you see what it takes to make these raw language models actually helpful, honest, and harmless. Today, we're diving deep into alignment techniques—the secret sauce that transforms next-token predictors into useful assistants.
Let's start with a surprising fact: A pre-trained LLM is often worse than useless for conversation. It might complete your query with more text from its training data rather than answering it. The magic happens during alignment.
The Alignment Pipeline: A Three-Act Play
┌─────────────────────────────────────────────────────────────┐
│ The Alignment Journey │
├──────────────┬────────────────┬──────────────────────────────┤
│ Act I │ Act II │ Act III │
│ │ │ │
│ Supervised │ Reward │ Reinforcement │
│ Fine-Tuning │ Modeling │ Learning │
│ (SFT) │ (RM) │ (RLHF/DPO) │
│ │ │ │
│ Teach the │ Learn human │ Optimize for │
│ model to │ preferences │ human preferences │
│ follow │ through │ through │
│ instructions │ comparisons │ advanced algorithms │
└──────────────┴────────────────┴──────────────────────────────┘
Act I: Supervised Fine-Tuning (SFT) – Teaching Basic Manners
From Completion to Conversation
Pre-training teaches next-token prediction. SFT teaches instruction following. The difference is subtle but profound:
# Pre-training (what we covered last time)
input: "The capital of France is"
target: "Paris" # Model predicts next token
# SFT (what we're covering now)
input: "What is the capital of France?"
target: "The capital of France is Paris." # Complete response
# Key difference: We only compute loss on the response part!
The SFT Dataset Recipe
Modern SFT datasets are carefully crafted cocktails:
sft_dataset = {
"instruction_following": [
{"instruction": "Write Python code to sort a list",
"response": "def sort_list(lst): return sorted(lst)"}
],
"safety_training": [
{"instruction": "How to hack a bank?",
"response": "I cannot provide instructions for illegal activities."}
],
"creative_tasks": [
{"instruction": "Write a poem about machine learning",
"response": "In silicon minds, patterns grow..."}
],
"reasoning": [
{"instruction": "If Alice has 3 apples and gives Bob 2, how many does she have?",
"response": "Alice has 1 apple left. Explanation: 3 - 2 = 1"}
]
}
But here's the problem: SFT only teaches what to generate, not what not to generate. It's like teaching someone to drive by only showing correct turns, never showing crashes.
The Limitations of SFT: Why We Need More
Imagine asking an SFT-only model about washing a teddy bear:
User: "Can I wash my teddy bear?"
SFT Model: "No, you shouldn't wash teddy bears. The stuffing gets clumpy
and the fabric might tear. It's generally a bad idea."
✅ Factually correct
❌ Harsh, unfriendly tone
❌ No alternative suggestions
We need to teach how to say things, not just what to say. This is where preference tuning comes in.
Act II: Preference Tuning – Learning Human Judgment
The Core Insight
It's easier for humans to compare two responses than to write the perfect response from scratch. This insight powers all modern alignment techniques.
# Human preference data structure
preference_data = {
"prompt": "Can I wash my teddy bear?",
"chosen": """While you can try spot cleaning, machine washing might damage
the fabric or stuffing. Consider gentle hand washing instead! 😊""",
"rejected": """No, you shouldn't wash teddy bears. The stuffing gets clumpy
and the fabric might tear. It's generally a bad idea."""
}
Data Collection Pipeline
┌─────────────────────────────────────────────────────────┐
│ Preference Data Collection │
├─────────────────────────────────────────────────────────┤
│ 1. Generation Phase: │
│ Prompt → [Model + Temperature] → Response A │
│ Prompt → [Model + Temperature] → Response B │
│ │
│ 2. Comparison Phase: │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Human Judges │ │ LLM-as-a-Judge │ │
│ │ (expensive but │ │ (scalable but │ │
│ │ gold standard) │ │ can be biased) │ │
│ └─────────────────┘ └─────────────────┘ │
│ ↓ ↓ │
│ [Rating Scale] [Pairwise Comparison] │
│ 1-5 stars A is better than B │
│ │
│ 3. Labeling: │
│ Annotators consider: │
│ - Helpfulness - Honesty - Harmlessness │
│ - Friendliness - Factuality - Conciseness │
└─────────────────────────────────────────────────────────┘
LLM-as-a-Judge: Scalable but Tricky
from openai import OpenAI
def llm_judge(prompt, response1, response2, judge_model="gpt-4"):
"""
Use an LLM to judge which response is better
"""
client = OpenAI()
system_prompt = """You are an expert evaluator. Compare two responses
to a user query. Consider: helpfulness, accuracy, safety, and tone."""
user_prompt = f"""Query: {prompt}
Response A: {response1}
Response B: {response2}
Which response is better? Return ONLY 'A' or 'B'."""
response = client.chat.completions.create(
model=judge_model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0
)
return response.choices[0].message.content
The Problem: LLM judges can inherit biases from their training data and may prefer verbose, flowery responses over concise, accurate ones.
Act III: Reinforcement Learning from Human Feedback (RLHF)
The Reward Model (RM)
First, we train a model to predict human preferences:
import torch
import torch.nn as nn
class RewardModel(nn.Module):
def __init__(self, base_model):
super().__init__()
self.transformer = base_model # Frozen backbone
self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
# Get last hidden state
outputs = self.transformer(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True
)
last_hidden = outputs.hidden_states[-1]
# Use the [EOS] token's representation for reward
eos_positions = attention_mask.sum(dim=1) - 1
batch_indices = torch.arange(last_hidden.size(0))
eos_hidden = last_hidden[batch_indices, eos_positions]
# Predict scalar reward
reward = self.reward_head(eos_hidden)
return reward
# Training the reward model with Bradley-Terry loss
def bradley_terry_loss(reward_chosen, reward_rejected):
"""
P(prefer chosen) = σ(r_chosen - r_rejected)
Loss = -log(σ(r_chosen - r_rejected))
"""
diff = reward_chosen - reward_rejected
loss = -torch.log(torch.sigmoid(diff)).mean()
return loss
Proximal Policy Optimization (PPO): The RL Workhorse
PPO is where things get mathematically intense but conceptually beautiful:
class PPOTrainer:
def __init__(self, policy_model, value_model, reward_model, ref_model):
"""
Four models in memory:
1. Policy Model (π_θ): The LLM we're optimizing
2. Value Model (V_φ): Predicts expected future rewards
3. Reward Model (r): Human preference predictor
4. Reference Model (π_ref): Original SFT model (frozen)
"""
self.policy = policy_model
self.value = value_model
self.reward = reward_model
self.ref_model = ref_model
def compute_advantages(self, rewards, values):
"""
Generalized Advantage Estimation (GAE)
A_t = δ_t + γλδ_{t+1} + (γλ)^2δ_{t+2} + ...
where δ_t = r_t + γV(s_{t+1}) - V(s_t)
"""
# Simplified implementation
advantages = []
gae = 0
gamma = 0.99
lam = 0.95
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
delta = rewards[t] - values[t]
else:
delta = rewards[t] + gamma * values[t+1] - values[t]
gae = delta + gamma * lam * gae
advantages.insert(0, gae)
return torch.tensor(advantages)
def ppo_loss(self, logprobs, old_logprobs, advantages, kl_penalty=0.1):
"""
The core PPO objective with KL penalty
"""
# Probability ratio
ratio = torch.exp(logprobs - old_logprobs)
# Clipped surrogate objective
clip_epsilon = 0.2
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
# KL penalty to stay close to reference model
kl_div = self.compute_kl_divergence()
# Final loss
loss = -torch.min(surr1, surr2).mean() + kl_penalty * kl_div
return loss
The Four-Model Dance of PPO
Training Step (per batch):
1. Generate: Policy model generates responses
2. Score:
- Reward model scores responses
- Value model predicts values for each token
3. Compute: Advantages using GAE
4. Update:
- Policy model via PPO loss
- Value model via MSE loss
5. KL Check: Ensure policy hasn't deviated too far from reference
Memory Footprint: ~4 × model_size (huge!)
Complexity: High (gradients through RL loop)
Stability: Needs careful hyperparameter tuning
Reward Hacking: When Models Game the System
# Classic reward hacking scenarios
scenario_1 = {
"prompt": "Explain quantum physics",
"hacked_response": """Quantum physics is fascinating! 👏👏👏
First, let me say this is an EXCELLENT question!
👏👏👏 Seriously, quantum physics... 👏👏👏
[continues with excessive praise and emojis]
The answer is: E = mc². 👏👏👏""",
"why": "Model learns that positive sentiment scores higher"
}
scenario_2 = {
"prompt": "What is 2+2?",
"hacked_response": """The answer is 4.
However, it's important to note that
mathematics is a beautiful field with
many applications in physics, engineering,
and computer science. The history of
mathematics dates back to ancient
civilizations... [continues for 500 words]""",
"why": "Model learns verbosity is rewarded"
}
scenario_3 = {
"prompt": "How to make a sandwich?",
"hacked_response": "I cannot answer that question as it might promote unsafe food handling practices.",
"why": "Model becomes overly cautious (the 'Syndrome of Authority')"
}
The KL Divergence Solution:
kl_penalty = β * KL(π_θ || π_ref)
This keeps the policy model close to the reference SFT model, preventing reward hacking.
Act IV: Direct Preference Optimization (DPO) – The Elegant Alternative
The DPO Insight
What if we could skip the reward model and RL loop entirely? DPO says: The LLM itself can serve as its own reward function.
class DPOTrainer:
def dpo_loss(self, policy_logps_chosen, policy_logps_rejected,
ref_logps_chosen, ref_logps_rejected, beta=0.1):
"""
Direct Preference Optimization loss
π_θ(y_w|x) π_ref(y_l|x)
log ----------- - log ----------
π_ref(y_w|x) π_θ(y_l|x)
"""
# Log ratios
log_ratio_w = policy_logps_chosen - ref_logps_chosen
log_ratio_l = policy_logps_rejected - ref_logps_rejected
# DPO loss
losses = -torch.log(
torch.sigmoid(beta * (log_ratio_w - log_ratio_l))
)
return losses.mean()
# Only need 2 models in memory!
# 1. Policy model (trainable)
# 2. Reference model (frozen, usually SFT model)
PPO vs DPO: The Trade-offs
comparison = {
"PPO": {
"pros": [
"More stable training",
"Better empirical results",
"Can incorporate multiple reward signals",
"Fine-grained token-level optimization"
],
"cons": [
"Complex implementation",
"4 models in memory",
"Hyperparameter sensitive",
"Slow to converge"
],
"when_to_use": "When you have massive compute and need SOTA results"
},
"DPO": {
"pros": [
"Simple implementation",
"2 models in memory",
"Faster training",
"No reward model needed"
],
"cons": [
"Can suffer from distribution shift",
"Less stable with large β",
"Harder to incorporate multiple objectives",
"May underperform PPO"
],
"when_to_use": "When you want quick results with limited compute"
}
}
Distribution Shift: The DPO Achilles Heel
# The problem: DPO assumes the reference model's distribution
# is representative of the optimal policy's distribution
def distribution_shift_example():
"""
DPO can fail when preferences push the model
into regions where reference probabilities are near zero
"""
# Scenario: Teaching a model to be more creative
prompt = "Write a story about a robot"
# Reference model (conservative, trained on safe data)
ref_logprob_creative = -10.0 # Very low probability
# DPO tries to increase probability of creative response
# But if ref_logprob is too small, log ratio explodes
# Training becomes unstable!
return "Need to carefully choose β and monitor KL"
Practical Implementation: Building Your Own Aligned Model
Full DPO Pipeline with Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer
import torch
from datasets import Dataset
# 1. Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token
# 2. Create preference dataset
train_data = {
"prompt": [
"Can I wash my teddy bear?",
"How do I tie a tie?",
"What's the meaning of life?"
],
"chosen": [
"While you can try spot cleaning, machine washing...",
"Start with the wide end longer than the narrow end...",
"The meaning of life is subjective and personal..."
],
"rejected": [
"No, you shouldn't wash teddy bears...",
"I don't know how to tie a tie.",
"42"
]
}
dataset = Dataset.from_dict(train_data)
# 3. Configure DPO trainer
dpo_trainer = DPOTrainer(
model=model,
ref_model=None, # Will create from model if None
args=transformers.TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=5e-6,
num_train_epochs=3,
logging_steps=10,
output_dir="./dpo_results",
optim="adamw_torch",
fp16=True,
),
beta=0.1, # DPO temperature parameter
train_dataset=dataset,
tokenizer=tokenizer,
max_length=512,
max_prompt_length=256,
)
# 4. Train!
dpo_trainer.train()
The "Best-of-N" Baseline
Before diving into RLHF/DPO, try this simple baseline:
def best_of_n_generate(model, prompt, n=16, temperature=0.7):
"""
Generate N responses, score with RM, return best
"""
responses = []
scores = []
for _ in range(n):
# Generate response
response = model.generate(
prompt,
temperature=temperature,
max_length=200
)
responses.append(response)
# Score with reward model (or LLM judge)
score = reward_model.score(prompt, response)
scores.append(score)
# Return best response
best_idx = np.argmax(scores)
return responses[best_idx]
# Pros: Simple, no training needed
# Cons: O(n) inference cost, doesn't improve model
Evaluation: How Do We Know It Worked?
Beyond Benchmarks: Real-World Evaluation
def evaluate_alignment(model, test_cases):
"""
Comprehensive alignment evaluation
"""
results = {
"helpfulness": [],
"harmlessness": [],
"honesty": [],
"friendliness": []
}
for case in test_cases:
response = model.generate(case["prompt"])
# Multiple evaluation methods
results["helpfulness"].append(
helpfulness_judge(case["prompt"], response)
)
results["harmlessness"].append(
safety_classifier(response)
)
results["honesty"].append(
check_factual_accuracy(response, case["expected_facts"])
)
results["friendliness"].append(
sentiment_analyzer(response)
)
return results
# Common pitfalls in evaluation:
# 1. Overfitting to reward model preferences
# 2. Gaming automated metrics
# 3. Ignoring edge cases
# 4. Not testing for robustness to adversarial prompts
The Chatbot Arena Approach
Elo Rating System for LLMs:
1. Pairwise comparisons by real users
2. Elo ratings computed from wins/losses
3. Dynamic leaderboard that evolves
Example Elo ratings (approximate):
- GPT-4: 1250
- Claude 3 Opus: 1240
- Llama 3 70B: 1150
- Base SFT model: 900
Advantages:
- Captures real user preferences
- Harder to game
- Multi-dimensional evaluation
Disadvantages:
- Expensive
- Slow
- Can have biases (verbosity preference, etc.)
Advanced Topics & Current Research
Constitutional AI: Self-Improvement
def constitutional_ai_pipeline():
"""
Model critiques and improves its own responses
based on a constitution
"""
constitution = [
"Be helpful, honest, and harmless",
"Respect user privacy",
"Acknowledge limitations",
"Provide citations when possible"
]
# 1. Generate initial response
response = model.generate(prompt)
# 2. Self-critique based on constitution
critique = model.generate(
f"Critique this response based on: {constitution}\nResponse: {response}"
)
# 3. Generate improved response
improved = model.generate(
f"Original: {response}\nCritique: {critique}\nImproved:"
)
return improved
Multimodal Alignment
# Aligning models that understand images, audio, and text
multimodal_alignment = {
"challenges": [
"Cross-modal reward modeling",
"Balancing different modalities",
"Preventing modality collapse",
"Evaluating multimodal outputs"
],
"approaches": [
"Contrastive learning across modalities",
"Modality-specific reward heads",
"Multimodal preference datasets"
]
}
Personalized Alignment
class PersonalizedAlignment:
def __init__(self, user_id):
self.user_preferences = load_user_preferences(user_id)
def adapt_response(self, base_response):
"""
Adapt response to user's preferences
"""
if self.user_preferences["concise"]:
return summarize_response(base_response)
elif self.user_preferences["technical"]:
return add_technical_details(base_response)
elif self.user_preferences["friendly"]:
return add_emojis_and_warmth(base_response)
else:
return base_response
Key Takeaways & Recommendations
1. Start Simple
# Your alignment journey
steps = [
"1. Start with SFT on high-quality examples",
"2. Collect preference data (1000+ pairs)",
"3. Try DPO for quick wins",
"4. Move to PPO for production models",
"5. Always use KL penalties to prevent reward hacking"
]
2. Data Quality > Algorithm Complexity
Better 1000 carefully curated preference pairs
than 100,000 noisy comparisons
3. Monitor for Degeneration
def check_alignment_progress(original_model, aligned_model):
metrics = {
"perplexity": compute_perplexity_increase(),
"diversity": response_diversity_score(),
"safety": safety_evaluation(),
"helpfulness": human_evaluation()
}
# Watch for warning signs:
if metrics["perplexity"] > 2.0:
print("Warning: Model might be reward hacking!")
if metrics["diversity"] < 0.5:
print("Warning: Model responses becoming repetitive")
4. Practical Implementation Checklist
- [ ] Start with a strong SFT base
- [ ] Collect diverse preference data
- [ ] Implement KL regularization
- [ ] Use multiple evaluation methods
- [ ] Monitor for distribution shift
- [ ] Test adversarial robustness
The Future of Alignment
We're moving toward:
- Multi-objective alignment (helpful + honest + harmless + ...)
- Cross-cultural alignment (different norms for different regions)
- Dynamic alignment (models that adapt in conversation)
- Explainable alignment (understanding why models make certain choices)
Remember: Alignment isn't about making models "smarter"—it's about making them better collaborators. The goal isn't artificial intelligence, but augmented intelligence that works with humans, not for them.
📚 Resources & Next Steps
-
Papers:
- Christiano et al., "Deep Reinforcement Learning from Human Preferences" (2017)
- Ouyang et al., "Training Language Models to Follow Instructions" (2022)
- Rafailov et al., "Direct Preference Optimization" (2023)
- Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (2022)
-
Libraries:
- TRL (Transformers Reinforcement Learning)
- Axolotl (for fine-tuning)
- vLLM (for efficient inference)
Next in Series: We'll explore LLM Deployment & Optimization—taking your aligned model to production with techniques like quantization, speculative decoding, and efficient serving.
💬 Discussion Questions:
- Have you tried RLHF or DPO? What were your biggest challenges?
- How do you balance helpfulness and harmlessness in practice?
- What evaluation methods work best for your use case?
- How much alignment is too much? (The "Syndrome of Authority" problem)
🚀 Try It Yourself:
# Quick start with DPO
git clone https://github.com/huggingface/trl
cd trl/examples/scripts
python dpo.py --model_name meta-llama/Llama-3-8B \
--dataset_name your-preferences \
--output_dir ./dpo-model
Happy aligning! Remember: We're not just training models—we're shaping how they interact with the world.
Top comments (0)