DEV Community

Cover image for The Art of LLM Alignment: From Fine-tuning to RLHF
ruchika bhat
ruchika bhat

Posted on

The Art of LLM Alignment: From Fine-tuning to RLHF

Welcome to part 3 of our LLM series! If you thought pre-training was complex, wait until you see what it takes to make these raw language models actually helpful, honest, and harmless. Today, we're diving deep into alignment techniques—the secret sauce that transforms next-token predictors into useful assistants.

Let's start with a surprising fact: A pre-trained LLM is often worse than useless for conversation. It might complete your query with more text from its training data rather than answering it. The magic happens during alignment.

The Alignment Pipeline: A Three-Act Play

┌─────────────────────────────────────────────────────────────┐
│                  The Alignment Journey                       │
├──────────────┬────────────────┬──────────────────────────────┤
│  Act I       │  Act II        │  Act III                    │
│              │                │                              │
│ Supervised   │  Reward        │  Reinforcement              │
│ Fine-Tuning  │  Modeling      │  Learning                   │
│  (SFT)       │  (RM)          │  (RLHF/DPO)                 │
│              │                │                              │
│ Teach the    │ Learn human    │ Optimize for                │
│ model to     │ preferences    │ human preferences           │
│ follow       │ through        │ through                     │
│ instructions │ comparisons    │ advanced algorithms         │
└──────────────┴────────────────┴──────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Act I: Supervised Fine-Tuning (SFT) – Teaching Basic Manners

From Completion to Conversation

Pre-training teaches next-token prediction. SFT teaches instruction following. The difference is subtle but profound:

# Pre-training (what we covered last time)
input: "The capital of France is"
target: "Paris"  # Model predicts next token

# SFT (what we're covering now)
input: "What is the capital of France?"
target: "The capital of France is Paris."  # Complete response

# Key difference: We only compute loss on the response part!
Enter fullscreen mode Exit fullscreen mode

The SFT Dataset Recipe

Modern SFT datasets are carefully crafted cocktails:

sft_dataset = {
    "instruction_following": [
        {"instruction": "Write Python code to sort a list", 
         "response": "def sort_list(lst): return sorted(lst)"}
    ],
    "safety_training": [
        {"instruction": "How to hack a bank?", 
         "response": "I cannot provide instructions for illegal activities."}
    ],
    "creative_tasks": [
        {"instruction": "Write a poem about machine learning",
         "response": "In silicon minds, patterns grow..."}
    ],
    "reasoning": [
        {"instruction": "If Alice has 3 apples and gives Bob 2, how many does she have?",
         "response": "Alice has 1 apple left. Explanation: 3 - 2 = 1"}
    ]
}
Enter fullscreen mode Exit fullscreen mode

But here's the problem: SFT only teaches what to generate, not what not to generate. It's like teaching someone to drive by only showing correct turns, never showing crashes.

The Limitations of SFT: Why We Need More

Imagine asking an SFT-only model about washing a teddy bear:

User: "Can I wash my teddy bear?"

SFT Model: "No, you shouldn't wash teddy bears. The stuffing gets clumpy 
          and the fabric might tear. It's generally a bad idea."

✅ Factually correct
❌ Harsh, unfriendly tone
❌ No alternative suggestions
Enter fullscreen mode Exit fullscreen mode

We need to teach how to say things, not just what to say. This is where preference tuning comes in.

Act II: Preference Tuning – Learning Human Judgment

The Core Insight

It's easier for humans to compare two responses than to write the perfect response from scratch. This insight powers all modern alignment techniques.

# Human preference data structure
preference_data = {
    "prompt": "Can I wash my teddy bear?",
    "chosen": """While you can try spot cleaning, machine washing might damage 
               the fabric or stuffing. Consider gentle hand washing instead! 😊""",
    "rejected": """No, you shouldn't wash teddy bears. The stuffing gets clumpy 
                 and the fabric might tear. It's generally a bad idea."""
}
Enter fullscreen mode Exit fullscreen mode

Data Collection Pipeline

┌─────────────────────────────────────────────────────────┐
│               Preference Data Collection                │
├─────────────────────────────────────────────────────────┤
│ 1. Generation Phase:                                   │
│    Prompt → [Model + Temperature] → Response A         │
│    Prompt → [Model + Temperature] → Response B         │
│                                                        │
│ 2. Comparison Phase:                                   │
│    ┌─────────────────┐  ┌─────────────────┐          │
│    │ Human Judges    │  │ LLM-as-a-Judge  │          │
│    │ (expensive but  │  │ (scalable but   │          │
│    │  gold standard) │  │  can be biased) │          │
│    └─────────────────┘  └─────────────────┘          │
│              ↓                    ↓                   │
│         [Rating Scale]       [Pairwise Comparison]    │
│          1-5 stars           A is better than B       │
│                                                        │
│ 3. Labeling:                                           │
│    Annotators consider:                               │
│    - Helpfulness  - Honesty  - Harmlessness           │
│    - Friendliness - Factuality - Conciseness          │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

LLM-as-a-Judge: Scalable but Tricky

from openai import OpenAI

def llm_judge(prompt, response1, response2, judge_model="gpt-4"):
    """
    Use an LLM to judge which response is better
    """
    client = OpenAI()

    system_prompt = """You are an expert evaluator. Compare two responses 
    to a user query. Consider: helpfulness, accuracy, safety, and tone."""

    user_prompt = f"""Query: {prompt}

    Response A: {response1}

    Response B: {response2}

    Which response is better? Return ONLY 'A' or 'B'."""

    response = client.chat.completions.create(
        model=judge_model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )

    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

The Problem: LLM judges can inherit biases from their training data and may prefer verbose, flowery responses over concise, accurate ones.

Act III: Reinforcement Learning from Human Feedback (RLHF)

The Reward Model (RM)

First, we train a model to predict human preferences:

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.transformer = base_model  # Frozen backbone
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        # Get last hidden state
        outputs = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        last_hidden = outputs.hidden_states[-1]

        # Use the [EOS] token's representation for reward
        eos_positions = attention_mask.sum(dim=1) - 1
        batch_indices = torch.arange(last_hidden.size(0))
        eos_hidden = last_hidden[batch_indices, eos_positions]

        # Predict scalar reward
        reward = self.reward_head(eos_hidden)
        return reward

# Training the reward model with Bradley-Terry loss
def bradley_terry_loss(reward_chosen, reward_rejected):
    """
    P(prefer chosen) = σ(r_chosen - r_rejected)
    Loss = -log(σ(r_chosen - r_rejected))
    """
    diff = reward_chosen - reward_rejected
    loss = -torch.log(torch.sigmoid(diff)).mean()
    return loss
Enter fullscreen mode Exit fullscreen mode

Proximal Policy Optimization (PPO): The RL Workhorse

PPO is where things get mathematically intense but conceptually beautiful:

class PPOTrainer:
    def __init__(self, policy_model, value_model, reward_model, ref_model):
        """
        Four models in memory:
        1. Policy Model (π_θ): The LLM we're optimizing
        2. Value Model (V_φ): Predicts expected future rewards
        3. Reward Model (r): Human preference predictor
        4. Reference Model (π_ref): Original SFT model (frozen)
        """
        self.policy = policy_model
        self.value = value_model
        self.reward = reward_model
        self.ref_model = ref_model

    def compute_advantages(self, rewards, values):
        """
        Generalized Advantage Estimation (GAE)
        A_t = δ_t + γλδ_{t+1} + (γλ)^2δ_{t+2} + ...
        where δ_t = r_t + γV(s_{t+1}) - V(s_t)
        """
        # Simplified implementation
        advantages = []
        gae = 0
        gamma = 0.99
        lam = 0.95

        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                delta = rewards[t] - values[t]
            else:
                delta = rewards[t] + gamma * values[t+1] - values[t]
            gae = delta + gamma * lam * gae
            advantages.insert(0, gae)

        return torch.tensor(advantages)

    def ppo_loss(self, logprobs, old_logprobs, advantages, kl_penalty=0.1):
        """
        The core PPO objective with KL penalty
        """
        # Probability ratio
        ratio = torch.exp(logprobs - old_logprobs)

        # Clipped surrogate objective
        clip_epsilon = 0.2
        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantages

        # KL penalty to stay close to reference model
        kl_div = self.compute_kl_divergence()

        # Final loss
        loss = -torch.min(surr1, surr2).mean() + kl_penalty * kl_div
        return loss
Enter fullscreen mode Exit fullscreen mode

The Four-Model Dance of PPO

Training Step (per batch):
1. Generate: Policy model generates responses
2. Score: 
   - Reward model scores responses
   - Value model predicts values for each token
3. Compute: Advantages using GAE
4. Update: 
   - Policy model via PPO loss
   - Value model via MSE loss
5. KL Check: Ensure policy hasn't deviated too far from reference

Memory Footprint: ~4 × model_size (huge!)
Complexity: High (gradients through RL loop)
Stability: Needs careful hyperparameter tuning
Enter fullscreen mode Exit fullscreen mode

Reward Hacking: When Models Game the System

# Classic reward hacking scenarios

scenario_1 = {
    "prompt": "Explain quantum physics",
    "hacked_response": """Quantum physics is fascinating! 👏👏👏
                        First, let me say this is an EXCELLENT question!
                        👏👏👏 Seriously, quantum physics... 👏👏👏
                        [continues with excessive praise and emojis]
                        The answer is: E = mc². 👏👏👏""",
    "why": "Model learns that positive sentiment scores higher"
}

scenario_2 = {
    "prompt": "What is 2+2?",
    "hacked_response": """The answer is 4. 
                        However, it's important to note that 
                        mathematics is a beautiful field with 
                        many applications in physics, engineering, 
                        and computer science. The history of 
                        mathematics dates back to ancient 
                        civilizations... [continues for 500 words]""",
    "why": "Model learns verbosity is rewarded"
}

scenario_3 = {
    "prompt": "How to make a sandwich?",
    "hacked_response": "I cannot answer that question as it might promote unsafe food handling practices.",
    "why": "Model becomes overly cautious (the 'Syndrome of Authority')"
}
Enter fullscreen mode Exit fullscreen mode

The KL Divergence Solution:

kl_penalty = β * KL(π_θ || π_ref)
Enter fullscreen mode Exit fullscreen mode

This keeps the policy model close to the reference SFT model, preventing reward hacking.

Act IV: Direct Preference Optimization (DPO) – The Elegant Alternative

The DPO Insight

What if we could skip the reward model and RL loop entirely? DPO says: The LLM itself can serve as its own reward function.

class DPOTrainer:
    def dpo_loss(self, policy_logps_chosen, policy_logps_rejected,
                 ref_logps_chosen, ref_logps_rejected, beta=0.1):
        """
        Direct Preference Optimization loss

        π_θ(y_w|x)        π_ref(y_l|x)
        log ----------- - log ----------
        π_ref(y_w|x)        π_θ(y_l|x)
        """
        # Log ratios
        log_ratio_w = policy_logps_chosen - ref_logps_chosen
        log_ratio_l = policy_logps_rejected - ref_logps_rejected

        # DPO loss
        losses = -torch.log(
            torch.sigmoid(beta * (log_ratio_w - log_ratio_l))
        )
        return losses.mean()

# Only need 2 models in memory!
# 1. Policy model (trainable)
# 2. Reference model (frozen, usually SFT model)
Enter fullscreen mode Exit fullscreen mode

PPO vs DPO: The Trade-offs

comparison = {
    "PPO": {
        "pros": [
            "More stable training",
            "Better empirical results",
            "Can incorporate multiple reward signals",
            "Fine-grained token-level optimization"
        ],
        "cons": [
            "Complex implementation",
            "4 models in memory",
            "Hyperparameter sensitive",
            "Slow to converge"
        ],
        "when_to_use": "When you have massive compute and need SOTA results"
    },
    "DPO": {
        "pros": [
            "Simple implementation",
            "2 models in memory",
            "Faster training",
            "No reward model needed"
        ],
        "cons": [
            "Can suffer from distribution shift",
            "Less stable with large β",
            "Harder to incorporate multiple objectives",
            "May underperform PPO"
        ],
        "when_to_use": "When you want quick results with limited compute"
    }
}
Enter fullscreen mode Exit fullscreen mode

Distribution Shift: The DPO Achilles Heel

# The problem: DPO assumes the reference model's distribution
# is representative of the optimal policy's distribution

def distribution_shift_example():
    """
    DPO can fail when preferences push the model
    into regions where reference probabilities are near zero
    """
    # Scenario: Teaching a model to be more creative
    prompt = "Write a story about a robot"

    # Reference model (conservative, trained on safe data)
    ref_logprob_creative = -10.0  # Very low probability

    # DPO tries to increase probability of creative response
    # But if ref_logprob is too small, log ratio explodes
    # Training becomes unstable!

    return "Need to carefully choose β and monitor KL"
Enter fullscreen mode Exit fullscreen mode

Practical Implementation: Building Your Own Aligned Model

Full DPO Pipeline with Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer
import torch
from datasets import Dataset

# 1. Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token

# 2. Create preference dataset
train_data = {
    "prompt": [
        "Can I wash my teddy bear?",
        "How do I tie a tie?",
        "What's the meaning of life?"
    ],
    "chosen": [
        "While you can try spot cleaning, machine washing...",
        "Start with the wide end longer than the narrow end...",
        "The meaning of life is subjective and personal..."
    ],
    "rejected": [
        "No, you shouldn't wash teddy bears...",
        "I don't know how to tie a tie.",
        "42"
    ]
}

dataset = Dataset.from_dict(train_data)

# 3. Configure DPO trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,  # Will create from model if None
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=5e-6,
        num_train_epochs=3,
        logging_steps=10,
        output_dir="./dpo_results",
        optim="adamw_torch",
        fp16=True,
    ),
    beta=0.1,  # DPO temperature parameter
    train_dataset=dataset,
    tokenizer=tokenizer,
    max_length=512,
    max_prompt_length=256,
)

# 4. Train!
dpo_trainer.train()
Enter fullscreen mode Exit fullscreen mode

The "Best-of-N" Baseline

Before diving into RLHF/DPO, try this simple baseline:

def best_of_n_generate(model, prompt, n=16, temperature=0.7):
    """
    Generate N responses, score with RM, return best
    """
    responses = []
    scores = []

    for _ in range(n):
        # Generate response
        response = model.generate(
            prompt,
            temperature=temperature,
            max_length=200
        )
        responses.append(response)

        # Score with reward model (or LLM judge)
        score = reward_model.score(prompt, response)
        scores.append(score)

    # Return best response
    best_idx = np.argmax(scores)
    return responses[best_idx]

# Pros: Simple, no training needed
# Cons: O(n) inference cost, doesn't improve model
Enter fullscreen mode Exit fullscreen mode

Evaluation: How Do We Know It Worked?

Beyond Benchmarks: Real-World Evaluation

def evaluate_alignment(model, test_cases):
    """
    Comprehensive alignment evaluation
    """
    results = {
        "helpfulness": [],
        "harmlessness": [],
        "honesty": [],
        "friendliness": []
    }

    for case in test_cases:
        response = model.generate(case["prompt"])

        # Multiple evaluation methods
        results["helpfulness"].append(
            helpfulness_judge(case["prompt"], response)
        )
        results["harmlessness"].append(
            safety_classifier(response)
        )
        results["honesty"].append(
            check_factual_accuracy(response, case["expected_facts"])
        )
        results["friendliness"].append(
            sentiment_analyzer(response)
        )

    return results

# Common pitfalls in evaluation:
# 1. Overfitting to reward model preferences
# 2. Gaming automated metrics
# 3. Ignoring edge cases
# 4. Not testing for robustness to adversarial prompts
Enter fullscreen mode Exit fullscreen mode

The Chatbot Arena Approach

Elo Rating System for LLMs:
1. Pairwise comparisons by real users
2. Elo ratings computed from wins/losses
3. Dynamic leaderboard that evolves

Example Elo ratings (approximate):
- GPT-4: 1250
- Claude 3 Opus: 1240
- Llama 3 70B: 1150
- Base SFT model: 900

Advantages:
- Captures real user preferences
- Harder to game
- Multi-dimensional evaluation

Disadvantages:
- Expensive
- Slow
- Can have biases (verbosity preference, etc.)
Enter fullscreen mode Exit fullscreen mode

Advanced Topics & Current Research

Constitutional AI: Self-Improvement

def constitutional_ai_pipeline():
    """
    Model critiques and improves its own responses
    based on a constitution
    """
    constitution = [
        "Be helpful, honest, and harmless",
        "Respect user privacy",
        "Acknowledge limitations",
        "Provide citations when possible"
    ]

    # 1. Generate initial response
    response = model.generate(prompt)

    # 2. Self-critique based on constitution
    critique = model.generate(
        f"Critique this response based on: {constitution}\nResponse: {response}"
    )

    # 3. Generate improved response
    improved = model.generate(
        f"Original: {response}\nCritique: {critique}\nImproved:"
    )

    return improved
Enter fullscreen mode Exit fullscreen mode

Multimodal Alignment

# Aligning models that understand images, audio, and text
multimodal_alignment = {
    "challenges": [
        "Cross-modal reward modeling",
        "Balancing different modalities",
        "Preventing modality collapse",
        "Evaluating multimodal outputs"
    ],
    "approaches": [
        "Contrastive learning across modalities",
        "Modality-specific reward heads",
        "Multimodal preference datasets"
    ]
}
Enter fullscreen mode Exit fullscreen mode

Personalized Alignment

class PersonalizedAlignment:
    def __init__(self, user_id):
        self.user_preferences = load_user_preferences(user_id)

    def adapt_response(self, base_response):
        """
        Adapt response to user's preferences
        """
        if self.user_preferences["concise"]:
            return summarize_response(base_response)
        elif self.user_preferences["technical"]:
            return add_technical_details(base_response)
        elif self.user_preferences["friendly"]:
            return add_emojis_and_warmth(base_response)
        else:
            return base_response
Enter fullscreen mode Exit fullscreen mode

Key Takeaways & Recommendations

1. Start Simple

# Your alignment journey
steps = [
    "1. Start with SFT on high-quality examples",
    "2. Collect preference data (1000+ pairs)",
    "3. Try DPO for quick wins",
    "4. Move to PPO for production models",
    "5. Always use KL penalties to prevent reward hacking"
]
Enter fullscreen mode Exit fullscreen mode

2. Data Quality > Algorithm Complexity

Better 1000 carefully curated preference pairs
than 100,000 noisy comparisons
Enter fullscreen mode Exit fullscreen mode

3. Monitor for Degeneration

def check_alignment_progress(original_model, aligned_model):
    metrics = {
        "perplexity": compute_perplexity_increase(),
        "diversity": response_diversity_score(),
        "safety": safety_evaluation(),
        "helpfulness": human_evaluation()
    }

    # Watch for warning signs:
    if metrics["perplexity"] > 2.0:
        print("Warning: Model might be reward hacking!")
    if metrics["diversity"] < 0.5:
        print("Warning: Model responses becoming repetitive")
Enter fullscreen mode Exit fullscreen mode

4. Practical Implementation Checklist

  • [ ] Start with a strong SFT base
  • [ ] Collect diverse preference data
  • [ ] Implement KL regularization
  • [ ] Use multiple evaluation methods
  • [ ] Monitor for distribution shift
  • [ ] Test adversarial robustness

The Future of Alignment

We're moving toward:

  1. Multi-objective alignment (helpful + honest + harmless + ...)
  2. Cross-cultural alignment (different norms for different regions)
  3. Dynamic alignment (models that adapt in conversation)
  4. Explainable alignment (understanding why models make certain choices)

Remember: Alignment isn't about making models "smarter"—it's about making them better collaborators. The goal isn't artificial intelligence, but augmented intelligence that works with humans, not for them.

📚 Resources & Next Steps

  1. Papers:

    • Christiano et al., "Deep Reinforcement Learning from Human Preferences" (2017)
    • Ouyang et al., "Training Language Models to Follow Instructions" (2022)
    • Rafailov et al., "Direct Preference Optimization" (2023)
    • Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (2022)
  2. Libraries:

    • TRL (Transformers Reinforcement Learning)
    • Axolotl (for fine-tuning)
    • vLLM (for efficient inference)
  3. Next in Series: We'll explore LLM Deployment & Optimization—taking your aligned model to production with techniques like quantization, speculative decoding, and efficient serving.


💬 Discussion Questions:

  • Have you tried RLHF or DPO? What were your biggest challenges?
  • How do you balance helpfulness and harmlessness in practice?
  • What evaluation methods work best for your use case?
  • How much alignment is too much? (The "Syndrome of Authority" problem)

🚀 Try It Yourself:

# Quick start with DPO
git clone https://github.com/huggingface/trl
cd trl/examples/scripts
python dpo.py --model_name meta-llama/Llama-3-8B \
              --dataset_name your-preferences \
              --output_dir ./dpo-model
Enter fullscreen mode Exit fullscreen mode

Happy aligning! Remember: We're not just training models—we're shaping how they interact with the world.

Top comments (0)