How LLMs Are Trained: Pretraining, SFT, and RLHF

#ai #llm #machinelearning #beginners

ChatGPT didn't pop out of the box knowing how to be helpful. It went through three distinct training stages — and understanding them explains almost everything about how LLMs behave. Here's the pipeline, shown by how the SAME answer improves at each stage.

🏗️ Step through Pretraining → SFT → RLHF: https://dev48v.infy.uk/ai/days/day17-how-llms-trained.html

Stage 1 — Pretraining

Predict the next token across a huge slice of the internet. The result is a base model: it knows a staggering amount and writes fluent text — but it just continues your text. Ask "How do I make tea?" and it might reply with more questions. Smart, but no manners.

Stage 2 — Supervised fine-tuning (SFT)

Train on curated instruction→response examples. Now it follows instructions and answers the question directly — plainly, but it works.

Stage 3 — RLHF

Humans rank multiple responses; you train a reward model on those preferences, then use RL (PPO/DPO) to optimize the model toward what people prefer. Now it's genuinely helpful, well-formatted, and declines unsafe requests. This is the "assistant" polish.

Why it matters

"Base" vs "chat/instruct" models differ because of stages 2-3.
Alignment ≠ truth — it can still confidently hallucinate.
Pretraining costs millions; SFT/RLHF cost human labor.

🔨 The pipeline (next-token pretraining → SFT on instructions → reward model + RL) on the page: https://dev48v.infy.uk/ai/days/day17-how-llms-trained.html

Part of AIFromZero. 🌐 https://dev48v.infy.uk