ChatGPT didn't pop out of the box knowing how to be helpful. It went through three distinct training stages β and understanding them explains almost everything about how LLMs behave. Here's the pipeline, shown by how the SAME answer improves at each stage.
ποΈ Step through Pretraining β SFT β RLHF: https://dev48v.infy.uk/ai/days/day17-how-llms-trained.html
Stage 1 β Pretraining
Predict the next token across a huge slice of the internet. The result is a base model: it knows a staggering amount and writes fluent text β but it just continues your text. Ask "How do I make tea?" and it might reply with more questions. Smart, but no manners.
Stage 2 β Supervised fine-tuning (SFT)
Train on curated instructionβresponse examples. Now it follows instructions and answers the question directly β plainly, but it works.
Stage 3 β RLHF
Humans rank multiple responses; you train a reward model on those preferences, then use RL (PPO/DPO) to optimize the model toward what people prefer. Now it's genuinely helpful, well-formatted, and declines unsafe requests. This is the "assistant" polish.
Why it matters
- "Base" vs "chat/instruct" models differ because of stages 2-3.
- Alignment β truth β it can still confidently hallucinate.
- Pretraining costs millions; SFT/RLHF cost human labor.
π¨ The pipeline (next-token pretraining β SFT on instructions β reward model + RL) on the page: https://dev48v.infy.uk/ai/days/day17-how-llms-trained.html
Part of AIFromZero. π https://dev48v.infy.uk
Top comments (0)