DEV Community

Devanshu Biswas
Devanshu Biswas

Posted on

How LLMs Are Trained: Pretraining, SFT, and RLHF

ChatGPT didn't pop out of the box knowing how to be helpful. It went through three distinct training stages β€” and understanding them explains almost everything about how LLMs behave. Here's the pipeline, shown by how the SAME answer improves at each stage.

πŸ—οΈ Step through Pretraining β†’ SFT β†’ RLHF: https://dev48v.infy.uk/ai/days/day17-how-llms-trained.html

Stage 1 β€” Pretraining

Predict the next token across a huge slice of the internet. The result is a base model: it knows a staggering amount and writes fluent text β€” but it just continues your text. Ask "How do I make tea?" and it might reply with more questions. Smart, but no manners.

Stage 2 β€” Supervised fine-tuning (SFT)

Train on curated instruction→response examples. Now it follows instructions and answers the question directly — plainly, but it works.

Stage 3 β€” RLHF

Humans rank multiple responses; you train a reward model on those preferences, then use RL (PPO/DPO) to optimize the model toward what people prefer. Now it's genuinely helpful, well-formatted, and declines unsafe requests. This is the "assistant" polish.

Why it matters

  • "Base" vs "chat/instruct" models differ because of stages 2-3.
  • Alignment β‰  truth β€” it can still confidently hallucinate.
  • Pretraining costs millions; SFT/RLHF cost human labor.

πŸ”¨ The pipeline (next-token pretraining β†’ SFT on instructions β†’ reward model + RL) on the page: https://dev48v.infy.uk/ai/days/day17-how-llms-trained.html

Part of AIFromZero. 🌐 https://dev48v.infy.uk

Top comments (0)