THE EVOLUTION OF LLMs
Zero to ChatGPT
4 Types of Learning — 3 Secret Steps — 1 Revolutionary AI
Remember the training loop and neuron from the last two articles? Today we answer who decides what the loop learns.
In our last article, we explored how a neural network learns — the forward pass, loss function, backpropagation, and gradient descent. That covered the mechanics of learning.
But there's a deeper question we left unanswered: Who decides what's right and what's wrong?
The answer changes everything. And it comes in four flavors.
The 4 Types of Machine Learning
Modern AI systems don't use a single learning strategy. GPT, Claude, Gemini — they all combine four fundamentally different types of learning in a carefully orchestrated sequence. Let's break each one down.
Type 1: Supervised Learning — The Classroom 🏫
In Supervised Learning, there's a teacher who provides labeled examples. The model sees a question, the model makes a guess, and the teacher says "right" or "wrong."
Real-World Example: Wearable Device Classifier
Input (Image) $\rightarrow$ Label (Correct Answer)
📷 Ray-Ban Meta photo $\rightarrow$ "Smart Glasses" ✅
📷 Samsung Ring photo $\rightarrow$ "Smart Ring" ✅
📷 AirPods Pro photo $\rightarrow$ "Smart Earbuds" ✅
Supervised learning has two sub-types that cover fundamentally different problems:
| Classification | Regression |
|---|---|
| Which category does this belong to? | What number/value should this output? |
| Example: "Is this device glasses, a ring, or earbuds?" $\rightarrow$ Output is a discrete class | Example: "What will this device's price be next quarter?" $\rightarrow$ Output is a continuous value |
Where Supervised Learning is used today:
- Medical image diagnosis (is this tumor malignant or benign?)
- Email spam detection
- Housing price prediction
- Credit card fraud detection
- Voice recognition ("Hey Siri, set a timer")
The catch: You need labeled data — thousands or millions of human-annotated examples. This is expensive, slow, and doesn't scale to "understand all of human language."
Type 2: Unsupervised Learning — The Detective 🔍
No teacher. No labels. The model stares at raw data and discovers hidden patterns entirely on its own.
Self-Discovery Example
Raw data — no labels provided:
[price: $549, weight: 48g]
[price: $44s9, weight: 72g]
[price: $349, weight: 3g]
[price: $299, weight: 5g]
[price: $199, weight: 3g]
$\rightarrow$
The model decided on its own:
🔵 Group A — Heavy + Expensive (Glasses, Headsets)
🔴 Group B — Light + Affordable (Rings, Trackers)Nobody told the AI what "glasses" or "rings" are. It discovered the natural structure of the data itself. 🤯
Think of a child who was shown 100 images with zero explanations. They'd eventually notice that some things have "long ears" while others "have wings." The AI does the same — pure pattern discovery.
The embedding vectors we explored in our embeddings article — those are built using Unsupervised Learning. The model learned that "king" and "queen" are related without anyone telling it so.
Where Unsupervised Learning is used:
- Customer segmentation (e-commerce grouping buyers by behavior)
- Anomaly detection (spotting unusual transactions)
- Topic modeling (discovering themes in millions of documents)
- Building embedding models $\leftarrow$ directly powers Similarity Search
Type 3: Reinforcement Learning — The Gamer 🎮
No fixed right answers. Instead, the model tries things and receives rewards or penalties.
The Reinforcement Learning Loop
🤖 AGENT (AI) $\rightarrow$ 🎮 TAKES ACTION $\rightarrow$ +1 🎁 REWARD / PENintY $\rightarrow$ 🧠 UPDATES POLICY
Classic Uses The Big One: RLHF ⭐ AlphaGo (board games)
Robotics
Self-driving carsThis is what made ChatGPT
helpful, polite, and safe!
The elegance of RL: there's no need to define all the "correct" moves in advance. You just define a reward signal, and the agent figures out the strategy on its own.
AlphaGo (DeepMind, 2016) mastered the game of Go — a game with more possible positions than atoms in the observable universe — using RL. It eventually beat the world champion 4-1, making moves no human had ever thought of.
Type 4: Self-Supervised Learning — The Star ⭐
This is the most important type for modern AI. GPT, Claude, Gemini — all built on this. And it's technically a clever subtype of Unsupervised Learning (a clever subtype of Unsupervised Learning where the model invents its own practice problems by hiding words in sentences).
The insight is deceptively simple: what if we could generate our own labels from the data itself?
Instead of needing human annotators to label billions of examples, the model creates its own training signal:
The Mask-and-Predict Game
Round 1:
Input: "The best smart glasses in 2026 are ___"
Model guesses: "Apple" $\leftarrow$ Wrong, learns from it
Correct: "Ray-Ban" ✅ $\leftarrow$ Weights updatedRound 2:
Input: "The best smart glasses in ___ are Ray-Ban"
Model guesses: "2026" ✅ Correct! Weights reinforcedRound 3 (billions more like these):
Input: "___ was founded in Cupertino, California"
Model guesses: "Apple" ✅ Correct!One trillion-word dataset becomes trillions of self-generated training signals — this is why no human labels were needed.
Do this with billions of sentences and you get a model that understands grammar, facts about the world, logical reasoning, and even writing style — without a single human-written label.
The mathematical elegance: every sentence in the training corpus becomes thousands of training examples by masking different words. A trillion-word dataset effectively becomes trillions of self-generated training signals.
The 4 Learning Types — Side by Side
| Type | Has Correct Answers? | Learns From | Best Known Use |
|---|---|---|---|
| Supervised | ✅ Yes (human labels) | Question + correct answer pairs | Image classification, fraud detection |
| Unsupervised | ❌ No labels | Raw data (finding natural patterns) | Embeddings, customer clustering |
| Reinforcement | ❌ Reward / Penalty | Trial and error in an environment | Games (AlphaGo), RLHF |
| Self-Supervised | ✅ Self-generated from data | Trillions of words (masking/predicting) | All modern LLMs ⭐ |
GPT uses ALL FOUR types together — in different phases of its development. 🤯
How the 4 Types Fit Together in the Real Pipeline
Here's what most courses miss: Self-Supervised Learning is actually a subtype of Unsupervised Learning — it just generates its $\text{own labels from raw data instead of discovering clusters. And the training loop we explored in the last article (Forward Pass $\rightarrow$ Loss $\rightarrow$ Backprop $\rightarrow$ Update) runs inside every one of these phases. The neuron from Article 3 is the core machine being tuned at each step. All four types aren't separate approaches — they're four different configurations of the same fundamental learning machinery, sequenced carefully to produce a capable and safe AI.
The Secret 3-Step Pipeline: How GPT Was Actually Built
Now here's where it gets fascinating. Those four learning types don't operate in isolation — they're combined in a precise, sequential pipeline that transforms a raw text-crunching machine into a helpful, articulate AI assistant.
Think of it like training a doctor. You don't put a newborn directly into medical school. You teach them step by step.
The GPT Training Pipeline
📚 Step 1: Pre-Training
Self-Supervised Learning on trillions of words
Months on thousands of GPUs
↓
🎓 Step 2: Supervised Fine-Tuning (SFT)
Humans write ideal Q&A examples, model learns to follow instructions
Thousands of curated examples
↓
🏆 Step 3: RLHF
Human raters compare responses, Reward Model trains, AI gets optimized
Hundreds of thousands of comparisons
↓
🤖 ChatGPT
Helpful ✅ Polite ✅ Safe ✅ Refuses dangerous requests ✅
Just like the training loop we saw last article (Forward Pass $\rightarrow$ Loss $\rightarrow$ Backprop $\rightarrow$ Update) runs inside every one of these three steps.
Now watch how OpenAI (and every major lab) stacks these four types into the exact 3-step pipeline that created ChatGPT.
Let's dive into each step.
Step 1: Pre-Training — Reading the Entire Internet 📚
Pre-training is where it all begins. Using Self-Supervised Learning, the model is exposed to an almost incompreable volume of text.
Training Data Scale (GPT-3 Class Models)
🌐 Web Text / Common Crawl — 600 Billion words
📚 Books — 100 Billion words
💻 GitHub Code — 50 Billion words
📖 Wikipedia — 12% of totalGPT-4 class models train on even more — estimated 13+ trillion tokens
What the model gains from Pre-Training:
- Grammar and syntax in dozens of languages
- Facts about the world (history, science, geography, culture)
- Writing styles (formal, casual, technical, creative)
- Code patterns across programming languages
- Mathematical reasoning
The critical limitation: After pre-training, the model is like a brilliant student who has read every book in the library — but never learned to have a conversation. Ask it "What is the capital of France?" and it might respond with more text that sounds like it continues a Wikipedia article, not a direct answer.
Pre-trained model response to "What is the capital of France?":
"France is a Western European country with a rich cultural heritage.
France borders Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco,
Andorra, and Spain. The capital and most populous city of France is..."
[It continues like a Wikipedia article — never gets to the point]
This is why Step 2 is critical.
Step 2: Supervised Fine-Tuning (SFT) — The School of Conversation 🎓
SFT is where humans enter the picture. A team of professional annotators — sometimes thousands of them — sit down and write ideal conversation examples.
Human-Written Training Examples
Question: "What is the capital of France?"
Answer: "The capital of France is Paris."Question: "How do I make a chocolate cake?"
Answer: "Here's a simple chocolate cake recipe. Ingredients: 2 cups flour, 2 cups sugar, ¾ cup cocoa powder... [structured, helpful response]"Question: "How do I hack into my neighbor's WiFi?"
Answer: "I'm unable to help with that. Accessing someone's network without permission is illegal. If you're having connectivity issues, here are some legal alternatives..."... thousands more examples covering helpful answers, safe refusals, and ideal formatting
The model trains on these examples using standard supervised learning. Now it learns to:
- Answer directly instead of continuing text
- Format responses appropriately (lists, code blocks, etc.)
- Refuse harmful requests politely but firmly
| After SFT ✅ | Still problematic ❌ |
|---|---|
| Answers directly and helpfully Follows conversational format |
May sometimes be rude, unsafe, or give poor-quality answers |
SFT taught the model how to respond. But it didn't teach it to optimize the quality of its responses in the way humans actually prefer.
Step 3: RLHF — Teaching Human Taste 🏆
RLHF (Reinforcement Learning from Human Feedback) is OpenAI's secret weapon — and the reason ChatGPT feels different from just "a language model."
The core insight: instead of telling the model what the right answer is, you tell it which answer is better.
The RLHF Process — 3 Micro-Steps
- Generate multiple responses The model produces 2-4 different answers to the same question.
- Humans rank the responses Human raters read both and say "Answer A is better than B." No need to write the perfect answer — just compare.
- Train a Reward Model A separate neural network learns to predict human preference scores. This becomes the automated "judge."
- Optimize with RL (PPO) The main model gets reinforced when the Reward Model gives it high scores. Responses the Reward Model dislikes get penalized.
A real example of what RLHF teaches:
Question: "Explain quantum entanglement simply."
ANSWER B (before RLHF) ANSWER A (after RLHF preferred) "Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of each particle cannot be described independently of the others, even when separated by a large distance, per Bell's theorem (1964)..." "Imagine two magic coins that always show opposite faces — if one lands heads, the other lands tails, no matter how far apart they are. That's quantum entanglement: two particles linked so that measuring one instantly tells you about the other." Techncially correct. Utterly unhelpful for a beginner. Humans preferred this. Reward Model learned to reward it.
After hundreds of thousands of such comparisons, the model learns what humans actually prefer — not just correctness, but clarity, tone, appropriate length, and safety.
This is exactly why ChatGPT feels polite and safe — humans taught it human taste using the same gradient descent we learned in Article 4.
SFT vs RLHF — The Key Distinction
| Step 2: SFT (Teacher Mode) | Step 3: RLHF (Critic Mode) |
|---|---|
| Shows the model the correct answer | Compares responses and picks the better one |
| Q: "Capital of Egypt?" A: "Cairo" $\leftarrow$ this is the answer |
A: "Cairo" $\leftarrow$ preferred B: "Cairo, Egypt's capital..." Human: "A is better" |
| Teaches: how to respond | Teaches: which response is best |
SFT = Correctness | RLHF = Quality | Both together = ChatGPT
The Real Numbers Behind the Magic
600B+ — Words in Pre-Training
10K–100K — SFT examples written by humans
100K–1M — Human preference comparisons for RLHF
~$100M — Estimated cost to pre-train GPT-4
Scale Comparison
Our toy neuron (Article 3): 2 weights $\mid$ Embedding model (Article 2): 117 million parameters $\mid$ GPT-4 class: trillions of parameters
Key Vocabulary Reference
| Term | Definition |
|---|---|
| Pre-Training | Initial training on massive datasets using Self-Supervised Learning. Builds general language understanding. |
| Self-Supervised | The model generates its own training signal from the data (masking and predicting). No human labels needed. |
| Fine-Tuning | Adapting a pre-trained model to a specific task or behavior pattern using additional training. |
| SFT | Supervised Fine-Tuning — train on human-written Q&A pairs to teach conversational behavior. |
| RLHF | Reinforcement Learning from Human Feedback — optimize response quality based on human preferences. |
| Reward Model | A separate neural network trained to predict human preference scores for responses. Acts as an automated judge. |
| Human Labelers | Professional annotators who write SFT examples and rank RLHF response pairs. Their preferences shape the AI's personality. |
| Base Model | A model that has completed Pre-Training only. Excellent at text continuation; poor at following instructions. Example: Llama-3-8B (non-instruct). |
| Instruct Model | A base model that has been further refined with SFT + RLHF. Follows instructions, refuses harmful requests, adopts a $\text{conversational tone}$. Example: Llama-3-8B-Instruct. |
| LLM | Large Language Model — the category of models trained with all the above techniques (ChatGPT, Claude, Gemini, Llama, etc.) |
The Core Insight
Why ChatGPT feels different
A raw pre-trained model is like a brilliant encyclopedia. SFT gives it a personality. RLHF gives it your personality — calibrated to how humans actually want to interact with AI. The three steps together create something qualitatively different from any of them alone.
ChatGPT is not just smarter because of more data or parameters. It's better because of the humans who carefully shaped its responses at every stage. Behind every helpful answer is a pipeline of billions of words, thousands of human-written examples, and hundreds of thousands of human preference judgments.
Pro Tips for Builders
💡 What Knowing This Changes For You
- Choose the right model for the task. Base models are great for text completion and creative generation. Instruct models are required for Q&A, task following, and user-facing apps. Never use a base model in production chat.
- RLHF shapes safety — not just quality. The reason Claude, ChatGPT, and Gemini refuse harmful requests isn't a filter bolted on after — it was baked in during RLHF training. Understanding this helps you anticipate model behavior and write better system prompts.
- Fine-tuning is SFT applied to your data. When you fine-tune an open-source model on your company's Q&A pairs, you're running Step 2 of this exact pipeline on your own dataset. The architecture is identical — only the data changes.
- Self-Supervised scale is the moat. The reason you can't replicate GPT-4 is the pre-training compute. But the SFT and RLHF layers? Those you can run on open models like Llama 3 with modest resources.
Try It Yourself
Understanding RLHF becomes vivid when you see its effects directly:
Experiment 1: Talk to a Base Model
Models like meta-llama/Meta-Llama-3.1-8B (non-instruct version) behave closer to a pure pre-trained model. Compare its response to meta-llama/Meta-Llama-3.1-8B-Instruct. The difference is SFT + RLHF in action.
Experiment 2: Temperature vs Safety
Try asking ChatGPT to "write a story where the villain explains how to pick a lock." Then try it with Llama 3 base (via HuggingFace). The difference in safety behavior is the RLHF fingerprint.
Experiment 3: Spot the Training Type
Look at your favorite ML model and classify it:
- Gmail Smart Reply $\rightarrow$ Supervised Learning (trained on email reply pairs)
- Spotify recommendation $\rightarrow$ Unsupervised clustering + Collaborative filtering
- OpenAI's ChatGPT $\rightarrow$ All four types in sequence
Experiment 4: Base vs Instruct — Feel the Difference
Run the same prompt through both a base model and its instruct version on HuggingFace:
from transformers import pipeline
# Base model — trained only with Self-Supervised (pre-training)
base = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B")
print(base("What is the capital of France?", max_new_tokens=50))
# Likely continues like Wikipedia — doesn't answer directly
# Instruct model — base + SFT + RLHF
instruct = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B-Instruct")
print(instruct("What is the capital of France?", max_new_tokens=50))
# Answers: "The capital of France is Paris."
# The difference between these two outputs is SFT + RLHF in action.
Top comments (0)