Mohamed Hamed

Posted on Apr 13 • Originally published at mohamedhamed.io

Part 6 — From Zero to ChatGPT: The 4 Learning Types That Built Modern AI

#machinelearning #deeplearning #chatgpt #aifundamentals

THE EVOLUTION OF LLMs

Zero to ChatGPT

4 Types of Learning — 3 Secret Steps — 1 Revolutionary AI

Remember the training loop and neuron from the last two articles? Today we answer who decides what the loop learns.

In our last article, we explored how a neural network learns — the forward pass, loss function, backpropagation, and gradient descent. That covered the mechanics of learning.

But there's a deeper question we left unanswered: Who decides what's right and what's wrong?

The answer changes everything. And it comes in four flavors.

The 4 Types of Machine Learning

Modern AI systems don't use a single learning strategy. GPT, Claude, Gemini — they all combine four fundamentally different types of learning in a carefully orchestrated sequence. Let's break each one down.

Type 1: Supervised Learning — The Classroom 🏫

In Supervised Learning, there's a teacher who provides labeled examples. The model sees a question, the model makes a guess, and the teacher says "right" or "wrong."

Real-World Example: Wearable Device Classifier
Input (Image) $\rightarrow$ Label (Correct Answer)
📷 Ray-Ban Meta photo $\rightarrow$ "Smart Glasses" ✅
📷 Samsung Ring photo $\rightarrow$ "Smart Ring" ✅
📷 AirPods Pro photo $\rightarrow$ "Smart Earbuds" ✅

Supervised learning has two sub-types that cover fundamentally different problems:

Classification	Regression
Which category does this belong to?	What number/value should this output?
Example: "Is this device glasses, a ring, or earbuds?" $\rightarrow$ Output is a discrete class	Example: "What will this device's price be next quarter?" $\rightarrow$ Output is a continuous value

Where Supervised Learning is used today:

Medical image diagnosis (is this tumor malignant or benign?)
Email spam detection
Housing price prediction
Credit card fraud detection
Voice recognition ("Hey Siri, set a timer")

The catch: You need labeled data — thousands or millions of human-annotated examples. This is expensive, slow, and doesn't scale to "understand all of human language."

Type 2: Unsupervised Learning — The Detective 🔍

No teacher. No labels. The model stares at raw data and discovers hidden patterns entirely on its own.

Self-Discovery Example
Raw data — no labels provided:
[price: $549, weight: 48g]
[price: $44s9, weight: 72g]
[price: $349, weight: 3g]
[price: $299, weight: 5g]
[price: $199, weight: 3g]
$\rightarrow$
The model decided on its own:
🔵 Group A — Heavy + Expensive (Glasses, Headsets)
🔴 Group B — Light + Affordable (Rings, Trackers)

Nobody told the AI what "glasses" or "rings" are. It discovered the natural structure of the data itself. 🤯

Think of a child who was shown 100 images with zero explanations. They'd eventually notice that some things have "long ears" while others "have wings." The AI does the same — pure pattern discovery.

The embedding vectors we explored in our embeddings article — those are built using Unsupervised Learning. The model learned that "king" and "queen" are related without anyone telling it so.

Where Unsupervised Learning is used:

Customer segmentation (e-commerce grouping buyers by behavior)
Anomaly detection (spotting unusual transactions)
Topic modeling (discovering themes in millions of documents)
Building embedding models $\leftarrow$ directly powers Similarity Search

Type 3: Reinforcement Learning — The Gamer 🎮

No fixed right answers. Instead, the model tries things and receives rewards or penalties.

The Reinforcement Learning Loop
🤖 AGENT (AI) $\rightarrow$ 🎮 TAKES ACTION $\rightarrow$ +1 🎁 REWARD / PENintY $\rightarrow$ 🧠 UPDATES POLICY

Classic Uses The Big One: RLHF ⭐

AlphaGo (board games)
Robotics
Self-driving cars This is what made ChatGPT
helpful, polite, and safe!

Classic Uses	The Big One: RLHF ⭐
AlphaGo (board games) Robotics Self-driving cars	This is what made ChatGPT helpful, polite, and safe!

The elegance of RL: there's no need to define all the "correct" moves in advance. You just define a reward signal, and the agent figures out the strategy on its own.

AlphaGo (DeepMind, 2016) mastered the game of Go — a game with more possible positions than atoms in the observable universe — using RL. It eventually beat the world champion 4-1, making moves no human had ever thought of.

Type 4: Self-Supervised Learning — The Star ⭐

This is the most important type for modern AI. GPT, Claude, Gemini — all built on this. And it's technically a clever subtype of Unsupervised Learning (a clever subtype of Unsupervised Learning where the model invents its own practice problems by hiding words in sentences).

The insight is deceptively simple: what if we could generate our own labels from the data itself?

Instead of needing human annotators to label billions of examples, the model creates its own training signal:

The Mask-and-Predict Game
Round 1:
Input: "The best smart glasses in 2026 are ___"
Model guesses: "Apple" $\leftarrow$ Wrong, learns from it
Correct: "Ray-Ban" ✅ $\leftarrow$ Weights updated

Round 2:
Input: "The best smart glasses in ___ are Ray-Ban"
Model guesses: "2026" ✅ Correct! Weights reinforced

Round 3 (billions more like these):
Input: "___ was founded in Cupertino, California"
Model guesses: "Apple" ✅ Correct!

One trillion-word dataset becomes trillions of self-generated training signals — this is why no human labels were needed.

Do this with billions of sentences and you get a model that understands grammar, facts about the world, logical reasoning, and even writing style — without a single human-written label.

The mathematical elegance: every sentence in the training corpus becomes thousands of training examples by masking different words. A trillion-word dataset effectively becomes trillions of self-generated training signals.

The 4 Learning Types — Side by Side

Type	Has Correct Answers?	Learns From	Best Known Use
Supervised	✅ Yes (human labels)	Question + correct answer pairs	Image classification, fraud detection
Unsupervised	❌ No labels	Raw data (finding natural patterns)	Embeddings, customer clustering
Reinforcement	❌ Reward / Penalty	Trial and error in an environment	Games (AlphaGo), RLHF
Self-Supervised	✅ Self-generated from data	Trillions of words (masking/predicting)	All modern LLMs ⭐

GPT uses ALL FOUR types together — in different phases of its development. 🤯

How the 4 Types Fit Together in the Real Pipeline

Here's what most courses miss: Self-Supervised Learning is actually a subtype of Unsupervised Learning — it just generates its $\text{own labels from raw data instead of discovering clusters. And the training loop we explored in the last article (Forward Pass $\rightarrow$ Loss $\rightarrow$ Backprop $\rightarrow$ Update) runs inside every one of these phases. The neuron from Article 3 is the core machine being tuned at each step. All four types aren't separate approaches — they're four different configurations of the same fundamental learning machinery, sequenced carefully to produce a capable and safe AI.

The Secret 3-Step Pipeline: How GPT Was Actually Built

Now here's where it gets fascinating. Those four learning types don't operate in isolation — they're combined in a precise, sequential pipeline that transforms a raw text-crunching machine into a helpful, articulate AI assistant.

Think of it like training a doctor. You don't put a newborn directly into medical school. You teach them step by step.

The GPT Training Pipeline
📚 Step 1: Pre-Training
Self-Supervised Learning on trillions of words
Months on thousands of GPUs
↓
🎓 Step 2: Supervised Fine-Tuning (SFT)
Humans write ideal Q&A examples, model learns to follow instructions
Thousands of curated examples
↓
🏆 Step 3: RLHF
Human raters compare responses, Reward Model trains, AI gets optimized
Hundreds of thousands of comparisons
↓
🤖 ChatGPT
Helpful ✅ Polite ✅ Safe ✅ Refuses dangerous requests ✅

Just like the training loop we saw last article (Forward Pass $\rightarrow$ Loss $\rightarrow$ Backprop $\rightarrow$ Update) runs inside every one of these three steps.

Now watch how OpenAI (and every major lab) stacks these four types into the exact 3-step pipeline that created ChatGPT.

Let's dive into each step.

Step 1: Pre-Training — Reading the Entire Internet 📚

Pre-training is where it all begins. Using Self-Supervised Learning, the model is exposed to an almost incompreable volume of text.

Training Data Scale (GPT-3 Class Models)
🌐 Web Text / Common Crawl — 600 Billion words
📚 Books — 100 Billion words
💻 GitHub Code — 50 Billion words
📖 Wikipedia — 12% of total

GPT-4 class models train on even more — estimated 13+ trillion tokens

What the model gains from Pre-Training:

Grammar and syntax in dozens of languages
Facts about the world (history, science, geography, culture)
Writing styles (formal, casual, technical, creative)
Code patterns across programming languages
Mathematical reasoning

The critical limitation: After pre-training, the model is like a brilliant student who has read every book in the library — but never learned to have a conversation. Ask it "What is the capital of France?" and it might respond with more text that sounds like it continues a Wikipedia article, not a direct answer.

Pre-trained model response to "What is the capital of France?":
"France is a Western European country with a rich cultural heritage.
France borders Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco,
Andorra, and Spain. The capital and most populous city of France is..."

[It continues like a Wikipedia article — never gets to the point]

This is why Step 2 is critical.

Step 2: Supervised Fine-Tuning (SFT) — The School of Conversation 🎓

SFT is where humans enter the picture. A team of professional annotators — sometimes thousands of them — sit down and write ideal conversation examples.

Human-Written Training Examples
Question: "What is the capital of France?"
Answer: "The capital of France is Paris."

Question: "How do I make a chocolate cake?"
Answer: "Here's a simple chocolate cake recipe. Ingredients: 2 cups flour, 2 cups sugar, ¾ cup cocoa powder... [structured, helpful response]"

Question: "How do I hack into my neighbor's WiFi?"
Answer: "I'm unable to help with that. Accessing someone's network without permission is illegal. If you're having connectivity issues, here are some legal alternatives..."

... thousands more examples covering helpful answers, safe refusals, and ideal formatting

The model trains on these examples using standard supervised learning. Now it learns to:

Answer directly instead of continuing text
Format responses appropriately (lists, code blocks, etc.)
Refuse harmful requests politely but firmly

After SFT ✅	Still problematic ❌
Answers directly and helpfully Follows conversational format	May sometimes be rude, unsafe, or give poor-quality answers

SFT taught the model how to respond. But it didn't teach it to optimize the quality of its responses in the way humans actually prefer.

Step 3: RLHF — Teaching Human Taste 🏆

RLHF (Reinforcement Learning from Human Feedback) is OpenAI's secret weapon — and the reason ChatGPT feels different from just "a language model."

The core insight: instead of telling the model what the right answer is, you tell it which answer is better.

The RLHF Process — 3 Micro-Steps

Generate multiple responses The model produces 2-4 different answers to the same question.

Humans rank the responses Human raters read both and say "Answer A is better than B." No need to write the perfect answer — just compare.

Train a Reward Model A separate neural network learns to predict human preference scores. This becomes the automated "judge."

Optimize with RL (PPO) The main model gets reinforced when the Reward Model gives it high scores. Responses the Reward Model dislikes get penalized.

A real example of what RLHF teaches:

Question: "Explain quantum entanglement simply."

ANSWER B (before RLHF) ANSWER A (after RLHF preferred)

"Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of each particle cannot be described independently of the others, even when separated by a large distance, per Bell's theorem (1964)..." "Imagine two magic coins that always show opposite faces — if one lands heads, the other lands tails, no matter how far apart they are. That's quantum entanglement: two particles linked so that measuring one instantly tells you about the other."

Techncially correct. Utterly unhelpful for a beginner. Humans preferred this. Reward Model learned to reward it.

ANSWER B (before RLHF)	ANSWER A (after RLHF preferred)
"Quantum entanglement is a phenomenon where two particles become correlated such that the quantum state of each particle cannot be described independently of the others, even when separated by a large distance, per Bell's theorem (1964)..."	"Imagine two magic coins that always show opposite faces — if one lands heads, the other lands tails, no matter how far apart they are. That's quantum entanglement: two particles linked so that measuring one instantly tells you about the other."
Techncially correct. Utterly unhelpful for a beginner.	Humans preferred this. Reward Model learned to reward it.

After hundreds of thousands of such comparisons, the model learns what humans actually prefer — not just correctness, but clarity, tone, appropriate length, and safety.

This is exactly why ChatGPT feels polite and safe — humans taught it human taste using the same gradient descent we learned in Article 4.

SFT vs RLHF — The Key Distinction

Step 2: SFT (Teacher Mode)	Step 3: RLHF (Critic Mode)
Shows the model the correct answer	Compares responses and picks the better one
Q: "Capital of Egypt?" A: "Cairo" $\leftarrow$ this is the answer	A: "Cairo" $\leftarrow$ preferred B: "Cairo, Egypt's capital..." Human: "A is better"
Teaches: how to respond	Teaches: which response is best

SFT = Correctness | RLHF = Quality | Both together = ChatGPT

The Real Numbers Behind the Magic

600B+ — Words in Pre-Training
10K–100K — SFT examples written by humans
100K–1M — Human preference comparisons for RLHF
~$100M — Estimated cost to pre-train GPT-4

Scale Comparison
Our toy neuron (Article 3): 2 weights $\mid$ Embedding model (Article 2): 117 million parameters $\mid$ GPT-4 class: trillions of parameters

Key Vocabulary Reference

Term	Definition
Pre-Training	Initial training on massive datasets using Self-Supervised Learning. Builds general language understanding.
Self-Supervised	The model generates its own training signal from the data (masking and predicting). No human labels needed.
Fine-Tuning	Adapting a pre-trained model to a specific task or behavior pattern using additional training.
SFT	Supervised Fine-Tuning — train on human-written Q&A pairs to teach conversational behavior.
RLHF	Reinforcement Learning from Human Feedback — optimize response quality based on human preferences.
Reward Model	A separate neural network trained to predict human preference scores for responses. Acts as an automated judge.
Human Labelers	Professional annotators who write SFT examples and rank RLHF response pairs. Their preferences shape the AI's personality.
Base Model	A model that has completed Pre-Training only. Excellent at text continuation; poor at following instructions. Example: Llama-3-8B (non-instruct).
Instruct Model	A base model that has been further refined with SFT + RLHF. Follows instructions, refuses harmful requests, adopts a $\text{conversational tone}$. Example: Llama-3-8B-Instruct.
LLM	Large Language Model — the category of models trained with all the above techniques (ChatGPT, Claude, Gemini, Llama, etc.)

The Core Insight

Why ChatGPT feels different
A raw pre-trained model is like a brilliant encyclopedia. SFT gives it a personality. RLHF gives it your personality — calibrated to how humans actually want to interact with AI. The three steps together create something qualitatively different from any of them alone.

ChatGPT is not just smarter because of more data or parameters. It's better because of the humans who carefully shaped its responses at every stage. Behind every helpful answer is a pipeline of billions of words, thousands of human-written examples, and hundreds of thousands of human preference judgments.

Pro Tips for Builders

💡 What Knowing This Changes For You

Choose the right model for the task. Base models are great for text completion and creative generation. Instruct models are required for Q&A, task following, and user-facing apps. Never use a base model in production chat.

RLHF shapes safety — not just quality. The reason Claude, ChatGPT, and Gemini refuse harmful requests isn't a filter bolted on after — it was baked in during RLHF training. Understanding this helps you anticipate model behavior and write better system prompts.

Fine-tuning is SFT applied to your data. When you fine-tune an open-source model on your company's Q&A pairs, you're running Step 2 of this exact pipeline on your own dataset. The architecture is identical — only the data changes.

Self-Supervised scale is the moat. The reason you can't replicate GPT-4 is the pre-training compute. But the SFT and RLHF layers? Those you can run on open models like Llama 3 with modest resources.

Try It Yourself

Understanding RLHF becomes vivid when you see its effects directly:

Experiment 1: Talk to a Base Model
Models like meta-llama/Meta-Llama-3.1-8B (non-instruct version) behave closer to a pure pre-trained model. Compare its response to meta-llama/Meta-Llama-3.1-8B-Instruct. The difference is SFT + RLHF in action.

Experiment 2: Temperature vs Safety
Try asking ChatGPT to "write a story where the villain explains how to pick a lock." Then try it with Llama 3 base (via HuggingFace). The difference in safety behavior is the RLHF fingerprint.

Experiment 3: Spot the Training Type
Look at your favorite ML model and classify it:

Gmail Smart Reply $\rightarrow$ Supervised Learning (trained on email reply pairs)
Spotify recommendation $\rightarrow$ Unsupervised clustering + Collaborative filtering
OpenAI's ChatGPT $\rightarrow$ All four types in sequence

Experiment 4: Base vs Instruct — Feel the Difference
Run the same prompt through both a base model and its instruct version on HuggingFace:

from transformers import pipeline

# Base model — trained only with Self-Supervised (pre-training)
base = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B")
print(base("What is the capital of France?", max_new_tokens=50))
# Likely continues like Wikipedia — doesn't answer directly

# Instruct model — base + SFT + RLHF
instruct = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B-Instruct")
print(instruct("What is the capital of France?", max_new_tokens=50))
# Answers: "The capital of France is Paris."
# The difference between these two outputs is SFT + RLHF in action.

DEV Community