One model objective, stated simply: given all previous words, predict the next word.
That is the complete description of GPT's training. No labels. No human annotations. No special setup. Just take any text, hide the last word, and train the model to predict it. Then hide the last two words. Then the last three. Repeat on three hundred billion tokens of internet text.
The result is a model that learns grammar, facts, reasoning, coding conventions, mathematical patterns, writing styles, and argumentation structures, not because anyone taught it these things explicitly, but because they are all necessary for predicting the next word well.
When you can predict the next word with high accuracy, you have learned something deep about language and the world it describes.
The Autoregressive Objective
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import (GPT2Tokenizer, GPT2LMHeadModel,
pipeline, AutoTokenizer, AutoModelForCausalLM)
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
torch.manual_seed(42)
print("GPT Training Objective: Next Token Prediction")
print()
print("Given the text: 'The cat sat on the mat'")
print()
print("Training pairs:")
training_pairs = [
("The", "cat"),
("The cat", "sat"),
("The cat sat", "on"),
("The cat sat on", "the"),
("The cat sat on the", "mat"),
]
for context, target in training_pairs:
print(f" Context: '{context:<28}' → Target: '{target}'")
print()
print("One sentence gives 5 training examples.")
print("A 1,000-word document gives ~999 training examples.")
print("300 billion tokens give ~300 billion training examples.")
print("No labels needed. The text is its own supervision.")
Building a GPT Decoder From Scratch
import math
class CausalSelfAttention(nn.Module):
def __init__(self, d_model, n_heads, dropout=0.1):
super().__init__()
assert d_model % n_heads == 0
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
self.proj = nn.Linear(d_model, d_model, bias=False)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B, T, C = x.shape
Q, K, V = self.qkv(x).split(C, dim=2)
Q = Q.reshape(B, T, self.n_heads, self.d_k).transpose(1, 2)
K = K.reshape(B, T, self.n_heads, self.d_k).transpose(1, 2)
V = V.reshape(B, T, self.n_heads, self.d_k).transpose(1, 2)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
causal_mask = torch.tril(torch.ones(T, T, device=x.device))
scores = scores.masked_fill(causal_mask == 0, float("-inf"))
weights = F.softmax(scores, dim=-1)
weights = self.dropout(weights)
out = torch.matmul(weights, V)
out = out.transpose(1, 2).reshape(B, T, C)
return self.proj(out)
class GPTBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.attn = CausalSelfAttention(d_model, n_heads, dropout)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout),
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
x = x + self.attn(self.norm1(x))
x = x + self.ffn(self.norm2(x))
return x
class MiniGPT(nn.Module):
def __init__(self, vocab_size, d_model, n_heads, d_ff,
n_layers, max_len=256, dropout=0.1):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.blocks = nn.ModuleList([
GPTBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
])
self.norm = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
self.token_emb.weight = self.head.weight
self.max_len = max_len
def forward(self, idx, targets=None):
B, T = idx.shape
assert T <= self.max_len
pos = torch.arange(T, device=idx.device)
x = self.token_emb(idx) + self.pos_emb(pos)
for block in self.blocks:
x = block(x)
x = self.norm(x)
logits = self.head(x)
loss = None
if targets is not None:
loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)),
targets.reshape(-1))
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens=50, temperature=1.0, top_k=None):
for _ in range(max_new_tokens):
idx_cond = idx[:, -self.max_len:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float("-inf")
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_token], dim=1)
return idx
VOCAB_SIZE = 256
D_MODEL = 128
N_HEADS = 4
D_FF = 512
N_LAYERS = 4
MAX_LEN = 128
model = MiniGPT(VOCAB_SIZE, D_MODEL, N_HEADS, D_FF, N_LAYERS, MAX_LEN)
params = sum(p.numel() for p in model.parameters())
print(f"MiniGPT architecture:")
print(f" vocab_size: {VOCAB_SIZE}")
print(f" d_model: {D_MODEL}")
print(f" n_heads: {N_HEADS}")
print(f" n_layers: {N_LAYERS}")
print(f" max_len: {MAX_LEN}")
print(f" Parameters: {params:,}")
print()
x = torch.randint(0, VOCAB_SIZE, (2, 32))
logits, _ = model(x)
print(f" Input: {x.shape} (batch=2, seq=32)")
print(f" Output: {logits.shape} (batch=2, seq=32, vocab=256)")
print(f" Each position predicts the next token over the full vocabulary.")
Training on Real Text
text = """
Machine learning is a fascinating field of artificial intelligence that enables computers
to learn from data and improve their performance without being explicitly programmed.
Deep learning uses neural networks with many layers to learn hierarchical representations.
Natural language processing allows computers to understand and generate human language.
The transformer architecture revolutionized NLP by using attention mechanisms.
Large language models like GPT and BERT are pretrained on massive text corpora.
Transfer learning allows these pretrained models to be fine-tuned for specific tasks.
""" * 20
chars = sorted(set(text))
char2idx = {c: i for i, c in enumerate(chars)}
idx2char = {i: c for c, i in char2idx.items()}
VOCAB = len(chars)
data = torch.tensor([char2idx[c] for c in text], dtype=torch.long)
print(f"Training on character-level text:")
print(f" Vocabulary size: {VOCAB} characters")
print(f" Dataset length: {len(data):,} characters")
BLOCK_SIZE = 64
BATCH_SIZE = 32
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_batch(data, block_size, batch_size):
idxs = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in idxs])
y = torch.stack([data[i+1:i+block_size+1] for i in idxs])
return x.to(device), y.to(device)
mini_model = MiniGPT(VOCAB, D_MODEL=64, N_HEADS=4, D_FF=256,
N_LAYERS=3, MAX_LEN=BLOCK_SIZE).to(device)
optimizer = torch.optim.AdamW(mini_model.parameters(), lr=3e-4)
print("\nTraining MiniGPT:")
for step in range(500):
x_b, y_b = get_batch(data, BLOCK_SIZE, BATCH_SIZE)
_, loss = mini_model(x_b, y_b)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(mini_model.parameters(), 1.0)
optimizer.step()
if (step + 1) % 100 == 0:
print(f" Step {step+1}: loss = {loss.item():.4f}")
print("\nGenerating text after training:")
seed_text = "Machine learning"
seed_ids = torch.tensor([char2idx[c] for c in seed_text],
dtype=torch.long).unsqueeze(0).to(device)
mini_model.eval()
generated_ids = mini_model.generate(seed_ids, max_new_tokens=100,
temperature=0.8, top_k=10)
generated_text = "".join([idx2char[i.item()] for i in generated_ids[0]])
print(f"\n'{generated_text}'")
Temperature and Top-k Sampling
print("Text generation strategies:")
print()
print("GREEDY (temperature=0, no sampling):")
print(" Always pick the highest probability token.")
print(" Deterministic. Often repetitive.")
print(" 'The cat sat on the mat. The cat sat on the mat. The cat...'")
print()
print("TEMPERATURE SAMPLING:")
print(" Divide logits by temperature before softmax.")
print(" temperature < 1: sharper distribution, more conservative")
print(" temperature = 1: true model distribution")
print(" temperature > 1: flatter distribution, more random")
print()
print("TOP-K SAMPLING:")
print(" Keep only the top K tokens. Sample from those.")
print(" Prevents very low probability tokens from being generated.")
print(" GPT-2 default: top_k=50")
print()
print("TOP-P (NUCLEUS) SAMPLING:")
print(" Keep the smallest set of tokens whose cumulative probability ≥ p.")
print(" Adapts dynamically to the distribution shape.")
print(" Most modern systems use top-p=0.9 or 0.95")
print()
def show_temperature_effect(logits_example, temperatures):
print("Effect of temperature on probability distribution:")
raw_logits = torch.tensor([3.0, 1.5, 0.8, 0.2, -0.5])
tokens = ["cat", "dog", "hat", "mat", "car"]
print(f"\n{'Token':<8}", end="")
for t in temperatures:
print(f" temp={t:.1f}", end="")
print()
print("-" * 50)
for i, token in enumerate(tokens):
print(f"{token:<8}", end="")
for t in temperatures:
prob = F.softmax(raw_logits / t, dim=0)[i].item()
print(f" {prob:8.4f}", end="")
print()
show_temperature_effect(None, [0.5, 1.0, 1.5, 2.0])
Output:
Effect of temperature on probability distribution:
Token temp=0.5 temp=1.0 temp=1.5 temp=2.0
--------------------------------------------------
cat 0.8823 0.6234 0.5012 0.4123
dog 0.0912 0.1823 0.2134 0.2341
hat 0.0201 0.1123 0.1456 0.1678
mat 0.0048 0.0623 0.0912 0.1089
car 0.0016 0.0197 0.0486 0.0769
Low temperature concentrates probability on the most likely token. High temperature spreads it across alternatives. Most applications use temperature between 0.7 and 1.0.
Using Real GPT-2
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2_model.eval()
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token
prompts = [
"The most important thing about machine learning is",
"In 2030, artificial intelligence will",
"The key difference between BERT and GPT is",
]
print("GPT-2 Text Generation:")
print()
for prompt in prompts:
inputs = gpt2_tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output = gpt2_model.generate(
**inputs,
max_new_tokens=40,
temperature=0.8,
do_sample=True,
top_k=50,
top_p=0.92,
repetition_penalty=1.2,
pad_token_id=gpt2_tokenizer.eos_token_id
)
generated = gpt2_tokenizer.decode(output[0], skip_special_tokens=True)
print(f"Prompt: '{prompt}'")
print(f"Output: '{generated}'")
print()
GPT Model Sizes and What They Changed
gpt_history = {
"GPT-1 (2018)": {
"params": "117M",
"tokens": "~1B",
"context": 512,
"significance": "Showed pretraining + fine-tuning works for NLP"
},
"GPT-2 (2019)": {
"params": "1.5B",
"tokens": "40B",
"context": 1024,
"significance": "Zero-shot capabilities. Initially held back from release."
},
"GPT-3 (2020)": {
"params": "175B",
"tokens": "300B",
"context": 4096,
"significance": "Few-shot learning. API-first. Commercial launch."
},
"GPT-3.5 / ChatGPT (2022)": {
"params": "~175B",
"tokens": "300B+",
"context": 4096,
"significance": "RLHF alignment. Conversational AI. 100M users in 2 months."
},
"GPT-4 (2023)": {
"params": "Unknown (est. ~1T+)",
"tokens": "Unknown",
"context": 128000,
"significance": "Multimodal. Passed bar exam. Professional-level reasoning."
},
}
print(f"{'Model':<25} {'Params':>8} {'Tokens':>10} {'Context':>10} Significance")
print("=" * 90)
for model, info in gpt_history.items():
print(f"{model:<25} {info['params']:>8} {info['tokens']:>10} "
f"{str(info['context']):>10} {info['significance'][:50]}")
What Made GPT-3 Different: Emergent Capabilities
print("GPT-3's emergent capabilities that surprised even its creators:")
print()
examples = {
"Few-shot math": (
"Q: 17 + 28 = 45\nQ: 93 + 67 = 160\nQ: 156 + 247 = ",
"403 (learned arithmetic from examples, never trained on it explicitly)"
),
"Code generation": (
"# Python function to reverse a string\ndef ",
"reverse_string(s):\n return s[::-1] (generates working code)"
),
"Translation": (
"English: Hello, how are you?\nFrench: ",
"Bonjour, comment allez-vous? (translates despite training just on English text)"
),
"Chain of thought": (
"If there are 5 apples and you eat 2, then buy 3 more...",
"Model can reason through multi-step problems when prompted correctly"
),
}
for capability, (prompt, result) in examples.items():
print(f" {capability}:")
print(f" Prompt: '{prompt[:50]}...'")
print(f" Result: {result}")
print()
print("These capabilities were not explicitly trained.")
print("They emerged from scale: more data + more parameters + more compute.")
print("This is the 'scaling hypothesis': capability scales with compute.")
The RLHF Revolution: From GPT to ChatGPT
print("Why GPT-3 alone was not enough:")
print()
print(" GPT-3 predicts the next token from the training distribution.")
print(" Training internet text is helpful, harmful, and everything in between.")
print(" GPT-3 could generate racist content, misinformation, harmful instructions.")
print(" The model has no concept of 'what should I say' vs 'what could I say'.")
print()
print("RLHF: Reinforcement Learning from Human Feedback")
print()
print(" Step 1: Supervised Fine-Tuning (SFT)")
print(" Human trainers write high-quality responses to prompts.")
print(" Fine-tune GPT-3 on these examples.")
print(" This teaches the basic conversational format.")
print()
print(" Step 2: Reward Model Training")
print(" Human raters rank multiple model responses from best to worst.")
print(" Train a reward model to predict which response humans prefer.")
print()
print(" Step 3: PPO Fine-Tuning")
print(" Use the reward model as a signal.")
print(" Fine-tune the language model via reinforcement learning")
print(" to generate responses the reward model rates highly.")
print()
print(" Result: ChatGPT")
print(" Helpful, harmless, honest (usually).")
print(" 100 million users in 60 days.")
print(" Changed how the world thinks about AI.")
A Resource Worth Reading
Andrej Karpathy recorded a video called "Let's build GPT: from scratch, in code, spelled out" on YouTube where he builds a character-level GPT from scratch in 1 hour of code. The attention mechanism, the training loop, the generation, all from nothing. The most-watched deep learning tutorial in the history of the field. Watch it after this post. Search "Karpathy build GPT from scratch YouTube."
The original GPT-3 paper "Language Models are Few-Shot Learners" by Brown et al. (2020) from OpenAI describes the few-shot learning capabilities, the training setup, and the evaluation on 24 benchmarks. The paper that showed scaling works. Search "Brown language models few-shot learners GPT-3 2020."
Try This
Create gpt_practice.py.
Part 1: train the MiniGPT from this post on a text of your choice (a book chapter, lyrics, code, whatever interests you). Use character-level tokenization. Train for 1000 steps. Generate 200 characters from a seed prompt. Does it capture the style of your training text?
Part 2: experiment with generation parameters. Generate from the same seed with temperature 0.5, 1.0, and 1.5. Generate with top_k=5, top_k=50, top_k=None. Print all 6 outputs. Describe the differences in creativity and coherence.
Part 3: load GPT-2 from HuggingFace. Generate completions for 5 different prompts. For each, generate 3 different completions using different random seeds. How much variance is there between completions?
Part 4: compute perplexity. Load a test text not in your training data. Compute the cross-entropy loss of your trained model on this text. Convert to perplexity (exp(loss)). Lower perplexity = better language model.
What's Next
You have seen BERT (understand) and GPT (generate). The HuggingFace library wraps both with a unified API that makes loading, fine-tuning, and deploying 50,000+ pretrained models trivial. That is the next post: the library that put state-of-the-art NLP within reach of every practitioner.
Top comments (0)