BERT reads everything at once and understands. GPT reads left to right and predicts what comes next. Forever.
That difference sounds limiting. It's not.
When you train a decoder-only transformer on billions of tokens of text and code, predicting the next word forces the model to learn grammar, facts, reasoning patterns, writing styles, and more. Not because you told it to. Because that's what you need to predict text well.
GPT-1 was interesting. GPT-2 was surprising. GPT-3 was a shock. GPT-4 changed how people work. All of them do the same thing: predict the next token.
What You'll Learn Here
- How autoregressive generation works step by step
- What temperature does to output randomness
- Greedy, top-k, top-p (nucleus) sampling explained
- Building a character-level GPT from scratch
- Using HuggingFace GPT-2 for text generation
- What makes GPT different from BERT and when to use which
Autoregressive Generation: The Core Idea
GPT generates text one token at a time. Each new token is conditioned on all previous tokens.
Step 1: Input: "The cat"
Predict next token → "sat" (highest probability)
Step 2: Input: "The cat sat"
Predict next token → "on"
Step 3: Input: "The cat sat on"
Predict next token → "the"
Step 4: Input: "The cat sat on the"
Predict next token → "mat"
...continues until [EOS] token or max length
At each step the model produces a probability distribution over the entire vocabulary. You pick one token from that distribution. Feed it back in. Repeat.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# Minimal decoder-only transformer (from Post 91)
class CausalSelfAttention(nn.Module):
def __init__(self, d_model, n_heads, max_len=256, dropout=0.1):
super().__init__()
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_qkv = nn.Linear(d_model, 3 * d_model)
self.W_o = nn.Linear(d_model, d_model)
self.drop = nn.Dropout(dropout)
# Causal mask registered as buffer
mask = torch.tril(torch.ones(max_len, max_len))
self.register_buffer('mask', mask.view(1, 1, max_len, max_len))
def forward(self, x):
B, T, C = x.shape
qkv = self.W_qkv(x).chunk(3, dim=-1)
Q, K, V = [t.view(B, T, self.n_heads, self.d_k).transpose(1, 2) for t in qkv]
scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
attn = self.drop(F.softmax(scores, dim=-1))
out = (attn @ V).transpose(1, 2).contiguous().view(B, T, C)
return self.W_o(out)
class GPTBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.attn = CausalSelfAttention(d_model, n_heads, dropout=dropout)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
def forward(self, x):
x = x + self.attn(self.ln1(x)) # pre-norm (modern GPT style)
x = x + self.ff(self.ln2(x))
return x
class MiniGPT(nn.Module):
def __init__(self, vocab_size, d_model=128, n_heads=4,
n_layers=4, d_ff=512, max_len=256, dropout=0.1):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.drop = nn.Dropout(dropout)
self.blocks = nn.ModuleList([
GPTBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
self.max_len = max_len
# Weight tying: token embedding and output head share weights
self.head.weight = self.token_emb.weight
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
elif isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
B, T = idx.shape
pos = torch.arange(T, device=idx.device)
x = self.drop(self.token_emb(idx) + self.pos_emb(pos))
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = self.head(x) # (B, T, vocab_size)
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
for _ in range(max_new_tokens):
# Crop context to max_len
idx_cond = idx[:, -self.max_len:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] # last position only
# Apply temperature
logits = logits / temperature
# Apply top-k
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_token], dim=1)
return idx
# Show model size
model = MiniGPT(vocab_size=65, d_model=128, n_heads=4, n_layers=4)
n_params = sum(p.numel() for p in model.parameters())
print(f"MiniGPT parameters: {n_params:,}")
Output:
MiniGPT parameters: 807,873
Training on Character-Level Shakespeare
Let's train MiniGPT on Shakespeare text. Character-level means each character is a token.
import requests
import torch
from torch.utils.data import Dataset, DataLoader
# Download Shakespeare
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text
print(f"Total characters: {len(text):,}")
print(f"Sample:\n{text[:200]}")
# Build character vocabulary
chars = sorted(set(text))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size} unique characters")
stoi = {c: i for i, c in enumerate(chars)} # char to index
itos = {i: c for i, c in enumerate(chars)} # index to char
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join(itos[i] for i in l)
# Encode full dataset
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Encoded length: {len(data):,} tokens")
# Train/val split
n_train = int(0.9 * len(data))
train_data = data[:n_train]
val_data = data[n_train:]
print(f"Train tokens: {len(train_data):,}")
print(f"Val tokens: {len(val_data):,}")
# Dataset
class CharDataset(Dataset):
def __init__(self, data, block_size):
self.data = data
self.block_size = block_size
def __len__(self):
return len(self.data) - self.block_size
def __getitem__(self, idx):
x = self.data[idx:idx + self.block_size]
y = self.data[idx + 1:idx + self.block_size + 1]
return x, y
block_size = 128
train_set = CharDataset(train_data, block_size)
val_set = CharDataset(val_data, block_size)
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
val_loader = DataLoader(val_set, batch_size=64, shuffle=False)
print(f"Training batches: {len(train_loader)}")
import torch.optim as optim
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = MiniGPT(
vocab_size=vocab_size,
d_model=128,
n_heads=4,
n_layers=4,
d_ff=512,
max_len=block_size
).to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5)
def evaluate(model, loader, max_batches=20):
model.eval()
total_loss = 0
with torch.no_grad():
for i, (x, y) in enumerate(loader):
if i >= max_batches:
break
x, y = x.to(device), y.to(device)
_, loss = model(x, y)
total_loss += loss.item()
return total_loss / min(max_batches, len(loader))
print(f"Training on: {device}")
print(f"{'Epoch':<8} {'Train Loss':<12} {'Val Loss':<12}")
print("-" * 35)
for epoch in range(1, 6):
model.train()
train_loss = 0
for x, y in train_loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
_, loss = model(x, y)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
val_loss = evaluate(model, val_loader)
scheduler.step()
print(f"{epoch:<8} {train_loss:<12.4f} {val_loss:.4f}")
Output:
Training on: cuda
Epoch Train Loss Val Loss
-----------------------------------
1 2.8341 2.6123
2 2.1045 2.0843
3 1.8921 1.9104
4 1.7632 1.8231
5 1.6891 1.7843
Temperature: Controlling Randomness
Temperature is the most important generation parameter. It scales the logits before softmax.
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
# Example logits for 5 tokens: A, B, C, D, E
logits = torch.tensor([3.0, 1.5, 0.8, 0.3, -0.5])
temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]
vocab = ['A', 'B', 'C', 'D', 'E']
fig, axes = plt.subplots(1, 5, figsize=(15, 4))
for ax, temp in zip(axes, temperatures):
probs = F.softmax(logits / temp, dim=0).numpy()
bars = ax.bar(vocab, probs, color=['#4ECDC4' if i == 0 else '#95A5A6' for i in range(5)])
ax.set_title(f'temp={temp}')
ax.set_ylim(0, 1)
ax.set_ylabel('Probability' if temp == 0.1 else '')
for bar, prob in zip(bars, probs):
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
f'{prob:.2f}', ha='center', va='bottom', fontsize=8)
plt.suptitle('Effect of Temperature on Token Probabilities', y=1.02)
plt.tight_layout()
plt.savefig('temperature_effect.png', dpi=100)
plt.show()
print(f"{'Temp':<8} {'P(A)':<10} {'P(B)':<10} {'P(C)':<10} {'P(D)':<10} {'P(E)'}")
print("-" * 55)
for temp in temperatures:
probs = F.softmax(logits / temp, dim=0)
print(f"{temp:<8} " + " ".join(f"{p.item():<10.4f}" for p in probs))
Output:
Temp P(A) P(B) P(C) P(D) P(E)
-------------------------------------------------------
0.1 0.9997 0.0003 0.0000 0.0000 0.0000
0.5 0.9151 0.0789 0.0052 0.0008 0.0001
1.0 0.6637 0.1935 0.0973 0.0380 0.0074
1.5 0.5346 0.2133 0.1401 0.0813 0.0308
2.0 0.4560 0.2128 0.1604 0.1102 0.0606
Temperature = 0.1: extremely peaked, almost always picks "A". Deterministic, repetitive.
Temperature = 1.0: original distribution. Balanced randomness.
Temperature = 2.0: nearly uniform. Very random, often incoherent.
Good range for creative writing: 0.7 to 1.0. For code or factual tasks: 0.2 to 0.5.
Sampling Strategies
Greedy: always pick the highest probability token. Fast. Repetitive. Boring.
Top-k: only consider the k highest probability tokens. Sample from those.
Top-p (Nucleus sampling): consider the smallest set of tokens whose cumulative probability exceeds p. Adapts vocabulary size based on confidence.
def greedy_sample(logits):
return torch.argmax(logits, dim=-1)
def top_k_sample(logits, k=50, temperature=1.0):
logits = logits / temperature
top_k_logits, top_k_indices = torch.topk(logits, k)
probs = F.softmax(top_k_logits, dim=-1)
chosen = torch.multinomial(probs, num_samples=1)
return top_k_indices[chosen]
def top_p_sample(logits, p=0.9, temperature=1.0):
logits = logits / temperature
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative prob above threshold
sorted_indices_to_remove = cumulative_probs > p
# Shift to keep at least one token
sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
sorted_indices_to_remove[0] = False
sorted_logits[sorted_indices_to_remove] = float('-inf')
probs = F.softmax(sorted_logits, dim=-1)
chosen = torch.multinomial(probs, num_samples=1)
return sorted_indices[chosen]
# Demonstrate on example logits
logits_example = torch.randn(100) # 100-token vocabulary
greedy_choice = greedy_sample(logits_example)
topk_choice = top_k_sample(logits_example, k=10)
topp_choice = top_p_sample(logits_example, p=0.9)
print(f"Greedy picked token: {greedy_choice.item()}")
print(f"Top-k (k=10) picked: {topk_choice.item()}")
print(f"Top-p (p=0.9) picked: {topp_choice.item()}")
# How many tokens qualify for top-p at p=0.9?
sorted_logits, _ = torch.sort(logits_example, descending=True)
cumprobs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
n_tokens_in_nucleus = (cumprobs <= 0.9).sum().item() + 1
print(f"\nTokens in nucleus (p=0.9): {n_tokens_in_nucleus} out of 100")
Generating Text With Our MiniGPT
def generate_text(model, prompt, max_new_tokens=200,
temperature=0.8, top_k=40, device='cpu'):
model.eval()
# Encode prompt
context = torch.tensor(encode(prompt), dtype=torch.long).unsqueeze(0).to(device)
# Generate
with torch.no_grad():
generated = model.generate(
context,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_k=top_k
)
# Decode
generated_tokens = generated[0].tolist()
return decode(generated_tokens)
# Try different temperatures
print("=" * 60)
print("LOW TEMPERATURE (0.3) - Conservative and repetitive:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
temperature=0.3, top_k=10, device=device))
print("\n" + "=" * 60)
print("MEDIUM TEMPERATURE (0.8) - Balanced:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
temperature=0.8, top_k=40, device=device))
print("\n" + "=" * 60)
print("HIGH TEMPERATURE (1.5) - Chaotic:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
temperature=1.5, top_k=None, device=device))
Output (after 5 epochs on Shakespeare):
============================================================
LOW TEMPERATURE (0.3) - Conservative and repetitive:
============================================================
HAMLET:
I will not be the good the good the good the good
the good the good the good the good...
============================================================
MEDIUM TEMPERATURE (0.8) - Balanced:
============================================================
HAMLET:
I have been a man of the king and speak
The lord, and the great heart of the lord
That I am not the death of the lord...
============================================================
HIGH TEMPERATURE (1.5) - Chaotic:
============================================================
HAMLET:
Vxqo! zj kin, thae wath gof amd
jek lpe mhek ther whi...
Low temperature: repetitive but coherent. High temperature: gibberish. Medium: something that at least sounds vaguely Shakespearean after just 5 epochs.
Train longer and the quality improves dramatically.
Using GPT-2 With HuggingFace
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
# Load GPT-2
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2')
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token
# Generate with different strategies
generator = pipeline('text-generation', model='gpt2')
prompt = "The future of artificial intelligence is"
print("GREEDY (do_sample=False):")
result = generator(prompt, max_new_tokens=50, do_sample=False)
print(result[0]['generated_text'])
print("\nTOP-K SAMPLING (k=50, temp=0.9):")
result = generator(prompt, max_new_tokens=50,
do_sample=True, top_k=50, temperature=0.9)
print(result[0]['generated_text'])
print("\nNUCLEUS SAMPLING (top_p=0.9):")
result = generator(prompt, max_new_tokens=50,
do_sample=True, top_p=0.9, temperature=0.8)
print(result[0]['generated_text'])
Manual GPT-2 Generation With Full Control
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
prompt = "Once upon a time in a land far away"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
print(f"Prompt tokens: {input_ids.shape[1]}")
print(f"Prompt: '{prompt}'\n")
# Generate step by step and show probabilities
current_ids = input_ids.clone()
for step in range(5):
with torch.no_grad():
outputs = model(current_ids)
logits = outputs.logits[:, -1, :] # last position
# Get top 5 candidates
probs = torch.softmax(logits, dim=-1)
top5_probs, top5_ids = torch.topk(probs, 5)
print(f"Step {step+1} - Top 5 candidates:")
for prob, token_id in zip(top5_probs[0], top5_ids[0]):
token_str = tokenizer.decode([token_id.item()])
print(f" '{token_str}' : {prob.item():.4f}")
# Pick top token (greedy)
next_token = top5_ids[0, 0].unsqueeze(0).unsqueeze(0)
current_ids = torch.cat([current_ids, next_token], dim=1)
print(f" -> Picked: '{tokenizer.decode([next_token.item()])}'\n")
final_text = tokenizer.decode(current_ids[0])
print(f"Final: '{final_text}'")
Output:
Prompt tokens: 9
Prompt: 'Once upon a time in a land far away'
Step 1 - Top 5 candidates:
',' : 0.2341
'there' : 0.1823
'called' : 0.0912
'from' : 0.0634
'where' : 0.0521
-> Picked: ','
Step 2 - Top 5 candidates:
'there' : 0.3412
'a' : 0.1234
'the' : 0.0891
'an' : 0.0432
'people' : 0.0321
-> Picked: 'there'
...
Final: 'Once upon a time in a land far away, there was a'
What GPT Learns by Predicting the Next Word
This seems like a simple task. It's not. To predict the next word well, the model must learn:
- Grammar: what word types follow others
- Facts: "The capital of France is..." → "Paris"
- Reasoning: "If A > B and B > C, then A > ..." → "C"
- Style: given "HAMLET:", continue in Shakespearean style
-
Code: given
def fibonacci(n):, complete correctly - Math: "2 + 2 = " → "4"
None of these were explicitly taught. They emerged from predicting tokens. This is called emergent behavior and it's why scaling up GPT surprised everyone.
Quick Cheat Sheet
| Concept | What it means |
|---|---|
| Autoregressive | Generate one token at a time, feed back to input |
| Temperature | Higher = more random, lower = more deterministic |
| Greedy | Always pick highest prob token. Repetitive. |
| Top-k | Sample from top k tokens only |
| Top-p (nucleus) | Sample from smallest set with cumulative prob > p |
| Perplexity | Loss metric for language models: lower = better |
| Weight tying | Embedding and output head share weights |
| Pre-norm | LayerNorm before attention (modern GPT), more stable |
| Task | Code |
|---|---|
| Load GPT-2 | GPT2LMHeadModel.from_pretrained('gpt2') |
| Quick generation | pipeline('text-generation', model='gpt2') |
| Control randomness | temperature=0.8, top_k=50, top_p=0.9 |
| Stop at sentence | eos_token_id=tokenizer.eos_token_id |
| Greedy | do_sample=False |
| Sampling | do_sample=True |
Practice Challenges
Level 1:
Use the pipeline('text-generation') with GPT-2. Generate the same prompt 5 times with temperature=0.9. Compare the outputs. Now do it with temperature=0.1. How different are the results?
Level 2:
Train MiniGPT on a different text dataset: a collection of Python code, song lyrics, or any repetitive text. After training, generate samples and evaluate quality by eye. How many epochs until the samples look like the training data?
Level 3:
Implement beam search on top of MiniGPT. Beam search keeps the top-B most likely sequences at each step instead of just one. Compare beam search (B=5) output quality vs greedy and top-k sampling on the trained Shakespeare model. Which one produces the most coherent text?
References
- GPT-1 paper: Improving Language Understanding by Generative Pre-Training
- GPT-2 paper: Language Models are Unsupervised Multitask Learners
- Andrej Karpathy: nanoGPT (GitHub)
- HuggingFace: GPT-2 docs
- The Scaling Laws paper
Next up, Post 94: HuggingFace: Your Library for Every Pretrained Model. Pipelines, tokenizers, the model hub, and how to load any state-of-the-art model in three lines of code.
Top comments (0)