DEV Community

Cover image for 93. GPT: The Model That Predicts the Next Word Forever
Akhilesh
Akhilesh

Posted on

93. GPT: The Model That Predicts the Next Word Forever

BERT reads everything at once and understands. GPT reads left to right and predicts what comes next. Forever.

That difference sounds limiting. It's not.

When you train a decoder-only transformer on billions of tokens of text and code, predicting the next word forces the model to learn grammar, facts, reasoning patterns, writing styles, and more. Not because you told it to. Because that's what you need to predict text well.

GPT-1 was interesting. GPT-2 was surprising. GPT-3 was a shock. GPT-4 changed how people work. All of them do the same thing: predict the next token.


What You'll Learn Here

  • How autoregressive generation works step by step
  • What temperature does to output randomness
  • Greedy, top-k, top-p (nucleus) sampling explained
  • Building a character-level GPT from scratch
  • Using HuggingFace GPT-2 for text generation
  • What makes GPT different from BERT and when to use which

Autoregressive Generation: The Core Idea

GPT generates text one token at a time. Each new token is conditioned on all previous tokens.

Step 1: Input: "The cat"
        Predict next token → "sat" (highest probability)

Step 2: Input: "The cat sat"
        Predict next token → "on" 

Step 3: Input: "The cat sat on"
        Predict next token → "the"

Step 4: Input: "The cat sat on the"
        Predict next token → "mat"

...continues until [EOS] token or max length
Enter fullscreen mode Exit fullscreen mode

At each step the model produces a probability distribution over the entire vocabulary. You pick one token from that distribution. Feed it back in. Repeat.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Minimal decoder-only transformer (from Post 91)
class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads, max_len=256, dropout=0.1):
        super().__init__()
        self.n_heads = n_heads
        self.d_k     = d_model // n_heads

        self.W_qkv = nn.Linear(d_model, 3 * d_model)
        self.W_o   = nn.Linear(d_model, d_model)
        self.drop  = nn.Dropout(dropout)

        # Causal mask registered as buffer
        mask = torch.tril(torch.ones(max_len, max_len))
        self.register_buffer('mask', mask.view(1, 1, max_len, max_len))

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.W_qkv(x).chunk(3, dim=-1)
        Q, K, V = [t.view(B, T, self.n_heads, self.d_k).transpose(1, 2) for t in qkv]

        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn   = self.drop(F.softmax(scores, dim=-1))

        out = (attn @ V).transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(out)


class GPTBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn = CausalSelfAttention(d_model, n_heads, dropout=dropout)
        self.ff   = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # pre-norm (modern GPT style)
        x = x + self.ff(self.ln2(x))
        return x


class MiniGPT(nn.Module):
    def __init__(self, vocab_size, d_model=128, n_heads=4,
                 n_layers=4, d_ff=512, max_len=256, dropout=0.1):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_len, d_model)
        self.drop      = nn.Dropout(dropout)
        self.blocks    = nn.ModuleList([
            GPTBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
        ])
        self.ln_f  = nn.LayerNorm(d_model)
        self.head  = nn.Linear(d_model, vocab_size, bias=False)
        self.max_len = max_len

        # Weight tying: token embedding and output head share weights
        self.head.weight = self.token_emb.weight

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        pos  = torch.arange(T, device=idx.device)

        x = self.drop(self.token_emb(idx) + self.pos_emb(pos))
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.head(x)   # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            # Crop context to max_len
            idx_cond = idx[:, -self.max_len:]

            logits, _ = self(idx_cond)
            logits     = logits[:, -1, :]  # last position only

            # Apply temperature
            logits = logits / temperature

            # Apply top-k
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')

            probs     = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx        = torch.cat([idx, next_token], dim=1)

        return idx

# Show model size
model = MiniGPT(vocab_size=65, d_model=128, n_heads=4, n_layers=4)
n_params = sum(p.numel() for p in model.parameters())
print(f"MiniGPT parameters: {n_params:,}")
Enter fullscreen mode Exit fullscreen mode

Output:

MiniGPT parameters: 807,873
Enter fullscreen mode Exit fullscreen mode

Training on Character-Level Shakespeare

Let's train MiniGPT on Shakespeare text. Character-level means each character is a token.

import requests
import torch
from torch.utils.data import Dataset, DataLoader

# Download Shakespeare
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text
print(f"Total characters: {len(text):,}")
print(f"Sample:\n{text[:200]}")
Enter fullscreen mode Exit fullscreen mode
# Build character vocabulary
chars = sorted(set(text))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size} unique characters")

stoi = {c: i for i, c in enumerate(chars)}  # char to index
itos = {i: c for i, c in enumerate(chars)}  # index to char

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join(itos[i] for i in l)

# Encode full dataset
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Encoded length: {len(data):,} tokens")

# Train/val split
n_train = int(0.9 * len(data))
train_data = data[:n_train]
val_data   = data[n_train:]
print(f"Train tokens: {len(train_data):,}")
print(f"Val tokens:   {len(val_data):,}")
Enter fullscreen mode Exit fullscreen mode
# Dataset
class CharDataset(Dataset):
    def __init__(self, data, block_size):
        self.data       = data
        self.block_size = block_size

    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        x = self.data[idx:idx + self.block_size]
        y = self.data[idx + 1:idx + self.block_size + 1]
        return x, y

block_size  = 128
train_set   = CharDataset(train_data, block_size)
val_set     = CharDataset(val_data, block_size)

train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
val_loader   = DataLoader(val_set,   batch_size=64, shuffle=False)

print(f"Training batches: {len(train_loader)}")
Enter fullscreen mode Exit fullscreen mode
import torch.optim as optim

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model  = MiniGPT(
    vocab_size=vocab_size,
    d_model=128,
    n_heads=4,
    n_layers=4,
    d_ff=512,
    max_len=block_size
).to(device)

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5)

def evaluate(model, loader, max_batches=20):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for i, (x, y) in enumerate(loader):
            if i >= max_batches:
                break
            x, y = x.to(device), y.to(device)
            _, loss = model(x, y)
            total_loss += loss.item()
    return total_loss / min(max_batches, len(loader))

print(f"Training on: {device}")
print(f"{'Epoch':<8} {'Train Loss':<12} {'Val Loss':<12}")
print("-" * 35)

for epoch in range(1, 6):
    model.train()
    train_loss = 0

    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        _, loss = model(x, y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        train_loss += loss.item()

    train_loss /= len(train_loader)
    val_loss    = evaluate(model, val_loader)
    scheduler.step()

    print(f"{epoch:<8} {train_loss:<12.4f} {val_loss:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Training on: cuda
Epoch    Train Loss   Val Loss
-----------------------------------
1        2.8341       2.6123
2        2.1045       2.0843
3        1.8921       1.9104
4        1.7632       1.8231
5        1.6891       1.7843
Enter fullscreen mode Exit fullscreen mode

Temperature: Controlling Randomness

Temperature is the most important generation parameter. It scales the logits before softmax.

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

# Example logits for 5 tokens: A, B, C, D, E
logits = torch.tensor([3.0, 1.5, 0.8, 0.3, -0.5])

temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]
vocab        = ['A', 'B', 'C', 'D', 'E']

fig, axes = plt.subplots(1, 5, figsize=(15, 4))

for ax, temp in zip(axes, temperatures):
    probs = F.softmax(logits / temp, dim=0).numpy()
    bars  = ax.bar(vocab, probs, color=['#4ECDC4' if i == 0 else '#95A5A6' for i in range(5)])
    ax.set_title(f'temp={temp}')
    ax.set_ylim(0, 1)
    ax.set_ylabel('Probability' if temp == 0.1 else '')
    for bar, prob in zip(bars, probs):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                f'{prob:.2f}', ha='center', va='bottom', fontsize=8)

plt.suptitle('Effect of Temperature on Token Probabilities', y=1.02)
plt.tight_layout()
plt.savefig('temperature_effect.png', dpi=100)
plt.show()

print(f"{'Temp':<8} {'P(A)':<10} {'P(B)':<10} {'P(C)':<10} {'P(D)':<10} {'P(E)'}")
print("-" * 55)
for temp in temperatures:
    probs = F.softmax(logits / temp, dim=0)
    print(f"{temp:<8} " + " ".join(f"{p.item():<10.4f}" for p in probs))
Enter fullscreen mode Exit fullscreen mode

Output:

Temp     P(A)       P(B)       P(C)       P(D)       P(E)
-------------------------------------------------------
0.1      0.9997     0.0003     0.0000     0.0000     0.0000
0.5      0.9151     0.0789     0.0052     0.0008     0.0001
1.0      0.6637     0.1935     0.0973     0.0380     0.0074
1.5      0.5346     0.2133     0.1401     0.0813     0.0308
2.0      0.4560     0.2128     0.1604     0.1102     0.0606
Enter fullscreen mode Exit fullscreen mode

Temperature = 0.1: extremely peaked, almost always picks "A". Deterministic, repetitive.

Temperature = 1.0: original distribution. Balanced randomness.

Temperature = 2.0: nearly uniform. Very random, often incoherent.

Good range for creative writing: 0.7 to 1.0. For code or factual tasks: 0.2 to 0.5.


Sampling Strategies

Greedy: always pick the highest probability token. Fast. Repetitive. Boring.

Top-k: only consider the k highest probability tokens. Sample from those.

Top-p (Nucleus sampling): consider the smallest set of tokens whose cumulative probability exceeds p. Adapts vocabulary size based on confidence.

def greedy_sample(logits):
    return torch.argmax(logits, dim=-1)

def top_k_sample(logits, k=50, temperature=1.0):
    logits = logits / temperature
    top_k_logits, top_k_indices = torch.topk(logits, k)
    probs  = F.softmax(top_k_logits, dim=-1)
    chosen = torch.multinomial(probs, num_samples=1)
    return top_k_indices[chosen]

def top_p_sample(logits, p=0.9, temperature=1.0):
    logits = logits / temperature
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

    # Remove tokens with cumulative prob above threshold
    sorted_indices_to_remove = cumulative_probs > p
    # Shift to keep at least one token
    sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
    sorted_indices_to_remove[0]  = False

    sorted_logits[sorted_indices_to_remove] = float('-inf')
    probs  = F.softmax(sorted_logits, dim=-1)
    chosen = torch.multinomial(probs, num_samples=1)
    return sorted_indices[chosen]

# Demonstrate on example logits
logits_example = torch.randn(100)   # 100-token vocabulary

greedy_choice = greedy_sample(logits_example)
topk_choice   = top_k_sample(logits_example, k=10)
topp_choice   = top_p_sample(logits_example, p=0.9)

print(f"Greedy picked token:  {greedy_choice.item()}")
print(f"Top-k (k=10) picked:  {topk_choice.item()}")
print(f"Top-p (p=0.9) picked: {topp_choice.item()}")

# How many tokens qualify for top-p at p=0.9?
sorted_logits, _ = torch.sort(logits_example, descending=True)
cumprobs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
n_tokens_in_nucleus = (cumprobs <= 0.9).sum().item() + 1
print(f"\nTokens in nucleus (p=0.9): {n_tokens_in_nucleus} out of 100")
Enter fullscreen mode Exit fullscreen mode

Generating Text With Our MiniGPT

def generate_text(model, prompt, max_new_tokens=200,
                  temperature=0.8, top_k=40, device='cpu'):
    model.eval()

    # Encode prompt
    context = torch.tensor(encode(prompt), dtype=torch.long).unsqueeze(0).to(device)

    # Generate
    with torch.no_grad():
        generated = model.generate(
            context,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k
        )

    # Decode
    generated_tokens = generated[0].tolist()
    return decode(generated_tokens)

# Try different temperatures
print("=" * 60)
print("LOW TEMPERATURE (0.3) - Conservative and repetitive:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
                    temperature=0.3, top_k=10, device=device))

print("\n" + "=" * 60)
print("MEDIUM TEMPERATURE (0.8) - Balanced:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
                    temperature=0.8, top_k=40, device=device))

print("\n" + "=" * 60)
print("HIGH TEMPERATURE (1.5) - Chaotic:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
                    temperature=1.5, top_k=None, device=device))
Enter fullscreen mode Exit fullscreen mode

Output (after 5 epochs on Shakespeare):

============================================================
LOW TEMPERATURE (0.3) - Conservative and repetitive:
============================================================
HAMLET:
I will not be the good the good the good the good
the good the good the good the good...

============================================================
MEDIUM TEMPERATURE (0.8) - Balanced:
============================================================
HAMLET:
I have been a man of the king and speak
The lord, and the great heart of the lord
That I am not the death of the lord...

============================================================
HIGH TEMPERATURE (1.5) - Chaotic:
============================================================
HAMLET:
Vxqo! zj kin, thae wath gof amd
jek lpe mhek ther whi...
Enter fullscreen mode Exit fullscreen mode

Low temperature: repetitive but coherent. High temperature: gibberish. Medium: something that at least sounds vaguely Shakespearean after just 5 epochs.

Train longer and the quality improves dramatically.


Using GPT-2 With HuggingFace

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

# Load GPT-2
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt2_model     = GPT2LMHeadModel.from_pretrained('gpt2')

gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Generate with different strategies
generator = pipeline('text-generation', model='gpt2')

prompt = "The future of artificial intelligence is"

print("GREEDY (do_sample=False):")
result = generator(prompt, max_new_tokens=50, do_sample=False)
print(result[0]['generated_text'])

print("\nTOP-K SAMPLING (k=50, temp=0.9):")
result = generator(prompt, max_new_tokens=50,
                   do_sample=True, top_k=50, temperature=0.9)
print(result[0]['generated_text'])

print("\nNUCLEUS SAMPLING (top_p=0.9):")
result = generator(prompt, max_new_tokens=50,
                   do_sample=True, top_p=0.9, temperature=0.8)
print(result[0]['generated_text'])
Enter fullscreen mode Exit fullscreen mode

Manual GPT-2 Generation With Full Control

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model     = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

prompt = "Once upon a time in a land far away"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

print(f"Prompt tokens: {input_ids.shape[1]}")
print(f"Prompt: '{prompt}'\n")

# Generate step by step and show probabilities
current_ids = input_ids.clone()

for step in range(5):
    with torch.no_grad():
        outputs = model(current_ids)
        logits  = outputs.logits[:, -1, :]  # last position

    # Get top 5 candidates
    probs      = torch.softmax(logits, dim=-1)
    top5_probs, top5_ids = torch.topk(probs, 5)

    print(f"Step {step+1} - Top 5 candidates:")
    for prob, token_id in zip(top5_probs[0], top5_ids[0]):
        token_str = tokenizer.decode([token_id.item()])
        print(f"  '{token_str}' : {prob.item():.4f}")

    # Pick top token (greedy)
    next_token = top5_ids[0, 0].unsqueeze(0).unsqueeze(0)
    current_ids = torch.cat([current_ids, next_token], dim=1)

    print(f"  -> Picked: '{tokenizer.decode([next_token.item()])}'\n")

final_text = tokenizer.decode(current_ids[0])
print(f"Final: '{final_text}'")
Enter fullscreen mode Exit fullscreen mode

Output:

Prompt tokens: 9
Prompt: 'Once upon a time in a land far away'

Step 1 - Top 5 candidates:
  ',' : 0.2341
  'there' : 0.1823
  'called' : 0.0912
  'from' : 0.0634
  'where' : 0.0521
  -> Picked: ','

Step 2 - Top 5 candidates:
  'there' : 0.3412
  'a' : 0.1234
  'the' : 0.0891
  'an' : 0.0432
  'people' : 0.0321
  -> Picked: 'there'
...

Final: 'Once upon a time in a land far away, there was a'
Enter fullscreen mode Exit fullscreen mode

What GPT Learns by Predicting the Next Word

This seems like a simple task. It's not. To predict the next word well, the model must learn:

  • Grammar: what word types follow others
  • Facts: "The capital of France is..." → "Paris"
  • Reasoning: "If A > B and B > C, then A > ..." → "C"
  • Style: given "HAMLET:", continue in Shakespearean style
  • Code: given def fibonacci(n):, complete correctly
  • Math: "2 + 2 = " → "4"

None of these were explicitly taught. They emerged from predicting tokens. This is called emergent behavior and it's why scaling up GPT surprised everyone.


Quick Cheat Sheet

Concept What it means
Autoregressive Generate one token at a time, feed back to input
Temperature Higher = more random, lower = more deterministic
Greedy Always pick highest prob token. Repetitive.
Top-k Sample from top k tokens only
Top-p (nucleus) Sample from smallest set with cumulative prob > p
Perplexity Loss metric for language models: lower = better
Weight tying Embedding and output head share weights
Pre-norm LayerNorm before attention (modern GPT), more stable
Task Code
Load GPT-2 GPT2LMHeadModel.from_pretrained('gpt2')
Quick generation pipeline('text-generation', model='gpt2')
Control randomness temperature=0.8, top_k=50, top_p=0.9
Stop at sentence eos_token_id=tokenizer.eos_token_id
Greedy do_sample=False
Sampling do_sample=True

Practice Challenges

Level 1:
Use the pipeline('text-generation') with GPT-2. Generate the same prompt 5 times with temperature=0.9. Compare the outputs. Now do it with temperature=0.1. How different are the results?

Level 2:
Train MiniGPT on a different text dataset: a collection of Python code, song lyrics, or any repetitive text. After training, generate samples and evaluate quality by eye. How many epochs until the samples look like the training data?

Level 3:
Implement beam search on top of MiniGPT. Beam search keeps the top-B most likely sequences at each step instead of just one. Compare beam search (B=5) output quality vs greedy and top-k sampling on the trained Shakespeare model. Which one produces the most coherent text?


References


Next up, Post 94: HuggingFace: Your Library for Every Pretrained Model. Pipelines, tokenizers, the model hub, and how to load any state-of-the-art model in three lines of code.

Top comments (0)