Akhilesh

Posted on May 21

93. GPT: The Model That Predicts the Next Word Forever

#beginners #ai #productivity #python

BERT reads everything at once and understands. GPT reads left to right and predicts what comes next. Forever.

That difference sounds limiting. It's not.

When you train a decoder-only transformer on billions of tokens of text and code, predicting the next word forces the model to learn grammar, facts, reasoning patterns, writing styles, and more. Not because you told it to. Because that's what you need to predict text well.

GPT-1 was interesting. GPT-2 was surprising. GPT-3 was a shock. GPT-4 changed how people work. All of them do the same thing: predict the next token.

What You'll Learn Here

How autoregressive generation works step by step
What temperature does to output randomness
Greedy, top-k, top-p (nucleus) sampling explained
Building a character-level GPT from scratch
Using HuggingFace GPT-2 for text generation
What makes GPT different from BERT and when to use which

Autoregressive Generation: The Core Idea

GPT generates text one token at a time. Each new token is conditioned on all previous tokens.

Step 1: Input: "The cat"
        Predict next token → "sat" (highest probability)

Step 2: Input: "The cat sat"
        Predict next token → "on" 

Step 3: Input: "The cat sat on"
        Predict next token → "the"

Step 4: Input: "The cat sat on the"
        Predict next token → "mat"

...continues until [EOS] token or max length

At each step the model produces a probability distribution over the entire vocabulary. You pick one token from that distribution. Feed it back in. Repeat.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Minimal decoder-only transformer (from Post 91)
class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads, max_len=256, dropout=0.1):
        super().__init__()
        self.n_heads = n_heads
        self.d_k     = d_model // n_heads

        self.W_qkv = nn.Linear(d_model, 3 * d_model)
        self.W_o   = nn.Linear(d_model, d_model)
        self.drop  = nn.Dropout(dropout)

        # Causal mask registered as buffer
        mask = torch.tril(torch.ones(max_len, max_len))
        self.register_buffer('mask', mask.view(1, 1, max_len, max_len))

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.W_qkv(x).chunk(3, dim=-1)
        Q, K, V = [t.view(B, T, self.n_heads, self.d_k).transpose(1, 2) for t in qkv]

        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn   = self.drop(F.softmax(scores, dim=-1))

        out = (attn @ V).transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(out)


class GPTBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn = CausalSelfAttention(d_model, n_heads, dropout=dropout)
        self.ff   = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # pre-norm (modern GPT style)
        x = x + self.ff(self.ln2(x))
        return x


class MiniGPT(nn.Module):
    def __init__(self, vocab_size, d_model=128, n_heads=4,
                 n_layers=4, d_ff=512, max_len=256, dropout=0.1):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_len, d_model)
        self.drop      = nn.Dropout(dropout)
        self.blocks    = nn.ModuleList([
            GPTBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
        ])
        self.ln_f  = nn.LayerNorm(d_model)
        self.head  = nn.Linear(d_model, vocab_size, bias=False)
        self.max_len = max_len

        # Weight tying: token embedding and output head share weights
        self.head.weight = self.token_emb.weight

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        pos  = torch.arange(T, device=idx.device)

        x = self.drop(self.token_emb(idx) + self.pos_emb(pos))
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.head(x)   # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            # Crop context to max_len
            idx_cond = idx[:, -self.max_len:]

            logits, _ = self(idx_cond)
            logits     = logits[:, -1, :]  # last position only

            # Apply temperature
            logits = logits / temperature

            # Apply top-k
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')

            probs     = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx        = torch.cat([idx, next_token], dim=1)

        return idx

# Show model size
model = MiniGPT(vocab_size=65, d_model=128, n_heads=4, n_layers=4)
n_params = sum(p.numel() for p in model.parameters())
print(f"MiniGPT parameters: {n_params:,}")

Output:

MiniGPT parameters: 807,873

Training on Character-Level Shakespeare

Let's train MiniGPT on Shakespeare text. Character-level means each character is a token.

import requests
import torch
from torch.utils.data import Dataset, DataLoader

# Download Shakespeare
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text
print(f"Total characters: {len(text):,}")
print(f"Sample:\n{text[:200]}")

# Build character vocabulary
chars = sorted(set(text))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size} unique characters")

stoi = {c: i for i, c in enumerate(chars)}  # char to index
itos = {i: c for i, c in enumerate(chars)}  # index to char

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join(itos[i] for i in l)

# Encode full dataset
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Encoded length: {len(data):,} tokens")

# Train/val split
n_train = int(0.9 * len(data))
train_data = data[:n_train]
val_data   = data[n_train:]
print(f"Train tokens: {len(train_data):,}")
print(f"Val tokens:   {len(val_data):,}")

# Dataset
class CharDataset(Dataset):
    def __init__(self, data, block_size):
        self.data       = data
        self.block_size = block_size

    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        x = self.data[idx:idx + self.block_size]
        y = self.data[idx + 1:idx + self.block_size + 1]
        return x, y

block_size  = 128
train_set   = CharDataset(train_data, block_size)
val_set     = CharDataset(val_data, block_size)

train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
val_loader   = DataLoader(val_set,   batch_size=64, shuffle=False)

print(f"Training batches: {len(train_loader)}")

import torch.optim as optim

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model  = MiniGPT(
    vocab_size=vocab_size,
    d_model=128,
    n_heads=4,
    n_layers=4,
    d_ff=512,
    max_len=block_size
).to(device)

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5)

def evaluate(model, loader, max_batches=20):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for i, (x, y) in enumerate(loader):
            if i >= max_batches:
                break
            x, y = x.to(device), y.to(device)
            _, loss = model(x, y)
            total_loss += loss.item()
    return total_loss / min(max_batches, len(loader))

print(f"Training on: {device}")
print(f"{'Epoch':<8} {'Train Loss':<12} {'Val Loss':<12}")
print("-" * 35)

for epoch in range(1, 6):
    model.train()
    train_loss = 0

    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        _, loss = model(x, y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        train_loss += loss.item()

    train_loss /= len(train_loader)
    val_loss    = evaluate(model, val_loader)
    scheduler.step()

    print(f"{epoch:<8} {train_loss:<12.4f} {val_loss:.4f}")

Output:

Training on: cuda
Epoch    Train Loss   Val Loss
-----------------------------------
1        2.8341       2.6123
2        2.1045       2.0843
3        1.8921       1.9104
4        1.7632       1.8231
5        1.6891       1.7843

Temperature: Controlling Randomness

Temperature is the most important generation parameter. It scales the logits before softmax.

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

# Example logits for 5 tokens: A, B, C, D, E
logits = torch.tensor([3.0, 1.5, 0.8, 0.3, -0.5])

temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]
vocab        = ['A', 'B', 'C', 'D', 'E']

fig, axes = plt.subplots(1, 5, figsize=(15, 4))

for ax, temp in zip(axes, temperatures):
    probs = F.softmax(logits / temp, dim=0).numpy()
    bars  = ax.bar(vocab, probs, color=['#4ECDC4' if i == 0 else '#95A5A6' for i in range(5)])
    ax.set_title(f'temp={temp}')
    ax.set_ylim(0, 1)
    ax.set_ylabel('Probability' if temp == 0.1 else '')
    for bar, prob in zip(bars, probs):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                f'{prob:.2f}', ha='center', va='bottom', fontsize=8)

plt.suptitle('Effect of Temperature on Token Probabilities', y=1.02)
plt.tight_layout()
plt.savefig('temperature_effect.png', dpi=100)
plt.show()

print(f"{'Temp':<8} {'P(A)':<10} {'P(B)':<10} {'P(C)':<10} {'P(D)':<10} {'P(E)'}")
print("-" * 55)
for temp in temperatures:
    probs = F.softmax(logits / temp, dim=0)
    print(f"{temp:<8} " + " ".join(f"{p.item():<10.4f}" for p in probs))

Output:

Temp     P(A)       P(B)       P(C)       P(D)       P(E)
-------------------------------------------------------
0.1      0.9997     0.0003     0.0000     0.0000     0.0000
0.5      0.9151     0.0789     0.0052     0.0008     0.0001
1.0      0.6637     0.1935     0.0973     0.0380     0.0074
1.5      0.5346     0.2133     0.1401     0.0813     0.0308
2.0      0.4560     0.2128     0.1604     0.1102     0.0606

Temperature = 0.1: extremely peaked, almost always picks "A". Deterministic, repetitive.

Temperature = 1.0: original distribution. Balanced randomness.

Temperature = 2.0: nearly uniform. Very random, often incoherent.

Good range for creative writing: 0.7 to 1.0. For code or factual tasks: 0.2 to 0.5.

Sampling Strategies

Greedy: always pick the highest probability token. Fast. Repetitive. Boring.

Top-k: only consider the k highest probability tokens. Sample from those.

Top-p (Nucleus sampling): consider the smallest set of tokens whose cumulative probability exceeds p. Adapts vocabulary size based on confidence.

def greedy_sample(logits):
    return torch.argmax(logits, dim=-1)

def top_k_sample(logits, k=50, temperature=1.0):
    logits = logits / temperature
    top_k_logits, top_k_indices = torch.topk(logits, k)
    probs  = F.softmax(top_k_logits, dim=-1)
    chosen = torch.multinomial(probs, num_samples=1)
    return top_k_indices[chosen]

def top_p_sample(logits, p=0.9, temperature=1.0):
    logits = logits / temperature
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

    # Remove tokens with cumulative prob above threshold
    sorted_indices_to_remove = cumulative_probs > p
    # Shift to keep at least one token
    sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
    sorted_indices_to_remove[0]  = False

    sorted_logits[sorted_indices_to_remove] = float('-inf')
    probs  = F.softmax(sorted_logits, dim=-1)
    chosen = torch.multinomial(probs, num_samples=1)
    return sorted_indices[chosen]

# Demonstrate on example logits
logits_example = torch.randn(100)   # 100-token vocabulary

greedy_choice = greedy_sample(logits_example)
topk_choice   = top_k_sample(logits_example, k=10)
topp_choice   = top_p_sample(logits_example, p=0.9)

print(f"Greedy picked token:  {greedy_choice.item()}")
print(f"Top-k (k=10) picked:  {topk_choice.item()}")
print(f"Top-p (p=0.9) picked: {topp_choice.item()}")

# How many tokens qualify for top-p at p=0.9?
sorted_logits, _ = torch.sort(logits_example, descending=True)
cumprobs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
n_tokens_in_nucleus = (cumprobs <= 0.9).sum().item() + 1
print(f"\nTokens in nucleus (p=0.9): {n_tokens_in_nucleus} out of 100")

Generating Text With Our MiniGPT

def generate_text(model, prompt, max_new_tokens=200,
                  temperature=0.8, top_k=40, device='cpu'):
    model.eval()

    # Encode prompt
    context = torch.tensor(encode(prompt), dtype=torch.long).unsqueeze(0).to(device)

    # Generate
    with torch.no_grad():
        generated = model.generate(
            context,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k
        )

    # Decode
    generated_tokens = generated[0].tolist()
    return decode(generated_tokens)

# Try different temperatures
print("=" * 60)
print("LOW TEMPERATURE (0.3) - Conservative and repetitive:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
                    temperature=0.3, top_k=10, device=device))

print("\n" + "=" * 60)
print("MEDIUM TEMPERATURE (0.8) - Balanced:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
                    temperature=0.8, top_k=40, device=device))

print("\n" + "=" * 60)
print("HIGH TEMPERATURE (1.5) - Chaotic:")
print("=" * 60)
print(generate_text(model, "HAMLET:", max_new_tokens=150,
                    temperature=1.5, top_k=None, device=device))

Output (after 5 epochs on Shakespeare):

============================================================
LOW TEMPERATURE (0.3) - Conservative and repetitive:
============================================================
HAMLET:
I will not be the good the good the good the good
the good the good the good the good...

============================================================
MEDIUM TEMPERATURE (0.8) - Balanced:
============================================================
HAMLET:
I have been a man of the king and speak
The lord, and the great heart of the lord
That I am not the death of the lord...

============================================================
HIGH TEMPERATURE (1.5) - Chaotic:
============================================================
HAMLET:
Vxqo! zj kin, thae wath gof amd
jek lpe mhek ther whi...

Low temperature: repetitive but coherent. High temperature: gibberish. Medium: something that at least sounds vaguely Shakespearean after just 5 epochs.

Train longer and the quality improves dramatically.

Using GPT-2 With HuggingFace

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

# Load GPT-2
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
gpt2_model     = GPT2LMHeadModel.from_pretrained('gpt2')

gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Generate with different strategies
generator = pipeline('text-generation', model='gpt2')

prompt = "The future of artificial intelligence is"

print("GREEDY (do_sample=False):")
result = generator(prompt, max_new_tokens=50, do_sample=False)
print(result[0]['generated_text'])

print("\nTOP-K SAMPLING (k=50, temp=0.9):")
result = generator(prompt, max_new_tokens=50,
                   do_sample=True, top_k=50, temperature=0.9)
print(result[0]['generated_text'])

print("\nNUCLEUS SAMPLING (top_p=0.9):")
result = generator(prompt, max_new_tokens=50,
                   do_sample=True, top_p=0.9, temperature=0.8)
print(result[0]['generated_text'])

Manual GPT-2 Generation With Full Control

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model     = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

prompt = "Once upon a time in a land far away"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

print(f"Prompt tokens: {input_ids.shape[1]}")
print(f"Prompt: '{prompt}'\n")

# Generate step by step and show probabilities
current_ids = input_ids.clone()

for step in range(5):
    with torch.no_grad():
        outputs = model(current_ids)
        logits  = outputs.logits[:, -1, :]  # last position

    # Get top 5 candidates
    probs      = torch.softmax(logits, dim=-1)
    top5_probs, top5_ids = torch.topk(probs, 5)

    print(f"Step {step+1} - Top 5 candidates:")
    for prob, token_id in zip(top5_probs[0], top5_ids[0]):
        token_str = tokenizer.decode([token_id.item()])
        print(f"  '{token_str}' : {prob.item():.4f}")

    # Pick top token (greedy)
    next_token = top5_ids[0, 0].unsqueeze(0).unsqueeze(0)
    current_ids = torch.cat([current_ids, next_token], dim=1)

    print(f"  -> Picked: '{tokenizer.decode([next_token.item()])}'\n")

final_text = tokenizer.decode(current_ids[0])
print(f"Final: '{final_text}'")

Output:

Prompt tokens: 9
Prompt: 'Once upon a time in a land far away'

Step 1 - Top 5 candidates:
  ',' : 0.2341
  'there' : 0.1823
  'called' : 0.0912
  'from' : 0.0634
  'where' : 0.0521
  -> Picked: ','

Step 2 - Top 5 candidates:
  'there' : 0.3412
  'a' : 0.1234
  'the' : 0.0891
  'an' : 0.0432
  'people' : 0.0321
  -> Picked: 'there'
...

Final: 'Once upon a time in a land far away, there was a'

What GPT Learns by Predicting the Next Word

This seems like a simple task. It's not. To predict the next word well, the model must learn:

Grammar: what word types follow others
Facts: "The capital of France is..." → "Paris"
Reasoning: "If A > B and B > C, then A > ..." → "C"
Style: given "HAMLET:", continue in Shakespearean style
Code: given def fibonacci(n):, complete correctly
Math: "2 + 2 = " → "4"

None of these were explicitly taught. They emerged from predicting tokens. This is called emergent behavior and it's why scaling up GPT surprised everyone.

Quick Cheat Sheet

Concept	What it means
Autoregressive	Generate one token at a time, feed back to input
Temperature	Higher = more random, lower = more deterministic
Greedy	Always pick highest prob token. Repetitive.
Top-k	Sample from top k tokens only
Top-p (nucleus)	Sample from smallest set with cumulative prob > p
Perplexity	Loss metric for language models: lower = better
Weight tying	Embedding and output head share weights
Pre-norm	LayerNorm before attention (modern GPT), more stable

Task	Code
Load GPT-2	`GPT2LMHeadModel.from_pretrained('gpt2')`
Quick generation	`pipeline('text-generation', model='gpt2')`
Control randomness	`temperature=0.8, top_k=50, top_p=0.9`
Stop at sentence	`eos_token_id=tokenizer.eos_token_id`
Greedy	`do_sample=False`
Sampling	`do_sample=True`

Practice Challenges

Level 1:
Use the pipeline('text-generation') with GPT-2. Generate the same prompt 5 times with temperature=0.9. Compare the outputs. Now do it with temperature=0.1. How different are the results?

Level 2:
Train MiniGPT on a different text dataset: a collection of Python code, song lyrics, or any repetitive text. After training, generate samples and evaluate quality by eye. How many epochs until the samples look like the training data?

Level 3:
Implement beam search on top of MiniGPT. Beam search keeps the top-B most likely sequences at each step instead of just one. Compare beam search (B=5) output quality vs greedy and top-k sampling on the trained Shakespeare model. Which one produces the most coherent text?

References

Next up, Post 94: HuggingFace: Your Library for Every Pretrained Model. Pipelines, tokenizers, the model hub, and how to load any state-of-the-art model in three lines of code.

DEV Community

93. GPT: The Model That Predicts the Next Word Forever

What You'll Learn Here

Autoregressive Generation: The Core Idea

Training on Character-Level Shakespeare

Temperature: Controlling Randomness

Sampling Strategies

Generating Text With Our MiniGPT

Using GPT-2 With HuggingFace

Manual GPT-2 Generation With Full Control

What GPT Learns by Predicting the Next Word

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)