Core Components:
- Data Download & Preprocessing - Downloads text from Project Gutenberg with fallback
- SimpleTokenizer - Word-based tokenization with vocabulary building
- TextDataset - PyTorch dataset for sequence-to-sequence training
- PositionalEncoding - Adds position information to embeddings
- MultiHeadAttention - Core attention mechanism
- FeedForward - Feed-forward neural network
- TransformerBlock - Complete transformer layer
- SimpleGPT - Full GPT model
- GPTTrainer - Training loop with validation
- Text Generation - Advanced text generation with top-k sampling
Key Features:
- Automatic device detection (MPS/CUDA/CPU)
- Proper weight initialization
- Gradient clipping and learning rate scheduling
- Model checkpointing
- Causal masking for autoregressive generation
Usage:
Simply run python filename.py
and it will:
- Download/prepare the dataset
- Build the tokenizer
- Create and train the model
- Save the complete model
- Generate sample text
Step 1: Initial Setup and Device Detection
import torch
import torch.nn as nn
import torch.nn.functional as F
# ... other imports
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)
if torch.backends.mps.is_available():
device = torch.device("mps")
elif torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
What's happening here:
-
Random Seeds: Setting seeds to 42 ensures reproducible results
- Every time you run the code, you'll get the same "random" numbers
- This makes debugging easier and results consistent
-
Device Detection:
- MPS (Metal Performance Shaders): Apple's GPU acceleration for M1/M2 Macs
- CUDA: NVIDIA GPU acceleration
- CPU: Fallback for any computer
Why this matters:
- GPUs are much faster for matrix operations (10-100x speedup)
- Your model will automatically use the best available hardware
What gets stored:
device = torch.device("mps") # or "cuda" or "cpu"
Key Concept - Tensors:
- All data in PyTorch is stored as "tensors" (multi-dimensional arrays)
- Tensors can live on CPU or GPU
- Example:
# CPU tensor
x = torch.tensor([1, 2, 3])
# GPU tensor (much faster for large operations)
x_gpu = torch.tensor([1, 2, 3]).to(device)
Questions to check understanding:
- Why do we set random seeds?
- What's the difference between CPU and GPU processing?
- What is a tensor?
Step 2: Data Download and Preprocessing
def download_dataset():
Path("data").mkdir(exist_ok=True)
url = "https://www.gutenberg.org/files/11/11-0.txt"
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
text = response.text
start_idx = text.find("*** START OF")
end_idx = text.find("*** END OF")
if start_idx != -1 and end_idx != -1:
clean_text = text[start_idx:end_idx]
newline_idx = clean_text.find('\n\n')
if newline_idx != -1:
clean_text = clean_text[newline_idx + 2:]
else:
clean_text = text
clean_text = clean_text[:50000] # Take first 50,000 characters
with open('data/dataset.txt', 'w', encoding='utf-8') as f:
f.write(clean_text)
return clean_text
What's happening step by step:
1. Create Data Directory
Path("data").mkdir(exist_ok=True)
- Creates a folder called "data" if it doesn't exist
-
exist_ok=True
means "don't crash if folder already exists"
2. Download Raw Text
url = "https://www.gutenberg.org/files/11/11-0.txt"
response = requests.get(url, timeout=30)
- Downloads "Alice's Adventures in Wonderland" from Project Gutenberg
- Project Gutenberg = free digital library of books
- File 11 = Alice in Wonderland (classic text for ML experiments)
3. Clean the Text
Raw downloaded text looks like:
The Project Gutenberg eBook of Alice's Adventures in Wonderland
*** START OF THE PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***
Alice was beginning to get very tired of sitting by her sister on the bank...
[ACTUAL STORY CONTENT]
*** END OF THE PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***
End of the Project Gutenberg EBook...
Cleaning process:
start_idx = text.find("*** START OF") # Find where story begins
end_idx = text.find("*** END OF") # Find where story ends
clean_text = text[start_idx:end_idx] # Extract only the story part
4. Final Processing
clean_text = clean_text[:50000] # Take first 50,000 characters
- Limits size to 50K characters for faster training on laptops
- Full Alice in Wonderland is ~150K characters
What gets saved to disk:
data/dataset.txt
├── Content: "Alice was beginning to get very tired of sitting by her sister..."
├── Size: ~50,000 characters
└── Format: Plain text, UTF-8 encoding
Example of cleaned text:
"Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do: once or twice she had peeped into the book her
sister was reading, but it had no pictures or conversations in it, 'and what
is the use of a book,' thought Alice 'without pictures or conversation?'"
5. Fallback Data (if download fails)
fallback_text = """
The quick brown fox jumps over the lazy dog...
Alice was beginning to get very tired...
""" * 100
- If internet fails, uses simple repeated sentences
- Ensures code always works, even offline
Key Concepts:
- Text Preprocessing: Cleaning raw data before feeding to ML models
- Character vs Word Count: 50K characters ≈ 8-10K words
- UTF-8 Encoding: Standard way to store text with special characters
What you now have:
- A clean text file with story content
- No headers, footers, or metadata
- Ready for the next step: tokenization
Memory representation:
clean_text = "Alice was beginning to get very tired..."
# Type: string
# Length: 50,000 characters
# Storage: ~50KB in memory
Step 3: Tokenization - Converting Text to Numbers
This is CRUCIAL - neural networks can't understand text, only numbers! We need to convert words to numbers.
class SimpleTokenizer:
def __init__(self, vocab_size=1000):
self.vocab_size = vocab_size
self.word_to_int = {} # Dictionary: word -> number
self.int_to_word = {} # Dictionary: number -> word
self.vocab = [] # List of all words
self.word_freq = Counter() # How often each word appears
Step 3A: Text Cleaning
def clean_text(self, text):
text = text.lower() # "Alice" -> "alice"
text = re.sub(r'[^a-zA-Z0-9\s\.\,\!\?\;\:\-\'\"]', ' ', text)
text = re.sub(r'\s+', ' ', text) # Multiple spaces -> single space
text = text.strip()
return text
What happens:
Input: "Alice was VERY tired!!! She thought, 'This is boring...'"
Step 1: "alice was very tired!!! she thought, 'this is boring...'"
Step 2: "alice was very tired she thought this is boring "
Step 3: "alice was very tired she thought this is boring"
Output: "alice was very tired she thought this is boring"
Step 3B: Build Vocabulary
def build_vocab(self, text):
clean_text = self.clean_text(text)
words = clean_text.split() # Split into individual words
self.word_freq = Counter(words) # Count frequency of each word
Example word counting:
text = "alice was tired alice was very tired"
words = ["alice", "was", "tired", "alice", "was", "very", "tired"]
word_freq = Counter(words)
# Result: {'alice': 2, 'was': 2, 'tired': 2, 'very': 1}
Create final vocabulary:
special_tokens = ['<PAD>', '<UNK>', '<BOS>', '<EOS>']
most_common_words = self.word_freq.most_common(vocab_size - 4)
self.vocab = special_tokens + [word for word, _ in most_common_words]
Special tokens explained:
-
<PAD>
: Padding (fill empty spaces) -
<UNK>
: Unknown word (not in vocabulary) -
<BOS>
: Beginning of sequence -
<EOS>
: End of sequence
Example vocabulary (first 10 items):
vocab = [
'<PAD>', # ID: 0
'<UNK>', # ID: 1
'<BOS>', # ID: 2
'<EOS>', # ID: 3
'the', # ID: 4 (most common word)
'and', # ID: 5
'to', # ID: 6
'a', # ID: 7
'alice', # ID: 8
'was', # ID: 9
# ... up to vocab_size=800 words
]
Step 3C: Create Word-to-Number Mappings
self.word_to_int = {word: i for i, word in enumerate(self.vocab)}
self.int_to_word = {i: word for i, word in enumerate(self.vocab)}
Resulting dictionaries:
word_to_int = {
'<PAD>': 0,
'<UNK>': 1,
'<BOS>': 2,
'<EOS>': 3,
'the': 4,
'and': 5,
'alice': 8,
'was': 9,
# ... 800 total words
}
int_to_word = {
0: '<PAD>',
1: '<UNK>',
2: '<BOS>',
3: '<EOS>',
4: 'the',
5: 'and',
8: 'alice',
9: 'was',
# ... 800 total words
}
Step 3D: Encoding (Text → Numbers)
def encode(self, text):
clean_text = self.clean_text(text)
words = clean_text.split()
numbers = []
for word in words:
if word in self.word_to_int:
numbers.append(self.word_to_int[word])
else:
numbers.append(self.word_to_int['<UNK>']) # Unknown word
return numbers
Example encoding:
text = "Alice was tired"
clean_text = "alice was tired"
words = ["alice", "was", "tired"]
# Look up each word:
numbers = [
word_to_int["alice"], # 8
word_to_int["was"], # 9
word_to_int["tired"] # 45 (assuming "tired" is 45th most common)
]
result = [8, 9, 45]
Step 3E: Decoding (Numbers → Text)
def decode(self, numbers):
words = []
for num in numbers:
if num in self.int_to_word:
words.append(self.int_to_word[num])
else:
words.append('<UNK>')
return ' '.join(words)
Example decoding:
numbers = [8, 9, 45]
# Look up each number:
words = [
int_to_word[8], # "alice"
int_to_word[9], # "was"
int_to_word[45] # "tired"
]
result = "alice was tired"
What Gets Saved:
# File: data/tokenizer.pkl
tokenizer_data = {
'vocab_size': 800,
'word_to_int': {'<PAD>': 0, '<UNK>': 1, ..., 'tired': 45, ...},
'int_to_word': {0: '<PAD>', 1: '<UNK>', ..., 45: 'tired', ...},
'vocab': ['<PAD>', '<UNK>', '<BOS>', '<EOS>', 'the', 'and', ...],
'word_freq': Counter({'the': 1234, 'and': 987, 'alice': 156, ...})
}
Memory Format:
Before tokenization:
"Alice was beginning to get very tired" (string, ~37 characters)
After tokenization:
[8, 9, 234, 4, 67, 12, 45] (list of integers, 7 numbers)
Why This Matters:
- Neural networks only understand numbers
- Consistent mapping: Same word always gets same number
- Vocabulary size controls model complexity: 800 words = manageable for small model
-
Unknown words handled gracefully: Rare words become
<UNK>
Key Insight:
Your entire book is now represented as a sequence of numbers between 0 and 799!
Original: "Alice was beginning to get very tired of sitting by her sister..."
Tokenized: [8, 9, 234, 4, 67, 12, 45, 23, 156, 34, 89, 234, ...]
Step 4: Creating the Training Dataset
Now we need to convert our tokenized text into training examples that teach the model to predict the next word.
class TextDataset(Dataset):
def __init__(self, text, tokenizer, seq_length=32):
self.tokenizer = tokenizer
self.seq_length = seq_length
self.tokens = tokenizer.encode(text) # Convert entire text to numbers
self.examples = []
for i in range(len(self.tokens) - seq_length):
input_seq = self.tokens[i:i + seq_length]
target_seq = self.tokens[i + 1:i + seq_length + 1]
self.examples.append((input_seq, target_seq))
Step 4A: Understanding the Core Concept
The key insight: To predict the next word, the model learns from input → target pairs where target is input shifted by 1 position.
Example with small sequence:
# Original tokenized text:
tokens = [8, 9, 234, 4, 67, 12, 45, 23, 156, 34, 89, 67, 234, 445, 23]
# alice was beginning to get very tired of sitting by her get beginning long of
# With seq_length = 5, we create these training examples:
Step 4B: Creating Training Examples (Sliding Window)
seq_length = 5 # Model sees 5 words at once
# Example 1:
i = 0
input_seq = tokens[0:5] # [8, 9, 234, 4, 67] "alice was beginning to get"
target_seq = tokens[1:6] # [9, 234, 4, 67, 12] "was beginning to get very"
# Example 2:
i = 1
input_seq = tokens[1:6] # [9, 234, 4, 67, 12] "was beginning to get very"
target_seq = tokens[2:7] # [234, 4, 67, 12, 45] "beginning to get very tired"
# Example 3:
i = 2
input_seq = tokens[2:7] # [234, 4, 67, 12, 45] "beginning to get very tired"
target_seq = tokens[3:8] # [4, 67, 12, 45, 23] "to get very tired of"
Step 4C: Visual Representation
Position: 0 1 2 3 4 5 6 7 8 9
Tokens: [ 8, 9, 234, 4, 67, 12, 45, 23, 156, 34 ]
Words: alice was beginning to get very tired of sitting by
Training Example 1:
Input: [ 8, 9, 234, 4, 67 ] "alice was beginning to get"
Target: [ 9, 234, 4, 67, 12 ] "was beginning to get very"
↑ ↑ ↑ ↑ ↑
Predict these from the inputs above
Training Example 2:
Input: [ 9, 234, 4, 67, 12 ] "was beginning to get very"
Target: [234, 4, 67, 12, 45 ] "beginning to get very tired"
Step 4D: What Each Training Example Teaches
Each position in the sequence learns a different prediction:
input_seq = [8, 9, 234, 4, 67] # "alice was beginning to get"
target_seq = [9, 234, 4, 67, 12] # "was beginning to get very"
# What the model learns:
# Position 0: Given "alice" → predict "was"
# Position 1: Given "alice was" → predict "beginning"
# Position 2: Given "alice was beginning" → predict "to"
# Position 3: Given "alice was beginning to" → predict "get"
# Position 4: Given "alice was beginning to get" → predict "very"
Step 4E: Dataset Class Methods
def __len__(self):
return len(self.examples) # How many training examples we have
def __getitem__(self, idx):
input_seq, target_seq = self.examples[idx]
# Convert to PyTorch tensors (required format)
input_tensor = torch.tensor(input_seq, dtype=torch.long)
target_tensor = torch.tensor(target_seq, dtype=torch.long)
return input_tensor, target_tensor
Step 4F: Complete Example with Real Numbers
Let's say our tokenized text is:
tokens = [8, 9, 234, 4, 67, 12, 45, 23, 156, 34, 89, 67, 234, 445, 23, 67, 89]
# Length: 17 tokens
# With seq_length = 5, we get: 17 - 5 = 12 training examples
All training examples:
examples = [
# (input_seq, target_seq)
([8, 9, 234, 4, 67], [9, 234, 4, 67, 12]), # Example 0
([9, 234, 4, 67, 12], [234, 4, 67, 12, 45]), # Example 1
([234, 4, 67, 12, 45], [4, 67, 12, 45, 23]), # Example 2
([4, 67, 12, 45, 23], [67, 12, 45, 23, 156]), # Example 3
([67, 12, 45, 23, 156], [12, 45, 23, 156, 34]), # Example 4
# ... and so on for 12 total examples
]
Step 4G: PyTorch Tensors (Data Format)
# When we call dataset[0], we get:
input_tensor = torch.tensor([8, 9, 234, 4, 67], dtype=torch.long)
target_tensor = torch.tensor([9, 234, 4, 67, 12], dtype=torch.long)
# Tensor properties:
print(input_tensor.shape) # torch.Size([5]) - 1D tensor with 5 elements
print(input_tensor.dtype) # torch.int64 - 64-bit integers
print(input_tensor.device) # cpu - stored on CPU (for now)
Step 4H: Memory Layout
In memory, each example looks like:
Example 0:
├── input_tensor: [8, 9, 234, 4, 67] # Shape: [5]
└── target_tensor: [9, 234, 4, 67, 12] # Shape: [5]
Example 1:
├── input_tensor: [9, 234, 4, 67, 12] # Shape: [5]
└── target_tensor: [234, 4, 67, 12, 45] # Shape: [5]
Step 4I: Dataset Split (Train/Validation)
train_size = int(0.8 * len(dataset)) # 80% for training
val_size = len(dataset) - train_size # 20% for validation
train_dataset, val_dataset = torch.utils.data.random_split(
dataset, [train_size, val_size]
)
Why split the data?
- Training set: Model learns from these examples
- Validation set: Test how well model generalizes to unseen data
- Prevents overfitting: Model memorizing instead of learning patterns
Step 4J: DataLoader (Batching)
batch_size = 8
train_loader = DataLoader(
train_dataset,
batch_size=batch_size,
shuffle=True, # Mix up the order each epoch
drop_last=True # Drop incomplete batches
)
What batching does:
Instead of processing one example at a time, we process 8 examples together:
# Single example:
input_shape: [5] # 5 tokens
target_shape: [5] # 5 tokens
# Batch of 8 examples:
input_shape: [8, 5] # 8 examples, each with 5 tokens
target_shape: [8, 5] # 8 targets, each with 5 tokens
Batch visualization:
batch_input = [
[8, 9, 234, 4, 67], # Example 1
[9, 234, 4, 67, 12], # Example 2
[234, 4, 67, 12, 45], # Example 3
[4, 67, 12, 45, 23], # Example 4
[67, 12, 45, 23, 156], # Example 5
[12, 45, 23, 156, 34], # Example 6
[45, 23, 156, 34, 89], # Example 7
[23, 156, 34, 89, 67] # Example 8
]
# Shape: [8, 5] = [batch_size, seq_length]
Key Insights:
- Sliding window creates many examples from single text
- Each position learns different context lengths (1 word, 2 words, 3 words, etc.)
- Target is always input shifted by 1 (next word prediction)
- Batching enables parallel processing on GPU
- Train/val split prevents overfitting
What's stored in memory:
dataset.examples = [
([8, 9, 234, 4, 67], [9, 234, 4, 67, 12]),
([9, 234, 4, 67, 12], [234, 4, 67, 12, 45]),
# ... thousands of examples
]
Step 5: Positional Encoding - Teaching the Model About Word Order
The Problem: Neural networks don't naturally understand that "cat sat mat" is different from "mat sat cat". They process all words at the same time!
The Solution: Add special position numbers to each word's embedding.
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_length=1000):
super().__init__()
pe = torch.zeros(max_length, d_model)
position = torch.arange(0, max_length).float().unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term) # Even dimensions
pe[:, 1::2] = torch.cos(position * div_term) # Odd dimensions
self.register_buffer('pe', pe.unsqueeze(0))
Step 5A: Understanding the Problem
Without positional encoding:
# These sentences would look IDENTICAL to the neural network:
sentence1 = ["the", "cat", "sat", "on", "mat"]
sentence2 = ["mat", "on", "sat", "cat", "the"]
# Because neural networks process them as a "bag of words"
# Missing: WHERE each word appears in the sentence
Step 5B: The Mathematical Formula
For each position pos
and dimension i
:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) # Even dimensions
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) # Odd dimensions
Breaking down the formula:
# Parameters:
pos = 0, 1, 2, 3, 4, ... # Position in sentence (0=first word, 1=second, etc.)
d_model = 128 # Embedding dimension
i = 0, 1, 2, 3, ..., d_model/2 # Dimension index
Step 5C: Step-by-Step Calculation
Let's calculate position encoding for position 0 (first word) and d_model=4 (simplified):
# Step 1: Create position tensor
position = torch.arange(0, max_length).float().unsqueeze(1)
# Result: [[0.], [1.], [2.], [3.], [4.], ...] Shape: [max_length, 1]
# Step 2: Calculate division term
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
# For d_model=4:
# torch.arange(0, 4, 2) = [0, 2] # Even dimensions only
# -(math.log(10000.0) / 4) = -2.302
# torch.exp([0, 2] * -2.302) = torch.exp([0, -4.605]) = [1.0, 0.01]
div_term = [1.0, 0.01]
Step 3: Calculate sine and cosine values
For position 0 (first word):
pos = 0
# Even dimensions (0, 2):
pe[0, 0] = sin(0 * 1.0) = sin(0) = 0.0
pe[0, 2] = sin(0 * 0.01) = sin(0) = 0.0
# Odd dimensions (1, 3):
pe[0, 1] = cos(0 * 1.0) = cos(0) = 1.0
pe[0, 3] = cos(0 * 0.01) = cos(0) = 1.0
# Position 0 encoding: [0.0, 1.0, 0.0, 1.0]
For position 1 (second word):
pos = 1
# Even dimensions:
pe[1, 0] = sin(1 * 1.0) = sin(1) = 0.841
pe[1, 2] = sin(1 * 0.01) = sin(0.01) = 0.01
# Odd dimensions:
pe[1, 1] = cos(1 * 1.0) = cos(1) = 0.540
pe[1, 3] = cos(1 * 0.01) = cos(0.01) = 0.9999
# Position 1 encoding: [0.841, 0.540, 0.01, 0.9999]
Step 5D: Complete Positional Encoding Matrix
For a sequence length of 5 and d_model=4:
pe = [
[0.000, 1.000, 0.000, 1.000], # Position 0
[0.841, 0.540, 0.010, 0.999], # Position 1
[0.909, -0.416, 0.020, 0.999], # Position 2
[0.141, -0.989, 0.030, 0.999], # Position 3
[-0.756, -0.654, 0.040, 0.999] # Position 4
]
# Shape: [5, 4] = [seq_length, d_model]
Step 5E: How Positional Encoding is Applied
def forward(self, x):
seq_len = x.size(1)
x = x + self.pe[:, :seq_len] # Add position info to word embeddings
return x
Example application:
# Input: Word embeddings for "the cat sat"
word_embeddings = [
[0.1, 0.2, 0.3, 0.4], # "the" embedding
[0.5, 0.6, 0.7, 0.8], # "cat" embedding
[0.9, 1.0, 1.1, 1.2] # "sat" embedding
]
# Shape: [3, 4] = [seq_length, d_model]
# Add positional encoding:
positional_encoding = [
[0.000, 1.000, 0.000, 1.000], # Position 0
[0.841, 0.540, 0.010, 0.999], # Position 1
[0.909, -0.416, 0.020, 0.999] # Position 2
]
# Final embeddings (word + position):
final_embeddings = [
[0.1+0.000, 0.2+1.000, 0.3+0.000, 0.4+1.000] = [0.1, 1.2, 0.3, 1.4],
[0.5+0.841, 0.6+0.540, 0.7+0.010, 0.8+0.999] = [1.341, 1.14, 0.71, 1.799],
[0.9+0.909, 1.0-0.416, 1.1+0.020, 1.2+0.999] = [1.809, 0.584, 1.12, 2.199]
]
Step 5F: Why This Works
Key properties:
- Unique fingerprint: Each position gets a unique pattern
- Relative positions: Model can learn "word A comes before word B"
- Periodic patterns: Similar positions have similar encodings
- Scalable: Works for any sequence length
Step 5G: Visual Understanding
Imagine each position as a unique "barcode":
Position 0: |||| | |||| | (0.0, 1.0, 0.0, 1.0)
Position 1: |||| || | |||| |||| (0.841, 0.540, 0.01, 0.999)
Position 2: |||||||| |||||||| (0.909, -0.416, 0.02, 0.999)
Position 3: ||| ||| |||||||| (0.141, -0.989, 0.03, 0.999)
Position 4: || || |||||||| (-0.756, -0.654, 0.04, 0.999)
Step 5H: Memory Storage
# What gets stored in memory:
self.pe = torch.tensor([
[[0.000, 1.000, 0.000, 1.000, ...], # Position 0
[0.841, 0.540, 0.010, 0.999, ...], # Position 1
[0.909, -0.416, 0.020, 0.999, ...], # Position 2
... # Up to max_length positions
[pos_n_encoding...]] # Position max_length-1
])
# Shape: [1, max_length, d_model] = [1, 1000, 128]
Step 5I: The Unsqueeze Operation
pe = torch.zeros(max_length, d_model) # Shape: [1000, 128]
self.register_buffer('pe', pe.unsqueeze(0)) # Shape: [1, 1000, 128]
Why unsqueeze(0)?
- Adds batch dimension for broadcasting
- Allows same positional encoding to be applied to all examples in a batch
Step 5J: Real-World Example
For our model with d_model=128 and seq_length=24:
# Sentence: "alice was beginning to get very tired"
# Positions: [0, 1, 2, 3, 4, 5, 6]
# Each word gets word_embedding + position_encoding:
alice_final = alice_embedding + position_0_encoding # [128] + [128] = [128]
was_final = was_embedding + position_1_encoding # [128] + [128] = [128]
beginning_final = beginning_embedding + position_2_encoding # [128] + [128] = [128]
# ... and so on
Key Insights:
- Position encoding is learned from data: The specific values come from math, not training
- Same word, different positions: "was" at position 1 vs position 5 will have different final embeddings
- Preserves meaning: Original word meaning + position information
- No extra parameters: Just mathematical computation, no weights to learn
What happens next:
The model now knows that "cat sat mat" ≠ "mat sat cat" because each word has position information encoded into it!
Step 6: Multi-Head Attention - The Heart of the Transformer
This is the most important part! Attention lets the model decide which words to focus on when predicting the next word.
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads, dropout=0.1):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model # 128 (total embedding size)
self.num_heads = num_heads # 8 (number of attention heads)
self.d_k = d_model // num_heads # 16 (size per head: 128/8=16)
self.W_q = nn.Linear(d_model, d_model, bias=False) # Query projection
self.W_k = nn.Linear(d_model, d_model, bias=False) # Key projection
self.W_v = nn.Linear(d_model, d_model, bias=False) # Value projection
self.W_o = nn.Linear(d_model, d_model) # Output projection
Step 6A: The Attention Concept
Real-world analogy: When reading "The cat sat on the mat", to understand "sat", you need to pay attention to:
- "cat" (what is sitting?)
- "on" (where is it sitting?)
- "mat" (what is it sitting on?)
In neural networks: Attention computes how much each word should influence the understanding of every other word.
Step 6B: Query, Key, Value (QKV) Concept
Think of attention like a search engine:
- Query (Q): "What am I looking for?"
- Key (K): "What do I have to offer?"
- Value (V): "What information do I contain?"
Example:
sentence = "the cat sat on the mat"
# When processing "sat":
query = "what_action_is_happening?"
keys = ["article", "animal", "action", "preposition", "article", "object"]
values = ["the_info", "cat_info", "sat_info", "on_info", "the_info", "mat_info"]
# Attention finds: "cat" and "mat" are most relevant for understanding "sat"
Step 6C: Mathematical Foundation
The core attention formula:
Attention(Q,K,V) = softmax(QK^T / √d_k)V
Let's break this down step by step:
Step 6D: Creating Q, K, V Matrices
def forward(self, query, key, value, mask=None):
batch_size, seq_length = query.size(0), query.size(1)
Q = self.W_q(query) # [batch, seq_len, d_model] → [batch, seq_len, d_model]
K = self.W_k(key) # [batch, seq_len, d_model] → [batch, seq_len, d_model]
V = self.W_v(value) # [batch, seq_len, d_model] → [batch, seq_len, d_model]
Example with real numbers:
# Input embeddings for "the cat sat" (simplified to d_model=4)
input_embeddings = [
[0.1, 0.2, 0.3, 0.4], # "the" with position encoding
[0.5, 0.6, 0.7, 0.8], # "cat" with position encoding
[0.9, 1.0, 1.1, 1.2] # "sat" with position encoding
]
# Shape: [1, 3, 4] = [batch_size, seq_length, d_model]
# Linear transformations (W_q, W_k, W_v are learned weight matrices):
W_q = [[0.1, 0.2, 0.3, 0.4],
[0.2, 0.3, 0.4, 0.5],
[0.3, 0.4, 0.5, 0.6],
[0.4, 0.5, 0.6, 0.7]] # Shape: [4, 4]
# Q = input_embeddings @ W_q
Q = [[0.3, 0.4, 0.5, 0.6], # Query for "the"
[0.7, 0.8, 0.9, 1.0], # Query for "cat"
[1.1, 1.2, 1.3, 1.4]] # Query for "sat"
# Shape: [1, 3, 4]
# K and V are computed similarly with W_k and W_v
Step 6E: Multi-Head Split
Instead of one big attention, we split into multiple "heads":
# Reshape for multi-head attention:
Q = Q.view(batch_size, seq_length, num_heads, d_k).transpose(1, 2)
K = K.view(batch_size, seq_length, num_heads, d_k).transpose(1, 2)
V = V.view(batch_size, seq_length, num_heads, d_k).transpose(1, 2)
Visual representation:
# Before reshaping:
Q.shape = [1, 3, 8] # [batch, seq_len, d_model] where d_model=8, num_heads=4, d_k=2
Q = [[Q_word1_all8dims],
[Q_word2_all8dims],
[Q_word3_all8dims]]
# After reshaping and transpose:
Q.shape = [1, 4, 3, 2] # [batch, num_heads, seq_len, d_k]
Q = [[[Q_word1_head1], [Q_word2_head1], [Q_word3_head1]], # Head 1
[[Q_word1_head2], [Q_word2_head2], [Q_word3_head2]], # Head 2
[[Q_word1_head3], [Q_word2_head3], [Q_word3_head3]], # Head 3
[[Q_word1_head4], [Q_word2_head4], [Q_word3_head4]]] # Head 4
Step 6F: Scaled Dot-Product Attention
def scaled_dot_product_attention(self, Q, K, V, mask=None):
# Step 1: Calculate attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
# Step 2: Apply mask (prevent looking at future words)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Step 3: Convert to probabilities
attention_weights = F.softmax(scores, dim=-1)
# Step 4: Apply attention to values
output = torch.matmul(attention_weights, V)
return output, attention_weights
Step 6G: Step-by-Step Attention Calculation
Let's compute attention for "the cat sat" with simplified numbers:
Step 1: Compute scores (QK^T)
# For one head, simplified to d_k=2:
Q = [[0.1, 0.2], # Query for "the"
[0.3, 0.4], # Query for "cat"
[0.5, 0.6]] # Query for "sat"
K = [[0.2, 0.1], # Key for "the"
[0.4, 0.3], # Key for "cat"
[0.6, 0.5]] # Key for "sat"
# Matrix multiplication QK^T:
scores = Q @ K.T = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]] @ [[0.2, 0.4, 0.6],
[0.1, 0.3, 0.5]]
scores = [[0.04, 0.10, 0.16], # How much "the" attends to [the, cat, sat]
[0.10, 0.21, 0.38], # How much "cat" attends to [the, cat, sat]
[0.16, 0.38, 0.60]] # How much "sat" attends to [the, cat, sat]
Step 2: Scale by √d_k
scores = scores / math.sqrt(2) = scores / 1.414
scores = [[0.028, 0.071, 0.113],
[0.071, 0.148, 0.269],
[0.113, 0.269, 0.424]]
Step 3: Apply causal mask
# Causal mask (prevent looking at future words):
mask = [[1, 0, 0], # "the" can only see "the"
[1, 1, 0], # "cat" can see "the, cat"
[1, 1, 1]] # "sat" can see "the, cat, sat"
# Apply mask (set forbidden positions to -∞):
masked_scores = [[0.028, -∞, -∞ ],
[0.071, 0.148, -∞ ],
[0.113, 0.269, 0.424]]
Step 4: Softmax to get probabilities
attention_weights = softmax(masked_scores)
# For each row, probabilities sum to 1:
attention_weights = [[1.0, 0.0, 0.0 ], # "the" pays 100% attention to itself
[0.481, 0.519, 0.0 ], # "cat" pays 48% to "the", 52% to itself
[0.307, 0.340, 0.353]] # "sat" pays attention to all three words
Step 5: Apply attention to values
V = [[0.1, 0.3], # Value for "the"
[0.2, 0.4], # Value for "cat"
[0.5, 0.7]] # Value for "sat"
# Weighted combination:
output = attention_weights @ V
# For "the": 1.0*[0.1,0.3] + 0.0*[0.2,0.4] + 0.0*[0.5,0.7] = [0.1, 0.3]
# For "cat": 0.481*[0.1,0.3] + 0.519*[0.2,0.4] + 0.0*[0.5,0.7] = [0.152, 0.351]
# For "sat": 0.307*[0.1,0.3] + 0.340*[0.2,0.4] + 0.353*[0.5,0.7] = [0.276, 0.565]
output = [[0.1, 0.3 ], # Updated representation for "the"
[0.152, 0.351], # Updated representation for "cat"
[0.276, 0.565]] # Updated representation for "sat"
Step 6H: Multi-Head Combination
# After all heads compute their outputs:
head_outputs = [
output_head_1, # [batch, seq_len, d_k]
output_head_2, # [batch, seq_len, d_k]
# ... 8 heads total
]
# Concatenate all heads:
attn_output = torch.cat(head_outputs, dim=-1) # [batch, seq_len, d_model]
# Final linear transformation:
output = self.W_o(attn_output)
Step 6I: Why Multiple Heads?
Each head can learn different types of relationships:
# Example with "The cat sat on the mat":
# Head 1: Subject-Verb relationships
# "cat" → "sat" (who is doing the action?)
# Head 2: Spatial relationships
# "sat" → "on" → "mat" (where is the action happening?)
# Head 3: Article-Noun relationships
# "the" → "cat", "the" → "mat" (which specific objects?)
# Head 4: Sequential relationships
# Each word → previous word (word order patterns)
Step 6J: Memory Layout
# What's stored for each attention head:
attention_weights = [
# Head 1:
[[1.0, 0.0, 0.0, 0.0, 0.0], # Word 1 attention distribution
[0.3, 0.7, 0.0, 0.0, 0.0], # Word 2 attention distribution
[0.1, 0.4, 0.5, 0.0, 0.0], # Word 3 attention distribution
[0.2, 0.2, 0.3, 0.3, 0.0], # Word 4 attention distribution
[0.1, 0.2, 0.2, 0.3, 0.2]], # Word 5 attention distribution
# Head 2:
[[1.0, 0.0, 0.0, 0.0, 0.0],
[0.8, 0.2, 0.0, 0.0, 0.0],
# ... different attention pattern
],
# ... 8 heads total
]
# Shape: [num_heads, seq_len, seq_len] = [8, 5, 5]
Step 6K: Causal Mask (Critical for Language Modeling)
def create_causal_mask(seq_length):
mask = torch.tril(torch.ones(seq_length, seq_length))
return mask.unsqueeze(0).unsqueeze(0)
# For seq_length=5:
mask = [[1, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 1, 0, 0],
[1, 1, 1, 1, 0],
[1, 1, 1, 1, 1]]
Why causal mask?
- Prevents model from "cheating" by looking at future words
- Word at position i can only attend to positions 0, 1, ..., i
- Essential for autoregressive generation (predicting next word)
Key Insights:
- Attention = Weighted Average: Each word becomes a weighted combination of all previous words
- Learning What Matters: Model learns which words are important for understanding each position
- Context-Dependent: Same word gets different representations based on context
- Parallel Processing: All positions computed simultaneously (unlike RNNs)
- Multiple Perspectives: Each head learns different types of relationships
The Magic:
After attention, "sat" doesn't just mean "sat" - it means "the cat's action of sitting" because it has been enriched with information from "the" and "cat"!
Step 7: Feed Forward Network - Processing the Attended Information
After attention tells us WHAT to focus on, the feed forward network decides WHAT TO DO with that information.
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.d_model = d_model # 128 (input/output dimension)
self.d_ff = d_ff # 512 (hidden dimension - 4x larger!)
self.linear1 = nn.Linear(d_model, d_ff) # Expand: 128 → 512
self.linear2 = nn.Linear(d_ff, d_model) # Compress: 512 → 128
self.dropout = nn.Dropout(dropout)
Step 7A: The Concept - Think Then Decide
Real-world analogy: After paying attention to relevant words, you need to "think" about what they mean together.
# After attention: "sat" now contains information about "cat" and "mat"
attended_sat = "cat's_action_of_sitting_on_mat"
# Feed forward network thinks:
# "Hmm, I see a cat + sitting + mat... this suggests a resting action on furniture"
# Then outputs: enhanced_understanding_of_sat
Step 7B: The Architecture - Expand, Activate, Compress
def forward(self, x):
# Step 1: Expand to larger dimension (more "thinking space")
x = self.linear1(x) # [batch, seq_len, 128] → [batch, seq_len, 512]
# Step 2: Apply ReLU activation (non-linearity)
x = F.relu(x) # Remove negative values, keep positive
# Step 3: Apply dropout (prevent overfitting)
x = self.dropout(x)
# Step 4: Compress back to original dimension
x = self.linear2(x) # [batch, seq_len, 512] → [batch, seq_len, 128]
return x
Step 7C: Why 4x Expansion?
d_model = 128 # Input dimension
d_ff = 512 # Hidden dimension (4x larger)
# Think of it like this:
# Input: "I have 128 pieces of information about this word"
# Expansion: "Let me spread this into 512 thinking slots"
# Processing: "Now I can do complex reasoning in this larger space"
# Compression: "Summarize my thoughts back into 128 pieces"
Why larger dimension helps:
- More parameters = more complex patterns
- More "thinking space" for the model
- Can combine information in sophisticated ways
Step 7D: Step-by-Step Example
Let's process the attended representation of "sat":
# Input: attended "sat" representation
input_sat = [0.3, 0.5, -0.2, 0.8] # Simplified to 4 dimensions
# Step 1: Linear expansion (128 → 512, simplified to 4 → 8)
W1 = [[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
[0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
[0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1]]
expanded = input_sat @ W1 = [0.5, 0.8, 1.1, 1.4, 1.7, 2.0, 2.3, 2.6]
# Step 2: ReLU activation (remove negatives)
activated = relu(expanded) = [0.5, 0.8, 1.1, 1.4, 1.7, 2.0, 2.3, 2.6]
# (All values were positive, so no change)
# Step 3: Dropout (randomly set some to 0 during training)
after_dropout = [0.5, 0.0, 1.1, 1.4, 0.0, 2.0, 2.3, 0.0] # Random example
# Step 4: Linear compression (8 → 4)
W2 = [[0.1, 0.2, 0.3, 0.4],
[0.2, 0.3, 0.4, 0.5],
[0.3, 0.4, 0.5, 0.6],
[0.4, 0.5, 0.6, 0.7],
[0.5, 0.6, 0.7, 0.8],
[0.6, 0.7, 0.8, 0.9],
[0.7, 0.8, 0.9, 1.0],
[0.8, 0.9, 1.0, 1.1]]
final_output = after_dropout @ W2 = [1.2, 1.5, 1.8, 2.1]
Step 7E: What Feed Forward Actually Learns
The feed forward network learns feature combinations:
# Example patterns the network might learn:
# Pattern 1: "Action + Object" detector
if expanded[0] > 0.5 and expanded[3] > 0.8:
# This might indicate "action happening to object"
output[0] = high_value
# Pattern 2: "Spatial relationship" detector
if expanded[1] > 0.6 and expanded[4] > 0.7:
# This might indicate "spatial positioning"
output[1] = high_value
# Pattern 3: "Temporal sequence" detector
if expanded[2] > 0.4 and expanded[5] > 0.9:
# This might indicate "time-based action"
output[2] = high_value
Step 7F: ReLU Activation - Why It Matters
# Without ReLU (just linear transformations):
# Model can only learn linear relationships
# Example: output = 2*input + 3
# With ReLU:
# Model can learn complex, non-linear patterns
# Example: if input > threshold then activate_pattern_A else activate_pattern_B
def relu_example():
values = [-1.0, -0.5, 0.0, 0.5, 1.0, 1.5]
after_relu = [max(0, x) for x in values]
# Result: [0.0, 0.0, 0.0, 0.5, 1.0, 1.5]
# Effect: Creates "gates" - some neurons turn off (0), others stay on
Step 7G: Memory Layout
# What gets stored in each layer:
# Linear1 weights: [d_model, d_ff] = [128, 512]
W1 = torch.tensor([
[w1_00, w1_01, w1_02, ..., w1_0511], # How input dim 0 connects to all 512 hidden dims
[w1_10, w1_11, w1_12, ..., w1_1511], # How input dim 1 connects to all 512 hidden dims
# ... 128 rows total
])
# Linear2 weights: [d_ff, d_model] = [512, 128]
W2 = torch.tensor([
[w2_00, w2_01, w2_02, ..., w2_0127], # How hidden dim 0 connects to all 128 output dims
[w2_10, w2_11, w2_12, ..., w2_1127], # How hidden dim 1 connects to all 128 output dims
# ... 512 rows total
])
Step 7H: Position-wise Processing
Key insight: Feed forward processes each position independently!
sentence = "the cat sat on mat"
positions = [pos_0, pos_1, pos_2, pos_3, pos_4]
# Each position goes through the SAME feed forward network:
for position in positions:
enhanced_position = feed_forward(position)
# This means:
# - "cat" at position 1 gets the same processing as "mat" at position 4
# - But their inputs are different (due to attention), so outputs differ
# - No information flows between positions in feed forward
Step 7I: Parameter Count
# Feed forward parameters:
linear1_params = d_model * d_ff = 128 * 512 = 65,536
linear2_params = d_ff * d_model = 512 * 128 = 65,536
total_ff_params = 131,072
# This is typically the LARGEST component of the transformer!
# Much bigger than attention: ~131K vs ~65K parameters
Step 7J: What Happens in Practice
# Input: Attended representations
input_batch = [
[[0.1, 0.3, -0.2, 0.5, ...], # "the" (128 dims)
[0.4, 0.6, 0.1, 0.8, ...], # "cat" (128 dims)
[0.2, 0.9, -0.1, 0.3, ...] # "sat" (128 dims)
],
# ... more examples in batch
]
# After feed forward:
output_batch = [
[[0.2, 0.4, 0.1, 0.7, ...], # Enhanced "the"
[0.3, 0.8, 0.2, 0.9, ...], # Enhanced "cat"
[0.5, 0.6, 0.3, 0.4, ...] # Enhanced "sat"
],
# ... enhanced representations
]
# Each word now has:
# 1. Original word meaning (from embedding)
# 2. Position information (from positional encoding)
# 3. Context from other words (from attention)
# 4. Complex feature combinations (from feed forward)
Key Insights:
- Expansion and compression: Gives model more "thinking space"
- Non-linearity: ReLU enables learning complex patterns
- Position-wise: Each word processed independently
- Feature combination: Learns to combine attended information
- Most parameters: Usually 2/3 of transformer's parameters
The Role in the Big Picture:
- Attention says: "Focus on these words"
- Feed Forward says: "Now that I'm focused, here's what it all means"
Step 8: Transformer Block - Combining Everything with Crucial Tricks
This is where we combine attention + feed forward + some essential tricks that make training actually work!
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads, dropout)
self.feed_forward = FeedForward(d_model, d_ff, dropout)
# CRUCIAL COMPONENTS:
self.norm1 = nn.LayerNorm(d_model) # After attention
self.norm2 = nn.LayerNorm(d_model) # After feed forward
self.dropout = nn.Dropout(dropout)
Step 8A: The Two Essential Tricks
Trick 1: Residual Connections (Skip connections)
Trick 2: Layer Normalization
Without these, deep transformers DON'T WORK AT ALL!
Step 8B: Residual Connections - The Highway for Information
def forward(self, x, mask=None):
# Attention with residual connection
attn_output, attention_weights = self.attention(
self.norm1(x), self.norm1(x), self.norm1(x), mask
)
x = x + self.dropout(attn_output) # ← THIS IS THE RESIDUAL CONNECTION
# Feed forward with residual connection
ff_output = self.feed_forward(self.norm2(x))
x = x + self.dropout(ff_output) # ← THIS IS THE RESIDUAL CONNECTION
return x, attention_weights
Step 8C: Why Residual Connections Are Essential
The Problem: Deep networks suffer from "vanishing gradients"
# Without residual connections (BAD):
x = input_embedding # [0.1, 0.2, 0.3, 0.4]
x = attention_layer(x) # [0.05, 0.08, 0.12, 0.15] (getting smaller)
x = feed_forward_layer(x) # [0.02, 0.03, 0.04, 0.05] (even smaller)
x = another_attention_layer(x) # [0.008, 0.01, 0.015, 0.02] (almost zero!)
# After many layers: [0.0001, 0.0002, 0.0003, 0.0004] (information is lost!)
# With residual connections (GOOD):
x = input_embedding # [0.1, 0.2, 0.3, 0.4]
x = x + attention_layer(x) # [0.1, 0.2, 0.3, 0.4] + [small_changes] = preserved!
x = x + feed_forward_layer(x) # Original info + new info = still rich!
# Information never disappears!
Step 8D: Layer Normalization - Keeping Numbers Stable
class LayerNorm(nn.Module):
def __init__(self, d_model, eps=1e-6):
super().__init__()
self.gamma = nn.Parameter(torch.ones(d_model)) # Learnable scale
self.beta = nn.Parameter(torch.zeros(d_model)) # Learnable shift
self.eps = eps
def forward(self, x):
# Calculate mean and variance across the last dimension
mean = x.mean(dim=-1, keepdim=True) # Average of each embedding
var = x.var(dim=-1, keepdim=True) # Variance of each embedding
# Normalize: subtract mean, divide by standard deviation
normalized = (x - mean) / torch.sqrt(var + self.eps)
# Scale and shift (learnable parameters)
return self.gamma * normalized + self.beta
Step 8E: Layer Norm Step-by-Step Example
# Input: one word's embedding after attention
x = [2.0, 8.0, 1.0, 5.0] # Unbalanced values!
# Step 1: Calculate statistics
mean = (2.0 + 8.0 + 1.0 + 5.0) / 4 = 4.0
variance = ((2-4)² + (8-4)² + (1-4)² + (5-4)²) / 4 = (4 + 16 + 9 + 1) / 4 = 7.5
std_dev = sqrt(7.5) = 2.74
# Step 2: Normalize (mean=0, std=1)
normalized = [(2.0-4.0)/2.74, (8.0-4.0)/2.74, (1.0-4.0)/2.74, (5.0-4.0)/2.74]
normalized = [-0.73, 1.46, -1.09, 0.36]
# Step 3: Apply learnable parameters
gamma = [1.0, 1.0, 1.0, 1.0] # (learned during training)
beta = [0.0, 0.0, 0.0, 0.0] # (learned during training)
final = gamma * normalized + beta = [-0.73, 1.46, -1.09, 0.36]
Step 8F: Why Layer Norm Helps
Problem: Neural networks are sensitive to input scale
# Without layer norm:
word1 = [0.1, 0.2, 0.1, 0.15] # Small values
word2 = [10.0, 20.0, 15.0, 25.0] # Large values
# Network treats these VERY differently, even if they represent similar concepts!
# With layer norm:
word1_normalized = [-0.8, 0.8, -0.8, 0.0] # Standardized scale
word2_normalized = [-0.9, 0.9, -0.3, 1.2] # Same scale range
# Now network can focus on patterns, not magnitudes!
Step 8G: Pre-Norm vs Post-Norm
Our implementation uses Pre-Norm (normalize first, then apply layer):
# Pre-Norm (what we use - MORE STABLE):
attn_output = self.attention(self.norm1(x), self.norm1(x), self.norm1(x), mask)
x = x + attn_output
# Post-Norm (older approach - LESS STABLE):
attn_output = self.attention(x, x, x, mask)
x = self.norm1(x + attn_output)
Why Pre-Norm is better:
- More stable gradients
- Easier to train deep models
- Less likely to explode or vanish
Step 8H: Complete Forward Pass Example
# Input: word embeddings with position encoding
input_x = [
[0.5, 0.3, 0.8, 0.2], # "the"
[0.1, 0.9, 0.4, 0.7], # "cat"
[0.6, 0.2, 0.1, 0.8] # "sat"
]
# Step 1: Layer norm before attention
normed_x = layer_norm(input_x)
# Result: normalized versions of each embedding
# Step 2: Multi-head attention
attn_output, weights = attention(normed_x, normed_x, normed_x, mask)
# Result: contextual information for each word
# Step 3: Residual connection + dropout
x = input_x + dropout(attn_output)
# Result: original info + attention info
# Step 4: Layer norm before feed forward
normed_x2 = layer_norm(x)
# Step 5: Feed forward network
ff_output = feed_forward(normed_x2)
# Step 6: Second residual connection + dropout
final_x = x + dropout(ff_output)
# Result: original + attention + feed forward info
# Final output: Each word now has rich, multi-layered representation
Step 8I: Information Flow Visualization
# What each word contains at each step:
# Initial:
"cat" = word_embedding + position_encoding
# After attention:
"cat" = word_embedding + position_encoding + attention_to_context
# After feed forward:
"cat" = word_embedding + position_encoding + attention_to_context + complex_features
# Each layer ADDS information, never replaces it!
Step 8J: Why This Architecture Works
1. Information Preservation:
# Residual connections ensure no information is lost
original_meaning + contextual_info + processed_features = rich_representation
2. Stable Training:
# Layer norm keeps values in good range for learning
no_explosion + no_vanishing = successful_training
3. Parallel Processing:
# All positions processed simultaneously
fast_computation + gpu_efficient = scalable_model
Step 8K: Memory Layout
# What gets stored for a transformer block:
# Layer Norm 1 parameters:
norm1_gamma = [learnable_scale_factor_for_each_dim] # Shape: [128]
norm1_beta = [learnable_bias_for_each_dim] # Shape: [128]
# Attention parameters:
attention_weights = {
'W_q': [128, 128], # Query projection
'W_k': [128, 128], # Key projection
'W_v': [128, 128], # Value projection
'W_o': [128, 128] # Output projection
}
# Layer Norm 2 parameters:
norm2_gamma = [learnable_scale_factor_for_each_dim] # Shape: [128]
norm2_beta = [learnable_bias_for_each_dim] # Shape: [128]
# Feed Forward parameters:
ff_weights = {
'linear1': [128, 512], # Expansion
'linear2': [512, 128] # Compression
}
# Total parameters per block: ~280K parameters
Step 8L: Multiple Blocks Create Depth
# GPT-style model stacks multiple transformer blocks:
input_embeddings
↓
TransformerBlock_1 # Learn basic patterns
↓
TransformerBlock_2 # Learn more complex patterns
↓
TransformerBlock_3 # Learn even more complex patterns
↓
TransformerBlock_4 # Learn very sophisticated patterns
↓
output_predictions
# Each layer builds on the previous layer's understanding
Key Insights:
- Residual connections = Information highway (nothing gets lost)
- Layer normalization = Keeps training stable
- Pre-norm = Better for deep models
- Additive nature = Each layer enriches representation
- Parallel processing = All positions computed together
The Magic:
After going through a transformer block, each word doesn't just know about itself - it knows about its context, its relationships, and complex patterns, while still retaining its original meaning!
Step 9: Complete GPT Model - Putting It All Together
Now we combine everything into a complete language model that can generate text!
class SimpleGPT(nn.Module):
def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout=0.1):
super().__init__()
self.d_model = d_model
self.vocab_size = vocab_size
# 1. Convert token IDs to dense vectors
self.token_embedding = nn.Embedding(vocab_size, d_model)
# 2. Add position information
self.positional_encoding = PositionalEncoding(d_model, max_seq_length)
self.dropout = nn.Dropout(dropout)
# 3. Stack multiple transformer blocks
self.transformer_blocks = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# 4. Final processing
self.ln_final = nn.LayerNorm(d_model)
# 5. Convert back to vocabulary probabilities
self.output_head = nn.Linear(d_model, vocab_size)
Step 9A: The Complete Information Flow
# Input: Token IDs
input_ids = [8, 9, 234, 4, 67] # "alice was beginning to get"
# Step 1: Token Embedding
token_embeddings = [
[0.1, 0.2, 0.3, ..., 0.5], # "alice" → 128-dim vector
[0.4, 0.3, 0.8, ..., 0.2], # "was" → 128-dim vector
[0.2, 0.9, 0.1, ..., 0.7], # "beginning" → 128-dim vector
[0.6, 0.1, 0.4, ..., 0.3], # "to" → 128-dim vector
[0.3, 0.7, 0.2, ..., 0.9] # "get" → 128-dim vector
]
# Shape: [1, 5, 128] = [batch_size, seq_length, d_model]
# Step 2: Add positional encoding
x = token_embeddings + positional_encoding
# Each word now knows WHAT it is and WHERE it is
# Step 3: Pass through transformer blocks
for transformer_block in self.transformer_blocks:
x, attention_weights = transformer_block(x, mask)
# Each layer adds more sophisticated understanding
# Step 4: Final layer normalization
x = self.ln_final(x)
# Step 5: Convert to vocabulary predictions
logits = self.output_head(x) # [1, 5, 800] = [batch, seq_len, vocab_size]
Step 9B: Token Embedding - Converting Numbers to Vectors
# Embedding lookup table:
token_embedding = nn.Embedding(vocab_size=800, d_model=128)
# What this creates:
embedding_table = [
[0.1, 0.2, 0.3, ..., 0.8], # Embedding for token 0 (<PAD>)
[0.4, 0.1, 0.9, ..., 0.2], # Embedding for token 1 (<UNK>)
[0.7, 0.3, 0.1, ..., 0.5], # Embedding for token 2 (<BOS>)
# ... 800 rows total, one for each word in vocabulary
[0.2, 0.8, 0.4, ..., 0.1] # Embedding for token 799
]
# Shape: [800, 128]
# Lookup process:
input_ids = [8, 9, 234] # "alice was beginning"
embeddings = [
embedding_table[8], # Get row 8 for "alice"
embedding_table[9], # Get row 9 for "was"
embedding_table[234] # Get row 234 for "beginning"
]
Step 9C: The Causal Mask - Preventing Cheating
def create_causal_mask(seq_length):
mask = torch.tril(torch.ones(seq_length, seq_length))
return mask.unsqueeze(0).unsqueeze(0)
# For "alice was beginning to get":
mask = [
[1, 0, 0, 0, 0], # "alice" can only see "alice"
[1, 1, 0, 0, 0], # "was" can see "alice, was"
[1, 1, 1, 0, 0], # "beginning" can see "alice, was, beginning"
[1, 1, 1, 1, 0], # "to" can see "alice, was, beginning, to"
[1, 1, 1, 1, 1] # "get" can see all previous words
]
Why this is crucial:
# Without causal mask (CHEATING):
# When predicting what comes after "alice was", model can see "beginning to get"
# This makes training useless - model learns to copy, not predict!
# With causal mask (PROPER TRAINING):
# When predicting what comes after "alice was", model can only see "alice was"
# Model must actually learn language patterns!
Step 9D: Output Head - Converting to Predictions
# After all transformer blocks:
final_representations = [
[0.3, 0.7, 0.1, ..., 0.9], # Rich representation of "alice"
[0.8, 0.2, 0.5, ..., 0.1], # Rich representation of "was"
[0.1, 0.9, 0.3, ..., 0.6], # Rich representation of "beginning"
[0.4, 0.1, 0.8, ..., 0.2], # Rich representation of "to"
[0.6, 0.3, 0.1, ..., 0.7] # Rich representation of "get"
]
# Shape: [1, 5, 128]
# Output head: Linear layer [128 → 800]
logits = self.output_head(final_representations)
# Result: Predictions for each position
logits = [
[2.3, 0.1, 0.8, ..., 1.2], # Predictions for position 0 (after "alice")
[0.5, 3.1, 0.2, ..., 0.9], # Predictions for position 1 (after "was")
[1.1, 0.4, 2.8, ..., 0.3], # Predictions for position 2 (after "beginning")
[0.7, 1.9, 0.1, ..., 2.5], # Predictions for position 3 (after "to")
[2.1, 0.3, 1.4, ..., 0.6] # Predictions for position 4 (after "get")
]
# Shape: [1, 5, 800] - Each position predicts over full vocabulary
Step 9E: Training Target Alignment
# Input sequence: [8, 9, 234, 4, 67] "alice was beginning to get"
# Target sequence: [9, 234, 4, 67, 12] "was beginning to get very"
# Training alignment:
# Position 0: Input="alice" Target="was" → Learn: alice → was
# Position 1: Input="was" Target="beginning" → Learn: was → beginning
# Position 2: Input="beginning" Target="to" → Learn: beginning → to
# Position 3: Input="to" Target="get" → Learn: to → get
# Position 4: Input="get" Target="very" → Learn: get → very
Step 9F: Loss Calculation
def forward(self, input_ids, targets=None):
# ... (all the processing steps)
loss = None
if targets is not None:
# Reshape for cross-entropy loss
logits_flat = logits.view(-1, self.vocab_size) # [batch*seq_len, vocab_size]
targets_flat = targets.view(-1) # [batch*seq_len]
# Calculate cross-entropy loss
loss = F.cross_entropy(logits_flat, targets_flat, ignore_index=-1)
return logits, loss, attention_maps
Cross-entropy loss explained:
# For one prediction:
predicted_logits = [2.3, 0.1, 0.8, 1.2, ...] # Raw scores for each word
target_word_id = 9 # Correct answer is word 9
# Convert logits to probabilities
probabilities = softmax(predicted_logits) # [0.82, 0.01, 0.03, 0.05, ...]
# Loss = -log(probability of correct word)
loss = -log(probabilities[9])
# If model predicts correctly: probability[9] = 0.9 → loss = -log(0.9) = 0.1 (low)
# If model predicts wrong: probability[9] = 0.1 → loss = -log(0.1) = 2.3 (high)
Step 9G: Weight Initialization - Starting Smart
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
elif isinstance(module, nn.LayerNorm):
torch.nn.init.zeros_(module.bias)
torch.nn.init.ones_(module.weight)
Why careful initialization matters:
# Bad initialization (random large values):
weights = [[-5.0, 8.2, -3.1, 9.7], ...]
# → Exploding gradients, unstable training
# Good initialization (small random values):
weights = [[0.02, -0.01, 0.03, -0.02], ...]
# → Stable gradients, successful training
Step 9H: Model Configuration Example
config = {
'vocab_size': 800, # Number of unique words
'd_model': 128, # Embedding dimension
'num_heads': 8, # Attention heads (128/8 = 16 dims per head)
'num_layers': 4, # Transformer blocks
'd_ff': 512, # Feed forward hidden dimension (4x d_model)
'max_seq_length': 256, # Maximum sequence length
'dropout': 0.1 # Dropout rate
}
# Parameter count:
# Embeddings: 800 × 128 = 102,400
# 4 Transformer blocks: 4 × ~280,000 = 1,120,000
# Output head: 128 × 800 = 102,400
# Total: ~1.3M parameters
Step 9I: Generation Process
def generate_text(model, tokenizer, prompt="alice", max_length=20):
model.eval()
# Start with prompt
input_ids = torch.tensor([tokenizer.encode(prompt)]) # [1, prompt_length]
for _ in range(max_length):
# Forward pass
logits, _, _ = model(input_ids) # [1, current_length, vocab_size]
# Get predictions for last position only
next_token_logits = logits[0, -1, :] # [vocab_size]
# Convert to probabilities
probs = F.softmax(next_token_logits, dim=-1)
# Sample next token
next_token = torch.multinomial(probs, num_samples=1) # [1]
# Add to sequence
input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1)
# Decode back to text
return tokenizer.decode(input_ids[0].tolist())
Step 9J: Memory Layout During Forward Pass
# Forward pass memory usage:
batch_size = 2
seq_length = 24
d_model = 128
vocab_size = 800
# Step-by-step memory:
input_ids: [2, 24] # Input token IDs
embeddings: [2, 24, 128] # After token embedding
pos_encoded: [2, 24, 128] # After positional encoding
transformer_out: [2, 24, 128] # After transformer blocks
logits: [2, 24, 800] # After output head
# Peak memory: mainly from logits (largest tensor)
# 2 × 24 × 800 × 4 bytes = ~150KB for this batch
Key Insights:
- Modular design - Each component has a clear purpose
- Information flow - Token → Embedding → Position → Transform → Predict
- Causal masking - Ensures proper language modeling
- Parallel processing - All positions computed simultaneously
- Scalable architecture - Can adjust size by changing config parameters
The Complete Picture:
Your GPT model can now:
- Convert text to numbers (tokenization)
- Understand word meanings (embeddings)
- Track word positions (positional encoding)
- Focus on relevant context (attention)
- Process information (feed forward)
- Generate predictions (output head)
- Learn from data (training loop)
Ready for Training - where we teach the model to predict the next word?
Step 10: Training - Teaching the Model to Predict the Next Word
This is where the magic happens! We feed the model thousands of examples and it learns to understand language patterns.
class GPTTrainer:
def __init__(self, model, train_loader, val_loader, tokenizer, device):
self.model = model
self.train_loader = train_loader # Training examples
self.val_loader = val_loader # Validation examples
self.tokenizer = tokenizer
self.device = device
# Track progress
self.train_losses = []
self.val_losses = []
self.learning_rates = []
Step 10A: The Training Concept
Core idea: Show the model millions of examples where it guesses the next word, then tell it if it was right or wrong.
# Training example:
input_sequence = "alice was beginning to"
target_sequence = "was beginning to get"
# Model learns:
# Given "alice" → predict "was"
# Given "alice was" → predict "beginning"
# Given "alice was beginning" → predict "to"
# Given "alice was beginning to" → predict "get"
Step 10B: Single Training Step
def train_step(self, batch):
input_ids, targets = batch
# 1. Move data to GPU/device
input_ids = input_ids.to(self.device) # [batch_size, seq_length]
targets = targets.to(self.device) # [batch_size, seq_length]
# 2. Zero out previous gradients
self.optimizer.zero_grad()
# 3. Forward pass - make predictions
logits, loss, _ = self.model(input_ids, targets)
# logits: [batch_size, seq_length, vocab_size] - model's guesses
# loss: scalar - how wrong the model was
# 4. Backward pass - calculate gradients
loss.backward()
# 5. Clip gradients (prevent exploding)
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
# 6. Update weights
self.optimizer.step()
return loss.item()
Step 10C: Detailed Forward Pass During Training
# Example batch:
input_ids = [
[8, 9, 234, 4, 67], # "alice was beginning to get"
[156, 23, 89, 12, 45] # "she said hello very loud"
]
targets = [
[9, 234, 4, 67, 12], # "was beginning to get very"
[23, 89, 12, 45, 78] # "said hello very loud again"
]
# Forward pass:
logits, loss, _ = model(input_ids, targets)
# What happens inside:
# 1. Convert IDs to embeddings
# 2. Add positional encoding
# 3. Pass through transformer layers
# 4. Get predictions for each position
# 5. Compare predictions with targets using cross-entropy loss
Step 10D: Loss Calculation Deep Dive
# For each position in each example:
# Example 1, Position 0:
input_context = "alice"
model_prediction = [0.1, 0.8, 0.05, 0.02, ...] # Probabilities for each word
correct_answer = 9 # "was"
position_loss = -log(model_prediction[9]) = -log(0.8) = 0.22
# Example 1, Position 1:
input_context = "alice was"
model_prediction = [0.05, 0.1, 0.7, 0.1, ...]
correct_answer = 234 # "beginning"
position_loss = -log(model_prediction[234]) = -log(0.7) = 0.36
# Total loss = average of all position losses across all examples in batch
Step 10E: Gradient Calculation and Backpropagation
# After loss.backward(), each parameter gets a gradient:
# Example: One weight in attention layer
weight_value = 0.5
gradient = -0.02 # Tells us: "decrease this weight slightly"
# Weight update:
learning_rate = 0.001
new_weight = weight_value - learning_rate * gradient
new_weight = 0.5 - 0.001 * (-0.02) = 0.5 + 0.00002 = 0.50002
# This happens for ALL 1.3 million parameters simultaneously!
Step 10F: Why Gradient Clipping Is Essential
# Without gradient clipping (BAD):
gradients = [100.0, -80.0, 150.0, -200.0, ...] # Huge gradients!
learning_rate = 0.001
weight_updates = learning_rate * gradients = [0.1, -0.08, 0.15, -0.2, ...]
# Weights change dramatically → model becomes unstable
# With gradient clipping (GOOD):
original_gradients = [100.0, -80.0, 150.0, -200.0, ...]
gradient_norm = sqrt(100² + 80² + 150² + 200²) = 283.7
clip_norm = 1.0
if gradient_norm > clip_norm:
clipped_gradients = gradients * (clip_norm / gradient_norm)
clipped_gradients = [0.35, -0.28, 0.53, -0.71, ...] # Much smaller!
# Result: Stable training
Step 10G: Complete Training Epoch
def train_epoch(self, optimizer, epoch):
self.model.train() # Set model to training mode
total_loss = 0
num_batches = 0
for batch_idx, (input_ids, targets) in enumerate(self.train_loader):
# Move to device
input_ids = input_ids.to(self.device) # [8, 24] - 8 examples, 24 tokens each
targets = targets.to(self.device) # [8, 24]
# Training step
batch_loss = self.train_step((input_ids, targets))
total_loss += batch_loss
num_batches += 1
# Print progress every 50 batches
if batch_idx % 50 == 0:
avg_loss = total_loss / num_batches
print(f"Batch {batch_idx}: Loss = {batch_loss:.4f}, Avg = {avg_loss:.4f}")
return total_loss / num_batches # Average loss for the epoch
Step 10H: Validation - Testing Without Learning
def validate(self):
self.model.eval() # Set to evaluation mode (disables dropout)
total_loss = 0
num_batches = 0
with torch.no_grad(): # Don't calculate gradients (saves memory)
for input_ids, targets in self.val_loader:
input_ids = input_ids.to(self.device)
targets = targets.to(self.device)
# Forward pass only (no backward pass)
logits, loss, _ = self.model(input_ids, targets)
total_loss += loss.item()
num_batches += 1
return total_loss / num_batches
Why validation matters:
# Training loss: "How well does the model do on data it has seen?"
# Validation loss: "How well does the model generalize to new data?"
# Good scenario:
train_loss = 2.1
val_loss = 2.3
# Model is learning and generalizing well
# Overfitting scenario:
train_loss = 1.2 # Very low
val_loss = 3.8 # Much higher
# Model memorized training data but can't generalize
Step 10I: Learning Rate Scheduling
# Learning rate scheduler automatically adjusts learning rate:
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=2
)
# How it works:
# Epoch 1: val_loss = 3.5, lr = 0.001
# Epoch 2: val_loss = 3.2, lr = 0.001 (improving, keep same lr)
# Epoch 3: val_loss = 3.1, lr = 0.001 (still improving)
# Epoch 4: val_loss = 3.2, lr = 0.001 (got worse, patience = 1)
# Epoch 5: val_loss = 3.3, lr = 0.001 (got worse again, patience = 0)
# Epoch 6: val_loss = 3.4, lr = 0.0005 (reduce lr by factor of 0.5)
Step 10J: Complete Training Loop
def train(self, num_epochs=5, learning_rate=1e-3):
# Create optimizer
optimizer = optim.AdamW(self.model.parameters(), lr=learning_rate, weight_decay=0.01)
# Create scheduler
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2)
best_val_loss = float('inf')
for epoch in range(num_epochs):
print(f"Epoch {epoch + 1}/{num_epochs}")
# Training phase
train_loss = self.train_epoch(optimizer, epoch)
# Validation phase
val_loss = self.validate()
# Update learning rate
old_lr = optimizer.param_groups[0]['lr']
scheduler.step(val_loss)
current_lr = optimizer.param_groups[0]['lr']
# Save best model
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(self.model.state_dict(), 'best_model.pth')
print(f"New best model saved! Val Loss: {val_loss:.4f}")
# Print epoch results
print(f"Train Loss: {train_loss:.4f}")
print(f"Val Loss: {val_loss:.4f}")
print(f"Learning Rate: {current_lr:.2e}")
# Test generation every epoch
sample_text = self.generate_sample("alice was", max_length=10)
print(f"Sample: '{sample_text}'")
print("-" * 50)
Step 10K: What the Model Learns Over Time
# Epoch 1 (Random initialization):
# Input: "alice was"
# Output: "chocolate banana computer elephant"
# Loss: 4.7 (very bad)
# Epoch 2 (Starting to learn):
# Input: "alice was"
# Output: "alice was was the the"
# Loss: 3.8 (still bad, but learning word repetition)
# Epoch 3 (Learning grammar):
# Input: "alice was"
# Output: "alice was very tired and"
# Loss: 2.1 (much better! Learning proper grammar)
# Final model:
# Input: "alice was"
# Output: "alice was beginning to get very tired of sitting"
# Loss: 1.8 (good! Generating coherent text)
Step 10L: Memory Usage During Training
# Training memory breakdown (batch_size=8, seq_length=24):
# Forward pass:
activations_memory = batch_size * seq_length * d_model * num_layers * 4_bytes
activations_memory = 8 * 24 * 128 * 4 * 4 = ~400KB
# Gradients (same size as parameters):
gradient_memory = num_parameters * 4_bytes = 1.3M * 4 = ~5MB
# Optimizer state (AdamW keeps momentum and variance for each parameter):
optimizer_memory = num_parameters * 8_bytes = 1.3M * 8 = ~10MB
# Total training memory: ~15-20MB (very manageable!)
Step 10M: Training Progress Visualization
# What you see during training:
# Epoch 1/3
# Batch 0: Loss = 4.712, Avg = 4.712
# Batch 50: Loss = 3.891, Avg = 4.201
# Batch 100: Loss = 3.456, Avg = 3.876
# Train Loss: 3.654
# Val Loss: 3.812
# Sample: 'alice was the the cat cat'
# Epoch 2/3
# Batch 0: Loss = 3.234, Avg = 3.234
# Batch 50: Loss = 2.891, Avg = 3.045
# Batch 100: Loss = 2.567, Avg = 2.789
# Train Loss: 2.634
# Val Loss: 2.945
# Sample: 'alice was very tired of sitting'
# Epoch 3/3
# Batch 0: Loss = 2.345, Avg = 2.345
# Batch 50: Loss = 2.123, Avg = 2.234
# Batch 100: Loss = 1.987, Avg = 2.089
# Train Loss: 1.967
# Val Loss: 2.123
# Sample: 'alice was beginning to get very tired'
Key Insights:
- Iterative learning - Model gets better with each epoch
- Gradient-based optimization - Small weight updates accumulate into intelligence
- Validation prevents overfitting - Ensures model generalizes
- Loss decreases over time - Quantitative measure of improvement
- Text quality improves - Qualitative measure of learning
The Training Miracle:
Through millions of tiny weight adjustments, your model learns to:
- Understand grammar rules
- Form coherent sentences
- Follow narrative patterns
- Generate contextually appropriate text
Ready for Text Generation - where we see the trained model in action?
Step 11: Text Generation - Seeing Your Trained Model in Action
This is the exciting part! Your trained model can now generate human-like text by predicting one word at a time.
def generate_text(model, tokenizer, prompt="alice", max_length=50, temperature=1.0, top_k=10):
model.eval() # Set to evaluation mode
# Start with the prompt
input_ids = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long).to(device)
generated_ids = input_ids.clone()
with torch.no_grad(): # Don't need gradients for generation
for step in range(max_length):
# Get model predictions
logits, _, _ = model(generated_ids)
# Get predictions for the last position only
next_token_logits = logits[0, -1, :] / temperature
# Apply top-k sampling
if top_k > 0:
top_k_logits, top_k_indices = torch.topk(next_token_logits, min(top_k, len(next_token_logits)))
filtered_logits = torch.full_like(next_token_logits, float('-inf'))
filtered_logits[top_k_indices] = top_k_logits
next_token_logits = filtered_logits
# Convert to probabilities and sample
probs = F.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Add to sequence
generated_ids = torch.cat([generated_ids, next_token.unsqueeze(0)], dim=1)
# Convert back to text
return tokenizer.decode(generated_ids[0].tolist())
Step 11A: The Generation Process Step-by-Step
# Starting prompt: "alice was"
prompt = "alice was"
input_ids = [8, 9] # "alice" = 8, "was" = 9
# Step 1: First prediction
current_sequence = [8, 9] # "alice was"
logits, _, _ = model(current_sequence)
next_word_logits = logits[0, -1, :] # Predictions after "was"
# Model outputs probabilities for each word:
probabilities = {
234: 0.25, # "beginning" - 25% probability
67: 0.20, # "very" - 20% probability
45: 0.15, # "tired" - 15% probability
156: 0.12, # "sitting" - 12% probability
89: 0.10, # "walking" - 10% probability
# ... other words get remaining 18%
}
# Sample: Let's say we pick "beginning" (ID 234)
current_sequence = [8, 9, 234] # "alice was beginning"
# Step 2: Second prediction
logits, _, _ = model(current_sequence)
next_word_logits = logits[0, -1, :] # Predictions after "beginning"
probabilities = {
4: 0.35, # "to" - 35% probability (very likely after "beginning")
67: 0.20, # "very" - 20% probability
12: 0.15, # "her" - 15% probability
# ... etc
}
# Sample: Pick "to" (ID 4)
current_sequence = [8, 9, 234, 4] # "alice was beginning to"
# Continue this process for max_length iterations...
Step 11B: Temperature - Controlling Creativity
Temperature controls how "creative" vs "conservative" the model is:
# Original logits (raw model outputs):
raw_logits = [3.0, 2.5, 1.0, 0.5, 0.2]
# Temperature = 0.1 (VERY CONSERVATIVE):
scaled_logits = [30.0, 25.0, 10.0, 5.0, 2.0] # Divide by 0.1
probabilities = [0.91, 0.08, 0.01, 0.00, 0.00] # Almost always picks best word
# Output: "alice was beginning to get very tired of sitting by her sister"
# Temperature = 1.0 (BALANCED):
scaled_logits = [3.0, 2.5, 1.0, 0.5, 0.2] # No change
probabilities = [0.50, 0.31, 0.11, 0.07, 0.03] # Reasonable distribution
# Output: "alice was beginning to feel quite sleepy and drowsy"
# Temperature = 2.0 (VERY CREATIVE):
scaled_logits = [1.5, 1.25, 0.5, 0.25, 0.1] # Divide by 2.0
probabilities = [0.31, 0.26, 0.17, 0.14, 0.12] # Much more random
# Output: "alice was purple elephants dancing rainbow yesterday mountains"
Step 11C: Top-K Sampling - Quality Control
Top-K sampling only considers the K most likely words:
# All word probabilities:
all_probs = {
234: 0.25, # "beginning"
67: 0.20, # "very"
45: 0.15, # "tired"
156: 0.12, # "sitting"
89: 0.10, # "walking"
23: 0.05, # "happy"
78: 0.03, # "sad"
445: 0.02, # "elephant" ← Weird word!
167: 0.01, # "purple" ← Very weird!
# ... 791 more words with tiny probabilities
}
# Top-K = 5 sampling:
# Only consider top 5 words: [234, 67, 45, 156, 89]
# Renormalize their probabilities:
top_k_probs = {
234: 0.25/0.82 = 0.30, # "beginning"
67: 0.20/0.82 = 0.24, # "very"
45: 0.15/0.82 = 0.18, # "tired"
156: 0.12/0.82 = 0.15, # "sitting"
89: 0.10/0.82 = 0.12, # "walking"
}
# Benefits:
# - Prevents weird words like "elephant" or "purple"
# - Maintains creativity within reasonable bounds
# - Much better text quality
Step 11D: Different Generation Strategies
# 1. GREEDY DECODING (always pick best word):
def greedy_decode():
probs = F.softmax(logits, dim=-1)
next_token = torch.argmax(probs) # Always pick highest probability
# Result: Deterministic but boring
# "alice was beginning to get very tired of sitting by her sister on the bank"
# 2. RANDOM SAMPLING:
def random_sample():
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1) # Random according to probabilities
# Result: Creative but sometimes incoherent
# "alice was beginning to purple elephant dance mountain yesterday"
# 3. TOP-K + TEMPERATURE (our approach):
def top_k_temperature_sample():
# Apply temperature
scaled_logits = logits / temperature
# Apply top-k filtering
top_k_logits, top_k_indices = torch.topk(scaled_logits, k)
# Sample from filtered distribution
# Result: Good balance of quality and creativity
# "alice was beginning to feel quite drowsy and sleepy in the warm afternoon sun"
Step 11E: Real Example - Before vs After Training
# BEFORE TRAINING (random weights):
prompt = "alice was"
# Model output: "alice was purple mountain chocolate elephant computer banana"
# Explanation: Model has no understanding, just random word generation
# AFTER TRAINING (learned patterns):
prompt = "alice was"
# Model output: "alice was beginning to get very tired of sitting by her sister"
# Explanation: Model learned:
# - "alice" is often followed by "was" (subject-verb pattern)
# - "beginning to" is a common phrase
# - "tired of sitting" makes semantic sense
# - Overall sentence structure follows English grammar
Step 11F: What the Model Actually Learned
Through training, your model internalized these patterns:
# 1. GRAMMATICAL PATTERNS:
# Subject → Verb: "alice" → "was", "cat" → "sat"
# Article → Noun: "the" → "cat", "a" → "book"
# Adjective → Noun: "big" → "house", "red" → "car"
# 2. SEMANTIC RELATIONSHIPS:
# Actions → Objects: "sat" → "chair/mat", "read" → "book"
# Spatial: "on" → "table/floor", "in" → "house/box"
# Temporal: "then" → past_events, "will" → future_events
# 3. NARRATIVE FLOW:
# Story beginnings: "once upon" → "a time"
# Character actions: "alice" → walking/sitting/thinking
# Dialogue patterns: "said" → quotes, questions → answers
# 4. WORLD KNOWLEDGE:
# People sit on chairs, not walls
# Books are read, not eaten
# Day comes before night
# Characters have consistent behaviors
Step 11G: Generation Quality Metrics
# How to evaluate generation quality:
# 1. PERPLEXITY (mathematical measure):
# Lower perplexity = better predictions
# Before training: perplexity ≈ 800 (terrible)
# After training: perplexity ≈ 45 (good)
# 2. HUMAN EVALUATION:
# Fluency: Does it sound natural? (1-5 scale)
# Coherence: Does it make sense? (1-5 scale)
# Relevance: Does it follow from the prompt? (1-5 scale)
# 3. AUTOMATED METRICS:
# BLEU score: Compared to reference text
# ROUGE score: Content overlap measures
# Sentence similarity: Semantic coherence
Step 11H: Interactive Generation Example
# Live generation session:
print("GPT Model Ready! Type prompts to generate text.")
while True:
prompt = input("Enter prompt: ")
if prompt == "quit":
break
# Generate with different settings
conservative = generate_text(model, tokenizer, prompt, max_length=20, temperature=0.7, top_k=5)
creative = generate_text(model, tokenizer, prompt, max_length=20, temperature=1.2, top_k=15)
print(f"Conservative: {conservative}")
print(f"Creative: {creative}")
print("-" * 50)
# Example session:
# Enter prompt: alice was
# Conservative: alice was beginning to get very tired of sitting by her sister on the bank
# Creative: alice was feeling quite drowsy and started to wonder about the peculiar rabbit
# Enter prompt: once upon
# Conservative: once upon a time there was a little girl who lived in a small house
# Creative: once upon a magical evening, strange creatures began dancing under the moonlight
Step 11I: Memory Usage During Generation
# Generation is much lighter than training:
# No gradients needed: 0 MB (vs ~10MB during training)
# No optimizer state: 0 MB (vs ~10MB during training)
# Only forward pass: ~1MB for activations
# Growing sequence: starts small, grows with each token
# Peak memory for 50-token generation: ~2-3MB total
# Very efficient! Can run on phone/laptop easily
Step 11J: Generation Speed
# Generation speed depends on model size:
# Our small model (1.3M parameters):
# ~100-200 tokens/second on CPU
# ~500-1000 tokens/second on GPU
# For comparison:
# GPT-3 (175B parameters): ~20-50 tokens/second
# Our model is much faster because it's much smaller!
# Generation time for 50 tokens:
# CPU: ~0.3 seconds
# GPU: ~0.05 seconds
Step 11K: Common Generation Issues and Solutions
# PROBLEM 1: Repetition
# Output: "alice was was was was was"
# Solution: Add repetition penalty or use different sampling
# PROBLEM 2: Incoherence
# Output: "alice was purple elephant mountain"
# Solution: Lower temperature, use top-k sampling
# PROBLEM 3: Too boring
# Output: "alice was tired alice was tired alice was tired"
# Solution: Increase temperature, increase top-k
# PROBLEM 4: Doesn't follow prompt
# Output: Prompt="alice was happy" → "bob went shopping"
# Solution: Better training data, longer context
Key Insights:
- Autoregressive generation - One word at a time, each depends on previous
- Probabilistic sampling - Model outputs probabilities, we sample from them
- Quality vs creativity tradeoff - Temperature and top-k control this balance
- Learned patterns emerge - Training creates understanding of language structure
- Interactive capability - Model can respond to any prompt in real-time
The Generation Miracle:
Your 1.3M parameter model can now:
- Complete any sentence you start
- Write coherent stories
- Follow grammatical rules
- Maintain narrative consistency
- Generate creative but sensible text
Final result: You've built a miniature GPT that understands language and can generate human-like text!
This is the same fundamental technology behind ChatGPT, GPT-4, and other large language models - just scaled up with more parameters, more data, and more compute!
Top comments (0)