If you've used ChatGPT, GitHub Copilot, or Claude, you've experienced the power of a single neural network architecture that made it all possible: The Transformer.
In 2017, a Google research paper titled "Attention Is All You Need" introduced what would become the foundation of every modern Large Language Model (LLM). Today, we're going to unpack how this architecture works, why it beat everything that came before it, and why its decoder-only variant powers models like GPT-4, Llama 3, and Gemini.
The Problem with Old-School Sequence Models
Before Transformers, we mostly used Recurrent Neural Networks (RNNs) and their fancy cousin, LSTMs. These processed text sequentially — word by word — passing a "hidden state" along like a baton in a relay race. This approach had two fatal flaws:
# Pseudo-code for the RNN bottleneck
def process_sentence_rnn(sentence):
hidden_state = zeros()
for word in sentence: # Sequential processing 😢
hidden_state = rnn_cell(word, hidden_state)
# Each step depends on previous step
# No parallelization possible!
return hidden_state
The problems:
- No parallelization — GPUs couldn't speed up sequential processing
- Vanishing gradients — Context from early words got lost in long texts
- Memory limitations — Couldn't handle long-range dependencies well
Transformers solved all of this by processing all words simultaneously and using a clever mechanism called self-attention to understand relationships.
Self-Attention: The Heart of the Transformer
Instead of processing words in order, Transformers look at all words at once and decide which ones are most relevant to each other. Here's how it works conceptually:
# Simplified self-attention in code
def self_attention(words):
# Each word gets three representations:
Q = compute_query(words) # "What am I looking for?"
K = compute_key(words) # "What do I contain?"
V = compute_value(words) # "What should I output?"
# Attention scores = similarity between Q and K
scores = matmul(Q, K.transpose())
# Softmax to get weights (which words to focus on)
weights = softmax(scores / sqrt(d_k))
# Weighted sum of values
output = matmul(weights, V)
return output
Real-world example: In the sentence "The cat chased its tail because it was playful", the word "it" needs to figure out what it refers to. Self-attention lets it assign high scores to "cat" and "tail", lower scores to other words.
Multi-Head Attention: Multiple Perspectives
Transformers don't use just one attention mechanism — they use multiple in parallel (typically 8-128 "heads"):
# Multi-head attention lets the model focus on different things simultaneously
heads = []
for i in range(num_heads):
# Each head might learn different patterns:
if i == 0: # Head 1: syntactic relationships
focus = ["cat", "chased", "tail"]
elif i == 1: # Head 2: semantic meaning
focus = ["playful", "because"]
# ... etc
head_output = attention_layer(Q[i], K[i], V[i])
heads.append(head_output)
# Combine all heads
multi_head_output = concat_and_project(heads)
Positional Encoding: Adding Order to Chaos
Since Transformers process all words simultaneously, they lose word order information. We fix this with positional encoding — adding information about each word's position:
import numpy as np
def positional_encoding(position, dimension, d_model=512):
"""Sinusoidal positional encoding from the original paper"""
angle_rates = 1 / np.power(10000, (2 * (dimension // 2)) / d_model)
angle = position * angle_rates
# Even dimensions use sine, odd use cosine
if dimension % 2 == 0:
return np.sin(angle)
else:
return np.cos(angle)
# Example: Different positions get unique encodings
pos_encodings = {
"The[0]": [0.0, 1.0, 0.0, 1.0, ...],
"cat[1]": [0.841, 0.540, 0.002, 0.999, ...],
"sat[2]": [0.909, -0.416, 0.004, 0.999, ...]
}
The Original Transformer: Encoder-Decoder Architecture
The original Transformer had two main parts (designed for translation tasks):
┌─────────────────────────────────────────────────────────────┐
│ ORIGINAL TRANSFORMER │
├─────────────────────────────────┬───────────────────────────┤
│ ENCODER │ DECODER │
│ (Reads and understands input) │ (Generates output) │
├─────────────────────────────────┼───────────────────────────┤
│ Input Embedding + Position │ Output Embedding + Pos │
│ ↓ │ ↓ │
│ ┌─────────────────────────┐ │ ┌─────────────────────┐ │
│ │ Encoder Layer × N │ │ │ Decoder Layer × N │ │
│ │ • Self-Attention │────┼─→│ • Masked Attn │ │
│ │ • Feed Forward │ │ │ • Cross-Attn │ │
│ │ • Layer Norm │ │ │ • Feed Forward │ │
│ │ • Residual Connections │ │ │ • Layer Norm │ │
│ └─────────────────────────┘ │ └─────────────────────┘ │
└─────────────────────────────────┴───────────────────────────┘
Key difference in the decoder: It uses masked self-attention — each word can only attend to previous words (not future ones). This prevents cheating during text generation.
Why Modern LLMs Use Decoder-Only Architectures
You might notice that GPT, Llama, and Claude don't use the full encoder-decoder setup. They use just the decoder part. Here's why:
# Architecture comparison
architectures = {
"encoder_decoder": { # Original Transformer, T5, BART
"use_case": "Translation, summarization",
"pros": "Great for seq2seq tasks",
"cons": "Twice the parameters, complex training",
"modern_llm": False
},
"encoder_only": { # BERT, RoBERTa
"use_case": "Classification, Q&A",
"pros": "Bidirectional context",
"cons": "Not good for generation",
"modern_llm": False
},
"decoder_only": { # GPT, Llama, Claude, Gemini
"use_case": "Text generation, chat",
"pros": "Perfect for next-token prediction",
"cons": "Unidirectional context",
"modern_llm": True # 🎯 Winner for LLMs!
}
}
Three reasons decoder-only won:
- Generative pre-training — Predicting the next word trains a general understanding of language
- Efficiency — One stack uses fewer parameters than two
- Transfer learning — A model trained to predict next tokens can learn almost any language task
The Transformer Block: A Layer-by-Layer Look
Let's look at what happens inside a single Transformer layer:
class TransformerBlock(nn.Module):
def __init__(self, d_model=768, n_heads=12, d_ff=3072):
super().__init__()
self.attention = MultiHeadAttention(d_model, n_heads)
self.norm1 = LayerNorm(d_model)
self.ffn = FeedForward(d_model, d_ff)
self.norm2 = LayerNorm(d_model)
def forward(self, x):
# 1. Self-attention with residual connection
attn_output = self.attention(x)
x = self.norm1(x + attn_output) # Add & normalize
# 2. Feed-forward network with residual connection
ffn_output = self.ffn(x)
x = self.norm2(x + ffn_output) # Add & normalize
return x
# Why residuals and normalization matter:
# Residual connections: Help gradients flow through deep networks
# LayerNorm: Stabilizes training across different samples
From Transformer to LLM: The Scaling Secret
The Transformer architecture was brilliant, but turning it into ChatGPT required massive scaling:
# How scaling transformed the Transformer
scaling_timeline = {
"2018": {
"model": "GPT-1",
"params": "117M",
"layers": 12,
"d_model": 768,
"training_data": "5GB"
},
"2020": {
"model": "GPT-3",
"params": "175B", # 1500x growth!
"layers": 96,
"d_model": 12288,
"training_data": "570GB"
},
"2024": {
"model": "Llama 3 405B",
"params": "405B",
"layers": 128,
"d_model": 16384,
"training_data": "15T tokens" # ~15TB!
}
}
The scaling laws (from the Chinchilla paper) tell us optimal scaling ratios:
Optimal tokens ≈ 20 × number of parameters
Optimal compute ≈ 6 × 10^12 × parameters
Key Innovations That Made It Work
1. Layer Normalization
# Without LayerNorm (training is unstable)
def unstable_forward(x):
return gelu(linear(x)) # Activations can explode/vanish
# With LayerNorm (stable training)
def stable_forward(x):
x_normalized = (x - mean(x)) / sqrt(var(x) + eps)
return gelu(linear(x_normalized))
2. Residual (Skip) Connections
# Without residuals: Vanishing gradients in deep networks
def deep_network(x):
for i in range(100): # 100 layers!
x = layer(x) # Signal degrades through many layers
return x
# With residuals: Gradient highway!
def resnet_style(x):
for i in range(100):
x = x + layer(x) # Original signal preserved
return x
3. Feed-Forward Networks
Each attention layer is followed by a simple but wide FFN:
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
# Typically: d_ff = 4 × d_model
self.w1 = nn.Linear(d_model, d_ff) # Expand
self.w2 = nn.Linear(d_ff, d_model) # Project back
self.activation = nn.GELU()
def forward(self, x):
return self.w2(self.activation(self.w1(x)))
Putting It All Together: Code for a Mini-Transformer
Here's a complete (simplified) Transformer implementation:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MiniTransformer(nn.Module):
def __init__(self, vocab_size=50000, d_model=512, n_layers=6, n_heads=8):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = PositionalEncoding(d_model)
self.layers = nn.ModuleList([
TransformerBlock(d_model, n_heads)
for _ in range(n_layers)
])
self.final_norm = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size)
def forward(self, x):
# 1. Convert tokens to vectors
x = self.embedding(x)
# 2. Add positional information
x = self.pos_encoding(x)
# 3. Process through transformer layers
for layer in self.layers:
x = layer(x)
# 4. Final normalization and projection
x = self.final_norm(x)
logits = self.lm_head(x)
return logits
# Usage example
model = MiniTransformer()
input_ids = torch.randint(0, 50000, (1, 128)) # Batch of 1, 128 tokens
logits = model(input_ids) # Shape: (1, 128, 50000)
Why This Architecture Won
- Parallelizability — All tokens processed simultaneously (great for GPUs)
- Long-range dependencies — Attention connects any two tokens directly
- Scalability — Add more layers/heads for better performance
- Transferability — Pre-trained models work on many tasks
Common Misconceptions
Myth 1: "Transformers understand language like humans do"
- Reality: They're pattern matchers, not thinkers. They predict what comes next based on training data statistics.
Myth 2: "More parameters always means better"
- Reality: The Chinchilla paper showed we need more data, not just bigger models.
Myth 3: "Self-attention is computationally efficient"
- Reality: It's O(n²) in sequence length! That's why we need optimizations like FlashAttention (coming in the next article).
What's Next?
The Transformer architecture was just the beginning. To build actual LLMs, we need:
- Massive-scale training (next article!)
- Efficiency optimizations (FlashAttention, quantization)
- Alignment techniques (RLHF, DPO)
In the next article, we'll dive into how LLMs are trained — from petabyte-scale datasets to distributed training across thousands of GPUs, and the clever tricks like Mixture of Experts that make trillion-parameter models possible.
Further Reading & Resources
- Original Paper: Attention Is All You Need — The 2017 paper that started it all
- Visual Guide: The Illustrated Transformer — Best visual explanation online
- Video Lectures:
- Implementations to Study:
Key Takeaways
- Transformers process all tokens in parallel using self-attention
- Self-attention calculates relevance scores between all token pairs
- Positional encoding adds order information
- Modern LLMs use decoder-only architectures for efficiency
- The real magic happens at scale (billions of parameters, trillions of tokens)
*Got questions or insights about Transformers? Drop them in the comments below! *
Next up: We'll explore how these architectures are trained at massive scale, covering distributed training, the Chinchilla scaling laws, and how models like GPT-4 are actually built.
Top comments (0)