In 2017, a paper titled "Attention Is All You Need" changed the trajectory of deep learning. The Transformer architecture it introduced isn't just the backbone of GPT, BERT, Claude, and every major LLM today — it's also the foundation for vision models (ViT), audio models (Whisper), and multimodal architectures.
But here's the catch: most tutorials skip the "why" and jump straight to the "how." You get a diagram of Q, K, V, a few equations, and a "just trust me, it works."
This article takes the opposite approach. We'll build the Transformer from first principles — starting with the problem it solves, then layering each component one by one, with visuals and runnable code.
1. The Problem: Why RNNs Hit a Wall
Before Transformers, sequence modeling meant Recurrent Neural Networks (RNNs), LSTMs, and GRUs. These architectures process tokens one at a time:
Input: "The cat sat on the ..."
RNN: h₀ → h₁ → h₂ → h₃ → h₄ → ...
(sequential — each step waits for the previous one)
This has two fundamental limitations:
① Sequential bottleneck. Token #100 can't be processed until tokens #1–99 are done. No parallelism = slow training, especially on GPUs that thrive on parallel computation.
② Long-range forgetting. In theory, an LSTM can remember information for hundreds of steps. In practice, by step 50, the signal from step 1 has degraded significantly. The model struggles to connect "She was born in Paris" with "She speaks fluent ___" when there are 30 tokens in between.
These problems aren't minor engineering issues — they're architectural ceilings. Transformers solve both simultaneously, and the key insight is deceptively simple:
Every token should be able to directly look at every other token — in one step.
2. The Core Innovation: Self-Attention
Let's build self-attention from scratch, step by step.
2.1 What Are We Trying to Do?
Given a sequence of input vectors:
x₁ = "The" → vector
x₂ = "cat" → vector
x₃ = "sat" → vector
x₄ = "on" → vector
x₅ = "the" → vector
x₆ = "mat" → vector
We want to produce a new set of vectors where each one contains context from the entire sequence. For example, the new vector for "mat" should know that it's a noun being sat on by a cat.
2.2 The Query-Key-Value Mechanism
Self-attention works by asking three questions for each token:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "If matched, what information should I pass along?"
For each token in the sequence, we:
- Compute its Query vector
- Compare it against every token's Key vector (including itself)
- Use the similarity scores to weight each token's Value vector
- Sum the weighted values — that's the new representation
Think of it like a library: you walk in with a query ("books about neural networks"), the librarian checks the keys (book titles/tags), and hands you the values (the actual books).
2.3 The Math (It's Simpler Than You Think)
Let X be our input matrix (sequence length × embedding dimension).
Step 1: Compute Q, K, V via learned weight matrices:
Q = X · W_Q (queries)
K = X · W_K (keys)
V = X · W_V (values)
Step 2: Compute attention scores — dot product of every query with every key:
S = Q · Kᵀ (shape: seq_len × seq_len)
Step 3: Scale and normalize with softmax:
S_scaled = S / √d_k
A = softmax(S_scaled, dim=-1)
The √d_k scaling prevents the dot products from growing too large (which would push softmax into regions with extremely small gradients).
Step 4: Weighted sum of values:
Output = A · V
That's it. Four matrix operations. Here's the code:
import numpy as np
def self_attention(X, W_q, W_k, W_v):
"""
X: input (seq_len × d_model)
W_q: query weight (d_model × d_k)
W_k: key weight (d_model × d_k)
W_v: value weight (d_model × d_v)
"""
d_k = W_q.shape[1]
# Step 1: Project to Q, K, V
Q = X @ W_q # (seq_len × d_k)
K = X @ W_k # (seq_len × d_k)
V = X @ W_v # (seq_len × d_v)
# Step 2: Compute attention scores
S = Q @ K.T # (seq_len × seq_len)
# Step 3: Scale + softmax
A = np.exp(S / np.sqrt(d_k))
A = A / A.sum(axis=-1, keepdims=True) # softmax
# Step 4: Weighted sum of values
return A @ V # (seq_len × d_v)
2.4 Why This Changes Everything
Notice what happened: every token directly interacts with every other token in a single operation. There's no sequential dependency. The parallelism is baked into the matrix multiplication.
And there's no distance penalty — token #1 and token #1000 have exactly the same ability to attend to each other. The long-range forgetting problem? Gone.
3. Multi-Head Attention: Many Perspectives
A single attention layer can only capture one kind of relationship. But real language has many: syntax, semantics, coreference, positional relationships, etc.
Multi-head attention runs multiple attention operations in parallel, each with its own Q, K, V projections:
Head 1: "Which words are verbs?"
Head 2: "Which noun does this pronoun refer to?"
Head 3: "What's the subject of this sentence?"
... and so on (typically 8–16 heads)
class MultiHeadAttention:
def __init__(self, d_model, num_heads):
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
# One projection matrix per head (concatenated for efficiency)
self.W_q = np.random.randn(d_model, d_model) * 0.01
self.W_k = np.random.randn(d_model, d_model) * 0.01
self.W_v = np.random.randn(d_model, d_model) * 0.01
self.W_o = np.random.randn(d_model, d_model) * 0.01
def split_heads(self, X):
"""(seq_len × d_model) → (num_heads × seq_len × d_k)"""
seq_len = X.shape[0]
X = X.reshape(seq_len, self.num_heads, self.d_k)
return X.transpose(1, 0, 2) # (num_heads, seq_len, d_k)
def __call__(self, X):
# Project to Q, K, V (all heads at once)
Q = X @ self.W_q # (seq_len × d_model)
K = X @ self.W_k
V = X @ self.W_v
# Split into heads
Q = self.split_heads(Q) # (num_heads × seq_len × d_k)
K = self.split_heads(K)
V = self.split_heads(V)
# Attention per head (parallel!)
outputs = []
for h in range(self.num_heads):
S = Q[h] @ K[h].T / np.sqrt(self.d_k)
A = np.exp(S) / np.exp(S).sum(axis=-1, keepdims=True)
head_output = A @ V[h]
outputs.append(head_output)
# Concatenate heads
concat = np.concatenate(outputs, axis=-1) # (seq_len × d_model)
return concat @ self.W_o # Final projection
The outputs of all heads are concatenated and projected one more time. This lets the model simultaneously attend to different types of relationships — something no single RNN state could do.
4. Positional Encoding: Putting Things in Order
Here's an important catch: self-attention is permutation-invariant. The set {"cat", "sat", "mat"} produces the same attention pattern as {"mat", "sat", "cat"} because dot products don't know about order.
But "the cat sat on the mat" and "the mat sat on the cat" have very different meanings. We need to inject positional information.
The original Transformer uses sinusoidal positional encoding:
def positional_encoding(seq_len, d_model):
pe = np.zeros((seq_len, d_model))
for pos in range(seq_len):
for i in range(0, d_model, 2):
pe[pos, i] = np.sin(pos / (10000 ** (i / d_model)))
pe[pos, i + 1] = np.cos(pos / (10000 ** (i / d_model)))
return pe
# Shape: (seq_len × d_model) — just add it to the input embeddings
X_with_position = X + positional_encoding(seq_len, d_model)
Why sines and cosines? Two elegant properties:
-
Relative positions emerge naturally —
PE(pos+k)can be expressed as a linear function ofPE(pos), so the model can learn relative position relationships. - Unbounded sequence length — unlike learned embeddings, sinusoidal encodings extrapolate to sequences longer than anything seen in training.
Modern models (GPT, BERT, Llama) use learned positional embeddings or rotary position encoding (RoPE) instead, but the principle is the same: inject order information into a permutation-blind mechanism.
5. Putting It Together: The Transformer Block
A single Transformer block = Multi-Head Attention + Feed-Forward Network + Residual Connections + LayerNorm.
Input → [LayerNorm → Multi-Head Attention → Add (residual)] → [LayerNorm → FFN → Add (residual)]
Each component serves a purpose:
| Component | What It Does | Why It Matters |
|---|---|---|
| Multi-Head Attention | Lets tokens exchange information | The core reasoning mechanism |
| Feed-Forward Network | Two linear layers with ReLU/GELU | Adds per-token computation depth |
| Residual Connection | Output = Layer(x) + x |
Gradients flow directly through, enabling very deep models |
| LayerNorm | Normalizes across features | Stabilizes training, reduces sensitivity to initialization |
class TransformerBlock:
def __init__(self, d_model, num_heads, d_ff):
self.attention = MultiHeadAttention(d_model, num_heads)
self.ffn = FeedForward(d_model, d_ff)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
def __call__(self, x):
# Self-attention + residual
attn_out = self.attention(self.norm1(x))
x = x + attn_out
# FFN + residual
ffn_out = self.ffn(self.norm2(x))
x = x + ffn_out
return x
class FeedForward:
def __init__(self, d_model, d_ff):
self.W1 = np.random.randn(d_model, d_ff) * 0.01
self.W2 = np.random.randn(d_ff, d_model) * 0.01
def __call__(self, x):
return np.maximum(0, x @ self.W1) @ self.W2 # ReLU
Stack 6, 12, or 96 of these blocks, and you get a Transformer.
6. The Big Picture: Encoder vs. Decoder
The original Transformer has two stacks:
Encoder (6 blocks, bidirectional):
- Each token attends to ALL tokens (past and future)
- Used for understanding tasks: classification, sentiment, NER
- BERT uses only the encoder
Decoder (6 blocks, autoregressive):
- Each token attends ONLY to itself and previous tokens (masked attention)
- Used for generation tasks: translation, text generation
- GPT uses only the decoder
Encoder-Decoder Architecture (the original):
- Encoder processes the input; Decoder generates output while attending to encoder's representations
- Used for translation, summarization
- T5 uses this
┌──────────────────┐ ┌──────────────────┐
│ Encoder │ │ Decoder │
│ (bidirectional) │ │ (autoregressive) │
│ │ │ │
│ Block 6 │ │ Block 6 │
│ Block 5 │ │ Block 5 │
│ ... │ │ ... │
│ Block 1 │ │ Block 1 │
│ │ │ │
│ Input Embedding │ │ Output Embedding│
│ + PE │ │ + PE │
└────────┬─────────┘ └────────┬──────────┘
│ │
└────── Cross-Attn ──────┘
(Decoder attends
to Encoder output)
The cross-attention in the decoder is what lets the model "look at" the input while generating output — translating "Je suis étudiant" while reading "I am a student."
7. Why Transformers Won (And What They Cost)
What We Gained
| Feature | RNN/LSTM | Transformer |
|---|---|---|
| Parallelism | ❌ Sequential | ✅ Full parallel |
| Long-range (1K+ tokens) | ❌ Forgets | ✅ No distance penalty |
| Training speed (wall-clock) | Slow | 3-5x faster |
| Scaling | Limited | Up to trillions of params |
The Price We Pay
O(n²) complexity. Every token attends to every token. For a 1,000-token sequence, that's 1M attention pairs. For 100,000 tokens (a whole book chapter), it's 10 billion — infeasible.
This is why modern optimizations exist:
- Sparse attention (only attend to local + a few global tokens)
- Sliding window attention (Mistral, Gemma)
- FlashAttention (hardware-efficient attention that avoids materializing the full matrix)
- KV-cache (reuse computed keys/values during generation)
8. From Here to GPT: What Changed
The Transformer you just built is the foundation. Modern LLMs add:
- Scale — GPT-3: 175B params, 96 layers. GPT-4: estimated 1.8T params (8×220B experts)
- Pre-training + Fine-tuning — Learn from internet text, then specialize
- RLHF — Align outputs with human preferences (what makes Claude helpful, honest, and harmless)
- Architecture tweaks — GQA (grouped query attention), RoPE, SwiGLU, RMSNorm instead of LayerNorm
But the core innovation — every token directly attends to every other token in parallel — remains unchanged since 2017. If you understand self-attention, you understand the engine driving the AI revolution.
9. Play With It Yourself
Here's a complete, minimal Transformer for character-level text generation (~100 lines):
import numpy as np
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
class TinyTransformer:
def __init__(self, vocab_size, d_model=64, num_heads=4):
self.embed = np.random.randn(vocab_size, d_model) * 0.01
self.blocks = [TransformerBlock(d_model, num_heads, d_model*4) for _ in range(3)]
self.output = np.random.randn(d_model, vocab_size) * 0.01
def generate(self, token, length=100, temperature=1.0):
tokens = [token]
for _ in range(length):
x = self.embed[tokens[-64:]] # last 64 tokens context
for block in self.blocks:
x = block(x)
logits = x[-1] @ self.output / temperature
probs = softmax(logits)
token = np.random.choice(len(probs), p=probs)
tokens.append(token)
return tokens
Train this on a 1MB text file with any optimizer, and watch it learn language structure from scratch in minutes.
The Bottom Line
- Self-attention lets every token directly interact with every other token — no sequential bottleneck, no distance decay
- Multi-head attention captures different relationship types simultaneously
- Positional encoding tells the model about order (since attention is permutation-blind)
- Residual connections + LayerNorm enable stacking dozens of layers
- The combination is so powerful it has replaced RNNs, CNNs in vision, and now extends to audio and video
The Transformer is 8 years old, and we're still discovering what it can do. The age of attention is just getting started.
Found this helpful? I write deep-dive tutorials on LLMs, RAG systems, and production AI. Follow me for more content that actually builds understanding, not just surface-level overviews.
I also maintain an Interview Guide with 300+ AI/ML system design questions covering Transformers, RAG, Agents, and production deployment patterns — designed to help you bridge the gap between theory and real-world system design.
Top comments (0)