马国锦

Posted on Jun 15

Transformers From Scratch: How Attention Really Works (With Visuals & Code)

#transformer #deeplearning #ai #tutorial

In 2017, a paper titled "Attention Is All You Need" changed the trajectory of deep learning. The Transformer architecture it introduced isn't just the backbone of GPT, BERT, Claude, and every major LLM today — it's also the foundation for vision models (ViT), audio models (Whisper), and multimodal architectures.

But here's the catch: most tutorials skip the "why" and jump straight to the "how." You get a diagram of Q, K, V, a few equations, and a "just trust me, it works."

This article takes the opposite approach. We'll build the Transformer from first principles — starting with the problem it solves, then layering each component one by one, with visuals and runnable code.

1. The Problem: Why RNNs Hit a Wall

Before Transformers, sequence modeling meant Recurrent Neural Networks (RNNs), LSTMs, and GRUs. These architectures process tokens one at a time:

Input:  "The cat sat on the ..."
RNN:    h₀ → h₁ → h₂ → h₃ → h₄ → ...
        (sequential — each step waits for the previous one)

This has two fundamental limitations:

① Sequential bottleneck. Token #100 can't be processed until tokens #1–99 are done. No parallelism = slow training, especially on GPUs that thrive on parallel computation.

② Long-range forgetting. In theory, an LSTM can remember information for hundreds of steps. In practice, by step 50, the signal from step 1 has degraded significantly. The model struggles to connect "She was born in Paris" with "She speaks fluent ___" when there are 30 tokens in between.

These problems aren't minor engineering issues — they're architectural ceilings. Transformers solve both simultaneously, and the key insight is deceptively simple:

Every token should be able to directly look at every other token — in one step.

2. The Core Innovation: Self-Attention

Let's build self-attention from scratch, step by step.

2.1 What Are We Trying to Do?

Given a sequence of input vectors:

x₁ = "The"    →  vector
x₂ = "cat"    →  vector
x₃ = "sat"    →  vector
x₄ = "on"     →  vector
x₅ = "the"    →  vector
x₆ = "mat"    →  vector

We want to produce a new set of vectors where each one contains context from the entire sequence. For example, the new vector for "mat" should know that it's a noun being sat on by a cat.

2.2 The Query-Key-Value Mechanism

Self-attention works by asking three questions for each token:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "If matched, what information should I pass along?"

For each token in the sequence, we:

Compute its Query vector
Compare it against every token's Key vector (including itself)
Use the similarity scores to weight each token's Value vector
Sum the weighted values — that's the new representation

Think of it like a library: you walk in with a query ("books about neural networks"), the librarian checks the keys (book titles/tags), and hands you the values (the actual books).

2.3 The Math (It's Simpler Than You Think)

Let X be our input matrix (sequence length × embedding dimension).

Step 1: Compute Q, K, V via learned weight matrices:

Q = X · W_Q      (queries)
K = X · W_K      (keys)  
V = X · W_V      (values)

Step 2: Compute attention scores — dot product of every query with every key:

S = Q · Kᵀ       (shape: seq_len × seq_len)

Step 3: Scale and normalize with softmax:

S_scaled = S / √d_k
A = softmax(S_scaled, dim=-1)

The √d_k scaling prevents the dot products from growing too large (which would push softmax into regions with extremely small gradients).

Step 4: Weighted sum of values:

Output = A · V

That's it. Four matrix operations. Here's the code:

import numpy as np

def self_attention(X, W_q, W_k, W_v):
    """
    X:    input (seq_len × d_model)
    W_q:  query weight (d_model × d_k)
    W_k:  key weight (d_model × d_k)
    W_v:  value weight (d_model × d_v)
    """
    d_k = W_q.shape[1]

    # Step 1: Project to Q, K, V
    Q = X @ W_q   # (seq_len × d_k)
    K = X @ W_k   # (seq_len × d_k)
    V = X @ W_v   # (seq_len × d_v)

    # Step 2: Compute attention scores
    S = Q @ K.T   # (seq_len × seq_len)

    # Step 3: Scale + softmax
    A = np.exp(S / np.sqrt(d_k))
    A = A / A.sum(axis=-1, keepdims=True)  # softmax

    # Step 4: Weighted sum of values
    return A @ V  # (seq_len × d_v)

2.4 Why This Changes Everything

Notice what happened: every token directly interacts with every other token in a single operation. There's no sequential dependency. The parallelism is baked into the matrix multiplication.

And there's no distance penalty — token #1 and token #1000 have exactly the same ability to attend to each other. The long-range forgetting problem? Gone.

3. Multi-Head Attention: Many Perspectives

A single attention layer can only capture one kind of relationship. But real language has many: syntax, semantics, coreference, positional relationships, etc.

Multi-head attention runs multiple attention operations in parallel, each with its own Q, K, V projections:

Head 1:  "Which words are verbs?"  
Head 2:  "Which noun does this pronoun refer to?"
Head 3:  "What's the subject of this sentence?"
... and so on (typically 8–16 heads)

class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # One projection matrix per head (concatenated for efficiency)
        self.W_q = np.random.randn(d_model, d_model) * 0.01
        self.W_k = np.random.randn(d_model, d_model) * 0.01
        self.W_v = np.random.randn(d_model, d_model) * 0.01
        self.W_o = np.random.randn(d_model, d_model) * 0.01

    def split_heads(self, X):
        """(seq_len × d_model) → (num_heads × seq_len × d_k)"""
        seq_len = X.shape[0]
        X = X.reshape(seq_len, self.num_heads, self.d_k)
        return X.transpose(1, 0, 2)  # (num_heads, seq_len, d_k)

    def __call__(self, X):
        # Project to Q, K, V (all heads at once)
        Q = X @ self.W_q  # (seq_len × d_model)
        K = X @ self.W_k
        V = X @ self.W_v

        # Split into heads
        Q = self.split_heads(Q)  # (num_heads × seq_len × d_k)
        K = self.split_heads(K)
        V = self.split_heads(V)

        # Attention per head (parallel!)
        outputs = []
        for h in range(self.num_heads):
            S = Q[h] @ K[h].T / np.sqrt(self.d_k)
            A = np.exp(S) / np.exp(S).sum(axis=-1, keepdims=True)
            head_output = A @ V[h]
            outputs.append(head_output)

        # Concatenate heads
        concat = np.concatenate(outputs, axis=-1)  # (seq_len × d_model)
        return concat @ self.W_o  # Final projection

The outputs of all heads are concatenated and projected one more time. This lets the model simultaneously attend to different types of relationships — something no single RNN state could do.

4. Positional Encoding: Putting Things in Order

Here's an important catch: self-attention is permutation-invariant. The set {"cat", "sat", "mat"} produces the same attention pattern as {"mat", "sat", "cat"} because dot products don't know about order.

But "the cat sat on the mat" and "the mat sat on the cat" have very different meanings. We need to inject positional information.

The original Transformer uses sinusoidal positional encoding:

def positional_encoding(seq_len, d_model):
    pe = np.zeros((seq_len, d_model))
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            pe[pos, i] = np.sin(pos / (10000 ** (i / d_model)))
            pe[pos, i + 1] = np.cos(pos / (10000 ** (i / d_model)))
    return pe

# Shape: (seq_len × d_model) — just add it to the input embeddings
X_with_position = X + positional_encoding(seq_len, d_model)

Why sines and cosines? Two elegant properties:

Relative positions emerge naturally — PE(pos+k) can be expressed as a linear function of PE(pos), so the model can learn relative position relationships.
Unbounded sequence length — unlike learned embeddings, sinusoidal encodings extrapolate to sequences longer than anything seen in training.

Modern models (GPT, BERT, Llama) use learned positional embeddings or rotary position encoding (RoPE) instead, but the principle is the same: inject order information into a permutation-blind mechanism.

5. Putting It Together: The Transformer Block

A single Transformer block = Multi-Head Attention + Feed-Forward Network + Residual Connections + LayerNorm.

Input → [LayerNorm → Multi-Head Attention → Add (residual)] → [LayerNorm → FFN → Add (residual)]

Each component serves a purpose:

Component	What It Does	Why It Matters
Multi-Head Attention	Lets tokens exchange information	The core reasoning mechanism
Feed-Forward Network	Two linear layers with ReLU/GELU	Adds per-token computation depth
Residual Connection	`Output = Layer(x) + x`	Gradients flow directly through, enabling very deep models
LayerNorm	Normalizes across features	Stabilizes training, reduces sensitivity to initialization

class TransformerBlock:
    def __init__(self, d_model, num_heads, d_ff):
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)

    def __call__(self, x):
        # Self-attention + residual
        attn_out = self.attention(self.norm1(x))
        x = x + attn_out

        # FFN + residual
        ffn_out = self.ffn(self.norm2(x))
        x = x + ffn_out

        return x

class FeedForward:
    def __init__(self, d_model, d_ff):
        self.W1 = np.random.randn(d_model, d_ff) * 0.01
        self.W2 = np.random.randn(d_ff, d_model) * 0.01

    def __call__(self, x):
        return np.maximum(0, x @ self.W1) @ self.W2  # ReLU

Stack 6, 12, or 96 of these blocks, and you get a Transformer.

6. The Big Picture: Encoder vs. Decoder

The original Transformer has two stacks:

Encoder (6 blocks, bidirectional):

Each token attends to ALL tokens (past and future)
Used for understanding tasks: classification, sentiment, NER
BERT uses only the encoder

Decoder (6 blocks, autoregressive):

Each token attends ONLY to itself and previous tokens (masked attention)
Used for generation tasks: translation, text generation
GPT uses only the decoder

Encoder-Decoder Architecture (the original):

Encoder processes the input; Decoder generates output while attending to encoder's representations
Used for translation, summarization
T5 uses this

┌──────────────────┐     ┌──────────────────┐
│    Encoder       │     │    Decoder       │
│  (bidirectional) │     │ (autoregressive) │
│                  │     │                  │
│  Block 6         │     │  Block 6         │
│  Block 5         │     │  Block 5         │
│  ...             │     │  ...             │
│  Block 1         │     │  Block 1         │
│                  │     │                  │
│  Input Embedding │     │  Output Embedding│
│      + PE        │     │      + PE        │
└────────┬─────────┘     └────────┬──────────┘
         │                        │
         └────── Cross-Attn ──────┘
                  (Decoder attends
                   to Encoder output)

The cross-attention in the decoder is what lets the model "look at" the input while generating output — translating "Je suis étudiant" while reading "I am a student."

7. Why Transformers Won (And What They Cost)

What We Gained

Feature	RNN/LSTM	Transformer
Parallelism	❌ Sequential	✅ Full parallel
Long-range (1K+ tokens)	❌ Forgets	✅ No distance penalty
Training speed (wall-clock)	Slow	3-5x faster
Scaling	Limited	Up to trillions of params

The Price We Pay

O(n²) complexity. Every token attends to every token. For a 1,000-token sequence, that's 1M attention pairs. For 100,000 tokens (a whole book chapter), it's 10 billion — infeasible.

This is why modern optimizations exist:

Sparse attention (only attend to local + a few global tokens)
Sliding window attention (Mistral, Gemma)
FlashAttention (hardware-efficient attention that avoids materializing the full matrix)
KV-cache (reuse computed keys/values during generation)

8. From Here to GPT: What Changed

The Transformer you just built is the foundation. Modern LLMs add:

Scale — GPT-3: 175B params, 96 layers. GPT-4: estimated 1.8T params (8×220B experts)
Pre-training + Fine-tuning — Learn from internet text, then specialize
RLHF — Align outputs with human preferences (what makes Claude helpful, honest, and harmless)
Architecture tweaks — GQA (grouped query attention), RoPE, SwiGLU, RMSNorm instead of LayerNorm

But the core innovation — every token directly attends to every other token in parallel — remains unchanged since 2017. If you understand self-attention, you understand the engine driving the AI revolution.

9. Play With It Yourself

Here's a complete, minimal Transformer for character-level text generation (~100 lines):

import numpy as np

def softmax(x): 
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

class TinyTransformer:
    def __init__(self, vocab_size, d_model=64, num_heads=4):
        self.embed = np.random.randn(vocab_size, d_model) * 0.01
        self.blocks = [TransformerBlock(d_model, num_heads, d_model*4) for _ in range(3)]
        self.output = np.random.randn(d_model, vocab_size) * 0.01

    def generate(self, token, length=100, temperature=1.0):
        tokens = [token]
        for _ in range(length):
            x = self.embed[tokens[-64:]]  # last 64 tokens context
            for block in self.blocks:
                x = block(x)
            logits = x[-1] @ self.output / temperature
            probs = softmax(logits)
            token = np.random.choice(len(probs), p=probs)
            tokens.append(token)
        return tokens

Train this on a 1MB text file with any optimizer, and watch it learn language structure from scratch in minutes.

The Bottom Line

Self-attention lets every token directly interact with every other token — no sequential bottleneck, no distance decay
Multi-head attention captures different relationship types simultaneously
Positional encoding tells the model about order (since attention is permutation-blind)
Residual connections + LayerNorm enable stacking dozens of layers
The combination is so powerful it has replaced RNNs, CNNs in vision, and now extends to audio and video

The Transformer is 8 years old, and we're still discovering what it can do. The age of attention is just getting started.

Found this helpful? I write deep-dive tutorials on LLMs, RAG systems, and production AI. Follow me for more content that actually builds understanding, not just surface-level overviews.

I also maintain an Interview Guide with 300+ AI/ML system design questions covering Transformers, RAG, Agents, and production deployment patterns — designed to help you bridge the gap between theory and real-world system design.

Top comments (1)

马国锦 • Jun 15

bulubulu