DEV Community: 马国锦

Transformers From Scratch: How Attention Really Works (With Visuals & Code)

马国锦 — Mon, 15 Jun 2026 02:02:36 +0000

In 2017, a paper titled "Attention Is All You Need" changed the trajectory of deep learning. The Transformer architecture it introduced isn't just the backbone of GPT, BERT, Claude, and every major LLM today — it's also the foundation for vision models (ViT), audio models (Whisper), and multimodal architectures.

But here's the catch: most tutorials skip the "why" and jump straight to the "how." You get a diagram of Q, K, V, a few equations, and a "just trust me, it works."

This article takes the opposite approach. We'll build the Transformer from first principles — starting with the problem it solves, then layering each component one by one, with visuals and runnable code.

1. The Problem: Why RNNs Hit a Wall

Before Transformers, sequence modeling meant Recurrent Neural Networks (RNNs), LSTMs, and GRUs. These architectures process tokens one at a time:

Input:  "The cat sat on the ..."
RNN:    h₀ → h₁ → h₂ → h₃ → h₄ → ...
        (sequential — each step waits for the previous one)

This has two fundamental limitations:

① Sequential bottleneck. Token #100 can't be processed until tokens #1–99 are done. No parallelism = slow training, especially on GPUs that thrive on parallel computation.

② Long-range forgetting. In theory, an LSTM can remember information for hundreds of steps. In practice, by step 50, the signal from step 1 has degraded significantly. The model struggles to connect "She was born in Paris" with "She speaks fluent ___" when there are 30 tokens in between.

These problems aren't minor engineering issues — they're architectural ceilings. Transformers solve both simultaneously, and the key insight is deceptively simple:

Every token should be able to directly look at every other token — in one step.

2. The Core Innovation: Self-Attention

Let's build self-attention from scratch, step by step.

2.1 What Are We Trying to Do?

Given a sequence of input vectors:

x₁ = "The"    →  vector
x₂ = "cat"    →  vector
x₃ = "sat"    →  vector
x₄ = "on"     →  vector
x₅ = "the"    →  vector
x₆ = "mat"    →  vector

We want to produce a new set of vectors where each one contains context from the entire sequence. For example, the new vector for "mat" should know that it's a noun being sat on by a cat.

2.2 The Query-Key-Value Mechanism

Self-attention works by asking three questions for each token:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "If matched, what information should I pass along?"

For each token in the sequence, we:

Compute its Query vector
Compare it against every token's Key vector (including itself)
Use the similarity scores to weight each token's Value vector
Sum the weighted values — that's the new representation

Think of it like a library: you walk in with a query ("books about neural networks"), the librarian checks the keys (book titles/tags), and hands you the values (the actual books).

2.3 The Math (It's Simpler Than You Think)

Let X be our input matrix (sequence length × embedding dimension).

Step 1: Compute Q, K, V via learned weight matrices:

Q = X · W_Q      (queries)
K = X · W_K      (keys)  
V = X · W_V      (values)

Step 2: Compute attention scores — dot product of every query with every key:

S = Q · Kᵀ       (shape: seq_len × seq_len)

Step 3: Scale and normalize with softmax:

S_scaled = S / √d_k
A = softmax(S_scaled, dim=-1)

The √d_k scaling prevents the dot products from growing too large (which would push softmax into regions with extremely small gradients).

Step 4: Weighted sum of values:

Output = A · V

That's it. Four matrix operations. Here's the code:

import numpy as np

def self_attention(X, W_q, W_k, W_v):
    """
    X:    input (seq_len × d_model)
    W_q:  query weight (d_model × d_k)
    W_k:  key weight (d_model × d_k)
    W_v:  value weight (d_model × d_v)
    """
    d_k = W_q.shape[1]

    # Step 1: Project to Q, K, V
    Q = X @ W_q   # (seq_len × d_k)
    K = X @ W_k   # (seq_len × d_k)
    V = X @ W_v   # (seq_len × d_v)

    # Step 2: Compute attention scores
    S = Q @ K.T   # (seq_len × seq_len)

    # Step 3: Scale + softmax
    A = np.exp(S / np.sqrt(d_k))
    A = A / A.sum(axis=-1, keepdims=True)  # softmax

    # Step 4: Weighted sum of values
    return A @ V  # (seq_len × d_v)

2.4 Why This Changes Everything

Notice what happened: every token directly interacts with every other token in a single operation. There's no sequential dependency. The parallelism is baked into the matrix multiplication.

And there's no distance penalty — token #1 and token #1000 have exactly the same ability to attend to each other. The long-range forgetting problem? Gone.

3. Multi-Head Attention: Many Perspectives

A single attention layer can only capture one kind of relationship. But real language has many: syntax, semantics, coreference, positional relationships, etc.

Multi-head attention runs multiple attention operations in parallel, each with its own Q, K, V projections:

Head 1:  "Which words are verbs?"  
Head 2:  "Which noun does this pronoun refer to?"
Head 3:  "What's the subject of this sentence?"
... and so on (typically 8–16 heads)

class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # One projection matrix per head (concatenated for efficiency)
        self.W_q = np.random.randn(d_model, d_model) * 0.01
        self.W_k = np.random.randn(d_model, d_model) * 0.01
        self.W_v = np.random.randn(d_model, d_model) * 0.01
        self.W_o = np.random.randn(d_model, d_model) * 0.01

    def split_heads(self, X):
        """(seq_len × d_model) → (num_heads × seq_len × d_k)"""
        seq_len = X.shape[0]
        X = X.reshape(seq_len, self.num_heads, self.d_k)
        return X.transpose(1, 0, 2)  # (num_heads, seq_len, d_k)

    def __call__(self, X):
        # Project to Q, K, V (all heads at once)
        Q = X @ self.W_q  # (seq_len × d_model)
        K = X @ self.W_k
        V = X @ self.W_v

        # Split into heads
        Q = self.split_heads(Q)  # (num_heads × seq_len × d_k)
        K = self.split_heads(K)
        V = self.split_heads(V)

        # Attention per head (parallel!)
        outputs = []
        for h in range(self.num_heads):
            S = Q[h] @ K[h].T / np.sqrt(self.d_k)
            A = np.exp(S) / np.exp(S).sum(axis=-1, keepdims=True)
            head_output = A @ V[h]
            outputs.append(head_output)

        # Concatenate heads
        concat = np.concatenate(outputs, axis=-1)  # (seq_len × d_model)
        return concat @ self.W_o  # Final projection

The outputs of all heads are concatenated and projected one more time. This lets the model simultaneously attend to different types of relationships — something no single RNN state could do.

4. Positional Encoding: Putting Things in Order

Here's an important catch: self-attention is permutation-invariant. The set {"cat", "sat", "mat"} produces the same attention pattern as {"mat", "sat", "cat"} because dot products don't know about order.

But "the cat sat on the mat" and "the mat sat on the cat" have very different meanings. We need to inject positional information.

The original Transformer uses sinusoidal positional encoding:

def positional_encoding(seq_len, d_model):
    pe = np.zeros((seq_len, d_model))
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            pe[pos, i] = np.sin(pos / (10000 ** (i / d_model)))
            pe[pos, i + 1] = np.cos(pos / (10000 ** (i / d_model)))
    return pe

# Shape: (seq_len × d_model) — just add it to the input embeddings
X_with_position = X + positional_encoding(seq_len, d_model)

Why sines and cosines? Two elegant properties:

Relative positions emerge naturally — PE(pos+k) can be expressed as a linear function of PE(pos), so the model can learn relative position relationships.
Unbounded sequence length — unlike learned embeddings, sinusoidal encodings extrapolate to sequences longer than anything seen in training.

Modern models (GPT, BERT, Llama) use learned positional embeddings or rotary position encoding (RoPE) instead, but the principle is the same: inject order information into a permutation-blind mechanism.

5. Putting It Together: The Transformer Block

A single Transformer block = Multi-Head Attention + Feed-Forward Network + Residual Connections + LayerNorm.

Input → [LayerNorm → Multi-Head Attention → Add (residual)] → [LayerNorm → FFN → Add (residual)]

Each component serves a purpose:

Component	What It Does	Why It Matters
Multi-Head Attention	Lets tokens exchange information	The core reasoning mechanism
Feed-Forward Network	Two linear layers with ReLU/GELU	Adds per-token computation depth
Residual Connection	`Output = Layer(x) + x`	Gradients flow directly through, enabling very deep models
LayerNorm	Normalizes across features	Stabilizes training, reduces sensitivity to initialization

class TransformerBlock:
    def __init__(self, d_model, num_heads, d_ff):
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)

    def __call__(self, x):
        # Self-attention + residual
        attn_out = self.attention(self.norm1(x))
        x = x + attn_out

        # FFN + residual
        ffn_out = self.ffn(self.norm2(x))
        x = x + ffn_out

        return x

class FeedForward:
    def __init__(self, d_model, d_ff):
        self.W1 = np.random.randn(d_model, d_ff) * 0.01
        self.W2 = np.random.randn(d_ff, d_model) * 0.01

    def __call__(self, x):
        return np.maximum(0, x @ self.W1) @ self.W2  # ReLU

Stack 6, 12, or 96 of these blocks, and you get a Transformer.

6. The Big Picture: Encoder vs. Decoder

The original Transformer has two stacks:

Encoder (6 blocks, bidirectional):

Each token attends to ALL tokens (past and future)
Used for understanding tasks: classification, sentiment, NER
BERT uses only the encoder

Decoder (6 blocks, autoregressive):

Each token attends ONLY to itself and previous tokens (masked attention)
Used for generation tasks: translation, text generation
GPT uses only the decoder

Encoder-Decoder Architecture (the original):

Encoder processes the input; Decoder generates output while attending to encoder's representations
Used for translation, summarization
T5 uses this

┌──────────────────┐     ┌──────────────────┐
│    Encoder       │     │    Decoder       │
│  (bidirectional) │     │ (autoregressive) │
│                  │     │                  │
│  Block 6         │     │  Block 6         │
│  Block 5         │     │  Block 5         │
│  ...             │     │  ...             │
│  Block 1         │     │  Block 1         │
│                  │     │                  │
│  Input Embedding │     │  Output Embedding│
│      + PE        │     │      + PE        │
└────────┬─────────┘     └────────┬──────────┘
         │                        │
         └────── Cross-Attn ──────┘
                  (Decoder attends
                   to Encoder output)

The cross-attention in the decoder is what lets the model "look at" the input while generating output — translating "Je suis étudiant" while reading "I am a student."

7. Why Transformers Won (And What They Cost)

What We Gained

Feature	RNN/LSTM	Transformer
Parallelism	❌ Sequential	✅ Full parallel
Long-range (1K+ tokens)	❌ Forgets	✅ No distance penalty
Training speed (wall-clock)	Slow	3-5x faster
Scaling	Limited	Up to trillions of params

The Price We Pay

O(n²) complexity. Every token attends to every token. For a 1,000-token sequence, that's 1M attention pairs. For 100,000 tokens (a whole book chapter), it's 10 billion — infeasible.

This is why modern optimizations exist:

Sparse attention (only attend to local + a few global tokens)
Sliding window attention (Mistral, Gemma)
FlashAttention (hardware-efficient attention that avoids materializing the full matrix)
KV-cache (reuse computed keys/values during generation)

8. From Here to GPT: What Changed

The Transformer you just built is the foundation. Modern LLMs add:

Scale — GPT-3: 175B params, 96 layers. GPT-4: estimated 1.8T params (8×220B experts)
Pre-training + Fine-tuning — Learn from internet text, then specialize
RLHF — Align outputs with human preferences (what makes Claude helpful, honest, and harmless)
Architecture tweaks — GQA (grouped query attention), RoPE, SwiGLU, RMSNorm instead of LayerNorm

But the core innovation — every token directly attends to every other token in parallel — remains unchanged since 2017. If you understand self-attention, you understand the engine driving the AI revolution.

9. Play With It Yourself

Here's a complete, minimal Transformer for character-level text generation (~100 lines):

import numpy as np

def softmax(x): 
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

class TinyTransformer:
    def __init__(self, vocab_size, d_model=64, num_heads=4):
        self.embed = np.random.randn(vocab_size, d_model) * 0.01
        self.blocks = [TransformerBlock(d_model, num_heads, d_model*4) for _ in range(3)]
        self.output = np.random.randn(d_model, vocab_size) * 0.01

    def generate(self, token, length=100, temperature=1.0):
        tokens = [token]
        for _ in range(length):
            x = self.embed[tokens[-64:]]  # last 64 tokens context
            for block in self.blocks:
                x = block(x)
            logits = x[-1] @ self.output / temperature
            probs = softmax(logits)
            token = np.random.choice(len(probs), p=probs)
            tokens.append(token)
        return tokens

Train this on a 1MB text file with any optimizer, and watch it learn language structure from scratch in minutes.

The Bottom Line

Self-attention lets every token directly interact with every other token — no sequential bottleneck, no distance decay
Multi-head attention captures different relationship types simultaneously
Positional encoding tells the model about order (since attention is permutation-blind)
Residual connections + LayerNorm enable stacking dozens of layers
The combination is so powerful it has replaced RNNs, CNNs in vision, and now extends to audio and video

The Transformer is 8 years old, and we're still discovering what it can do. The age of attention is just getting started.

Found this helpful? I write deep-dive tutorials on LLMs, RAG systems, and production AI. Follow me for more content that actually builds understanding, not just surface-level overviews.

I also maintain an Interview Guide with 300+ AI/ML system design questions covering Transformers, RAG, Agents, and production deployment patterns — designed to help you bridge the gap between theory and real-world system design.

AE Sentinel – AI-powered adverse event monitoring for clinical trials

马国锦 — Fri, 12 Jun 2026 08:40:33 +0000

I just open-sourced AE Sentinel, an AI-powered pharmacovigilance platform that automates adverse event monitoring in clinical trials.

🔗 GitHub: https://github.com/mgj10086/AEPD-Sentinel

What it does

Module	Description
🏷️ AE Coding	Auto MedDRA coding with keyword-based synonym matching
📋 SAE Reports	CIOMS-I report generation (PDF/DOCX/JSON)
⚠️ Deviation Detection	7 built-in rules, auto-triggered from AE processing
📊 Signal Mining	Organ-class aggregation, Fisher's exact test simulation
✅ Compliance	GCP compliance auditing
📚 Knowledge Base	RAG-powered semantic search (ChromaDB)
🔐 Audit Trail	HMAC hash chain, tamper-evident
🔔 Notifications	Real-time alerts for deviations/SAEs

Tech Stack

Backend: Python 3.9+ / FastAPI / Uvicorn
Frontend: Vue 3 / Pinia / Element Plus / ECharts
Database: MySQL (production) / SQLite (dev) — auto-switch via env
Vector DB: ChromaDB (persistent, semantic search)
Auth: JWT (HS256) + role-based access (Admin / PV Specialist / CRA)
Container: Multi-stage Docker build (Node → Python + Nginx)

Architecture

Three-layer separation:

The $0 RAG Stack: Build a Production Retrieval System Without Paying a Cent

马国锦 — Thu, 11 Jun 2026 08:53:17 +0000

You do not need Pinecone, OpenAI embeddings, or a $200/month vector database to build a solid RAG system.

The Free Stack

Component	Free Option
Embedding Model	BAAI/bge-large-zh-v1.5
Vector Database	FAISS / Chroma
LLM	DeepSeek / Ollama
Reranker	BAAI/bge-reranker-base
BM25	rank-bm25 (pip)

Zero API costs. Zero monthly fees.

Embedding Without OpenAI

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-zh-v1.5")
embeddings = model.encode(["Your text"])

Free, MIT-licensed, beats ada-002 on Chinese benchmarks.

Vector Search Without Pinecone

import faiss
index = faiss.IndexHNSWFlat(1024, 32)
index.add(embeddings.astype(np.float32))
D, I = index.search(query_embedding, k=10)

FAISS for speed. Chroma for ease. Both free.

LLM Without API Keys

Option A: Ollama local — ollama pull qwen2.5:7b
Option B: DeepSeek API free tier — 500 requests/day

Reranking for Free

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-base")
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

One step that often improves precision more than weeks of prompt tuning.

The Full Pipeline

class FreeRAG:
    def __init__(self):
        self.embedder = SentenceTransformer("BAAI/bge-large-zh-v1.5")
        self.index = faiss.IndexHNSWFlat(1024, 32)
        self.reranker = CrossEncoder("BAAI/bge-reranker-base")

    def search(self, query, k=10):
        q_emb = self.embedder.encode([query])
        _, ids = self.index.search(q_emb.astype(np.float32), 100)
        pairs = [[query, self.docs[i]] for i in ids[0][:30]]
        scores = self.reranker.predict(pairs)
        return [self.docs[ids[0][i]] for i in sorted(range(len(scores)), key=lambda j: scores[j], reverse=True)[:k]]

Vector search + cross-encoder rerank. Same pipeline that costs $200/month on managed services.

When to Graduate to Paid

Trigger	Move To
> 1M vectors	Milvus
Need GPU inference	TEI
Team access	Weaviate Cloud

For the first 90% of your journey, free tools are enough.

☕ Support This Content

If my articles saved you money on SaaS bills, scan the QR code below to buy me a coffee.

Follow @mgj for weekly practical AI engineering content.

Agentic RAG: When Your Retrieval System Learns to Think for Itself

马国锦 — Thu, 11 Jun 2026 08:45:29 +0000

Traditional RAG retrieves once and generates. Agentic RAG retrieves, evaluates, and decides whether to retrieve again.

Traditional RAG vs Agentic RAG

Traditional: User asks → Vector search once → Generate answer
Agentic: User asks → Agent analyzes → Decides search strategy → Verifies results → Re-searches if needed → Generates

The agent thinks between steps.

The Core Loop

class AgenticRAG:
    def __init__(self, retriever, llm, max_rounds=3):
        self.retriever = retriever
        self.llm = llm
        self.max_rounds = max_rounds

    def _decide(self, query, context):
        prompt = f"Query: {query}. Existing info: {context or 'none'}. Decide: RETRIEVE | GENERATE | REFINE"
        return self.llm.generate(prompt)

    def _verify(self, answer, sources):
        prompt = f"Answer: {answer}. Sources: {sources}. Is every claim backed by a source? yes/no"
        return "yes" in self.llm.generate(prompt).lower()

    def run(self, query):
        context = []
        for _ in range(self.max_rounds):
            decision = self._decide(query, "\n".join(context))
            if "GENERATE" in decision:
                answer = self.llm.generate(f"Based on: {context}\nQuery: {query}")
                if self._verify(answer, context):
                    return answer
            else:
                results = self.retriever.search(query, top_k=5)
                context.extend(results)
        return answer

Three Critical Decisions

1. When to Stop Searching

Combine confidence threshold + round limit (max 3) + information gain check.

2. Which Tools to Give the Agent

Start: vector + BM25 + RRF. Add SQL and web search later.

3. How Deep to Verify

Light check (1 LLM call) for every query. Deep check (re-search + compare) for high-stakes answers.

When to Use Each

Scenario	Use
Simple fact lookup	Traditional RAG
Multi-hop reasoning	Agentic RAG
Numerical aggregation	Agentic RAG + SQL
Latency-sensitive (<500ms)	Traditional RAG
High accuracy (medical/legal)	Agentic RAG + deep verify

Traditional RAG hopes. Agentic RAG verifies. Each extra round pushes recall from 60% to 90%+.

☕ Support This Content

If my articles saved you debugging time, scan the QR code below to buy me a coffee.

Follow @mgj for weekly AI engineering deep dives.

How I Ship 10x Faster with Claude Code: The 5-Layer Workflow System

马国锦 — Thu, 11 Jun 2026 03:37:23 +0000

After 8 months of daily Claude Code use, I have distilled my workflow into a 5-layer system.

Layer 1: CLAUDE.md — Project Memory

This is the foundation. CLAUDE.md is a file at your project root that Claude reads automatically every session.

Bad CLAUDE.md: Generic advice like "write clean code, use Git."

Good CLAUDE.md: Specific tech stack, exact commands, architecture diagram, unique conventions.

Rule: only write what is unique to your project.

Layer 2: Plan Mode — Direction Before Speed

Trigger: any change touching 3+ files. Claude explores code, designs a plan, waits for your approval before writing anything.

Layer 3: Small Tasks — One Unit, One Commit

Each task changes ONE logical unit. After each task, the project is still runnable. You know exactly which commit introduced a bug.

Layer 4: Git — Your Undo Button

Commit after every small task. The commit history IS your project journal.

Layer 5: Worktree — Parallel Without Collision

git worktree add ../bugfix .claude/worktrees/hotfix
cd ../bugfix
# Fix bug in isolation
cd -
git worktree prune

Bonus: Context Compounding

Level	Vehicle
Project	CLAUDE.md
Session	Memory system
Knowledge Base	Linked notes

After 8 months, Claude does not just understand my project — it understands how I think.

Quick Start

Day 1: Write a proper CLAUDE.md
Day 2: Use Plan Mode before cross-file changes
Day 3: Break tasks into small steps
Day 5: Try Worktree
Day 7: Write your first Memory
Day 14: Review commits for patterns

☕ Support This Content

If my articles saved you time, scan the QR code below to buy me a coffee.

Follow @mgj for weekly AI engineering content.

Build Your RAG System Right the First Time: 6 Decisions That Make or Break It

马国锦 — Thu, 11 Jun 2026 03:36:21 +0000

After debugging 20+ broken RAG systems, I have identified the 6 decisions that determine whether yours works.

Decision 1: Embedding Model

Language	Use This
Chinese	BAAI/bge-large-zh-v1.5
Chinese + English	BAAI/bge-m3
English	text-embedding-3-large

Non-negotiable: indexing model and query model must be byte-for-byte identical.

Decision 2: Chunk Size

Document Type	Sweet Spot	Overlap
FAQ	128-256	20
Technical docs	512	50
Long-form	768-1024	100

Use recursive splitting, not fixed-length.

Decision 3: Index Type — HNSW vs IVF

Scale	Use
< 1M vectors	HNSW (recall > 0.95)
1-5M, RAM tight	IVF + PQ
> 5M	IVF + PQ + Sharding

Decision 4: Metadata

Without metadata filtering, every query scans all vectors. Add department=engineering AND date > 2024-01-01 to go from 5M to 50K vectors.

Decision 5: Deduplication — Do It Twice

Document-level: MinHash + LSH, threshold 0.85
Chunk-level: SimHash, threshold 0.95

Decision 6: Query Processing

Technique	When
Query rewriting	Short/fuzzy queries
HyDE	Factual QA
RRF fusion	Semantic + exact-match
Cross-Encoder rerank	Post-retrieval

Minimum viable stack: Query rewriting + Cross-Encoder rerank.

Optimization Priority

Embedding model language-appropriate?
Chunk size reasonable (256-768)?
Deduplicating?
Query rewriting
Cross-Encoder reranking
Metadata filtering

☕ Support This Content

If my articles saved you debugging time, scan the QR code below to buy me a coffee.

Follow @mgj for weekly AI engineering deep dives.