Yesterday: attention — each token deciding how much to look at every other token. Today we assemble the rest of the Transformer block around it. Stack a dozen of these and you have GPT and BERT. Here's the full block, visualized stage by stage.
🧠 Step through a Transformer block: https://dev48v.infy.uk/dl/day13-transformers.html
Attention isn't enough on its own
Attention mixes information across positions, but you still need to (a) know word order and (b) transform features. So a block wraps attention with a few more pieces.
The block, top to bottom
- Token + positional embeddings — turn tokens into vectors and inject order (a sine pattern), since attention is order-blind.
- Multi-head self-attention — several attention "views" in parallel, then combined.
- Add & LayerNorm — a residual connection (add the input back) keeps gradients flowing; layer-norm stabilizes training.
- Feed-forward network — a small per-token MLP that transforms each position.
- Add & LayerNorm again.
Why it took over
Stack N blocks → deeper abstraction. Unlike RNNs, every position is processed in parallel, so it trains fast on huge data. Decoder-only = GPT, encoder-only = BERT.
🔨 Full build (embed+positional → multi-head → residual+norm → FFN → stack) on the page: https://dev48v.infy.uk/dl/day13-transformers.html
Part of DeepLearningFromZero. 🌐 https://dev48v.infy.uk
Top comments (0)