DEV Community

Devanshu Biswas
Devanshu Biswas

Posted on

Transformers From Scratch: Assembling the Block Behind GPT

Yesterday: attention — each token deciding how much to look at every other token. Today we assemble the rest of the Transformer block around it. Stack a dozen of these and you have GPT and BERT. Here's the full block, visualized stage by stage.

🧠 Step through a Transformer block: https://dev48v.infy.uk/dl/day13-transformers.html

Attention isn't enough on its own

Attention mixes information across positions, but you still need to (a) know word order and (b) transform features. So a block wraps attention with a few more pieces.

The block, top to bottom

  1. Token + positional embeddings — turn tokens into vectors and inject order (a sine pattern), since attention is order-blind.
  2. Multi-head self-attention — several attention "views" in parallel, then combined.
  3. Add & LayerNorm — a residual connection (add the input back) keeps gradients flowing; layer-norm stabilizes training.
  4. Feed-forward network — a small per-token MLP that transforms each position.
  5. Add & LayerNorm again.

Why it took over

Stack N blocks → deeper abstraction. Unlike RNNs, every position is processed in parallel, so it trains fast on huge data. Decoder-only = GPT, encoder-only = BERT.

🔨 Full build (embed+positional → multi-head → residual+norm → FFN → stack) on the page: https://dev48v.infy.uk/dl/day13-transformers.html

Part of DeepLearningFromZero. 🌐 https://dev48v.infy.uk

Top comments (0)