Mohamed Hamed

Posted on Apr 21 • Originally published at mohamedhamed.io

Part 7 — The Transformer: The Architecture That Accidentally Changed the World

#transformer #attention #neuralnetworks #aifundamentals

THE ENGINE OF THE FUTURE

Transformer

"Attention Is All You Need" — the paper that changed everything

Last article we saw how the four learning types + training loop built ChatGPT. Today we open the box and see the exact architecture that made all of it possible.

June 2017. Eight researchers at Google Brain sat down and asked a dangerous question:

"Why do we even need the RNN?"

Then they deleted it.

The paper they published — "Attention Is All You Need" — was not patented. It was released freely to the world. And that single decision launched ChatGPT, Claude, Gemini, Llama, and every significant language model that exists today.

This is the story of the Transformer: what problem it solved, how it works, and why understanding it makes you a fundamentally better AI developer.

Before 2017: The World Ran on RNNs

To understand why the Transformer was revolutionary, you need to understand what it replaced.

The dominant architecture for language before 2017 was the Recurrent Neural Network (RNN). The idea was elegant: read text the way humans do — one word at a time, remembering what came before.

How the RNN Read a Sentence
The glasses $\rightarrow$ remember $\rightarrow$ are $\rightarrow$ remember $\rightarrow$ light $\rightarrow$ ... $\rightarrow$ but their battery...

By the time it reaches "battery", the beginning of the sentence has almost completely faded from memory.

The RNN had three fatal problems that held AI back for years:

Problem 1: Memory Decay (The Forgetting Problem)

The RNN maintained a "hidden state" — a compressed memory that got updated with each new word. The trouble: each update overwrote part of the previous memory.

Sentence: "The smart glasses are light but their battery is very weak and doesn't last a full day"

glasses: 100%
smart: 90%
light: 75%
battery: 50%
full day...: 5% ❌

By the time it reaches "full day" — it has forgotten that the sentence started with "s"glasses"!

Engineers tried to fix this with LSTMs (Long Short-Term Memory networks) in 1997. They helped, but didn't fully solve the problem. Long documents remained an unsolvable challenge.

Problem 2: Sequential Processing (The Speed Problem)

RNNs are inherently sequential. Word 2 can't be processed until Word 1 is done. Word 3 waits for Word 2.

RNN — Sequential ❌	Transformer — Parallel ✅
Word 1 $\rightarrow$ finish ↓ Word 2 $\rightarrow$ finish ↓ Word 3 $\rightarrow$ finish ↓ ... 100 steps in a row Even with 8,000 GPUs, you can't parallelize — each step depends on the previous.	Word 1 Word 2 Word 3 ⚡ ALL AT ONCE All 100 words processed simultaneously across thousands of GPUs.

A 100-word sentence takes the RNN 100 sequential steps. The Transformer does all of them in one step — which is why it could scale to billions of parameters in a way RNNs never could.

Problem 3: Long-Range Dependencies

Short sentence — no problem:
"The glasses are red" ✅ — "red" clearly refers to "glasses"

Long sentence — serious problem:
"The glasses I bought from the store in downtown that's been open for 20 years and everyone says is trustworthy are red"

By the time the RNN reached "red" — it forgot that the sentence began with "glasses." It might confusingly connect "red" to "years" instead. ❌

These three problems — forgetting, slowness, and poor long-range connections — had been the ceiling of AI language abilities for over a decade.

The 2015 Band-Aid: The Original Attention Mechanism

Before the Transformer, researchers found a partial fix: Attention.

The insight was brilliant in its simplicity. Instead of relying on the hidden state to carry all information forward, what if at each step, the model could look back at any previous word and focus on the most relevant ones?

Attention: The Flashlight Analogy
When the model processes the word "battery" in our long sentence, Attention lets it shine a flashlight backwards across the entire sentence and ask: "Which earlier words are most relevant to understanding 'battery'?"

glasses $\leftrightarrow$ battery

Attention links "battery" to "glasses" even if there are 100 words between them. 🔗

This helped significantly. But it was still bolted onto the RNN — it didn't fix the fundamental speed problem, and it added computational cost on top of an already slow architecture.

2017: The Paper That Changed Everything

Eight researchers at Google Brain — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin — looked at all these problems and asked the audacious question:

"Why do we even need the RNN?"

Their answer was published in June 20

Google Brain researchers reasoned:
"Why do we need the RNN at all?"
"Let's remove it entirely!"
"And use Attention alone!" 🚀

The elegance of the solution: if Attention already lets you look at any word in the sentence, why process words sequentially at all? Instead, look at all words simultaneously and let them all "attend" to each other in parallel.

They called it the Transformer.

Self-Attention: The Core Innovation

The key mechanism inside the Transformer is Self-Attention. Here's exactly how it works.

Each word in the input sentence simultaneously asks three questions about every other word:

🔍 Query (Q)	🗝️ Key (K)	💎 Value (V)
"What am I looking for?" Each word broadcasts its search intent	"What do I offer?" Each word announces its content/identity	"What do I actually contribute?" The actual information passed forward

The attention score for each word pair is computed as:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) × V

Q·Kᵀ measures how much query matches key (compatibility). √dₖ prevents the dot products from getting too large. Softmax converts scores to probabilities. V is the weighted sum of information to pass forward.

(Don't worry — the 3-word numeric walkthrough below turns every symbol above into plain arithmetic.)

In plain English: each word votes on how much attention to pay to every other word. The votes are weighted by relevance. The information from relevant words flows through.

The Dinner Party: Self-Attention as a Convesation

Forget formulas for a moment. Picture a dinner party with three guests: river, bank, and overflowed.

The word bank is sitting at the table feeling ambiguous — is it a riverbank, or the place where you keep your money? It has no idea. So it does what anyone confused would do: it looks around the room and asks the other guests for context.

🍷 The Scene at the Table
bank turns to river: "How related are you to me?" — river shrugs: "Pretty related, I'd say 26%."

bank checks itself in the mirror: "I'm obviously 48% me."

bank turns to overflowed: "And you?" — overflowed nods: "26% connected."

The three numbers add up to 100%. That's the whole point — bank has a fixed amount of attention to spend, and it just decided how to split it.

Now bank takes a weighted sip of each guest's meaning — a big gulp of its own identity, smaller sips of river and overflowed. When it swallows, it's no longer a plain "bank." It's now "the kind of bank that hangs out with rivers and floods." A riverbank. The financial-institution meaning never even entered the picture.

That's self-attention. One ambiguous word, a room full of context, and a weighted blend that resolves the meaning. No formulas needed.

👀 Peek under the hood — the actual arithmetic
For the curious: those percentages (26%, 48%, 26%) aren't magic — they come from four lines of arithmetic. Each word carries three tiny vectors (Q, K, V). Here's what the model actually does when "bank" looks around the room:

Give each word a Q, K, V:
river = ([1,0], [1,0], [0.9, 0.1])
bank = ([1,1], [1,1], [0.5, 0.5])
overflowed = ([0,1], [0,1], [0.1, 0.9])

Match bank's Q against every K (compatibility):
scores = [1.0, 2.0, 1.0], then ÷√2 $\rightarrow$ [0.71, 1.41, 0.71]

Squish into percentages (softmax):
$\rightarrow$ [0.26, 0.48, 0.26] $\leftarrow$ the 26/48/26 split above

Blend the V vectors by those percentages:
new_bank = [0.50, 0.50] — now carries river + flood context

The √2 just keeps numbers from exploding when vectors get big — safely ignore on a regular read. The Q, K, V numbers above are made up for teaching; in a real model they're learned during training.

A bigger example for intuition: In the sentence "The bank by the river overflowed":

"bank" attends heavily to "river" $\rightarrow$ understands it's a riverbank, not a financial bank
"overflowed" attends to both "bank" and "river" $\rightarrow$ understands the event context
All of this happens simultaneously, not sequentially

Attention Scores Matrix — "The bank by the river overflowed"	The	bank	river	overflowed
bank	0.05	0.45 $\leftarrow$self	0.92 ⭐	0.38
river	0.03	0.88 ⭐	0.52 $\leftarrow$self	0.71
overflowed	0.02	0.79 ⭐	0.85 ⭐	0.60 $\leftarrow$self

Higher score = stronger attention. "bank" scoring 0.92 on "river" is how the model learns this is a riverbank, not a financial institution. (Scores above are illustrative — real attention weights are learned during training and sum to 1.0 per row after softmax.)

Multi-Head Attention: A Panel of Experts Reading the Same Sentence

The dinner-party conversation from last section was only one type of conversation. But real sentences have many kinds of relationships happening at once — grammar, contrast, mood, big-picture meaning — and a single conversation can't catch them all.

So the Transformer hires a panel of experts. Each one listens to the same sentence through a completely different lens, then they all hand in their reports.

Take the sentence: "The smart glasses are light but their battery is very weak."
Here's the panel arguing about it in real $\text{time}$:

🔍 The Grammar Cop
"'weak' is describing 'battery' — that's a clean adjective-noun pairing. Move on."

⚖️ The Contrast Detective
"The word 'but' is the whole point. Somebody's pitting 'light' against 'weak' here — there's a trade-off being drawn."

🎭 The Sentiment Reader
"Something positive ('light') is being undercut by something negative ('weak'). The mood in this sentence is disappointment."

🔭 The Big-Picture Thinker
"Zooming out — this whole sentence is a complaint about a gadget. File it under 'product review.'"

Each expert writes up their own attention matrix. Then the model staples all their reports together into one rich representation of the sentence.

That's multi-head attention: one sentence, many simultaneous readings, then a combined verdict. It's the same trick a good doctor uses — a cardiologist, neurologist, and radiologist all examining the same patient, pooling notes, producing a diagnosis sharper than any specialist could alone.

And the scale is wild: GPT-3 runs 96 of these experts in parallel, inside every single layer. GPT-4 likely runs even more. Nobody told them what to specialize in — each expert just learned their niche during training.

Positional Encoding: Remembering Order

Here's a subtle problem with reading everything in parallel: word order gets lost.

If you process all words simultaneously with no sense of position:

"The dog bit the man"
"The man bit the dog"

...look identical to the attention mechanism — just the same three tokens rearranged.

The Problem ❌	The Solution ✅
Self-Attention sees all words at once — without position info, "The dog bit the man" and "The man bit the dog" are identical bags of tokens.	Add a unique position vector to each word's embedding: "dog" at position 1 gets a different fingerprint than "dog" at position 5.

"dog" + position 1 encoding $\rightarrow$ knows it's the subject
"bit" + position 2 encoding $\rightarrow$ knows it's the verb
"man" + position 3 encoding $\rightarrow$ knows it's the object

Now "The dog(1) bit(2) the man(3)" is mathematically distinct from "The man(1) bit(2) the dog(3)." Order preserved — without losing parallelism.

Inside a Transformer Block

A complete Transformer isn't just attention — it's a stack of blocks, each containing multiple components:

One Transformer Block (repeated N times)

Input Embeddings + Positional Encoding ↓

Multi-Head Self-Attention Each word attends to all others in parallel ↓

Add & Normalize (Residual Connection) Original input added back — prevents information loss ↓

Feed-Forward Network Each position independently processed for richer representations ↓ repeat 12-96x

Final Output Layer Probability distribution over vocabulary — next token predicted

The Residual Connection (step 3) is worth calling out: at each layer, the original input is added back to the attention output. This ensures that even if an attention head learns something unhelpful, the original information isn't destroyed. It's the architectural equivalent of "don't erase the original — build on top of it." This is the same "add original input back" trick that let us train the deep networks in Article 4 without losing early information.

How This Architecture Powers Everything You've Learned So Far

Every concept from the previous articles lives inside this diagram:

The neuron from Article 3 is inside the Feed-Forward Network — every position runs through dense layers of neurons after attention.
The training loop from Article 4 (Forward Pass $\rightarrow$ Loss $\rightarrow$ Backprop $\rightarrow$ Update) runs across all 96+ attention heads simultaneously during pre-training.
The 384-dimensional embeddings from Article 2 are what the final output layer produces — the Transformer is the machine that creates them.
The 4 learning types from Article 5 — Self-Supervised pre-training, SFT, and RLHF — all use this exact stack as their underlying model.

Also, the Positional Encoding we just saw is exactly why the embeddings we learned in Article 2 carry both meaning and order — position is baked into every vector from the first layer.

The Timeline: From Research to Revolution

2014: RNN + LSTM Dominates Language AI reads word-by-word. Long texts break. Slow. Can't parallelize.
2015: Attention Mechanism Added Bolted onto RNN. Better long-range connections, but still sequential. Partial fix.
June 2017: "Attention Is All You Need" Google Brain removes RNN entirely. Parallel processing. Scales to billions of parameters. Released openly — no patent.
2018–2019: BERT + GPT-1/2 Launch OpenAI and Google apply Transformer at scale. First demonstrations of emergent language understanding.
2020: GPT-3 — 175 billion parameters (weights inside its neurons) The first model to show that scaling Transformers produces qualitatively new capabilities: reasoning, writing, code.
2022–2026: ChatGPT, Claude, Gemini, Llama... Transformer-based models enter everyday use. The architecture that started in a Google paper now runs on billions of devices. Every capability we've covered (embeddings, similarity search, training loop, RLHF) only became possible because the Transformer removed the RNN bottleneck.

RNN vs Transformer — The Final Scoreboard

Before 2017 (RNN + Attention)	After 2017 (Transformer)
Reads word by word ❌	Reads the whole sentence at once ✅
Forgets distant words ❌	Every word attends to every other word ✅
Hard to parallelize on GPUs ❌	Runs on thousands of GPUs simultaneously ✅
Long texts cause failures ❌	Scales to 1M+ token context windows ✅
RNN max context ~500 tokens ❌	Transformer today: 1M+ tokens (Gemini 1.5 Pro) ✅

The Four Key Components — Summary

Multi-Head Attention: Allows the model to see multiple types of relationships simultaneously — like a team of specialists each analyzing the same sentence from a different angle.
Residual Connections: Guarantees that original information is never lost, even as it passes through dozens of transformation layers. The safety net of deep learning.
Positional Encoding: Since the model reads everything in parallel, positional encodings inject word order information so the model can distinguish "dog bites man" from "man bites dog."
Stacked Layers: Each block builds deeper understanding. Early layers capture surface patterns (syntax). Later layers capture abstract meaning (semantics, reasoning). This is what built ChatGPT and Claude.

The Core Insight

The numbers are impressive — but the real magic is how these four components work together inside every model you use.

Why the Transformer won

The Transformer's fundamental advantage isn't just accuracy — it's scalability. Because it's fully parallelizable, you can throw more GPUs at it and it gets proportionally faster. This enabled training on hundreds of billions of words in days rather than years. And as models scaled, entirely new capabilities emerged — reasoning, code generation, creative writing — that nobody had programmed explicitly.

The decision not to patent the Transformer architecture was arguably the most consequential act of open science in the history of AI. Every model you interact with today — when you ask ChatGPT a question, when Claude writes code, when Gemini translates text — runs on this architecture.

Pro Tips for Builders

💡 What Knowing the Transformer Changes For You

Encoder vs Decoder matters for your use case. BERT-style (encoder-only) models are best for understanding tasks — classification, embeddings, similarity search. GPT-style (decoder-only) models are best for generation. Knowing the architecture helps you pick the right tool.

Context window = Transformer memory. The reason models have a context limit is the self-attention mechanism — attention cost scales quadratically with sequence length. 1M-token models require architectural tricks (sparse attention, sliding windows) to make this tractable.

More layers = more abstraction. Early layers in a 96-layer GPT capture syntax. Middle layers capture facts. Late layers handle reasoning and abstraction. This is why larger models are qualitatively better — not just quantitatively.

Attention heads are interpretable. Tools like BertViz can show you which words each head attends to. This is one of the few places in deep learning where you can actually see what the model "thinks."

Try It Yourself

Experiment 1: Visualize Attention
The tool BertViz lets you visualize how attention heads in BERT (a Transformer model) focus on different words. Watch how the head that handles syntax behaves differently from the head that handles semantics.

Experiment $\text{2: Feel the Difference}$
Load bert-base-uncased (encoder-only Transformer) and gpt2 (decoder-only Transformer) via HuggingFace. BERT sees the whole sentence at once. GPT-2 generates tokens one at a time using its Transformer decoder. Same architecture, different configurations.

from transformers import pipeline

# BERT (encoder) — sees the full sentence at once and fills the blank
fill_mask = pipeline("fill-mask", model="bert-base-uncased")

prediction = fill_mask("The bank by the [MASK] overflowed.")
print(prediction[0])
# {'token_str': 'river', 'score': 0.89, ...}
#
# BERT picks "river" because it reads "overflowed" simultaneously
# with "bank" — context flows in both directions.

# GPT-2 (decoder) — generates tokens left-to-right
generator = pipeline("text-generation", model="gpt2")

continuation = generator("The bank by the river", max_new_tokens=5)
print(continuation[0]["generated_text"])
# "The bank by the river was flooded..."

Experiment 3: Count Attention Heads

from transformers import GPT2Config

config = GPT2Config()
heads = config.n_head
layers = config.n_layer

print(f"GPT-2 Small: {heads} heads × {layers} layers = {heads * layers} attention ops")
# GPT-2 Small: 12 heads × 12 layers = 144 attention ops

Experiment 4: Test Long-Range Dependencies (Transformer vs RNN)

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="distilbert-base-uncased")

sentence = """
    The glasses I bought from the store in downtown Cairo
    that my friend recommended last summer are [MASK].
"""

prediction = fill_mask(sentence)
print(prediction[0]["token_str"])
# "beautiful"  — linked correctly back to "glasses" despite the long gap.
# An RNN would likely have forgotten "glasses" by the time it reached [MASK].

Everything we've covered — from one neuron to embeddings to the full Transformer — comes together when the model actually writes its answer, one token at a time.

DEV Community