kaustubh yerkade

Posted on May 27 • Edited on Jun 13

Decoding the "Attention Is All You Need"......

#deeplearning

Ever wondered how ChatGPT or Gemini understands your queries and writes the perfect answer? It all started with the 2017 paper Attention Is All You Need.The Transformer model, introduced in the paper revolutionized deep learning and natural language processing (NLP) by enabling models to process sequences in parallel, leading to breakthroughs in tasks like machine translation, text summarization, and question answering. It powers BERT, GPT, T5, and every major LLM we know today.

💡 Why This Paper Matters

Replaces RNNs/CNNs in sequence tasks like translation
Introduces self-attention as a universal computation layer
Enables massive parallelization
Paves the way for large language models (LLMs)

💡 The Problem with RNNs and CNNs

Property	RNN (e.g. LSTM/GRU)	CNN (e.g. ConvS2S)	Transformer
Parallelizable?	No	Yes	Yes
Long-term dependencies	Hard	Better	Excellent
Interpretability	Poor	Medium	Good

Contents:

Attention
Self-Attention
Multi-Head Attention
Positional Encoding
Transformer Architecture
Encoder-Decoder Structure
Self-Attention Mechanism
Positional Encoding
Feed-Forward Neural Networks
Residual Connections and Layer Normalization
Pizza🍕 Example
References

Before this, models like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) were used for understanding sequences like text. But they had limitations: they were slow, couldn’t remember long sentences well, and were hard to train.

The Transformer encoder-decoder architecture, entirely based on self-attention and position-wise feed-forward networks.

Encoder: 6 identical layers
- Self-Attention + Feed-Forward + Residual + LayerNorm
Decoder: 6 identical layers
- Masked Self-Attention + Encoder-Decoder Attention + Feed-Forward

What Is Attention 👀?

In human terms:
Imagine you're reading a sentence:

"Kaustubh is typing something on laptop, Because he has to submit his assignment before deadline."

We know that “He” refers to “Kaustubh,” not “laptop.”
That’s attention — your brain is figuring out which word “He” relates to.

In AI terms:
Attention is a mechanism in neural networks that helps the model figure out which words to focus on when processing language. It assigns importance (attention scores) to each word depending on how relevant it is to the current word being looked at.

This allow models to weigh the relevance of different parts of the input data dynamically, enabling them to focus on the most pertinent information when generating outputs.

Self-Attention 👀 (Looking Around the Sentence)

Self-Attention means every word looks at every other word in the sentence to understand the context.

Self-Attention allows each word in a sequence to consider every other word, enabling the model to capture contextual relationships regardless of their distance in the sequence. This is crucial for understanding nuances in language, such as resolving pronouns or interpreting idiomatic expressions.

Each token is transformed into 3 vectors:

Query (Q)
Key (K)
Value (V)

The attention output is computed as:

Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Measures compatibility between query and keys
Outputs weighted average of values

Example:

"The cat sat on the mat."

To understand "sat", it attends to "cat" (subject) and "mat" (location)

Multi-Head Attention 👀🧠 (MHA)

Instead of using one attention lens, Transformers use multiple attention heads. Each head learns to focus on different parts of the sentence. They works as a detective looking for something different —
one looks at verbs, another checks pronouns, another scans for sentiment.
This gives the model a richer understanding of language.

Multi-Head Attention employs multiple attention mechanisms in parallel, allowing the model to capture various types of relationships (e.g., syntactic and semantic) simultaneously. Each head operates in a different representation subspace, providing a richer understanding of the input.

Instead of a single attention operation, we run multiple in parallel:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

Each head_i is:

head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)

This allows the model to capture multiple relationships (e.g., syntax, semantics) in parallel.

https://poloclub.github.io/transformer-explainer/

Positional Encoding

Transformers don’t naturally understand the order of words (unlike RNNs which process words one by one).

Positional Encoding is like giving each word a unique signal based on where it appears in the sentence. Since Transformers lack inherent sequence order awareness, Positional Encoding injects information about the position of each token using sinusoidal functions, enabling the model to distinguish between different word orders.

“The cat sat on the mat” ≠ “Mat the sat cat on the”
Positional encoding ensures the model knows which is which.

Because there's no recurrence or convolution, we inject positional info into embeddings:

PE(pos, 2i) = sin(pos / 10000^{2i/d_{model}})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_{model}})

This lets the model learn relative position information.

🌟Transformer Architecture:

The Transformer is a deep learning model architecture designed to process sequences of data — especially text. It helps to understand the meaning of input sentence, figure out what’s important, and generate a meaningful response. It does this by analyzing all the words in your sentence at once, paying special attention to how each word relates to every other word.

The Transformer model is made up of two main parts:
1.Encoder Reads the input sentence (like a question)
2.Decoder Generates the output sentence (like the answer)

Each of these parts has multiple layers of:

Self-Attention
Feed-Forward Neural Networks (tiny ML models that learn patterns)
Layer Normalization (helps training stay stable)
Residual Connections (skip connections that prevent data loss)

and stack this whole block 6 times (or more in bigger models) to learn deeper patterns.

THE LEFT SIDE : Encoder Block

Left Side: Encoder Block (Stacked N times)
This part reads and understands the input sentence.

Components:
1.Input Embedding
Converts words into vector form (numerical representation).

2.Positional Encoding
Since Transformers don’t know the order of words, this encoding adds info about word position.

3.Multi-Head Attention
The model looks at other words in the sentence to understand context.
→ Multi-head = several attention mechanisms in parallel.

4.Add & Norm
Add : Applies residual connection (original input + output of layer)
Norm: Normalizes values for stable training

5.Feed Forward Network (FFN)
A small neural network applied to each token to learn complex patterns.

All this is repeated N times (usually 6 in the original model) to build deep understanding.

THE RIGHT SIDE: Decoder Block

Right Side: Decoder Block (stacked N times)
This part generates the output sentence based on what the encoder understood.

Components:
1.Output Embedding
Converts previous output words (shifted right) into vectors.

2.Positional Encoding
Again, tells the model the order of output words.

3.Masked Multi-Head Attention
Prevents the decoder from seeing future tokens during training.
(So it predicts one word at a time without cheating!)

4.Multi-Head Attention (over Encoder Output)
Now it attends over the encoder's final output — combining what it "heard" with what it wants to "say".

5.Add & Norm
Add : Applies residual connection (original input + output of layer)
Norm: Normalizes values for stable training

6.Feed Forward Network (FFN)
A small neural network applied to each token to learn complex patterns.

7.Linear Layer
Converts the decoder’s output into logits (raw scores for each word in the vocabulary).

8.Softmax
Softmax Activation Function turns logits into probabilities(Range 0-1). The word with the highest probability is selected as the output.

Lets understand with an Example-

Suppose you're chatting with a restaurant chatbot🤖 and ask:

🗣️ "I want a large pepperoni pizza with extra cheese."

Let’s see step by step how Transformer handles this behind the scenes.

🧠 Step 1: Input Embedding (Encoder)
The sentence gets broken into words:
"I", "want", "a", "large", "pepperoni", "pizza", "with", "extra", "cheese" etc.

and each word is turned into a vector (basically a bunch of numbers).

(visualization of high-dimensional data- https://projector.tensorflow.org/ )

Transformer architecture natively processes numerical data, not text.
So How Does Words Become Vectors ? Lets understand this step by step-

Tokenization
Assigning Token IDs (Numerical Encoding)
Word Embeddings
Adding Positional Encoding

1. Tokenization - First, your sentence is split into tokens — which can be words, subwords, or even characters. On average, 1 word ≈ 1.3 tokens i.e.
No of Token ≈ Words × 1.3

https://platform.openai.com/tokenizer

👉 Sentence:
"I want a large pepperoni pizza with extra cheese."

👉 After tokenization (using any high-performance tokenizer ):
["I", "want", "a", "large", "pepperoni","pizza"..................]

👉Some models might split further:
["I", "want", "a", "large", "peppe", "roni", "pi", "zza"........]

2. Assigning token IDs - Each token gets a unique ID from the model’s vocabulary.

Token	ID
I	101
want	2054
a	2053
large	6587
pepperoni	7892
pizza	4782

(These are just numbers to help lookup the next step — embeddings.)

3. Vector Embeddings - Now, each token ID is mapped to a vector — a long list of decimal numbers (typically 512 or 768 values).

So:

"I" → [0.02, -0.31, ..., 0.11]
"want" → [0.65, 0.21, ..., -0.09]
"a" → [0.69, 0.23, ..., -0.06]
"large" → [0.70, 0.25, ..., -0.05]
"pepperoni" → [0.75, -0.30, ..., 0.02]
"pizza" → [0.89, -0.45, ..., 0.03]

These vectors are learned representations — the model figures them out during training so that similar words get similar vectors."pizza" and "burger" will have similar vectors (both are food 🍔🍕).

4. Positional Encoding - Transformers don’t process tokens in order like RNNs do — so we need to tell them where each word appears.To give the model a sense of position of words in the sequence, we add positional encoding to the token embeddings.

A fixed-size vector added to each token's embedding to inject sequence order information.

👇 The formula (from the paper):
For position pos and dimension i:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This creates unique wave patterns per position across dimensions that help the model identify order and relative distance. Suppose embedding size = 4 (just for simplicity):

Token	Word Embedding	Positional Encoding	Final Input Vector (Sum)
“I”	[0.1, 0.2, 0.3, 0.4]	[0.0, 1.0, 0.0, 1.0]	[0.1, 1.2, 0.3, 1.4]
“want”	[0.2, 0.3, 0.4, 0.5]	[0.84, 0.54, 0.91, -0.41]	[1.04, 0.84, 1.31, 0.09]
…	…	…	…

Each colored line represents one dimension of the positional encoding vector.

X-axis = Position in the sequence (e.g., token 0 to 9)
Y-axis = Value of the encoding in that dimension
Sin and Cos waves allow the model to detect:
- Absolute position (thanks to the shape of the wave)
- Relative distance (via wave phase shifts)

This encoding is added directly to token embeddings, enabling the Transformer to understand word order without any RNNs or CNNs.

In this way word is turned into a vector.

🧠 Step 2: Multi-Head Self-Attention (Encoder)
The model tries to understand the context by looking at each word in relation to others.

For example:
It understands "pepperoni" modifies "pizza", and "extra" belongs with "cheese".

Self-Attention: The Core Idea
For each input token, the model calculates attention over all tokens in the sequence, including itself. This enables contextualized representations—each word "looks" at every other word to gather relevant information.

1. Input Representation
Given a sequence of tokens represented by embeddings:
X = [x₁, x₂, ..., xₙ] where each xᵢ ∈ ℝᵈ (e.g., d = 512)

We transform X into three matrices through learned linear projections:

Query (Q) = XWᴼQ --> A Query vector (what it wants)
Key (K) = XWᴼK --> A Key vector (what it contains)
Value (V) = XWᴼV --> A Value vector (the actual content to share)

Each W is a learnable weight matrix of shape ℝᵈˣᵈₖ

For a token like "pepperoni":
It compares its Query vector with all other tokens' Key vectors via dot product.

Attention Score("pepperoni" → "pizza") = Q_pepperoni · K_pizza

This gives a score matrix (before softmax):

	I	want	a	large	pepperoni	pizza	with	extra	cheese
pepperoni	0.1	0.2	0.3	0.4	1.0	1.9	0.7	0.2	0.3

Apply Softmax (Normalization)
This turns raw scores into probabilities:

Weights:
[I]=0.01, want=0.02, ..., pizza=**0.55**, ..., cheese=0.05

It means "pepperoni" mostly attends to "pizza", but a bit to others too.

Compute Output for "pepperoni"
Each score is multiplied with the corresponding Value vector, and summed:

output_pepperoni = Σ (Attention Weight_i × V_i)

💡Multi-Head Part
Repeat all the above across multiple heads (e.g., 8) with different learned weights:

Head 1 might focus on syntactic relations ("I" ↔ "want")
Head 2 might capture ingredients ("pepperoni" ↔ "pizza" ↔ "cheese")
Head 3 might model modifiers ("extra" ↔ "cheese", "large" ↔ "pizza")

Each head learns a different perspective, and their outputs are concatenated and projected into one final output vector per token.

2. Scaled Dot-Product Attention
For each token, attention scores are computed by:

QK⊤ : Dot product to get similarity
softmax: Normalizes scores
√𝑑𝑘 : Scaling factor to avoid large values in softmax
Multiplied by V to aggregate weighted information

3. Multi-Head Attention
Instead of performing attention just once, we do it h times (e.g., 8 or 12):

Each head has its own Q, K, V linear projections. Outputs of all heads are concatenated and projected using 𝑊O.
This allows each head to capture different relationships or features across tokens.

---------------------=---------------------
🧠 Step 3: Feed-Forward Layer

After each token gets its updated, context-aware vector from the self-attention block, that vector is passed through a small neural network — this is the Feed-Forward Neural Network (FFN).

The same FFN is applied independently to each token (i.e., position-wise). That means it doesn’t mix information across tokens anymore — it just transforms the vector at each position.

💡 Why?
To add non-linearity and help the model learn more complex patterns beyond attention.(like "pepperoni + pizza = topping request").
It lets the model refine what it’s learned from context.

Each FFN has two linear (dense) layers with a ReLU activation in between:

This may look complicated, Its two linear transformations with a ReLU activation in between.

We’re taking a vector (x) that represents a token (like a word in context), and transforming it into something deeper and more meaningful by applying:
1. First Linear Layer:
h=xW1 + b1
x: input vector (size d_model, say 512)
W₁: weight matrix that increases dimensionality (512 → 2048)
b₁: bias vector (added after the multiplication)

2. Non-Linearity (ReLU Activation):
a=max(0,h)
This is the ReLU (Rectified Linear Unit) activation.It keeps only the positive parts of the vector (a) and turns negative values into 0. Helps the network learn non-linear patterns.

3. Second Linear Layer:
output=aW2 + b2
W₂: weight matrix that shrinks the vector back down (2048 → 512)
b₂: another bias vector

This brings the high-dimensional vector back to the original size so it fits nicely into the next step in the Transformer block. This final vector is the refined version of the input token — it now contains richer patterns and representations, ready to move forward in the model.

💡Why Not Just One Layer?
Using two layers with ReLU in between allows the model to:

Learn complex, non-linear mappings
Separate important features
Introduce more parameters = more learning capacity

After either the Self-Attention or the Feed-Forward Network (FNN), the model does- Add & Norm

Add (Residual Connection):
The original input (before Attention or FNN) is added back to the output of that layer. This is called a residual connection or skip connection.

Output = LayerInput + LayerOutput

💡It helps the model retain original information and avoids vanishing gradients (helps in training deep networks).

Norm (Layer Normalization):
The result is then passed through LayerNorm, which normalizes the data across its features.

FinalOutput = LayerNorm(Output)

💡This keeps values balanced and stable, so the model trains faster and more reliably.

Example
Imagine you're revising an essay:
-The original draft is the input.
-The editor’s suggestions (Self-Attention or FNN) are applied.
-But instead of throwing away the original, you add the suggestions on top of it.
-Then you polish (normalize) the final version for consistency and tone.

Let’s say the input vector is:
x = [0.4, 0.7, -0.2]

The output of Attention is:
attention_output = [0.3, 0.1, 0.5]

Add:
sum = x + attention_output = [0.7, 0.8, 0.3]

LayerNorm is applied across all the features of a single data point (token). It ensures that the features (the vector) have:

Mean = 0
Standard Deviation = 1

Then it adds two trainable parameters:
γ (gamma) = scale
β (beta) = shift

🧠 Step 4: Repeat Layer N times
The Transformer repeats a full "layer block" N times (in the original paper, N = 6).

Why Repeat N Times?
It is like training your brain with layers of understanding:

Layer 1 learns basic relationships (e.g., "Cats" and "cute" are connected).
Layer 2 understands slightly more complex patterns (e.g., "are cute" is a phrase).

...

Layer N captures very deep dependencies, context, and grammar.

More layers = deeper understanding of the input/output sentence structure and meaning. Each layer is like a new pair of glasses that sees deeper features. After 6 rounds (or more), the model understands subtle relationships between words even if they’re far apart — like how in:

“The dog that chased the cat was fast.”
The model knows “dog” is the subject of “was fast.”

Let's get back to our original example of Restaurant Chatbot Prompt:

“I want a large pepperoni pizza with extra cheese.”

💡 Layer 1 – Word Recognition
The model just identifies important keywords:

large
pepperoni
pizza
extra
cheese

But it doesn’t really "get" what you want yet.

💡 Layer 2 – Basic Structure
It starts understanding structure:

You’re making an order.

“Large” refers to pizza size, not pepperoni or cheese.

“Extra” is an intensifier, modifying cheese.

💡 Layer 3 – Contextual Attention
Now it gets better:

Connects "extra" → cheese

Knows pepperoni is a topping, not a size

Starts forming a mental image of a pizza order

💡 Layer 4 – Intent Formation
The model figures out your intent:

You’re requesting a custom pizza: large size, pepperoni topping, and more cheese than usual.

💡 Layer 5 – Response Logic
It prepares a relevant action or response, like:

“Got it! One large pepperoni pizza with extra cheese coming up.”
Or if it’s an app: it builds the right order object in code.

🧾 Layer 6 – Final Output Ready
By now, your sentence has been passed through 6 refined filters. The model fully understands:

1.Structure
2.Grammar
3.Intent
4.Meaning
5.Action

So it’s ready to generate a fluent, helpful response (or pass this meaning to the decoder if it's a translation or chatbot).

Now the Encoder has a deep, contextual understanding of what you're asking. Decoder takes over and begins to generate a response, word by word.

Step 5: Decoder Starts Generating a Response

Key Concepts in the Decoder:
Each decoder layer also has three main blocks:

1.Masked Self-Attention – So it doesn’t "peek" ahead at future words
2.Encoder-Decoder Attention – To focus on the input sentence
3.Feed-Forward Layer – Just like in the encoder

💡The Response Begins:
1.Masked Self-Attention:
It says:
“OK, I’ve got , let’s guess the first word…”

It attends to nothing yet because this is the first token.

2.Encoder-Decoder Attention:
Now it looks at the encoder output — which understood the pizza order.
It pays attention to:

"large" (size)

"pepperoni" (topping)

"extra cheese" (modifier)

3.Feed-Forward Layer:
Processes everything so far and generates the first word:

“Got”

💡 Repeat for Next Word:
Now it has:

Got

Again:

Masked Self-Attention makes sure it only looks at and Got, not future words.
Encoder-Decoder Attention still references the full pizza order.
Generates the next word: "it!"

Then:

Got it!

Next: "One", "large", "pepperoni", etc.

💡 Why Masked Attention?
It prevents the model from cheating by looking at future words while generating. At each step, it can only look at previously generated tokens.

💡When Does It Stop?
The decoder keeps generating until it produces a special token like:

So the final output might be:

“Got it! One large pepperoni pizza with extra cheese coming right up!”

The Decoder थोडक्यात works like this:

1.Starts with an empty sentence (just a start token ).
2.Masked Self-Attention ensures the bot generates one word at a time.
3.Cross-Attention lets it look at the encoder’s understanding of your input sentence.
4.Word-by-word, it builds:

["Got", "it!", "One", "large", "pepperoni", "pizza", "with", "extra", "cheese", ...]

Step 6: Final Layer
The last layer maps the internal representation to actual words using a Softmax over the vocabulary.

Final Wrap-Up: Beyond the Layers
We've journeyed through the Transformer model like ordering pizza from a multilingual robot — from input embeddings to multi-head attention, through feed-forward nets, and finally watching the decoder craft a best response.

But there’s more.... Let’s break it down:

💡1. Why Positional Encoding Isn’t Just Math
Sine and cosine — they give each word a unique wave signature based on its position. Unlike RNNs, Transformers don’t process sequentially, so this tells the model, “Hey, ‘pepperoni’ comes before ‘pizza’.”

💡 2. Decoder’s Mask = No Peeking Policy 🤦‍♂️
The decoder’s attention is masked so it can’t “look ahead” during training. This keeps it honest while learning to predict one word at a time.

💡 3. Multi-Head Attention = Parallel Thoughts
Each attention head learns something different. One might track verbs, another focuses on names or locations. It’s like getting advice from 8 detectives before making a decision.

💡 4. Training
Optimizer: Adam with warmup steps.

Loss Function: Cross-entropy + a dash of label smoothing to avoid overconfidence.

Layer Count: 6 for both encoder and decoder in the original model.

💡 5. Scaling the Infra
The paper used 6 layers. Fast-forward to today’s models (GPT-4, Gemini, Claude), There are 100+ layers, billions of parameters, and training on TBs of data.

💡 6. What Transformers Dropped: No RNNs, No CNNs
Transformers reduced recurrence and convolutions & replaced them with pure attention. The results are- Faster training and full-sequence visibility at once.

💡 7. How They Evaluated It: BLEU Score
The original task in the paper was machine translation. They used the BLEU score to measure how close the output was to human translation.

💡 8. Training vs Inference
-During Training: The model sees the correct sentence.

-During Inference: It has to generate one word at a time from scratch.

References -
https://arxiv.org/abs/1706.03762
https://nlp.seas.harvard.edu/annotated-transformer/
https://jalammar.github.io/illustrated-transformer/
https://www.wikiwand.com/en/articles/Transformer_%28deep_learning_architecture%29
https://d2l.ai
https://bbycroft.net/llm
https://poloclub.github.io/transformer-explainer/

https://colab.research.google.com/github/harvardnlp/annotated-transformer/blob/master/AnnotatedTransformer.ipynb