Yuvaraj

Posted on Mar 8

Transformer - Encoder Deep Dive - Part 3: What is Self-Attention

#machinelearning #architecture #deeplearning #ai

Recap

Embedding: "The", "dog", "bit", "the", "man" each have a unique semantic identity.
Positional Encoding: Each word now knows exactly where it sits in the sentence.

Wait... What exactly is the Encoder's job? Part 2

The sole purpose of the Encoder is to understand Context.

With the example, "The dog bit the man" - let’s look at the word "bit".

On its own, "bit" could mean:

A small piece of something (a "bit" of chocolate).
The past tense of a bite (the action).
A digital 0 or 1 (a computer "bit").

The Encoder doesn't know which one it is until it pays Attention to the words around it through association.

Those words are like strangers in an elevator—they are standing near each other, but they aren't talking.

What exactly is "Self-Attention"?

Self: The model is looking at the same sentence it is currently processing. It isn't looking at a dictionary or a translation yet; it's just looking at its own words.

Attention: The model decides which other words in that sentence are relevant to the word it is currently "thinking" about.

The Definition: Self-Attention is a mechanism that allows a word to "look" at every other word in its own sentence to find the context it needs to define itself.

The "Relationship" Logic
In our sentence "The dog bit the man," Self-Attention is the reason the model knows that:

"dog" is related to "bit" (as the actor).
"man" is related to "bit" (as the receiver).
"the" is related to "dog" (telling us it's a specific dog).

Without Self-Attention, the word "bit" is just a three-letter string. With Self-Attention, "bit" becomes a bridge that connects a subject (dog) to an object (man).

Attention is the conversation.

This matrix is now standing at the door of the first Multi-Head Attention block.

Let's understand Self-Attention in this article.

In a real Transformer, 8 of these heads work together to create 'Multi-Head Attention,' which we will glue together in Part 4.

Queries, Keys, and Values (Q, K, V)

To calculate attention, we don't just use the input matrix as it is.
Self-Attention transforms our input matrix into three different versions of itself using three learnable weight matrices (W^Q, W^K, W^V).

Think of this like taking the same word and looking at it through three different lenses:

Query (Q) - "The Search": This is what a word is looking for.
Example: The word "dog" asks: "Is there an action in this sentence that I performed?"
Key (K) - "The Label": This is how a word identifies itself to others.
Example: The word "bit" says: "I am an action involving teeth."
Value (V) - "The Cargo": This is the actual information a word carries.
Example: The word 'dog' (Query) found a match with the label 'action' (Key) on the word 'bit.' It then reached inside the truck and took the 'biting information' (Value) to update its own identity.

Here is a series of four images that visually breakdown the concept of Self-Attention, using our "The dog bit the man" example.

dog, bit, the, man are present without knowing how these words are connected each other in this sentence "The Dog bit the man"

the word vectors for "dog" and "bit" are isolated. The "dog" vector is a generic noun with no knowledge of the action it's about to perform.

How does 'dog' find 'bit'?

The "dog" vector (acting as the Query) "shines a light" on all other words to find its match. The "bit" vector (acting as the Key) responds strongly, creating a high Attention Score.

The Transfer of Meaning

Using the Attention Score as a weight, the "bit" vector's actual content (Value) is transferred to the "dog" vector. The "dog" is now "absorbing" the meaning of the action.

Contextualized

After the process, the "dog" vector is transformed. Its mathematical representation has changed (visualized here by the color blending), and it is now a "context-aware" vector that knows it is the subject of the bite.

The word 'dog' is no longer a generic noun; it’s a subject tied to an action.

Self attention Formula: Deep Dive

For the developers who want to see the code or the math, everything we just discussed (Query, Key, Value) is condensed into one famous formula:

Here is a step-by-step visual breakdown of the Self-Attention mechanism, using our sentence "The dog bit the man". We'll follow the mathematical formula and visualize how the input matrices are transformed at each stage.

Step 1: The Initial Learned Matrices (Q, K, V)

Before any attention is calculated, the input matrix is multiplied by three separate, learnable weight matrices () to create three new matrices: Query (Q), Key (K), and Value (V). These matrices are the starting point for our calculation.

You might be wondering: "If the input is just our 'The dog bit the man' matrix, why do Q, K, and V have different numbers?"

This happens through Linear Transformation.

We take our input and multiply it by three separate "Weight Matrices." These weights are like filters or lenses that highlight different parts of the word's meaning.

Input × W^Q = Q: This transformation extracts the "Question" part of the word.
Input × W^K = K: This extracts the "Label" part of the word.
Input × W^V = V: This extracts the "Cargo" (Content) part of the word.

Why this matters:
These W matrices are learnable. At first, the model is bad at asking questions. But over time, it learns exactly how to adjust the numbers in W^Q so that the word "dog" asks the perfect question to find its verb "bit."

Step 2: The Compatibility Check (Raw Scores)

Now, we calculate how much every word should listen to every other word "Compatibility". To do this, we perform a Dot Product between the Query (Q) and the transposed Key (K^T) matrices.

💡 Quick Recall: If you need a refresher on how the math of multiplying these matrices works, check out Step 5 of Part 1, where we saw how rows and columns collide to create a "Relationship Score."

In this step, we multiply the Query of "dog" by the Transpose of the Key "bit".

The Result: We get a raw "Attention Score."

The Logic: If the "Search Query" of the dog matches the "Label" of the bite, the math produces a high number. If they don't match, the number stays near zero.

For example, the high score of 15.2 between "dog" and "bit" indicates a strong connection.

Step 3: Scaling

The two critical steps that turn raw, unstable scores into clear probabilities for the model.

3.1. The Scaling Step: Stabilizing the Math

Before we can turn our scores into percentages, we have to manage their size.
The raw scores from the dot product (Q * K^T) can be very large, especially with high-dimensional vectors (d_model=512).

Why is this a problem? Large numbers can cause the training process to become unstable. The model's gradients can become too small ("vanishing gradients"), meaning it stops learning.

Now, why do we care if gradients get too small?

When we apply Softmax to very large numbers (like our unscaled scores of 15 or 20), the function becomes "extremely confident." It gives one word 99.999% of the attention and everything else 0.00001%.

Deepdive in to this problem - What is a Gradient?
Gradient - "Directional Signal" telling the model how to improve. When numbers get too large, the signal becomes so weak that the model gets "confused" and stops learning

Let's imagine you are standing on a foggy mountain in the dark, and your goal is to reach the lowest valley (the "Loss" or "Error"). Because of the fog, you can’t see the bottom.

The Gradient is like feeling the ground with your foot to see which way it slopes.

If your foot feels a steep slope downward, that is a Strong Gradient. It tells you exactly which way to step to get closer to the bottom.
If the ground feels almost perfectly flat, that is a Vanishing Gradient. You have no idea which way to move to improve. You are stuck.

The Solution: We divide the raw scores by a scaling factor, which is the square root of the dimension of the keys . This "squashes" the scores back into a manageable range without changing their relative order.

softmax applied on row wise

3.1.1. What is d_k? (The Width of the Key)

Remember our "Semantic Passport" analogy from Part 2? Each word has a vector of 512 numbers (d_model = 512). However, when it comes time to talk in the Engine Room, the model doesn't use all 512 pages at once.

Instead, it breaks those 512 dimensions into smaller, specialized chunks. d_k is the size of one of those chunks—typically 64.

3.1.2. Why not use all 512 at once? (The Specialization Problem)

You might ask: "Why not just calculate one massive attention score for all 512 pages?" The answer is Specialization. If you use all 512 dimensions at once, you get one single "Attention Score." This score becomes an average of every word's relationship in the sentence, and in language, averages are dangerous.

The Analogy: Imagine you are at a business meeting. If you try to listen to the CEO, the Accountant, and the Engineer through one single "ear," their voices blur together. You might get the "average" topic, but you’ll miss the specific details of the budget or the technical specs.

By breaking the 512 dimensions into 8 chunks of 64, the model creates 8 specialized "Attention Heads."

Each head acts like a specialist:

Head 1: Focuses on Grammar (Subject-Verb relationship).
Head 2: Focuses on Entity Relationships (Who is the "dog" and who is the "man"?).
Head 3: Focuses on the "Tense" or "Time" of the sentence.
Head 4: ...
Head n: ...

Step 4: The Softmax Step: The "Winner-Takes-All" Filter**

Now that our scores are stable, we need to convert them into probabilities that we can use as weights. This is where the Softmax function comes in.

Softmax is a mathematical function that takes a list of numbers (which can be positive, negative, or zero) and turns them into a list of probabilities that sum up to exactly 1.0 (or 100%).

Why is this useful?

Normalization: It gives us a clear "attention budget" for each word. The total attention a word pays to the entire sentence must always be 100%.
Amplification: It highlights the highest score and suppresses the lower ones. As seen in the image, the highest scaled score of 1.9 gets a massive 65% of the attention, while the negative scores get almost none.

"Softmax looks at each word individually (each row). It takes the 100% attention budget for that word and distributes it across the sentence."

Let's visualize the softmax of dot product of (Q * K^T) divided by (/) scaling (square root of dimension of keys)

Step 5: The Transfer of Meaning (Weighted Sum)

Finally, we use the attention weights (probabilities) from Step 3 to create a weighted sum of the Value (V) matrix. This is the step where the actual context is transferred.

For example, the new vector for "dog" is calculated by taking 80% of the "bit" Value vector, 5% of the "dog" Value vector, and so on. The result is a new matrix where each word's vector has been updated with information from the words it "paid attention" to.

NOTE: this is just the report from 1 of 8 specialists(heads). In the next part, we'll see how the results from all 8 specialists(heads) are combined to form the final Multi-Head Attention output."

One Specialist = Self-Attention

Summary:

The Attention InterfaceWe have successfully turned our raw input into a Contextual Masterpiece. Q, K, V gave us the tools for the search.

Q*K^T: found the relationships.
Scaling & Softmax stabilized the math and gave us clear percentages.
Value (V) provided the cargo that updated our word meanings.

What’s Next?:

We’ve seen how a single "specialist"(self Attention) handles a 64-bit chunk of our data.
But our Encoder is a powerhouse that runs 8 of these specialists at the exact same time.

In Part 4, we will dive into Multi-Head Attention to move deeper into the Transformer tower.