DEV Community

Cover image for Transformer Series - Blog #4 How the word "Bank" knows what it means: Self-Attention explained intuitively
Jyoti Prajapati
Jyoti Prajapati

Posted on

Transformer Series - Blog #4 How the word "Bank" knows what it means: Self-Attention explained intuitively

Welcome back to the Transformer Series!

In Blog #3, we gave our Transformer a sense of "Time" using Positional Encoding. It now knows the order of words. But even with order, words are still fundamentally lonely.

If I just say the word "Bank," what comes to mind?

A place to store your money? 🏦

The edge of a river? 🛶

As a human, you don't even think about this ambiguity. You instantly look at the surrounding words to decide:

"I went to the bank to deposit a check." (Money context) "I sat on the bank and watched the river flow." (Nature context)

Before Transformers, older models struggled with this. They often treated the word "Bank" the same way regardless of the sentence.

The Transformer solves this with Self-Attention. It’s the mechanism that allows the word "Bank" to "look around" at the other words in the sentence and update its own meaning based on who it is hanging out with.

The Context Crisis: Why we needed a revolution

In the early days of NLP, we had a "memory" problem.

Models like RNNs (Recurrent Neural Networks) and LSTMs were like readers who had a very short attention span. They read from left to right, one word at a time. By the time they reached the end of a long sentence, the "state" of the first few words had started to fade.

Imagine reading this:

"The cat, which was chased by the neighbor's massive, loud, and energetic dog that always escapes its yard, ran up the tree."

An RNN might struggle to remember that "ran" refers to the "cat" and not the "dog" because so much "noise" happened in between.

Self-Attention changed everything. It said: "Stop reading in order. Let every word look at every other word, all at once."

The Deep Dive: How the "Search Engine" Math Works
To make this happen, we don't just use one vector per word. We use three.
For every input word, we generate a Query (Q), a Key (K), and a Value (V).

These aren't magic; they are created by multiplying our input embedding by three weight matrices ($W^Q, W^K, W^V$) that the model learns during training.

1. The Query (Q): The "Ask"Think of this as the word's "Personal Interest Profile." The word "Bank" sends out a Query: "I am looking for anything related to finance or geography."

2. The Key (K): The "Label" Every word in the sentence has a Key. It’s like a metadata tag. The word "Deposit" has a Key that says: "I am a financial action.

3. The Score: The "Compatibility Test
We take the Query of "Bank" and the Key of "Deposit" and do a Dot Product.

  • If they are highly related, the number is huge.
  • If they are unrelated (like "Bank" and "The"), the number is near zero.
Score = Q.K^T
Enter fullscreen mode Exit fullscreen mode

4. The Softmax: The "Attention Filter"
We take all those scores and pass them through a Softmax function. This turns the scores into probabilities that add up to 100%.

  • Deposit: 0.85
  • The: 0.02
  • Check: 0.13 This tells the model: "When processing 'Bank', spend 85% of your energy looking at 'Deposit'."

5. The Value (V): The "Payload"
Finally, we multiply our percentages by the Value vectors. The Value is the actual information the word carries. We sum them up, and BAM — we have a new, "Context-Aware" version of the word "Bank."

Below is one more example for Q, K, and V:

Q, K, V concept

The "Scaling" Secret: Why we divide by sqrt(d_k) You’ll often see this in the official formula:

Attention (Q, K, V) = softmax((QK^T)\sqrt(d_k))V
Enter fullscreen mode Exit fullscreen mode

Technical "Senior" Tip: Why the division?
As the dimensionality (d_k) of our vectors grows, the dot product QK^T can grow very large in magnitude. When these values are huge, the softmax function gets pushed into regions where the gradient is extremely small (the "vanishing gradient" problem). By scaling down by sqrt(d_k), we keep the math stable and ensure the model can actually learn.

The Code: A Raw Implementation
As a developer, you don't really understand it until you see the shapes. Here is how you would write this in PyTorch (without the abstractions):

import torch
import torch.nn.functional as F

def basic_self_attention(q, k, v):
    # q, k, v are of shape [batch_size, seq_len, d_k]
    d_k = q.size(-1)

    # 1. Compute scores (Matrix multiplication)
    # [batch, seq_len, seq_len]
    scores = torch.matmul(q, k.transpose(-2, -1)) 

    # 2. Scale
    scores = scores / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))

    # 3. Apply Softmax to get weights
    weights = F.softmax(scores, dim=-1)

    # 4. Multiply by Values
    # [batch, seq_len, d_k]
    output = torch.matmul(weights, v)

    return output, weights
Enter fullscreen mode Exit fullscreen mode

Why This Architecture Won

1. Total Parallelization: Unlike RNNs, which have to wait for word 1 to finish before starting word 2, Self-Attention calculates the entire sentence's relationships simultaneously. This is why we can train massive models like GPT-4 on thousands of GPUs.

2.Global Reach: A word at the very beginning of a 1,000-word document can "attend" to a word at the very end in a single calculation. No memory loss.

Summary
Self-Attention is the "Brain" of the Transformer. It's the mechanism that turns a sequence of isolated words into a cohesive, contextual map of meaning.

  • Queries ask the questions.
  • Keys provide the answers.
  • Values provide the content.

Next up in Blog #5: What's better than one "Attention" spotlight? Eight of them. We’re diving into Multi-Head Attention and why "splitting your focus" is the secret to high performance.

Happy Reading

Found this deep dive useful? Bookmark it for your next interview, and let me know in the comments: What was the specific 'aha' moment for you when learning about Q, K, and V?

Top comments (0)