Self-Attention is not just “looking at important words.”
It is a matrix operation.
And that is exactly why Transformers scale.
Core Idea
Self-Attention lets each token compare itself with every other token in the same sequence.
Each token asks:
Which other tokens are useful for updating my representation?
This matters because meaning is contextual.
A token should not stay as a static embedding.
It should become a representation shaped by the sentence around it.
The Key Structure
Self-Attention follows this pipeline:
Input Embeddings
→ Query, Key, Value Projection
→ Similarity Scores
→ Scaling
→ Softmax Weights
→ Weighted Sum of Values
→ Contextual Token Output
More compactly:
Self-Attention = matching + weighting + information mixing
The full formula is:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V
This equation looks dense.
But the idea is simple:
Compare tokens.
Convert scores into weights.
Use weights to mix information.
Pseudo-code View
At a high level, Self-Attention works like this:
X = token_embeddings
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
scores = Q @ K.T
scaled_scores = scores / sqrt(d_k)
weights = softmax(scaled_scores)
output = weights @ V
That is the core computation.
In real Transformer implementations, this is done for all tokens at once.
Not token by token.
That is why the matrix form matters.
Concrete Example
Take this sentence:
I love you
When updating the token “love”, Self-Attention compares it with:
I
love
you
The token “love” may strongly attend to “I” and “you”.
So its representation becomes more contextual.
It no longer means only the word “love.”
It becomes something closer to:
love as an action between I and you
That is why Self-Attention is powerful.
It turns isolated token vectors into relationship-aware vectors.
QKV Intuition
Each token is projected into three roles:
Query, Key, and Value.
Query:
What am I looking for?
Key:
What do I contain that others can match against?
Value:
What information do I pass forward if selected?
Search analogy:
Query = search request
Key = searchable index
Value = retrieved content
This separation is important.
The model can learn different spaces for matching and information transfer.
Step 1: Generate Q, K, and V
Given input embeddings X:
Q = XW_Q
K = XW_K
V = XW_V
W_Q, W_K, and W_V are learned matrices.
They are trained with the model.
This means QKV is not manually designed.
The model learns how to project tokens into attention roles.
Implementation-wise, this is just matrix multiplication.
Conceptually, it creates three different views of the same token.
Step 2: Compute Attention Scores
The model compares Query and Key vectors.
For one token:
score = q · k
A larger dot product means stronger similarity.
Example:
q₁ · k₁ = 112
q₁ · k₂ = 96
The first key matches more strongly.
But these are still raw scores.
They are not probabilities yet.
Step 3: Scale and Apply Softmax
Dot products can become large when vector dimensions grow.
Large scores can make Softmax too sharp.
That can make training unstable.
So Self-Attention scales the scores:
score = (q · k) / √dₖ
Then Softmax converts scores into weights.
Example:
scores = [14, 12]
softmax(scores) ≈ [0.88, 0.12]
Now the model has attention weights.
These weights say how much each token should contribute.
This matters in practice.
Without scaling, attention can collapse too aggressively onto one token.
Step 4: Weighted Sum of Values
The final output is a weighted sum of Value vectors.
z = Σ αᵢvᵢ
Example:
values = [10, 20]
weights = [0.88, 0.12]
output = 0.88 × 10 + 0.12 × 20 = 11.2
The first value contributes more.
The second value contributes less.
That is the basic meaning of attention output.
It is not a simple average.
It is selective information mixing.
Self-Attention vs Cross-Attention
Self-Attention:
- Query, Key, and Value come from the same sequence
- models relationships inside one sequence
- used in Transformer encoders and decoders
Cross-Attention:
- Query comes from the decoder
- Key and Value come from the encoder
- models relationships between two sequences
- used in encoder-decoder models
In short:
Self-Attention = inside the same sequence
Cross-Attention = between different sequences
This difference matters when reading Transformer code.
If Q, K, and V come from the same tensor, it is Self-Attention.
If Q comes from one tensor and K/V come from another, it is Cross-Attention.
Naive vs Matrix View
Naive view:
Each token compares with every other token one by one.
Matrix view:
All token relationships are computed at once.
Naive logic:
for token_i in tokens:
for token_j in tokens:
compute_similarity(token_i, token_j)
Matrix logic:
scores = Q @ K.T
That single matrix multiplication computes all pairwise token scores.
This is why Transformers are GPU-friendly.
They replace sequential loops with dense linear algebra.
Why Matrix Computation Matters
The attention matrix contains token-to-token relationships.
If the sequence length is n, the score matrix is n × n.
Each row means:
How much one token attends to every token.
Each column means:
How much that token is attended to by others.
This structure is powerful.
But it also creates a cost problem.
Full Self-Attention grows roughly with O(n²).
Longer context means more computation and memory.
So the same design that makes attention expressive also makes it expensive.
That is why efficient attention methods exist.
Important Conditions and Limits
Self-Attention needs positional information.
By itself, attention compares token content.
It does not automatically know token order.
Self-Attention also gets expensive as sequence length grows.
For short and medium sequences, full attention is powerful.
For very long sequences, memory and compute become major constraints.
Another important point:
Attention weights are not always perfect explanations.
They show how information is mixed.
But they should not always be treated as human-level reasoning traces.
Implementation Perspective
In real models, QKV projection is often implemented as one combined linear layer.
Instead of computing three separate matrix multiplications:
Q = XW_Q
K = XW_K
V = XW_V
Implementations often compute:
QKV = XW_QKV
Then split the result into Q, K, and V.
This is faster and cleaner.
The math stays the same.
The implementation is optimized.
That is the developer mindset:
Understand the formula.
Then recognize the optimized tensor layout in code.
Takeaway
Self-Attention is the core operation behind Transformers.
It works by projecting tokens into Q, K, and V.
Q and K compute relevance.
Softmax turns relevance into weights.
Weights mix V into contextual outputs.
The shortest version is:
Self-Attention = compare tokens → weight information → update representations
If you understand QKᵀ and weighted Values, you understand the heart of Transformer computation.
Discussion
When reading Transformer code, which part feels most confusing?
QKV projection, Softmax attention weights, or the final matrix multiplication with V?
Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/self-attention-qkv-matrix-en/
GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai
Top comments (0)