zeromathai

Posted on Jun 18 • Originally published at zeromathai.com

How Self-Attention Works — QKV, Softmax, and Matrix Computation

#ai #machinelearning #nlp #transformers

Self-Attention is not just “looking at important words.”

It is a matrix operation.

And that is exactly why Transformers scale.

Core Idea

Self-Attention lets each token compare itself with every other token in the same sequence.

Each token asks:

Which other tokens are useful for updating my representation?

This matters because meaning is contextual.

A token should not stay as a static embedding.

It should become a representation shaped by the sentence around it.

The Key Structure

Self-Attention follows this pipeline:

Input Embeddings

→ Query, Key, Value Projection

→ Similarity Scores

→ Scaling

→ Softmax Weights

→ Weighted Sum of Values

→ Contextual Token Output

More compactly:

Self-Attention = matching + weighting + information mixing

The full formula is:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

This equation looks dense.

But the idea is simple:

Compare tokens.

Convert scores into weights.

Use weights to mix information.

Pseudo-code View

At a high level, Self-Attention works like this:

X = token_embeddings

Q = X @ W_Q
K = X @ W_K
V = X @ W_V

scores = Q @ K.T

scaled_scores = scores / sqrt(d_k)

weights = softmax(scaled_scores)

output = weights @ V

That is the core computation.

In real Transformer implementations, this is done for all tokens at once.

Not token by token.

That is why the matrix form matters.

Concrete Example

Take this sentence:

I love you

When updating the token “love”, Self-Attention compares it with:

I

love

you

The token “love” may strongly attend to “I” and “you”.

So its representation becomes more contextual.

It no longer means only the word “love.”

It becomes something closer to:

love as an action between I and you

That is why Self-Attention is powerful.

It turns isolated token vectors into relationship-aware vectors.

QKV Intuition

Each token is projected into three roles:

Query, Key, and Value.

Query:

What am I looking for?

Key:

What do I contain that others can match against?

Value:

What information do I pass forward if selected?

Search analogy:

Query = search request

Key = searchable index

Value = retrieved content

This separation is important.

The model can learn different spaces for matching and information transfer.

Step 1: Generate Q, K, and V

Given input embeddings X:

Q = XW_Q

K = XW_K

V = XW_V

W_Q, W_K, and W_V are learned matrices.

They are trained with the model.

This means QKV is not manually designed.

The model learns how to project tokens into attention roles.

Implementation-wise, this is just matrix multiplication.

Conceptually, it creates three different views of the same token.

Step 2: Compute Attention Scores

The model compares Query and Key vectors.

For one token:

score = q · k

A larger dot product means stronger similarity.

Example:

q₁ · k₁ = 112

q₁ · k₂ = 96

The first key matches more strongly.

But these are still raw scores.

They are not probabilities yet.

Step 3: Scale and Apply Softmax

Dot products can become large when vector dimensions grow.

Large scores can make Softmax too sharp.

That can make training unstable.

So Self-Attention scales the scores:

score = (q · k) / √dₖ

Then Softmax converts scores into weights.

Example:

scores = [14, 12]

softmax(scores) ≈ [0.88, 0.12]

Now the model has attention weights.

These weights say how much each token should contribute.

This matters in practice.

Without scaling, attention can collapse too aggressively onto one token.

Step 4: Weighted Sum of Values

The final output is a weighted sum of Value vectors.

z = Σ αᵢvᵢ

Example:

values = [10, 20]

weights = [0.88, 0.12]

output = 0.88 × 10 + 0.12 × 20 = 11.2

The first value contributes more.

The second value contributes less.

That is the basic meaning of attention output.

It is not a simple average.

It is selective information mixing.

Self-Attention vs Cross-Attention

Self-Attention:

Query, Key, and Value come from the same sequence
models relationships inside one sequence
used in Transformer encoders and decoders

Cross-Attention:

Query comes from the decoder
Key and Value come from the encoder
models relationships between two sequences
used in encoder-decoder models

In short:

Self-Attention = inside the same sequence

Cross-Attention = between different sequences

This difference matters when reading Transformer code.

If Q, K, and V come from the same tensor, it is Self-Attention.

If Q comes from one tensor and K/V come from another, it is Cross-Attention.

Naive vs Matrix View

Naive view:

Each token compares with every other token one by one.

Matrix view:

All token relationships are computed at once.

Naive logic:

for token_i in tokens:
    for token_j in tokens:
        compute_similarity(token_i, token_j)

Matrix logic:

scores = Q @ K.T

That single matrix multiplication computes all pairwise token scores.

This is why Transformers are GPU-friendly.

They replace sequential loops with dense linear algebra.

Why Matrix Computation Matters

The attention matrix contains token-to-token relationships.

If the sequence length is n, the score matrix is n × n.

Each row means:

How much one token attends to every token.

Each column means:

How much that token is attended to by others.

This structure is powerful.

But it also creates a cost problem.

Full Self-Attention grows roughly with O(n²).

Longer context means more computation and memory.

So the same design that makes attention expressive also makes it expensive.

That is why efficient attention methods exist.

Important Conditions and Limits

Self-Attention needs positional information.

By itself, attention compares token content.

It does not automatically know token order.

Self-Attention also gets expensive as sequence length grows.

For short and medium sequences, full attention is powerful.

For very long sequences, memory and compute become major constraints.

Another important point:

Attention weights are not always perfect explanations.

They show how information is mixed.

But they should not always be treated as human-level reasoning traces.

Implementation Perspective

In real models, QKV projection is often implemented as one combined linear layer.

Instead of computing three separate matrix multiplications:

Q = XW_Q

K = XW_K

V = XW_V

Implementations often compute:

QKV = XW_QKV

Then split the result into Q, K, and V.

This is faster and cleaner.

The math stays the same.

The implementation is optimized.

That is the developer mindset:

Understand the formula.

Then recognize the optimized tensor layout in code.

Takeaway

Self-Attention is the core operation behind Transformers.

It works by projecting tokens into Q, K, and V.

Q and K compute relevance.

Softmax turns relevance into weights.

Weights mix V into contextual outputs.

The shortest version is:

Self-Attention = compare tokens → weight information → update representations

If you understand QKᵀ and weighted Values, you understand the heart of Transformer computation.

Discussion

When reading Transformer code, which part feels most confusing?

QKV projection, Softmax attention weights, or the final matrix multiplication with V?

Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/self-attention-qkv-matrix-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community

How Self-Attention Works — QKV, Softmax, and Matrix Computation

Core Idea

The Key Structure

Pseudo-code View

Concrete Example

QKV Intuition

Step 1: Generate Q, K, and V

Step 2: Compute Attention Scores

Step 3: Scale and Apply Softmax

Step 4: Weighted Sum of Values

Self-Attention vs Cross-Attention

Naive vs Matrix View

Why Matrix Computation Matters

Important Conditions and Limits

Implementation Perspective

Takeaway

Discussion

Top comments (0)