Self-Attention is powerful.
But by itself, it has three problems.
It needs multiple views, it needs word order, and it needs stable training.
That is why Multi-Head Attention, Positional Encoding, and Add & Norm exist.
Core Idea
A Transformer block is not just attention.
Attention computes token relationships.
Multi-Head Attention makes those relationships richer.
Positional Encoding tells the model where tokens are.
Add & Norm keeps deep Transformer blocks trainable.
This matters because modern LLMs are deep.
Without these support structures, attention alone is not enough.
The Key Structure
A simplified Transformer encoder block looks like this:
Input
→ Positional Information
→ Multi-Head Attention
→ Add & Norm
→ Feed-Forward Network
→ Add & Norm
→ Output
More compactly:
Transformer Block = attention + position + residual flow + normalization
Each part solves a specific problem.
Multi-Head Attention solves the “single view” problem.
Positional Encoding solves the “no order” problem.
Add & Norm solves the “deep training stability” problem.
Implementation View
At a high level, the block works like this:
tokens = tokenize(text)
x = embedding(tokens)
x = x + positional_encoding
attention_output = multi_head_attention(x)
x = layer_norm(x + attention_output)
ffn_output = feed_forward(x)
output = layer_norm(x + ffn_output)
In modern Pre-LN style, the order often changes:
attention_output = multi_head_attention(layer_norm(x))
x = x + attention_output
ffn_output = feed_forward(layer_norm(x))
output = x + ffn_output
The idea is the same.
Keep the original signal flowing.
Normalize activations.
Let attention and FFN update the representation.
Multi-Head Attention
Single attention gives one relationship map.
But language has many relationship types.
A token may need to track:
- nearby words
- subject-verb structure
- semantic similarity
- long-distance references
- coreference
One attention head cannot easily capture all of these at once.
Multi-Head Attention fixes this by running several attention heads in parallel.
Each head gets its own learned Q, K, and V projections.
So each head can learn a different representation subspace.
The formula is:
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ)Wᴼ
In plain English:
Run attention multiple ways.
Concatenate the results.
Project them back into one vector space.
Concrete Example
Take this sentence:
The animal did not cross the street because it was tired.
What does “it” refer to?
One attention head may focus on “animal.”
Another may focus on “tired.”
Another may track the structure around “because.”
This is useful because the model does not need one attention map to explain everything.
Different heads can specialize.
That is why Multi-Head Attention matters.
It gives the model multiple ways to read the same sentence.
Single-Head vs Multi-Head Attention
Single-head attention:
- uses one attention distribution
- sees token relationships from one perspective
- is simpler
- can mix different patterns together
Multi-head attention:
- uses multiple attention distributions
- views tokens through different learned projections
- captures diverse relationships
- recombines the results afterward
The key difference:
Single-head = one view of context
Multi-head = multiple views of context
This is not just repeated computation.
It is structured parallel interpretation.
Why Positional Encoding Is Needed
Self-Attention compares tokens at the same time.
That is great for parallelism.
But it creates a problem.
Attention alone does not know token order.
Consider:
dog bites man
man bites dog
Same words.
Different meaning.
Without position information, the model does not naturally know which token came first.
So Transformers inject position into token representations.
The basic structure is:
Input Representation = Token Embedding + Positional Encoding
This matters because word order changes meaning.
A language model must know both:
what the token is
where the token is
Sinusoidal Positional Encoding
The original Transformer used fixed sine and cosine patterns.
Even dimensions use sine.
Odd dimensions use cosine.
The idea is:
different positions get different wave patterns.
A simplified view:
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i + 1) = cos(pos / 10000^(2i / d_model))
This is not just a position ID.
It creates smooth position signals across dimensions.
The model can use these signals to reason about position and distance.
APE, RPE, and RoPE
There are several ways to inject position.
Absolute Positional Embedding:
- assigns a position vector to each absolute index
- position 1 has one vector
- position 2 has another vector
Relative Positional Embedding:
- focuses on distance between tokens
- useful when relative position matters more than absolute index
Rotary Positional Embedding:
- rotates Query and Key vectors using position
- makes relative position work naturally inside attention
- commonly used in modern LLMs
The shared goal is simple:
Give attention a way to understand order.
Add & Norm
A Transformer block also needs stability.
That is where Add & Norm comes in.
Add means residual connection.
Norm means layer normalization.
The classic Post-LN formula is:
Output = LayerNorm(x + Sublayer(x))
The residual connection preserves the original input.
The sublayer adds a learned update.
LayerNorm keeps the representation stable.
This is important because Transformers stack many layers.
Without residual paths, information can degrade.
Without normalization, training can become unstable.
Residual Connection Intuition
A sublayer should not have to rebuild everything from scratch.
It should only need to learn an update.
Instead of:
new_output = sublayer(x)
Use:
new_output = x + sublayer(x)
This is a huge difference.
The original representation can pass forward directly.
The sublayer only adds useful changes.
That makes deep networks easier to train.
Layer Normalization Intuition
LayerNorm normalizes each token representation.
It works across the feature dimensions of a token.
Not across the sequence.
That means each token vector is stabilized independently.
In practice, this helps keep activation values in a manageable range.
This matters when many Transformer blocks are stacked.
Small instability in one layer can grow across dozens or hundreds of layers.
Pre-LN vs Post-LN
Post-LN:
x = LayerNorm(x + Sublayer(x))
Pre-LN:
x = x + Sublayer(LayerNorm(x))
The original Transformer used Post-LN.
Many modern large Transformers use Pre-LN.
Why?
Pre-LN often improves training stability in deep models.
The difference is placement.
Post-LN normalizes after the residual addition.
Pre-LN normalizes before the sublayer.
Both use residual connections.
But their training behavior can be different.
Naive vs Practical View
Naive view:
Transformer block = attention layer
Practical view:
Transformer block = attention + position + residuals + normalization + FFN
Naive implementation mindset:
run attention
return output
Practical implementation mindset:
add position
run multi-head attention
preserve input through residuals
normalize representations
apply feed-forward updates
repeat safely across many layers
This is why implementation details matter.
The architecture works because these parts support each other.
Important Conditions and Limits
Multi-Head Attention is powerful, but more heads are not always better.
Too many heads can increase cost.
Some heads may become redundant.
Positional Encoding is necessary because attention is order-agnostic by default.
But different positional methods behave differently in long-context settings.
Add & Norm improves stability.
But the exact Pre-LN or Post-LN choice affects optimization.
So these are not decorative components.
They are architectural decisions.
Takeaway
Multi-Head Attention gives the model multiple views of token relationships.
Positional Encoding gives attention a sense of order.
Add & Norm keeps deep Transformer blocks stable.
The shortest version is:
Transformer Block = multi-view attention + position signal + stable residual updates
If Self-Attention is the engine, these components are the systems that make the engine usable at scale.
Discussion
When reading Transformer architecture, which part feels most important to understand first?
Multi-Head Attention, Positional Encoding, or Add & Norm?
Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/multi-head-attention-positional-encoding-add-norm-en/
GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai
Top comments (0)