zeromathai

Posted on Jun 22 • Originally published at zeromathai.com

Why Multi-Head Attention Needs Position, Residuals, and Normalization

#ai #llm #deeplearning #machinelearning

Self-Attention is powerful.

But by itself, it has three problems.

It needs multiple views, it needs word order, and it needs stable training.

That is why Multi-Head Attention, Positional Encoding, and Add & Norm exist.

Core Idea

A Transformer block is not just attention.

Attention computes token relationships.

Multi-Head Attention makes those relationships richer.

Positional Encoding tells the model where tokens are.

Add & Norm keeps deep Transformer blocks trainable.

This matters because modern LLMs are deep.

Without these support structures, attention alone is not enough.

The Key Structure

A simplified Transformer encoder block looks like this:

Input

→ Positional Information

→ Multi-Head Attention

→ Add & Norm

→ Feed-Forward Network

→ Add & Norm

→ Output

More compactly:

Transformer Block = attention + position + residual flow + normalization

Each part solves a specific problem.

Multi-Head Attention solves the “single view” problem.

Positional Encoding solves the “no order” problem.

Add & Norm solves the “deep training stability” problem.

Implementation View

At a high level, the block works like this:

tokens = tokenize(text)

x = embedding(tokens)

x = x + positional_encoding

attention_output = multi_head_attention(x)

x = layer_norm(x + attention_output)

ffn_output = feed_forward(x)

output = layer_norm(x + ffn_output)

In modern Pre-LN style, the order often changes:

attention_output = multi_head_attention(layer_norm(x))

x = x + attention_output

ffn_output = feed_forward(layer_norm(x))

output = x + ffn_output

The idea is the same.

Keep the original signal flowing.

Normalize activations.

Let attention and FFN update the representation.

Multi-Head Attention

Single attention gives one relationship map.

But language has many relationship types.

A token may need to track:

nearby words
subject-verb structure
semantic similarity
long-distance references
coreference

One attention head cannot easily capture all of these at once.

Multi-Head Attention fixes this by running several attention heads in parallel.

Each head gets its own learned Q, K, and V projections.

So each head can learn a different representation subspace.

The formula is:

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ)Wᴼ

In plain English:

Run attention multiple ways.

Concatenate the results.

Project them back into one vector space.

Concrete Example

Take this sentence:

The animal did not cross the street because it was tired.

What does “it” refer to?

One attention head may focus on “animal.”

Another may focus on “tired.”

Another may track the structure around “because.”

This is useful because the model does not need one attention map to explain everything.

Different heads can specialize.

That is why Multi-Head Attention matters.

It gives the model multiple ways to read the same sentence.

Single-Head vs Multi-Head Attention

Single-head attention:

uses one attention distribution
sees token relationships from one perspective
is simpler
can mix different patterns together

Multi-head attention:

uses multiple attention distributions
views tokens through different learned projections
captures diverse relationships
recombines the results afterward

The key difference:

Single-head = one view of context

Multi-head = multiple views of context

This is not just repeated computation.

It is structured parallel interpretation.

Why Positional Encoding Is Needed

Self-Attention compares tokens at the same time.

That is great for parallelism.

But it creates a problem.

Attention alone does not know token order.

Consider:

dog bites man

man bites dog

Same words.

Different meaning.

Without position information, the model does not naturally know which token came first.

So Transformers inject position into token representations.

The basic structure is:

Input Representation = Token Embedding + Positional Encoding

This matters because word order changes meaning.

A language model must know both:

what the token is

where the token is

Sinusoidal Positional Encoding

The original Transformer used fixed sine and cosine patterns.

Even dimensions use sine.

Odd dimensions use cosine.

The idea is:

different positions get different wave patterns.

A simplified view:

PE(pos, 2i) = sin(pos / 10000^(2i / d_model))

PE(pos, 2i + 1) = cos(pos / 10000^(2i / d_model))

This is not just a position ID.

It creates smooth position signals across dimensions.

The model can use these signals to reason about position and distance.

APE, RPE, and RoPE

There are several ways to inject position.

Absolute Positional Embedding:

assigns a position vector to each absolute index
position 1 has one vector
position 2 has another vector

Relative Positional Embedding:

focuses on distance between tokens
useful when relative position matters more than absolute index

Rotary Positional Embedding:

rotates Query and Key vectors using position
makes relative position work naturally inside attention
commonly used in modern LLMs

The shared goal is simple:

Give attention a way to understand order.

Add & Norm

A Transformer block also needs stability.

That is where Add & Norm comes in.

Add means residual connection.

Norm means layer normalization.

The classic Post-LN formula is:

Output = LayerNorm(x + Sublayer(x))

The residual connection preserves the original input.

The sublayer adds a learned update.

LayerNorm keeps the representation stable.

This is important because Transformers stack many layers.

Without residual paths, information can degrade.

Without normalization, training can become unstable.

Residual Connection Intuition

A sublayer should not have to rebuild everything from scratch.

It should only need to learn an update.

Instead of:

new_output = sublayer(x)

Use:

new_output = x + sublayer(x)

This is a huge difference.

The original representation can pass forward directly.

The sublayer only adds useful changes.

That makes deep networks easier to train.

Layer Normalization Intuition

LayerNorm normalizes each token representation.

It works across the feature dimensions of a token.

Not across the sequence.

That means each token vector is stabilized independently.

In practice, this helps keep activation values in a manageable range.

This matters when many Transformer blocks are stacked.

Small instability in one layer can grow across dozens or hundreds of layers.

Pre-LN vs Post-LN

Post-LN:

x = LayerNorm(x + Sublayer(x))

Pre-LN:

x = x + Sublayer(LayerNorm(x))

The original Transformer used Post-LN.

Many modern large Transformers use Pre-LN.

Why?

Pre-LN often improves training stability in deep models.

The difference is placement.

Post-LN normalizes after the residual addition.

Pre-LN normalizes before the sublayer.

Both use residual connections.

But their training behavior can be different.

Naive vs Practical View

Naive view:

Transformer block = attention layer

Practical view:

Transformer block = attention + position + residuals + normalization + FFN

Naive implementation mindset:

run attention
return output

Practical implementation mindset:

add position
run multi-head attention
preserve input through residuals
normalize representations
apply feed-forward updates
repeat safely across many layers

This is why implementation details matter.

The architecture works because these parts support each other.

Important Conditions and Limits

Multi-Head Attention is powerful, but more heads are not always better.

Too many heads can increase cost.

Some heads may become redundant.

Positional Encoding is necessary because attention is order-agnostic by default.

But different positional methods behave differently in long-context settings.

Add & Norm improves stability.

But the exact Pre-LN or Post-LN choice affects optimization.

So these are not decorative components.

They are architectural decisions.

Takeaway

Multi-Head Attention gives the model multiple views of token relationships.

Positional Encoding gives attention a sense of order.

Add & Norm keeps deep Transformer blocks stable.

The shortest version is:

Transformer Block = multi-view attention + position signal + stable residual updates

If Self-Attention is the engine, these components are the systems that make the engine usable at scale.

Discussion

When reading Transformer architecture, which part feels most important to understand first?

Multi-Head Attention, Positional Encoding, or Add & Norm?

Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/multi-head-attention-positional-encoding-add-norm-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community