Self-Attention can compare every token with every other token.
But there is a catch.
By itself, it does not know the order of tokens.
That is a serious problem because “dog bites man” and “man bites dog” use the same words but mean completely different things.
Core Idea
A Transformer needs two kinds of information:
what the token is
where the token is
Token embeddings provide the “what.”
Positional embeddings provide the “where.”
This matters because attention without position is order-blind.
It can compare tokens, but it does not naturally know which token came first.
The Key Structure
A simple positional embedding flow looks like this:
Token Embedding + Positional Information → Input Representation
For Absolute Positional Embedding:
E = X + P
Where:
X = token embedding
P = positional embedding
E = final input representation
More compactly:
Transformer input = meaning vector + position signal
Different positional methods change how the position signal is injected.
Pseudo-code View
Basic positional injection:
tokens = tokenize(text)
x = embedding(tokens)
position = positional_embedding(token_positions)
input_representation = x + position
For attention-based position methods:
q = project_query(x)
k = project_key(x)
q = apply_position(q)
k = apply_position(k)
attention_scores = q @ k.T
APE usually modifies the input embedding.
RPE usually modifies the attention score.
RoPE usually modifies Query and Key.
That difference is the whole story.
Concrete Example
Compare these two sentences:
dog bites man
man bites dog
The token set is the same:
dog, bites, man
But the order changes the meaning.
Without positional information, Self-Attention sees token relationships but has no built-in sequence order.
With positional information, each token representation includes location.
So “dog” at position 1 is different from “dog” at position 3.
This is why positional encoding is not optional.
It is required for language understanding.
APE: Absolute Positional Embedding
Absolute Positional Embedding assigns a vector to each position index.
Position 1 has one vector.
Position 2 has another vector.
Position 3 has another vector.
Then the model adds that position vector to the token embedding.
Example:
Token embedding:
X = [0.2, 0.5]
Position embedding:
P = [0.1, -0.2]
Final representation:
E = [0.3, 0.3]
APE is easy to understand.
It says:
this token is at this exact position
Why APE Is Useful
APE is simple.
It is easy to implement.
It works well when sequence lengths stay close to what the model saw during training.
Implementation-wise, it is just:
x = token_embedding + position_embedding
That makes it cheap and clean.
But the simplicity has a cost.
APE treats position as a fixed index.
If the model sees much longer inputs than it was trained on, unseen positions can become unreliable.
That makes APE weaker for long-context extrapolation.
RPE: Relative Positional Embedding
Relative Positional Embedding focuses on distance.
Instead of asking:
What position is this token at?
It asks:
How far apart are these two tokens?
This is often more natural for language.
A subject and verb may appear at different absolute positions.
But their relative distance and direction still matter.
A simplified RPE attention score looks like this:
Aᵢⱼ = (QᵢKⱼᵀ + Rᵢ₋ⱼ) / √d
Rᵢ₋ⱼ represents the relative position between token i and token j.
This means position directly affects attention.
Concrete RPE Example
Suppose:
QᵢKⱼᵀ = 12
Rᵢ₋ⱼ = 4
√d = 4
Then:
Aᵢⱼ = (12 + 4) / 4 = 4
Without the relative term:
Aᵢⱼ = 12 / 4 = 3
So the distance relationship increased the attention score.
That is the intuition.
RPE lets the model say:
This token is more relevant because of where it is relative to me.
RoPE: Rotary Positional Embedding
Rotary Positional Embedding takes a different path.
It does not add a position vector to the input.
It rotates Query and Key vectors based on position.
The core idea:
position becomes rotation
A 2D rotation matrix looks like this:
Rθ = [[cosθ, -sinθ], [sinθ, cosθ]]
If you rotate [1, 0] by 90 degrees:
[1, 0] → [0, 1]
RoPE applies this idea across Query and Key dimensions.
Different positions get different rotations.
Then attention scores naturally include relative position.
Why RoPE Works Well
RoPE uses absolute position to rotate Q and K.
But when Q and K are compared, the score depends on their relative position difference.
The key relationship is:
(RθⁱQ)ᵀ(RθʲK) = QᵀRθʲ⁻ⁱK
This means the attention score contains j - i.
That is the relative distance.
So RoPE gives you a useful combination:
absolute-position injection + relative-position behavior
This is why RoPE became popular in modern LLMs.
APE vs RPE vs RoPE
APE:
- adds position vectors to token embeddings
- simple and cheap
- good for fixed or known sequence lengths
- weaker for long-context extrapolation
RPE:
- adds relative distance information to attention scores
- directly models token-to-token distance
- flexible for variable lengths
- can complicate attention implementation
RoPE:
- rotates Query and Key vectors by position
- makes relative distance appear inside attention
- memory-efficient
- works well with modern long-context LLMs
The key difference:
APE = where am I?
RPE = how far are we?
RoPE = rotate Q/K so distance appears in attention
Implementation Perspective
If you are reading Transformer code, look at where position enters the model.
APE usually appears near the embedding layer:
x = token_embedding + position_embedding
RPE usually appears inside attention score computation:
scores = q @ k.T + relative_position_bias
RoPE usually appears after Q and K projection:
q = apply_rope(q, positions)
k = apply_rope(k, positions)
scores = q @ k.T
This is the developer shortcut.
Find the injection point.
Then you know which positional method the model uses.
Naive vs Practical View
Naive view:
Positional embedding just tells the model token order.
Practical view:
Positional design affects long-context behavior, caching, memory, and attention quality.
Naive mindset:
add positions
run attention
Practical mindset:
choose how position enters attention
consider context length
consider extrapolation
consider KV Cache compatibility
consider implementation complexity
This matters because positional encoding is not a small detail.
It changes how the model behaves when the context becomes long.
Why This Matters Again
Short inputs can hide positional weaknesses.
Long-context models expose them.
If positional information does not extrapolate well, the model may become unstable outside its training length.
This is why modern LLMs care so much about RoPE variants and long-context scaling.
The position method affects whether a model can reliably handle long prompts, code files, documents, and conversations.
Important Conditions and Limits
APE is easy but tied to absolute indices.
RPE is expressive but can complicate attention computation.
RoPE is efficient and practical, but still needs careful scaling for very long contexts.
Also:
Positional embeddings do not create reasoning by themselves.
They only give attention a way to use order.
The model still needs training to learn useful patterns.
Takeaway
Self-Attention needs positional information because it is order-blind by default.
APE adds absolute position to embeddings.
RPE adds relative distance to attention scores.
RoPE rotates Query and Key vectors so relative position appears naturally.
The shortest version:
Positional Embedding = the order signal that makes attention understand sequence structure
If you understand where position enters the model, you understand the difference between APE, RPE, and RoPE.
Discussion
When learning Transformer internals, which positional method feels most intuitive to you?
APE, RPE, or RoPE?
Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/advanced-positional-embeddings-en/
GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai
Top comments (0)