DEV Community

Samyak Jain
Samyak Jain

Posted on

Positional Encoding - Sense of direction for Transformers

I have been trying to understand how transformers work lately, and whenever we read or hear about transformers, there is one word which comes up more than any other: ‘Attention’ , although it’s not something that first appeared with transformers, but it has become the centerpiece of the whole architecture.

That said, attention by itself doesn’t get us very far. There’s another idea that doesn’t always get the same spotlight, but without it, self-attention would completely fall apart. That idea is positional encoding the thing that lets transformers keep track of word order.

But why self attention in transformers is useless without this Positional encoding?

The issue is that self-attention has no built-in sense of order. Each token looks at every other token simultaneously, like in the diagram from 3Blue1Brown.

Image showing self attention in Transformer , from 3blue1brown youtube channel video illustration

Now because each token (approximately a word) is looking at each token at the same time there is no information of how far a token is to another token or whats the order of the tokens.

So if my sentence is “A cat ran behind a mouse,” the transformer without positional encoding would only see it as a bag of words: {A, cat, ran, behind, a, mouse}. Now flip it to “A mouse ran behind a cat” — the bag looks identical to a positional encoding less transformer. The meaning, however, is completely different. That’s the blind spot positional encoding is designed to fix.

This wasn’t really a problem for RNNs, because the architecture itself processed tokens step by step. Each word was fed in after the previous one, and the hidden state carried along a memory of everything seen so far. In other words, the order of the sequence was baked into the way RNNs worked.

Transformers flipped that idea. Instead of moving sequentially, they looked at all the tokens in parallel. That parallelism is one of the reasons that makes them so powerful and efficient — but it also means they lose the natural sense of order that RNNs had for free.

That’s where the positional encoding steps in: it reintroduces the concept of order without giving up the transformer’s parallelism.

So how do we encode this "sense" of position in each token?

Well if you look at the original "Attention is all you need" paper they have solved this problem using these two formula

PE(pos,2i)=sin(pos100002i/dmodel) PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
PE(pos,2i+1)=cos(pos100002i/dmodel) PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Haha this didn't really made a lot of sense to me at the first glance , So I'll try to explain what I have understood about it so far.

I had questions like how this formula is derived? why there is a pair of sine and cosine? whats with using 10,000? So Lets try to reach to this formula step by step.

Idea 1: Just Count (Integer Indexing)

The simplest idea to have positional information is to assign each token its index in the sequence.

  • The Idea: The first word is 1, the second is 2, the third is 3, and so on. [1, 2, 3, 4, ...]

  • The Flaw: The numbers can become very large for long sentences. Neural networks work best with small, normalized values, and large numbers can make training unstable. Also, a model might see very different ranges of numbers during training vs. testing, making it hard to generalize.

Ok so we cant have direct integers as its not Neural network friendly , it can cause vanishing or exploding gradient. Normally neural networks like their values to be balanced between positive and negative values.

Idea 2: Normalize the Count

To fix the scaling issue, the next logical step is to normalize the indices to a fixed range, like [0, 1].

  • The Idea: Divide each index by the length of the sentence. For a 4-word sentence, the encoding would be: [1/4, 2/4, 3/4, 4/4] or [0.25, 0.5, 0.75, 1.0]

  • The Flaw: This makes the meaning of a position dependent on the sentence length. For example, the value 0.5 means the 2nd position in a 4-word sentence, but it would mean the 10th position in a 20-word sentence. The model has no consistent way to interpret what a position value means, which is a deal-breaker.

Ok so we need static information which represents a position it cant be dependent on the length of the sentence or some other variable factor.

Idea 3: Use a Binary Vector

To handle arbitrary lengths consistently while keeping values small, the next idea is to represent the position index in binary and turn it into a vector.

  • The Idea: Represent each number with its binary equivalent, creating a vector for each position. Here we solved the previous two problems which were keeping the values between 0 and 1 and keeping the information static regardless of the length of sentence (still all the numbers are positive though).

    • Position 2 -> [0, 0, 1, 0]
    • Position 3 -> [0, 0, 1, 1]
    • Position 4 -> [0, 1, 0, 0]
  • The Flaw: This is discrete and "jagged." A small change in position (e.g., from 7 (0111) to 8 (1000)) can cause every single value in the vector to flip. This gives the model no smooth, continuous sense of distance or proximity between nearby positions.

Below is an example of what it means by "jagged"

example of what it means with

Checkout the source here

Ok so we want that our model gets a smooth curve not an abrupt change between two adjacent positions.

Idea 4: Using Sine Waves

Alright so we need something which is smooth, meaning it periodically changes from 0 to 1 to 0 and so on, well sine function can perform this perfectly and its also in the range of [-1,1], which solves our problem of having equal positive and negatives.

Like this

Illustration of using sin waves for positional encoding
Checkout the source here

Alright lets try to use a sin function to calculate positions for our tokens

Does this sine wave approach checks all the problems that we found in other approaches?

  1. something which gives out a smooth curve
  2. would be static irrespective of the length of the sequence
  3. not integers so wont cause problems for the neural network

This is a very brief version which tries to derive the formula but it has been explained in a very well manner here and here.

Alright, does that mean we’re all set now? We would just use one sine function per position and call it a day?

Well... not exactly

1. Periodicity problem

  • Sine is periodic, which means it repeats its values over and over.
  • For example, if we move along the sine wave by 1 radian per position:

    • sin(1), sin(2), …
    • By the time we reach sin(6), the value is already very close to what we had earlier, just because sine repeats.
  • This creates a problem: the model sees almost the same number at different positions, even though they are far apart in the sequence. It can’t reliably tell positions apart.

2. Phase problem

  • Even without worrying about repetition, phase ambiguity is an issue.
  • Consider sin(30) and sin(150). Both return roughly 0.5.
  • For the model, if you only provide the sine value, it sees 0.5 and has no idea whether it came from position 30 or 150.
  • This is why sine alone doesn’t uniquely identify positions — the same value can appear at many different points in the sequence.
  • You cant locate a point in a 2d space just with a y axis (sine)

3. Linear Transformation

Another major problem is linear transformation. You might think that moving from pos = 6 to pos = 9 should be equivalent to moving from pos = 18 to pos = 21, because the offset k = 3 is the same. But it wont happen with the current approach

But before we look at why it wont happen lets understand why we want this property? This is very well explained in a blog that am quoting here

Why is this a very desirable property to have? Imagine we have a network that is trying to translate the sentence "I am going to eat."

The combination "is/am/are" +"going"+"to" +"verb" is a very common grammatical structure, with a fixed positional structure. "going" always ends up at index 1, "to" at index 2, etc.

In this case, when translating "verb", we may want the network to learn to pay attention to the noun "I" that occurs before "am going to". "I" is located 4 position units to the left of "verb".

Since our attention layer uses linear transformations to form the keys, queries, and values, it would be nice if we have positional encodings such that the linear transformation can translate the position vector located 4 units to the left, so that it lines up with the position vector of "verb".

The query and key would then match up perfectly.

Checkout the source here

However, when you use only sine values in positional encoding, this isn’t true. Here’s why:

  1. Sine depends on absolute position:
- Sine is a wave that oscillates between -1 and 1.
- The amount the sine value changes when you move by `k` depends on **where you start** on the wave.
- For example, moving 3 steps near a peak of the wave might produce a tiny change, while moving 3 steps near zero might produce a large change.
Enter fullscreen mode Exit fullscreen mode
  1. Shift is not consistent:
- The same offset `k` produces **different differences in sine values** depending on the starting position.
- That means the “distance” between `pos` and `pos+k` is **variable**, unlike what you might expect from a true linear transformation.
Enter fullscreen mode Exit fullscreen mode
  1. Consequence:
- Moving by the same offset doesn’t give the same change in information.
- The sine-only encoding is **position-dependent**, so relative shifts aren’t uniform across the sequence.
Enter fullscreen mode Exit fullscreen mode

For an illustration checkout this video

Illustration of using sine only and sine + cosine offset

Enter Sin + Cos based embeddings

We are close to the formula that the paper discussed, so we saw that sine based embeddings even though provided us with smooth curve it had its own problems.

Lets see if we can solve those problems with sin + cos

Each position is now represented as a pair [sin(f*pos), cos(f*pos)] for each frequency.

Unique positions:

The combination of sine and cosine gives a unique point on the circle for every position.

Earlier, using sine alone, we ran into the phase problem:

  • sin(30°) ≈ 0.5
  • sin(150°) ≈ 0.5

The model would see 0.5 in both cases and wouldn’t know whether it came from position 30 or 150 — clearly a problem!

Now, with sin + cos:

  • Represent position as a 2D point: [sin(pos), cos(pos)]
  • Example:
Position sin(pos) cos(pos) Point (sin, cos)
30° 0.5 0.866 (0.5, 0.866)
150° 0.5 -0.866 (0.5, -0.866)

Even though the sine values are the same, the cosine values are different.

  • The points (0.5, 0.866) and (0.5, -0.866) are distinct on the circle.
  • The model can now tell these positions apart.

So, combining sine and cosine removes the ambiguity and makes each position uniquely identifiable on the circle.

Periodicity Problem

The sine + cosine pairs doesnt directly helps in solving this problem, i think the creators just delayed this problem by creating multiple pairs of sin and cos (dimensions = d_model) and on top of that this explains why they used the scaling of 10,000 -

  • Using a large number like 10,000 spreads the frequencies logarithmically across dimensions.
  • Low dimensions oscillate fast, high dimensions oscillate slowly.
  • Because of this spread, the combined vector of [sin, cos] across all dimensions is very unlikely to repeat for nearby positions, even though each individual sine/cosine is periodic.

Linear Transformation

Now, each position is represented as two numbers: [sin(f * pos), cos(f * pos)]

Think of this as a point on a circle, where the angle = f * pos.

- `sin(f * pos)` = vertical coordinate        
- `cos(f * pos)` = horizontal coordinate
Enter fullscreen mode Exit fullscreen mode

How shifting positions works
  • Moving by k positions = increasing the angle by f * k
  • In 2D, this is equivalent to rotating the point around the circle.
  • A rotation has a fixed 2×2 matrix:
PEpos=[sin(fpos) cos(fpos)] \mathbf{PE}_{pos} = \begin{bmatrix} \sin(f \cdot pos) \ \cos(f \cdot pos) \end{bmatrix}
  • Key point: this matrix depends only on k, not on the starting position.

Why this solves linear transformation

  1. Uniform distance changes

    • The Euclidean distance between a point and its rotated version is constant for a given k
    • No matter where you start, moving forward by k produces the same shift in the vector
  2. Predictable relative shifts

    • Because rotation is linear, the model can learn a fixed transformation for any offset k
    • This is exactly what the Transformer exploits to reason about relative positions
  3. Intuition

    1. Sine-only = wavy line → inconsistent movement
    2. Sin+cos = circle → rotation → consistent movement

This concept of linear transformation in PE is very well explained here and here

Experiment

To see the difference in practice, I created positional embeddings in two ways:

  1. Sine-only embeddings
  2. Sine + Cosine pairs

I then computed the cosine similarity between two pairs of positions:

  • Pair 1: pos = 5 and pos = 9
  • Pair 2: pos = 13 and pos = 17

Both pairs have the same offset (Δpos = 4).

Results:

  • Sine-only embeddings:

    • The cosine similarity for Pair 1 was different from Pair 2.
    • This confirms that sine-only embeddings produce variable shifts depending on the starting position.
  • Sine + Cos embeddings:

    • The cosine similarity for Pair 1 exactly matched Pair 2.
    • Using [sin, cos] ensures that shifts are consistent across the sequence, independent of the starting position.

Here is the google Colab to check this experiment out

Conclusion

So yeah, positional encoding might not get the same spotlight as attention, but it’s what keeps transformers from being just a fancy bag-of-words model. We went from simple counting to sine waves and then to sine + cosine pairs, and along the way, we saw why each step mattered. In the end, sin + cos gives each position a unique spot on a circle, solves phase issues, and makes relative shifts predictable—letting the model actually make sense of word order while still enjoying the parallelism that makes transformers so powerful.

Positional encoding is still an active area of research. People are exploring relative positional encodings instead of static ones, like we discussed—check it out here.

There are still models like autoregressive models which runs fine without using Positional encoding at all, which makes this topic even more interesting that when do we need to use Positional encodings and when not, but still need to read a lot before I can try to explain it in my own words.

Since this topic is really vast, I’d love for readers to critique this blog and point out any mistakes I might have made or assumptions that could be wrong.

References

Top comments (0)