Welcome back to the Transformer Series! π
In Blog #2, we turned words into numbers (Embeddings). We took the word "King" and turned it into a vector of numbers that captures its meaning.
But we have a massive problem.
Transformers are designed to be fast. Unlike their predecessors (RNNs), which read a sentence one word at a time (left-to-right), Transformers gulp down the entire sentence in one go.
This Parallel Processing is great for speed, but terrible for context.
If you feed the sentence "The dog bit the man" into a Transformer without help, it sees the exact same input as "The man bit the dog." Itβs just a bag of words floating in space. It knows who is involved, but it has no idea who did what because it has lost the concept of Order.
Today, we are going to fix that using one of the smartest tricks in Deep Learning: Positional Encoding.
The Problem: The "Shuffled Book"
Imagine I hand you a book, but Iβve torn out all the pages and shuffled them. You can still read the individual sentences and understand the topics (Embeddings), but you canβt follow the story (Sequence).
RNNs (Old School): They read page 1, then page 2, then page 3. They know the order implicitly because they process things sequentially.
Transformers (New School): They look at all the pages at the exact same time.
To fix this, we need to write the page number on every single page before we throw them into the pile. That is exactly what Positional Encoding is.
Attempt #1: The Naive Approach (Integers) β
Why don't we just add a simple number to each word?
"The" = 1
"Dog" = 2
"Bit" = 3
The Issue: If your sentence is 5 words long, the last number is 5. If your sentence is 500 words long, the last number is 500. Neural Networks hate unbounded numbers. They work best when numbers are small and normalized (usually between -1 and 1). If we feed in a huge number like 500, it will blow up the gradients and ruin the training.
Attempt #2: Fractions? β
Okay, let's normalize it.
First word = 0
Last word = 1
Everything else is a fraction in between.
The Issue: In a short sentence (5 words), the "step" between words is 0.2. In a long sentence (100 words), the "step" between words is 0.01. The meaning of the position changes depending on the sentence length. We need a method where the "distance" between position 5 and 6 is consistent, regardless of how long the text is.
The Solution: The "Wiggly" Timestamps (Sine & Cosine)
The authors of the Transformer paper (Attention Is All You Need) came up with a brilliant solution using waves.
Instead of adding a single number (like "3") to the word, they add a whole vector of values derived from Sine and Cosine waves.
The Analogy: The Multi-Hand Clock
Imagine a clock with many hands.
- The Second hand moves very fast (High frequency).
- The Minute hand moves slower.
- The Hour hand moves very slow (Low frequency).
If you look at the position of all the hands together, you get a unique "fingerprint" for that specific time.
Positional Encodings work the same way.
- For the first dimension of the vector, we use a wave that wiggles very fast.
- For the last dimension, we use a wave that wiggles very slowly.
By combining these, every single position (Index 1, Index 2, Index 100) gets a unique pattern of numbers that stays consistent between 0 and 1.
The Visual "Addition"We don't append this info; we actually add it.
Final Input = Word Embedding (Meaning) + Positional Encoding (Order)
The model learns that "Meaning" comes from the inherent value, and "Order" comes from these specific wave patterns added on top.
Show Me The Code (PyTorch)
You don't need to implement the math manually (PyTorch handles it), but here is what it looks like conceptually:
import torch
import math
def create_positional_encoding(max_len, d_model):
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
# The "Wiggle" Factor
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
# Apply Sine to even indices, Cosine to odd indices
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
# This creates a unique "timestamp" vector for every word position!
Why This is Genius ?
- Deterministic: The position of the 5th word is always the same mathematical pattern.
- Extensible: Because it uses waves, the model can actually learn to handle sequences longer than the ones it was trained on (it can just "continue the wave").
- Distance Awareness: The math allows the model to easily calculate the distance between words (e.g., Word A is 3 steps away from Word B) just by comparing their wave patterns.
Summary
- Transformers process words in parallel, so they lose order.
- We fix this by adding a Positional Vector to the Word Embedding.
- We use Sine and Cosine waves to create these vectors so they are unique, bounded, and consistent.
Now that our Transformer knows what the words mean (Embeddings) and where they are (Positional Encoding), it is finally ready to start understanding the relationships between them.
Next up in Blog #4: The superstar of the show. Self-Attention. (Or: How the word "Bank" knows if it's a river bank or a money bank).
Stay tuned! π



Top comments (0)