Yuvaraj

Posted on Feb 16 • Edited on Feb 21

Transformers - Encoder Deep Dive - Part 2

#ai #architecture #machinelearning #deeplearning

In our journey so far, we have explored the high-level intuition of why Transformers exist and mapped out the blueprint and notations in Part 1.

Wait... What exactly is the Encoder?

Before we prep the ingredients, let's look at the "Chef."

In the Transformer diagram, the Encoder is the tower on the left.

If you think of a Transformer as a translation system:

The Encoder is the "Scholar" who reads the English sentence and understands every hidden nuance, relationship, and bit of context.
The Decoder is the "Writer" who takes that scholar's notes and writes the sentence out in French.

Where is it used?

While the original paper used both towers for translation, the tech world realized the Encoder is a powerhouse on its own. Below are few samples,

Search Engines: To understand the intent behind your query, not just the keywords.
Sentiment Analysis: To "feel" if a product review is happy or angry.
Text Classification: To automatically sort your emails into "Spam" or "Work."

In short: The Encoder's sole job is to turn a human sentence into a "Context-Rich" mathematical map.

Let's learn how to build this map

1. Encoder Input: Why does the Encoder need `(Seq, d_model)`?

As we discussed in Part 1, the hardware (GPU) is built for speed. It doesn't want to read a sentence word-by-word. It wants a Matrix.

Specifically, the Encoder requires a matrix of shape (Seq, d_model).

Seq (Sequence Length): The number of words (e.g., 5 for "The dog bit the man").
d_model (Model Dimension): The "width" of our mathematical understanding (usually 512).

The Encoder's job is to understand and refine this matrix.

let's learn how to feed input to encoder in this structure, Meaning (Embedding) and Order (Positional Encoding).

2. Input Embedding: Giving Words a Digital Identity

A computer doesn't know what a "dog" is. To a machine, "dog" is just a string of bits. Input Embedding is the process of giving that word a "Semantic Passport."

How is this vector actually generated?

To understand how we get our (Seq, d_model) matrix, let's follow the word "dog" through the three-step mechanical process we teased in Part 1:

Step 1: The ID Lookup One-Hot Vector

First, the model looks up "dog" in its dictionary (Vocabulary) and finds its unique ID (e.g., ID 432). It creates a One-Hot Vector: a massive list of zeros with a single 1 at position 432.

Recall from Part 1: This is a "Sparse" representation. It's huge, mostly empty, and contains zero meaning.

Step 2: The Embedding Matrix

This is where the magic happens. The model maintains a giant Embedding Matrix of size (Vocab_Size, d_model).

Each row in this matrix is a 512-dimension vector.
Initially, these numbers are random nonsense.

Step 3: The Row Selection

The model uses the ID from Step 1 to select exactly one row from this matrix. This row is our d_model vector. We have now successfully converted a "Sparse" ID into a "Dense" list of numbers.

The Analogy: The Semantic Passport

Imagine every word in our sentence, "The dog bit the man," gets a passport with 512 pages (our d_model). Each page describes a feature that the model has learned:

Page 1 (Noun-ness): High value for "dog", low for "bit".
Page 2 (Living): High value for "dog" and "man".
Page 3 (Animal): High value for "dog", low for "man".

💡 Important: These Values Change!

Unlike a fixed dictionary, these embedding values are learnable weights.

Why? At the start of training, the model's "understanding" is random. As it reads millions of sentences, it uses backpropagation to adjust these numbers.
The Goal: It learns that "dog" and "wolf" often appear in similar contexts (near words like "bark", "forest", or "pack"). To satisfy the math, it moves their 512-dimension coordinates closer together in space.
The Result: The embedding vector evolves as the model gets "smarter."

3. The Parallelism Paradox

In Part 1, we praised Transformers for their Parallelism. Unlike the "Drunken Narrator" (RNN), the Transformer looks at the entire input matrix at the same time.

The Paradox: If you look at every word simultaneously, you lose the sense of time.
To a Transformer, these two sentences look identical because the words are the same:

"The dog bit the man."
"The man bit the dog."

We have Meaning from our Embeddings, but we are missing Order.

4. Positional Encoding: The "Home Address"

To solve this, we need to "stamp" a position onto our embeddings. This tells the model where each word is standing in line.

The Formulas (How to calculate)

For an index i in our 512-dimensional vector:

For Even Steps (2i): PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
For Odd Steps (2i+1): PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Why Sine and Cosine?

The researchers used Sine and Cosine waves because they allow the model to understand relative positions. Because these functions oscillate, the model can easily learn that "word A is specific distance away from word B."

> Using waves to create a unique mathematical "signature" for every seat in the sentence.

Important: These Values are CONSTANT

Unlike Embeddings, Positional Encodings are fixed.

Why? The "meaning" of a word like "dog" might change as the model learns, but the "meaning" of Position #2 should never change. Position #2 is always Position #2. By keeping this constant, the model has a stable "grid" to map its learnable meanings onto.

5. Why do we ADD them? (`Meaning + Order`)

The researchers chose to ADD the Positional Encoding vector directly to the Input Embedding vector.

> Final Vector = Learnable Meaning + Constant Position

Why add instead of just attaching it to the end (concatenation)?

It keeps the matrix size at (Seq, d_model). We don't make the matrix "fatter," which keeps the hardware processing fast.

6. Summary: The Prepared Matrix

We have successfully prepared our ingredients. We started with raw text and ended with a refined (Seq, d_model) matrix where every word knows who it is and where it sits.

This matrix is finally ready to enter the first actual "box" of the Encoder.

What's Next?

Now that the input is prepared, it’s time to feed the Encoder.

In Part 3, we will explore Multi-Head Self-Attention. This is where the words actually start "talking" to each other using the Matrix Multiplication logic we teased in Part 1. We’ll see how the model decides which words are the most important in the sentence.

References

Official Paper: Attention Is All You Need
Visual Guide: Visualizing Positional Encoding that helped me visualize the mechanics of the transformers.

DEV Community

Transformers - Encoder Deep Dive - Part 2

Wait... What exactly is the Encoder?

Where is it used?

1. Encoder Input: Why does the Encoder need `(Seq, d_model)`?

2. Input Embedding: Giving Words a Digital Identity

How is this vector actually generated?

Step 1: The ID Lookup One-Hot Vector

Step 2: The Embedding Matrix

Step 3: The Row Selection

The Analogy: The Semantic Passport

💡 Important: These Values Change!

3. The Parallelism Paradox

4. Positional Encoding: The "Home Address"

The Formulas (How to calculate)

Why Sine and Cosine?

Important: These Values are CONSTANT

5. Why do we ADD them? (`Meaning + Order`)

6. Summary: The Prepared Matrix

What's Next?

References

Top comments (0)

Wait... What exactly is the Encoder?

Where is it used?

1. Encoder Input: Why does the Encoder need (Seq, d_model)?

2. Input Embedding: Giving Words a Digital Identity

How is this vector actually generated?

Step 1: The ID Lookup One-Hot Vector

Step 2: The Embedding Matrix

Step 3: The Row Selection

The Analogy: The Semantic Passport

💡 Important: These Values Change!

3. The Parallelism Paradox

4. Positional Encoding: The "Home Address"

The Formulas (How to calculate)

Why Sine and Cosine?

Important: These Values are CONSTANT

5. Why do we ADD them? (Meaning + Order)

6. Summary: The Prepared Matrix

What's Next?

References

1. Encoder Input: Why does the Encoder need `(Seq, d_model)`?

5. Why do we ADD them? (`Meaning + Order`)