DEV Community

Yuvaraj
Yuvaraj

Posted on • Edited on

Transformers Encoder Deep Dive - Part 1

In my previous article, "What Are Transformers? Why Do They Dominate the AI World?", we explored the intuition behind this revolution. We saw how the "Search Warrant" (Attention) replaced the "Drunken Narrator" (RNNs) to solve the problem of long-distance memory.

But how does that logic actually live inside a machine? To understand that, we have to look at the "Blueprint."

The Master Map

When you look at the original architecture from the landmark paper "Attention Is All You Need," you see two main towers: the Encoder (left) and the Decoder (right).

visualisation about Encode and Decoder transformers archetecture

In this series, we are going to break down these boxes into logical mental models. We want to understand not just what they are, but why they exist and how they enable the massive parallelism that makes modern AI so fast and powerful.

Before we jump into the Encoder's inner workings, we need to learn the "Language of the Machine."
Let's master the notations using a simple sentence:

"The dog bit the man."


1. What is d_model?

In the world of AI, words aren't letters; they are lists of numbers called vectors.
d_model is the dimension (or the length) of that list. If d_model = 512, it means every single word in our sentence is represented by a list of 512 different numbers. These numbers capture the "meaning" of the word.

Here is the step-by-step visual explanation of how a d_model vector is generated for a single word, using "The" as our example.

Step 1: The Vocabulary & One-Hot Vector

Before a model can understand a word, it needs a digital ID. We start with a huge list of every word the model knows (its Vocabulary).

We then create a very long vector that is almost entirely zeros, with a single "1" at the position corresponding to our word. This is called a One-Hot Vector. It's simple, but it doesn't capture any meaning—it's just an index.

visualisation explaining finding index in the vocabulary

Step 2: The Embedding Matrix (The Lookup Table)

This is where the magic happens. The model has a giant, learnable matrix called the Embedding Matrix. You can think of it as a massive lookup table.

  • Rows: Each row corresponds to a word in the vocabulary.
  • Columns: The number of columns is the d_model size (4 used here for simplicity, e.g., 512 ).

When we feed the One-Hot Vector into this matrix, the "1" acts like a selector switch. It activates and "selects" the corresponding row in the Embedding Matrix. This row contains a list of dense, learnable numbers that represent the word's meaning.

Visualisation explaining finding row with one hot vector in embedding matrix

Step 3: The Final d_model Vector

The result of this lookup operation is a single, dense vector. This is the d_model vector for the word "The".

Instead of a sparse vector of zeros, we now have a compact list of numbers (of size d_model) that the model can use to perform mathematical operations. When we do this for every word in the sentence, we get the (Seq, d_model) input matrix you saw earlier.

visulisation explaining final d_model vector


2. What is Sequence Length (Seq)?

This is simply the number of words (or tokens) we are feeding into the model at once.
For our sentence, "The dog bit the man":

  • The (1), dog (2), bit (3), the (4), man (5).
  • Our Sequence Length (Seq) = 5.

(In a real Transformer, we use Tokens efficiently, but for this mental model, we will treat each word as a tokens.)


3. The Input Matrix (Seq, d_model)

When we stack these words together, we get our input matrix. Imagine a table where each row is a word and each column is a feature of that word. For our 5-word sentence with a model dimension of 4, it looks like this:

Visualisation of input matrix


4. The Transpose Matrix (d_model, Seq)

To perform the "Search Warrant" logic, the model needs to compare words against each other. To do this mathematically, we transpose the matrix. We flip it so the rows become columns. This allows the model to look at the sentence from a different "angle."

Visualisation of transpose matrix


5. The Multiplication Concept (How the magic happens)

How does the model calculate how much "dog" relates to "bit"? It uses Matrix Multiplication.

Even if you aren't a math expert, the mental model is simple: We take a Row from our first matrix (a word) and multiply it against a Column from the second matrix (another word).

  1. We multiply the corresponding numbers.
  2. We sum them all up.
  3. The result is a single "Score" that represents the relationship between those two words.

Step-by-Step Calculation:
Here is a step-by-step visual breakdown of how the matrix multiplication (Seq, d_model) x (d_model, Seq) works. This process is what creates the "attention scores" between every word in the sentence.

Step 1: The Setup

We start with two matrices. Matrix A represents our input sentence where each row is a word vector. Matrix B is the transposed version, where each column is a word vector. We also have an empty Result Matrix where we will store the scores.

Visualisation of input matrix and transposed input matrix

Step 2: Calculating the First Cell

To find the score for how much the first word ("The") relates to itself, we take the dot product of the first row of Matrix A and the first column of Matrix B. We multiply corresponding elements and sum them up.

Visualisation of how to calculate first cell

Step 3: Moving to the Next Word

We stay on the first row of Matrix A ("The") but move to the second column of Matrix B ("dog"). The dot product of these two vectors gives us the score for how much "The" relates to "dog".

Visualisation of how to calculate next word

Step 4: Calculating for the Second Row

After completing the first row of the Result Matrix, we move to the second row of Matrix A ("dog") and reset to the first column of Matrix B ("The"). This gives us the score for how much "dog" relates to "The".

Visulaisation of how to calculate next row

Step 5: The Final Result Matrix

By repeating this row-by-column multiplication for every combination, we get a final (Seq x Seq) matrix. This is a map of all pairwise relationships in the sentence, which is the core of the self-attention mechanism.

Visulaisation of multiplied matrix


Why this matters

By representing our sentence as these matrices, the computer doesn't have to read the sentence word-by-word like a human (or an RNN). Because of this matrix structure, the hardware (GPU) can calculate all these word relationships at the same time.

This is the foundation of Parallelism.


References & Further Learning

If you want to dive even deeper into the original research or see these concepts in motion, I highly recommend checking out these foundational resources:

Official Paper: "Attention Is All You Need" (Vaswani et al., 2017) – The research paper that started it all.

Visual Guide: Transformers Explained Clearly – A fantastic YouTube deep-dive that helped me visualize the mechanics behind the math.


What's Next?

Now that we have our map and understanding of notations, we are ready to start learning mental model of encoder. In Part 2, we will start with Embeddings and Positional Encoding inside Encoder — the process of turning raw text into these mathematical "Ingredients" and giving them a "Home Address" so the model knows the order of the sentence.

Top comments (0)