DEV Community: Yuvaraj

Transformer - Encoder Deep Dive - Part 3: What is Self-Attention

Yuvaraj — Sun, 08 Mar 2026 20:10:23 +0000

Recap

Embedding: "The", "dog", "bit", "the", "man" each have a unique semantic identity.
Positional Encoding: Each word now knows exactly where it sits in the sentence.

Wait... What exactly is the Encoder's job? Part 2

The sole purpose of the Encoder is to understand Context.

With the example, "The dog bit the man" - let’s look at the word "bit".

On its own, "bit" could mean:

A small piece of something (a "bit" of chocolate).
The past tense of a bite (the action).
A digital 0 or 1 (a computer "bit").

The Encoder doesn't know which one it is until it pays Attention to the words around it through association.

Those words are like strangers in an elevator—they are standing near each other, but they aren't talking.

What exactly is "Self-Attention"?

Self: The model is looking at the same sentence it is currently processing. It isn't looking at a dictionary or a translation yet; it's just looking at its own words.

Attention: The model decides which other words in that sentence are relevant to the word it is currently "thinking" about.

The Definition: Self-Attention is a mechanism that allows a word to "look" at every other word in its own sentence to find the context it needs to define itself.

The "Relationship" Logic
In our sentence "The dog bit the man," Self-Attention is the reason the model knows that:

"dog" is related to "bit" (as the actor).
"man" is related to "bit" (as the receiver).
"the" is related to "dog" (telling us it's a specific dog).

Without Self-Attention, the word "bit" is just a three-letter string. With Self-Attention, "bit" becomes a bridge that connects a subject (dog) to an object (man).

Attention is the conversation.

This matrix is now standing at the door of the first Multi-Head Attention block.

Let's understand Self-Attention in this article.

In a real Transformer, 8 of these heads work together to create 'Multi-Head Attention,' which we will glue together in Part 4.

Queries, Keys, and Values (Q, K, V)

To calculate attention, we don't just use the input matrix as it is.
Self-Attention transforms our input matrix into three different versions of itself using three learnable weight matrices (W^Q, W^K, W^V).

Think of this like taking the same word and looking at it through three different lenses:

Query (Q) - "The Search": This is what a word is looking for.
Example: The word "dog" asks: "Is there an action in this sentence that I performed?"
Key (K) - "The Label": This is how a word identifies itself to others.
Example: The word "bit" says: "I am an action involving teeth."
Value (V) - "The Cargo": This is the actual information a word carries.
Example: The word 'dog' (Query) found a match with the label 'action' (Key) on the word 'bit.' It then reached inside the truck and took the 'biting information' (Value) to update its own identity.

Here is a series of four images that visually breakdown the concept of Self-Attention, using our "The dog bit the man" example.

dog, bit, the, man are present without knowing how these words are connected each other in this sentence "The Dog bit the man"

the word vectors for "dog" and "bit" are isolated. The "dog" vector is a generic noun with no knowledge of the action it's about to perform.

How does 'dog' find 'bit'?

The "dog" vector (acting as the Query) "shines a light" on all other words to find its match. The "bit" vector (acting as the Key) responds strongly, creating a high Attention Score.

The Transfer of Meaning

Using the Attention Score as a weight, the "bit" vector's actual content (Value) is transferred to the "dog" vector. The "dog" is now "absorbing" the meaning of the action.

Contextualized

After the process, the "dog" vector is transformed. Its mathematical representation has changed (visualized here by the color blending), and it is now a "context-aware" vector that knows it is the subject of the bite.

The word 'dog' is no longer a generic noun; it’s a subject tied to an action.

Self attention Formula: Deep Dive

For the developers who want to see the code or the math, everything we just discussed (Query, Key, Value) is condensed into one famous formula:

Here is a step-by-step visual breakdown of the Self-Attention mechanism, using our sentence "The dog bit the man". We'll follow the mathematical formula and visualize how the input matrices are transformed at each stage.

Step 1: The Initial Learned Matrices (Q, K, V)

Before any attention is calculated, the input matrix is multiplied by three separate, learnable weight matrices () to create three new matrices: Query (Q), Key (K), and Value (V). These matrices are the starting point for our calculation.

You might be wondering: "If the input is just our 'The dog bit the man' matrix, why do Q, K, and V have different numbers?"

This happens through Linear Transformation.

We take our input and multiply it by three separate "Weight Matrices." These weights are like filters or lenses that highlight different parts of the word's meaning.

Input × W^Q = Q: This transformation extracts the "Question" part of the word.
Input × W^K = K: This extracts the "Label" part of the word.
Input × W^V = V: This extracts the "Cargo" (Content) part of the word.

Why this matters:
These W matrices are learnable. At first, the model is bad at asking questions. But over time, it learns exactly how to adjust the numbers in W^Q so that the word "dog" asks the perfect question to find its verb "bit."

Step 2: The Compatibility Check (Raw Scores)

Now, we calculate how much every word should listen to every other word "Compatibility". To do this, we perform a Dot Product between the Query (Q) and the transposed Key (K^T) matrices.

💡 Quick Recall: If you need a refresher on how the math of multiplying these matrices works, check out Step 5 of Part 1, where we saw how rows and columns collide to create a "Relationship Score."

In this step, we multiply the Query of "dog" by the Transpose of the Key "bit".

The Result: We get a raw "Attention Score."

The Logic: If the "Search Query" of the dog matches the "Label" of the bite, the math produces a high number. If they don't match, the number stays near zero.

For example, the high score of 15.2 between "dog" and "bit" indicates a strong connection.

Step 3: Scaling

The two critical steps that turn raw, unstable scores into clear probabilities for the model.

3.1. The Scaling Step: Stabilizing the Math

Before we can turn our scores into percentages, we have to manage their size.
The raw scores from the dot product (Q * K^T) can be very large, especially with high-dimensional vectors (d_model=512).

Why is this a problem? Large numbers can cause the training process to become unstable. The model's gradients can become too small ("vanishing gradients"), meaning it stops learning.

Now, why do we care if gradients get too small?

When we apply Softmax to very large numbers (like our unscaled scores of 15 or 20), the function becomes "extremely confident." It gives one word 99.999% of the attention and everything else 0.00001%.

Deepdive in to this problem - What is a Gradient?
Gradient - "Directional Signal" telling the model how to improve. When numbers get too large, the signal becomes so weak that the model gets "confused" and stops learning

Let's imagine you are standing on a foggy mountain in the dark, and your goal is to reach the lowest valley (the "Loss" or "Error"). Because of the fog, you can’t see the bottom.

The Gradient is like feeling the ground with your foot to see which way it slopes.

If your foot feels a steep slope downward, that is a Strong Gradient. It tells you exactly which way to step to get closer to the bottom.
If the ground feels almost perfectly flat, that is a Vanishing Gradient. You have no idea which way to move to improve. You are stuck.

The Solution: We divide the raw scores by a scaling factor, which is the square root of the dimension of the keys . This "squashes" the scores back into a manageable range without changing their relative order.

softmax applied on row wise

3.1.1. What is d_k? (The Width of the Key)

Remember our "Semantic Passport" analogy from Part 2? Each word has a vector of 512 numbers (d_model = 512). However, when it comes time to talk in the Engine Room, the model doesn't use all 512 pages at once.

Instead, it breaks those 512 dimensions into smaller, specialized chunks. d_k is the size of one of those chunks—typically 64.

3.1.2. Why not use all 512 at once? (The Specialization Problem)

You might ask: "Why not just calculate one massive attention score for all 512 pages?" The answer is Specialization. If you use all 512 dimensions at once, you get one single "Attention Score." This score becomes an average of every word's relationship in the sentence, and in language, averages are dangerous.

The Analogy: Imagine you are at a business meeting. If you try to listen to the CEO, the Accountant, and the Engineer through one single "ear," their voices blur together. You might get the "average" topic, but you’ll miss the specific details of the budget or the technical specs.

By breaking the 512 dimensions into 8 chunks of 64, the model creates 8 specialized "Attention Heads."

Each head acts like a specialist:

Head 1: Focuses on Grammar (Subject-Verb relationship).
Head 2: Focuses on Entity Relationships (Who is the "dog" and who is the "man"?).
Head 3: Focuses on the "Tense" or "Time" of the sentence.
Head 4: ...
Head n: ...

Step 4: The Softmax Step: The "Winner-Takes-All" Filter**

Now that our scores are stable, we need to convert them into probabilities that we can use as weights. This is where the Softmax function comes in.

Softmax is a mathematical function that takes a list of numbers (which can be positive, negative, or zero) and turns them into a list of probabilities that sum up to exactly 1.0 (or 100%).

Why is this useful?

Normalization: It gives us a clear "attention budget" for each word. The total attention a word pays to the entire sentence must always be 100%.
Amplification: It highlights the highest score and suppresses the lower ones. As seen in the image, the highest scaled score of 1.9 gets a massive 65% of the attention, while the negative scores get almost none.

"Softmax looks at each word individually (each row). It takes the 100% attention budget for that word and distributes it across the sentence."

Let's visualize the softmax of dot product of (Q * K^T) divided by (/) scaling (square root of dimension of keys)

Step 5: The Transfer of Meaning (Weighted Sum)

Finally, we use the attention weights (probabilities) from Step 3 to create a weighted sum of the Value (V) matrix. This is the step where the actual context is transferred.

For example, the new vector for "dog" is calculated by taking 80% of the "bit" Value vector, 5% of the "dog" Value vector, and so on. The result is a new matrix where each word's vector has been updated with information from the words it "paid attention" to.

NOTE: this is just the report from 1 of 8 specialists(heads). In the next part, we'll see how the results from all 8 specialists(heads) are combined to form the final Multi-Head Attention output."

One Specialist = Self-Attention

Summary:

The Attention InterfaceWe have successfully turned our raw input into a Contextual Masterpiece. Q, K, V gave us the tools for the search.

Q*K^T: found the relationships.
Scaling & Softmax stabilized the math and gave us clear percentages.
Value (V) provided the cargo that updated our word meanings.

What’s Next?:

We’ve seen how a single "specialist"(self Attention) handles a 64-bit chunk of our data.
But our Encoder is a powerhouse that runs 8 of these specialists at the exact same time.

In Part 4, we will dive into Multi-Head Attention to move deeper into the Transformer tower.

References

Official Paper: Attention Is All You Need
Visual Guide: Visualizing Positional Encoding that helped me visualize the mechanics of the transformers.

Transformers - Encoder Deep Dive - Part 2

Yuvaraj — Mon, 16 Feb 2026 08:33:53 +0000

In our journey so far, we have explored the high-level intuition of why Transformers exist and mapped out the blueprint and notations in Part 1.

Wait... What exactly is the Encoder?

Before we prep the ingredients, let's look at the "Chef."

In the Transformer diagram, the Encoder is the tower on the left.

If you think of a Transformer as a translation system:

The Encoder is the "Scholar" who reads the English sentence and understands every hidden nuance, relationship, and bit of context.
The Decoder is the "Writer" who takes that scholar's notes and writes the sentence out in French.

Where is it used?

While the original paper used both towers for translation, the tech world realized the Encoder is a powerhouse on its own. Below are few samples,

Search Engines: To understand the intent behind your query, not just the keywords.
Sentiment Analysis: To "feel" if a product review is happy or angry.
Text Classification: To automatically sort your emails into "Spam" or "Work."

In short: The Encoder's sole job is to turn a human sentence into a "Context-Rich" mathematical map.

Let's learn how to build this map

1. Encoder Input: Why does the Encoder need `(Seq, d_model)`?

As we discussed in Part 1, the hardware (GPU) is built for speed. It doesn't want to read a sentence word-by-word. It wants a Matrix.

Specifically, the Encoder requires a matrix of shape (Seq, d_model).

Seq (Sequence Length): The number of words (e.g., 5 for "The dog bit the man").
d_model (Model Dimension): The "width" of our mathematical understanding (usually 512).

The Encoder's job is to understand and refine this matrix.

let's learn how to feed input to encoder in this structure, Meaning (Embedding) and Order (Positional Encoding).

2. Input Embedding: Giving Words a Digital Identity

A computer doesn't know what a "dog" is. To a machine, "dog" is just a string of bits. Input Embedding is the process of giving that word a "Semantic Passport."

How is this vector actually generated?

To understand how we get our (Seq, d_model) matrix, let's follow the word "dog" through the three-step mechanical process we teased in Part 1:

Step 1: The ID Lookup One-Hot Vector

First, the model looks up "dog" in its dictionary (Vocabulary) and finds its unique ID (e.g., ID 432). It creates a One-Hot Vector: a massive list of zeros with a single 1 at position 432.

Recall from Part 1: This is a "Sparse" representation. It's huge, mostly empty, and contains zero meaning.

Step 2: The Embedding Matrix

This is where the magic happens. The model maintains a giant Embedding Matrix of size (Vocab_Size, d_model).

Each row in this matrix is a 512-dimension vector.
Initially, these numbers are random nonsense.

Step 3: The Row Selection

The model uses the ID from Step 1 to select exactly one row from this matrix. This row is our d_model vector. We have now successfully converted a "Sparse" ID into a "Dense" list of numbers.

The Analogy: The Semantic Passport

Imagine every word in our sentence, "The dog bit the man," gets a passport with 512 pages (our d_model). Each page describes a feature that the model has learned:

Page 1 (Noun-ness): High value for "dog", low for "bit".
Page 2 (Living): High value for "dog" and "man".
Page 3 (Animal): High value for "dog", low for "man".

💡 Important: These Values Change!

Unlike a fixed dictionary, these embedding values are learnable weights.

Why? At the start of training, the model's "understanding" is random. As it reads millions of sentences, it uses backpropagation to adjust these numbers.
The Goal: It learns that "dog" and "wolf" often appear in similar contexts (near words like "bark", "forest", or "pack"). To satisfy the math, it moves their 512-dimension coordinates closer together in space.
The Result: The embedding vector evolves as the model gets "smarter."

3. The Parallelism Paradox

In Part 1, we praised Transformers for their Parallelism. Unlike the "Drunken Narrator" (RNN), the Transformer looks at the entire input matrix at the same time.

The Paradox: If you look at every word simultaneously, you lose the sense of time.
To a Transformer, these two sentences look identical because the words are the same:

"The dog bit the man."
"The man bit the dog."

We have Meaning from our Embeddings, but we are missing Order.

4. Positional Encoding: The "Home Address"

To solve this, we need to "stamp" a position onto our embeddings. This tells the model where each word is standing in line.

The Formulas (How to calculate)

For an index i in our 512-dimensional vector:

For Even Steps (2i): PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
For Odd Steps (2i+1): PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Why Sine and Cosine?

The researchers used Sine and Cosine waves because they allow the model to understand relative positions. Because these functions oscillate, the model can easily learn that "word A is specific distance away from word B."

> Using waves to create a unique mathematical "signature" for every seat in the sentence.

Important: These Values are CONSTANT

Unlike Embeddings, Positional Encodings are fixed.

Why? The "meaning" of a word like "dog" might change as the model learns, but the "meaning" of Position #2 should never change. Position #2 is always Position #2. By keeping this constant, the model has a stable "grid" to map its learnable meanings onto.

5. Why do we ADD them? (`Meaning + Order`)

The researchers chose to ADD the Positional Encoding vector directly to the Input Embedding vector.

> Final Vector = Learnable Meaning + Constant Position

Why add instead of just attaching it to the end (concatenation)?

It keeps the matrix size at (Seq, d_model). We don't make the matrix "fatter," which keeps the hardware processing fast.

6. Summary: The Prepared Matrix

We have successfully prepared our ingredients. We started with raw text and ended with a refined (Seq, d_model) matrix where every word knows who it is and where it sits.

This matrix is finally ready to enter the first actual "box" of the Encoder.

What's Next?

Now that the input is prepared, it’s time to feed the Encoder.

In Part 3, we will explore Multi-Head Self-Attention. This is where the words actually start "talking" to each other using the Matrix Multiplication logic we teased in Part 1. We’ll see how the model decides which words are the most important in the sentence.

References

Official Paper: Attention Is All You Need
Visual Guide: Visualizing Positional Encoding that helped me visualize the mechanics of the transformers.

Transformers Encoder Deep Dive - Part 1

Yuvaraj — Sun, 15 Feb 2026 16:20:18 +0000

In my previous article, "What Are Transformers? Why Do They Dominate the AI World?", we explored the intuition behind this revolution. We saw how the "Search Warrant" (Attention) replaced the "Drunken Narrator" (RNNs) to solve the problem of long-distance memory.

But how does that logic actually live inside a machine? To understand that, we have to look at the "Blueprint."

The Master Map

When you look at the original architecture from the landmark paper "Attention Is All You Need," you see two main towers: the Encoder (left) and the Decoder (right).

In this series, we are going to break down these boxes into logical mental models. We want to understand not just what they are, but why they exist and how they enable the massive parallelism that makes modern AI so fast and powerful.

Before we jump into the Encoder's inner workings, we need to learn the "Language of the Machine."
Let's master the notations using a simple sentence:

"The dog bit the man."

1. What is `d_model`?

In the world of AI, words aren't letters; they are lists of numbers called vectors.
d_model is the dimension (or the length) of that list. If d_model = 512, it means every single word in our sentence is represented by a list of 512 different numbers. These numbers capture the "meaning" of the word.

Here is the step-by-step visual explanation of how a d_model vector is generated for a single word, using "The" as our example.

Step 1: The Vocabulary & One-Hot Vector

Before a model can understand a word, it needs a digital ID. We start with a huge list of every word the model knows (its Vocabulary).

We then create a very long vector that is almost entirely zeros, with a single "1" at the position corresponding to our word. This is called a One-Hot Vector. It's simple, but it doesn't capture any meaning—it's just an index.

Step 2: The Embedding Matrix (The Lookup Table)

This is where the magic happens. The model has a giant, learnable matrix called the Embedding Matrix. You can think of it as a massive lookup table.

Rows: Each row corresponds to a word in the vocabulary.
Columns: The number of columns is the d_model size (4 used here for simplicity, e.g., 512 ).

When we feed the One-Hot Vector into this matrix, the "1" acts like a selector switch. It activates and "selects" the corresponding row in the Embedding Matrix. This row contains a list of dense, learnable numbers that represent the word's meaning.

Step 3: The Final `d_model` Vector

The result of this lookup operation is a single, dense vector. This is the d_model vector for the word "The".

Instead of a sparse vector of zeros, we now have a compact list of numbers (of size d_model) that the model can use to perform mathematical operations. When we do this for every word in the sentence, we get the (Seq, d_model) input matrix you saw earlier.

2. What is Sequence Length (`Seq`)?

This is simply the number of words (or tokens) we are feeding into the model at once.
For our sentence, "The dog bit the man":

The (1), dog (2), bit (3), the (4), man (5).
Our Sequence Length (Seq) = 5.

(In a real Transformer, we use Tokens efficiently, but for this mental model, we will treat each word as a tokens.)

3. The Input Matrix `(Seq, d_model)`

When we stack these words together, we get our input matrix. Imagine a table where each row is a word and each column is a feature of that word. For our 5-word sentence with a model dimension of 4, it looks like this:

4. The Transpose Matrix `(d_model, Seq)`

To perform the "Search Warrant" logic, the model needs to compare words against each other. To do this mathematically, we transpose the matrix. We flip it so the rows become columns. This allows the model to look at the sentence from a different "angle."

5. The Multiplication Concept (How the magic happens)

How does the model calculate how much "dog" relates to "bit"? It uses Matrix Multiplication.

Even if you aren't a math expert, the mental model is simple: We take a Row from our first matrix (a word) and multiply it against a Column from the second matrix (another word).

We multiply the corresponding numbers.
We sum them all up.
The result is a single "Score" that represents the relationship between those two words.

Step-by-Step Calculation:
Here is a step-by-step visual breakdown of how the matrix multiplication (Seq, d_model) x (d_model, Seq) works. This process is what creates the "attention scores" between every word in the sentence.

Step 1: The Setup

We start with two matrices. Matrix A represents our input sentence where each row is a word vector. Matrix B is the transposed version, where each column is a word vector. We also have an empty Result Matrix where we will store the scores.

Step 2: Calculating the First Cell

To find the score for how much the first word ("The") relates to itself, we take the dot product of the first row of Matrix A and the first column of Matrix B. We multiply corresponding elements and sum them up.

Step 3: Moving to the Next Word

We stay on the first row of Matrix A ("The") but move to the second column of Matrix B ("dog"). The dot product of these two vectors gives us the score for how much "The" relates to "dog".

Step 4: Calculating for the Second Row

After completing the first row of the Result Matrix, we move to the second row of Matrix A ("dog") and reset to the first column of Matrix B ("The"). This gives us the score for how much "dog" relates to "The".

Step 5: The Final Result Matrix

By repeating this row-by-column multiplication for every combination, we get a final (Seq x Seq) matrix. This is a map of all pairwise relationships in the sentence, which is the core of the self-attention mechanism.

Why this matters

By representing our sentence as these matrices, the computer doesn't have to read the sentence word-by-word like a human (or an RNN). Because of this matrix structure, the hardware (GPU) can calculate all these word relationships at the same time.

This is the foundation of Parallelism.

References & Further Learning

If you want to dive even deeper into the original research or see these concepts in motion, I highly recommend checking out these foundational resources:

Official Paper: "Attention Is All You Need" (Vaswani et al., 2017) – The research paper that started it all.

Visual Guide: Transformers Explained Clearly – A fantastic YouTube deep-dive that helped me visualize the mechanics behind the math.

What's Next?

Now that we have our map and understanding of notations, we are ready to start learning mental model of encoder. In Part 2, we will start with Embeddings and Positional Encoding inside Encoder — the process of turning raw text into these mathematical "Ingredients" and giving them a "Home Address" so the model knows the order of the sentence.

What are Transformers, Why do they Dominate the AI World?

Yuvaraj — Sun, 15 Feb 2026 10:47:14 +0000

In the world of AI, we have to deal with Sequences—data where the order isn't just a detail; it's the entire meaning.

Language: "The dog bit the man" is a very different story than "The man bit the dog."
Code: x = 5; y = x + 2; works. y = x + 2; x = 5; crashes.

To process these sequences, AI has evolved through two major architectural eras:

1. RNNs (The "Linked List" Approach)

For years, Recurrent Neural Networks (RNNs) were the industry standard. They treat a sentence like a Ticker Tape or a Linked List.

The Logic: To understand word #10, the model must first pass through words #1 through #9 in a strict, sequential order.
The Constraint: It’s a for loop. It cannot skip ahead, and it cannot process word #10 until word #9 is finished.

2. Transformers (The "Random Access" Approach)

In 2017, a landmark paper titled "Attention is All You Need" introduced the Transformer. It stopped treating sentences like strings to be iterated over and started treating them like Arrays with an Index.

The Logic: It doesn't wait in line. It takes a Snapshot of the entire sequence at once.
The Breakthrough: It sees every item in the "Array" simultaneously. It understands the relationship between word #1 and word #10 without having to "walk" the distance between them.

How AI "Behaves": The Language Learner Analogy

To understand why the AI world shifted toward Transformers, we need to look at how these models actually experience a sentence. It’s the difference between struggling through a translation and reading fluently.

The RNN: The "Beginner Learner"

Imagine an adult who has just started learning a new language. They are reading a long, complex sentence with a dictionary in hand.

The Struggle: They translate the first word, then the second, then the third. Because they process data sequentially, their mental energy is entirely spent on the current word.
The Result: By the time they reach the end of a long paragraph, the specific details of the beginning have started to fade. They have a very narrow "window" of focus. If the beginning of the sentence affects the end, they often have to stop and re-read.

The Transformer: The "Fluent Reader"

Now, imagine a fluent adult reading the same sentence. They don’t process it word-by-word in a vacuum.

The Behavior: Their eyes scan the entire block of text almost instantly. Even as they read the final word, they remain "aware" of the subject at the very beginning.
The Advantage: They can ignore "filler" words and focus their Attention only on the words that carry the most meaning. They aren't just "remembering" the start of the sentence; they are actively connecting it to the end.

The "Memory" Problem

The problem with being a "Beginner Learner" (RNN) isn't just that it’s slow—it’s that it's unreliable over long distances.

Deep dive to understand it better

To truly understand why the AI world moved toward Transformers, we need to see where the "old way" breaks. In linguistics, we call this a Long-Range Dependency problem—when two words that need each other are separated by a long distance.

Let’s look at this deceptively simple sentence:

"The keys to the old house that my grandfather built in 1945 were lost."

The Challenge:

The subject is "keys" (plural).
Therefore, the verb at the end must be "were lost" (plural).

Between the subject and the verb, we have three singular nouns designed to confuse the model: house, grandfather, and 1945.

1. How the RNN processes it: The Drunken Narrator

As our "Beginner Learner," the RNN must carry the memory of the first word through the entire sentence, step-by-step. We call this the Drunken Narrator effect. Imagine a narrator telling a story, but with every new word, their memory of the start gets a little fuzzier.

Start: The RNN reads "The keys." Internal state: Subject is Plural.
Middle: It reads "...house..." The memory updates. The singular "house" slightly dilutes the "plural" signal.
Distraction: It reads "...grandfather... 1945." After three singular nouns in a row, the original "plural" signal is now a faint whisper.
Failure: It reaches the end and needs to predict the verb. Since the most recent memory is singular ("1945"), it incorrectly predicts: "was lost."

The Technical Reality: This is the Vanishing Gradient problem. In long sequences, the mathematical signal from the beginning literally "vanishes" before it reaches the end.

2. How the Transformer processes it: The Search Warrant

The Transformer (our "Fluent Reader") doesn't struggle with memory. It processes the whole sentence at once using the Search Warrant approach.

The Setup: The Transformer takes a snapshot. It sees "keys" and "lost" simultaneously.
The Query: To understand the word "lost," it doesn't rely on a fading memory. It issues a Query to every other word: "Who is the subject of being lost?"
The Attention: * "House" and "Grandfather" return low scores.
"Keys" returns a massive Attention Score.
The Success: The model forms a direct, high-speed connection between "lost" and "keys," ignoring the "distance" entirely. It correctly predicts: "were lost."

The Technical Reality: This is Self-Attention. It allows any word to "attend" to any other word in the sequence, making the distance between them mathematically zero.

Summary: Why Transformers are better

Feature	RNN (Drunken Narrator)	Transformer (Search Warrant)
Processing	Sequential (Slow)	Parallel (Fast)
Memory	Fades with distance	Perfect, direct access
The "Keys" Test	Fails (confused by "1945")	Succeeds (looks at "keys" directly)

Conclusion: The New Standard

The transition from RNNs to Transformers wasn't just a minor upgrade; it was a fundamental shift in how we handle information. By moving from the "Drunken Narrator" (Sequential) to the "Search Warrant" (Parallel), we unlocked the ability to train on the massive scale of data that powers today’s LLMs.

As a developer, understanding this shift is crucial. It’s the difference between building a system that merely follows a loop and one that understands the entire context of its environment.

A Frontend Perspective

As a Frontend Developer, I’m used to thinking about state and data flow. Seeing how Transformers manage 'context' through Attention feels remarkably similar to modern state management—it's about making sure the right information is available at the right time, regardless of where it lives in the application.

References

Vaswani, A., et al. (2017). "Attention Is All You Need". Advances in Neural Information Processing Systems.

What’s Next?

let me know in the comments: Which analogy clicked better for you, the "language learner" or the "Search Warrant"?

React Compiler: Stop Manually Optimizing Your React Apps

Yuvaraj — Tue, 03 Feb 2026 21:47:00 +0000

During our team KATA session, a colleague asked a question that I bet you've thought about it too:

"If React already knows to only render the elements that changed, why do we need to optimize anything manually?"

It was a brilliant question. The answer reveals a major pain point we’ve lived with for years—and let's see how React compiler addresses few areas.

Let’s take a journey through the evolution of React optimization, using a simple analogy: The Restaurant Kitchen.

🍝 The Restaurant Kitchen: How React Actually Works

Imagine your App is a kitchen.

Head Chef (Parent Component): Manages the kitchen.
Line cooks (Child Components): Cook their own dishes.

In a standard React app, every time the Head Chef changes something—even just restocking the salt—they ring a giant bell. Every single cook stops and redoes their work from scratch, even if the ingredients for their dish haven’t changed.

This is React’s default behavior: When a parent re-renders, all children re-render.

For years, to stop this waste, we had to write additional code to give instruction(hooks) to react's optimisation technique. Let’s look at how a single component evolved from "without hooks(instructions to compiler)" to "With hooks(instructions to react optimisation technique)" to "React compiler code automatically optimises it."

The Evolution of a Component

Let's look at a RestaurantMenu that does three things:

Holds a list of dishes.
Filters them (an expensive calculation).
Renders a list of items (child components).

Phase 1: The Code (Clean but Slow)

Here is the code most beginners write. It looks clean, but it has hidden performance traps.

import { useState } from 'react';

// A simple child component
const DishList = ({ dishes, onOrder }) => {
  console.log("🍝 Rendering DishList (Child)"); // <--- Watch this log!
  return <div>{/* items... */}</div>;
};

export default function RestaurantMenu({ allDishes, theme }) {
  const [category, setCategory] = useState('pasta');

  // ⚠️ PROBLEM 1: Expensive Calculation runs every render
  const filteredDishes = allDishes.filter(dish => {
    console.log("🧮 Filtering... (Slow Math)"); 
    return dish.category === category;
  });

  const handleOrder = (dish) => {
    console.log("Ordered:", dish);
  };

  return (
    <div className={theme}>
      {/* Clicking this causes a re-render */}
      <button onClick={() => setCategory('salad')}>Switch Category</button>

      {/* ⚠️ PROBLEM 2: Inline Arrow Function */}
      {/* Writing (dish) => handleOrder(dish) creates a BRAND NEW function 
          in memory every single time this component renders. 
          This forces DishList to re-render. */}
      <DishList 
        dishes={filteredDishes} 
        onOrder={(dish) => handleOrder(dish)} 
      />
    </div>
  );
}

What happens in the Console?
Even if the parent re-renders for a minor reason (or if we click the button), everything runs again.

🖥️ CONSOLE OUTPUT:
---------------------------------------------
🧮 Filtering... (Slow Math)
🍝 Rendering DishList (Child)

(Every single interaction triggers these logs. Wasteful!)

Phase 2: The Solution with hooks(addition instructions)

To fix this in React, we had to introduce "Hooks." We wrap in useMemo, useCallback, and memo.

import { useState, useMemo, useCallback, memo } from 'react';

// Solution A: Wrap child in memo to prevent useless re-renders
const DishList = memo(({ dishes, onOrder }) => {
  console.log("🍝 Rendering DishList (Child)");
  return <div>{/* items... */}</div>;
});

export default function RestaurantMenu({ allDishes, theme }) {
  const [category, setCategory] = useState('pasta');

  // Solution B: Cache calculation with useMemo
  const filteredDishes = useMemo(() => {
    console.log("🧮 Filtering... (Slow Math)");
    return allDishes.filter(dish => dish.category === category);
  }, [allDishes, category]); 

  // Solution C: Freeze function with useCallback
  const handleOrder = useCallback((dish) => {
    console.log("Ordered:", dish);
  }, []); 

  return (
    <div className={theme}>
      <button onClick={() => setCategory('salad')}>Switch Category</button>

      {/* ⚠️ THE TRAP: We CANNOT use an inline arrow here! 
          If we wrote: onOrder={(dish) => handleOrder(dish)}
          It would BREAK the optimization because the arrow wrapper 
          is a new reference. We are FORCED to pass the function directly. */}
      <DishList 
        dishes={filteredDishes} 
        onOrder={handleOrder} 
      />
    </div>
  );
}

What happens in the Console now?
If the parent re-renders (for example, if theme changes but category stays the same), the console stays silent.

🖥️ CONSOLE OUTPUT:
---------------------------------------------
(Silent. No logs appear.)

(Performance is achieved, but the code is hard to read because of hooks syntax)

What happens, If your colleague changes onOrder={handleOrder} to onOrder={() => handleOrder()}, the optimization breaks silently, the arrow function () => handleOrder() creates a new function every time the component renders

Phase 3: The React Compiler Solution (without additional code)

This is the magic of React compiler. You go back to writing the code from Phase 1.

// No useMemo. No useCallback. No memo.
export default function RestaurantMenu({ allDishes, theme }) {
  const [category, setCategory] = useState('pasta');

  // The Compiler AUTOMATICALLY memoizes this
  const filteredDishes = allDishes.filter(dish => {
    console.log("🧮 Filtering... (Slow Math)");
    return dish.category === category;
  });

  // The Compiler AUTOMATICALLY stabilizes this function
  const handleOrder = (dish) => {
    console.log("Ordered:", dish);
  };

  return (
    <div className={theme}>
      <button onClick={() => setCategory('salad')}>Switch Category</button>
      {/* ✅ COMPILER MAGIC: We can use an inline arrow again! 
          The compiler is smart enough to "memoize" this arrow function 
          wrapper automatically. It sees that 'handleOrder' is stable, 
          so it makes this arrow stable too. */}
      <DishList dishes={filteredDishes} onOrder={(dish) => handleOrder(dish)} />
    </div>
  );
}

What happens in the Console?
Even though we deleted all the hooks, the result is identical to Phase 2.

🖥️ CONSOLE OUTPUT:
---------------------------------------------
(Silent. No logs appear.)

What just happened?
The React Compiler analyzed your code at build time. It understands data flow better than we do.

It sees filteredDishes only changes when category changes.
It sees you wrapped handleOrder in an arrow function (dish) => handleOrder(dish).
It automatically caches that arrow function wrapper so it remains the exact same reference across renders.
It effectively generates the optimized code from Phase 2 for you, behind the scenes.

The Philosophy Shift

For years, We had to manually tell the framework: "Remember this variable! Freeze this function!"

React compiler address this problem!.
React now assumes the burden of optimization. It allows us to stop worrying about render cycles and dependency arrays, and start focusing on what actually matters: shipping features.

What Now?

The best part is that React Compiler is backward compatible (React v17, v18 as well). You don't have to rewrite your codebase. You can enable it, and it will optimize your "plain" components while leaving your existing hooks.

Thanks for reading! This is my first post on Dev.to, and I wrote it to help solidify my own understanding of the Compiler. I’d love your feedback—did the restaurant analogy make sense to you? Let me know in the comments!

DEV Community: Yuvaraj

Transformer - Encoder Deep Dive - Part 3: What is Self-Attention

Recap

Wait... What exactly is the Encoder's job? Part 2

What exactly is "Self-Attention"?

Queries, Keys, and Values (Q, K, V)

How does 'dog' find 'bit'?

The Transfer of Meaning

Contextualized

Self attention Formula: Deep Dive

Step 1: The Initial Learned Matrices (Q, K, V)

Step 2: The Compatibility Check (Raw Scores)

Step 3: Scaling

3.1. The Scaling Step: Stabilizing the Math

3.1.1. What is d_k? (The Width of the Key)

3.1.2. Why not use all 512 at once? (The Specialization Problem)

Step 4: The Softmax Step: The "Winner-Takes-All" Filter**

Step 5: The Transfer of Meaning (Weighted Sum)

Summary:

What’s Next?:

References

Transformers - Encoder Deep Dive - Part 2

Wait... What exactly is the Encoder?

Where is it used?

1. Encoder Input: Why does the Encoder need (Seq, d_model)?

2. Input Embedding: Giving Words a Digital Identity

How is this vector actually generated?

Step 1: The ID Lookup One-Hot Vector

Step 2: The Embedding Matrix

Step 3: The Row Selection

The Analogy: The Semantic Passport

💡 Important: These Values Change!

3. The Parallelism Paradox

4. Positional Encoding: The "Home Address"

The Formulas (How to calculate)

Why Sine and Cosine?

Important: These Values are CONSTANT

5. Why do we ADD them? (Meaning + Order)

6. Summary: The Prepared Matrix

What's Next?

References

Transformers Encoder Deep Dive - Part 1

The Master Map

1. What is d_model?

Step 1: The Vocabulary & One-Hot Vector

Step 2: The Embedding Matrix (The Lookup Table)

Step 3: The Final d_model Vector

2. What is Sequence Length (Seq)?

3. The Input Matrix (Seq, d_model)

4. The Transpose Matrix (d_model, Seq)

5. The Multiplication Concept (How the magic happens)

Step 1: The Setup

Step 2: Calculating the First Cell

Step 3: Moving to the Next Word

Step 4: Calculating for the Second Row

Step 5: The Final Result Matrix

Why this matters

References & Further Learning

What's Next?

What are Transformers, Why do they Dominate the AI World?

1. RNNs (The "Linked List" Approach)

2. Transformers (The "Random Access" Approach)

How AI "Behaves": The Language Learner Analogy

The RNN: The "Beginner Learner"

The Transformer: The "Fluent Reader"

The "Memory" Problem

Deep dive to understand it better

The Challenge:

1. How the RNN processes it: The Drunken Narrator

2. How the Transformer processes it: The Search Warrant

Summary: Why Transformers are better

Conclusion: The New Standard

A Frontend Perspective

References

What’s Next?

React Compiler: Stop Manually Optimizing Your React Apps

🍝 The Restaurant Kitchen: How React Actually Works

The Evolution of a Component

Phase 1: The Code (Clean but Slow)

Phase 2: The Solution with hooks(addition instructions)

1. Encoder Input: Why does the Encoder need `(Seq, d_model)`?

5. Why do we ADD them? (`Meaning + Order`)

1. What is `d_model`?

Step 3: The Final `d_model` Vector

2. What is Sequence Length (`Seq`)?

3. The Input Matrix `(Seq, d_model)`

4. The Transpose Matrix `(d_model, Seq)`