<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yuvaraj</title>
    <description>The latest articles on DEV Community by Yuvaraj (@iamyuvaraj).</description>
    <link>https://dev.to/iamyuvaraj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3751247%2F77f91644-df57-45a0-8ec8-7afd8af6cc5b.png</url>
      <title>DEV Community: Yuvaraj</title>
      <link>https://dev.to/iamyuvaraj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/iamyuvaraj"/>
    <language>en</language>
    <item>
      <title>Transformer - Encoder Deep Dive - Part 3: What is Self-Attention</title>
      <dc:creator>Yuvaraj</dc:creator>
      <pubDate>Sun, 08 Mar 2026 20:10:23 +0000</pubDate>
      <link>https://dev.to/iamyuvaraj/transformer-encoder-deep-dive-part-3-what-is-self-attention-1aen</link>
      <guid>https://dev.to/iamyuvaraj/transformer-encoder-deep-dive-part-3-what-is-self-attention-1aen</guid>
      <description>&lt;h3&gt;
  
  
  Recap
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding:&lt;/strong&gt; "The", "dog", "bit", "the", "man" each have a unique semantic identity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Positional Encoding:&lt;/strong&gt; Each word now knows exactly where it sits in the sentence.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Wait... What exactly is the Encoder's job? &lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-2-3lig"&gt;Part 2&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;The sole purpose of the Encoder is to understand Context. &lt;/p&gt;

&lt;p&gt;With the example, "The dog bit the man" - let’s look at the word "bit".&lt;/p&gt;

&lt;p&gt;On its own, "bit" could mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A small piece of something (a "bit" of chocolate).&lt;/li&gt;
&lt;li&gt;The past tense of a bite (the action).&lt;/li&gt;
&lt;li&gt;A digital 0 or 1 (a computer "bit").&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Encoder doesn't know which one it is until it pays Attention to the words around it through association. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Those words are like strangers in an elevator—they are standing &lt;strong&gt;near each other, but they aren't talking.&lt;/strong&gt; &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What exactly is "Self-Attention"?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Self:&lt;/strong&gt; The model is looking at the same sentence it is currently processing. It isn't looking at a dictionary or a translation yet; it's just looking at its own words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attention:&lt;/strong&gt; The model decides which other words in that sentence are relevant to the word it is currently "thinking" about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Definition:&lt;/strong&gt; Self-Attention is a mechanism that allows a word to "look" at every other word in its own sentence to find the context it needs to define itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "Relationship" Logic&lt;/strong&gt;&lt;br&gt;
In our sentence "The dog bit the man," Self-Attention is the reason the model knows that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"dog" is related to "bit" (as the actor).&lt;/li&gt;
&lt;li&gt;"man" is related to "bit" (as the receiver).&lt;/li&gt;
&lt;li&gt;"the" is related to "dog" (telling us it's a specific dog).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without Self-Attention, the word "bit" is just a three-letter string. With Self-Attention, "bit" becomes a bridge that connects a subject (dog) to an object (man).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Attention&lt;/strong&gt; is the conversation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This matrix is now standing at the door of the first &lt;strong&gt;Multi-Head Attention&lt;/strong&gt; block.&lt;/p&gt;

&lt;p&gt;Let's understand &lt;strong&gt;Self-Attention&lt;/strong&gt; in this article.&lt;/p&gt;

&lt;p&gt;In a real Transformer, 8 of these heads work together to create 'Multi-Head Attention,' which we will glue together in Part 4.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmqzlcosv0aniuspvqgc9.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmqzlcosv0aniuspvqgc9.PNG" alt="visualisation about Encode and Decoder transformers archetecture" width="800" height="874"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Queries, Keys, and Values (Q, K, V)
&lt;/h3&gt;

&lt;p&gt;To calculate attention, we don't just use the input matrix as it is. &lt;br&gt;
&lt;strong&gt;Self-Attention&lt;/strong&gt; transforms our input matrix into three different versions of itself using three learnable weight matrices (W^Q, W^K, W^V).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Think of this like taking the same word and looking at it through three different lenses:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Query (Q)&lt;/strong&gt; - "The Search": This is what a word is looking for.&lt;br&gt;
&lt;em&gt;Example: The word "dog" asks: "Is there an action in this sentence that I performed?"&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Key (K)&lt;/strong&gt; - "The Label": This is how a word identifies itself to others.&lt;br&gt;
&lt;em&gt;Example: The word "bit" says: "I am an action involving teeth."&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Value (V)&lt;/strong&gt; - "The Cargo": This is the actual information a word carries.&lt;br&gt;
&lt;em&gt;Example: The word 'dog' (Query) found a match with the label 'action' (Key) on the word 'bit.' It then reached inside the truck and took the 'biting information' (Value) to update its own identity.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Here is a series of four images that visually breakdown the concept of Self-Attention, using our "The dog bit the man" example.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;dog, bit, the, man are present without knowing how these words are connected each other in this sentence "The Dog bit the man"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;the word vectors for "dog" and "bit" are isolated. The "dog" vector is a generic noun with no knowledge of the action it's about to perform.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hv9i0rd7j8eldrlyyg6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hv9i0rd7j8eldrlyyg6.png" alt="Visualized the dog, bit words as isolated matrix" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How does 'dog' find 'bit'?
&lt;/h3&gt;

&lt;p&gt;The "dog" vector (acting as the &lt;strong&gt;Query&lt;/strong&gt;) "shines a light" on all other words to find its match. The "bit" vector (acting as the &lt;strong&gt;Key&lt;/strong&gt;) responds strongly, creating a high &lt;strong&gt;Attention Score&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42xzygcl9lyyailpnppt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F42xzygcl9lyyailpnppt.png" alt="Visualized how dog is finding its meaning with the word bit" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  The Transfer of Meaning
&lt;/h4&gt;

&lt;p&gt;Using the Attention Score as a weight, the "bit" vector's actual content (&lt;strong&gt;Value&lt;/strong&gt;) is transferred to the "dog" vector. The "dog" is now "absorbing" the meaning of the action.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F476xnrf61uoxzxttfwwy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F476xnrf61uoxzxttfwwy.png" alt="Visualized how the bit meaning is transferred to the word dog" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Contextualized
&lt;/h3&gt;

&lt;p&gt;After the process, the "dog" vector is transformed. Its mathematical representation has changed (visualized here by the color blending), and it is now a "context-aware" vector that knows it is the &lt;strong&gt;subject of the bite&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The word 'dog' is no longer a generic noun; it’s a subject tied to an action.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqeo823ulagj3x8lj0udn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqeo823ulagj3x8lj0udn.png" alt="Visualized final formed matrix contains the subject who bite" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Self attention Formula: Deep Dive
&lt;/h3&gt;

&lt;p&gt;For the developers who want to see the code or the math, everything we just discussed (Query, Key, Value) is condensed into one famous formula:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F66t7ymqb03ybsu01c71s.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F66t7ymqb03ybsu01c71s.PNG" alt="Attention formula" width="800" height="157"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is a step-by-step visual breakdown of the Self-Attention mechanism, using our sentence &lt;strong&gt;"The dog bit the man"&lt;/strong&gt;. We'll follow the mathematical formula and visualize how the input matrices are transformed at each stage.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: The Initial Learned Matrices (Q, K, V)
&lt;/h4&gt;

&lt;p&gt;Before any attention is calculated, the input matrix is multiplied by three separate, learnable weight matrices () to create three new matrices: &lt;strong&gt;Query (Q)&lt;/strong&gt;, &lt;strong&gt;Key (K)&lt;/strong&gt;, and &lt;strong&gt;Value (V)&lt;/strong&gt;. These matrices are the starting point for our calculation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzoxby02sfbufw7dt0ge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzoxby02sfbufw7dt0ge.png" alt="Visualized Q,K,V matrices" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You might be wondering: "If the input is just our 'The dog bit the man' matrix, why do Q, K, and V have different numbers?"&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This happens through Linear Transformation. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We take our input and multiply it by three separate "Weight Matrices." These weights are like filters or lenses that highlight different parts of the word's meaning.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input × W^Q = Q&lt;/strong&gt;: This transformation extracts the "Question" part of the word.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input × W^K = K&lt;/strong&gt;: This extracts the "Label" part of the word.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input × W^V = V&lt;/strong&gt;: This extracts the "Cargo" (Content) part of the word.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why this matters: &lt;br&gt;
These &lt;strong&gt;W&lt;/strong&gt; matrices are learnable. At first, the model is bad at asking questions. But over time, it learns exactly how to adjust the numbers in &lt;strong&gt;W^Q&lt;/strong&gt; so that the word "dog" asks the perfect question to find its verb "bit."&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 2: The Compatibility Check (Raw Scores)
&lt;/h4&gt;

&lt;p&gt;Now, we calculate how much every word should listen to every other word "Compatibility". To do this, we perform a &lt;strong&gt;Dot Product&lt;/strong&gt; between the Query (Q) and the transposed Key (K^T) matrices.&lt;/p&gt;

&lt;p&gt;💡 Quick Recall: If you need a refresher on how the math of multiplying these matrices works, check out &lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-1-1g0#5-the-multiplication-concept-how-the-magic-happens"&gt;Step 5 of Part 1&lt;/a&gt;, where we saw how rows and columns collide to create a "Relationship Score."&lt;/p&gt;

&lt;p&gt;In this step, we multiply the Query of "dog" by the Transpose of the Key "bit".&lt;/p&gt;

&lt;p&gt;The Result: We get a raw "Attention Score."&lt;/p&gt;

&lt;p&gt;The Logic: If the "Search Query" of the dog matches the "Label" of the bite, the math produces a &lt;strong&gt;high number&lt;/strong&gt;. If they don't match, the number stays &lt;strong&gt;near zero&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example, the high score of &lt;code&gt;15.2&lt;/code&gt; between "dog" and "bit" indicates a strong connection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31sfi1ou93favvyzjmar.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31sfi1ou93favvyzjmar.png" alt="Visualized dot product q and transposed k" width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 3: Scaling
&lt;/h4&gt;

&lt;p&gt;The two critical steps that turn raw, unstable scores into clear probabilities for the model.&lt;/p&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;3.1. The Scaling Step: Stabilizing the Math&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Before we can turn our scores into percentages, we have to manage their size. &lt;br&gt;
The raw scores from the dot product (Q * K^T) can be very large, especially with high-dimensional vectors (d_model=512).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is this a problem?&lt;/strong&gt; Large numbers can cause the training process to become unstable. The model's gradients can become too small ("vanishing gradients"), meaning it stops learning.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Now, why do we care if gradients get too small?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When we apply Softmax to very large numbers (like our unscaled scores of 15 or 20), the function becomes "extremely confident." It gives one word 99.999% of the attention and everything else 0.00001%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deepdive in to this problem - What is a Gradient?&lt;/strong&gt;&lt;br&gt;
Gradient - "Directional Signal" telling the model how to improve. When numbers get too large, the signal becomes so weak that the model gets "confused" and stops learning&lt;/p&gt;

&lt;p&gt;Let's imagine you are standing on a foggy mountain in the dark, and your goal is to reach the lowest valley (the "Loss" or "Error"). Because of the fog, you can’t see the bottom.&lt;/p&gt;

&lt;p&gt;The Gradient is like feeling the ground with your foot to see which way it slopes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If your foot feels a steep slope downward, that is a &lt;strong&gt;Strong Gradient&lt;/strong&gt;. It tells you exactly which way to step to get closer to the bottom.&lt;/li&gt;
&lt;li&gt;If the ground feels almost perfectly flat, that is a &lt;strong&gt;Vanishing Gradient&lt;/strong&gt;. You have no idea which way to move to improve. You are stuck.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Solution:&lt;/strong&gt; We divide the raw scores by a scaling factor, which is the square root of the dimension of the keys &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2x9bhz6d8tp7w9n3eka9.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2x9bhz6d8tp7w9n3eka9.PNG" alt="Visualized square root of dimension of keys are formula" width="71" height="54"&gt;&lt;/a&gt;. This "squashes" the scores back into a manageable range without changing their relative order.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fife37u8lii0mslrbfdcj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fife37u8lii0mslrbfdcj.png" alt="Visualized scaling" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;softmax applied on row wise&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;3.1.1. What is d_k? (The Width of the Key)&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Remember our &lt;strong&gt;"&lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-2-3lig#the-analogy-the-semantic-passport"&gt;Semantic Passport&lt;/a&gt;"&lt;/strong&gt; analogy from Part 2? Each word has a vector of 512 numbers (d_model = 512). However, when it comes time to talk in the Engine Room, the model doesn't use all 512 pages at once.&lt;/p&gt;

&lt;p&gt;Instead, it breaks those 512 dimensions into smaller, specialized chunks. &lt;strong&gt;d_k&lt;/strong&gt; is the size of one of those chunks—typically &lt;strong&gt;64&lt;/strong&gt;.&lt;/p&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;3.1.2. Why not use all 512 at once? (The Specialization Problem)&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;You might ask: &lt;em&gt;"Why not just calculate one massive attention score for all 512 pages?"&lt;/em&gt; The answer is &lt;strong&gt;Specialization&lt;/strong&gt;. If you use all 512 dimensions at once, you get one single "Attention Score." This score becomes an &lt;strong&gt;average&lt;/strong&gt; of every word's relationship in the sentence, and in language, averages are dangerous.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Analogy:&lt;/strong&gt; Imagine you are at a business meeting. If you try to listen to the CEO, the Accountant, and the Engineer through one single "ear," their voices blur together. You might get the "average" topic, but you’ll miss the specific details of the budget or the technical specs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By breaking the 512 dimensions into &lt;strong&gt;8 chunks of 64&lt;/strong&gt;, the model creates 8 specialized "Attention Heads."&lt;/p&gt;

&lt;p&gt;Each head acts like a specialist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Head 1: Focuses on Grammar (Subject-Verb relationship).&lt;/li&gt;
&lt;li&gt;Head 2: Focuses on Entity Relationships (Who is the "dog" and who is the "man"?).&lt;/li&gt;
&lt;li&gt;Head 3: Focuses on the "Tense" or "Time" of the sentence.&lt;/li&gt;
&lt;li&gt;Head 4: ...&lt;/li&gt;
&lt;li&gt;Head n: ...&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Step 4: The Softmax Step: The "Winner-Takes-All" Filter**
&lt;/h4&gt;

&lt;p&gt;Now that our scores are stable, we need to convert them into probabilities that we can use as weights. This is where the &lt;strong&gt;Softmax function&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;p&gt;Softmax is a mathematical function that takes a list of numbers (which can be positive, negative, or zero) and turns them into a list of probabilities that &lt;strong&gt;sum up to exactly 1.0 (or 100%)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is this useful?&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Normalization:&lt;/strong&gt; It gives us a clear "attention budget" for each word. The total attention a word pays to the entire sentence must always be 100%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amplification:&lt;/strong&gt; It highlights the highest score and suppresses the lower ones. As seen in the image, the highest scaled score of &lt;strong&gt;1.9&lt;/strong&gt; gets a massive &lt;strong&gt;65%&lt;/strong&gt; of the attention, while the negative scores get almost none.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgckd3kdslkflbfxbuyy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgckd3kdslkflbfxbuyy.png" alt="Visualized softmax calculation" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Softmax looks at each word individually (each row). It takes the 100% attention budget for that word and distributes it across the sentence."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let's visualize the softmax of dot product of (Q * K^T) divided by (/) scaling (square root of dimension of keys)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4gbzstn6dlk8yxp691m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4gbzstn6dlk8yxp691m.png" alt="Visualized step softmax output" width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: The Transfer of Meaning (Weighted Sum)
&lt;/h3&gt;

&lt;p&gt;Finally, we use the attention weights (probabilities) from Step 3 to create a &lt;strong&gt;weighted sum&lt;/strong&gt; of the &lt;strong&gt;Value (V)&lt;/strong&gt; matrix. This is the step where the actual context is transferred.&lt;/p&gt;

&lt;p&gt;For example, the new vector for "dog" is calculated by taking 80% of the "bit" Value vector, 5% of the "dog" Value vector, and so on. The result is a new matrix where each word's vector has been updated with information from the words it "paid attention" to.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofbklodx1xvr0bbfnd5k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofbklodx1xvr0bbfnd5k.png" alt="Vizualised weighted sum output" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;NOTE: this is just the report from 1 of 8 specialists(heads). In the next part, we'll see how the results from all 8 specialists(heads) are combined to form the final Multi-Head Attention output."&lt;/p&gt;

&lt;p&gt;One Specialist = Self-Attention&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h4&gt;
  
  
  Summary:
&lt;/h4&gt;

&lt;p&gt;The Attention InterfaceWe have successfully turned our raw input into a Contextual Masterpiece. Q, K, V gave us the tools for the search.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Q*K^T: found the relationships.&lt;/li&gt;
&lt;li&gt;Scaling &amp;amp; Softmax stabilized the math and gave us clear percentages.&lt;/li&gt;
&lt;li&gt;Value (V) provided the cargo that updated our word meanings.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  What’s Next?:
&lt;/h3&gt;

&lt;p&gt;We’ve seen how a single "specialist"(&lt;strong&gt;self Attention&lt;/strong&gt;) handles a 64-bit chunk of our data. &lt;br&gt;
But our Encoder is a powerhouse that runs 8 of these specialists at the exact same time.&lt;/p&gt;

&lt;p&gt;In Part 4, we will dive into Multi-Head Attention to move deeper into the Transformer tower.&lt;/p&gt;

&lt;h4&gt;
  
  
  References
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1706.03762" rel="noopener noreferrer"&gt;&lt;strong&gt;Official Paper:&lt;/strong&gt; Attention Is All You Need&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=dichIcUZfOw" rel="noopener noreferrer"&gt;&lt;strong&gt;Visual Guide:&lt;/strong&gt; Visualizing Positional Encoding&lt;/a&gt; that helped me visualize the mechanics of the transformers.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Transformers - Encoder Deep Dive - Part 2</title>
      <dc:creator>Yuvaraj</dc:creator>
      <pubDate>Mon, 16 Feb 2026 08:33:53 +0000</pubDate>
      <link>https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-2-3lig</link>
      <guid>https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-2-3lig</guid>
      <description>&lt;p&gt;In our journey so far, we have explored the &lt;a href="https://dev.to/iamyuvaraj/what-are-transformers-why-do-they-dominate-the-ai-world-2278"&gt;high-level intuition of why Transformers exist&lt;/a&gt; and mapped out the blueprint and notations in &lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-1-1g0"&gt;Part 1&lt;/a&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Wait... What exactly is the Encoder?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before we prep the ingredients, let's look at the "Chef."&lt;/p&gt;

&lt;p&gt;In the Transformer diagram, the &lt;strong&gt;Encoder&lt;/strong&gt; is the tower on the left.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zk01td260yk9q89k6rs.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zk01td260yk9q89k6rs.PNG" alt="Visualisation explain about transformers architecture" width="800" height="874"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you think of a Transformer as a &lt;strong&gt;translation system&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Encoder&lt;/strong&gt; is the "Scholar" who reads the English sentence and understands every hidden nuance, relationship, and bit of context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Decoder&lt;/strong&gt; is the "Writer" who takes that scholar's notes and writes the sentence out in French.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Where is it used?&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;While the original paper used both towers for translation, the tech world realized the Encoder is a powerhouse on its own. Below are few samples,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search Engines:&lt;/strong&gt; To understand the &lt;em&gt;intent&lt;/em&gt; behind your query, not just the keywords.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentiment Analysis:&lt;/strong&gt; To "feel" if a product review is happy or angry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text Classification:&lt;/strong&gt; To automatically sort your emails into "Spam" or "Work."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short: &lt;strong&gt;The Encoder's sole job is to turn a human sentence into a "Context-Rich" mathematical map.&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Let's learn how to build this map&lt;/p&gt;




&lt;h3&gt;
  
  
  1. Encoder Input: Why does the Encoder need &lt;code&gt;(Seq, d_model)&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;As we discussed in &lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-1-1g0"&gt;Part 1&lt;/a&gt;, the hardware (GPU) is built for speed. It doesn't want to read a sentence word-by-word. It wants a &lt;strong&gt;Matrix&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Specifically, the Encoder requires a matrix of shape &lt;strong&gt;&lt;code&gt;(Seq, d_model)&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Seq&lt;/code&gt; (Sequence Length):&lt;/strong&gt; The number of words (e.g., 5 for "The dog bit the man").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;d_model&lt;/code&gt; (Model Dimension):&lt;/strong&gt; The "width" of our mathematical understanding (usually 512).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Encoder's job is to understand and refine this matrix.&lt;/p&gt;

&lt;p&gt;let's learn how to feed input to encoder in this structure, &lt;strong&gt;Meaning&lt;/strong&gt; (Embedding) and &lt;strong&gt;Order&lt;/strong&gt; (Positional Encoding).&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Input Embedding: Giving Words a Digital Identity
&lt;/h2&gt;

&lt;p&gt;A computer doesn't know what a "dog" is. To a machine, "dog" is just a string of bits. &lt;strong&gt;Input Embedding&lt;/strong&gt; is the process of giving that word a &lt;strong&gt;"Semantic Passport."&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How is this vector actually generated?
&lt;/h3&gt;

&lt;p&gt;To understand how we get our &lt;code&gt;(Seq, d_model)&lt;/code&gt; matrix, let's follow the word &lt;strong&gt;"dog"&lt;/strong&gt; through the three-step mechanical process we teased in &lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-1-1g0"&gt;Part 1&lt;/a&gt;:&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: The ID Lookup &lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-1-1g0#step-1-the-vocabulary-amp-onehot-vector"&gt;One-Hot Vector&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;First, the model looks up "dog" in its dictionary (Vocabulary) and finds its unique ID (e.g., ID 432). It creates a &lt;strong&gt;One-Hot Vector&lt;/strong&gt;: a massive list of zeros with a single &lt;code&gt;1&lt;/code&gt; at position 432.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Recall from Part 1: This is a "Sparse" representation. It's huge, mostly empty, and contains zero meaning.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Step 2: &lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-1-1g0#step-2-the-embedding-matrix-the-lookup-table"&gt;The Embedding Matrix&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;This is where the magic happens. The model maintains a giant &lt;strong&gt;Embedding Matrix&lt;/strong&gt; of size &lt;code&gt;(Vocab_Size, d_model)&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each &lt;strong&gt;row&lt;/strong&gt; in this matrix is a 512-dimension vector.&lt;/li&gt;
&lt;li&gt;Initially, these numbers are random nonsense.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Step 3: &lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-1-1g0#step-3-the-final-raw-dmodel-endraw-vector"&gt;The Row Selection&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;The model uses the ID from Step 1 to select exactly one row from this matrix. This row is our &lt;strong&gt;&lt;code&gt;d_model&lt;/code&gt; vector&lt;/strong&gt;. We have now successfully converted a "Sparse" ID into a "Dense" list of numbers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptqjz1m59ae99tdom4f4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptqjz1m59ae99tdom4f4.png" alt="Visualisation explaining embedding vector" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  The Analogy: The Semantic Passport
&lt;/h3&gt;

&lt;p&gt;Imagine every word in our sentence, &lt;strong&gt;"The dog bit the man,"&lt;/strong&gt; gets a passport with 512 pages (our &lt;code&gt;d_model&lt;/code&gt;). Each page describes a feature that the model has learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Page 1 (Noun-ness):&lt;/strong&gt; High value for "dog", low for "bit".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Page 2 (Living):&lt;/strong&gt; High value for "dog" and "man".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Page 3 (Animal):&lt;/strong&gt; High value for "dog", low for "man".&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  💡 Important: These Values Change!
&lt;/h3&gt;

&lt;p&gt;Unlike a fixed dictionary, these embedding values are &lt;strong&gt;learnable weights&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why?&lt;/strong&gt; At the start of training, the model's "understanding" is random. As it reads millions of sentences, it uses backpropagation to adjust these numbers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Goal:&lt;/strong&gt; It learns that "dog" and "wolf" often appear in similar contexts (near words like "bark", "forest", or "pack"). To satisfy the math, it moves their 512-dimension coordinates closer together in space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Result:&lt;/strong&gt; The embedding vector evolves as the model gets "smarter."&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. The Parallelism Paradox
&lt;/h3&gt;

&lt;p&gt;In Part 1, we praised Transformers for their &lt;strong&gt;Parallelism&lt;/strong&gt;. Unlike the "Drunken Narrator" (RNN), the Transformer looks at the entire input matrix at the same time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Paradox:&lt;/strong&gt; If you look at every word simultaneously, you lose the sense of time.&lt;br&gt;
To a Transformer, these two sentences look identical because the words are the same:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"The &lt;strong&gt;dog&lt;/strong&gt; bit the &lt;strong&gt;man&lt;/strong&gt;."&lt;/li&gt;
&lt;li&gt;"The &lt;strong&gt;man&lt;/strong&gt; bit the &lt;strong&gt;dog&lt;/strong&gt;."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We have &lt;strong&gt;Meaning&lt;/strong&gt; from our Embeddings, but we are missing &lt;strong&gt;Order&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Positional Encoding: The "Home Address"
&lt;/h3&gt;

&lt;p&gt;To solve this, we need to "stamp" a position onto our embeddings. This tells the model where each word is standing in line.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Formulas (How to calculate)
&lt;/h4&gt;

&lt;p&gt;For an index &lt;code&gt;i&lt;/code&gt; in our 512-dimensional vector:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;For Even Steps (&lt;code&gt;2i&lt;/code&gt;):&lt;/strong&gt;
&lt;code&gt;PE(pos, 2i) = sin(pos / 10000^(2i/d_model))&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For Odd Steps (&lt;code&gt;2i+1&lt;/code&gt;):&lt;/strong&gt;
&lt;code&gt;PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1rwh0bpmn85nospmyqu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1rwh0bpmn85nospmyqu.png" alt="Visulaisation about positonal encoding calculation" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Why Sine and Cosine?
&lt;/h4&gt;

&lt;p&gt;The researchers used Sine and Cosine waves because they allow the model to understand &lt;strong&gt;relative positions&lt;/strong&gt;. Because these functions oscillate, the model can easily learn that "word A is specific distance away from word B."&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&amp;gt; Using waves to create a unique mathematical "signature" for every seat in the sentence.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Important: These Values are CONSTANT
&lt;/h4&gt;

&lt;p&gt;Unlike Embeddings, Positional Encodings are &lt;strong&gt;fixed&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why?&lt;/strong&gt; The "meaning" of a word like "dog" might change as the model learns, but the "meaning" of &lt;strong&gt;Position #2&lt;/strong&gt; should never change. Position #2 is always Position #2. By keeping this constant, the model has a stable "grid" to map its learnable meanings onto.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  5. Why do we ADD them? (&lt;code&gt;Meaning + Order&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;The researchers chose to &lt;strong&gt;ADD&lt;/strong&gt; the Positional Encoding vector directly to the Input Embedding vector.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&amp;gt; Final Vector = Learnable Meaning + Constant Position&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why add instead of just attaching it to the end (concatenation)?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It keeps the matrix size at &lt;code&gt;(Seq, d_model)&lt;/code&gt;. We don't make the matrix "fatter," which keeps the hardware processing fast.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgv55phkyo12ifadlsmv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgv55phkyo12ifadlsmv.png" alt="Visualisation explains why do we need to add input embedding with positional encoding" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Summary: The Prepared Matrix
&lt;/h3&gt;

&lt;p&gt;We have successfully prepared our ingredients. We started with raw text and ended with a refined &lt;strong&gt;&lt;code&gt;(Seq, d_model)&lt;/code&gt;&lt;/strong&gt; matrix where every word knows &lt;strong&gt;who it is&lt;/strong&gt; and &lt;strong&gt;where it sits&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This matrix is finally ready to enter the first actual "box" of the Encoder.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbbaxk8ggqkdc29gdcbdb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbbaxk8ggqkdc29gdcbdb.png" alt="Visualisation about the input matrix fed to encoder" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  What's Next?
&lt;/h3&gt;

&lt;p&gt;Now that the input is prepared, it’s time to feed the &lt;strong&gt;Encoder&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Part 3&lt;/strong&gt;, we will explore &lt;strong&gt;Multi-Head Self-Attention&lt;/strong&gt;. This is where the words actually start "talking" to each other using the Matrix Multiplication logic we teased in Part 1. We’ll see how the model decides which words are the most important in the sentence.&lt;/p&gt;

&lt;h4&gt;
  
  
  References
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1706.03762" rel="noopener noreferrer"&gt;&lt;strong&gt;Official Paper:&lt;/strong&gt; Attention Is All You Need&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=dichIcUZfOw" rel="noopener noreferrer"&gt;&lt;strong&gt;Visual Guide:&lt;/strong&gt; Visualizing Positional Encoding&lt;/a&gt; that helped me visualize the mechanics of the transformers.&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Transformers Encoder Deep Dive - Part 1</title>
      <dc:creator>Yuvaraj</dc:creator>
      <pubDate>Sun, 15 Feb 2026 16:20:18 +0000</pubDate>
      <link>https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-1-1g0</link>
      <guid>https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-1-1g0</guid>
      <description>&lt;p&gt;In my previous article, &lt;strong&gt;&lt;a href="https://dev.to/iamyuvaraj/what-are-transformers-why-do-they-dominate-the-ai-world-2278"&gt;"What Are Transformers? Why Do They Dominate the AI World?"&lt;/a&gt;&lt;/strong&gt;, we explored the intuition behind this revolution. We saw how the &lt;strong&gt;"Search Warrant"&lt;/strong&gt; (Attention) replaced the &lt;strong&gt;"Drunken Narrator"&lt;/strong&gt; (RNNs) to solve the problem of long-distance memory.&lt;/p&gt;

&lt;p&gt;But how does that logic actually live inside a machine? To understand that, we have to look at the &lt;strong&gt;"Blueprint."&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Master Map
&lt;/h3&gt;

&lt;p&gt;When you look at the original architecture from the landmark paper &lt;em&gt;"&lt;a href="https://arxiv.org/html/1706.03762v7#S3" rel="noopener noreferrer"&gt;Attention Is All You Need&lt;/a&gt;,"&lt;/em&gt; you see two main towers: the &lt;strong&gt;Encoder&lt;/strong&gt; (left) and the &lt;strong&gt;Decoder&lt;/strong&gt; (right).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmqzlcosv0aniuspvqgc9.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmqzlcosv0aniuspvqgc9.PNG" alt="visualisation about Encode and Decoder transformers archetecture" width="800" height="874"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this series, we are going to break down these boxes into logical mental models. We want to understand not just what they are, but &lt;em&gt;why&lt;/em&gt; they exist and how they enable the massive &lt;strong&gt;parallelism&lt;/strong&gt; that makes modern AI so fast and powerful.&lt;/p&gt;

&lt;p&gt;Before we jump into the Encoder's inner workings, we need to learn the "Language of the Machine." &lt;br&gt;
Let's master the notations using a simple sentence: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"The dog bit the man."&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  1. What is &lt;code&gt;d_model&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;In the world of AI, words aren't letters; they are lists of numbers called &lt;strong&gt;vectors&lt;/strong&gt;.&lt;br&gt;
&lt;strong&gt;&lt;code&gt;d_model&lt;/code&gt;&lt;/strong&gt; is the &lt;strong&gt;dimension&lt;/strong&gt; (or the length) of that list. If &lt;code&gt;d_model = 512&lt;/code&gt;, it means every single word in our sentence is represented by a list of 512 different numbers. These numbers capture the "meaning" of the word.&lt;/p&gt;

&lt;p&gt;Here is the step-by-step visual explanation of how a &lt;code&gt;d_model&lt;/code&gt; vector is generated for a single word, using "The" as our example.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: The Vocabulary &amp;amp; One-Hot Vector
&lt;/h4&gt;

&lt;p&gt;Before a model can understand a word, it needs a digital ID. We start with a huge list of every word the model knows (its &lt;strong&gt;Vocabulary&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;We then create a very long vector that is almost entirely zeros, with a single "1" at the position corresponding to our word. This is called a &lt;strong&gt;One-Hot Vector&lt;/strong&gt;. It's simple, but it doesn't capture any meaning—it's just an index.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frz5fj71b6tu3a46bhg6z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frz5fj71b6tu3a46bhg6z.png" alt="visualisation explaining finding index in the vocabulary" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 2: The Embedding Matrix (The Lookup Table)
&lt;/h4&gt;

&lt;p&gt;This is where the magic happens. The model has a giant, learnable matrix called the &lt;strong&gt;Embedding Matrix&lt;/strong&gt;. You can think of it as a massive lookup table.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rows:&lt;/strong&gt; Each row corresponds to a word in the vocabulary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Columns:&lt;/strong&gt; The number of columns is the &lt;code&gt;d_model&lt;/code&gt; size (4 used here for simplicity, e.g., 512 ).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When we feed the One-Hot Vector into this matrix, the "1" acts like a selector switch. It activates and "selects" the corresponding row in the Embedding Matrix. This row contains a list of dense, learnable numbers that represent the word's meaning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faf4ic33xfo7gwl0exkd6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faf4ic33xfo7gwl0exkd6.png" alt="Visualisation explaining finding row with one hot vector in embedding matrix" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 3: The Final &lt;code&gt;d_model&lt;/code&gt; Vector
&lt;/h4&gt;

&lt;p&gt;The result of this lookup operation is a single, dense vector. This is the &lt;strong&gt;&lt;code&gt;d_model&lt;/code&gt; vector&lt;/strong&gt; for the word "The".&lt;/p&gt;

&lt;p&gt;Instead of a sparse vector of zeros, we now have a compact list of numbers (of size &lt;code&gt;d_model&lt;/code&gt;) that the model can use to perform mathematical operations. When we do this for every word in the sentence, we get the &lt;code&gt;(Seq, d_model)&lt;/code&gt; input matrix you saw earlier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxnrf2atucpa7ns2xv6b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxnrf2atucpa7ns2xv6b.png" alt="visulisation explaining final d_model vector" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  2. What is Sequence Length (&lt;code&gt;Seq&lt;/code&gt;)?
&lt;/h3&gt;

&lt;p&gt;This is simply the number of words (or tokens) we are feeding into the model at once.&lt;br&gt;
For our sentence, &lt;strong&gt;"The dog bit the man"&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The&lt;/strong&gt; (1), &lt;strong&gt;dog&lt;/strong&gt; (2), &lt;strong&gt;bit&lt;/strong&gt; (3), &lt;strong&gt;the&lt;/strong&gt; (4), &lt;strong&gt;man&lt;/strong&gt; (5).&lt;/li&gt;
&lt;li&gt;Our &lt;strong&gt;Sequence Length (&lt;code&gt;Seq&lt;/code&gt;) = 5&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;(In a real Transformer, we use Tokens efficiently, but for this mental model, we will treat each word as a tokens.)&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  3. The Input Matrix &lt;code&gt;(Seq, d_model)&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;When we stack these words together, we get our input matrix. Imagine a table where each row is a word and each column is a feature of that word. For our 5-word sentence with a model dimension of 4, it looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmlt2o83xyurt6b2uj950.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmlt2o83xyurt6b2uj950.PNG" alt="Visualisation of input matrix" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  4. The Transpose Matrix &lt;code&gt;(d_model, Seq)&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;To perform the &lt;a href="https://dev.to/iamyuvaraj/what-are-transformers-why-do-they-dominate-the-ai-world-2278#2-how-the-transformer-processes-it-the-search-warrant"&gt;"Search Warrant"&lt;/a&gt; logic, the model needs to compare words against each other. To do this mathematically, we &lt;strong&gt;transpose&lt;/strong&gt; the matrix. We flip it so the rows become columns. This allows the model to look at the sentence from a different "angle."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmr57midpgrexubulwpb2.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmr57midpgrexubulwpb2.PNG" alt="Visualisation of transpose matrix" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  5. The Multiplication Concept (How the magic happens)
&lt;/h3&gt;

&lt;p&gt;How does the model calculate how much "dog" relates to "bit"? It uses &lt;strong&gt;Matrix Multiplication&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Even if you aren't a math expert, the mental model is simple: We take a &lt;strong&gt;Row&lt;/strong&gt; from our first matrix (a word) and multiply it against a &lt;strong&gt;Column&lt;/strong&gt; from the second matrix (another word).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We multiply the corresponding numbers.&lt;/li&gt;
&lt;li&gt;We sum them all up.&lt;/li&gt;
&lt;li&gt;The result is a single "Score" that represents the relationship between those two words.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Step-by-Step Calculation:&lt;/strong&gt;&lt;br&gt;
Here is a step-by-step visual breakdown of how the matrix multiplication &lt;code&gt;(Seq, d_model) x (d_model, Seq)&lt;/code&gt; works. This process is what creates the "attention scores" between every word in the sentence.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: The Setup
&lt;/h4&gt;

&lt;p&gt;We start with two matrices. &lt;strong&gt;Matrix A&lt;/strong&gt; represents our input sentence where each row is a word vector. &lt;strong&gt;Matrix B&lt;/strong&gt; is the transposed version, where each column is a word vector. We also have an empty &lt;strong&gt;Result Matrix&lt;/strong&gt; where we will store the scores.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9rbf41wdcxkbmiwlf9t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9rbf41wdcxkbmiwlf9t.png" alt="Visualisation of input matrix and transposed input matrix" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 2: Calculating the First Cell
&lt;/h4&gt;

&lt;p&gt;To find the score for how much the first word ("The") relates to itself, we take the &lt;strong&gt;dot product&lt;/strong&gt; of the first row of Matrix A and the first column of Matrix B. We multiply corresponding elements and sum them up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4urtblfnuubq8l59ldf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4urtblfnuubq8l59ldf.png" alt="Visualisation of how to calculate first cell" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 3: Moving to the Next Word
&lt;/h4&gt;

&lt;p&gt;We stay on the first row of Matrix A ("The") but move to the &lt;em&gt;second&lt;/em&gt; column of Matrix B ("dog"). The dot product of these two vectors gives us the score for how much "The" relates to "dog".&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxj12h7osc2k5loor83pe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxj12h7osc2k5loor83pe.png" alt="Visualisation of how to calculate next word" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 4: Calculating for the Second Row
&lt;/h4&gt;

&lt;p&gt;After completing the first row of the Result Matrix, we move to the &lt;em&gt;second&lt;/em&gt; row of Matrix A ("dog") and reset to the first column of Matrix B ("The"). This gives us the score for how much "dog" relates to "The".&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkuxjhutou0jqgz6lawb0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkuxjhutou0jqgz6lawb0.png" alt="Visulaisation of how to calculate next row" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 5: The Final Result Matrix
&lt;/h4&gt;

&lt;p&gt;By repeating this row-by-column multiplication for every combination, we get a final &lt;code&gt;(Seq x Seq)&lt;/code&gt; matrix. This is a map of all pairwise relationships in the sentence, which is the core of the self-attention mechanism.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F216mankjwzo9eqxv1w39.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F216mankjwzo9eqxv1w39.png" alt="Visulaisation of multiplied matrix" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;By representing our sentence as these matrices, the computer doesn't have to read the sentence word-by-word like a human (or an RNN). Because of this matrix structure, the hardware (GPU) can calculate all these word relationships &lt;strong&gt;at the same time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the foundation of &lt;strong&gt;Parallelism.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References &amp;amp; Further Learning
&lt;/h2&gt;

&lt;p&gt;If you want to dive even deeper into the original research or see these concepts in motion, I highly recommend checking out these foundational resources:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Official Paper:&lt;/strong&gt; &lt;a href="https://arxiv.org/html/1706.03762v7" rel="noopener noreferrer"&gt;"Attention Is All You Need"&lt;/a&gt; (Vaswani et al., 2017) – The research paper that started it all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visual Guide:&lt;/strong&gt; &lt;a href="https://www.youtube.com/watch?v=bCz4OMemCcA" rel="noopener noreferrer"&gt;Transformers Explained Clearly – A fantastic YouTube deep-dive&lt;/a&gt; that helped me visualize the mechanics behind the math.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Now that we have our map and understanding of notations, we are ready to start learning mental model of encoder. In &lt;strong&gt;&lt;a href="https://dev.to/iamyuvaraj/transformers-encoder-deep-dive-part-2-3lig"&gt;Part 2&lt;/a&gt;&lt;/strong&gt;, we will start with  &lt;strong&gt;Embeddings and Positional Encoding inside Encoder&lt;/strong&gt; — the process of turning raw text into these mathematical "Ingredients" and giving them a "Home Address" so the model knows the order of the sentence.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>What are Transformers, Why do they Dominate the AI World?</title>
      <dc:creator>Yuvaraj</dc:creator>
      <pubDate>Sun, 15 Feb 2026 10:47:14 +0000</pubDate>
      <link>https://dev.to/iamyuvaraj/what-are-transformers-why-do-they-dominate-the-ai-world-2278</link>
      <guid>https://dev.to/iamyuvaraj/what-are-transformers-why-do-they-dominate-the-ai-world-2278</guid>
      <description>&lt;p&gt;In the world of AI, we have to deal with &lt;strong&gt;Sequences&lt;/strong&gt;—data where the order isn't just a detail; it's the entire meaning.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Language:&lt;/strong&gt; "The dog bit the man" is a very different story than "The man bit the dog."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code:&lt;/strong&gt; &lt;code&gt;x = 5; y = x + 2;&lt;/code&gt; works. &lt;code&gt;y = x + 2; x = 5;&lt;/code&gt; crashes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To process these sequences, AI has evolved through two major architectural eras:&lt;/p&gt;




&lt;h4&gt;
  
  
  1. RNNs (The "Linked List" Approach)
&lt;/h4&gt;

&lt;p&gt;For years, &lt;strong&gt;Recurrent Neural Networks (RNNs)&lt;/strong&gt; were the industry standard. They treat a sentence like a &lt;strong&gt;Ticker Tape&lt;/strong&gt; or a &lt;strong&gt;Linked List&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Logic:&lt;/strong&gt; To understand word #10, the model &lt;em&gt;must&lt;/em&gt; first pass through words #1 through #9 in a strict, sequential order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Constraint:&lt;/strong&gt; It’s a &lt;code&gt;for&lt;/code&gt; loop. It cannot skip ahead, and it cannot process word #10 until word #9 is finished.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhklgogyd2i8s3egxx9gp.png" alt="Visualisation of how the sentence processed by RNN" width="800" height="436"&gt;
&lt;/h2&gt;

&lt;h4&gt;
  
  
  2. Transformers (The "Random Access" Approach)
&lt;/h4&gt;

&lt;p&gt;In 2017, a landmark paper titled &lt;em&gt;"Attention is All You Need"&lt;/em&gt; introduced the &lt;strong&gt;Transformer&lt;/strong&gt;. It stopped treating sentences like strings to be iterated over and started treating them like &lt;strong&gt;Arrays with an Index&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Logic:&lt;/strong&gt; It doesn't wait in line. It takes a &lt;strong&gt;Snapshot&lt;/strong&gt; of the entire sequence at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Breakthrough:&lt;/strong&gt; It sees every item in the "Array" simultaneously. It understands the relationship between word #1 and word #10 without having to "walk" the distance between them.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjlnwrafaef15hx1f9cwf.png" alt="Visualisation of how the sentence processed by Transformers" width="800" height="436"&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How AI "Behaves": The Language Learner Analogy
&lt;/h3&gt;

&lt;p&gt;To understand why the AI world shifted toward Transformers, we need to look at how these models actually experience a sentence. It’s the difference between struggling through a translation and reading fluently.&lt;/p&gt;

&lt;h4&gt;
  
  
  The RNN: The "Beginner Learner"
&lt;/h4&gt;

&lt;p&gt;Imagine an adult who has just started learning a new language. They are reading a long, complex sentence with a dictionary in hand.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Struggle:&lt;/strong&gt; They translate the first word, then the second, then the third. Because they process data sequentially, their mental energy is entirely spent on the &lt;em&gt;current&lt;/em&gt; word.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Result:&lt;/strong&gt; By the time they reach the end of a long paragraph, the specific details of the beginning have started to fade. They have a very narrow "window" of focus. If the beginning of the sentence affects the end, they often have to stop and re-read.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  The Transformer: The "Fluent Reader"
&lt;/h4&gt;

&lt;p&gt;Now, imagine a fluent adult reading the same sentence. They don’t process it word-by-word in a vacuum.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Behavior:&lt;/strong&gt; Their eyes scan the entire block of text almost instantly. Even as they read the final word, they remain "aware" of the subject at the very beginning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Advantage:&lt;/strong&gt; They can ignore "filler" words and focus their &lt;strong&gt;Attention&lt;/strong&gt; only on the words that carry the most meaning. They aren't just "remembering" the start of the sentence; they are actively &lt;strong&gt;connecting&lt;/strong&gt; it to the end.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The "Memory" Problem
&lt;/h2&gt;

&lt;p&gt;The problem with being a "Beginner Learner" (RNN) isn't just that it’s slow—it’s that it's &lt;strong&gt;unreliable over long distances.&lt;/strong&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzfabo8eu9r0biecfeyvv.png" alt="Visualisation explains RNN vs transformers processing the sentence with the analogy explained above" width="800" height="436"&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Deep dive to understand it better
&lt;/h3&gt;

&lt;p&gt;To truly understand why the AI world moved toward Transformers, we need to see where the "old way" breaks. In linguistics, we call this a &lt;strong&gt;Long-Range Dependency&lt;/strong&gt; problem—when two words that need each other are separated by a long distance.&lt;/p&gt;

&lt;p&gt;Let’s look at this deceptively simple sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The &lt;strong&gt;keys&lt;/strong&gt; to the old house that my grandfather built in 1945 were &lt;strong&gt;lost&lt;/strong&gt;."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  The Challenge:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;The subject is &lt;strong&gt;"keys"&lt;/strong&gt; (plural).&lt;/li&gt;
&lt;li&gt;Therefore, the verb at the end must be &lt;strong&gt;"were lost"&lt;/strong&gt; (plural).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Between the subject and the verb, we have three singular nouns designed to confuse the model: &lt;em&gt;house&lt;/em&gt;, &lt;em&gt;grandfather&lt;/em&gt;, and &lt;em&gt;1945&lt;/em&gt;.&lt;/p&gt;




&lt;h4&gt;
  
  
  1. How the RNN processes it: The Drunken Narrator
&lt;/h4&gt;

&lt;p&gt;As our "Beginner Learner," the RNN must carry the memory of the first word through the entire sentence, step-by-step. We call this the &lt;strong&gt;Drunken Narrator&lt;/strong&gt; effect. Imagine a narrator telling a story, but with every new word, their memory of the start gets a little fuzzier.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start:&lt;/strong&gt; The RNN reads "The &lt;strong&gt;keys&lt;/strong&gt;." Internal state: &lt;em&gt;Subject is Plural&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Middle:&lt;/strong&gt; It reads "...house..." The memory updates. The singular "house" slightly dilutes the "plural" signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distraction:&lt;/strong&gt; It reads "...grandfather... 1945." After three singular nouns in a row, the original "plural" signal is now a faint whisper.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure:&lt;/strong&gt; It reaches the end and needs to predict the verb. Since the most recent memory is singular ("1945"), it incorrectly predicts: &lt;strong&gt;"was lost."&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Technical Reality:&lt;/strong&gt; This is the &lt;strong&gt;Vanishing Gradient&lt;/strong&gt; problem. In long sequences, the mathematical signal from the beginning literally "vanishes" before it reaches the end.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6qyva92nds11iinjedeg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6qyva92nds11iinjedeg.png" alt="Visualization explaining how RNN processing sentence" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h4&gt;
  
  
  2. How the Transformer processes it: The Search Warrant
&lt;/h4&gt;

&lt;p&gt;The Transformer (our "Fluent Reader") doesn't struggle with memory. It processes the whole sentence at once using the &lt;strong&gt;Search Warrant&lt;/strong&gt; approach.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Setup:&lt;/strong&gt; The Transformer takes a snapshot. It sees "keys" and "lost" simultaneously.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Query:&lt;/strong&gt; To understand the word "&lt;strong&gt;lost&lt;/strong&gt;," it doesn't rely on a fading memory. It issues a &lt;strong&gt;Query&lt;/strong&gt; to every other word: &lt;em&gt;"Who is the subject of being lost?"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Attention:&lt;/strong&gt; * "House" and "Grandfather" return low scores.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;"&lt;strong&gt;Keys&lt;/strong&gt;" returns a massive &lt;strong&gt;Attention Score&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Success:&lt;/strong&gt; The model forms a direct, high-speed connection between "&lt;strong&gt;lost&lt;/strong&gt;" and "&lt;strong&gt;keys&lt;/strong&gt;," ignoring the "distance" entirely. It correctly predicts: &lt;strong&gt;"were lost."&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Technical Reality:&lt;/strong&gt; This is &lt;strong&gt;Self-Attention&lt;/strong&gt;. It allows any word to "attend" to any other word in the sequence, making the distance between them mathematically zero.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftv9rheky8fn0cfgyg7zd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftv9rheky8fn0cfgyg7zd.png" alt="Visualization explaining how transformer processing sentence" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Summary: Why Transformers are better
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;RNN (Drunken Narrator)&lt;/th&gt;
&lt;th&gt;Transformer (Search Warrant)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Processing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sequential (Slow)&lt;/td&gt;
&lt;td&gt;Parallel (Fast)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fades with distance&lt;/td&gt;
&lt;td&gt;Perfect, direct access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;The "Keys" Test&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Fails&lt;/strong&gt; (confused by "1945")&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Succeeds&lt;/strong&gt; (looks at "keys" directly)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Conclusion: The New Standard
&lt;/h2&gt;

&lt;p&gt;The transition from RNNs to Transformers wasn't just a minor upgrade; it was a fundamental shift in how we handle information. By moving from the &lt;strong&gt;"Drunken Narrator"&lt;/strong&gt; (Sequential) to the &lt;strong&gt;"Search Warrant"&lt;/strong&gt; (Parallel), we unlocked the ability to train on the massive scale of data that powers today’s LLMs.&lt;/p&gt;

&lt;p&gt;As a developer, understanding this shift is crucial. It’s the difference between building a system that merely follows a loop and one that understands the entire context of its environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Frontend Perspective
&lt;/h3&gt;

&lt;p&gt;As a &lt;strong&gt;Frontend Developer&lt;/strong&gt;, I’m used to thinking about state and data flow. Seeing how Transformers manage 'context' through &lt;strong&gt;Attention&lt;/strong&gt; feels remarkably similar to modern state management—it's about making sure the right information is available at the right time, regardless of where it lives in the application.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vaswani, A., et al. (2017).&lt;/strong&gt; &lt;em&gt;&lt;a href="https://arxiv.org/abs/1706.03762" rel="noopener noreferrer"&gt;"Attention Is All You Need"&lt;/a&gt;&lt;/em&gt;. &lt;em&gt;Advances in Neural Information Processing Systems.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;What’s Next?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;let me know in the comments: Which analogy clicked better for you, the "language learner" or the "Search Warrant"?&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>transformers</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>React Compiler: Stop Manually Optimizing Your React Apps</title>
      <dc:creator>Yuvaraj</dc:creator>
      <pubDate>Tue, 03 Feb 2026 21:47:00 +0000</pubDate>
      <link>https://dev.to/iamyuvaraj/react-compiler-stop-manually-optimizing-your-react-apps-5co4</link>
      <guid>https://dev.to/iamyuvaraj/react-compiler-stop-manually-optimizing-your-react-apps-5co4</guid>
      <description>&lt;p&gt;During our team KATA session, a colleague asked a question that I bet you've thought about it too:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"If React already knows to only render the elements that changed, why do we need to optimize anything manually?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was a brilliant question. The answer reveals a major pain point we’ve lived with for years—and let's see how React compiler addresses few areas. &lt;/p&gt;

&lt;p&gt;Let’s take a journey through the evolution of React optimization, using a simple analogy: &lt;strong&gt;The Restaurant Kitchen&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  🍝 The Restaurant Kitchen: How React &lt;em&gt;Actually&lt;/em&gt; Works
&lt;/h2&gt;

&lt;p&gt;Imagine your App is a kitchen.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Head Chef (Parent Component):&lt;/strong&gt; Manages the kitchen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Line cooks (Child Components):&lt;/strong&gt; Cook their own dishes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a standard React app, every time the Head Chef changes something—even just restocking the salt—they ring a giant bell. &lt;strong&gt;Every single cook stops and redoes their work from scratch&lt;/strong&gt;, even if the ingredients for their dish haven’t changed.&lt;/p&gt;

&lt;p&gt;This is React’s default behavior: &lt;strong&gt;When a parent re-renders, all children re-render.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For years, to stop this waste, we had to write additional code to give instruction(hooks) to react's optimisation technique. Let’s look at how a single component evolved from "without hooks(instructions to compiler)" to "With hooks(instructions to react optimisation technique)" to "React compiler code automatically optimises it."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Evolution of a Component
&lt;/h2&gt;

&lt;p&gt;Let's look at a &lt;code&gt;RestaurantMenu&lt;/code&gt; that does three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Holds a list of dishes.&lt;/li&gt;
&lt;li&gt;Filters them (an expensive calculation).&lt;/li&gt;
&lt;li&gt;Renders a list of items (child components).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Phase 1: The Code (Clean but Slow)
&lt;/h3&gt;

&lt;p&gt;Here is the code most beginners write. It looks clean, but it has hidden performance traps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;useState&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;react&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// A simple child component&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;DishList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;dishes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;onOrder&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;🍝 Rendering DishList (Child)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// &amp;lt;--- Watch this log!&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="cm"&gt;/* items... */&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/div&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;;
&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;RestaurantMenu&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;allDishes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;theme&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;setCategory&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pasta&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// ⚠️ PROBLEM 1: Expensive Calculation runs every render&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filteredDishes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;allDishes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;🧮 Filtering... (Slow Math)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handleOrder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Ordered:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt; &lt;span class="nx"&gt;className&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;theme&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="cm"&gt;/* Clicking this causes a re-render */&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt; &lt;span class="nx"&gt;onClick&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setCategory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;salad&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Switch&lt;/span&gt; &lt;span class="nx"&gt;Category&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/button&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="cm"&gt;/* ⚠️ PROBLEM 2: Inline Arrow Function */&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="cm"&gt;/* Writing (dish) =&amp;gt; handleOrder(dish) creates a BRAND NEW function 
          in memory every single time this component renders. 
          This forces DishList to re-render. */&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;DishList&lt;/span&gt; 
        &lt;span class="nx"&gt;dishes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;filteredDishes&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; 
        &lt;span class="nx"&gt;onOrder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{(&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;handleOrder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt; 
      &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/div&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happens in the Console?&lt;/strong&gt;&lt;br&gt;
Even if the parent re-renders for a minor reason (or if we click the button), everything runs again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🖥️ CONSOLE OUTPUT:
---------------------------------------------
🧮 Filtering... (Slow Math)
🍝 Rendering DishList (Child)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(Every single interaction triggers these logs. Wasteful!)&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 2: The Solution with hooks(addition instructions)
&lt;/h3&gt;

&lt;p&gt;To fix this in React, we had to introduce "Hooks." We wrap in &lt;code&gt;useMemo&lt;/code&gt;, &lt;code&gt;useCallback&lt;/code&gt;, and &lt;code&gt;memo&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;useState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;useMemo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;useCallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;memo&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;react&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Solution A: Wrap child in memo to prevent useless re-renders&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;DishList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;memo&lt;/span&gt;&lt;span class="p"&gt;(({&lt;/span&gt; &lt;span class="nx"&gt;dishes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;onOrder&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;🍝 Rendering DishList (Child)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="cm"&gt;/* items... */&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/div&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;;
&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;RestaurantMenu&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;allDishes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;theme&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;setCategory&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pasta&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Solution B: Cache calculation with useMemo&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filteredDishes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useMemo&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;🧮 Filtering... (Slow Math)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;allDishes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;allDishes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt; 

  &lt;span class="c1"&gt;// Solution C: Freeze function with useCallback&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handleOrder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useCallback&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Ordered:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;[]);&lt;/span&gt; 

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt; &lt;span class="nx"&gt;className&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;theme&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt; &lt;span class="nx"&gt;onClick&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setCategory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;salad&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Switch&lt;/span&gt; &lt;span class="nx"&gt;Category&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/button&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="cm"&gt;/* ⚠️ THE TRAP: We CANNOT use an inline arrow here! 
          If we wrote: onOrder={(dish) =&amp;gt; handleOrder(dish)}
          It would BREAK the optimization because the arrow wrapper 
          is a new reference. We are FORCED to pass the function directly. */&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;DishList&lt;/span&gt; 
        &lt;span class="nx"&gt;dishes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;filteredDishes&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; 
        &lt;span class="nx"&gt;onOrder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;handleOrder&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; 
      &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/div&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happens in the Console now?&lt;/strong&gt;&lt;br&gt;
If the parent re-renders (for example, if &lt;code&gt;theme&lt;/code&gt; changes but &lt;code&gt;category&lt;/code&gt; stays the same), the console stays silent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🖥️ CONSOLE OUTPUT:
---------------------------------------------
(Silent. No logs appear.)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(Performance is achieved, but the code is hard to read because of hooks syntax)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What happens, If your colleague changes onOrder={handleOrder} to onOrder={() =&amp;gt; handleOrder()}&lt;/em&gt;, the optimization breaks silently, the arrow function () =&amp;gt; handleOrder() creates a new function every time the component renders&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 3: The React Compiler Solution (without additional code)
&lt;/h3&gt;

&lt;p&gt;This is the magic of React compiler. You go back to writing the code from &lt;strong&gt;Phase 1&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// No useMemo. No useCallback. No memo.&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;RestaurantMenu&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;allDishes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;theme&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;setCategory&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pasta&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// The Compiler AUTOMATICALLY memoizes this&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filteredDishes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;allDishes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;🧮 Filtering... (Slow Math)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// The Compiler AUTOMATICALLY stabilizes this function&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handleOrder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Ordered:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt; &lt;span class="nx"&gt;className&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;theme&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt; &lt;span class="nx"&gt;onClick&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setCategory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;salad&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Switch&lt;/span&gt; &lt;span class="nx"&gt;Category&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/button&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="cm"&gt;/* ✅ COMPILER MAGIC: We can use an inline arrow again! 
          The compiler is smart enough to "memoize" this arrow function 
          wrapper automatically. It sees that 'handleOrder' is stable, 
          so it makes this arrow stable too. */&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;DishList&lt;/span&gt; &lt;span class="nx"&gt;dishes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;filteredDishes&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="nx"&gt;onOrder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{(&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;handleOrder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dish&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/div&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happens in the Console?&lt;/strong&gt;&lt;br&gt;
Even though we deleted all the hooks, the result is identical to Phase 2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🖥️ CONSOLE OUTPUT:
---------------------------------------------
(Silent. No logs appear.)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What just happened?&lt;/strong&gt;&lt;br&gt;
The React Compiler analyzed your code at build time. It understands data flow better than we do.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It sees &lt;code&gt;filteredDishes&lt;/code&gt; only changes when &lt;code&gt;category&lt;/code&gt; changes.&lt;/li&gt;
&lt;li&gt;It sees you wrapped handleOrder in an arrow function (dish) =&amp;gt; handleOrder(dish).&lt;/li&gt;
&lt;li&gt;It automatically caches that arrow function wrapper so it remains the exact same reference across renders.&lt;/li&gt;
&lt;li&gt;It effectively generates the optimized code from &lt;strong&gt;Phase 2&lt;/strong&gt; for you, behind the scenes.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Philosophy Shift
&lt;/h2&gt;

&lt;p&gt;For years, We had to manually tell the framework: &lt;em&gt;"Remember this variable! Freeze this function!"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;React compiler address this problem!.&lt;/strong&gt;&lt;br&gt;
React now assumes the burden of optimization. It allows us to stop worrying about render cycles and dependency arrays, and start focusing on what actually matters: &lt;strong&gt;shipping features.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What Now?
&lt;/h3&gt;

&lt;p&gt;The best part is that React Compiler is &lt;strong&gt;backward compatible (React v17, v18 as well)&lt;/strong&gt;. You don't have to rewrite your codebase. You can enable it, and it will optimize your "plain" components while leaving your existing hooks.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Thanks for reading! This is my first post on Dev.to, and I wrote it to help solidify my own understanding of the Compiler. I’d love your feedback—did the restaurant analogy make sense to you? Let me know in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>welcome</category>
      <category>react</category>
      <category>webdev</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
