<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rijul Rajesh</title>
    <description>The latest articles on DEV Community by Rijul Rajesh (@rijultp).</description>
    <link>https://dev.to/rijultp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1207862%2F2d1456e5-ef74-42a1-ac31-d0e6d6bc547f.webp</url>
      <title>DEV Community: Rijul Rajesh</title>
      <link>https://dev.to/rijultp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rijultp"/>
    <language>en</language>
    <item>
      <title>Understanding Multi-Head Attention in Transformers</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Sun, 03 May 2026 20:08:43 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-multi-head-attention-in-transformers-gj1</link>
      <guid>https://dev.to/rijultp/understanding-multi-head-attention-in-transformers-gj1</guid>
      <description>&lt;p&gt;Self-attention already helps a transformer understand relationships between words using Query, Key, and Value. But there’s a problem.&lt;/p&gt;

&lt;p&gt;One attention mechanism usually ends up focusing on a limited kind of relationship at a time.&lt;/p&gt;

&lt;p&gt;Language doesn’t work like that. A sentence can have structure, meaning, and long-range links all at once.&lt;/p&gt;

&lt;p&gt;That’s why transformers use &lt;strong&gt;multi-head attention&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens in multi-head attention
&lt;/h2&gt;

&lt;p&gt;Instead of doing attention once, the model does it multiple times in parallel.&lt;/p&gt;

&lt;p&gt;Each run is called a head, and each head has its own learned weights for Query, Key, and Value.&lt;/p&gt;

&lt;p&gt;So every head looks at the same sentence, but in its own way.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it flows
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The input embeddings are first prepared&lt;/li&gt;
&lt;li&gt;They are split into multiple heads using linear projections&lt;/li&gt;
&lt;li&gt;Each head runs its own self-attention&lt;/li&gt;
&lt;li&gt;Each head produces its own output&lt;/li&gt;
&lt;li&gt;All outputs are joined back together&lt;/li&gt;
&lt;li&gt;A final layer mixes them into one result&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why this works better compared to previous approach
&lt;/h2&gt;

&lt;p&gt;Different heads naturally pick up different things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;word order and grammar&lt;/li&gt;
&lt;li&gt;nearby word relationships&lt;/li&gt;
&lt;li&gt;long-distance links&lt;/li&gt;
&lt;li&gt;meaning-based connections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So instead of forcing one attention mechanism to do everything, the model spreads the job across multiple perspectives.&lt;/p&gt;

&lt;p&gt;One head is like reading a sentence with one focus.&lt;/p&gt;

&lt;p&gt;Multiple heads is like reading it several times, each time noticing something different, then combining those notes.&lt;/p&gt;

&lt;p&gt;Multi-head attention doesn’t change the idea of self-attention. It just runs it multiple times in parallel so the model can understand language from different angles at once.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers Part 17: Generating the Output Word</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Fri, 01 May 2026 20:53:08 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-17-generating-the-output-word-35ol</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-17-generating-the-output-word-35ol</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-16-preparing-for-output-prediction-with-residual-connections-1n06"&gt;previous article&lt;/a&gt;, we set up the residual connections to get the final output values from the decoder.&lt;/p&gt;

&lt;p&gt;In this article, we begin by passing these two output values through a fully connected layer.&lt;/p&gt;

&lt;p&gt;This layer has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One input for each value representing the current token
(in this case, 2 inputs)&lt;/li&gt;
&lt;li&gt;One output for each word in the output vocabulary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since our vocabulary has 4 tokens, this gives us 4 output values.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fch2kyaj0ttufavtu4qe2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fch2kyaj0ttufavtu4qe2.png" alt=" " width="680" height="758"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Selecting the Output Word
&lt;/h4&gt;

&lt;p&gt;Next, we pass these 4 output values through a softmax function.&lt;/p&gt;

&lt;p&gt;This allows us to select the most likely output word, which in this case is “vamos”.&lt;/p&gt;

&lt;p&gt;So far, the translation is correct. However, the process does not stop here.&lt;/p&gt;

&lt;h4&gt;
  
  
  Continuing the Decoding Process
&lt;/h4&gt;

&lt;p&gt;The decoder continues generating words until it produces an  token, which indicates the end of the sentence.&lt;/p&gt;

&lt;p&gt;To generate the next word, we feed the predicted word back into the decoder.&lt;/p&gt;

&lt;p&gt;We will explore this step in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers – Part 16: Preparing for Output Prediction with Residual Connections</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Wed, 29 Apr 2026 21:26:20 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-16-preparing-for-output-prediction-with-residual-connections-1n06</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-16-preparing-for-output-prediction-with-residual-connections-1n06</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-15-scaling-and-combining-values-in-encoder-decoder-attention-4dfm"&gt;previous article&lt;/a&gt;, we handled values in encoder-decoder attention, now we will simplify the diagram a bit add another set of residual connections.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2o38bbt86p8t40cvkss9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2o38bbt86p8t40cvkss9.png" alt=" " width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This allows the encoder–decoder attention to focus on the relationships between the output words and the input words, without needing to preserve the self-attention and positional encoding from earlier.&lt;/p&gt;

&lt;p&gt;Lastly, we need a way to take these two values that represent the  token in the decoder and select one of the four output tokens: ir, vamos, y, or &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4u4mu2hg41fyec6edqc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4u4mu2hg41fyec6edqc.png" alt=" " width="800" height="492"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To do this, we pass these two values through a fully connected layer.&lt;/p&gt;

&lt;p&gt;We will explore this further in the next article.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers Part 15: Scaling and Combining Values in Encoder–Decoder Attention</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Tue, 28 Apr 2026 19:23:58 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-15-scaling-and-combining-values-in-encoder-decoder-attention-4dfm</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-15-scaling-and-combining-values-in-encoder-decoder-attention-4dfm</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-14-calculating-encoder-decoder-attention-2hjl"&gt;previous article&lt;/a&gt;, we gained an understanding how much each input word contributes, in this article we will start to compute the value vectors for each input word and combine them accordingly.&lt;/p&gt;

&lt;p&gt;We &lt;strong&gt;scale those values using the Softmax percentages&lt;/strong&gt;, and &lt;strong&gt;add the scaled values together&lt;/strong&gt; to obtain the &lt;strong&gt;encoder–decoder attention values&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwqch95hes73nhxgo4a9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwqch95hes73nhxgo4a9.png" alt=" " width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;sets of weights&lt;/strong&gt; used to calculate the &lt;strong&gt;queries, keys, and values&lt;/strong&gt; for &lt;strong&gt;encoder–decoder attention&lt;/strong&gt; are &lt;strong&gt;different&lt;/strong&gt; from the sets of weights used in &lt;strong&gt;self-attention&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just like in &lt;strong&gt;self-attention&lt;/strong&gt;, these sets of weights are &lt;strong&gt;copied and reused for each word&lt;/strong&gt;, which allows the model to be &lt;strong&gt;flexible with different input and output lengths&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We can also &lt;strong&gt;stack encoder–decoder attention layers&lt;/strong&gt;, just like we do with &lt;strong&gt;self-attention&lt;/strong&gt;, to better handle &lt;strong&gt;more complex phrases&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We will continue with more details in the next article&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers Part 14: Calculating Encoder–Decoder Attention</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Sun, 26 Apr 2026 19:34:55 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-14-calculating-encoder-decoder-attention-2hjl</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-14-calculating-encoder-decoder-attention-2hjl</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-13-introducing-encoder-decoder-attention-544e"&gt;previous article&lt;/a&gt;, we just began introducing the concept of encoder-decoder attention. &lt;/p&gt;

&lt;p&gt;Now lets start digging into the details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Encoder–Decoder Attention in Action
&lt;/h2&gt;

&lt;p&gt;Just like in self-attention, we start by creating &lt;strong&gt;query values&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this case, we create &lt;strong&gt;two values to represent the query for the &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; token&lt;/strong&gt; in the decoder.&lt;/p&gt;

&lt;p&gt;Next, we create &lt;strong&gt;key values for each word in the encoder output&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvnlukey2sy7vbcl7ian.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvnlukey2sy7vbcl7ian.png" alt=" " width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Calculating Similarity
&lt;/h2&gt;

&lt;p&gt;Now, we calculate the similarity between the &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; token in the decoder and each word in the encoder.&lt;/p&gt;

&lt;p&gt;This is done using the &lt;strong&gt;dot product&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4tzv38qv3z46f32lynvh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4tzv38qv3z46f32lynvh.png" alt=" " width="800" height="209"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Applying Softmax
&lt;/h2&gt;

&lt;p&gt;We then pass these similarity scores through a &lt;strong&gt;softmax function&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydq1ynkcl9ji2sb1hnrd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydq1ynkcl9ji2sb1hnrd.png" alt=" " width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This gives us weights that determine how much attention the decoder should pay to each input word.&lt;/p&gt;

&lt;p&gt;In this example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The first input word gets &lt;strong&gt;100% attention&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The second word gets &lt;strong&gt;0% attention&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means the decoder will focus entirely on the first input word when deciding the first translated word.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;Now that we know how much each input word contributes, the next step is to compute the &lt;strong&gt;value vectors&lt;/strong&gt; for each input word and combine them accordingly.&lt;/p&gt;

&lt;p&gt;We will explore this in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers Part 13: Introducing Encoder–Decoder Attention</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Sat, 25 Apr 2026 19:40:51 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-13-introducing-encoder-decoder-attention-544e</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-13-introducing-encoder-decoder-attention-544e</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-12-building-the-decoder-layers-36j"&gt;previous article&lt;/a&gt;, we built up the decoder layers and stopped at the relationship between input and output sentence.&lt;/p&gt;

&lt;p&gt;So this brings us to the concept of Encoder-Decoder Attention&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Encoder–Decoder Attention Matters
&lt;/h2&gt;

&lt;p&gt;Consider the input sentence:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;“Don’t eat the delicious looking and smelling pizza.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When translating this sentence, it is very important to keep track of the word &lt;strong&gt;“Don’t”&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If the translation ignores this word, we might end up with:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;“Eat the delicious looking and smelling pizza.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;These two sentences have completely opposite meanings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Idea
&lt;/h2&gt;

&lt;p&gt;Because of this, the decoder must pay close attention to the &lt;strong&gt;important words in the input&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;encoder–decoder attention&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;p&gt;It allows the decoder to focus on the most relevant parts of the input sentence while generating the output.&lt;/p&gt;

&lt;p&gt;In simple terms, encoder–decoder attention helps the decoder keep track of significant words in the input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Updated Transformer Structure
&lt;/h2&gt;

&lt;p&gt;With this idea, our current encoder–decoder structure looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3lynkdp6qyj8zidu47zf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3lynkdp6qyj8zidu47zf.png" alt=" " width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will build on this and explore the details in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Understanding Transformers Part 12: Building the Decoder Layers</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:23:30 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-12-building-the-decoder-layers-36j</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-12-building-the-decoder-layers-36j</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-11-how-decoding-begins-4dal"&gt;previous article&lt;/a&gt;, we just began with the concept of decoders in a transformer.&lt;/p&gt;

&lt;p&gt;Now we will start adding the positional encoding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding Positional Encoding in the Decoder
&lt;/h2&gt;

&lt;p&gt;Now, for the decoder, let’s add &lt;strong&gt;positional encoding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just like before, we use the same sine and cosine curves to get positional values based on the embedding positions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvz99r6qdaguisfybdjun.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvz99r6qdaguisfybdjun.png" alt=" " width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are the &lt;strong&gt;same curves&lt;/strong&gt; that were used earlier when encoding the input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Applying Positional Values
&lt;/h2&gt;

&lt;p&gt;Since the &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; token is in the &lt;strong&gt;first position&lt;/strong&gt; and has &lt;strong&gt;two embedding values&lt;/strong&gt;, we take the corresponding positional values from the curves.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For the &lt;strong&gt;first embedding&lt;/strong&gt;, the value is &lt;strong&gt;0&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;For the &lt;strong&gt;second embedding&lt;/strong&gt;, the value is &lt;strong&gt;1&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, we add these positional values to the embedding:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F898xly58c48hkjtmjgka.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F898xly58c48hkjtmjgka.png" alt=" " width="643" height="743"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As a result, we get &lt;strong&gt;2.70 and -0.34&lt;/strong&gt;, which represent the &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; token after adding positional encoding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding Self-Attention
&lt;/h2&gt;

&lt;p&gt;Next, we add the &lt;strong&gt;self-attention layer&lt;/strong&gt; so the decoder can keep track of relationships between output words.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61el0t66d4wu1puzmxfk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61el0t66d4wu1puzmxfk.png" alt=" " width="577" height="824"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The self-attention values for the &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; token are &lt;strong&gt;-2.8 and -2.3&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Note that the &lt;strong&gt;weights used in the decoder’s self-attention&lt;/strong&gt; (for queries, keys, and values) are &lt;strong&gt;different from those used in the encoder&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding Residual Connections
&lt;/h2&gt;

&lt;p&gt;Now, we add &lt;strong&gt;residual connections&lt;/strong&gt;, just like we did in the encoder.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyuupdtfpl297azfp3h0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyuupdtfpl297azfp3h0.png" alt=" " width="549" height="765"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;So far, we have seen how self-attention helps the transformer understand relationships &lt;strong&gt;within the output sentence&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;However, for tasks like translation, the model also needs to understand relationships &lt;strong&gt;between the input sentence and the output sentence&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We will explore this in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers Part 11: How Decoding Begins</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Wed, 22 Apr 2026 19:31:56 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-11-how-decoding-begins-4dal</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-11-how-decoding-begins-4dal</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-10-final-step-in-encoding-4f55"&gt;previous article&lt;/a&gt; we wrapped up the encoder part,  In this article, we will start building the second part of the transformer: the &lt;strong&gt;decoder&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just like the encoder, the decoder also begins with &lt;strong&gt;word embeddings&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;However, this time the embeddings are created for the &lt;strong&gt;output vocabulary&lt;/strong&gt;, which consists of Spanish words such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;ir&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;vamos&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;y&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; token&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foteuuzkbucm5wvcgblq1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foteuuzkbucm5wvcgblq1.png" alt=" " width="800" height="662"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting the Decoding Process
&lt;/h2&gt;

&lt;p&gt;To begin decoding, we use the &lt;strong&gt;&lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; token&lt;/strong&gt; as the input.&lt;/p&gt;

&lt;p&gt;This is a common way to initialize the decoding process for an encoded sentence.&lt;/p&gt;

&lt;p&gt;In some cases, people use a &lt;strong&gt;&lt;code&gt;&amp;lt;SOS&amp;gt;&lt;/code&gt; (Start of Sentence)&lt;/strong&gt; token instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Creating the Initial Input
&lt;/h2&gt;

&lt;p&gt;We represent the &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; token as a vector by assigning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1&lt;/strong&gt; to &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; to all other words in the vocabulary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbr0cpk2e2ogeo78ini0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbr0cpk2e2ogeo78ini0d.png" alt=" " width="800" height="592"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From this we can see that 2.70 and -1.34 are the numbers that represent the value for the EOS token.&lt;/p&gt;

&lt;p&gt;Now that we have the initial input for the decoder, the next step is to &lt;strong&gt;add positional encoding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We will explore this in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers Part 10: Final Step in Encoding</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Tue, 21 Apr 2026 19:36:28 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-10-final-step-in-encoding-4f55</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-10-final-step-in-encoding-4f55</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-9-stacking-self-attention-layers-3gg3"&gt;previous article&lt;/a&gt;, we explored the use of self-attention layers, now we will dive into the final step of encoding and start moving into decoders&lt;/p&gt;

&lt;p&gt;As the final step, we take the &lt;strong&gt;positional encoded values&lt;/strong&gt; and add them to the &lt;strong&gt;self-attention values&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;These connections are called &lt;strong&gt;residual connections&lt;/strong&gt;. They make it easier to train complex neural networks by allowing the self-attention layer to focus on learning relationships between words, without needing to preserve the original word embedding and positional information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88f0hdawj6e44254gmb2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88f0hdawj6e44254gmb2.png" alt=" " width="800" height="562"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, we have everything needed to encode the input for this simple transformer.&lt;/p&gt;

&lt;p&gt;These four components work together to convert words into meaningful numerical representations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Word embedding&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Positional encoding&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Self-attention&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Residual connections&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that we have encoded the English input phrase &lt;strong&gt;“Let’s go”&lt;/strong&gt;, the next step is to &lt;strong&gt;decode it into Spanish&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To do this, we need to build a &lt;strong&gt;decoder&lt;/strong&gt;, which we will explore in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers Part 9: Stacking Self-Attention Layers</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Fri, 17 Apr 2026 20:50:25 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-9-stacking-self-attention-layers-3gg3</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-9-stacking-self-attention-layers-3gg3</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-8-shared-weights-in-self-attention-2pbe"&gt;previous article&lt;/a&gt;, we explored how the weights are shared in self-attention.&lt;/p&gt;

&lt;p&gt;Now we will see why we have these self-attention values instead of the initial positional encoding values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using Self-Attention Values
&lt;/h2&gt;

&lt;p&gt;We now use the &lt;strong&gt;self-attention values&lt;/strong&gt; instead of the original positional encoded values.&lt;/p&gt;

&lt;p&gt;This is because the self-attention values for each word include information from all the other words in the sentence. This helps give each word &lt;strong&gt;context&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It also helps establish how each word in the input is related to the others.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgrjmoknigxi9rs0n743q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgrjmoknigxi9rs0n743q.png" alt=" " width="800" height="553"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we think of this unit, along with its weights for calculating queries, keys, and values, as a &lt;strong&gt;self-attention cell&lt;/strong&gt;, then we can extend this idea further.&lt;/p&gt;

&lt;p&gt;To correctly capture relationships in more complex sentences and paragraphs, we can &lt;strong&gt;stack multiple self-attention cells&lt;/strong&gt;, each with its own set of weights. These layers are applied to the position-encoded values of each word, allowing the model to learn different types of relationships.&lt;/p&gt;

&lt;p&gt;Going back to our example, there is one more step required to fully encode the input. We will explore that in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers Part 8: Shared Weights in Self-Attention</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Thu, 16 Apr 2026 21:08:46 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-8-shared-weights-in-self-attention-2pbe</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-8-shared-weights-in-self-attention-2pbe</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-7-from-similarity-scores-to-self-attention-3noo"&gt;previous article&lt;/a&gt;, we started calculating the self-attention values.&lt;/p&gt;

&lt;p&gt;Let’s now calculate the self-attention values for the word &lt;strong&gt;“go”&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We do not need to recalculate the &lt;strong&gt;keys&lt;/strong&gt; and &lt;strong&gt;values&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead, we only need to create the &lt;strong&gt;query&lt;/strong&gt; that represents the word &lt;strong&gt;“go”&lt;/strong&gt;, and then perform the same calculations as before.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9f7405sfueefr9nix6p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9f7405sfueefr9nix6p.png" alt=" " width="800" height="558"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After completing the calculations, we get the self-attention values for &lt;strong&gt;“go”&lt;/strong&gt; as:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.5 and -2.1&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Observations About Self-Attention
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The &lt;strong&gt;weights used to calculate queries&lt;/strong&gt; are the same for both &lt;strong&gt;“Let’s”&lt;/strong&gt; and &lt;strong&gt;“go”&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This means that regardless of the number of words, we use &lt;strong&gt;one shared set of weights&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Similarly, the same sets of weights are reused to calculate &lt;strong&gt;keys&lt;/strong&gt; and &lt;strong&gt;values&lt;/strong&gt; for every input word.&lt;/li&gt;
&lt;li&gt;No matter how many words are given as input, the transformer reuses the same weights for queries, keys, and values.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;We do not need to compute queries, keys, and values &lt;strong&gt;sequentially&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All of them can be computed &lt;strong&gt;at the same time&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;This allows transformers to take advantage of &lt;strong&gt;parallel computation&lt;/strong&gt;, making them very efficient.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;We will continue building our transformer step by step in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Understanding Transformers Part 7: From Similarity Scores to Self-Attention</title>
      <dc:creator>Rijul Rajesh</dc:creator>
      <pubDate>Wed, 15 Apr 2026 02:24:37 +0000</pubDate>
      <link>https://dev.to/rijultp/understanding-transformers-part-7-from-similarity-scores-to-self-attention-3noo</link>
      <guid>https://dev.to/rijultp/understanding-transformers-part-7-from-similarity-scores-to-self-attention-3noo</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/rijultp/understanding-transformers-part-6-calculating-similarity-between-queries-and-keys-25o7"&gt;previous article&lt;/a&gt;, we calculated the similarities between Queries and Keys.&lt;/p&gt;

&lt;p&gt;We can use the output of the &lt;strong&gt;softmax function&lt;/strong&gt; to determine how much each input word should contribute when encoding the word &lt;strong&gt;“Let’s”&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a14whe27xi9e8q4scau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a14whe27xi9e8q4scau.png" alt=" " width="800" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Interpreting the Weights
&lt;/h3&gt;

&lt;p&gt;In this case, &lt;strong&gt;“Let’s”&lt;/strong&gt; is much more similar to itself than to &lt;strong&gt;“go”&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So after applying softmax:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;“Let’s” gets a weight close to 1 (100%)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;“go” gets a weight close to 0 (0%)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Let’s” contributes almost entirely to its own encoding&lt;/li&gt;
&lt;li&gt;“go” contributes very little&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Creating Value Representations
&lt;/h2&gt;

&lt;p&gt;To apply these weights, we create another set of values for each word.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, we create &lt;strong&gt;two values to represent “Let’s”&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8p1v8o7r4foc4bcigdq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8p1v8o7r4foc4bcigdq.png" alt=" " width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Then, we &lt;strong&gt;scale these values by 1&lt;/strong&gt; (since its weight is 100%)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Next, we create &lt;strong&gt;two values to represent “go”&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpb1ezeuhgw7y8c7l6uhu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpb1ezeuhgw7y8c7l6uhu.png" alt=" " width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;These values are &lt;strong&gt;scaled by 0&lt;/strong&gt; (since its weight is 0%)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Combining the Values
&lt;/h2&gt;

&lt;p&gt;Finally, we add the scaled values together:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F81a3fi42bbj5ny5lnver.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F81a3fi42bbj5ny5lnver.png" alt=" " width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The result is a new set of values that represent the word &lt;strong&gt;“Let’s”&lt;/strong&gt;, now enriched by its relationship with all input words.&lt;/p&gt;

&lt;p&gt;These final values are called the &lt;strong&gt;self-attention values&lt;/strong&gt; for &lt;strong&gt;“Let’s”&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;They combine information from all words in the sentence, weighted by how relevant each word is to &lt;strong&gt;“Let’s”&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We can now repeat the same process for the word &lt;strong&gt;“go”&lt;/strong&gt;, which we will explore in the next article.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Looking for an easier way to install tools, libraries, or entire repositories?&lt;/strong&gt;&lt;br&gt;
Try &lt;strong&gt;Installerpedia&lt;/strong&gt;: a &lt;strong&gt;community-driven, structured installation platform&lt;/strong&gt; that lets you install almost anything with &lt;strong&gt;minimal hassle&lt;/strong&gt; and &lt;strong&gt;clear, reliable guidance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipm &lt;span class="nb"&gt;install &lt;/span&gt;repo-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;… and you’re done! 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hexmos.com/ipm" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2s3mzj8pfcq94a1y4at.png" alt="Installerpedia Screenshot" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://hexmos.com/ipm/" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Installerpedia here&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
