<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kushagra Gupta</title>
    <description>The latest articles on DEV Community by Kushagra Gupta (@kushagra_gupta_13239507ec).</description>
    <link>https://dev.to/kushagra_gupta_13239507ec</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3972047%2F0c2b1cab-764c-49cc-b104-c3f29586aeea.png</url>
      <title>DEV Community: Kushagra Gupta</title>
      <link>https://dev.to/kushagra_gupta_13239507ec</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kushagra_gupta_13239507ec"/>
    <language>en</language>
    <item>
      <title>Understanding Attention in Transformers — Intuition Before Equations</title>
      <dc:creator>Kushagra Gupta</dc:creator>
      <pubDate>Sun, 07 Jun 2026 04:23:49 +0000</pubDate>
      <link>https://dev.to/kushagra_gupta_13239507ec/understanding-attention-in-transformers-intuition-before-equations-1nfj</link>
      <guid>https://dev.to/kushagra_gupta_13239507ec/understanding-attention-in-transformers-intuition-before-equations-1nfj</guid>
      <description>&lt;p&gt;When people first hear about Transformers, they often encounter words like Query, Key, Value, and Attention Heads and feel confused.&lt;/p&gt;

&lt;p&gt;But the main idea of attention is actually simple.&lt;/p&gt;

&lt;p&gt;Attention answers one question:&lt;/p&gt;

&lt;p&gt;While processing one word, which other words should the model pay attention to?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Was Attention Needed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before Transformers, models like RNNs and LSTMs processed words one by one.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;"The animal didn’t cross the street because it was tired."&lt;/p&gt;

&lt;p&gt;The model needs to understand that "it" refers to "animal".&lt;/p&gt;

&lt;p&gt;Older models struggled with long-distance relationships because information had to pass through many steps.&lt;/p&gt;

&lt;p&gt;Attention solved this problem by allowing every word to directly look at every other word.&lt;/p&gt;

&lt;p&gt;Instead of remembering everything through a long chain, the model can simply ask:&lt;/p&gt;

&lt;p&gt;Which words are important for me right now?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tokens Become Vectors&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A sentence like:&lt;/p&gt;

&lt;p&gt;"The cat sat"&lt;/p&gt;

&lt;p&gt;is broken into tokens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The&lt;/li&gt;
&lt;li&gt;cat&lt;/li&gt;
&lt;li&gt;sat&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each token is converted into a vector called an embedding.&lt;/p&gt;

&lt;p&gt;These vectors contain learned semantic meaning.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"cat" and "dog" may have similar vectors&lt;/li&gt;
&lt;li&gt;"king" and "queen" may also be related&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the sentence becomes a collection of vectors instead of plain text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Main Idea of Attention&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Suppose the model is processing the word "sat".&lt;/p&gt;

&lt;p&gt;To understand "sat", the model may focus more on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"cat"&lt;/li&gt;
&lt;li&gt;less on "The"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Attention allows each word to update itself using information from surrounding words.&lt;/p&gt;

&lt;p&gt;This makes words context-aware.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"bank" in "river bank"&lt;/li&gt;
&lt;li&gt;"bank" in "bank account"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Attention helps the model understand the correct meaning from context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query, Key, and Value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the part many people find confusing.&lt;/p&gt;

&lt;p&gt;Imagine entering a library looking for physics books.&lt;/p&gt;

&lt;p&gt;You:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ask a question&lt;/li&gt;
&lt;li&gt;Compare it with shelf labels&lt;/li&gt;
&lt;li&gt;Retrieve useful books&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Attention works similarly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Query means:&lt;/p&gt;

&lt;p&gt;What information am I looking for?&lt;/p&gt;

&lt;p&gt;If the token is "sat", the query may implicitly ask:&lt;/p&gt;

&lt;p&gt;Who is doing the sitting?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Key means:&lt;/p&gt;

&lt;p&gt;What kind of information do I contain?&lt;/p&gt;

&lt;p&gt;The word "cat" may contain information related to an animal or subject.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query-Key Matching&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model compares the Query with all Keys.&lt;/p&gt;

&lt;p&gt;If two vectors match strongly, the model decides those words are related.&lt;/p&gt;

&lt;p&gt;So the query from "sat" may strongly match the key from "cat".&lt;/p&gt;

&lt;p&gt;This tells the model:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"cat" is important for understanding "sat".&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Value contains the actual information passed forward.&lt;/p&gt;

&lt;p&gt;We can think of attention like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query asks the question&lt;/li&gt;
&lt;li&gt;Key decides relevance&lt;/li&gt;
&lt;li&gt;Value provides the information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Important words contribute more information.&lt;/p&gt;

&lt;p&gt;Less important words contribute less.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaled Dot-Product Attention&lt;/strong&gt;&lt;br&gt;
The full attention formula is:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgcljxdb7sfqiwvj5wool.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgcljxdb7sfqiwvj5wool.png" alt=" " width="796" height="154"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Workflow&lt;/strong&gt;&lt;br&gt;
•  Tokens are converted into embeddings (vectors). &lt;br&gt;
•  Each word updates its meaning using surrounding words (context). &lt;br&gt;
•  Query asks: “What information am I looking for?” &lt;br&gt;
•  Query and Key dot product measures relevance between words. &lt;br&gt;
•  Values are weighted by softmax scores to create the final context-aware representation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simple Attention Flow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query from "sat"
       |
Compare with all Keys
       |
Find important words
       |
Give higher importance to relevant words
       |
Combine information
       |
Create updated meaning of "sat"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Multi-Head Attention
&lt;/h2&gt;

&lt;p&gt;Transformers do attention multiple times in parallel.&lt;/p&gt;

&lt;p&gt;These are called attention heads.&lt;/p&gt;

&lt;p&gt;Different heads can focus on different relationships:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grammar&lt;/li&gt;
&lt;li&gt;Pronouns&lt;/li&gt;
&lt;li&gt;Long-distance meaning&lt;/li&gt;
&lt;li&gt;Nearby words&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows the model to observe language from multiple perspectives at the same time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Attention Became Important
&lt;/h2&gt;

&lt;p&gt;Attention solved major problems of older sequence models.&lt;/p&gt;

&lt;p&gt;Transformers gained several advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better long-range understanding&lt;/li&gt;
&lt;li&gt;Parallel processing&lt;/li&gt;
&lt;li&gt;Improved scalability&lt;/li&gt;
&lt;li&gt;Stronger language understanding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This became the foundation of modern large language models.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>nlp</category>
    </item>
  </channel>
</rss>
