<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shrijith Venkatramana</title>
    <description>The latest articles on DEV Community by Shrijith Venkatramana (@shrsv).</description>
    <link>https://dev.to/shrsv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1001514%2F17b7d334-44b1-417a-9268-346e6a34988a.jpg</url>
      <title>DEV Community: Shrijith Venkatramana</title>
      <link>https://dev.to/shrsv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shrsv"/>
    <language>en</language>
    <item>
      <title>Scaled Dot-Product Attention: The 4-Line Algorithm That Powers Modern LLMs</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Wed, 01 Jul 2026 17:49:09 +0000</pubDate>
      <link>https://dev.to/shrsv/scaled-dot-product-attention-the-4-line-algorithm-that-powers-modern-llms-c21</link>
      <guid>https://dev.to/shrsv/scaled-dot-product-attention-the-4-line-algorithm-that-powers-modern-llms-c21</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;In 2017, a small group of Google researchers removed recurrence, removed convolution, and bet everything on one deceptively simple idea: every word should decide for itself what deserves attention.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That idea—&lt;strong&gt;Scaled Dot-Product Attention&lt;/strong&gt;—became the computational primitive behind GPT, Claude, Gemini, Llama, DeepSeek, and nearly every modern Large Language Model.&lt;/p&gt;

&lt;p&gt;The remarkable part isn't just that it works.&lt;/p&gt;

&lt;p&gt;It's that the core algorithm fits into a single equation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Attention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this article we'll build intuition first, then gradually unpack the mathematics, engineering, and economics behind the mechanism that made modern LLMs possible.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;(Include the screenshot from the "Attention Is All You Need" paper here.)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Before Transformers: The Memory Problem
&lt;/h1&gt;

&lt;p&gt;Imagine asking someone to finish this sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The trophy didn't fit in the suitcase because &lt;strong&gt;it&lt;/strong&gt; was too small."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What does &lt;strong&gt;it&lt;/strong&gt; refer to?&lt;/p&gt;

&lt;p&gt;The trophy?&lt;/p&gt;

&lt;p&gt;Or the suitcase?&lt;/p&gt;

&lt;p&gt;Humans answer almost instantly because our brains naturally connect related concepts across a sentence.&lt;/p&gt;

&lt;p&gt;Earlier neural networks struggled.&lt;/p&gt;

&lt;h3&gt;
  
  
  RNNs
&lt;/h3&gt;

&lt;p&gt;Recurrent Neural Networks processed words one at a time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The -&amp;gt; trophy -&amp;gt; didn't -&amp;gt; fit -&amp;gt; ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Each word updated a hidden state.&lt;/p&gt;

&lt;p&gt;The problem was that information had to travel through dozens or hundreds of sequential steps before reaching later words.&lt;/p&gt;

&lt;p&gt;By the time the network reached the end of a paragraph, early information had often faded away.&lt;/p&gt;

&lt;p&gt;LSTMs improved the situation with gating mechanisms, but they still fundamentally processed sequences sequentially.&lt;/p&gt;

&lt;p&gt;That became an enormous bottleneck.&lt;/p&gt;

&lt;p&gt;Both computationally.&lt;/p&gt;

&lt;p&gt;And conceptually.&lt;/p&gt;
&lt;h1&gt;
  
  
  The Insight: Let Every Word Look Everywhere
&lt;/h1&gt;

&lt;p&gt;One of the authors of the Transformer paper, Ashish Vaswani, later described the goal simply:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Instead of carrying memory forward step-by-step, why not allow every word to directly inspect every other word?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Suppose we have:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The cat sat on the mat.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When processing &lt;strong&gt;sat&lt;/strong&gt;, perhaps the model mostly cares about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cat&lt;/li&gt;
&lt;li&gt;on&lt;/li&gt;
&lt;li&gt;mat&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It doesn't need to care very much about &lt;strong&gt;The&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of forcing information through intermediate states, attention allows direct communication.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        sat
      /  |  \
     /   |   \
   cat   on  mat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Every token asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Which other tokens are relevant to me?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's attention.&lt;/p&gt;
&lt;h1&gt;
  
  
  Queries, Keys and Values: Think Like a Search Engine
&lt;/h1&gt;

&lt;p&gt;The names sound intimidating.&lt;/p&gt;

&lt;p&gt;They're actually borrowed from information retrieval.&lt;/p&gt;

&lt;p&gt;Imagine Google Search.&lt;/p&gt;

&lt;p&gt;When you search:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;best pizza near me
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You issue a &lt;strong&gt;query&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Every webpage has characteristics that determine whether it matches.&lt;/p&gt;

&lt;p&gt;Those are analogous to &lt;strong&gt;keys&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The content you finally read is the &lt;strong&gt;value&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Exactly the same thing happens inside attention.&lt;/p&gt;

&lt;p&gt;Every word generates three vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query (Q)&lt;/strong&gt; - What am I looking for?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key (K)&lt;/strong&gt; - What information do I offer?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value (V)&lt;/strong&gt; - What information should I contribute if selected?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suppose we have:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The animal didn't cross the road because it was tired.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;For the token &lt;strong&gt;it&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Its Query might strongly match:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;animal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;instead of&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;road
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;because their semantic representations are more compatible.&lt;/p&gt;

&lt;p&gt;Attention is therefore a sophisticated matching process.&lt;/p&gt;
&lt;h1&gt;
  
  
  The Famous Equation (That Looks Scarier Than It Is)
&lt;/h1&gt;

&lt;p&gt;The Transformer paper defines attention as:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Let's decode it piece by piece.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1: Compare Queries Against Every Key
&lt;/h2&gt;

&lt;p&gt;Each Query is compared with every Key using a dot product.&lt;/p&gt;

&lt;p&gt;A dot product is simply a similarity score.&lt;/p&gt;

&lt;p&gt;Large positive value:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Very relevant.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Near zero:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Mostly unrelated.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Negative:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Probably irrelevant.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Suppose our similarities are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Word&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;cat&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dog&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;road&lt;/td&gt;
&lt;td&gt;-1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Higher means stronger relevance.&lt;/p&gt;

&lt;p&gt;Notice that we don't compare a Query against just one Key.&lt;/p&gt;

&lt;p&gt;Every Query is compared against &lt;strong&gt;every Key&lt;/strong&gt; simultaneously.&lt;/p&gt;

&lt;p&gt;If there are 100 tokens in the sentence, each Query computes 100 similarity scores.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: Why Divide by sqrt(d_k)?
&lt;/h2&gt;

&lt;p&gt;This is the "Scaled" part.&lt;/p&gt;

&lt;p&gt;Without scaling, dot products become enormous.&lt;/p&gt;

&lt;p&gt;Imagine vectors of length 512.&lt;/p&gt;

&lt;p&gt;Even if each component averages only around 1, adding hundreds of multiplications quickly produces very large numbers.&lt;/p&gt;

&lt;p&gt;A useful back-of-the-envelope calculation is that the variance of a dot product grows roughly in proportion to &lt;code&gt;d_k&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;d_k = 512
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;then&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sqrt(512) is approximately 22.6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Those values are fed into the softmax function.&lt;/p&gt;

&lt;p&gt;Softmax contains exponentials.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;exp(22) is roughly 3.5 billion
exp(10) is roughly 22 thousand
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A small increase in the input suddenly creates a huge difference in the output.&lt;/p&gt;

&lt;p&gt;One score completely dominates.&lt;/p&gt;

&lt;p&gt;Everything else effectively becomes zero.&lt;/p&gt;

&lt;p&gt;That creates two problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;unstable gradients&lt;/li&gt;
&lt;li&gt;slower learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dividing every score by &lt;code&gt;sqrt(d_k)&lt;/code&gt; keeps the values in a healthy numerical range.&lt;/p&gt;

&lt;p&gt;It's essentially variance normalization.&lt;/p&gt;

&lt;p&gt;A remarkably small trick with enormous practical consequences.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 3: Softmax Creates Probabilities
&lt;/h2&gt;

&lt;p&gt;Suppose the scaled scores become:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3
2
0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Softmax converts them into something approximately like:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.71
0.26
0.03
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;These become attention weights.&lt;/p&gt;

&lt;p&gt;Now the model knows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;spend about 71% of attention here&lt;/li&gt;
&lt;li&gt;spend about 26% here&lt;/li&gt;
&lt;li&gt;mostly ignore the rest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The probabilities always sum to 1.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 4: Weighted Sum of Values
&lt;/h2&gt;

&lt;p&gt;Finally those probabilities weight the Value vectors.&lt;/p&gt;

&lt;p&gt;Think of it like averaging expert opinions.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Expert A : 70%
Expert B : 25%
Expert C : 5%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The final representation becomes:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.70 * A + 0.25 * B + 0.05 * C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That weighted average becomes the new representation for the current token.&lt;/p&gt;

&lt;p&gt;Instead of copying information from a single location, attention intelligently blends information from multiple relevant tokens.&lt;/p&gt;


&lt;h1&gt;
  
  
  Why Matrix Multiplication Changed Everything
&lt;/h1&gt;

&lt;p&gt;The equation often looks abstract because it's written with matrices.&lt;/p&gt;

&lt;p&gt;That choice was an engineering breakthrough.&lt;/p&gt;

&lt;p&gt;Instead of processing one word at a time:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;word 1
word 2
word 3
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;the Transformer processes &lt;strong&gt;every token simultaneously&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If a sentence contains 128 words,&lt;/p&gt;

&lt;p&gt;it computes attention for all 128 together using large matrix multiplications.&lt;/p&gt;

&lt;p&gt;Modern GPUs are extraordinarily efficient at matrix multiplication.&lt;/p&gt;

&lt;p&gt;This wasn't merely mathematically elegant.&lt;/p&gt;

&lt;p&gt;It matched the hardware.&lt;/p&gt;

&lt;p&gt;Google's TPUs were designed around massive matrix operations.&lt;/p&gt;

&lt;p&gt;NVIDIA GPUs excel at them too.&lt;/p&gt;

&lt;p&gt;The algorithm and the hardware reinforced one another.&lt;/p&gt;

&lt;p&gt;This is one reason Transformers scaled so dramatically.&lt;/p&gt;

&lt;p&gt;Sometimes the biggest breakthrough isn't inventing a new algorithm.&lt;/p&gt;

&lt;p&gt;It's inventing one that perfectly matches the hardware already available.&lt;/p&gt;
&lt;h1&gt;
  
  
  The Hidden Cost: Attention Isn't Free
&lt;/h1&gt;

&lt;p&gt;Attention is powerful.&lt;/p&gt;

&lt;p&gt;It is also expensive.&lt;/p&gt;

&lt;p&gt;Suppose a sequence contains &lt;code&gt;n&lt;/code&gt; tokens.&lt;/p&gt;

&lt;p&gt;Every token compares itself with every other token.&lt;/p&gt;

&lt;p&gt;That means roughly:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;n * n
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;or simply:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;O(n^2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;comparisons.&lt;/p&gt;

&lt;p&gt;Double the sequence length:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1,000 tokens
      -&amp;gt;
2,000 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;and you perform about &lt;strong&gt;four times as much work&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Approximate example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Pairwise Comparisons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;1 million&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000&lt;/td&gt;
&lt;td&gt;100 million&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100,000&lt;/td&gt;
&lt;td&gt;10 billion&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This single operation dominates both memory usage and inference cost.&lt;/p&gt;

&lt;p&gt;Much of today's LLM research—including FlashAttention, sparse attention, sliding-window attention, grouped-query attention, and linear attention—is fundamentally about making this computation cheaper without sacrificing quality.&lt;/p&gt;

&lt;p&gt;In many ways, modern AI engineering has become an optimization problem built around this one equation.&lt;/p&gt;
&lt;h1&gt;
  
  
  A Historical Moment Few Papers Ever Achieve
&lt;/h1&gt;

&lt;p&gt;When Ashish Vaswani and seven colleagues published &lt;strong&gt;"Attention Is All You Need"&lt;/strong&gt; in 2017, they were solving a machine translation problem.&lt;/p&gt;

&lt;p&gt;They were &lt;strong&gt;not&lt;/strong&gt; trying to build ChatGPT.&lt;/p&gt;

&lt;p&gt;Yet within a few years:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI built GPT on the Transformer architecture.&lt;/li&gt;
&lt;li&gt;Google introduced BERT using the same core attention mechanism.&lt;/li&gt;
&lt;li&gt;Nearly every frontier LLM adopted Scaled Dot-Product Attention as its computational primitive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some research papers introduce new techniques.&lt;/p&gt;

&lt;p&gt;Very few redefine an entire field.&lt;/p&gt;

&lt;p&gt;This was one of them.&lt;/p&gt;

&lt;p&gt;Today, when billions of people interact with ChatGPT, Claude, Gemini, or Llama, they're ultimately benefiting from an idea that occupies only a few lines in a research paper.&lt;/p&gt;
&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;The beauty of Scaled Dot-Product Attention lies in its simplicity.&lt;/p&gt;

&lt;p&gt;Every token asks a question.&lt;/p&gt;

&lt;p&gt;Every other token advertises what it knows.&lt;/p&gt;

&lt;p&gt;Similarity determines relevance.&lt;/p&gt;

&lt;p&gt;Softmax decides how much to trust each source.&lt;/p&gt;

&lt;p&gt;The answers are blended into a richer representation.&lt;/p&gt;

&lt;p&gt;From those four operations emerged language models capable of writing code, translating languages, solving mathematical problems, generating images, and powering AI assistants used by hundreds of millions of people.&lt;/p&gt;

&lt;p&gt;Sometimes revolutions begin not with thousands of lines of code, but with a single elegant equation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What surprised you most about Scaled Dot-Product Attention?&lt;/strong&gt; Was it the simplicity of the mathematics, the engineering insight of matching GPUs with matrix operations, or the fact that dividing by &lt;code&gt;sqrt(d_k)&lt;/code&gt; turned out to be one of the key ingredients that made today's LLMs train reliably at scale?&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>What Actually Happens When You Train an LLM? Following the First 12 Hours of the Original Transformer</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Tue, 30 Jun 2026 18:03:36 +0000</pubDate>
      <link>https://dev.to/shrsv/what-actually-happens-when-you-train-an-llm-following-the-first-12-hours-of-the-original-5hd6</link>
      <guid>https://dev.to/shrsv/what-actually-happens-when-you-train-an-llm-following-the-first-12-hours-of-the-original-5hd6</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;In 2017, eight NVIDIA P100 GPUs sat in a Google data center for about twelve hours.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1eqvex6nz84si48wsp12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1eqvex6nz84si48wsp12.png" alt="p100 gpus" width="301" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During those twelve hours, they repeatedly did something almost embarrassingly simple.&lt;/p&gt;

&lt;p&gt;They picked up a batch of sentences.&lt;/p&gt;

&lt;p&gt;Made predictions.&lt;/p&gt;

&lt;p&gt;Measured how wrong those predictions were.&lt;/p&gt;

&lt;p&gt;Adjusted a few million numbers.&lt;/p&gt;

&lt;p&gt;Then did it again.&lt;/p&gt;

&lt;p&gt;Exactly &lt;strong&gt;100,000 times&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Those twelve hours produced the base Transformer model described in &lt;em&gt;Attention Is All You Need&lt;/em&gt;. The larger version trained for &lt;strong&gt;300,000 optimization steps&lt;/strong&gt;, taking roughly &lt;strong&gt;3.5 days&lt;/strong&gt; on the same hardware.&lt;/p&gt;

&lt;p&gt;Today, frontier LLMs train on tens of thousands of GPUs for weeks, but if you inspected the training logs, the core loop would still look remarkably familiar.&lt;/p&gt;

&lt;p&gt;This article follows that loop.&lt;/p&gt;

&lt;p&gt;We'll watch one training run unfold—from raw text on disk to a model that can translate languages—and along the way learn a bit about every intimidating term that deal with training a transformer model.&lt;/p&gt;

&lt;h1&gt;
  
  
  8:00 AM — Nothing Has Been Learned Yet
&lt;/h1&gt;

&lt;p&gt;Imagine switching on the machine.&lt;/p&gt;

&lt;p&gt;The Transformer knows absolutely no language. It doesn't know English, or German. And it doesn't know grammar. It doesn't even have conception of what a "word" is.&lt;/p&gt;

&lt;p&gt;Internally it contains millions of parameters—ordinary floating-point numbers initialized almost randomly.&lt;/p&gt;

&lt;p&gt;The training data, however, already contains knowledge.&lt;/p&gt;

&lt;p&gt;For the English-German task, the authors used the &lt;strong&gt;WMT 2014&lt;/strong&gt; dataset containing roughly &lt;strong&gt;4.5 million sentence pairs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A tiny sample might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;English:
The meeting begins tomorrow.

German:
Das Treffen beginnt morgen.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Or&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;English:
The cat sat on the mat.

German:
Die Katze saß auf der Matte.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Notice what's missing.&lt;/p&gt;

&lt;p&gt;Nobody wrote rules like&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Adjectives come before nouns."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;or&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"German verbs often appear at the end."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The only supervision is example after example after example.&lt;/p&gt;

&lt;p&gt;The model's job is to discover those rules itself.&lt;/p&gt;
&lt;h1&gt;
  
  
  8:00:01 — The Computer Doesn't See Words
&lt;/h1&gt;

&lt;p&gt;Before training starts, the text is transformed into something GPUs understand.&lt;/p&gt;

&lt;p&gt;Integers.&lt;/p&gt;

&lt;p&gt;The paper uses &lt;strong&gt;Byte Pair Encoding (BPE)&lt;/strong&gt;, introduced a year earlier by Rico Sennrich and colleagues.&lt;/p&gt;

&lt;p&gt;Instead of storing every possible English word, BPE builds a vocabulary of common subword pieces.&lt;/p&gt;

&lt;p&gt;For example,&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;unbelievable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;might become&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;un
believ
able
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Those pieces become IDs.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;un      → 517
believ  → 10328
able    → 294
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Why go through this trouble?&lt;/p&gt;

&lt;p&gt;Imagine giving every English word its own entry.&lt;/p&gt;

&lt;p&gt;You'd need hundreds of thousands of entries, and every new word—"ChatGPT", "Kubernetes", "DeepSeek"—would be unknown.&lt;/p&gt;

&lt;p&gt;Subwords solve that elegantly.&lt;/p&gt;

&lt;p&gt;Once the model understands "micro", "service" and "architecture", it already has much of what it needs to interpret "microservice architecture", even if it has never encountered the exact phrase before.&lt;/p&gt;

&lt;p&gt;Modern tokenizers have evolved, but this basic idea remains.&lt;/p&gt;
&lt;h1&gt;
  
  
  8:00:02 — The First Batch Arrives
&lt;/h1&gt;

&lt;p&gt;One beginner misconception is that the GPU trains on one sentence at a time.&lt;/p&gt;

&lt;p&gt;That would waste almost all of its computational power.&lt;/p&gt;

&lt;p&gt;GPUs are throughput machines.&lt;/p&gt;

&lt;p&gt;They become efficient only when thousands of arithmetic units work simultaneously.&lt;/p&gt;

&lt;p&gt;Instead, the Transformer paper groups examples into batches containing approximately&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;25,000 source-language tokens&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;25,000 target-language tokens&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;or about &lt;strong&gt;50,000 tokens&lt;/strong&gt; in total.&lt;/p&gt;

&lt;p&gt;Think of a factory.&lt;/p&gt;

&lt;p&gt;Running one car down an assembly line would be absurd.&lt;/p&gt;

&lt;p&gt;Factories move hundreds of products simultaneously because keeping machines idle is expensive.&lt;/p&gt;

&lt;p&gt;GPU training works the same way.&lt;/p&gt;

&lt;p&gt;Batching is not a machine-learning trick.&lt;/p&gt;

&lt;p&gt;It's operations optimization.&lt;/p&gt;
&lt;h1&gt;
  
  
  8:00:02.4 — The Model Makes Its First Mistake
&lt;/h1&gt;

&lt;p&gt;The first forward pass takes roughly &lt;strong&gt;0.4 seconds&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The model receives&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The cat sat on the mat.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;and produces...&lt;/p&gt;

&lt;p&gt;garbage.&lt;/p&gt;

&lt;p&gt;Maybe something equivalent to&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;House.

Tomorrow.

Blue.

Water.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That isn't failure.&lt;/p&gt;

&lt;p&gt;It's exactly what we expect.&lt;/p&gt;

&lt;p&gt;Every parameter was random only moments ago.&lt;/p&gt;

&lt;p&gt;Now comes the crucial question.&lt;/p&gt;

&lt;p&gt;How wrong was the prediction?&lt;/p&gt;

&lt;p&gt;The answer is summarized by a single number called the &lt;strong&gt;loss&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Everything that follows exists solely to reduce that number.&lt;/p&gt;


&lt;h1&gt;
  
  
  8:00:02.5 — Which of the 65 Million Parameters Was Responsible?
&lt;/h1&gt;

&lt;p&gt;Suppose I asked you to tune an old radio using sixty-five million knobs.&lt;/p&gt;

&lt;p&gt;After hearing static, which knob would you turn?&lt;/p&gt;

&lt;p&gt;You wouldn't know.&lt;/p&gt;

&lt;p&gt;Yet that's essentially the problem.&lt;/p&gt;

&lt;p&gt;The Transformer base model contains about &lt;strong&gt;65 million trainable parameters&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The larger model contains around &lt;strong&gt;213 million&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Backpropagation solves this enormous credit-assignment problem.&lt;/p&gt;

&lt;p&gt;Rather than saying&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Parameter #18,423 is wrong,"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;it computes&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"If this parameter increased slightly, would the loss increase or decrease?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;for every single parameter.&lt;/p&gt;

&lt;p&gt;The result is a gigantic map of tiny suggested adjustments called &lt;strong&gt;gradients&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now another algorithm enters the story.&lt;/p&gt;
&lt;h1&gt;
  
  
  Adam: The Engineer Who Turns the Knobs
&lt;/h1&gt;

&lt;p&gt;The paper uses the &lt;strong&gt;Adam optimizer&lt;/strong&gt; with&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;β₁ = 0.9&lt;/li&gt;
&lt;li&gt;β₂ = 0.98&lt;/li&gt;
&lt;li&gt;ε = 10⁻⁹&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't arbitrary constants copied from Stack Overflow.&lt;/p&gt;

&lt;p&gt;Adam remembers recent gradients, rather like giving the optimization process momentum.&lt;/p&gt;

&lt;p&gt;Imagine descending a foggy mountain.&lt;/p&gt;

&lt;p&gt;If every step depended only on the slope beneath your feet, you'd zigzag constantly.&lt;/p&gt;

&lt;p&gt;Adam remembers where you've been heading over the past several steps, smoothing the journey downhill.&lt;/p&gt;

&lt;p&gt;Interestingly, the paper chose &lt;strong&gt;β₂ = 0.98&lt;/strong&gt; rather than the more familiar &lt;strong&gt;0.999&lt;/strong&gt; found in many deep-learning libraries today. That makes Adam respond more quickly to changing gradients—a small but deliberate engineering decision.&lt;/p&gt;

&lt;p&gt;Millions of parameters are nudged.&lt;/p&gt;

&lt;p&gt;Tiny, incremental changes.&lt;/p&gt;

&lt;p&gt;Often by less than one thousandth.&lt;/p&gt;

&lt;p&gt;Then the next batch arrives.&lt;/p&gt;
&lt;h1&gt;
  
  
  9:00 AM — The Strange Equation Everyone Hates
&lt;/h1&gt;

&lt;p&gt;One of the paper's most intimidating equations defines the learning rate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F6ha2c93b31vuc9y592yu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F6ha2c93b31vuc9y592yu.png" alt="learning rate eq" width="559" height="55"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It looks frightening.&lt;/p&gt;

&lt;p&gt;Its purpose is not.&lt;/p&gt;

&lt;p&gt;Early in training, every parameter is effectively random.&lt;/p&gt;

&lt;p&gt;Large updates can make optimization unstable.&lt;/p&gt;

&lt;p&gt;So the authors &lt;strong&gt;warm up&lt;/strong&gt; the learning rate over the first &lt;strong&gt;4,000 optimization steps&lt;/strong&gt;, gradually increasing it rather than starting at full speed.&lt;/p&gt;

&lt;p&gt;After warmup, the learning rate begins shrinking.&lt;/p&gt;

&lt;p&gt;Imagine sanding a table.&lt;/p&gt;

&lt;p&gt;At first you remove material aggressively.&lt;/p&gt;

&lt;p&gt;Near the end you make tiny finishing passes.&lt;/p&gt;

&lt;p&gt;Training behaves similarly.&lt;/p&gt;

&lt;p&gt;The equation also contains the term (d_{\text{model}}^{-1/2}).&lt;/p&gt;

&lt;p&gt;This compensates for model size.&lt;/p&gt;

&lt;p&gt;As hidden representations become larger, gradients naturally change scale. Dividing by the square root of the model dimension helps keep parameter updates numerically well behaved as architectures grow.&lt;/p&gt;

&lt;p&gt;The equation is just common sensical engineering.&lt;/p&gt;
&lt;h1&gt;
  
  
  Noon — Preventing the Model From Memorizing
&lt;/h1&gt;

&lt;p&gt;If optimization only chased lower loss, the network could simply memorize the training data.&lt;/p&gt;

&lt;p&gt;The paper deliberately makes learning harder.&lt;/p&gt;

&lt;p&gt;First comes &lt;strong&gt;dropout&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Ten percent of activations are randomly disabled during training.&lt;/p&gt;

&lt;p&gt;Every batch therefore sees a slightly different network.&lt;/p&gt;

&lt;p&gt;No neuron can become indispensable.&lt;/p&gt;

&lt;p&gt;Second comes &lt;strong&gt;label smoothing&lt;/strong&gt; with a value of &lt;strong&gt;0.1&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of pretending the correct next token has probability exactly one, the target distribution is softened slightly.&lt;/p&gt;

&lt;p&gt;That sounds counterintuitive.&lt;/p&gt;

&lt;p&gt;Yet translation quality improved.&lt;/p&gt;

&lt;p&gt;Real language is messy.&lt;/p&gt;

&lt;p&gt;There are often several acceptable translations.&lt;/p&gt;

&lt;p&gt;Slight uncertainty produces a less overconfident model.&lt;/p&gt;
&lt;h1&gt;
  
  
  8:00 PM — Twelve Hours Later
&lt;/h1&gt;

&lt;p&gt;After roughly &lt;strong&gt;100,000 optimization steps&lt;/strong&gt;, the base model has finished training.&lt;/p&gt;

&lt;p&gt;The larger model continues until &lt;strong&gt;300,000 steps&lt;/strong&gt;, taking approximately &lt;strong&gt;3.5 days&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Each step processed about &lt;strong&gt;50,000 tokens&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Back-of-the-envelope, that's around &lt;strong&gt;five billion token presentations&lt;/strong&gt; during the base run—not unique tokens, but training exposures. The same examples are revisited across multiple passes through the dataset.&lt;/p&gt;

&lt;p&gt;The paper doesn't stop by reporting translation accuracy.&lt;/p&gt;

&lt;p&gt;It also reports &lt;strong&gt;FLOPs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That's significant.&lt;/p&gt;

&lt;p&gt;Even in 2017, the authors understood that machine learning was becoming an engineering discipline constrained not only by accuracy, but also by computation.&lt;/p&gt;

&lt;p&gt;A model that is 1% better but requires ten times more compute is often a poor engineering trade-off.&lt;/p&gt;

&lt;p&gt;That thinking has only become more relevant.&lt;/p&gt;

&lt;p&gt;Today, training an LLM is as much about distributed systems, networking, storage bandwidth, GPU utilization, checkpointing and failure recovery as it is about neural networks.&lt;/p&gt;
&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;People often remember &lt;em&gt;Attention Is All You Need&lt;/em&gt; for introducing self-attention.&lt;/p&gt;

&lt;p&gt;Equally important was something less glamorous: it demonstrated a training recipe that scaled.&lt;/p&gt;

&lt;p&gt;Large batches kept GPUs busy.&lt;/p&gt;

&lt;p&gt;Carefully designed learning-rate schedules stabilized optimization.&lt;/p&gt;

&lt;p&gt;Adam made billions of tiny updates practical.&lt;/p&gt;

&lt;p&gt;Regularization techniques prevented memorization.&lt;/p&gt;

&lt;p&gt;None of these ideas are individually magical. Together, repeated hundreds of thousands of times, they turned random numbers into a model that could translate language.&lt;/p&gt;

&lt;p&gt;Nearly a decade later, today's frontier LLMs still follow the same rhythm.&lt;/p&gt;

&lt;p&gt;The numbers have changed by orders of magnitude.&lt;/p&gt;

&lt;p&gt;The loop has not.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Load a batch. Predict. Measure the loss. Update the weights. Repeat.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the heartbeat of every modern language model.&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The Transformer Architecture Behind Modern LLMs: A Developer's Guide to the Diagram That Changed AI</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Mon, 29 Jun 2026 17:09:22 +0000</pubDate>
      <link>https://dev.to/shrsv/the-transformer-architecture-behind-modern-llms-a-developers-guide-to-the-diagram-that-changed-ai-2e4</link>
      <guid>https://dev.to/shrsv/the-transformer-architecture-behind-modern-llms-a-developers-guide-to-the-diagram-that-changed-ai-2e4</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've used ChatGPT, Claude, Gemini, or any modern LLM, you've already benefited from one of the most influential transformer architecture in modern AI.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In 2017, a team of researchers at Google published a paper titled &lt;strong&gt;"Attention Is All You Need."&lt;/strong&gt; Hidden inside it was the now-famous architecture diagram shown below—the blueprint for the Transformer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fspfod3rgaw8ktpku3t9n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fspfod3rgaw8ktpku3t9n.png" alt="blueprint for transformer" width="592" height="698"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At first glance, it looks intimidating: boxes, arrows, loops, encoder stacks, decoder stacks, attention blocks...&lt;/p&gt;

&lt;p&gt;But underneath the complexity lies a surprisingly elegant idea.&lt;/p&gt;

&lt;p&gt;By the end of this article, you'll understand &lt;strong&gt;every single component&lt;/strong&gt; in this diagram, why it exists, how information flows through it, and how all these pieces collaborate to generate human-like language.&lt;/p&gt;

&lt;p&gt;Let's build our understanding layer by layer.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Big Picture: Two Factories Working Together
&lt;/h1&gt;

&lt;p&gt;Before diving into the individual blocks, ignore the details and look at the overall shape.&lt;/p&gt;

&lt;p&gt;There are &lt;strong&gt;two major halves&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Encoder (left)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Decoder (right)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of them as two specialized factories.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input sentence
      │
      ▼
 ┌───────────────┐
 │    Encoder    │
 └───────────────┘
      │
 Learned representation
      │
      ▼
 ┌───────────────┐
 │    Decoder    │
 └───────────────┘
      │
      ▼
 Generated text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Originally, this architecture was designed for &lt;strong&gt;machine translation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;English:
"The cat sat on the mat."

Encoder understands it.

↓

Decoder generates:

"Le chat était assis sur le tapis."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The encoder's only responsibility is &lt;strong&gt;understanding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The decoder's responsibility is &lt;strong&gt;generation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Today's GPT models actually use &lt;strong&gt;only the decoder&lt;/strong&gt;, while models like BERT use &lt;strong&gt;only the encoder&lt;/strong&gt;. The original Transformer paper contained both because translation requires understanding one language before producing another.&lt;/p&gt;
&lt;h1&gt;
  
  
  Step 1 — Input Embeddings: Converting Words into Numbers
&lt;/h1&gt;

&lt;p&gt;Computers don't understand words.&lt;/p&gt;

&lt;p&gt;They understand vectors.&lt;/p&gt;

&lt;p&gt;When the sentence&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The cat sat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;enters the Transformer, each word is converted into a dense numerical vector.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"The"
↓

[0.17, -0.42, 1.33, ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;These vectors are called &lt;strong&gt;embeddings&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Words with similar meanings naturally end up close together in this high-dimensional space.&lt;/p&gt;

&lt;p&gt;For example,&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;king
queen
prince
princess
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;occupy nearby regions.&lt;/p&gt;

&lt;p&gt;Instead of manually designing these vectors, the Transformer &lt;strong&gt;learns them during training&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the architecture diagram, this is the very first pink box:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Inputs
   │
   ▼
Input Embedding
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;At this stage, every token is simply represented as a learned numerical vector.&lt;/p&gt;
&lt;h1&gt;
  
  
  Step 2 — Positional Encoding: Giving Words an Order
&lt;/h1&gt;

&lt;p&gt;Here's an interesting problem.&lt;/p&gt;

&lt;p&gt;Attention doesn't inherently know word order.&lt;/p&gt;

&lt;p&gt;Consider:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dog bites man

Man bites dog
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Same words.&lt;/p&gt;

&lt;p&gt;Completely different meaning.&lt;/p&gt;

&lt;p&gt;Since attention processes every word simultaneously, we must explicitly tell the model where each word appears.&lt;/p&gt;

&lt;p&gt;That's the purpose of &lt;strong&gt;Positional Encoding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The positional vector is added directly to the embedding.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Embedding

+

Position

=

Final input vector
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This explains the small &lt;strong&gt;⊕ symbol&lt;/strong&gt; in the diagram.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Embedding
      │
      ▼
      ⊕
     / \
Embedding Position
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Rather than learning grammar from sequence alone, the model receives positional information immediately.&lt;/p&gt;

&lt;p&gt;You can think of it as giving every token both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;what it is&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;where it occurs&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;
  
  
  Step 3 — The Encoder Stack: Understanding the Entire Sentence
&lt;/h1&gt;

&lt;p&gt;Now we reach the large box labelled &lt;strong&gt;N×&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────┐
│ Multi-Head Attention│
│ Add &amp;amp; Norm          │
│ Feed Forward        │
│ Add &amp;amp; Norm          │
└─────────────────────┘

Repeated N times
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The paper used &lt;strong&gt;N = 6&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Modern LLMs often use dozens or even hundreds of layers.&lt;/p&gt;

&lt;p&gt;Each encoder layer gradually refines the representation.&lt;/p&gt;

&lt;p&gt;Think of reading a paragraph.&lt;/p&gt;

&lt;p&gt;Your first reading identifies words.&lt;/p&gt;

&lt;p&gt;The second discovers phrases.&lt;/p&gt;

&lt;p&gt;The third understands relationships.&lt;/p&gt;

&lt;p&gt;The fourth extracts meaning.&lt;/p&gt;

&lt;p&gt;Each encoder layer performs another refinement pass.&lt;/p&gt;
&lt;h1&gt;
  
  
  Step 4 — Multi-Head Attention: The Heart of the Transformer
&lt;/h1&gt;

&lt;p&gt;This is the innovation that changed deep learning.&lt;/p&gt;

&lt;p&gt;Suppose the sentence is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The animal didn't cross the street because &lt;strong&gt;it&lt;/strong&gt; was tired.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What does &lt;strong&gt;it&lt;/strong&gt; refer to?&lt;/p&gt;

&lt;p&gt;Attention allows every word to examine every other word before updating its representation.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;animal  ←──────────┐
                   │
cross ─────────────┤
                   │
street ────────────┤
                   │
it  ◄──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Instead of only looking at nearby words like RNNs, every token has access to the &lt;strong&gt;entire sentence&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This allows long-distance dependencies to be captured naturally.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Multiple Heads?
&lt;/h2&gt;

&lt;p&gt;One attention mechanism isn't enough.&lt;/p&gt;

&lt;p&gt;Different relationships matter.&lt;/p&gt;

&lt;p&gt;One head may learn:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;grammatical structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pronoun resolution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;verb-object relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;semantic similarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Imagine several experts reading the same sentence simultaneously.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Head 1:
Grammar

Head 2:
Meaning

Head 3:
Syntax

Head 4:
Long-range context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Their outputs are combined into a richer representation.&lt;/p&gt;

&lt;p&gt;This is why it's called &lt;strong&gt;Multi-Head Attention&lt;/strong&gt;.&lt;/p&gt;
&lt;h1&gt;
  
  
  Step 5 — Add &amp;amp; Norm: Keeping Training Stable
&lt;/h1&gt;

&lt;p&gt;Notice that after every major block we see&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Add &amp;amp; Norm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This performs two operations.&lt;/p&gt;
&lt;h2&gt;
  
  
  Residual Connection (Add)
&lt;/h2&gt;

&lt;p&gt;Instead of replacing information, we preserve the original.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Output

=

Attention(x)

+

x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This shortcut helps gradients flow through deep networks.&lt;/p&gt;

&lt;p&gt;Without it, training hundreds of layers becomes extremely difficult.&lt;/p&gt;
&lt;h2&gt;
  
  
  Layer Normalization (Norm)
&lt;/h2&gt;

&lt;p&gt;Different layers naturally produce values on different scales.&lt;/p&gt;

&lt;p&gt;Layer normalization keeps activations well-behaved.&lt;/p&gt;

&lt;p&gt;Think of it as recalibrating measurements after every processing stage.&lt;/p&gt;

&lt;p&gt;Without normalization:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1

0.3

Layer 2

500

Layer 3

0.0004
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Training quickly becomes unstable.&lt;/p&gt;

&lt;p&gt;Normalization keeps everything numerically manageable.&lt;/p&gt;
&lt;h1&gt;
  
  
  Step 6 — Feed Forward Networks: Thinking Independently
&lt;/h1&gt;

&lt;p&gt;Attention allows tokens to exchange information.&lt;/p&gt;

&lt;p&gt;The Feed Forward layer allows each token to process what it has learned.&lt;/p&gt;

&lt;p&gt;For every token independently:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Vector

↓

Linear

↓

Activation

↓

Linear
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;No interaction happens here.&lt;/p&gt;

&lt;p&gt;Instead, this stage performs deeper feature extraction.&lt;/p&gt;

&lt;p&gt;An analogy:&lt;/p&gt;

&lt;p&gt;Attention is a group discussion.&lt;/p&gt;

&lt;p&gt;Feed Forward is everyone quietly thinking afterward.&lt;/p&gt;

&lt;p&gt;This alternating pattern—&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Discuss

↓

Think

↓

Discuss

↓

Think
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;is repeated across every Transformer layer.&lt;/p&gt;
&lt;h1&gt;
  
  
  Step 7 — The Decoder: Generating One Token at a Time
&lt;/h1&gt;

&lt;p&gt;Now we move to the right half of the diagram.&lt;/p&gt;

&lt;p&gt;The decoder is responsible for producing text.&lt;/p&gt;

&lt;p&gt;Its input isn't the original sentence.&lt;/p&gt;

&lt;p&gt;Instead it receives:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;START&amp;gt;

↓

The

↓

The cat

↓

The cat sat

↓

...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Notice the label:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Outputs
(shifted right)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This means that during training, the decoder receives the &lt;strong&gt;correct previous token&lt;/strong&gt; as input while learning to predict the next one.&lt;/p&gt;

&lt;p&gt;If the target sentence is:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The cat sat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;the decoder sees:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;START&amp;gt;

The

The cat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;and learns to predict:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The

cat

sat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This "teacher forcing" strategy makes training much more efficient because the model always conditions on the correct history rather than its own mistakes.&lt;/p&gt;
&lt;h1&gt;
  
  
  Step 8 — Masked Multi-Head Attention: Preventing Cheating
&lt;/h1&gt;

&lt;p&gt;Imagine predicting the next word in:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The cat sat on the __
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If the model could already see&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;there would be nothing to learn.&lt;/p&gt;

&lt;p&gt;So the decoder applies &lt;strong&gt;Masked Multi-Head Attention&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Current token

↓

Can see:

Previous words ✔

Future words ✘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;During generation:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I love
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;cannot attend to&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pizza
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;until pizza has actually been generated.&lt;/p&gt;

&lt;p&gt;The mask preserves causality.&lt;/p&gt;

&lt;p&gt;This is precisely why GPT models generate text one token at a time.&lt;/p&gt;
&lt;h1&gt;
  
  
  Step 9 — Cross-Attention: Looking Back at the Encoder
&lt;/h1&gt;

&lt;p&gt;The second attention block inside the decoder is different.&lt;/p&gt;

&lt;p&gt;Here, the decoder attends to the encoder output.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Encoder

↓

Sentence meaning

↓

Decoder consults it
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Suppose we're translating:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"The red car."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;While generating&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;voiture
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;the decoder continually asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which parts of the original sentence are relevant right now?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This interaction between encoder and decoder is called &lt;strong&gt;cross-attention&lt;/strong&gt; (shown simply as "Multi-Head Attention" in the original diagram, with arrows coming from the encoder stack).&lt;/p&gt;

&lt;p&gt;It lets the decoder ground each generated token in the encoded meaning of the source sentence instead of relying only on previously generated words.&lt;/p&gt;

&lt;p&gt;Decoder-only models like GPT omit this block because there is no separate encoder to consult.&lt;/p&gt;
&lt;h1&gt;
  
  
  Step 10 — Linear Layer + Softmax: Choosing the Next Word
&lt;/h1&gt;

&lt;p&gt;After the decoder finishes processing, we finally reach the top of the diagram.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Decoder Output

↓

Linear

↓

Softmax

↓

Output Probabilities
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;strong&gt;Linear&lt;/strong&gt; layer converts the decoder's hidden representation into one score for every token in the vocabulary.&lt;/p&gt;

&lt;p&gt;Imagine a vocabulary containing 50,000 words.&lt;/p&gt;

&lt;p&gt;The output might look like:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cat      5.8
dog      2.1
apple   -0.4
car      1.7
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;These raw scores (often called &lt;em&gt;logits&lt;/em&gt;) aren't probabilities yet.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Softmax&lt;/strong&gt; layer transforms them into a probability distribution:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cat      0.81
dog      0.09
car      0.04
apple    0.01
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The model can then choose the next token—either the most probable one or a sampled alternative depending on the decoding strategy.&lt;/p&gt;

&lt;p&gt;That chosen token is fed back into the decoder, and the entire process repeats until an end-of-sequence token is produced.&lt;/p&gt;
&lt;h1&gt;
  
  
  Putting It All Together: Following the Data Through the Diagram
&lt;/h1&gt;

&lt;p&gt;Now the entire figure becomes much easier to read.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input text

↓

Input Embedding

↓

Positional Encoding

↓

Encoder Stack (N layers)
    ├─ Multi-Head Attention
    ├─ Add &amp;amp; Norm
    ├─ Feed Forward
    └─ Add &amp;amp; Norm

↓

Context-rich representation

↓

Decoder receives previous outputs
(shifted right)

↓

Output Embedding

↓

Positional Encoding

↓

Masked Multi-Head Attention
(looks only at earlier generated tokens)

↓

Cross-Attention
(consults the encoder output)

↓

Feed Forward

↓

Repeat N layers

↓

Linear

↓

Softmax

↓

Next token probability

↓

Repeat until complete sentence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Every block has a specific role:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input Embedding&lt;/td&gt;
&lt;td&gt;Convert tokens into dense vectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Positional Encoding&lt;/td&gt;
&lt;td&gt;Encode word order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Head Attention&lt;/td&gt;
&lt;td&gt;Let tokens exchange information globally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Masked Multi-Head Attention&lt;/td&gt;
&lt;td&gt;Prevent access to future tokens during generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-Attention&lt;/td&gt;
&lt;td&gt;Allow the decoder to consult the encoder's understanding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feed Forward&lt;/td&gt;
&lt;td&gt;Transform each token's representation independently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add &amp;amp; Norm&lt;/td&gt;
&lt;td&gt;Stabilize optimization and preserve information via residual connections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encoder Stack (N×)&lt;/td&gt;
&lt;td&gt;Build increasingly rich contextual representations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decoder Stack (N×)&lt;/td&gt;
&lt;td&gt;Generate the output sequence one token at a time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear&lt;/td&gt;
&lt;td&gt;Produce a score for every vocabulary token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Softmax&lt;/td&gt;
&lt;td&gt;Convert scores into probabilities for selecting the next token&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h1&gt;
  
  
  A Crisp Summary To Help You Remember This
&lt;/h1&gt;

&lt;p&gt;The Transformer succeeded because it replaced sequential processing with &lt;strong&gt;parallel attention&lt;/strong&gt;, enabling models to reason over entire sequences at once while still generating coherent text token by token.&lt;/p&gt;

&lt;p&gt;Almost every major language model today—from GPT and Claude to Llama, Mistral, and Gemini—can trace its lineage back to this deceptively simple diagram. While modern architectures introduce refinements such as rotary positional embeddings, grouped-query attention, mixture-of-experts layers, and optimized decoding strategies, the core ideas remain strikingly similar to those introduced in 2017.&lt;/p&gt;

&lt;p&gt;The next time someone says an LLM is "just predicting the next token," remember what's happening under the hood: &lt;strong&gt;embeddings capture meaning, positional encodings preserve order, attention weaves relationships across the sequence, feed-forward networks refine those representations, residual connections keep deep networks trainable, and the decoder repeatedly transforms all of that into one probability distribution after another until a coherent response emerges.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What part of the Transformer architecture surprised you the most—the fact that every token can attend to every other token, the masking that enables autoregressive generation, or how such a simple stack of repeated blocks scales to models with hundreds of billions of parameters?&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Self-Attention: The Brilliant Idea That Made Large Language Models Possible</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Sun, 28 Jun 2026 17:18:46 +0000</pubDate>
      <link>https://dev.to/shrsv/self-attention-the-brilliant-idea-that-made-large-language-models-possible-1oj</link>
      <guid>https://dev.to/shrsv/self-attention-the-brilliant-idea-that-made-large-language-models-possible-1oj</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;How a seemingly simple mathematical trick replaced decades of sequential neural networks and unlocked the age of GPT.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Imagine asking ten software engineers to summarize a pull request.&lt;/p&gt;

&lt;p&gt;One engineer reads every line from top to bottom. Another immediately jumps to the files that seem most relevant. A senior engineer skims most of the code but pays close attention to the parts that affect authentication, concurrency, or performance.&lt;/p&gt;

&lt;p&gt;The senior engineer isn't processing every line equally.&lt;/p&gt;

&lt;p&gt;They're &lt;strong&gt;paying attention&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That simple observation eventually became one of the most important ideas in modern machine learning. In 2017, a group of researchers at Google published a paper with an almost understated title: "&lt;strong&gt;\"Attention Is All You Need.\"&lt;/strong&gt; The paper introduced the &lt;strong&gt;Transformer&lt;/strong&gt;, a new neural network architecture that abandoned recurrent networks entirely in favor of one central mechanism: &lt;strong&gt;self-attention&lt;/strong&gt;."&lt;/p&gt;

&lt;p&gt;Today, nearly every major Large Language Model—GPT, Claude, Gemini, Llama, DeepSeek, Mistral—builds upon this idea.&lt;/p&gt;

&lt;p&gt;Let's understand why.&lt;/p&gt;

&lt;h1&gt;
  
  
  Before Transformers: Language Was Processed Like a Conveyor Belt
&lt;/h1&gt;

&lt;p&gt;For nearly two decades, sequence models were dominated by &lt;strong&gt;Recurrent Neural Networks (RNNs)&lt;/strong&gt; and later &lt;strong&gt;LSTMs&lt;/strong&gt; and &lt;strong&gt;GRUs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Suppose we have the sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The animal didn't cross the road because it was tired.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An RNN processes it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The
 ↓
animal
 ↓
didn't
 ↓
cross
 ↓
...
 ↓
tired
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Every new word updates a hidden state.&lt;/p&gt;

&lt;p&gt;If the model wants to understand &lt;strong&gt;"it"&lt;/strong&gt;, information about &lt;strong&gt;"animal"&lt;/strong&gt; has already travelled through six or seven intermediate computations.&lt;/p&gt;

&lt;p&gt;It's a little like the children's game of telephone. Every time information is passed forward, a little noise is introduced.&lt;/p&gt;

&lt;p&gt;The longer the sentence becomes, the harder it is to preserve distant information.&lt;/p&gt;

&lt;p&gt;This caused several practical problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long-range dependencies became difficult.&lt;/li&gt;
&lt;li&gt;Training was inherently sequential.&lt;/li&gt;
&lt;li&gt;GPUs—which thrive on parallel computation—were underutilized.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even clever improvements like LSTMs only partially solved these issues.&lt;/p&gt;

&lt;p&gt;Researchers began asking a different question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if every word could simply look at every other word directly?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That question became self-attention.&lt;/p&gt;
&lt;h1&gt;
  
  
  The Core Idea: Every Word Gets to Read the Entire Sentence
&lt;/h1&gt;

&lt;p&gt;Instead of processing words one after another, self-attention lets every token consult every other token before deciding what it should represent.&lt;/p&gt;

&lt;p&gt;Consider:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The trophy didn't fit into the suitcase because it was too small.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When humans read "it", we naturally ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;trophy?&lt;/li&gt;
&lt;li&gt;suitcase?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our brain briefly looks backward.&lt;/p&gt;

&lt;p&gt;Transformers perform the same operation mathematically.&lt;/p&gt;

&lt;p&gt;When computing the representation for &lt;strong&gt;it&lt;/strong&gt;, the model assigns attention weights:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Word&lt;/th&gt;
&lt;th&gt;Importance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;trophy&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;suitcase&lt;/td&gt;
&lt;td&gt;0.67&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;small&lt;/td&gt;
&lt;td&gt;0.17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fit&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;others&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These numbers are not programmed.&lt;/p&gt;

&lt;p&gt;They are learned from enormous amounts of text.&lt;/p&gt;

&lt;p&gt;The new representation becomes approximately:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;representation(it)

=
0.67 × suitcase
+
0.17 × small
+
0.08 × trophy
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Notice something subtle.&lt;/p&gt;

&lt;p&gt;The word &lt;strong&gt;it&lt;/strong&gt; itself never changes.&lt;/p&gt;

&lt;p&gt;Instead, its &lt;strong&gt;vector representation&lt;/strong&gt; becomes richer because it incorporates contextual information from the rest of the sentence.&lt;/p&gt;

&lt;p&gt;This is why the mechanism is called &lt;strong&gt;self-attention&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The sentence is attending to itself.&lt;/p&gt;
&lt;h1&gt;
  
  
  Why This Was Revolutionary
&lt;/h1&gt;

&lt;p&gt;The Google paper's title—&lt;strong&gt;Attention Is All You Need&lt;/strong&gt;—was intentionally provocative.&lt;/p&gt;

&lt;p&gt;At the time, attention mechanisms already existed.&lt;/p&gt;

&lt;p&gt;Bahdanau and colleagues had introduced attention in neural machine translation in 2014. However, attention was only an add-on to recurrent networks.&lt;/p&gt;

&lt;p&gt;The Transformer asked a far bolder question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens if we remove recurrence completely?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input
 ↓
LSTM
 ↓
LSTM
 ↓
LSTM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;the Transformer became:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input
 ↓
Self Attention
 ↓
Feed Forward
 ↓
Self Attention
 ↓
Feed Forward
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;No recurrence.&lt;/p&gt;

&lt;p&gt;No convolutions.&lt;/p&gt;

&lt;p&gt;Just attention layers stacked dozens—or eventually hundreds—of times.&lt;/p&gt;

&lt;p&gt;Many researchers initially viewed this as risky.&lt;/p&gt;

&lt;p&gt;Within a year, it became obvious the idea worked astonishingly well.&lt;/p&gt;
&lt;h1&gt;
  
  
  The Math Is Simple and Elegant
&lt;/h1&gt;

&lt;p&gt;The mathematics often intimidates newcomers, but the underlying idea is straightforward.&lt;/p&gt;

&lt;p&gt;Each token produces three vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query (Q)&lt;/strong&gt; → What information am I looking for?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key (K)&lt;/strong&gt; → What information do I contain?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value (V)&lt;/strong&gt; → What information should I contribute?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of attending a technical conference.&lt;/p&gt;

&lt;p&gt;Every attendee carries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a list of questions they're interested in (Query),&lt;/li&gt;
&lt;li&gt;a badge describing their expertise (Key),&lt;/li&gt;
&lt;li&gt;the knowledge they can share (Value).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conversation happens when someone's questions match another person's expertise.&lt;/p&gt;

&lt;p&gt;Mathematically:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;score = Query · Key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The dot product measures compatibility.&lt;/p&gt;

&lt;p&gt;Large dot product?&lt;/p&gt;

&lt;p&gt;Pay attention.&lt;/p&gt;

&lt;p&gt;Small dot product?&lt;/p&gt;

&lt;p&gt;Ignore.&lt;/p&gt;

&lt;p&gt;The scores are normalized using the Softmax function:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;QKᵀ&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="err"&gt;√&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The division by √d prevents very large vector dimensions from producing excessively large dot products that would make Softmax saturate. Without this scaling, gradients become small and training becomes unstable.&lt;/p&gt;

&lt;p&gt;Finally,&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Each token becomes a weighted combination of information from every other token.&lt;/p&gt;

&lt;p&gt;That's the entire mechanism.&lt;/p&gt;

&lt;p&gt;The famous equation occupies only a single line in the original paper.&lt;/p&gt;

&lt;p&gt;Yet it changed AI forever.&lt;/p&gt;
&lt;h1&gt;
  
  
  A Back-of-the-Envelope Calculation: Why Attention Is Expensive
&lt;/h1&gt;

&lt;p&gt;Self-attention's biggest strength is also its biggest weakness.&lt;/p&gt;

&lt;p&gt;Suppose a context contains &lt;strong&gt;4,096 tokens&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Every token compares itself against every other token.&lt;/p&gt;

&lt;p&gt;Total comparisons:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;4096 × 4096

≈ 16.8 million
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now consider modern models.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8,192 tokens&lt;/li&gt;
&lt;li&gt;32 attention heads&lt;/li&gt;
&lt;li&gt;dozens of Transformer layers&lt;/li&gt;
&lt;li&gt;billions of parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The number of operations quickly reaches the trillions during training.&lt;/p&gt;

&lt;p&gt;This explains why training frontier models requires thousands of GPUs running continuously for weeks or months.&lt;/p&gt;

&lt;p&gt;The economics become equally striking.&lt;/p&gt;

&lt;p&gt;A single GPU might cost tens of thousands of dollars. Training clusters contain thousands of them.&lt;/p&gt;

&lt;p&gt;Electricity, cooling, networking, storage, engineering time, and failed experiments all contribute to training costs that can reach tens or even hundreds of millions of dollars for the largest models.&lt;/p&gt;

&lt;p&gt;This computational expense has also motivated an entire research field devoted to making attention cheaper.&lt;/p&gt;

&lt;p&gt;Techniques such as FlashAttention, grouped-query attention, sparse attention, and linear attention all attempt to preserve the quality of self-attention while reducing memory usage or computational complexity.&lt;/p&gt;

&lt;p&gt;Ironically, many innovations in modern LLM engineering are really innovations in making self-attention practical at scale.&lt;/p&gt;
&lt;h1&gt;
  
  
  Why Self-Attention Became the Foundation of LLMs
&lt;/h1&gt;

&lt;p&gt;Language isn't fundamentally sequential.&lt;/p&gt;

&lt;p&gt;Relationships often span entire documents.&lt;/p&gt;

&lt;p&gt;A variable declared hundreds of lines earlier influences the current line of code.&lt;/p&gt;

&lt;p&gt;A pronoun refers to a noun introduced several paragraphs ago.&lt;/p&gt;

&lt;p&gt;An API call depends on documentation presented earlier in the conversation.&lt;/p&gt;

&lt;p&gt;Self-attention naturally models these relationships.&lt;/p&gt;

&lt;p&gt;It also parallelizes beautifully.&lt;/p&gt;

&lt;p&gt;Unlike RNNs, every token in a sequence can be processed simultaneously on modern GPUs.&lt;/p&gt;

&lt;p&gt;That single architectural decision dramatically increased hardware utilization and enabled models to scale from millions of parameters to today's trillion-parameter frontier.&lt;/p&gt;

&lt;p&gt;Perhaps the greatest lesson is that breakthroughs are not always about making systems more complicated.&lt;/p&gt;

&lt;p&gt;Sometimes they're about removing assumptions.&lt;/p&gt;

&lt;p&gt;The Transformer removed the assumption that language must be processed one word at a time.&lt;/p&gt;

&lt;p&gt;Everything that followed—from GPT-2 to ChatGPT—was built on that realization.&lt;/p&gt;
&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;It's easy to think of GPT as an impossibly complex black box.&lt;/p&gt;

&lt;p&gt;But underneath the billions of parameters lies a surprisingly elegant principle.&lt;/p&gt;

&lt;p&gt;Every word asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which other words should I pay attention to?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That single question replaced decades of recurrent architectures and reshaped artificial intelligence.&lt;/p&gt;

&lt;p&gt;Sometimes, the most revolutionary ideas aren't new ways of computing.&lt;/p&gt;

&lt;p&gt;They're new ways of deciding &lt;strong&gt;what deserves attention&lt;/strong&gt;.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;What surprised you most about self-attention?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Was it that the core algorithm fits into a single equation, or that one architectural decision replaced decades of recurrent neural networks? I'd love to hear your thoughts—or any clever analogies you've found useful when explaining Transformers to other developers.&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Sequence Transduction: The Forgotten Problem That Led to Modern LLMs</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Sat, 27 Jun 2026 17:15:41 +0000</pubDate>
      <link>https://dev.to/shrsv/sequence-transduction-the-forgotten-problem-that-led-to-modern-llms-439e</link>
      <guid>https://dev.to/shrsv/sequence-transduction-the-forgotten-problem-that-led-to-modern-llms-439e</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Most developers think large language models were built to predict the next word. They weren't—not at first.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you travel back to the early 2010s, the hardest problems in AI weren't writing poems or generating code. They were translating English into French, converting speech into text, and summarizing documents. These were all instances of the same challenge: &lt;strong&gt;sequence transduction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The term appears almost casually in the opening paragraph of the Transformer paper:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"...sequence modeling and transduction problems such as language modeling and machine translation."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Today, almost everyone knows the Transformer. Very few remember the problem it was invented to solve.&lt;/p&gt;

&lt;p&gt;Ironically, solving sequence transduction turned out to create the foundation upon which modern LLMs would later emerge.&lt;/p&gt;

&lt;p&gt;Let's explore why.&lt;/p&gt;

&lt;h1&gt;
  
  
  What Exactly Is Sequence Transduction?
&lt;/h1&gt;

&lt;p&gt;Imagine you own a factory.&lt;/p&gt;

&lt;p&gt;A sequence modeling problem asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Given everything that has happened so far, what comes next?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Like predicting the next product coming off the conveyor belt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The cat sat on the _____
                ↓
               mat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This is language modeling.&lt;/p&gt;

&lt;p&gt;A sequence transduction problem is larger idea.&lt;/p&gt;

&lt;p&gt;Instead of predicting one missing piece, you transform an entire sequence into another.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;English
↓

"The weather is nice."

↓

French

"Il fait beau."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Or&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio
↓

Waveform

↓

Text

"Welcome everyone."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Or&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Buggy code

↓

Correct code
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Different input.&lt;/p&gt;

&lt;p&gt;Different output.&lt;/p&gt;

&lt;p&gt;Often different lengths.&lt;/p&gt;

&lt;p&gt;The model must understand the entire source or at least large parts of it before generating the target.&lt;/p&gt;

&lt;p&gt;In hindsight, modern AI assistants spend almost all of their time doing sequence transduction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;translating languages&lt;/li&gt;
&lt;li&gt;summarizing reports&lt;/li&gt;
&lt;li&gt;generating SQL&lt;/li&gt;
&lt;li&gt;converting Python to Rust&lt;/li&gt;
&lt;li&gt;explaining stack traces&lt;/li&gt;
&lt;li&gt;producing commit messages&lt;/li&gt;
&lt;li&gt;writing documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They all reduce to:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Input sequence → Output sequence&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1&gt;
  
  
  Why Was This Such a Hard Problem?
&lt;/h1&gt;

&lt;p&gt;Humans underestimate how much memory translation requires.&lt;/p&gt;

&lt;p&gt;Consider translating:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The committee, after reviewing several proposals over three months, finally approved the budget."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Suppose you're translating into German.&lt;/p&gt;

&lt;p&gt;The verb may not appear until the end.&lt;/p&gt;

&lt;p&gt;To translate correctly, the model must remember:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;who performed the action&lt;/li&gt;
&lt;li&gt;tense&lt;/li&gt;
&lt;li&gt;plurality&lt;/li&gt;
&lt;li&gt;grammatical structure&lt;/li&gt;
&lt;li&gt;dependencies dozens of words apart&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Early neural networks processed text one word at a time.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Word₁ → hidden state
              ↓
Word₂ → hidden state
              ↓
Word₃ → hidden state
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Everything had to be compressed into one hidden vector.&lt;/p&gt;

&lt;p&gt;It was like asking someone to summarize an entire novel using only one sticky note.&lt;/p&gt;

&lt;p&gt;Eventually information disappears.&lt;/p&gt;

&lt;p&gt;This became known as the &lt;strong&gt;long-range dependency problem&lt;/strong&gt;.&lt;/p&gt;
&lt;h1&gt;
  
  
  The Rise—and Limits—of Recurrent Neural Networks
&lt;/h1&gt;

&lt;p&gt;During the late 1980s and 1990s, researchers developed &lt;strong&gt;Recurrent Neural Networks (RNNs)&lt;/strong&gt; to process sequential data.&lt;/p&gt;

&lt;p&gt;Unlike ordinary neural networks, RNNs reused the same parameters at every time step.&lt;/p&gt;

&lt;p&gt;Instead of building a different network for every word, one network repeatedly updated an internal memory.&lt;/p&gt;

&lt;p&gt;Mathematically:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;hidden_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;previous_hidden_state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The same computation runs repeatedly.&lt;/p&gt;

&lt;p&gt;This parameter sharing was elegant.&lt;/p&gt;

&lt;p&gt;Suppose an RNN contains one million parameters.&lt;/p&gt;

&lt;p&gt;A thousand-word paragraph still uses one million parameters—not a billion.&lt;/p&gt;

&lt;p&gt;The network simply reuses them.&lt;/p&gt;

&lt;p&gt;Economically, this was attractive. But computationally, it was painful.&lt;/p&gt;

&lt;p&gt;Everything had to happen sequentially.&lt;/p&gt;

&lt;p&gt;Word 500 could not begin until word 499 finished.&lt;/p&gt;

&lt;p&gt;No parallelism. No GPUs in picture. Training was slow.&lt;/p&gt;
&lt;h1&gt;
  
  
  LSTMs: Teaching Neural Networks to Remember
&lt;/h1&gt;

&lt;p&gt;In 1997, Sepp Hochreiter and Jürgen Schmidhuber introduced one of the most influential ideas in deep learning:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long Short-Term Memory (LSTM).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of blindly overwriting memory every step, the network learned gates.&lt;/p&gt;

&lt;p&gt;Think of memory like a whiteboard.&lt;/p&gt;

&lt;p&gt;Each word asks three questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should I erase something?&lt;/li&gt;
&lt;li&gt;Should I remember this?&lt;/li&gt;
&lt;li&gt;Should I reveal this later?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those questions became three learned gates.&lt;/p&gt;

&lt;p&gt;Forget gate.&lt;/p&gt;

&lt;p&gt;Input gate.&lt;/p&gt;

&lt;p&gt;Output gate.&lt;/p&gt;

&lt;p&gt;Instead of forcing every piece of information through the same bottleneck, the model learned what deserved long-term storage.&lt;/p&gt;

&lt;p&gt;A surprisingly intuitive analogy is human note-taking.&lt;/p&gt;

&lt;p&gt;Most conversations are forgotten.&lt;/p&gt;

&lt;p&gt;A few facts are written into your notebook.&lt;/p&gt;

&lt;p&gt;LSTMs learned which facts deserved the notebook.&lt;/p&gt;

&lt;p&gt;For over a decade, LSTMs dominated speech recognition, handwriting recognition, language translation, and time-series forecasting.&lt;/p&gt;

&lt;p&gt;Google, Apple, Microsoft, and Baidu all deployed enormous production systems powered by them.&lt;/p&gt;
&lt;h1&gt;
  
  
  The Encoder–Decoder Revolution
&lt;/h1&gt;

&lt;p&gt;Around 2014, another breakthrough appeared.&lt;/p&gt;

&lt;p&gt;Instead of using one RNN/LSTMs for everything, researchers separated the task into two parts.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input sentence
      ↓
Encoder
      ↓
Meaning vector
      ↓
Decoder
      ↓
Output sentence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This architecture became known as the &lt;strong&gt;sequence-to-sequence (Seq2Seq)&lt;/strong&gt; model.&lt;/p&gt;

&lt;p&gt;For the first time, neural networks learned translation end-to-end.&lt;/p&gt;

&lt;p&gt;No phrase tables or handcrafted grammar or brittle rules. It was just millions of examples.&lt;/p&gt;

&lt;p&gt;One famous anecdote came from Google.&lt;/p&gt;

&lt;p&gt;Traditional statistical machine translation systems consisted of dozens of independently engineered components accumulated over years.&lt;/p&gt;

&lt;p&gt;Neural machine translation replaced much of that complexity with a single differentiable model trained from data. In 2016, Google reported that its neural system substantially reduced translation errors across multiple language pairs while simplifying the overall pipeline.&lt;/p&gt;

&lt;p&gt;This represented an engineering improvement for sure, but more importantly it was a philosophical shift.&lt;/p&gt;

&lt;p&gt;Instead of programming language knowledge -- we trained it.&lt;/p&gt;
&lt;h1&gt;
  
  
  Attention Changed Everything
&lt;/h1&gt;

&lt;p&gt;The Seq2Seq model still had one weakness.&lt;/p&gt;

&lt;p&gt;Everything had to fit inside one vector.&lt;/p&gt;

&lt;p&gt;Information gets lost.&lt;/p&gt;

&lt;p&gt;In 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio proposed &lt;strong&gt;attention&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of remembering everything, just look back whenever necessary.&lt;/p&gt;

&lt;p&gt;While generating each output word, the decoder asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which input words matter right now?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not every word.&lt;/p&gt;

&lt;p&gt;Only the relevant ones.&lt;/p&gt;

&lt;p&gt;Translation suddenly became much easier.&lt;/p&gt;

&lt;p&gt;Long sentences improved dramatically.&lt;/p&gt;
&lt;h1&gt;
  
  
  From Translation Engine to ChatGPT
&lt;/h1&gt;

&lt;p&gt;The Transformer paper in 2017 -- instead of improving recurrent networks, removed recurrence entirely.&lt;/p&gt;

&lt;p&gt;Every word could attend directly to every other word.&lt;/p&gt;

&lt;p&gt;Parallel computation became possible.&lt;/p&gt;

&lt;p&gt;Training speed increased enormously.&lt;/p&gt;

&lt;p&gt;GPUs became dramatically more efficient because every token in a sequence could be processed simultaneously rather than one after another.&lt;/p&gt;

&lt;p&gt;Even more interesting was the economics.&lt;/p&gt;

&lt;p&gt;Suppose translating a sentence of 100 words with an RNN requires roughly 100 sequential computation steps.&lt;/p&gt;

&lt;p&gt;A Transformer still performs similar amounts of arithmetic overall, but many of those operations can execute in parallel on modern accelerators.&lt;/p&gt;

&lt;p&gt;The wall-clock training time drops dramatically because GPUs are optimized for large batches of matrix multiplications rather than long chains of sequential dependencies.&lt;/p&gt;

&lt;p&gt;That operational advantage—not merely higher accuracy—made scaling practical.&lt;/p&gt;

&lt;p&gt;The remarkable twist is that the architecture built to solve translation generalized astonishingly well.&lt;/p&gt;

&lt;p&gt;Replace:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;English → French
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;with&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Question → Answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Code → Documentation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Prompt&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="n"&gt;program&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The underlying problem barely changes.&lt;/p&gt;

&lt;p&gt;It remains sequence transduction.&lt;/p&gt;

&lt;p&gt;Modern LLMs still perform next-token prediction during training.&lt;/p&gt;

&lt;p&gt;But from a developer's perspective, they are universal transduction engines.&lt;/p&gt;

&lt;p&gt;Every prompt is transformed into another sequence.&lt;/p&gt;

&lt;p&gt;The interface changed.&lt;/p&gt;

&lt;p&gt;The underlying abstraction survived.&lt;/p&gt;
&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;The history of AI is often told as a story about predicting the next word.&lt;/p&gt;

&lt;p&gt;That story is incomplete.&lt;/p&gt;

&lt;p&gt;For decades, researchers wrestled with a harder question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we transform one complex sequence into another while preserving meaning?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That single question drove the invention of encoder–decoder architectures, LSTMs, attention mechanisms, and ultimately the Transformer itself.&lt;/p&gt;

&lt;p&gt;The next time you ask an LLM to refactor code, summarize a meeting, or generate a SQL query, remember what it's really doing.&lt;/p&gt;

&lt;p&gt;Not merely predicting words.&lt;/p&gt;

&lt;p&gt;Performing sequence transduction at an extraordinary scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What surprised you most about this history?&lt;/strong&gt; Did you always think LLMs grew out of language modeling, or is it more useful to think of them as the latest—and perhaps most powerful—generation of sequence transduction systems?&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How We Actually Measure Whether an LLM's Output Is Good - BLEU, COMET and BLEURT</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Fri, 26 Jun 2026 18:36:27 +0000</pubDate>
      <link>https://dev.to/shrsv/how-we-actually-measure-whether-an-llms-output-is-good-bleu-comet-and-bleurt-3c0f</link>
      <guid>https://dev.to/shrsv/how-we-actually-measure-whether-an-llms-output-is-good-bleu-comet-and-bleurt-3c0f</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;An AI model writes a paragraph. It sounds fluent. It looks convincing. But how do you know whether it's actually good?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This deceptively simple question has occupied researchers for more than two decades.&lt;/p&gt;

&lt;p&gt;Long before ChatGPT, machine translation researchers faced exactly the same problem. Human evaluation was expensive, inconsistent, and painfully slow. If every new model required thousands of humans to compare translations, research would crawl.&lt;/p&gt;

&lt;p&gt;That necessity gave rise to &lt;strong&gt;BLEU&lt;/strong&gt;, one of the most influential evaluation metrics in AI history. Years later, as language models became better at paraphrasing and reasoning, BLEU started to show its age. Researchers responded with learned metrics like &lt;strong&gt;BLEURT&lt;/strong&gt; and &lt;strong&gt;COMET&lt;/strong&gt;, which use neural networks to judge language much more like humans do.&lt;/p&gt;

&lt;p&gt;Interestingly, this mirrors software engineering itself. We first wrote simple unit tests, then integration tests, and today we increasingly rely on sophisticated observability systems. Evaluation metrics for LLMs have undergone a similar evolution.&lt;/p&gt;

&lt;p&gt;Let's see why.&lt;/p&gt;

&lt;h1&gt;
  
  
  Before BLEU: The Evaluation Bottleneck
&lt;/h1&gt;

&lt;p&gt;Imagine you're building Google Translate in 2001.&lt;/p&gt;

&lt;p&gt;Every time your team improves the model, someone has to read thousands of translated sentences and score them.&lt;/p&gt;

&lt;p&gt;Suppose a single sentence pair takes only 20 seconds to judge.&lt;/p&gt;

&lt;p&gt;Evaluating 50,000 sentences would require nearly &lt;strong&gt;280 human-hours&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now imagine dozens of experiments every week.&lt;/p&gt;

&lt;p&gt;Evaluation—not training—quickly becomes the bottleneck.&lt;/p&gt;

&lt;p&gt;Researchers at IBM, led by &lt;strong&gt;Kishore Papineni&lt;/strong&gt;, introduced &lt;strong&gt;BLEU (Bilingual Evaluation Understudy)&lt;/strong&gt; in 2002 to automate this process.&lt;/p&gt;

&lt;p&gt;Their idea was surprisingly simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If a machine translation resembles what professional translators write, it's probably good.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This became one of the most cited papers in natural language processing.&lt;/p&gt;

&lt;h1&gt;
  
  
  BLEU: Counting Shared Phrases
&lt;/h1&gt;

&lt;p&gt;BLEU compares a model's output against one or more human reference translations.&lt;/p&gt;

&lt;p&gt;Suppose the reference is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The cat is sitting on the mat.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model produces:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The cat sat on the mat.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Many words and short phrases overlap.&lt;/p&gt;

&lt;p&gt;Now consider:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A feline rested indoors.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A human recognizes this as a perfectly reasonable translation.&lt;/p&gt;

&lt;p&gt;BLEU mostly doesn't.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because BLEU isn't measuring meaning.&lt;/p&gt;

&lt;p&gt;It measures &lt;strong&gt;shared n-grams&lt;/strong&gt;—contiguous sequences of words.&lt;/p&gt;

&lt;p&gt;The score combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;word matches (1-grams)&lt;/li&gt;
&lt;li&gt;two-word phrases&lt;/li&gt;
&lt;li&gt;three-word phrases&lt;/li&gt;
&lt;li&gt;four-word phrases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also penalizes outputs that are suspiciously short.&lt;/p&gt;

&lt;p&gt;High-level intuition:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More matching phrases → higher score&lt;/li&gt;
&lt;li&gt;Longer matching phrases → even better&lt;/li&gt;
&lt;li&gt;Too short → penalty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This simple idea turned out to correlate surprisingly well with human judgments across large datasets.&lt;/p&gt;

&lt;p&gt;When the famous &lt;strong&gt;Transformer&lt;/strong&gt; paper &lt;em&gt;Attention Is All You Need&lt;/em&gt; reported &lt;strong&gt;28.4 BLEU&lt;/strong&gt; on the WMT English-German benchmark, that represented roughly a &lt;strong&gt;2 BLEU improvement&lt;/strong&gt; over previous systems—a significant jump that helped establish the Transformer as the new state of the art.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why BLEU Eventually Broke Down
&lt;/h1&gt;

&lt;p&gt;BLEU assumes that good translations look similar.&lt;/p&gt;

&lt;p&gt;Modern LLMs don't.&lt;/p&gt;

&lt;p&gt;Consider these summaries.&lt;/p&gt;

&lt;p&gt;Reference:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The meeting was postponed because the client requested additional documentation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Output A:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The meeting was delayed after the client asked for more documents.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Output B:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Client requested more paperwork, so the meeting moved.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Humans would probably give both excellent scores.&lt;/p&gt;

&lt;p&gt;BLEU prefers whichever shares more exact phrases.&lt;/p&gt;

&lt;p&gt;Now imagine asking ChatGPT:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Explain recursion like I'm five.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are hundreds of excellent answers.&lt;/p&gt;

&lt;p&gt;BLEU expects one.&lt;/p&gt;

&lt;p&gt;This becomes even worse for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summarization&lt;/li&gt;
&lt;li&gt;question answering&lt;/li&gt;
&lt;li&gt;code explanations&lt;/li&gt;
&lt;li&gt;dialogue&lt;/li&gt;
&lt;li&gt;reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As models became more creative, exact word overlap became a poor proxy for quality.&lt;/p&gt;

&lt;p&gt;Researchers needed evaluation metrics that understood meaning rather than wording.&lt;/p&gt;

&lt;h1&gt;
  
  
  BLEURT: Teaching AI to Judge AI
&lt;/h1&gt;

&lt;p&gt;Google Research introduced &lt;strong&gt;BLEURT&lt;/strong&gt; in 2020.&lt;/p&gt;

&lt;p&gt;Instead of counting words, BLEURT fine-tunes a pretrained Transformer to predict human evaluation scores.&lt;/p&gt;

&lt;p&gt;Think of it as hiring a reviewer instead of using a spell checker.&lt;/p&gt;

&lt;p&gt;During training, BLEURT sees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;candidate answer&lt;/li&gt;
&lt;li&gt;reference answer&lt;/li&gt;
&lt;li&gt;human quality score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After millions of examples, it learns patterns humans value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;preserved meaning&lt;/li&gt;
&lt;li&gt;factual consistency&lt;/li&gt;
&lt;li&gt;grammatical quality&lt;/li&gt;
&lt;li&gt;fluency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An interesting engineering trick made BLEURT particularly effective.&lt;/p&gt;

&lt;p&gt;Human-scored datasets are relatively small.&lt;/p&gt;

&lt;p&gt;The researchers first generated large amounts of &lt;strong&gt;synthetically corrupted text&lt;/strong&gt;—introducing deletions, substitutions, and paraphrases—to pretrain the evaluator before fine-tuning on expensive human judgments.&lt;/p&gt;

&lt;p&gt;This significantly reduced the amount of labeled data needed.&lt;/p&gt;

&lt;h1&gt;
  
  
  COMET: Learning from Human Preferences
&lt;/h1&gt;

&lt;p&gt;Around the same time, researchers at &lt;strong&gt;Unbabel&lt;/strong&gt; developed &lt;strong&gt;COMET&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Like BLEURT, COMET uses a neural network.&lt;/p&gt;

&lt;p&gt;But it has access to something BLEURT often doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;original source sentence&lt;/li&gt;
&lt;li&gt;reference translation&lt;/li&gt;
&lt;li&gt;candidate translation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That additional context matters.&lt;/p&gt;

&lt;p&gt;Suppose the French sentence is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Il fait froid.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One candidate says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It is cold.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Another says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It is freezing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Without seeing the original sentence, both might seem acceptable.&lt;/p&gt;

&lt;p&gt;With the source available, COMET can better judge whether meaning has shifted.&lt;/p&gt;

&lt;p&gt;Modern COMET models consistently show stronger correlation with professional human evaluators than BLEU across many translation benchmarks.&lt;/p&gt;

&lt;p&gt;Today, COMET is frequently reported alongside BLEU in machine translation research.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Economics of Evaluation
&lt;/h1&gt;

&lt;p&gt;Training frontier models can cost millions of dollars.&lt;/p&gt;

&lt;p&gt;Evaluation, surprisingly, can become expensive too.&lt;/p&gt;

&lt;p&gt;Imagine comparing three model versions.&lt;/p&gt;

&lt;p&gt;Each produces answers for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10 benchmark datasets&lt;/li&gt;
&lt;li&gt;5,000 prompts each&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's &lt;strong&gt;150,000 outputs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If humans spend only 15 seconds per evaluation:&lt;/p&gt;

&lt;p&gt;150,000 × 15 seconds ≈ &lt;strong&gt;625 hours&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At $40/hour for expert annotators, that's &lt;strong&gt;$25,000&lt;/strong&gt; for a single evaluation round.&lt;/p&gt;

&lt;p&gt;And that's before measuring agreement between multiple reviewers.&lt;/p&gt;

&lt;p&gt;Automatic metrics dramatically reduce this cost.&lt;/p&gt;

&lt;p&gt;A common workflow today looks like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Evaluate every experiment automatically.&lt;/li&gt;
&lt;li&gt;Keep only the best-performing candidates.&lt;/li&gt;
&lt;li&gt;Send those few models to human reviewers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The automatic metric acts as a high-quality filter rather than a replacement for humans.&lt;/p&gt;

&lt;h1&gt;
  
  
  Where We Are Today
&lt;/h1&gt;

&lt;p&gt;No single metric captures quality perfectly.&lt;/p&gt;

&lt;p&gt;BLEU remains valuable because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it's simple&lt;/li&gt;
&lt;li&gt;reproducible&lt;/li&gt;
&lt;li&gt;historically comparable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;BLEURT improves semantic understanding by learning from human judgments.&lt;/p&gt;

&lt;p&gt;COMET goes even further by incorporating the original source sentence and demonstrating stronger agreement with professional evaluators.&lt;/p&gt;

&lt;p&gt;For frontier LLMs, evaluation has become even broader.&lt;/p&gt;

&lt;p&gt;Researchers increasingly combine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;learned metrics like COMET&lt;/li&gt;
&lt;li&gt;human preference studies&lt;/li&gt;
&lt;li&gt;benchmark suites&lt;/li&gt;
&lt;li&gt;domain-specific tests&lt;/li&gt;
&lt;li&gt;LLM-as-a-judge evaluations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The lesson is larger than machine translation.&lt;/p&gt;

&lt;p&gt;As AI systems become more capable, &lt;strong&gt;evaluating them becomes an AI problem itself.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The future of language models may depend just as much on better judges as on better generators.&lt;/p&gt;




&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Synthetic Data: The Hidden Ingredient That Made Modern LLMs Scale</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Thu, 25 Jun 2026 18:07:49 +0000</pubDate>
      <link>https://dev.to/shrsv/synthetic-data-the-hidden-ingredient-that-made-modern-llms-scale-2njm</link>
      <guid>https://dev.to/shrsv/synthetic-data-the-hidden-ingredient-that-made-modern-llms-scale-2njm</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Everyone knows modern AI runs on data. Fewer people realize that today's most capable AI systems increasingly learn from data they created themselves.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For years, the common belief was simple: collect more human-written text, train bigger models, and intelligence would emerge.&lt;/p&gt;

&lt;p&gt;That worked—until it didn't.&lt;/p&gt;

&lt;p&gt;By 2022, frontier AI labs had consumed a significant fraction of the publicly available high-quality text on the internet. The next leap wasn't going to come from scraping another few billion web pages.&lt;/p&gt;

&lt;p&gt;Instead, researchers turned to something different:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;They started letting AI generate its own training data.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Today, synthetic data powers reasoning models, coding assistants, math solvers, autonomous agents, and many of the capabilities developers now take for granted. In many cases, the most valuable dataset isn't written by humans anymore—it's produced by another AI.&lt;/p&gt;

&lt;p&gt;Let's explore how that works out.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Factory Analogy
&lt;/h1&gt;

&lt;p&gt;Imagine you're building a car factory.&lt;/p&gt;

&lt;p&gt;Initially, every part has to be handcrafted by skilled workers. Production is slow and expensive.&lt;/p&gt;

&lt;p&gt;Eventually, you build machines that manufacture car parts automatically.&lt;/p&gt;

&lt;p&gt;Now the factory spends less time making parts manually and more time building better machines that manufacture even better parts.&lt;/p&gt;

&lt;p&gt;Synthetic data works similarly.&lt;/p&gt;

&lt;p&gt;Initially:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Humans write examples&lt;/li&gt;
&lt;li&gt;Humans solve problems&lt;/li&gt;
&lt;li&gt;Humans explain reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Eventually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI generates new examples&lt;/li&gt;
&lt;li&gt;AI writes explanations&lt;/li&gt;
&lt;li&gt;AI creates harder problems&lt;/li&gt;
&lt;li&gt;AI critiques and improves its own answers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model becomes both the student &lt;strong&gt;and&lt;/strong&gt; one of the teachers.&lt;/p&gt;

&lt;h1&gt;
  
  
  This Idea Was Proven Long Before ChatGPT
&lt;/h1&gt;

&lt;p&gt;Perhaps the most famous example isn't from language models at all.&lt;/p&gt;

&lt;p&gt;In 2017, DeepMind introduced &lt;strong&gt;AlphaGo Zero&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Earlier versions of AlphaGo learned partly from millions of expert human games.&lt;/p&gt;

&lt;p&gt;AlphaGo Zero didn't.&lt;/p&gt;

&lt;p&gt;It started knowing only the rules of Go.&lt;/p&gt;

&lt;p&gt;Then it played against itself.&lt;/p&gt;

&lt;p&gt;Millions of games later, it became stronger than every previous version—and stronger than every human on Earth.&lt;/p&gt;

&lt;p&gt;No additional human demonstrations.&lt;/p&gt;

&lt;p&gt;Only synthetic experience generated through self-play.&lt;/p&gt;

&lt;p&gt;Researchers realized something profound:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Once a system becomes good enough, it can manufacture experiences that are more useful than collecting additional human ones.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That idea quietly became one of the foundations of modern AI.&lt;/p&gt;

&lt;h1&gt;
  
  
  What Does Synthetic Data Look Like for LLMs?
&lt;/h1&gt;

&lt;p&gt;Those less acquainted often imagine synthetic data as "ChatGPT writing more paragraphs."&lt;/p&gt;

&lt;p&gt;That's only a tiny piece.&lt;/p&gt;

&lt;p&gt;Modern models generate many different kinds of training data:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of waiting for humans to ask questions, models invent thousands of new ones.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Human:
How do binary search trees work?

Synthetic examples:

Explain AVL trees.
Compare B-trees with BSTs.
Design a filesystem index.
Implement interval trees.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;One human question becomes hundreds of training examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reasoning Traces&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of merely generating answers:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The model produces detailed reasoning explaining &lt;em&gt;how&lt;/em&gt; it reached the answer.&lt;/p&gt;

&lt;p&gt;Those reasoning traces become valuable training material for future models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A coding model can generate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;buggy code&lt;/li&gt;
&lt;li&gt;corrected versions&lt;/li&gt;
&lt;li&gt;optimized implementations&lt;/li&gt;
&lt;li&gt;documentation&lt;/li&gt;
&lt;li&gt;test cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One programming problem becomes an entire software engineering dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harder Problems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Models can intentionally create problems just beyond their current capability.&lt;/p&gt;

&lt;p&gt;Just like a good teacher gradually increases difficulty, the dataset evolves as the model improves.&lt;/p&gt;
&lt;h1&gt;
  
  
  The Economics Are Incredible
&lt;/h1&gt;

&lt;p&gt;Suppose hiring experts costs roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$40 per hour&lt;/li&gt;
&lt;li&gt;each example takes 3 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's approximately &lt;strong&gt;$2 per example&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A million examples?&lt;/p&gt;

&lt;p&gt;Around &lt;strong&gt;$2 million&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now imagine a frontier model generates one million candidate examples overnight.&lt;/p&gt;

&lt;p&gt;Even after filtering aggressively, perhaps only 20% are worth keeping.&lt;/p&gt;

&lt;p&gt;That's still 200,000 useful examples produced in hours instead of months.&lt;/p&gt;

&lt;p&gt;Of course, GPUs aren't free.&lt;/p&gt;

&lt;p&gt;But frontier labs already own massive compute clusters.&lt;/p&gt;

&lt;p&gt;Once the infrastructure exists, generating another million examples is often dramatically cheaper—and much faster—than coordinating thousands of human annotators around the world.&lt;/p&gt;

&lt;p&gt;Synthetic data changes the limiting resource from &lt;strong&gt;human labor&lt;/strong&gt; to &lt;strong&gt;compute&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That is a profound shift.&lt;/p&gt;
&lt;h1&gt;
  
  
  Why Doesn't the Model Eventually Teach Itself Nonsense?
&lt;/h1&gt;

&lt;p&gt;A natural concern is:&lt;/p&gt;

&lt;p&gt;"If AI keeps learning from AI, won't errors compound forever?"&lt;/p&gt;

&lt;p&gt;Absolutely—if done carelessly.&lt;/p&gt;

&lt;p&gt;Modern pipelines therefore look more like factories with quality control than simple generators.&lt;/p&gt;

&lt;p&gt;A typical loop is:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Generate
      ↓
Verify
      ↓
Filter
      ↓
Rank
      ↓
Train
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Only a fraction of generated examples survive.&lt;/p&gt;

&lt;p&gt;For coding tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the code compile?&lt;/li&gt;
&lt;li&gt;Do unit tests pass?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For mathematics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the answer verifiably correct?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For reasoning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do multiple models agree?&lt;/li&gt;
&lt;li&gt;Does a stronger model approve?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Synthetic data is valuable precisely because most of it gets thrown away.&lt;/p&gt;

&lt;p&gt;Quality matters far more than quantity.&lt;/p&gt;
&lt;h1&gt;
  
  
  How You can Put This Into Practice in Your Apps
&lt;/h1&gt;

&lt;p&gt;Many developers think synthetic data is something only OpenAI or DeepMind can use.&lt;/p&gt;

&lt;p&gt;Not anymore.&lt;/p&gt;

&lt;p&gt;Suppose you're building an AI assistant for SQL.&lt;/p&gt;

&lt;p&gt;Instead of manually writing 5,000 examples, you might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write 100 excellent examples.&lt;/li&gt;
&lt;li&gt;Ask a strong model to generate variations.&lt;/li&gt;
&lt;li&gt;Automatically execute the SQL.&lt;/li&gt;
&lt;li&gt;Keep only queries that execute correctly.&lt;/li&gt;
&lt;li&gt;Fine-tune on the verified dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or imagine building an AI tutor.&lt;/p&gt;

&lt;p&gt;Generate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;beginner questions&lt;/li&gt;
&lt;li&gt;intermediate questions&lt;/li&gt;
&lt;li&gt;trick questions&lt;/li&gt;
&lt;li&gt;incorrect student answers&lt;/li&gt;
&lt;li&gt;ideal explanations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One carefully designed seed dataset can grow into thousands of high-quality training examples.&lt;/p&gt;

&lt;p&gt;The bottleneck becomes designing good verification systems—not endlessly producing data by hand.&lt;/p&gt;

&lt;p&gt;That's a very different engineering problem.&lt;/p&gt;
&lt;h1&gt;
  
  
  The Bigger Picture
&lt;/h1&gt;

&lt;p&gt;Scaling laws taught us that more compute and more data generally produce better models.&lt;/p&gt;

&lt;p&gt;Synthetic data adds a fascinating twist:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model itself becomes part of the data-generation pipeline.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of relying solely on humanity's existing knowledge, AI systems increasingly create new training experiences for future AI systems.&lt;/p&gt;

&lt;p&gt;In some domains—coding, mathematics, games, and formal reasoning—that approach is already proving remarkably effective because correctness can often be verified automatically.&lt;/p&gt;

&lt;p&gt;It is one of the reasons today's reasoning models feel dramatically more capable than models from just a few years ago.&lt;/p&gt;

&lt;p&gt;Ironically, one of the biggest breakthroughs in machine learning wasn't finding more human data.&lt;/p&gt;

&lt;p&gt;It was discovering that, with the right safeguards, machines can help create the next generation of training data themselves.&lt;/p&gt;
&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AlphaGo Zero&lt;/strong&gt; (Silver et al., 2017) — Demonstrated that superhuman performance could emerge through self-play rather than human demonstrations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-Instruct&lt;/strong&gt; (Wang et al., 2023) — Showed that language models can bootstrap instruction-following datasets by generating and filtering their own instruction-response pairs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phi-1: Textbooks Are All You Need&lt;/strong&gt; (Microsoft Research, 2023) — Demonstrated that carefully curated synthetic "textbook-quality" data can outperform much larger quantities of noisy web data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What do you think?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you had to improve an AI application today, would you spend your effort collecting more human data—or designing a better pipeline to generate and verify synthetic data? As frontier models improve, that trade-off is becoming one of the most interesting engineering decisions in AI.&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The Scaling Laws That Made LLMs Work</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Wed, 24 Jun 2026 18:58:52 +0000</pubDate>
      <link>https://dev.to/shrsv/the-scaling-laws-that-made-llms-work-5h5h</link>
      <guid>https://dev.to/shrsv/the-scaling-laws-that-made-llms-work-5h5h</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How the AI industry accidentally discovered that "just make it bigger" was one of the most important scientific findings of the decade
&lt;/h2&gt;

&lt;p&gt;In 2018, many researchers believed language models were clever toys.&lt;/p&gt;

&lt;p&gt;They could autocomplete text, generate amusing sentences, and occasionally fool people for a paragraph or two. But few expected them to become software engineers, researchers, tutors, designers, and writing assistants.&lt;/p&gt;

&lt;p&gt;Then something strange happened.&lt;/p&gt;

&lt;p&gt;Teams at OpenAI, Google, DeepMind, Anthropic and elsewhere kept increasing three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model size&lt;/li&gt;
&lt;li&gt;Training data&lt;/li&gt;
&lt;li&gt;Compute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And performance kept improving.&lt;/p&gt;

&lt;p&gt;Not linearly.&lt;/p&gt;

&lt;p&gt;Not randomly.&lt;/p&gt;

&lt;p&gt;Predictably.&lt;/p&gt;

&lt;p&gt;The shocking discovery was that intelligence-like capabilities emerged from scale itself.&lt;/p&gt;

&lt;p&gt;Today, ChatGPT, Claude, Gemini, and other frontier models exist largely because researchers discovered scaling laws—empirical mathematical relationships that revealed how performance improves as models become larger and are trained on more data.&lt;/p&gt;

&lt;p&gt;This is the story of that discovery, why it mattered, and why it changed the economics of software forever.&lt;/p&gt;

&lt;h1&gt;
  
  
  Before Scaling Laws: The Era of Clever Tricks
&lt;/h1&gt;

&lt;p&gt;For decades, AI progress often came from clever architecture changes.&lt;/p&gt;

&lt;p&gt;Researchers would invent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better feature engineering&lt;/li&gt;
&lt;li&gt;Better optimization algorithms&lt;/li&gt;
&lt;li&gt;Better linguistic rules&lt;/li&gt;
&lt;li&gt;Better neural network structures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Progress was often irregular.&lt;/p&gt;

&lt;p&gt;A breakthrough would appear.&lt;/p&gt;

&lt;p&gt;Then improvements would stall.&lt;/p&gt;

&lt;p&gt;Many people assumed future progress would continue this way.&lt;/p&gt;

&lt;p&gt;Then deep learning arrived.&lt;/p&gt;

&lt;p&gt;Researchers began noticing something unusual.&lt;/p&gt;

&lt;p&gt;A bigger neural network often outperformed a smaller one.&lt;/p&gt;

&lt;p&gt;A lot.&lt;/p&gt;

&lt;p&gt;Even when nobody fully understood why.&lt;/p&gt;

&lt;p&gt;One famous observation came from researchers at Google working on machine translation.&lt;/p&gt;

&lt;p&gt;Instead of hand-crafting linguistic rules, larger neural networks trained on larger datasets simply worked better.&lt;/p&gt;

&lt;p&gt;The trend kept repeating.&lt;/p&gt;

&lt;h1&gt;
  
  
  The AlexNet Lesson: Compute Can Beat Cleverness
&lt;/h1&gt;

&lt;p&gt;A key moment occurred in 2012.&lt;/p&gt;

&lt;p&gt;At the ImageNet competition, a neural network called AlexNet built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton dramatically outperformed competitors.&lt;/p&gt;

&lt;p&gt;The architecture was important.&lt;/p&gt;

&lt;p&gt;But equally important was something less glamorous:&lt;/p&gt;

&lt;p&gt;They used GPUs.&lt;/p&gt;

&lt;p&gt;Lots of compute.&lt;/p&gt;

&lt;p&gt;The lesson was subtle but profound:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;More computation could unlock capabilities that smaller systems never exhibited.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This idea would later become the foundation of modern LLM development.&lt;/p&gt;

&lt;h1&gt;
  
  
  The OpenAI Scaling Laws Paper That Changed Everything
&lt;/h1&gt;

&lt;p&gt;In 2020, researchers at &lt;a href="https://openai.com?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt; published a landmark paper:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Scaling Laws for Neural Language Models"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Authored by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jared Kaplan&lt;/li&gt;
&lt;li&gt;Sam McCandlish&lt;/li&gt;
&lt;li&gt;and colleagues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The paper reported a surprising result.&lt;/p&gt;

&lt;p&gt;Language model loss followed a smooth power-law relationship with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parameter count&lt;/li&gt;
&lt;li&gt;Dataset size&lt;/li&gt;
&lt;li&gt;Training compute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of hitting obvious plateaus, performance improved according to remarkably predictable mathematical curves.&lt;/p&gt;

&lt;p&gt;The researchers found relationships resembling:&lt;/p&gt;

&lt;p&gt;L(N) is proportional to N^(-alpha)&lt;/p&gt;

&lt;p&gt;where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;L = loss&lt;/li&gt;
&lt;li&gt;N = number of parameters&lt;/li&gt;
&lt;li&gt;α = scaling exponent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The exact constants differed across experiments, but the important insight was this:&lt;/p&gt;

&lt;p&gt;Every additional order of magnitude in scale delivered measurable gains.&lt;/p&gt;

&lt;p&gt;No magic tricks required.&lt;/p&gt;

&lt;p&gt;No fundamentally new algorithms required.&lt;/p&gt;

&lt;p&gt;Just scale.&lt;/p&gt;

&lt;p&gt;This result was shocking because many researchers expected diminishing returns to arrive much sooner.&lt;/p&gt;

&lt;p&gt;Instead, the curves kept going.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Back-of-the-Envelope Economics
&lt;/h1&gt;

&lt;p&gt;Let's build intuition.&lt;/p&gt;

&lt;p&gt;Imagine a model with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 billion parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suppose increasing it to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10 billion parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;reduces error by a meaningful amount.&lt;/p&gt;

&lt;p&gt;Then increasing to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100 billion parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;reduces error again.&lt;/p&gt;

&lt;p&gt;Each improvement costs vastly more compute.&lt;/p&gt;

&lt;p&gt;But here's the key:&lt;/p&gt;

&lt;p&gt;For large organizations, even small quality improvements are worth enormous amounts of money.&lt;/p&gt;

&lt;p&gt;Consider search engines.&lt;/p&gt;

&lt;p&gt;If improving answer quality by 1% generates hundreds of millions of dollars in user value, spending tens of millions on training becomes rational.&lt;/p&gt;

&lt;p&gt;The economics start resembling semiconductor manufacturing.&lt;/p&gt;

&lt;p&gt;The biggest players can afford massive upfront investment because performance gains compound downstream.&lt;/p&gt;

&lt;p&gt;This is one reason frontier AI rapidly became a contest among organizations with access to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Massive capital&lt;/li&gt;
&lt;li&gt;GPU clusters&lt;/li&gt;
&lt;li&gt;Infrastructure expertise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scaling laws transformed AI from a pure research problem into an industrial production problem.&lt;/p&gt;

&lt;h1&gt;
  
  
  Chinchilla: The Industry Discovers It Was Scaling Wrong
&lt;/h1&gt;

&lt;p&gt;Then another surprise arrived.&lt;/p&gt;

&lt;p&gt;In 2022, researchers at DeepMind published the famous Chinchilla paper "Training Compute-Optimal Large Language Models," led by Jordan Hoffmann.&lt;/p&gt;

&lt;p&gt;The team discovered something important.&lt;/p&gt;

&lt;p&gt;Many models were too large relative to the amount of training data they consumed.&lt;/p&gt;

&lt;p&gt;The industry had been spending enormous compute training gigantic models that were under-trained.&lt;/p&gt;

&lt;p&gt;Chinchilla showed that for a fixed compute budget, better performance often comes from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller models&lt;/li&gt;
&lt;li&gt;More tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;rather than:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Larger models&lt;/li&gt;
&lt;li&gt;Fewer tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result fundamentally changed training strategies across the industry.&lt;/p&gt;

&lt;p&gt;Many later frontier models incorporated lessons from Chinchilla-style compute-optimal training.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why Emergent Abilities Appeared
&lt;/h1&gt;

&lt;p&gt;One of the most fascinating observations came from large language models exhibiting capabilities not visible in smaller versions.&lt;/p&gt;

&lt;p&gt;Examples included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-step reasoning&lt;/li&gt;
&lt;li&gt;Code generation&lt;/li&gt;
&lt;li&gt;Translation&lt;/li&gt;
&lt;li&gt;Few-shot learning&lt;/li&gt;
&lt;li&gt;Tool usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A small model might completely fail a task.&lt;/p&gt;

&lt;p&gt;A larger version suddenly succeeds.&lt;/p&gt;

&lt;p&gt;Researchers called these behaviors &lt;strong&gt;emergent abilities&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The exact mechanisms remain debated.&lt;/p&gt;

&lt;p&gt;However, scaling laws provided an important clue.&lt;/p&gt;

&lt;p&gt;If performance improves smoothly on underlying representations, task-level capabilities may appear abrupt only because evaluation thresholds are discrete.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;45% accuracy feels useless&lt;/li&gt;
&lt;li&gt;55% accuracy feels useful&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A small continuous improvement underneath can create a seemingly sudden jump in usefulness.&lt;/p&gt;

&lt;p&gt;This observation continues to influence modern research into reasoning models.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Operations Story Matters As Well
&lt;/h1&gt;

&lt;p&gt;The public often imagines AI breakthroughs occurring through genius insights alone.&lt;/p&gt;

&lt;p&gt;The reality is much messier.&lt;/p&gt;

&lt;p&gt;Scaling laws forced organizations to become experts in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed systems&lt;/li&gt;
&lt;li&gt;Networking&lt;/li&gt;
&lt;li&gt;Data pipelines&lt;/li&gt;
&lt;li&gt;Storage systems&lt;/li&gt;
&lt;li&gt;Cluster scheduling&lt;/li&gt;
&lt;li&gt;GPU utilization&lt;/li&gt;
&lt;li&gt;Fault tolerance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Training frontier models became one of the largest computing operations ever attempted.&lt;/p&gt;

&lt;p&gt;Modern training runs can involve tens of thousands of GPUs operating simultaneously.&lt;/p&gt;

&lt;p&gt;At that scale, hardware failures become routine.&lt;/p&gt;

&lt;p&gt;Engineers must design systems assuming components will constantly fail.&lt;/p&gt;

&lt;p&gt;Ironically, many advances enabling modern AI came not from machine learning itself but from classical systems engineering.&lt;/p&gt;

&lt;p&gt;The people building the training infrastructure often look more like distributed systems engineers than traditional AI researchers.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why Developers Should Care
&lt;/h1&gt;

&lt;p&gt;Scaling laws explain why capabilities keep arriving that seem surprising.&lt;/p&gt;

&lt;p&gt;Many developers ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"How did models suddenly become good at coding?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer is often less mysterious than it appears.&lt;/p&gt;

&lt;p&gt;A large portion of progress comes from moving further along predictable scaling curves.&lt;/p&gt;

&lt;p&gt;More compute.&lt;/p&gt;

&lt;p&gt;More data.&lt;/p&gt;

&lt;p&gt;More parameters.&lt;/p&gt;

&lt;p&gt;Better optimization.&lt;/p&gt;

&lt;p&gt;The resulting improvements accumulate until tasks become economically useful.&lt;/p&gt;

&lt;p&gt;This perspective is valuable because it reframes AI progress.&lt;/p&gt;

&lt;p&gt;Instead of viewing each new model as a miracle, we can see many advances as the expected outcome of operating larger and more efficient training systems.&lt;/p&gt;

&lt;p&gt;The future may contain architectural breakthroughs.&lt;/p&gt;

&lt;p&gt;But one lesson from the past decade is difficult to ignore:&lt;/p&gt;

&lt;p&gt;Scale itself turned out to be one of the most important algorithms.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;One of the great scientific surprises of modern AI is that intelligence-like capabilities did not emerge primarily from increasingly clever hand-designed systems.&lt;/p&gt;

&lt;p&gt;They emerged from discovering a predictable relationship between computation and capability.&lt;/p&gt;

&lt;p&gt;The researchers who uncovered scaling laws effectively found a map.&lt;/p&gt;

&lt;p&gt;That map allowed organizations to forecast future performance before spending billions of dollars building larger systems.&lt;/p&gt;

&lt;p&gt;Few discoveries have reshaped an industry so quickly.&lt;/p&gt;

&lt;p&gt;The next time a new language model appears with capabilities that seem impossibly better than its predecessor, it is worth remembering:&lt;/p&gt;

&lt;p&gt;The improvement may not be magic.&lt;/p&gt;

&lt;p&gt;It may simply be another point on a scaling curve that researchers have been following for years.&lt;/p&gt;

&lt;p&gt;If scaling laws continue holding for another decade, do you think future breakthroughs will come primarily from &lt;strong&gt;more compute&lt;/strong&gt;, &lt;strong&gt;better architectures&lt;/strong&gt;, or &lt;strong&gt;entirely new paradigms beyond today's transformers&lt;/strong&gt;?&lt;/p&gt;




&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Reinforcement Learning with Verifiable Rewards: Why AI is Learning to Grade Its Own Homework</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Tue, 23 Jun 2026 19:56:18 +0000</pubDate>
      <link>https://dev.to/shrsv/reinforcement-learning-with-verifiable-rewards-why-ai-is-learning-to-grade-its-own-homework-4oif</link>
      <guid>https://dev.to/shrsv/reinforcement-learning-with-verifiable-rewards-why-ai-is-learning-to-grade-its-own-homework-4oif</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Large Language Models have gotten remarkably good at generating text.&lt;/p&gt;

&lt;p&gt;But there has always been a fundamental problem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do you tell an AI whether its answer is actually correct?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For creative writing, opinions, brainstorming, and conversations, correctness is fuzzy. Human feedback is usually required.&lt;/p&gt;

&lt;p&gt;But what about problems where correctness can be objectively verified?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the code passing the tests?&lt;/li&gt;
&lt;li&gt;Did the SQL query return the expected result?&lt;/li&gt;
&lt;li&gt;Does the mathematical proof produce the correct answer?&lt;/li&gt;
&lt;li&gt;Does the generated webpage match the specification?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This simple observation is driving one of the most interesting developments in modern AI training:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reinforcement Learning with Verifiable Rewards (RLVR).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of asking humans to score every answer, we let reality score it.&lt;/p&gt;

&lt;p&gt;And that changes everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Traditional RLHF Approach
&lt;/h2&gt;

&lt;p&gt;Most developers have heard of Reinforcement Learning from Human Feedback (RLHF).&lt;/p&gt;

&lt;p&gt;The basic process looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Model generates answers.&lt;/li&gt;
&lt;li&gt;Humans rank the answers.&lt;/li&gt;
&lt;li&gt;A reward model learns human preferences.&lt;/li&gt;
&lt;li&gt;Reinforcement learning optimizes against that reward model.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Conceptually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Question
    ↓
Model Output
    ↓
Human Evaluation
    ↓
Reward Signal
    ↓
Model Improvement
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This worked well for making models more helpful, harmless, and conversational.&lt;/p&gt;

&lt;p&gt;But it has a major limitation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Humans are expensive.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You need thousands or millions of human judgments.&lt;/p&gt;

&lt;p&gt;Even worse, humans often disagree.&lt;/p&gt;

&lt;p&gt;Ask ten programmers whether a piece of code is elegant and you'll get eleven opinions.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Key Insight: Some Tasks Are Self-Verifying
&lt;/h2&gt;

&lt;p&gt;Now imagine a different task:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Write a Python function that reverses a linked list.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You don't necessarily need a human reviewer.&lt;/p&gt;

&lt;p&gt;You can simply run:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If all tests pass:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reward = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If tests fail:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reward = 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The reward becomes objective.&lt;/p&gt;

&lt;p&gt;This is the central idea behind RLVR.&lt;/p&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Does a human like this answer?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;we ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Can we verify this answer automatically?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Whenever verification is possible, reward generation becomes dramatically cheaper and more scalable.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why This Works So Well for Coding
&lt;/h2&gt;

&lt;p&gt;Coding is one of the most natural domains for RLVR.&lt;/p&gt;

&lt;p&gt;Consider a coding benchmark:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input:
Implement binary search.

Output:
Generated code
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Verification is straightforward:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;run_tests&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;binary_search&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;passes for all test cases, the model receives a high reward.&lt;/p&gt;

&lt;p&gt;Otherwise it receives a low reward.&lt;/p&gt;

&lt;p&gt;The model gradually learns patterns that lead to successful execution.&lt;/p&gt;

&lt;p&gt;Over millions of examples, it begins discovering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better debugging strategies&lt;/li&gt;
&lt;li&gt;Better decomposition strategies&lt;/li&gt;
&lt;li&gt;Better reasoning chains&lt;/li&gt;
&lt;li&gt;Better code structures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;without needing humans to manually inspect every solution.&lt;/p&gt;

&lt;p&gt;This is one reason coding models have improved so rapidly in recent years.&lt;/p&gt;
&lt;h2&gt;
  
  
  Beyond Coding: Mathematics
&lt;/h2&gt;

&lt;p&gt;Mathematics is another ideal RLVR environment.&lt;/p&gt;

&lt;p&gt;Suppose the task is:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Solve:
127 × 348
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The final answer can be checked automatically.&lt;/p&gt;

&lt;p&gt;Even more interesting:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Find x:
2x + 5 = 17
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Verification is trivial:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Substitute x
Check equation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Correct answer?&lt;/p&gt;

&lt;p&gt;Reward = 1.&lt;/p&gt;

&lt;p&gt;Incorrect answer?&lt;/p&gt;

&lt;p&gt;Reward = 0.&lt;/p&gt;

&lt;p&gt;This allows models to practice enormous numbers of mathematical problems without requiring armies of human annotators.&lt;/p&gt;

&lt;p&gt;Many recent reasoning-focused models have benefited heavily from this kind of training.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Is Actually Being Optimized?
&lt;/h2&gt;

&lt;p&gt;Under the hood, RLVR still looks like reinforcement learning.&lt;/p&gt;

&lt;p&gt;The model generates a solution:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;State → Action → Outcome
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The difference is the source of the reward.&lt;/p&gt;

&lt;p&gt;Traditional RLHF:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reward = Human Preference
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;RLVR:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reward = Verifiable Correctness
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A simplified objective looks like:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;maximize E[reward]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;where reward comes from an automated verifier.&lt;/p&gt;

&lt;p&gt;The verifier might be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unit tests&lt;/li&gt;
&lt;li&gt;Mathematical checking&lt;/li&gt;
&lt;li&gt;Compilation success&lt;/li&gt;
&lt;li&gt;Benchmark execution&lt;/li&gt;
&lt;li&gt;Formal proof validation&lt;/li&gt;
&lt;li&gt;Simulation outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is effectively searching for behaviors that maximize success rates.&lt;/p&gt;
&lt;h2&gt;
  
  
  An Example: Training a Coding Model
&lt;/h2&gt;

&lt;p&gt;Imagine training an AI on algorithmic problems.&lt;/p&gt;

&lt;p&gt;For each problem:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Problem
 ↓
Model generates solution
 ↓
Compile
 ↓
Run tests
 ↓
Assign reward
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Tests:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Fails.&lt;/p&gt;

&lt;p&gt;Reward:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The model tries another approach:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Tests pass.&lt;/p&gt;

&lt;p&gt;Reward:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Over time, reinforcement learning shifts probability mass toward successful behaviors.&lt;/p&gt;

&lt;p&gt;The model isn't memorizing solutions.&lt;/p&gt;

&lt;p&gt;It's learning patterns associated with success.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Hidden Superpower: Scaling Rewards
&lt;/h2&gt;

&lt;p&gt;The most important consequence of RLVR is not accuracy.&lt;/p&gt;

&lt;p&gt;It's scalability.&lt;/p&gt;

&lt;p&gt;Suppose you want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10 million training examples&lt;/li&gt;
&lt;li&gt;100 million training examples&lt;/li&gt;
&lt;li&gt;1 billion training examples&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Human evaluation becomes impossible.&lt;/p&gt;

&lt;p&gt;Automated verification remains feasible.&lt;/p&gt;

&lt;p&gt;Once a verifier exists, reward generation can scale almost indefinitely.&lt;/p&gt;

&lt;p&gt;This transforms the economics of model training.&lt;/p&gt;

&lt;p&gt;Instead of hiring more evaluators, you simply generate more problems and run more verifications.&lt;/p&gt;

&lt;p&gt;Many researchers believe this is one of the major reasons reasoning and coding models have improved so quickly over the last few years.&lt;/p&gt;
&lt;h2&gt;
  
  
  Limitations and Open Problems
&lt;/h2&gt;

&lt;p&gt;RLVR is powerful, but it is not universal.&lt;/p&gt;

&lt;p&gt;Many important tasks lack objective verification.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing a compelling novel&lt;/li&gt;
&lt;li&gt;Designing a great product strategy&lt;/li&gt;
&lt;li&gt;Creating a persuasive marketing campaign&lt;/li&gt;
&lt;li&gt;Conducting a nuanced negotiation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For these domains, correctness is subjective.&lt;/p&gt;

&lt;p&gt;Human judgment remains necessary.&lt;/p&gt;

&lt;p&gt;Another challenge is reward hacking.&lt;/p&gt;

&lt;p&gt;If a model discovers shortcuts that exploit the verifier rather than solving the underlying problem, training can become misleading.&lt;/p&gt;

&lt;p&gt;The verifier itself must be robust.&lt;/p&gt;

&lt;p&gt;In practice, designing good reward functions is often harder than training the model.&lt;/p&gt;
&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;For years, the AI community focused on teaching models through human preferences.&lt;/p&gt;

&lt;p&gt;RLVR introduces a different idea:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Whenever reality can verify an answer, let reality provide the reward.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For coding, mathematics, theorem proving, scientific reasoning, and other objective domains, this approach dramatically reduces the need for human supervision while enabling massive training scale.&lt;/p&gt;

&lt;p&gt;The result is a new generation of models that aren't just learning from people.&lt;/p&gt;

&lt;p&gt;They're learning from whether their outputs actually work.&lt;/p&gt;

&lt;p&gt;And that may be one of the most important shifts in modern AI training.&lt;/p&gt;

&lt;p&gt;If you were training an AI for your domain, what would serve as the verifier? Unit tests, simulations, customer metrics, formal proofs, or something else entirely?&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>TPUs vs GPUs: How Google's Tensor Processing Units Actually Work</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Sun, 21 Jun 2026 16:44:01 +0000</pubDate>
      <link>https://dev.to/shrsv/tpus-vs-gpus-how-googles-tensor-processing-units-actually-work-c8i</link>
      <guid>https://dev.to/shrsv/tpus-vs-gpus-how-googles-tensor-processing-units-actually-work-c8i</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Machine learning engineers spend countless hours optimizing models, tweaking architectures, and squeezing performance out of hardware.&lt;/p&gt;

&lt;p&gt;Yet many developers who train large models today have only a vague understanding of the machines doing the actual work.&lt;/p&gt;

&lt;p&gt;Ask a developer how a GPU works, and you'll usually hear something about "lots of parallel cores."&lt;/p&gt;

&lt;p&gt;Ask how a TPU works, and the answer is often, "Google made a chip for AI."&lt;/p&gt;

&lt;p&gt;But the design differences are much more interesting than that.&lt;/p&gt;

&lt;p&gt;TPUs weren't built as faster GPUs. They were built around a different assumption: that neural networks spend most of their time performing enormous matrix multiplications. Once you accept that premise, the entire chip architecture changes.&lt;/p&gt;

&lt;p&gt;Let's explore how TPUs work, why Google built them, and where they outperform GPUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Deep Learning Is Mostly Matrix Multiplication
&lt;/h2&gt;

&lt;p&gt;At a high level, modern neural networks are giant collections of matrix operations.&lt;/p&gt;

&lt;p&gt;Consider a simple transformer layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;X&lt;/code&gt; is the input activation matrix&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;W&lt;/code&gt; is the weight matrix&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under the hood, this becomes millions or billions of multiply-and-add operations.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A (4096 x 4096)
×
B (4096 x 4096)
=
C (4096 x 4096)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This single operation contains over 68 billion multiply-accumulate computations.&lt;/p&gt;

&lt;p&gt;Training and inference repeatedly execute these operations.&lt;/p&gt;

&lt;p&gt;The hardware question becomes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is the fastest possible machine for multiplying giant matrices?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;GPUs and TPUs answer this question differently.&lt;/p&gt;
&lt;h2&gt;
  
  
  How GPUs Became the First AI Accelerators
&lt;/h2&gt;

&lt;p&gt;GPUs were never originally designed for machine learning.&lt;/p&gt;

&lt;p&gt;They were built to render graphics.&lt;/p&gt;

&lt;p&gt;Rendering a video game requires performing similar operations on millions of pixels simultaneously.&lt;/p&gt;

&lt;p&gt;This naturally led GPU manufacturers to create architectures containing thousands of lightweight processing cores.&lt;/p&gt;

&lt;p&gt;A simplified GPU architecture looks like this:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CPU
 |
 | launches kernels
 |
GPU
 ├── Thousands of parallel cores
 ├── Shared memory
 ├── Global memory
 └── Scheduling logic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The key idea:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Many threads run simultaneously&lt;/li&gt;
&lt;li&gt;Each thread performs small calculations&lt;/li&gt;
&lt;li&gt;Hardware schedules work dynamically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach works extremely well for deep learning because matrix multiplication can be broken into many independent tasks.&lt;/p&gt;

&lt;p&gt;The result was almost accidental:&lt;/p&gt;

&lt;p&gt;The hardware built for gaming turned out to be excellent for neural networks.&lt;/p&gt;
&lt;h2&gt;
  
  
  Google's Observation: GPUs Are Doing Too Much
&lt;/h2&gt;

&lt;p&gt;Around 2013–2015, Google's infrastructure was serving billions of machine learning predictions every day.&lt;/p&gt;

&lt;p&gt;Engineers noticed something important.&lt;/p&gt;

&lt;p&gt;Many GPU features were rarely used during inference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex scheduling&lt;/li&gt;
&lt;li&gt;Branch prediction&lt;/li&gt;
&lt;li&gt;General-purpose execution logic&lt;/li&gt;
&lt;li&gt;Graphics-related circuitry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These features are valuable for a broad range of workloads.&lt;/p&gt;

&lt;p&gt;But neural networks are highly predictable.&lt;/p&gt;

&lt;p&gt;Most of the work boils down to:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Multiply
Add
Multiply
Add
Multiply
Add
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Over and over.&lt;/p&gt;

&lt;p&gt;Google asked a radical question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if we remove everything that isn't useful for matrix multiplication?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer became the TPU.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Heart of a TPU: The Systolic Array
&lt;/h2&gt;

&lt;p&gt;The most important component inside a TPU is the systolic array.&lt;/p&gt;

&lt;p&gt;A systolic array is a grid of processing elements that pass data rhythmically through the chip.&lt;/p&gt;

&lt;p&gt;Imagine a matrix multiplication:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A × B = C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Instead of sending data back and forth to memory repeatedly, the TPU streams values through a grid.&lt;/p&gt;

&lt;p&gt;A simplified example:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A →
[PE][PE][PE]
[PE][PE][PE]
[PE][PE][PE]
      ↓
      B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Each Processing Element (PE):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Receives values from neighbors&lt;/li&gt;
&lt;li&gt;Multiplies them&lt;/li&gt;
&lt;li&gt;Accumulates partial results&lt;/li&gt;
&lt;li&gt;Passes data onward&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The data "flows" through the chip like blood through arteries.&lt;/p&gt;

&lt;p&gt;That's where the name systolic comes from.&lt;/p&gt;

&lt;p&gt;This architecture dramatically reduces memory movement, which is often the true bottleneck in modern computing.&lt;/p&gt;

&lt;p&gt;Moving data frequently costs more energy and time than performing arithmetic.&lt;/p&gt;

&lt;p&gt;TPUs are designed around minimizing that movement.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Memory Bandwidth Matters More Than Compute
&lt;/h2&gt;

&lt;p&gt;Many developers assume AI workloads are limited by compute.&lt;/p&gt;

&lt;p&gt;In reality, large models are often limited by memory.&lt;/p&gt;

&lt;p&gt;Consider two scenarios.&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 1
&lt;/h3&gt;

&lt;p&gt;The processor performs:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2 × 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This operation is extremely cheap.&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 2
&lt;/h3&gt;

&lt;p&gt;The processor fetches:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2
3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;from distant memory before performing the multiplication.&lt;/p&gt;

&lt;p&gt;The memory access can cost far more than the arithmetic.&lt;/p&gt;

&lt;p&gt;As models scale, this becomes increasingly important.&lt;/p&gt;

&lt;p&gt;TPUs address this problem using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Large on-chip buffers&lt;/li&gt;
&lt;li&gt;High-bandwidth memory&lt;/li&gt;
&lt;li&gt;Data reuse strategies&lt;/li&gt;
&lt;li&gt;Systolic execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Move data as little as possible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is one reason TPUs achieve impressive performance-per-watt.&lt;/p&gt;
&lt;h2&gt;
  
  
  TPU Training Pods: Scaling Beyond a Single Chip
&lt;/h2&gt;

&lt;p&gt;One TPU is powerful.&lt;/p&gt;

&lt;p&gt;A TPU Pod is where things become interesting.&lt;/p&gt;

&lt;p&gt;Google connects thousands of TPUs using specialized high-speed interconnects.&lt;/p&gt;

&lt;p&gt;Conceptually:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TPU  TPU  TPU  TPU
 |    |    |    |
TPU  TPU  TPU  TPU
 |    |    |    |
TPU  TPU  TPU  TPU
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;These chips behave almost like one giant distributed accelerator.&lt;/p&gt;

&lt;p&gt;Large language models frequently require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model parallelism&lt;/li&gt;
&lt;li&gt;Data parallelism&lt;/li&gt;
&lt;li&gt;Pipeline parallelism&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TPU Pods were designed with these workloads in mind.&lt;/p&gt;

&lt;p&gt;This is one reason many frontier-scale models have historically been trained on TPU infrastructure.&lt;/p&gt;

&lt;p&gt;The networking architecture becomes nearly as important as the chips themselves.&lt;/p&gt;


&lt;h2&gt;
  
  
  TPU vs GPU: Which Is Better?
&lt;/h2&gt;

&lt;p&gt;The answer depends on the workload.&lt;/p&gt;
&lt;h3&gt;
  
  
  GPUs excel when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Running diverse workloads&lt;/li&gt;
&lt;li&gt;Supporting many frameworks&lt;/li&gt;
&lt;li&gt;Performing graphics and AI together&lt;/li&gt;
&lt;li&gt;Requiring maximum ecosystem support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mature tooling&lt;/li&gt;
&lt;li&gt;Massive developer community&lt;/li&gt;
&lt;li&gt;Broad software compatibility&lt;/li&gt;
&lt;li&gt;Flexible execution&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  TPUs excel when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Running large-scale neural networks&lt;/li&gt;
&lt;li&gt;Using TensorFlow or JAX ecosystems&lt;/li&gt;
&lt;li&gt;Maximizing throughput&lt;/li&gt;
&lt;li&gt;Optimizing energy efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Specialized matrix hardware&lt;/li&gt;
&lt;li&gt;Excellent scaling characteristics&lt;/li&gt;
&lt;li&gt;High utilization for AI workloads&lt;/li&gt;
&lt;li&gt;Lower overhead for tensor operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff is flexibility.&lt;/p&gt;

&lt;p&gt;A GPU is a powerful general-purpose parallel computer.&lt;/p&gt;

&lt;p&gt;A TPU is a highly specialized neural network machine.&lt;/p&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU = Swiss Army knife&lt;/li&gt;
&lt;li&gt;TPU = Industrial assembly line&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The assembly line wins if your workload matches its design.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why This Matters for Modern AI Engineers
&lt;/h2&gt;

&lt;p&gt;As models continue growing, hardware architecture is becoming a first-class concern.&lt;/p&gt;

&lt;p&gt;Ten years ago, most developers could treat hardware as a black box.&lt;/p&gt;

&lt;p&gt;Today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training costs millions of dollars&lt;/li&gt;
&lt;li&gt;Model efficiency directly impacts profitability&lt;/li&gt;
&lt;li&gt;Inference latency affects user experience&lt;/li&gt;
&lt;li&gt;Hardware choices influence architecture decisions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding TPUs isn't just about learning another chip.&lt;/p&gt;

&lt;p&gt;It's about understanding a broader trend:&lt;/p&gt;

&lt;p&gt;The era of general-purpose computing is giving way to increasingly specialized hardware.&lt;/p&gt;

&lt;p&gt;TPUs are one example.&lt;/p&gt;

&lt;p&gt;AI accelerators from NVIDIA, AMD, Amazon, Microsoft, Cerebras, Groq, and many others are pushing the same idea further.&lt;/p&gt;

&lt;p&gt;The future of AI may not belong to the fastest processor.&lt;/p&gt;

&lt;p&gt;It may belong to the processor whose architecture most closely matches the mathematics of machine learning.&lt;/p&gt;
&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;GPUs helped ignite the deep learning revolution because they offered massive parallelism at scale. TPUs took the next step by asking a narrower question: if neural networks mostly perform matrix multiplication, why not build hardware specifically for that task?&lt;/p&gt;

&lt;p&gt;The result was a radically different architecture centered around systolic arrays, data movement efficiency, and large-scale distributed training.&lt;/p&gt;

&lt;p&gt;As AI systems continue growing, understanding these architectural choices becomes increasingly valuable—not just for hardware engineers, but for every developer building machine learning systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you were training a large model today, would you prioritize the flexibility of GPUs or the specialization of TPUs—and why?&lt;/strong&gt;&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Multi-Step Learning Rate Schedulers in LLM Training: Why Some Teams Are Moving Beyond Cosine Decay</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Sat, 20 Jun 2026 11:42:42 +0000</pubDate>
      <link>https://dev.to/shrsv/multi-step-learning-rate-schedulers-in-llm-training-why-some-teams-are-moving-beyond-cosine-decay-3nef</link>
      <guid>https://dev.to/shrsv/multi-step-learning-rate-schedulers-in-llm-training-why-some-teams-are-moving-beyond-cosine-decay-3nef</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Training modern Large Language Models is expensive.&lt;/p&gt;

&lt;p&gt;When a single training run can consume millions of GPU hours, even small optimization decisions become important. Most developers focus on model architecture, dataset quality, and scaling laws. Yet one of the most influential knobs in training is surprisingly simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How should the learning rate change over time?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For years, cosine decay has been the default answer. But many recent LLM projects have quietly adopted an alternative: the &lt;strong&gt;multi-step learning rate scheduler&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What's interesting is that the reason isn't necessarily better final accuracy.&lt;/p&gt;

&lt;p&gt;It's because multi-step schedules make something else much easier: &lt;strong&gt;continuing training later without wasting previous compute.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's explore why.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Learning Rate Is the Model's Step Size
&lt;/h2&gt;

&lt;p&gt;A neural network learns by repeatedly adjusting its parameters.&lt;/p&gt;

&lt;p&gt;The learning rate controls the size of those adjustments.&lt;/p&gt;

&lt;p&gt;Imagine hiking toward a destination in dense fog:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Too large a step → you overshoot repeatedly.&lt;/li&gt;
&lt;li&gt;Too small a step → progress becomes painfully slow.&lt;/li&gt;
&lt;li&gt;A well-chosen step size gets you there efficiently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During LLM training, we rarely keep the learning rate constant.&lt;/p&gt;

&lt;p&gt;Instead, we start with a relatively large learning rate to make rapid progress and gradually reduce it as training converges.&lt;/p&gt;

&lt;p&gt;The scheduler determines &lt;em&gt;how&lt;/em&gt; that reduction happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Popular Choice: Cosine Decay
&lt;/h2&gt;

&lt;p&gt;The most common scheduler in modern LLM training is cosine decay.&lt;/p&gt;

&lt;p&gt;The learning rate follows a smooth curve:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fa64nh0ocavo96rjwrstb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fa64nh0ocavo96rjwrstb.png" alt="learning rate 1" width="251" height="102"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the beginning, the learning rate is high.&lt;/p&gt;

&lt;p&gt;As training progresses, it smoothly decreases until reaching a very small value near the end.&lt;/p&gt;

&lt;p&gt;Visually, it looks like a gently descending hill.&lt;/p&gt;

&lt;p&gt;Why is cosine decay popular?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple&lt;/li&gt;
&lt;li&gt;Stable&lt;/li&gt;
&lt;li&gt;Works across many model sizes&lt;/li&gt;
&lt;li&gt;Requires little tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For years, it became the default choice for transformer training.&lt;/p&gt;

&lt;p&gt;However, it has an important limitation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Cosine Decay in Continual Training
&lt;/h2&gt;

&lt;p&gt;Suppose your original plan was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Train for 1 trillion tokens&lt;/li&gt;
&lt;li&gt;Stop&lt;/li&gt;
&lt;li&gt;Evaluate results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A month later you decide:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Let's continue training for another trillion tokens.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With cosine decay, things become awkward.&lt;/p&gt;

&lt;p&gt;The scheduler assumed training would end at a specific point.&lt;/p&gt;

&lt;p&gt;By the time you reach that endpoint, the learning rate has already decayed close to zero.&lt;/p&gt;

&lt;p&gt;Extending training now requires redesigning the schedule.&lt;/p&gt;

&lt;p&gt;You must decide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restart the scheduler?&lt;/li&gt;
&lt;li&gt;Stretch the curve?&lt;/li&gt;
&lt;li&gt;Create a new decay function?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each choice changes optimization behavior.&lt;/p&gt;

&lt;p&gt;This becomes increasingly inconvenient for organizations that frequently expand training runs as new compute becomes available.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter Multi-Step Learning Rate Scheduling
&lt;/h2&gt;

&lt;p&gt;A multi-step scheduler divides training into distinct phases.&lt;/p&gt;

&lt;p&gt;Instead of continuously decreasing the learning rate, it stays constant for long periods and then drops abruptly.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Training Progress&lt;/th&gt;
&lt;th&gt;Learning Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0% - 80%&lt;/td&gt;
&lt;td&gt;1.0×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;80% - 90%&lt;/td&gt;
&lt;td&gt;0.1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;90% - 100%&lt;/td&gt;
&lt;td&gt;0.01×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rather than a smooth curve, the graph resembles a staircase.&lt;/p&gt;

&lt;p&gt;The learning rate remains fixed during a stage and changes only at predefined milestones.&lt;/p&gt;

&lt;p&gt;Conceptually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 1: High LR
───────────────

Stage 2: Lower LR
        ─────────

Stage 3: Very Low LR
                 ───
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Many recent LLM efforts use schedules resembling an &lt;strong&gt;80% / 10% / 10%&lt;/strong&gt; distribution.&lt;/p&gt;

&lt;p&gt;Most computation happens in the first phase, while later phases act as refinement stages.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Multi-Step Schedulers Work Surprisingly Well
&lt;/h2&gt;

&lt;p&gt;At first glance, abrupt drops seem less elegant than smooth cosine decay.&lt;/p&gt;

&lt;p&gt;Yet empirical results often show something surprising:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final model quality is frequently very similar.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because optimization is dominated by the large early stage.&lt;/p&gt;

&lt;p&gt;Most useful learning occurs when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The learning rate is relatively high&lt;/li&gt;
&lt;li&gt;The model is far from convergence&lt;/li&gt;
&lt;li&gt;Large parameter updates are still beneficial&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The later stages mainly fine-tune the model.&lt;/p&gt;

&lt;p&gt;Whether the transition between stages is smooth or abrupt often matters less than developers expect.&lt;/p&gt;

&lt;p&gt;In practice, many teams observe nearly identical downstream performance between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cosine schedules&lt;/li&gt;
&lt;li&gt;Carefully tuned multi-step schedules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes the operational advantages of multi-step scheduling very attractive.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Hidden Advantage: Reusing Training Compute
&lt;/h2&gt;

&lt;p&gt;This is where multi-step scheduling becomes especially interesting for large-scale training.&lt;/p&gt;

&lt;p&gt;Imagine training proceeds like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Tokens Trained&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stage 1&lt;/td&gt;
&lt;td&gt;800B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2&lt;/td&gt;
&lt;td&gt;100B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 3&lt;/td&gt;
&lt;td&gt;100B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now suppose additional funding or GPU capacity appears.&lt;/p&gt;

&lt;p&gt;Instead of redesigning the entire learning-rate trajectory, you can simply:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extend Stage 1&lt;/li&gt;
&lt;li&gt;Continue training at the same learning rate&lt;/li&gt;
&lt;li&gt;Delay later stage transitions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The optimization process remains consistent.&lt;/p&gt;

&lt;p&gt;The expensive computation already completed during Stage 1 remains fully reusable.&lt;/p&gt;

&lt;p&gt;This flexibility becomes valuable when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scaling budgets change&lt;/li&gt;
&lt;li&gt;New datasets arrive&lt;/li&gt;
&lt;li&gt;Training targets evolve&lt;/li&gt;
&lt;li&gt;Additional compute becomes available unexpectedly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For frontier model teams, this operational convenience can outweigh theoretical elegance.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementing a Multi-Step Scheduler
&lt;/h2&gt;

&lt;p&gt;A simplified PyTorch example looks like this:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AdamW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scheduler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lr_scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MultiStepLR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;milestones&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;800000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;900000&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This configuration means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learning rate = 1e-4 initially&lt;/li&gt;
&lt;li&gt;At step 800,000 → drop by 10×&lt;/li&gt;
&lt;li&gt;At step 900,000 → drop by another 10×&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;LR&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0 - 800k&lt;/td&gt;
&lt;td&gt;1e-4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;800k - 900k&lt;/td&gt;
&lt;td&gt;1e-5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;900k+&lt;/td&gt;
&lt;td&gt;1e-6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Real LLM training systems usually include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Warmup stages&lt;/li&gt;
&lt;li&gt;More sophisticated milestone selection&lt;/li&gt;
&lt;li&gt;Token-based rather than step-based scheduling&lt;/li&gt;
&lt;li&gt;Distributed optimizer considerations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the core idea remains the same.&lt;/p&gt;
&lt;h2&gt;
  
  
  When Should You Use Multi-Step Scheduling?
&lt;/h2&gt;

&lt;p&gt;For smaller projects, cosine decay remains an excellent default.&lt;/p&gt;

&lt;p&gt;However, multi-step scheduling becomes compelling when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training runs may be extended later&lt;/li&gt;
&lt;li&gt;Compute availability is uncertain&lt;/li&gt;
&lt;li&gt;Continual pretraining is expected&lt;/li&gt;
&lt;li&gt;Multiple training phases are planned&lt;/li&gt;
&lt;li&gt;Reusing partially completed training is important&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In these environments, optimization quality may remain nearly unchanged while operational flexibility improves significantly.&lt;/p&gt;

&lt;p&gt;Sometimes the best engineering decision isn't the theoretically cleanest one.&lt;/p&gt;

&lt;p&gt;It's the one that makes future decisions easier.&lt;/p&gt;
&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Learning-rate schedulers are often treated as a minor implementation detail.&lt;/p&gt;

&lt;p&gt;Yet at LLM scale, they influence not only optimization but also the economics of training.&lt;/p&gt;

&lt;p&gt;Cosine decay offers smooth and reliable convergence. Multi-step schedules often achieve similar final performance while making continual training far easier to manage.&lt;/p&gt;

&lt;p&gt;That tradeoff explains why several modern LLM training efforts have adopted multi-step schedulers as their default strategy.&lt;/p&gt;

&lt;p&gt;As model training increasingly becomes a long-running, iterative process rather than a single fixed experiment, flexibility may matter just as much as raw optimization performance.&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>The Three Phases of Post-Training: How LLMs Learn to Provide Sensible Responses</title>
      <dc:creator>Shrijith Venkatramana</dc:creator>
      <pubDate>Fri, 19 Jun 2026 19:02:11 +0000</pubDate>
      <link>https://dev.to/shrsv/the-three-phases-of-post-training-how-llms-learn-to-be-provide-sensible-responses-10a9</link>
      <guid>https://dev.to/shrsv/the-three-phases-of-post-training-how-llms-learn-to-be-provide-sensible-responses-10a9</guid>
      <description>&lt;p&gt;&lt;em&gt;Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;Star Us&lt;/a&gt; to help devs discover the project. Do give it a try and share your feedback for improving the product.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Most developers have heard the phrase:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"LLMs are trained on massive amounts of internet data."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While technically true, it leaves out the most interesting part.&lt;/p&gt;

&lt;p&gt;Pretraining teaches a model how language works. But it doesn't teach the model how to be helpful, harmless, honest, or aligned with human expectations.&lt;/p&gt;

&lt;p&gt;If pretraining creates a brilliant but socially awkward intern, post-training turns that intern into a productive teammate.&lt;/p&gt;

&lt;p&gt;Modern AI systems such as ChatGPT, Gemini, Claude, and others rely heavily on a multi-stage post-training pipeline. While implementations differ, the overall pattern is surprisingly consistent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Supervised Fine-Tuning (SFT)&lt;/li&gt;
&lt;li&gt;Reward Modeling (RM)&lt;/li&gt;
&lt;li&gt;Reinforcement Learning (RL)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's explore what each stage does, why it exists, and how they work together.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why Pretraining Isn't Enough
&lt;/h1&gt;

&lt;p&gt;Imagine we train a model on the entire internet and ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"How do I become a better software engineer?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model has seen thousands of answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good advice&lt;/li&gt;
&lt;li&gt;Bad advice&lt;/li&gt;
&lt;li&gt;Contradictory advice&lt;/li&gt;
&lt;li&gt;Sarcasm&lt;/li&gt;
&lt;li&gt;Reddit arguments&lt;/li&gt;
&lt;li&gt;Technical blog posts&lt;/li&gt;
&lt;li&gt;Motivational speeches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model learns patterns in text, but it doesn't inherently know which response humans would prefer.&lt;/p&gt;

&lt;p&gt;It only knows what tends to come next.&lt;/p&gt;

&lt;p&gt;This is the core limitation of pretraining.&lt;/p&gt;

&lt;p&gt;The model learns:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What people write."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What humans want."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Post-training bridges this gap.&lt;/p&gt;

&lt;h1&gt;
  
  
  Stage 1: Supervised Fine-Tuning (SFT)
&lt;/h1&gt;

&lt;p&gt;The first step is teaching the model what good behavior looks like.&lt;/p&gt;

&lt;p&gt;Researchers create high-quality examples consisting of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: Explain TCP vs UDP.

Assistant:
TCP provides reliable ordered delivery...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Or:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: Write a Python function that reverses a linked list.

Assistant:
def reverse(head):
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Thousands or millions of these examples are collected.&lt;/p&gt;

&lt;p&gt;The model is then trained to imitate the desired responses.&lt;/p&gt;

&lt;p&gt;Conceptually:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Question → Ideal Answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;becomes&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model → Learn to reproduce ideal answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The objective is still next-token prediction, but now on carefully curated data instead of arbitrary internet text.&lt;/p&gt;
&lt;h3&gt;
  
  
  Software Engineering Analogy
&lt;/h3&gt;

&lt;p&gt;Think of SFT as onboarding a new engineer.&lt;/p&gt;

&lt;p&gt;Instead of letting them learn exclusively from random GitHub repositories, you provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Coding standards&lt;/li&gt;
&lt;li&gt;Architecture guidelines&lt;/li&gt;
&lt;li&gt;Example pull requests&lt;/li&gt;
&lt;li&gt;Internal best practices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engineer begins to imitate the patterns you want.&lt;/p&gt;
&lt;h3&gt;
  
  
  What SFT Solves
&lt;/h3&gt;

&lt;p&gt;SFT dramatically improves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instruction following&lt;/li&gt;
&lt;li&gt;Formatting&lt;/li&gt;
&lt;li&gt;Tool usage&lt;/li&gt;
&lt;li&gt;Coding style&lt;/li&gt;
&lt;li&gt;Conversational behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, it still has a limitation.&lt;/p&gt;

&lt;p&gt;For many prompts, there isn't one correct answer.&lt;/p&gt;

&lt;p&gt;There may be multiple reasonable responses with varying quality levels.&lt;/p&gt;

&lt;p&gt;That's where Reward Modeling enters.&lt;/p&gt;
&lt;h1&gt;
  
  
  Stage 2: Reward Modeling (RM)
&lt;/h1&gt;

&lt;p&gt;Suppose a user asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"How should I learn distributed systems?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three responses might all be technically correct.&lt;/p&gt;

&lt;p&gt;Response A:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Read a textbook.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Response B:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Read a textbook and build projects.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Response C:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Study networking, databases, consensus algorithms,
then implement a small Raft cluster.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Most humans would likely prefer C.&lt;/p&gt;

&lt;p&gt;But how does a model learn that preference?&lt;/p&gt;

&lt;p&gt;The answer is Reward Modeling.&lt;/p&gt;
&lt;h3&gt;
  
  
  Collecting Human Preferences
&lt;/h3&gt;

&lt;p&gt;Human evaluators compare multiple outputs:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prompt

Answer A
Answer B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;They choose the better response.&lt;/p&gt;

&lt;p&gt;Thousands or millions of comparisons are collected.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prompt:
How do I learn Go?

Preferred:
Build projects and read effective Go.

Rejected:
Just read documentation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A separate model is trained to predict these preferences.&lt;/p&gt;

&lt;p&gt;This becomes the Reward Model.&lt;/p&gt;

&lt;p&gt;Conceptually:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Response → Quality Score
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The reward model acts like an automated judge.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;SFT teaches:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Produce answers similar to examples."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reward Modeling teaches:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Recognize which answers humans prefer."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This distinction is subtle but important.&lt;/p&gt;

&lt;p&gt;One is imitation.&lt;/p&gt;

&lt;p&gt;The other is evaluation.&lt;/p&gt;
&lt;h1&gt;
  
  
  Stage 3: Reinforcement Learning (RL)
&lt;/h1&gt;

&lt;p&gt;Now we have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A policy model (the assistant)&lt;/li&gt;
&lt;li&gt;A reward model (the judge)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The final stage uses Reinforcement Learning to optimize the assistant.&lt;/p&gt;

&lt;p&gt;The process looks like:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Prompt
   ↓
Model generates answer
   ↓
Reward model scores answer
   ↓
Update model to increase reward
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Repeated millions of times.&lt;/p&gt;

&lt;p&gt;Over time, the assistant learns to generate responses that maximize the reward signal.&lt;/p&gt;
&lt;h3&gt;
  
  
  PPO and Modern Variants
&lt;/h3&gt;

&lt;p&gt;Historically, many systems used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PPO (Proximal Policy Optimization)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More recently, newer approaches such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DPO (Direct Preference Optimization)&lt;/li&gt;
&lt;li&gt;RLAIF (Reinforcement Learning from AI Feedback)&lt;/li&gt;
&lt;li&gt;GRPO and related techniques&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;have gained popularity.&lt;/p&gt;

&lt;p&gt;The exact algorithm matters less than the goal:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Move the model toward outputs that humans consistently prefer.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Software Engineering Analogy
&lt;/h3&gt;

&lt;p&gt;Imagine code review automation.&lt;/p&gt;

&lt;p&gt;SFT teaches an engineer using examples of good pull requests.&lt;/p&gt;

&lt;p&gt;Reward Modeling creates a senior reviewer that scores submissions.&lt;/p&gt;

&lt;p&gt;RL repeatedly updates the engineer based on reviewer feedback.&lt;/p&gt;

&lt;p&gt;Eventually the engineer starts producing code that receives better review scores.&lt;/p&gt;
&lt;h1&gt;
  
  
  The New Trend: Models Training Models
&lt;/h1&gt;

&lt;p&gt;One interesting detail from recent Gemini research is the growing use of models themselves in post-training workflows.&lt;/p&gt;

&lt;p&gt;Instead of relying exclusively on humans, powerful models help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate candidate responses&lt;/li&gt;
&lt;li&gt;Identify low-quality data&lt;/li&gt;
&lt;li&gt;Detect inconsistencies&lt;/li&gt;
&lt;li&gt;Perform ranking tasks&lt;/li&gt;
&lt;li&gt;Assist evaluation pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates a feedback loop:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model
  ↓
Generates data
  ↓
Humans verify
  ↓
Improved model
  ↓
Generates better data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The result is dramatically improved scalability.&lt;/p&gt;

&lt;p&gt;The future of post-training may involve humans increasingly acting as supervisors while models handle much of the operational workload.&lt;/p&gt;


&lt;h1&gt;
  
  
  Why Data Quality Beats Model Size
&lt;/h1&gt;

&lt;p&gt;A common assumption is that better AI comes primarily from larger models.&lt;/p&gt;

&lt;p&gt;The industry increasingly suggests otherwise.&lt;/p&gt;

&lt;p&gt;Many recent gains come not from:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;More parameters
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;but from:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Better post-training data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A smaller model trained on excellent preference data can often outperform a larger model trained on mediocre data.&lt;/p&gt;

&lt;p&gt;This explains why modern research papers frequently emphasize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Preference datasets&lt;/li&gt;
&lt;li&gt;Evaluation quality&lt;/li&gt;
&lt;li&gt;Synthetic data generation&lt;/li&gt;
&lt;li&gt;Data filtering&lt;/li&gt;
&lt;li&gt;Alignment pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The quality of feedback often matters more than the quantity of compute.&lt;/p&gt;
&lt;h1&gt;
  
  
  Pretraining is about language, SFT is about sensible responses
&lt;/h1&gt;

&lt;p&gt;Pretraining teaches a model how language works.&lt;/p&gt;

&lt;p&gt;Supervised Fine-Tuning teaches it how to respond.&lt;/p&gt;

&lt;p&gt;Reward Modeling teaches it what humans prefer.&lt;/p&gt;

&lt;p&gt;Reinforcement Learning teaches it to consistently optimize for those preferences.&lt;/p&gt;

&lt;p&gt;Together, these stages transform a statistical text predictor into something that feels surprisingly useful.&lt;/p&gt;

&lt;p&gt;As foundation models become increasingly capable, the competitive advantage may shift away from raw model size and toward the sophistication of post-training systems and data quality pipelines.&lt;/p&gt;

&lt;p&gt;The next major breakthrough in AI might not come from a bigger model.&lt;/p&gt;

&lt;p&gt;It might come from a better teacher.&lt;/p&gt;

&lt;p&gt;If you were designing a reward model for a coding assistant, what signals would you optimize for—correctness, readability, maintainability, performance, or something else entirely?&lt;/p&gt;



&lt;p&gt;*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.&lt;/p&gt;

&lt;p&gt;git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*&lt;/p&gt;

&lt;p&gt;Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/HexmosTech" rel="noopener noreferrer"&gt;
        HexmosTech
      &lt;/a&gt; / &lt;a href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;
        git-lrc
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Free, Micro AI Code Reviews That Run on Git Commit
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;p&gt;| &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.da.md" rel="noopener noreferrer"&gt;🇩🇰 Dansk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.es.md" rel="noopener noreferrer"&gt;🇪🇸 Español&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fa.md" rel="noopener noreferrer"&gt;🇮🇷 Farsi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.fi.md" rel="noopener noreferrer"&gt;🇫🇮 Suomi&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ja.md" rel="noopener noreferrer"&gt;🇯🇵 日本語&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.nn.md" rel="noopener noreferrer"&gt;🇳🇴 Norsk&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.pt.md" rel="noopener noreferrer"&gt;🇵🇹 Português&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.ru.md" rel="noopener noreferrer"&gt;🇷🇺 Русский&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.sq.md" rel="noopener noreferrer"&gt;🇦🇱 Shqip&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.zh.md" rel="noopener noreferrer"&gt;🇨🇳 中文&lt;/a&gt; | &lt;a href="https://github.com/HexmosTech/git-lrc/readme/README.hi.md" rel="noopener noreferrer"&gt;🇮🇳 हिन्दी&lt;/a&gt; |&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;img width="60" alt="git-lrc logo" src="https://camo.githubusercontent.com/948c8f2d5cf41b48985cd364d48c3a2dc9bfbfd42eab3e0a9a1b3e61f5f17ce3/68747470733a2f2f6865786d6f732e636f6d2f66726565646576746f6f6c732f7075626c69632f6c725f6c6f676f2e737667"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;git-lrc&lt;/h1&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Free, Micro AI Code Reviews That Run on Commit&lt;/h2&gt;
&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.producthunt.com/products/git-lrc?embed=true&amp;amp;utm_source=badge-top-post-badge&amp;amp;utm_medium=badge&amp;amp;utm_campaign=badge-git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="git-lrc - Free, micro AI code reviews that run on commit | Product Hunt" width="200" src="https://camo.githubusercontent.com/87bf2d4283c1e0aa99e254bd17fefb1c67c0c0d39300043a243a4aa633b6cecc/68747470733a2f2f6170692e70726f6475637468756e742e636f6d2f776964676574732f656d6265642d696d6167652f76312f746f702d706f73742d62616467652e7376673f706f73745f69643d31303739323632267468656d653d6c6967687426706572696f643d6461696c7926743d31373731373439313730383638"&gt;&lt;/a&gt;
&amp;nbsp;&lt;/p&gt;
&lt;br&gt;
&lt;a href="https://discord.gg/sGdnKwB3qq" rel="nofollow noopener noreferrer"&gt;
  &lt;img alt="Discord Community" src="https://camo.githubusercontent.com/b8f979318aaabc8dec512b9d4e6e2a12431fba3c8a3b8738e1a97a0722d4e4bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d436f6d6d756e6974792d3538363546323f6c6f676f3d646973636f7264266c6162656c436f6c6f723d7768697465"&gt;
&lt;/a&gt; &lt;a href="https://goreportcard.com/report/github.com/HexmosTech/git-lrc" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Go Report Card" src="https://camo.githubusercontent.com/e74c0651c3ee9165a2ed01cb0f6842c494029960df30eb9c24cf622d3d21bf46/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f4865786d6f73546563682f6769742d6c7263"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml" rel="noopener noreferrer"&gt;&lt;img alt="gitleaks.yml" title="gitleaks.yml: Secret scanning workflow" src="https://github.com/HexmosTech/git-lrc/actions/workflows/gitleaks.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml" rel="noopener noreferrer"&gt;&lt;img alt="osv-scanner.yml" title="osv-scanner.yml: Dependency vulnerability scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/osv-scanner.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml" rel="noopener noreferrer"&gt;&lt;img alt="govulncheck.yml" title="govulncheck.yml: Go vulnerability check" src="https://github.com/HexmosTech/git-lrc/actions/workflows/govulncheck.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a href="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml" rel="noopener noreferrer"&gt;&lt;img alt="semgrep.yml" title="semgrep.yml: Static analysis security scan" src="https://github.com/HexmosTech/git-lrc/actions/workflows/semgrep.yml/badge.svg"&gt;&lt;/a&gt;&amp;nbsp;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/dependabot-enabled.svg"&gt;&lt;img alt="dependabot-enabled" title="dependabot-enabled: Automated dependency updates are enabled" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fdependabot-enabled.svg"&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/HexmosTech/git-lrc/./gfx/a_few_micro_reviews.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FHexmosTech%2Fgit-lrc%2FHEAD%2F.%2Fgfx%2Fa_few_micro_reviews.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GenAI today is a &lt;strong&gt;race car without brakes&lt;/strong&gt;. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents &lt;em&gt;silently break things&lt;/em&gt;: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;git-lrc&lt;/code&gt; is your braking system.&lt;/strong&gt; It hooks into &lt;code&gt;git commit&lt;/code&gt; and runs an AI review on every diff &lt;em&gt;before&lt;/em&gt; it lands. 60-second setup. Completely free.&lt;/p&gt;
&lt;p&gt;In short, git-lrc helps &lt;strong&gt;Prevent Outages, Breaches, and Technical Debt Before They Happen&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At a glance:&lt;/strong&gt; &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;10 risk categories&lt;/a&gt; · &lt;a href="https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for" rel="noopener noreferrer"&gt;100+ failure patterns tracked&lt;/a&gt; · every commit…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/HexmosTech/git-lrc" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
