<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: zeromathai</title>
    <description>The latest articles on DEV Community by zeromathai (@zeromathai).</description>
    <link>https://dev.to/zeromathai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3872570%2Fc7bba9ef-1a14-44b5-a02d-f6720ab48ab8.png</url>
      <title>DEV Community: zeromathai</title>
      <link>https://dev.to/zeromathai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zeromathai"/>
    <language>en</language>
    <item>
      <title>How Modern Transformer Blocks Work — From RMSNorm to MoE</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Mon, 29 Jun 2026 10:42:05 +0000</pubDate>
      <link>https://dev.to/zeromathai/how-modern-transformer-blocks-work-from-rmsnorm-to-moe-44cc</link>
      <guid>https://dev.to/zeromathai/how-modern-transformer-blocks-work-from-rmsnorm-to-moe-44cc</guid>
      <description>&lt;p&gt;The original Transformer idea is still alive.&lt;/p&gt;

&lt;p&gt;But modern LLM blocks are not just the 2017 Transformer copied and scaled.&lt;/p&gt;

&lt;p&gt;They are engineered for deeper training, longer context, cheaper inference, and larger capacity.&lt;/p&gt;

&lt;p&gt;That is why components like RMSNorm, GQA, RoPE, SwiGLU, and MoE matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;A modern Transformer block still follows the same basic pattern:&lt;/p&gt;

&lt;p&gt;Attention updates token relationships.&lt;/p&gt;

&lt;p&gt;The Feed-Forward Network transforms each token representation.&lt;/p&gt;

&lt;p&gt;Residual connections keep information flowing.&lt;/p&gt;

&lt;p&gt;But modern LLMs changed the details.&lt;/p&gt;

&lt;p&gt;Those details are not cosmetic.&lt;/p&gt;

&lt;p&gt;They make large-scale training and inference practical.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;A typical modern Transformer block looks like this:&lt;/p&gt;

&lt;p&gt;Input&lt;br&gt;&lt;br&gt;
→ RMSNorm or Pre-Layer Normalization&lt;br&gt;&lt;br&gt;
→ Self-Attention with GQA and RoPE&lt;br&gt;&lt;br&gt;
→ Residual Connection&lt;br&gt;&lt;br&gt;
→ RMSNorm or Pre-Layer Normalization&lt;br&gt;&lt;br&gt;
→ Feed-Forward Network with SwiGLU or MoE&lt;br&gt;&lt;br&gt;
→ Residual Connection&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;Modern Transformer Block = stable normalization + efficient attention + stronger FFN + residual flow&lt;/p&gt;

&lt;p&gt;Each component solves a real scaling problem.&lt;/p&gt;

&lt;p&gt;Pre-LN improves deep training stability.&lt;/p&gt;

&lt;p&gt;GQA reduces KV Cache memory.&lt;/p&gt;

&lt;p&gt;RoPE injects position into attention.&lt;/p&gt;

&lt;p&gt;SwiGLU improves FFN expressiveness.&lt;/p&gt;

&lt;p&gt;MoE increases capacity without activating all parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pseudo-code View
&lt;/h2&gt;

&lt;p&gt;A simplified modern block looks like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def transformer_block(x):
    h = rms_norm(x)

    attn = grouped_query_attention(
        q=apply_rope(query(h)),
        k=apply_rope(key(h)),
        v=value(h)
    )

    x = x + attn

    h = rms_norm(x)

    ffn = swiglu_ffn(h)

    x = x + ffn

    return x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;With MoE, the FFN part can become:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;h = rms_norm(x)

selected_experts = router(h)

ffn = run_top_k_experts(h, selected_experts)

x = x + ffn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The pattern stays simple.&lt;/p&gt;

&lt;p&gt;Normalize.&lt;/p&gt;

&lt;p&gt;Transform.&lt;/p&gt;

&lt;p&gt;Add back.&lt;/p&gt;

&lt;p&gt;Repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Imagine the model processes this token:&lt;/p&gt;

&lt;p&gt;"bank"&lt;/p&gt;

&lt;p&gt;The attention block helps decide whether “bank” means:&lt;/p&gt;

&lt;p&gt;a financial institution&lt;/p&gt;

&lt;p&gt;or the side of a river&lt;/p&gt;

&lt;p&gt;RoPE helps the model understand token order and distance.&lt;/p&gt;

&lt;p&gt;GQA helps attention run with a smaller KV Cache.&lt;/p&gt;

&lt;p&gt;The FFN then transforms the contextual representation.&lt;/p&gt;

&lt;p&gt;If the model uses MoE, the router may send this token to experts specialized for finance, geography, or general language.&lt;/p&gt;

&lt;p&gt;That is the intuition.&lt;/p&gt;

&lt;p&gt;Modern Transformer blocks are not just bigger.&lt;/p&gt;

&lt;p&gt;They are more selective, stable, and hardware-aware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pre-LN vs Post-LN
&lt;/h2&gt;

&lt;p&gt;The original Transformer commonly used Post-LN.&lt;/p&gt;

&lt;p&gt;Post-LN:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = LayerNorm(x + Sublayer(x))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Modern LLMs often use Pre-LN.&lt;/p&gt;

&lt;p&gt;Pre-LN:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = x + Sublayer(LayerNorm(x))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The difference looks small.&lt;/p&gt;

&lt;p&gt;But it matters.&lt;/p&gt;

&lt;p&gt;Pre-LN normalizes before the sublayer.&lt;/p&gt;

&lt;p&gt;That helps gradients flow through deep Transformer stacks.&lt;/p&gt;

&lt;p&gt;When a model has dozens or hundreds of layers, this becomes critical.&lt;/p&gt;

&lt;p&gt;Pre-LN is not just a formatting choice.&lt;/p&gt;

&lt;p&gt;It is a training stability choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  RMSNorm
&lt;/h2&gt;

&lt;p&gt;RMSNorm is a simpler normalization method.&lt;/p&gt;

&lt;p&gt;LayerNorm recenters and rescales.&lt;/p&gt;

&lt;p&gt;RMSNorm mainly rescales using the root mean square.&lt;/p&gt;

&lt;p&gt;The RMS is:&lt;/p&gt;

&lt;p&gt;RMS(h) = sqrt((1 / n) * Σ hᵢ²)&lt;/p&gt;

&lt;p&gt;Then the normalized vector is:&lt;/p&gt;

&lt;p&gt;h_norm = h / (RMS(h) + ε) * g&lt;/p&gt;

&lt;p&gt;Why use it?&lt;/p&gt;

&lt;p&gt;It keeps activation scale stable.&lt;/p&gt;

&lt;p&gt;It removes some computation compared with LayerNorm.&lt;/p&gt;

&lt;p&gt;It works well in large LLMs.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;h = [3, 4]&lt;/p&gt;

&lt;p&gt;RMS(h) = sqrt((9 + 16) / 2) ≈ 3.54&lt;/p&gt;

&lt;p&gt;Normalized h ≈ [0.85, 1.13]&lt;/p&gt;

&lt;p&gt;The key idea:&lt;/p&gt;

&lt;p&gt;RMSNorm stabilizes scale without doing more than necessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attention Block: GQA + RoPE
&lt;/h2&gt;

&lt;p&gt;Modern attention is often not plain Multi-Head Attention.&lt;/p&gt;

&lt;p&gt;It usually combines memory-aware attention with positional encoding.&lt;/p&gt;

&lt;p&gt;Grouped-Query Attention reduces KV Cache size.&lt;/p&gt;

&lt;p&gt;Rotary Positional Embedding injects position into Query and Key.&lt;/p&gt;

&lt;p&gt;The attention flow becomes:&lt;/p&gt;

&lt;p&gt;Input&lt;br&gt;&lt;br&gt;
→ Q, K, V projection&lt;br&gt;&lt;br&gt;
→ Apply RoPE to Q and K&lt;br&gt;&lt;br&gt;
→ Share K/V by groups using GQA&lt;br&gt;&lt;br&gt;
→ Compute attention&lt;br&gt;&lt;br&gt;
→ Output projection&lt;/p&gt;

&lt;p&gt;This matters for inference.&lt;/p&gt;

&lt;p&gt;Long-context generation is often limited by KV Cache memory.&lt;/p&gt;

&lt;p&gt;GQA reduces that pressure.&lt;/p&gt;

&lt;p&gt;RoPE keeps position information inside attention without adding a large position table.&lt;/p&gt;

&lt;h2&gt;
  
  
  SwiGLU
&lt;/h2&gt;

&lt;p&gt;The Feed-Forward Network is not just a simple MLP anymore.&lt;/p&gt;

&lt;p&gt;Many modern LLMs use SwiGLU.&lt;/p&gt;

&lt;p&gt;SwiGLU is a gated activation.&lt;/p&gt;

&lt;p&gt;One path carries information.&lt;/p&gt;

&lt;p&gt;Another path controls how much passes through.&lt;/p&gt;

&lt;p&gt;A simplified formula:&lt;/p&gt;

&lt;p&gt;SwiGLU(x) = (W₁x) * Swish(W₂x)&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;W₁x = 4&lt;/p&gt;

&lt;p&gt;Swish(W₂x) = 0.5&lt;/p&gt;

&lt;p&gt;Output = 2&lt;/p&gt;

&lt;p&gt;The gate decides how much information moves forward.&lt;/p&gt;

&lt;p&gt;That gives the FFN more control than a plain activation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mixture of Experts
&lt;/h2&gt;

&lt;p&gt;Mixture of Experts increases model capacity without activating every parameter for every token.&lt;/p&gt;

&lt;p&gt;Instead of one FFN, the model has multiple expert networks.&lt;/p&gt;

&lt;p&gt;A router chooses which experts handle each token.&lt;/p&gt;

&lt;p&gt;Example router output:&lt;/p&gt;

&lt;p&gt;Expert 1 = 0.45&lt;br&gt;&lt;br&gt;
Expert 2 = 0.19&lt;br&gt;&lt;br&gt;
Expert 3 = 0.05&lt;br&gt;&lt;br&gt;
Expert 4 = 0.31  &lt;/p&gt;

&lt;p&gt;With Top-2 routing:&lt;/p&gt;

&lt;p&gt;Expert 1 and Expert 4 are selected.&lt;/p&gt;

&lt;p&gt;Only those experts run.&lt;/p&gt;

&lt;p&gt;This is why MoE is called sparse.&lt;/p&gt;

&lt;p&gt;The model may have many parameters.&lt;/p&gt;

&lt;p&gt;But each token uses only a small subset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dense FFN vs MoE
&lt;/h2&gt;

&lt;p&gt;Dense FFN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;every token uses the same FFN&lt;/li&gt;
&lt;li&gt;all FFN parameters are active&lt;/li&gt;
&lt;li&gt;simpler to train and serve&lt;/li&gt;
&lt;li&gt;compute grows directly with FFN size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MoE:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;each token is routed to selected experts&lt;/li&gt;
&lt;li&gt;only part of the model activates&lt;/li&gt;
&lt;li&gt;increases total capacity efficiently&lt;/li&gt;
&lt;li&gt;adds routing and load-balancing complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference:&lt;/p&gt;

&lt;p&gt;Dense FFN = same compute path for every token&lt;/p&gt;

&lt;p&gt;MoE = conditional compute path per token&lt;/p&gt;

&lt;p&gt;MoE is powerful.&lt;/p&gt;

&lt;p&gt;But it is not free.&lt;/p&gt;

&lt;p&gt;It introduces routing instability, expert imbalance, and distributed communication overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Token Prediction
&lt;/h2&gt;

&lt;p&gt;Standard language modeling predicts one next token.&lt;/p&gt;

&lt;p&gt;At position t:&lt;/p&gt;

&lt;p&gt;predict token t + 1&lt;/p&gt;

&lt;p&gt;Multi-Token Prediction trains the model to predict multiple future tokens.&lt;/p&gt;

&lt;p&gt;At position t:&lt;/p&gt;

&lt;p&gt;predict token t + 1, t + 2, t + 3 ...&lt;/p&gt;

&lt;p&gt;This gives more learning signals from the same representation.&lt;/p&gt;

&lt;p&gt;Standard training:&lt;/p&gt;

&lt;p&gt;one position → one supervision signal&lt;/p&gt;

&lt;p&gt;MTP training:&lt;/p&gt;

&lt;p&gt;one position → multiple supervision signals&lt;/p&gt;

&lt;p&gt;This can improve sample efficiency.&lt;/p&gt;

&lt;p&gt;In some systems, it can also support faster generation ideas.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive vs Modern View
&lt;/h2&gt;

&lt;p&gt;Naive view:&lt;/p&gt;

&lt;p&gt;Transformer block = attention + FFN&lt;/p&gt;

&lt;p&gt;Modern view:&lt;/p&gt;

&lt;p&gt;Transformer block = stable normalization + efficient attention + gated FFN + sparse scaling&lt;/p&gt;

&lt;p&gt;Naive block:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;attention
ffn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Modern block:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rmsnorm
rope
gqa
residual
rmsnorm
swiglu or moe
residual
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This matters because modern LLM performance is not just about parameter count.&lt;/p&gt;

&lt;p&gt;It is about architecture details that make those parameters trainable and deployable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Perspective
&lt;/h2&gt;

&lt;p&gt;When reading modern LLM code, look for these patterns:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;self.input_layernorm = RMSNorm(...)

self.self_attn = Attention(..., rope=True, num_key_value_heads=...)

self.post_attention_layernorm = RMSNorm(...)

self.mlp = SwiGLU(...) or MoE(...)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The key clue for GQA is:&lt;/p&gt;

&lt;p&gt;number of query heads &amp;gt; number of key-value heads&lt;/p&gt;

&lt;p&gt;The key clue for RoPE is:&lt;/p&gt;

&lt;p&gt;position is applied to Q and K before attention&lt;/p&gt;

&lt;p&gt;The key clue for MoE is:&lt;/p&gt;

&lt;p&gt;router logits decide which experts run&lt;/p&gt;

&lt;p&gt;These details tell you what kind of Transformer block you are actually looking at.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Conditions and Limits
&lt;/h2&gt;

&lt;p&gt;Pre-LN improves stability, but the whole optimization setup still matters.&lt;/p&gt;

&lt;p&gt;RMSNorm is efficient, but it does not replace good initialization or training design.&lt;/p&gt;

&lt;p&gt;GQA reduces KV Cache memory, but may trade off some attention flexibility.&lt;/p&gt;

&lt;p&gt;RoPE works well for long contexts, but very long extrapolation may still need scaling techniques.&lt;/p&gt;

&lt;p&gt;SwiGLU improves FFN behavior, but increases FFN structure complexity.&lt;/p&gt;

&lt;p&gt;MoE increases capacity, but adds routing and system complexity.&lt;/p&gt;

&lt;p&gt;Modern Transformer design is a trade-off system.&lt;/p&gt;

&lt;p&gt;Every upgrade solves one bottleneck and introduces another design choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Again
&lt;/h2&gt;

&lt;p&gt;Modern LLMs are not just large neural networks.&lt;/p&gt;

&lt;p&gt;They are carefully engineered stacks.&lt;/p&gt;

&lt;p&gt;If you understand the block, you can better understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;why inference needs KV Cache optimization&lt;/li&gt;
&lt;li&gt;why RoPE appears in attention code&lt;/li&gt;
&lt;li&gt;why RMSNorm replaces LayerNorm&lt;/li&gt;
&lt;li&gt;why GQA changes memory usage&lt;/li&gt;
&lt;li&gt;why MoE models can be huge but still sparse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the difference between using LLMs and understanding how they scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Modern Transformer blocks preserve the original Transformer idea.&lt;/p&gt;

&lt;p&gt;But they upgrade almost every practical detail.&lt;/p&gt;

&lt;p&gt;The shortest version:&lt;/p&gt;

&lt;p&gt;Modern Transformer Block = Pre-LN/RMSNorm + GQA/RoPE Attention + SwiGLU/MoE FFN + Residual Connections&lt;/p&gt;

&lt;p&gt;If Self-Attention is the core idea, the modern block is the production-grade version of that idea.&lt;/p&gt;

&lt;p&gt;It is built for depth, context length, inference memory, and scalable capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When reading modern LLM architecture, which component feels most important to understand first?&lt;/p&gt;

&lt;p&gt;RMSNorm, RoPE, GQA, SwiGLU, or MoE?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/modern-transformer-blocks-llm-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/modern-transformer-blocks-llm-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>transformers</category>
    </item>
    <item>
      <title>Why Positional Embeddings Matter — APE, RPE, and RoPE Explained for Developers</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Fri, 26 Jun 2026 15:01:50 +0000</pubDate>
      <link>https://dev.to/zeromathai/why-positional-embeddings-matter-ape-rpe-and-rope-explained-for-developers-27gn</link>
      <guid>https://dev.to/zeromathai/why-positional-embeddings-matter-ape-rpe-and-rope-explained-for-developers-27gn</guid>
      <description>&lt;p&gt;Self-Attention can compare every token with every other token.&lt;/p&gt;

&lt;p&gt;But there is a catch.&lt;/p&gt;

&lt;p&gt;By itself, it does not know the order of tokens.&lt;/p&gt;

&lt;p&gt;That is a serious problem because “dog bites man” and “man bites dog” use the same words but mean completely different things.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;A Transformer needs two kinds of information:&lt;/p&gt;

&lt;p&gt;what the token is&lt;/p&gt;

&lt;p&gt;where the token is&lt;/p&gt;

&lt;p&gt;Token embeddings provide the “what.”&lt;/p&gt;

&lt;p&gt;Positional embeddings provide the “where.”&lt;/p&gt;

&lt;p&gt;This matters because attention without position is order-blind.&lt;/p&gt;

&lt;p&gt;It can compare tokens, but it does not naturally know which token came first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;A simple positional embedding flow looks like this:&lt;/p&gt;

&lt;p&gt;Token Embedding + Positional Information → Input Representation&lt;/p&gt;

&lt;p&gt;For Absolute Positional Embedding:&lt;/p&gt;

&lt;p&gt;E = X + P&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;p&gt;X = token embedding&lt;/p&gt;

&lt;p&gt;P = positional embedding&lt;/p&gt;

&lt;p&gt;E = final input representation&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;Transformer input = meaning vector + position signal&lt;/p&gt;

&lt;p&gt;Different positional methods change how the position signal is injected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pseudo-code View
&lt;/h2&gt;

&lt;p&gt;Basic positional injection:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tokens = tokenize(text)

x = embedding(tokens)

position = positional_embedding(token_positions)

input_representation = x + position
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For attention-based position methods:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;q = project_query(x)

k = project_key(x)

q = apply_position(q)

k = apply_position(k)

attention_scores = q @ k.T
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;APE usually modifies the input embedding.&lt;/p&gt;

&lt;p&gt;RPE usually modifies the attention score.&lt;/p&gt;

&lt;p&gt;RoPE usually modifies Query and Key.&lt;/p&gt;

&lt;p&gt;That difference is the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Compare these two sentences:&lt;/p&gt;

&lt;p&gt;dog bites man&lt;/p&gt;

&lt;p&gt;man bites dog&lt;/p&gt;

&lt;p&gt;The token set is the same:&lt;/p&gt;

&lt;p&gt;dog, bites, man&lt;/p&gt;

&lt;p&gt;But the order changes the meaning.&lt;/p&gt;

&lt;p&gt;Without positional information, Self-Attention sees token relationships but has no built-in sequence order.&lt;/p&gt;

&lt;p&gt;With positional information, each token representation includes location.&lt;/p&gt;

&lt;p&gt;So “dog” at position 1 is different from “dog” at position 3.&lt;/p&gt;

&lt;p&gt;This is why positional encoding is not optional.&lt;/p&gt;

&lt;p&gt;It is required for language understanding.&lt;/p&gt;

&lt;h2&gt;
  
  
  APE: Absolute Positional Embedding
&lt;/h2&gt;

&lt;p&gt;Absolute Positional Embedding assigns a vector to each position index.&lt;/p&gt;

&lt;p&gt;Position 1 has one vector.&lt;/p&gt;

&lt;p&gt;Position 2 has another vector.&lt;/p&gt;

&lt;p&gt;Position 3 has another vector.&lt;/p&gt;

&lt;p&gt;Then the model adds that position vector to the token embedding.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Token embedding:&lt;/p&gt;

&lt;p&gt;X = [0.2, 0.5]&lt;/p&gt;

&lt;p&gt;Position embedding:&lt;/p&gt;

&lt;p&gt;P = [0.1, -0.2]&lt;/p&gt;

&lt;p&gt;Final representation:&lt;/p&gt;

&lt;p&gt;E = [0.3, 0.3]&lt;/p&gt;

&lt;p&gt;APE is easy to understand.&lt;/p&gt;

&lt;p&gt;It says:&lt;/p&gt;

&lt;p&gt;this token is at this exact position&lt;/p&gt;

&lt;h2&gt;
  
  
  Why APE Is Useful
&lt;/h2&gt;

&lt;p&gt;APE is simple.&lt;/p&gt;

&lt;p&gt;It is easy to implement.&lt;/p&gt;

&lt;p&gt;It works well when sequence lengths stay close to what the model saw during training.&lt;/p&gt;

&lt;p&gt;Implementation-wise, it is just:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = token_embedding + position_embedding
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That makes it cheap and clean.&lt;/p&gt;

&lt;p&gt;But the simplicity has a cost.&lt;/p&gt;

&lt;p&gt;APE treats position as a fixed index.&lt;/p&gt;

&lt;p&gt;If the model sees much longer inputs than it was trained on, unseen positions can become unreliable.&lt;/p&gt;

&lt;p&gt;That makes APE weaker for long-context extrapolation.&lt;/p&gt;

&lt;h2&gt;
  
  
  RPE: Relative Positional Embedding
&lt;/h2&gt;

&lt;p&gt;Relative Positional Embedding focuses on distance.&lt;/p&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;p&gt;What position is this token at?&lt;/p&gt;

&lt;p&gt;It asks:&lt;/p&gt;

&lt;p&gt;How far apart are these two tokens?&lt;/p&gt;

&lt;p&gt;This is often more natural for language.&lt;/p&gt;

&lt;p&gt;A subject and verb may appear at different absolute positions.&lt;/p&gt;

&lt;p&gt;But their relative distance and direction still matter.&lt;/p&gt;

&lt;p&gt;A simplified RPE attention score looks like this:&lt;/p&gt;

&lt;p&gt;Aᵢⱼ = (QᵢKⱼᵀ + Rᵢ₋ⱼ) / √d&lt;/p&gt;

&lt;p&gt;Rᵢ₋ⱼ represents the relative position between token i and token j.&lt;/p&gt;

&lt;p&gt;This means position directly affects attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete RPE Example
&lt;/h2&gt;

&lt;p&gt;Suppose:&lt;/p&gt;

&lt;p&gt;QᵢKⱼᵀ = 12&lt;/p&gt;

&lt;p&gt;Rᵢ₋ⱼ = 4&lt;/p&gt;

&lt;p&gt;√d = 4&lt;/p&gt;

&lt;p&gt;Then:&lt;/p&gt;

&lt;p&gt;Aᵢⱼ = (12 + 4) / 4 = 4&lt;/p&gt;

&lt;p&gt;Without the relative term:&lt;/p&gt;

&lt;p&gt;Aᵢⱼ = 12 / 4 = 3&lt;/p&gt;

&lt;p&gt;So the distance relationship increased the attention score.&lt;/p&gt;

&lt;p&gt;That is the intuition.&lt;/p&gt;

&lt;p&gt;RPE lets the model say:&lt;/p&gt;

&lt;p&gt;This token is more relevant because of where it is relative to me.&lt;/p&gt;

&lt;h2&gt;
  
  
  RoPE: Rotary Positional Embedding
&lt;/h2&gt;

&lt;p&gt;Rotary Positional Embedding takes a different path.&lt;/p&gt;

&lt;p&gt;It does not add a position vector to the input.&lt;/p&gt;

&lt;p&gt;It rotates Query and Key vectors based on position.&lt;/p&gt;

&lt;p&gt;The core idea:&lt;/p&gt;

&lt;p&gt;position becomes rotation&lt;/p&gt;

&lt;p&gt;A 2D rotation matrix looks like this:&lt;/p&gt;

&lt;p&gt;Rθ = [[cosθ, -sinθ], [sinθ, cosθ]]&lt;/p&gt;

&lt;p&gt;If you rotate [1, 0] by 90 degrees:&lt;/p&gt;

&lt;p&gt;[1, 0] → [0, 1]&lt;/p&gt;

&lt;p&gt;RoPE applies this idea across Query and Key dimensions.&lt;/p&gt;

&lt;p&gt;Different positions get different rotations.&lt;/p&gt;

&lt;p&gt;Then attention scores naturally include relative position.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why RoPE Works Well
&lt;/h2&gt;

&lt;p&gt;RoPE uses absolute position to rotate Q and K.&lt;/p&gt;

&lt;p&gt;But when Q and K are compared, the score depends on their relative position difference.&lt;/p&gt;

&lt;p&gt;The key relationship is:&lt;/p&gt;

&lt;p&gt;(RθⁱQ)ᵀ(RθʲK) = QᵀRθʲ⁻ⁱK&lt;/p&gt;

&lt;p&gt;This means the attention score contains j - i.&lt;/p&gt;

&lt;p&gt;That is the relative distance.&lt;/p&gt;

&lt;p&gt;So RoPE gives you a useful combination:&lt;/p&gt;

&lt;p&gt;absolute-position injection + relative-position behavior&lt;/p&gt;

&lt;p&gt;This is why RoPE became popular in modern LLMs.&lt;/p&gt;

&lt;h2&gt;
  
  
  APE vs RPE vs RoPE
&lt;/h2&gt;

&lt;p&gt;APE:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;adds position vectors to token embeddings&lt;/li&gt;
&lt;li&gt;simple and cheap&lt;/li&gt;
&lt;li&gt;good for fixed or known sequence lengths&lt;/li&gt;
&lt;li&gt;weaker for long-context extrapolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RPE:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;adds relative distance information to attention scores&lt;/li&gt;
&lt;li&gt;directly models token-to-token distance&lt;/li&gt;
&lt;li&gt;flexible for variable lengths&lt;/li&gt;
&lt;li&gt;can complicate attention implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RoPE:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rotates Query and Key vectors by position&lt;/li&gt;
&lt;li&gt;makes relative distance appear inside attention&lt;/li&gt;
&lt;li&gt;memory-efficient&lt;/li&gt;
&lt;li&gt;works well with modern long-context LLMs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference:&lt;/p&gt;

&lt;p&gt;APE = where am I?&lt;/p&gt;

&lt;p&gt;RPE = how far are we?&lt;/p&gt;

&lt;p&gt;RoPE = rotate Q/K so distance appears in attention&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Perspective
&lt;/h2&gt;

&lt;p&gt;If you are reading Transformer code, look at where position enters the model.&lt;/p&gt;

&lt;p&gt;APE usually appears near the embedding layer:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = token_embedding + position_embedding
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;RPE usually appears inside attention score computation:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scores = q @ k.T + relative_position_bias
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;RoPE usually appears after Q and K projection:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;q = apply_rope(q, positions)

k = apply_rope(k, positions)

scores = q @ k.T
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is the developer shortcut.&lt;/p&gt;

&lt;p&gt;Find the injection point.&lt;/p&gt;

&lt;p&gt;Then you know which positional method the model uses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive vs Practical View
&lt;/h2&gt;

&lt;p&gt;Naive view:&lt;/p&gt;

&lt;p&gt;Positional embedding just tells the model token order.&lt;/p&gt;

&lt;p&gt;Practical view:&lt;/p&gt;

&lt;p&gt;Positional design affects long-context behavior, caching, memory, and attention quality.&lt;/p&gt;

&lt;p&gt;Naive mindset:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;add positions
run attention
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Practical mindset:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;choose how position enters attention
consider context length
consider extrapolation
consider KV Cache compatibility
consider implementation complexity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This matters because positional encoding is not a small detail.&lt;/p&gt;

&lt;p&gt;It changes how the model behaves when the context becomes long.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Again
&lt;/h2&gt;

&lt;p&gt;Short inputs can hide positional weaknesses.&lt;/p&gt;

&lt;p&gt;Long-context models expose them.&lt;/p&gt;

&lt;p&gt;If positional information does not extrapolate well, the model may become unstable outside its training length.&lt;/p&gt;

&lt;p&gt;This is why modern LLMs care so much about RoPE variants and long-context scaling.&lt;/p&gt;

&lt;p&gt;The position method affects whether a model can reliably handle long prompts, code files, documents, and conversations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Conditions and Limits
&lt;/h2&gt;

&lt;p&gt;APE is easy but tied to absolute indices.&lt;/p&gt;

&lt;p&gt;RPE is expressive but can complicate attention computation.&lt;/p&gt;

&lt;p&gt;RoPE is efficient and practical, but still needs careful scaling for very long contexts.&lt;/p&gt;

&lt;p&gt;Also:&lt;/p&gt;

&lt;p&gt;Positional embeddings do not create reasoning by themselves.&lt;/p&gt;

&lt;p&gt;They only give attention a way to use order.&lt;/p&gt;

&lt;p&gt;The model still needs training to learn useful patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Self-Attention needs positional information because it is order-blind by default.&lt;/p&gt;

&lt;p&gt;APE adds absolute position to embeddings.&lt;/p&gt;

&lt;p&gt;RPE adds relative distance to attention scores.&lt;/p&gt;

&lt;p&gt;RoPE rotates Query and Key vectors so relative position appears naturally.&lt;/p&gt;

&lt;p&gt;The shortest version:&lt;/p&gt;

&lt;p&gt;Positional Embedding = the order signal that makes attention understand sequence structure&lt;/p&gt;

&lt;p&gt;If you understand where position enters the model, you understand the difference between APE, RPE, and RoPE.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When learning Transformer internals, which positional method feels most intuitive to you?&lt;/p&gt;

&lt;p&gt;APE, RPE, or RoPE?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/advanced-positional-embeddings-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/advanced-positional-embeddings-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>transformers</category>
    </item>
    <item>
      <title>Why KV Cache Matters — How MQA, GQA, and MLA Make LLM Inference Faster</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Thu, 25 Jun 2026 14:15:58 +0000</pubDate>
      <link>https://dev.to/zeromathai/why-kv-cache-matters-how-mqa-gqa-and-mla-make-llm-inference-faster-5gb4</link>
      <guid>https://dev.to/zeromathai/why-kv-cache-matters-how-mqa-gqa-and-mla-make-llm-inference-faster-5gb4</guid>
      <description>&lt;p&gt;LLMs generate text one token at a time.&lt;/p&gt;

&lt;p&gt;That sounds simple.&lt;/p&gt;

&lt;p&gt;But without KV Cache, every new token would repeat a lot of old work.&lt;/p&gt;

&lt;p&gt;That is why inference optimization starts with keys and values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;KV Cache stores previously computed Key and Value tensors.&lt;/p&gt;

&lt;p&gt;During generation, the model only needs to compute the new token’s Query, Key, and Value.&lt;/p&gt;

&lt;p&gt;Then the new Query attends to cached Keys and Values.&lt;/p&gt;

&lt;p&gt;This matters because autoregressive generation repeats the same context again and again.&lt;/p&gt;

&lt;p&gt;KV Cache removes a huge amount of duplicated computation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;Autoregressive generation:&lt;/p&gt;

&lt;p&gt;Prompt tokens&lt;br&gt;&lt;br&gt;
→ compute K/V&lt;br&gt;&lt;br&gt;
→ store K/V in cache&lt;br&gt;&lt;br&gt;
→ generate next token&lt;br&gt;&lt;br&gt;
→ append new K/V&lt;br&gt;&lt;br&gt;
→ repeat&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;KV Cache = reuse past K/V + compute only new K/V&lt;/p&gt;

&lt;p&gt;But there is a trade-off.&lt;/p&gt;

&lt;p&gt;KV Cache reduces recomputation.&lt;/p&gt;

&lt;p&gt;It does not remove attention cost.&lt;/p&gt;

&lt;p&gt;And as context length grows, the cache itself becomes large.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pseudo-code View
&lt;/h2&gt;

&lt;p&gt;Without KV Cache:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;context = prompt_tokens

while not finished:
    Q, K, V = compute_qkv(context)

    output = attention(Q, K, V)

    next_token = sample(output)

    context.append(next_token)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;With KV Cache:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;context = prompt_tokens

K_cache, V_cache = compute_and_store_kv(context)

while not finished:
    q_new, k_new, v_new = compute_qkv(new_token)

    K_cache.append(k_new)
    V_cache.append(v_new)

    output = attention(q_new, K_cache, V_cache)

    next_token = sample(output)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The optimized version avoids recomputing K and V for old tokens.&lt;/p&gt;

&lt;p&gt;That is the main speedup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Prompt:&lt;/p&gt;

&lt;p&gt;Dear&lt;/p&gt;

&lt;p&gt;The model generates:&lt;/p&gt;

&lt;p&gt;Sarah&lt;/p&gt;

&lt;p&gt;Next context:&lt;/p&gt;

&lt;p&gt;Dear Sarah&lt;/p&gt;

&lt;p&gt;Without KV Cache:&lt;/p&gt;

&lt;p&gt;The model recomputes K/V for “Dear” again.&lt;/p&gt;

&lt;p&gt;With KV Cache:&lt;/p&gt;

&lt;p&gt;The model reuses the cached K/V for “Dear.”&lt;/p&gt;

&lt;p&gt;It only computes new K/V for “Sarah.”&lt;/p&gt;

&lt;p&gt;Now extend this to a 10,000-token conversation.&lt;/p&gt;

&lt;p&gt;Recomputing old tokens becomes wasteful.&lt;/p&gt;

&lt;p&gt;Caching becomes essential.&lt;/p&gt;

&lt;h2&gt;
  
  
  What KV Cache Reduces
&lt;/h2&gt;

&lt;p&gt;KV Cache reduces repeated computation.&lt;/p&gt;

&lt;p&gt;Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;past Key computation&lt;/li&gt;
&lt;li&gt;past Value computation&lt;/li&gt;
&lt;li&gt;repeated projection work for old tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it does not eliminate everything.&lt;/p&gt;

&lt;p&gt;The new Query still attends to cached Keys and Values.&lt;/p&gt;

&lt;p&gt;So longer context still costs more.&lt;/p&gt;

&lt;p&gt;This matters in production.&lt;/p&gt;

&lt;p&gt;A long chat can become memory-heavy even if generation is optimized.&lt;/p&gt;

&lt;h2&gt;
  
  
  The New Bottleneck
&lt;/h2&gt;

&lt;p&gt;KV Cache speeds up inference.&lt;/p&gt;

&lt;p&gt;But it also creates a memory problem.&lt;/p&gt;

&lt;p&gt;For every layer, every token stores Key and Value tensors.&lt;/p&gt;

&lt;p&gt;Longer context means larger cache.&lt;/p&gt;

&lt;p&gt;More users mean more cache memory.&lt;/p&gt;

&lt;p&gt;More heads mean more K/V tensors.&lt;/p&gt;

&lt;p&gt;So the bottleneck shifts:&lt;/p&gt;

&lt;p&gt;Before KV Cache:&lt;/p&gt;

&lt;p&gt;recompute cost&lt;/p&gt;

&lt;p&gt;After KV Cache:&lt;/p&gt;

&lt;p&gt;memory cost&lt;/p&gt;

&lt;p&gt;This is why MQA, GQA, and MLA exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  MHA vs MQA vs GQA vs MLA
&lt;/h2&gt;

&lt;p&gt;The main difference is how Key and Value tensors are stored.&lt;/p&gt;

&lt;p&gt;Standard Multi-Head Attention:&lt;/p&gt;

&lt;p&gt;Each head has its own K/V.&lt;/p&gt;

&lt;p&gt;Multi-Query Attention:&lt;/p&gt;

&lt;p&gt;All heads share one K/V.&lt;/p&gt;

&lt;p&gt;Grouped-Query Attention:&lt;/p&gt;

&lt;p&gt;Groups of heads share K/V.&lt;/p&gt;

&lt;p&gt;Multi-Head Latent Attention:&lt;/p&gt;

&lt;p&gt;K/V information is stored in compressed latent form.&lt;/p&gt;

&lt;p&gt;The goal is the same:&lt;/p&gt;

&lt;p&gt;reduce KV Cache size while preserving useful attention behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Head Attention
&lt;/h2&gt;

&lt;p&gt;In standard Multi-Head Attention, each head has separate Query, Key, and Value projections.&lt;/p&gt;

&lt;p&gt;If there are 8 heads:&lt;/p&gt;

&lt;p&gt;8 heads → 8 K/V pairs&lt;/p&gt;

&lt;p&gt;This is expressive.&lt;/p&gt;

&lt;p&gt;Each head can learn its own representation.&lt;/p&gt;

&lt;p&gt;But it is expensive during inference.&lt;/p&gt;

&lt;p&gt;More heads mean larger cache.&lt;/p&gt;

&lt;p&gt;So MHA gives quality and flexibility.&lt;/p&gt;

&lt;p&gt;But it pays with memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Query Attention
&lt;/h2&gt;

&lt;p&gt;Multi-Query Attention keeps different Queries for each head.&lt;/p&gt;

&lt;p&gt;But all heads share the same Key and Value.&lt;/p&gt;

&lt;p&gt;If there are 8 heads:&lt;/p&gt;

&lt;p&gt;8 query heads → 1 shared K/V pair&lt;/p&gt;

&lt;p&gt;This sharply reduces cache size.&lt;/p&gt;

&lt;p&gt;It is memory-efficient.&lt;/p&gt;

&lt;p&gt;But there is a trade-off.&lt;/p&gt;

&lt;p&gt;Because all heads share K/V, head diversity can decrease.&lt;/p&gt;

&lt;p&gt;So MQA is fast and compact.&lt;/p&gt;

&lt;p&gt;But it may lose some expressiveness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grouped-Query Attention
&lt;/h2&gt;

&lt;p&gt;Grouped-Query Attention is the compromise.&lt;/p&gt;

&lt;p&gt;Instead of one shared K/V for all heads, it divides heads into groups.&lt;/p&gt;

&lt;p&gt;Each group shares one K/V pair.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;8 heads&lt;br&gt;&lt;br&gt;
2 groups&lt;br&gt;&lt;br&gt;
→ 2 K/V pairs&lt;/p&gt;

&lt;p&gt;This sits between MHA and MQA.&lt;/p&gt;

&lt;p&gt;MHA stores 8 K/V pairs.&lt;/p&gt;

&lt;p&gt;MQA stores 1 K/V pair.&lt;/p&gt;

&lt;p&gt;GQA stores a configurable middle ground.&lt;/p&gt;

&lt;p&gt;That makes GQA practical for modern LLM inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Head Latent Attention
&lt;/h2&gt;

&lt;p&gt;Multi-Head Latent Attention goes further.&lt;/p&gt;

&lt;p&gt;Instead of storing full K/V tensors directly, it stores compressed latent representations.&lt;/p&gt;

&lt;p&gt;Then it reconstructs or projects the needed information during attention.&lt;/p&gt;

&lt;p&gt;The idea is:&lt;/p&gt;

&lt;p&gt;store less&lt;/p&gt;

&lt;p&gt;recover enough&lt;/p&gt;

&lt;p&gt;This is especially useful for long-context inference.&lt;/p&gt;

&lt;p&gt;Because when context length grows, KV Cache grows with it.&lt;/p&gt;

&lt;p&gt;MLA attacks the memory problem at the representation level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison Table
&lt;/h2&gt;

&lt;p&gt;MHA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;separate K/V per head&lt;/li&gt;
&lt;li&gt;high expressiveness&lt;/li&gt;
&lt;li&gt;large KV Cache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MQA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one shared K/V for all heads&lt;/li&gt;
&lt;li&gt;smallest shared-KV cache&lt;/li&gt;
&lt;li&gt;possible quality trade-off&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GQA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shared K/V per head group&lt;/li&gt;
&lt;li&gt;balanced memory and quality&lt;/li&gt;
&lt;li&gt;common practical compromise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MLA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compressed latent K/V&lt;/li&gt;
&lt;li&gt;strong cache reduction&lt;/li&gt;
&lt;li&gt;useful for long-context models&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation Perspective
&lt;/h2&gt;

&lt;p&gt;In real inference systems, KV Cache is not just a model detail.&lt;/p&gt;

&lt;p&gt;It affects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;GPU memory&lt;/li&gt;
&lt;li&gt;batch size&lt;/li&gt;
&lt;li&gt;max context length&lt;/li&gt;
&lt;li&gt;serving cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A model with a smaller KV Cache can serve longer contexts or more users on the same hardware.&lt;/p&gt;

&lt;p&gt;That is why shared K/V designs matter.&lt;/p&gt;

&lt;p&gt;They are not just architecture theory.&lt;/p&gt;

&lt;p&gt;They directly affect deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive vs Practical View
&lt;/h2&gt;

&lt;p&gt;Naive view:&lt;/p&gt;

&lt;p&gt;LLM inference = run the model repeatedly&lt;/p&gt;

&lt;p&gt;Practical view:&lt;/p&gt;

&lt;p&gt;LLM inference = manage cached states efficiently&lt;/p&gt;

&lt;p&gt;Naive generation:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recompute all token states every step
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Optimized generation:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cache past K/V
compute only new token states
reduce K/V storage with MQA, GQA, or MLA
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is one of the biggest differences between understanding Transformers conceptually and running them efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Conditions and Limits
&lt;/h2&gt;

&lt;p&gt;KV Cache does not make attention free.&lt;/p&gt;

&lt;p&gt;The new Query still attends over cached tokens.&lt;/p&gt;

&lt;p&gt;Long context still increases memory and latency.&lt;/p&gt;

&lt;p&gt;MQA reduces memory but may reduce head diversity.&lt;/p&gt;

&lt;p&gt;GQA balances memory and quality.&lt;/p&gt;

&lt;p&gt;MLA reduces cache size through compression, but adds architectural complexity.&lt;/p&gt;

&lt;p&gt;So the real design question is:&lt;/p&gt;

&lt;p&gt;How much memory can we save without hurting generation quality too much?&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Again
&lt;/h2&gt;

&lt;p&gt;Long-context models are useful only if inference is practical.&lt;/p&gt;

&lt;p&gt;A model that supports huge context but cannot fit enough cache in GPU memory is hard to serve.&lt;/p&gt;

&lt;p&gt;KV Cache makes autoregressive generation faster.&lt;/p&gt;

&lt;p&gt;MQA, GQA, and MLA make KV Cache more scalable.&lt;/p&gt;

&lt;p&gt;That is why modern LLM architecture spends so much effort on shared or compressed Key-Value attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;KV Cache reuses past Keys and Values.&lt;/p&gt;

&lt;p&gt;MQA shares K/V across all heads.&lt;/p&gt;

&lt;p&gt;GQA shares K/V within groups.&lt;/p&gt;

&lt;p&gt;MLA compresses K/V into latent representations.&lt;/p&gt;

&lt;p&gt;The shortest version:&lt;/p&gt;

&lt;p&gt;KV optimization = faster generation + smaller memory footprint&lt;/p&gt;

&lt;p&gt;If attention is the engine, KV Cache is the memory system that keeps generation practical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When optimizing LLM inference, which bottleneck do you usually notice first?&lt;/p&gt;

&lt;p&gt;Latency, GPU memory, context length, or serving cost?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/kv-cache-shared-key-value-attention-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/kv-cache-shared-key-value-attention-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>transformers</category>
    </item>
    <item>
      <title>Why Attention Becomes the Bottleneck — And How Efficient Attention Fixes It</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Wed, 24 Jun 2026 14:23:42 +0000</pubDate>
      <link>https://dev.to/zeromathai/why-attention-becomes-the-bottleneck-and-how-efficient-attention-fixes-it-2dkg</link>
      <guid>https://dev.to/zeromathai/why-attention-becomes-the-bottleneck-and-how-efficient-attention-fixes-it-2dkg</guid>
      <description>&lt;p&gt;Your model got smarter.&lt;/p&gt;

&lt;p&gt;But suddenly it got slower.&lt;/p&gt;

&lt;p&gt;Why does increasing context length explode compute?&lt;/p&gt;

&lt;p&gt;Because attention is O(n²).&lt;/p&gt;

&lt;p&gt;And that becomes the real bottleneck in modern LLMs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;Attention compares every token with every other token.&lt;/p&gt;

&lt;p&gt;That is powerful.&lt;/p&gt;

&lt;p&gt;But it is expensive.&lt;/p&gt;

&lt;p&gt;Efficient Attention methods try to answer one question:&lt;/p&gt;

&lt;p&gt;How do we keep useful context while reducing cost?&lt;/p&gt;

&lt;p&gt;This matters because long-context LLMs are useless if they are too slow or too expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;Full Attention cost:&lt;/p&gt;

&lt;p&gt;Attention Cost = O(n²)&lt;/p&gt;

&lt;p&gt;Meaning:&lt;/p&gt;

&lt;p&gt;n tokens → n × n comparisons&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;1,000 tokens → 1M comparisons&lt;br&gt;&lt;br&gt;
10,000 tokens → 100M comparisons  &lt;/p&gt;

&lt;p&gt;10× longer input → 100× more work&lt;/p&gt;

&lt;p&gt;That is the bottleneck.&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;Attention = full connectivity + quadratic cost&lt;/p&gt;

&lt;p&gt;Efficient Attention = reduce connections or optimize computation&lt;/p&gt;

&lt;h2&gt;
  
  
  Pseudo-code View
&lt;/h2&gt;

&lt;p&gt;Full attention:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for i in tokens:
    for j in tokens:
        score[i][j] = dot(Q[i], K[j])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Efficient attention idea:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;restrict or optimize comparisons

for i in tokens:
    for j in selected_tokens:
        score[i][j] = dot(Q[i], K[j])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;compute same attention
but optimize memory access
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Two strategies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reduce what you compute&lt;/li&gt;
&lt;li&gt;optimize how you compute&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Imagine reading a 10,000-token document.&lt;/p&gt;

&lt;p&gt;Full Attention:&lt;/p&gt;

&lt;p&gt;Every word looks at every other word.&lt;/p&gt;

&lt;p&gt;That is like comparing every sentence to every sentence.&lt;/p&gt;

&lt;p&gt;Local Attention:&lt;/p&gt;

&lt;p&gt;Each word looks only at nearby words.&lt;/p&gt;

&lt;p&gt;Like reading paragraph by paragraph.&lt;/p&gt;

&lt;p&gt;Sparse Attention:&lt;/p&gt;

&lt;p&gt;Each word looks at selected words.&lt;/p&gt;

&lt;p&gt;Like focusing on keywords and headings.&lt;/p&gt;

&lt;p&gt;FlashAttention:&lt;/p&gt;

&lt;p&gt;Still reads everything.&lt;/p&gt;

&lt;p&gt;But does it efficiently by avoiding unnecessary memory movement.&lt;/p&gt;

&lt;p&gt;Different methods.&lt;/p&gt;

&lt;p&gt;Same goal:&lt;/p&gt;

&lt;p&gt;Reduce cost without losing important context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full Attention vs Efficient Attention
&lt;/h2&gt;

&lt;p&gt;Full Attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;connects every token to every token&lt;/li&gt;
&lt;li&gt;captures long-range dependencies&lt;/li&gt;
&lt;li&gt;expensive in compute and memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Efficient Attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reduces connections or optimizes execution&lt;/li&gt;
&lt;li&gt;scales to longer sequences&lt;/li&gt;
&lt;li&gt;trades off some flexibility for efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference:&lt;/p&gt;

&lt;p&gt;Full = maximum connectivity&lt;/p&gt;

&lt;p&gt;Efficient = selective or optimized connectivity&lt;/p&gt;

&lt;h2&gt;
  
  
  Local Attention
&lt;/h2&gt;

&lt;p&gt;Local Attention limits attention to a window.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Each token attends to last 128 tokens.&lt;/p&gt;

&lt;p&gt;Cost becomes:&lt;/p&gt;

&lt;p&gt;O(n × window)&lt;/p&gt;

&lt;p&gt;Instead of O(n²)&lt;/p&gt;

&lt;p&gt;This works because:&lt;/p&gt;

&lt;p&gt;Nearby context often matters most.&lt;/p&gt;

&lt;p&gt;But limitation:&lt;/p&gt;

&lt;p&gt;Long-range dependencies can be missed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sparse Attention
&lt;/h2&gt;

&lt;p&gt;Sparse Attention generalizes Local Attention.&lt;/p&gt;

&lt;p&gt;Instead of full connections:&lt;/p&gt;

&lt;p&gt;Use structured patterns.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local windows&lt;/li&gt;
&lt;li&gt;strided attention&lt;/li&gt;
&lt;li&gt;global tokens&lt;/li&gt;
&lt;li&gt;block patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduces cost while keeping some long-range connections.&lt;/p&gt;

&lt;p&gt;But trade-off:&lt;/p&gt;

&lt;p&gt;Too sparse → lose important relationships&lt;/p&gt;

&lt;p&gt;So many models mix:&lt;/p&gt;

&lt;p&gt;full attention + sparse attention layers&lt;/p&gt;

&lt;h2&gt;
  
  
  FlashAttention
&lt;/h2&gt;

&lt;p&gt;FlashAttention does NOT change attention logic.&lt;/p&gt;

&lt;p&gt;It changes how attention is computed.&lt;/p&gt;

&lt;p&gt;Problem:&lt;/p&gt;

&lt;p&gt;Attention is often memory-bound.&lt;/p&gt;

&lt;p&gt;GPU spends time moving data, not computing.&lt;/p&gt;

&lt;p&gt;FlashAttention solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compute attention in blocks&lt;/li&gt;
&lt;li&gt;keep data in fast SRAM&lt;/li&gt;
&lt;li&gt;avoid storing large intermediate matrices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;p&gt;store full attention matrix → read again&lt;/p&gt;

&lt;p&gt;It does:&lt;/p&gt;

&lt;p&gt;compute on-the-fly → minimize memory movement&lt;/p&gt;

&lt;p&gt;Key idea:&lt;/p&gt;

&lt;p&gt;Optimize IO, not just math&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive vs Optimized View
&lt;/h2&gt;

&lt;p&gt;Naive view:&lt;/p&gt;

&lt;p&gt;Attention cost = math operations&lt;/p&gt;

&lt;p&gt;Optimized view:&lt;/p&gt;

&lt;p&gt;Attention cost = math + memory movement&lt;/p&gt;

&lt;p&gt;Naive:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;compute QK^T
store matrix
apply softmax
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Optimized (FlashAttention):&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;compute in chunks
avoid large memory writes
reuse data efficiently
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is why FlashAttention speeds up real systems.&lt;/p&gt;

&lt;p&gt;Not by changing theory.&lt;/p&gt;

&lt;p&gt;But by fixing hardware inefficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters (Again)
&lt;/h2&gt;

&lt;p&gt;Early:&lt;/p&gt;

&lt;p&gt;Attention made Transformers powerful.&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;p&gt;Attention limits how far they can scale.&lt;/p&gt;

&lt;p&gt;If you cannot optimize attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;context stays short&lt;/li&gt;
&lt;li&gt;inference becomes slow&lt;/li&gt;
&lt;li&gt;cost explodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Efficient attention enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;longer context windows&lt;/li&gt;
&lt;li&gt;faster inference&lt;/li&gt;
&lt;li&gt;lower GPU cost&lt;/li&gt;
&lt;li&gt;production-scale LLM systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Important Conditions and Limits
&lt;/h2&gt;

&lt;p&gt;Local Attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fast&lt;/li&gt;
&lt;li&gt;but weak for long-range dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sparse Attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;flexible&lt;/li&gt;
&lt;li&gt;but pattern design matters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FlashAttention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exact attention&lt;/li&gt;
&lt;li&gt;but requires hardware-aware implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also:&lt;/p&gt;

&lt;p&gt;Even optimized attention still grows with sequence length.&lt;/p&gt;

&lt;p&gt;There is no free lunch.&lt;/p&gt;

&lt;p&gt;Only better trade-offs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Attention is the core of Transformers.&lt;/p&gt;

&lt;p&gt;But it is also the bottleneck.&lt;/p&gt;

&lt;p&gt;Full Attention = powerful but expensive&lt;/p&gt;

&lt;p&gt;Efficient Attention = scalable but selective or optimized&lt;/p&gt;

&lt;p&gt;The shortest version:&lt;/p&gt;

&lt;p&gt;Efficient Attention = reduce connections OR optimize memory access&lt;/p&gt;

&lt;p&gt;If you understand that, you understand why modern LLM engineering focuses so much on attention optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When working with long-context models, which matters more to you?&lt;/p&gt;

&lt;p&gt;Accuracy from full attention or efficiency from optimized attention?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/efficient-attention-flashattention-sparse-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/efficient-attention-flashattention-sparse-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>transformers</category>
    </item>
    <item>
      <title>How Transformer Decoders Generate Text — From Causal Masking to Decoding</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Tue, 23 Jun 2026 14:43:46 +0000</pubDate>
      <link>https://dev.to/zeromathai/how-transformer-decoders-generate-text-from-causal-masking-to-decoding-1fh8</link>
      <guid>https://dev.to/zeromathai/how-transformer-decoders-generate-text-from-causal-masking-to-decoding-1fh8</guid>
      <description>&lt;p&gt;A Transformer Decoder does not generate a sentence all at once.&lt;/p&gt;

&lt;p&gt;It predicts one token.&lt;/p&gt;

&lt;p&gt;Then it feeds that token back and predicts the next one.&lt;/p&gt;

&lt;p&gt;That simple loop is the core of modern LLM generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;A Transformer Decoder is built for autoregressive generation.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;p&gt;previous tokens → next token prediction → repeat&lt;/p&gt;

&lt;p&gt;The Decoder creates hidden representations.&lt;/p&gt;

&lt;p&gt;The LM Head converts those representations into vocabulary scores.&lt;/p&gt;

&lt;p&gt;A decoding strategy chooses the actual next token.&lt;/p&gt;

&lt;p&gt;This matters because generation quality is not only about the model.&lt;/p&gt;

&lt;p&gt;It also depends on how tokens are selected.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;A simplified generation pipeline looks like this:&lt;/p&gt;

&lt;p&gt;Input Context&lt;br&gt;&lt;br&gt;
→ Decoder Layers&lt;br&gt;&lt;br&gt;
→ Hidden State&lt;br&gt;&lt;br&gt;
→ LM Head&lt;br&gt;&lt;br&gt;
→ Logits&lt;br&gt;&lt;br&gt;
→ Softmax&lt;br&gt;&lt;br&gt;
→ Decoding Strategy&lt;br&gt;&lt;br&gt;
→ Next Token&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;Text Generation = decoder representation + vocabulary scoring + token selection&lt;/p&gt;

&lt;p&gt;The Decoder answers:&lt;/p&gt;

&lt;p&gt;What should the next representation be?&lt;/p&gt;

&lt;p&gt;The LM Head answers:&lt;/p&gt;

&lt;p&gt;Which vocabulary tokens are likely?&lt;/p&gt;

&lt;p&gt;The decoding strategy answers:&lt;/p&gt;

&lt;p&gt;Which token should we actually output?&lt;/p&gt;

&lt;h2&gt;
  
  
  Pseudo-code View
&lt;/h2&gt;

&lt;p&gt;Autoregressive decoding looks like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;context = prompt_tokens

while not stop:
    hidden = decoder(context)

    logits = lm_head(hidden[-1])

    probs = softmax(logits / temperature)

    next_token = decode(probs)

    context.append(next_token)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The key loop is:&lt;/p&gt;

&lt;p&gt;predict → append → repeat&lt;/p&gt;

&lt;p&gt;This is why LLM inference is sequential.&lt;/p&gt;

&lt;p&gt;Even if training can be parallelized, generation still produces tokens one step at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transformer Decoder Structure
&lt;/h2&gt;

&lt;p&gt;A Transformer Decoder layer usually contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Masked Self-Attention&lt;/li&gt;
&lt;li&gt;Cross-Attention&lt;/li&gt;
&lt;li&gt;Feed-Forward Network&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Masked Self-Attention lets the Decoder look only at previous tokens.&lt;/p&gt;

&lt;p&gt;Cross-Attention lets it look at Encoder outputs when an input sequence exists.&lt;/p&gt;

&lt;p&gt;The Feed-Forward Network transforms each token representation.&lt;/p&gt;

&lt;p&gt;For decoder-only LLMs, Cross-Attention is usually removed.&lt;/p&gt;

&lt;p&gt;The model only continues from the current context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Causal Masking
&lt;/h2&gt;

&lt;p&gt;The Decoder must not cheat.&lt;/p&gt;

&lt;p&gt;When predicting token 5, it cannot look at token 6.&lt;/p&gt;

&lt;p&gt;That is the role of the causal mask.&lt;/p&gt;

&lt;p&gt;The generation probability can be written as:&lt;/p&gt;

&lt;p&gt;P(y₁, y₂, ..., yₜ | x) = Π P(yₜ | y₁, ..., yₜ₋₁, x)&lt;/p&gt;

&lt;p&gt;Each token depends only on previous output tokens and the input.&lt;/p&gt;

&lt;p&gt;This is important.&lt;/p&gt;

&lt;p&gt;Without causal masking, the model could see future answers during training.&lt;/p&gt;

&lt;p&gt;Then it would fail during real generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Target sentence:&lt;/p&gt;

&lt;p&gt;I love you&lt;/p&gt;

&lt;p&gt;During training, the Decoder input is shifted right:&lt;/p&gt;

&lt;p&gt;Input:&lt;/p&gt;

&lt;p&gt; I love&lt;/p&gt;

&lt;p&gt;Target:&lt;/p&gt;

&lt;p&gt;I love you&lt;/p&gt;

&lt;p&gt;So the model learns:&lt;/p&gt;

&lt;p&gt; → I&lt;/p&gt;

&lt;p&gt; I → love&lt;/p&gt;

&lt;p&gt; I love → you&lt;/p&gt;

&lt;p&gt;At inference time, there is no target sentence.&lt;/p&gt;

&lt;p&gt;The model must use its own previous output.&lt;/p&gt;

&lt;p&gt;That is why errors can accumulate during generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Teacher Forcing
&lt;/h2&gt;

&lt;p&gt;Teacher forcing is used during training.&lt;/p&gt;

&lt;p&gt;Instead of feeding the model’s wrong prediction back into the next step, we feed the correct previous token.&lt;/p&gt;

&lt;p&gt;This makes training more stable.&lt;/p&gt;

&lt;p&gt;Training:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input = correct previous tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Inference:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input = model-generated previous tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This difference matters.&lt;/p&gt;

&lt;p&gt;A model can behave well during training but drift during generation.&lt;/p&gt;

&lt;p&gt;That is why decoding strategy and evaluation matter in real systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  LM Head and Logits
&lt;/h2&gt;

&lt;p&gt;The Decoder outputs hidden vectors.&lt;/p&gt;

&lt;p&gt;But hidden vectors are not tokens.&lt;/p&gt;

&lt;p&gt;The LM Head maps a hidden vector to vocabulary-sized scores.&lt;/p&gt;

&lt;p&gt;These scores are called logits.&lt;/p&gt;

&lt;p&gt;If the vocabulary size is 50,000, the LM Head outputs 50,000 scores.&lt;/p&gt;

&lt;p&gt;Each score corresponds to one possible next token.&lt;/p&gt;

&lt;p&gt;Logits are not probabilities yet.&lt;/p&gt;

&lt;p&gt;Softmax converts them into probabilities.&lt;/p&gt;

&lt;p&gt;The pipeline is:&lt;/p&gt;

&lt;p&gt;hidden state → logits → probabilities → selected token&lt;/p&gt;

&lt;h2&gt;
  
  
  Temperature Scaling
&lt;/h2&gt;

&lt;p&gt;Temperature controls how sharp or flat the probability distribution becomes.&lt;/p&gt;

&lt;p&gt;The formula is:&lt;/p&gt;

&lt;p&gt;pᵢ(τ) = exp(zᵢ / τ) / Σ exp(zⱼ / τ)&lt;/p&gt;

&lt;p&gt;Lower temperature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sharper distribution&lt;/li&gt;
&lt;li&gt;more deterministic output&lt;/li&gt;
&lt;li&gt;less randomness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Higher temperature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;flatter distribution&lt;/li&gt;
&lt;li&gt;more diverse output&lt;/li&gt;
&lt;li&gt;more randomness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;With logits [2, 1, 0]:&lt;/p&gt;

&lt;p&gt;temperature = 0.5 makes the top token much stronger.&lt;/p&gt;

&lt;p&gt;temperature = 2 makes lower-ranked tokens more likely.&lt;/p&gt;

&lt;p&gt;This matters in practice.&lt;/p&gt;

&lt;p&gt;Temperature is one of the simplest ways to control creativity.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Decoding Means
&lt;/h2&gt;

&lt;p&gt;Decoding means selecting the next token from probabilities.&lt;/p&gt;

&lt;p&gt;The model gives a distribution.&lt;/p&gt;

&lt;p&gt;The decoding algorithm makes a choice.&lt;/p&gt;

&lt;p&gt;That choice affects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;correctness&lt;/li&gt;
&lt;li&gt;creativity&lt;/li&gt;
&lt;li&gt;repetition&lt;/li&gt;
&lt;li&gt;diversity&lt;/li&gt;
&lt;li&gt;determinism&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So decoding is not a small detail.&lt;/p&gt;

&lt;p&gt;It is part of the generation behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Greedy Decoding
&lt;/h2&gt;

&lt;p&gt;Greedy decoding always chooses the most likely token.&lt;/p&gt;

&lt;p&gt;If probabilities are:&lt;/p&gt;

&lt;p&gt;A = 0.70&lt;br&gt;&lt;br&gt;
B = 0.20&lt;br&gt;&lt;br&gt;
C = 0.10  &lt;/p&gt;

&lt;p&gt;Greedy always picks A.&lt;/p&gt;

&lt;p&gt;It is simple and fast.&lt;/p&gt;

&lt;p&gt;But it can be repetitive.&lt;/p&gt;

&lt;p&gt;It can also choose a locally good token that leads to a worse full sentence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beam Search
&lt;/h2&gt;

&lt;p&gt;Beam search keeps multiple candidate sequences.&lt;/p&gt;

&lt;p&gt;Instead of only keeping the best next token, it keeps the best k paths.&lt;/p&gt;

&lt;p&gt;If beam size = 3, the model tracks three candidate continuations.&lt;/p&gt;

&lt;p&gt;This can improve structured generation.&lt;/p&gt;

&lt;p&gt;But it can also reduce diversity.&lt;/p&gt;

&lt;p&gt;When k = 1, beam search becomes greedy decoding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Top-k Sampling
&lt;/h2&gt;

&lt;p&gt;Top-k sampling keeps only the k most likely tokens.&lt;/p&gt;

&lt;p&gt;Then it samples from that smaller set.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;k = 3&lt;/p&gt;

&lt;p&gt;Only the top 3 tokens can be selected.&lt;/p&gt;

&lt;p&gt;This prevents the model from choosing extremely unlikely tokens.&lt;/p&gt;

&lt;p&gt;But it still allows some randomness.&lt;/p&gt;

&lt;p&gt;Top-k is useful when you want controlled diversity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Top-p Sampling
&lt;/h2&gt;

&lt;p&gt;Top-p sampling is also called nucleus sampling.&lt;/p&gt;

&lt;p&gt;Instead of keeping a fixed number of tokens, it keeps the smallest set whose cumulative probability exceeds p.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Token probabilities:&lt;/p&gt;

&lt;p&gt;honeycomb = 0.45&lt;br&gt;&lt;br&gt;
gingerbread = 0.20&lt;br&gt;&lt;br&gt;
donut = 0.12&lt;br&gt;&lt;br&gt;
cupcake = 0.04  &lt;/p&gt;

&lt;p&gt;If p = 0.6:&lt;/p&gt;

&lt;p&gt;honeycomb + gingerbread = 0.65&lt;/p&gt;

&lt;p&gt;So only those two tokens enter the sampling set.&lt;/p&gt;

&lt;p&gt;Top-p adapts to the confidence of the model.&lt;/p&gt;

&lt;p&gt;That makes it more flexible than fixed Top-k.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deterministic vs Stochastic Decoding
&lt;/h2&gt;

&lt;p&gt;Deterministic decoding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;greedy decoding&lt;/li&gt;
&lt;li&gt;beam search&lt;/li&gt;
&lt;li&gt;same input usually gives same output&lt;/li&gt;
&lt;li&gt;useful for predictable tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stochastic decoding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Top-k sampling&lt;/li&gt;
&lt;li&gt;Top-p sampling&lt;/li&gt;
&lt;li&gt;can generate different outputs&lt;/li&gt;
&lt;li&gt;useful for creative tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference is simple:&lt;/p&gt;

&lt;p&gt;Deterministic = choose the best-looking path&lt;/p&gt;

&lt;p&gt;Stochastic = sample from likely paths&lt;/p&gt;

&lt;p&gt;For coding tasks, deterministic settings are often useful.&lt;/p&gt;

&lt;p&gt;For brainstorming, stochastic settings are often better.&lt;/p&gt;

&lt;h2&gt;
  
  
  Encoder-Decoder vs Decoder-Only Models
&lt;/h2&gt;

&lt;p&gt;Encoder-Decoder models use both input understanding and output generation.&lt;/p&gt;

&lt;p&gt;They are useful for tasks like translation.&lt;/p&gt;

&lt;p&gt;The Encoder reads the source sequence.&lt;/p&gt;

&lt;p&gt;The Decoder generates the target sequence.&lt;/p&gt;

&lt;p&gt;Decoder-only models use only the generation stack.&lt;/p&gt;

&lt;p&gt;They predict the next token from the previous context.&lt;/p&gt;

&lt;p&gt;Most GPT-style LLMs are decoder-only.&lt;/p&gt;

&lt;p&gt;The architecture is simpler for open-ended text generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Perspective
&lt;/h2&gt;

&lt;p&gt;In real inference code, generation is not just:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model(prompt)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;It is closer to:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tokenize prompt

run decoder

get logits from LM Head

apply temperature

filter with top-k or top-p

sample or choose token

append token

repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This matters because small decoding changes can produce very different outputs.&lt;/p&gt;

&lt;p&gt;A model can feel precise, boring, creative, unstable, or repetitive depending on decoding settings.&lt;/p&gt;

&lt;p&gt;The model gives probabilities.&lt;/p&gt;

&lt;p&gt;Your decoding pipeline turns those probabilities into behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive vs Practical View
&lt;/h2&gt;

&lt;p&gt;Naive view:&lt;/p&gt;

&lt;p&gt;LLM = text in, text out&lt;/p&gt;

&lt;p&gt;Practical view:&lt;/p&gt;

&lt;p&gt;LLM = token loop + logits + decoding policy&lt;/p&gt;

&lt;p&gt;Naive mindset:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ask model
receive answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Practical mindset:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;manage context
control temperature
choose decoding strategy
stop generation correctly
handle repetition
optimize inference cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is why developers need to understand the Decoder.&lt;/p&gt;

&lt;p&gt;Generation is a system, not a single function call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Conditions and Limits
&lt;/h2&gt;

&lt;p&gt;Decoder generation is sequential.&lt;/p&gt;

&lt;p&gt;Each new token depends on previous tokens.&lt;/p&gt;

&lt;p&gt;That can make inference slow.&lt;/p&gt;

&lt;p&gt;Causal masking is required to prevent future-token leakage.&lt;/p&gt;

&lt;p&gt;Teacher forcing helps training, but inference uses the model’s own predictions.&lt;/p&gt;

&lt;p&gt;Decoding strategy changes output behavior.&lt;/p&gt;

&lt;p&gt;Temperature, Top-k, and Top-p are not cosmetic options.&lt;/p&gt;

&lt;p&gt;They directly shape the generated text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;The Transformer Decoder generates text by predicting one token at a time.&lt;/p&gt;

&lt;p&gt;Masked Self-Attention prevents future-token access.&lt;/p&gt;

&lt;p&gt;The LM Head converts hidden states into vocabulary logits.&lt;/p&gt;

&lt;p&gt;Softmax turns logits into probabilities.&lt;/p&gt;

&lt;p&gt;Decoding chooses the actual next token.&lt;/p&gt;

&lt;p&gt;The shortest version is:&lt;/p&gt;

&lt;p&gt;Decoder generation = causal attention + LM Head + decoding loop&lt;/p&gt;

&lt;p&gt;If you understand that loop, you understand how LLMs actually produce text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When tuning LLM output, which setting do you usually adjust first?&lt;/p&gt;

&lt;p&gt;Temperature, Top-k, Top-p, or the prompt itself?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/transformer-decoder-lm-head-decoding-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/transformer-decoder-lm-head-decoding-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Why Multi-Head Attention Needs Position, Residuals, and Normalization</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Mon, 22 Jun 2026 14:29:20 +0000</pubDate>
      <link>https://dev.to/zeromathai/why-multi-head-attention-needs-position-residuals-and-normalization-15nj</link>
      <guid>https://dev.to/zeromathai/why-multi-head-attention-needs-position-residuals-and-normalization-15nj</guid>
      <description>&lt;p&gt;Self-Attention is powerful.&lt;/p&gt;

&lt;p&gt;But by itself, it has three problems.&lt;/p&gt;

&lt;p&gt;It needs multiple views, it needs word order, and it needs stable training.&lt;/p&gt;

&lt;p&gt;That is why Multi-Head Attention, Positional Encoding, and Add &amp;amp; Norm exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;A Transformer block is not just attention.&lt;/p&gt;

&lt;p&gt;Attention computes token relationships.&lt;/p&gt;

&lt;p&gt;Multi-Head Attention makes those relationships richer.&lt;/p&gt;

&lt;p&gt;Positional Encoding tells the model where tokens are.&lt;/p&gt;

&lt;p&gt;Add &amp;amp; Norm keeps deep Transformer blocks trainable.&lt;/p&gt;

&lt;p&gt;This matters because modern LLMs are deep.&lt;/p&gt;

&lt;p&gt;Without these support structures, attention alone is not enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;A simplified Transformer encoder block looks like this:&lt;/p&gt;

&lt;p&gt;Input&lt;br&gt;&lt;br&gt;
→ Positional Information&lt;br&gt;&lt;br&gt;
→ Multi-Head Attention&lt;br&gt;&lt;br&gt;
→ Add &amp;amp; Norm&lt;br&gt;&lt;br&gt;
→ Feed-Forward Network&lt;br&gt;&lt;br&gt;
→ Add &amp;amp; Norm&lt;br&gt;&lt;br&gt;
→ Output&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;Transformer Block = attention + position + residual flow + normalization&lt;/p&gt;

&lt;p&gt;Each part solves a specific problem.&lt;/p&gt;

&lt;p&gt;Multi-Head Attention solves the “single view” problem.&lt;/p&gt;

&lt;p&gt;Positional Encoding solves the “no order” problem.&lt;/p&gt;

&lt;p&gt;Add &amp;amp; Norm solves the “deep training stability” problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation View
&lt;/h2&gt;

&lt;p&gt;At a high level, the block works like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tokens = tokenize(text)

x = embedding(tokens)

x = x + positional_encoding

attention_output = multi_head_attention(x)

x = layer_norm(x + attention_output)

ffn_output = feed_forward(x)

output = layer_norm(x + ffn_output)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In modern Pre-LN style, the order often changes:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;attention_output = multi_head_attention(layer_norm(x))

x = x + attention_output

ffn_output = feed_forward(layer_norm(x))

output = x + ffn_output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The idea is the same.&lt;/p&gt;

&lt;p&gt;Keep the original signal flowing.&lt;/p&gt;

&lt;p&gt;Normalize activations.&lt;/p&gt;

&lt;p&gt;Let attention and FFN update the representation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Head Attention
&lt;/h2&gt;

&lt;p&gt;Single attention gives one relationship map.&lt;/p&gt;

&lt;p&gt;But language has many relationship types.&lt;/p&gt;

&lt;p&gt;A token may need to track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nearby words&lt;/li&gt;
&lt;li&gt;subject-verb structure&lt;/li&gt;
&lt;li&gt;semantic similarity&lt;/li&gt;
&lt;li&gt;long-distance references&lt;/li&gt;
&lt;li&gt;coreference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One attention head cannot easily capture all of these at once.&lt;/p&gt;

&lt;p&gt;Multi-Head Attention fixes this by running several attention heads in parallel.&lt;/p&gt;

&lt;p&gt;Each head gets its own learned Q, K, and V projections.&lt;/p&gt;

&lt;p&gt;So each head can learn a different representation subspace.&lt;/p&gt;

&lt;p&gt;The formula is:&lt;/p&gt;

&lt;p&gt;MultiHead(Q, K, V) = Concat(head₁, ..., headₕ)Wᴼ&lt;/p&gt;

&lt;p&gt;In plain English:&lt;/p&gt;

&lt;p&gt;Run attention multiple ways.&lt;/p&gt;

&lt;p&gt;Concatenate the results.&lt;/p&gt;

&lt;p&gt;Project them back into one vector space.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Take this sentence:&lt;/p&gt;

&lt;p&gt;The animal did not cross the street because it was tired.&lt;/p&gt;

&lt;p&gt;What does “it” refer to?&lt;/p&gt;

&lt;p&gt;One attention head may focus on “animal.”&lt;/p&gt;

&lt;p&gt;Another may focus on “tired.”&lt;/p&gt;

&lt;p&gt;Another may track the structure around “because.”&lt;/p&gt;

&lt;p&gt;This is useful because the model does not need one attention map to explain everything.&lt;/p&gt;

&lt;p&gt;Different heads can specialize.&lt;/p&gt;

&lt;p&gt;That is why Multi-Head Attention matters.&lt;/p&gt;

&lt;p&gt;It gives the model multiple ways to read the same sentence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Single-Head vs Multi-Head Attention
&lt;/h2&gt;

&lt;p&gt;Single-head attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;uses one attention distribution&lt;/li&gt;
&lt;li&gt;sees token relationships from one perspective&lt;/li&gt;
&lt;li&gt;is simpler&lt;/li&gt;
&lt;li&gt;can mix different patterns together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Multi-head attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;uses multiple attention distributions&lt;/li&gt;
&lt;li&gt;views tokens through different learned projections&lt;/li&gt;
&lt;li&gt;captures diverse relationships&lt;/li&gt;
&lt;li&gt;recombines the results afterward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference:&lt;/p&gt;

&lt;p&gt;Single-head = one view of context&lt;/p&gt;

&lt;p&gt;Multi-head = multiple views of context&lt;/p&gt;

&lt;p&gt;This is not just repeated computation.&lt;/p&gt;

&lt;p&gt;It is structured parallel interpretation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Positional Encoding Is Needed
&lt;/h2&gt;

&lt;p&gt;Self-Attention compares tokens at the same time.&lt;/p&gt;

&lt;p&gt;That is great for parallelism.&lt;/p&gt;

&lt;p&gt;But it creates a problem.&lt;/p&gt;

&lt;p&gt;Attention alone does not know token order.&lt;/p&gt;

&lt;p&gt;Consider:&lt;/p&gt;

&lt;p&gt;dog bites man&lt;/p&gt;

&lt;p&gt;man bites dog&lt;/p&gt;

&lt;p&gt;Same words.&lt;/p&gt;

&lt;p&gt;Different meaning.&lt;/p&gt;

&lt;p&gt;Without position information, the model does not naturally know which token came first.&lt;/p&gt;

&lt;p&gt;So Transformers inject position into token representations.&lt;/p&gt;

&lt;p&gt;The basic structure is:&lt;/p&gt;

&lt;p&gt;Input Representation = Token Embedding + Positional Encoding&lt;/p&gt;

&lt;p&gt;This matters because word order changes meaning.&lt;/p&gt;

&lt;p&gt;A language model must know both:&lt;/p&gt;

&lt;p&gt;what the token is&lt;/p&gt;

&lt;p&gt;where the token is&lt;/p&gt;

&lt;h2&gt;
  
  
  Sinusoidal Positional Encoding
&lt;/h2&gt;

&lt;p&gt;The original Transformer used fixed sine and cosine patterns.&lt;/p&gt;

&lt;p&gt;Even dimensions use sine.&lt;/p&gt;

&lt;p&gt;Odd dimensions use cosine.&lt;/p&gt;

&lt;p&gt;The idea is:&lt;/p&gt;

&lt;p&gt;different positions get different wave patterns.&lt;/p&gt;

&lt;p&gt;A simplified view:&lt;/p&gt;

&lt;p&gt;PE(pos, 2i) = sin(pos / 10000^(2i / d_model))&lt;/p&gt;

&lt;p&gt;PE(pos, 2i + 1) = cos(pos / 10000^(2i / d_model))&lt;/p&gt;

&lt;p&gt;This is not just a position ID.&lt;/p&gt;

&lt;p&gt;It creates smooth position signals across dimensions.&lt;/p&gt;

&lt;p&gt;The model can use these signals to reason about position and distance.&lt;/p&gt;

&lt;h2&gt;
  
  
  APE, RPE, and RoPE
&lt;/h2&gt;

&lt;p&gt;There are several ways to inject position.&lt;/p&gt;

&lt;p&gt;Absolute Positional Embedding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;assigns a position vector to each absolute index&lt;/li&gt;
&lt;li&gt;position 1 has one vector&lt;/li&gt;
&lt;li&gt;position 2 has another vector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Relative Positional Embedding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;focuses on distance between tokens&lt;/li&gt;
&lt;li&gt;useful when relative position matters more than absolute index&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rotary Positional Embedding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rotates Query and Key vectors using position&lt;/li&gt;
&lt;li&gt;makes relative position work naturally inside attention&lt;/li&gt;
&lt;li&gt;commonly used in modern LLMs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shared goal is simple:&lt;/p&gt;

&lt;p&gt;Give attention a way to understand order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Add &amp;amp; Norm
&lt;/h2&gt;

&lt;p&gt;A Transformer block also needs stability.&lt;/p&gt;

&lt;p&gt;That is where Add &amp;amp; Norm comes in.&lt;/p&gt;

&lt;p&gt;Add means residual connection.&lt;/p&gt;

&lt;p&gt;Norm means layer normalization.&lt;/p&gt;

&lt;p&gt;The classic Post-LN formula is:&lt;/p&gt;

&lt;p&gt;Output = LayerNorm(x + Sublayer(x))&lt;/p&gt;

&lt;p&gt;The residual connection preserves the original input.&lt;/p&gt;

&lt;p&gt;The sublayer adds a learned update.&lt;/p&gt;

&lt;p&gt;LayerNorm keeps the representation stable.&lt;/p&gt;

&lt;p&gt;This is important because Transformers stack many layers.&lt;/p&gt;

&lt;p&gt;Without residual paths, information can degrade.&lt;/p&gt;

&lt;p&gt;Without normalization, training can become unstable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Residual Connection Intuition
&lt;/h2&gt;

&lt;p&gt;A sublayer should not have to rebuild everything from scratch.&lt;/p&gt;

&lt;p&gt;It should only need to learn an update.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;p&gt;new_output = sublayer(x)&lt;/p&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;p&gt;new_output = x + sublayer(x)&lt;/p&gt;

&lt;p&gt;This is a huge difference.&lt;/p&gt;

&lt;p&gt;The original representation can pass forward directly.&lt;/p&gt;

&lt;p&gt;The sublayer only adds useful changes.&lt;/p&gt;

&lt;p&gt;That makes deep networks easier to train.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer Normalization Intuition
&lt;/h2&gt;

&lt;p&gt;LayerNorm normalizes each token representation.&lt;/p&gt;

&lt;p&gt;It works across the feature dimensions of a token.&lt;/p&gt;

&lt;p&gt;Not across the sequence.&lt;/p&gt;

&lt;p&gt;That means each token vector is stabilized independently.&lt;/p&gt;

&lt;p&gt;In practice, this helps keep activation values in a manageable range.&lt;/p&gt;

&lt;p&gt;This matters when many Transformer blocks are stacked.&lt;/p&gt;

&lt;p&gt;Small instability in one layer can grow across dozens or hundreds of layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pre-LN vs Post-LN
&lt;/h2&gt;

&lt;p&gt;Post-LN:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = LayerNorm(x + Sublayer(x))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Pre-LN:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = x + Sublayer(LayerNorm(x))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The original Transformer used Post-LN.&lt;/p&gt;

&lt;p&gt;Many modern large Transformers use Pre-LN.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Pre-LN often improves training stability in deep models.&lt;/p&gt;

&lt;p&gt;The difference is placement.&lt;/p&gt;

&lt;p&gt;Post-LN normalizes after the residual addition.&lt;/p&gt;

&lt;p&gt;Pre-LN normalizes before the sublayer.&lt;/p&gt;

&lt;p&gt;Both use residual connections.&lt;/p&gt;

&lt;p&gt;But their training behavior can be different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive vs Practical View
&lt;/h2&gt;

&lt;p&gt;Naive view:&lt;/p&gt;

&lt;p&gt;Transformer block = attention layer&lt;/p&gt;

&lt;p&gt;Practical view:&lt;/p&gt;

&lt;p&gt;Transformer block = attention + position + residuals + normalization + FFN&lt;/p&gt;

&lt;p&gt;Naive implementation mindset:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;run attention
return output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Practical implementation mindset:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;add position
run multi-head attention
preserve input through residuals
normalize representations
apply feed-forward updates
repeat safely across many layers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is why implementation details matter.&lt;/p&gt;

&lt;p&gt;The architecture works because these parts support each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Conditions and Limits
&lt;/h2&gt;

&lt;p&gt;Multi-Head Attention is powerful, but more heads are not always better.&lt;/p&gt;

&lt;p&gt;Too many heads can increase cost.&lt;/p&gt;

&lt;p&gt;Some heads may become redundant.&lt;/p&gt;

&lt;p&gt;Positional Encoding is necessary because attention is order-agnostic by default.&lt;/p&gt;

&lt;p&gt;But different positional methods behave differently in long-context settings.&lt;/p&gt;

&lt;p&gt;Add &amp;amp; Norm improves stability.&lt;/p&gt;

&lt;p&gt;But the exact Pre-LN or Post-LN choice affects optimization.&lt;/p&gt;

&lt;p&gt;So these are not decorative components.&lt;/p&gt;

&lt;p&gt;They are architectural decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Multi-Head Attention gives the model multiple views of token relationships.&lt;/p&gt;

&lt;p&gt;Positional Encoding gives attention a sense of order.&lt;/p&gt;

&lt;p&gt;Add &amp;amp; Norm keeps deep Transformer blocks stable.&lt;/p&gt;

&lt;p&gt;The shortest version is:&lt;/p&gt;

&lt;p&gt;Transformer Block = multi-view attention + position signal + stable residual updates&lt;/p&gt;

&lt;p&gt;If Self-Attention is the engine, these components are the systems that make the engine usable at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When reading Transformer architecture, which part feels most important to understand first?&lt;/p&gt;

&lt;p&gt;Multi-Head Attention, Positional Encoding, or Add &amp;amp; Norm?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/multi-head-attention-positional-encoding-add-norm-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/multi-head-attention-positional-encoding-add-norm-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>How Self-Attention Works — QKV, Softmax, and Matrix Computation</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Thu, 18 Jun 2026 14:19:01 +0000</pubDate>
      <link>https://dev.to/zeromathai/how-self-attention-works-qkv-softmax-and-matrix-computation-514j</link>
      <guid>https://dev.to/zeromathai/how-self-attention-works-qkv-softmax-and-matrix-computation-514j</guid>
      <description>&lt;p&gt;Self-Attention is not just “looking at important words.”&lt;/p&gt;

&lt;p&gt;It is a matrix operation.&lt;/p&gt;

&lt;p&gt;And that is exactly why Transformers scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;Self-Attention lets each token compare itself with every other token in the same sequence.&lt;/p&gt;

&lt;p&gt;Each token asks:&lt;/p&gt;

&lt;p&gt;Which other tokens are useful for updating my representation?&lt;/p&gt;

&lt;p&gt;This matters because meaning is contextual.&lt;/p&gt;

&lt;p&gt;A token should not stay as a static embedding.&lt;/p&gt;

&lt;p&gt;It should become a representation shaped by the sentence around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;Self-Attention follows this pipeline:&lt;/p&gt;

&lt;p&gt;Input Embeddings&lt;br&gt;&lt;br&gt;
→ Query, Key, Value Projection&lt;br&gt;&lt;br&gt;
→ Similarity Scores&lt;br&gt;&lt;br&gt;
→ Scaling&lt;br&gt;&lt;br&gt;
→ Softmax Weights&lt;br&gt;&lt;br&gt;
→ Weighted Sum of Values&lt;br&gt;&lt;br&gt;
→ Contextual Token Output&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;Self-Attention = matching + weighting + information mixing&lt;/p&gt;

&lt;p&gt;The full formula is:&lt;/p&gt;

&lt;p&gt;Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V&lt;/p&gt;

&lt;p&gt;This equation looks dense.&lt;/p&gt;

&lt;p&gt;But the idea is simple:&lt;/p&gt;

&lt;p&gt;Compare tokens.&lt;/p&gt;

&lt;p&gt;Convert scores into weights.&lt;/p&gt;

&lt;p&gt;Use weights to mix information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pseudo-code View
&lt;/h2&gt;

&lt;p&gt;At a high level, Self-Attention works like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X = token_embeddings

Q = X @ W_Q
K = X @ W_K
V = X @ W_V

scores = Q @ K.T

scaled_scores = scores / sqrt(d_k)

weights = softmax(scaled_scores)

output = weights @ V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That is the core computation.&lt;/p&gt;

&lt;p&gt;In real Transformer implementations, this is done for all tokens at once.&lt;/p&gt;

&lt;p&gt;Not token by token.&lt;/p&gt;

&lt;p&gt;That is why the matrix form matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Take this sentence:&lt;/p&gt;

&lt;p&gt;I love you&lt;/p&gt;

&lt;p&gt;When updating the token “love”, Self-Attention compares it with:&lt;/p&gt;

&lt;p&gt;I&lt;br&gt;&lt;br&gt;
love&lt;br&gt;&lt;br&gt;
you&lt;/p&gt;

&lt;p&gt;The token “love” may strongly attend to “I” and “you”.&lt;/p&gt;

&lt;p&gt;So its representation becomes more contextual.&lt;/p&gt;

&lt;p&gt;It no longer means only the word “love.”&lt;/p&gt;

&lt;p&gt;It becomes something closer to:&lt;/p&gt;

&lt;p&gt;love as an action between I and you&lt;/p&gt;

&lt;p&gt;That is why Self-Attention is powerful.&lt;/p&gt;

&lt;p&gt;It turns isolated token vectors into relationship-aware vectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  QKV Intuition
&lt;/h2&gt;

&lt;p&gt;Each token is projected into three roles:&lt;/p&gt;

&lt;p&gt;Query, Key, and Value.&lt;/p&gt;

&lt;p&gt;Query:&lt;/p&gt;

&lt;p&gt;What am I looking for?&lt;/p&gt;

&lt;p&gt;Key:&lt;/p&gt;

&lt;p&gt;What do I contain that others can match against?&lt;/p&gt;

&lt;p&gt;Value:&lt;/p&gt;

&lt;p&gt;What information do I pass forward if selected?&lt;/p&gt;

&lt;p&gt;Search analogy:&lt;/p&gt;

&lt;p&gt;Query = search request&lt;/p&gt;

&lt;p&gt;Key = searchable index&lt;/p&gt;

&lt;p&gt;Value = retrieved content&lt;/p&gt;

&lt;p&gt;This separation is important.&lt;/p&gt;

&lt;p&gt;The model can learn different spaces for matching and information transfer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Generate Q, K, and V
&lt;/h2&gt;

&lt;p&gt;Given input embeddings X:&lt;/p&gt;

&lt;p&gt;Q = XW_Q&lt;br&gt;&lt;br&gt;
K = XW_K&lt;br&gt;&lt;br&gt;
V = XW_V&lt;/p&gt;

&lt;p&gt;W_Q, W_K, and W_V are learned matrices.&lt;/p&gt;

&lt;p&gt;They are trained with the model.&lt;/p&gt;

&lt;p&gt;This means QKV is not manually designed.&lt;/p&gt;

&lt;p&gt;The model learns how to project tokens into attention roles.&lt;/p&gt;

&lt;p&gt;Implementation-wise, this is just matrix multiplication.&lt;/p&gt;

&lt;p&gt;Conceptually, it creates three different views of the same token.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Compute Attention Scores
&lt;/h2&gt;

&lt;p&gt;The model compares Query and Key vectors.&lt;/p&gt;

&lt;p&gt;For one token:&lt;/p&gt;

&lt;p&gt;score = q · k&lt;/p&gt;

&lt;p&gt;A larger dot product means stronger similarity.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;q₁ · k₁ = 112&lt;br&gt;&lt;br&gt;
q₁ · k₂ = 96  &lt;/p&gt;

&lt;p&gt;The first key matches more strongly.&lt;/p&gt;

&lt;p&gt;But these are still raw scores.&lt;/p&gt;

&lt;p&gt;They are not probabilities yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Scale and Apply Softmax
&lt;/h2&gt;

&lt;p&gt;Dot products can become large when vector dimensions grow.&lt;/p&gt;

&lt;p&gt;Large scores can make Softmax too sharp.&lt;/p&gt;

&lt;p&gt;That can make training unstable.&lt;/p&gt;

&lt;p&gt;So Self-Attention scales the scores:&lt;/p&gt;

&lt;p&gt;score = (q · k) / √dₖ&lt;/p&gt;

&lt;p&gt;Then Softmax converts scores into weights.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;scores = [14, 12]&lt;/p&gt;

&lt;p&gt;softmax(scores) ≈ [0.88, 0.12]&lt;/p&gt;

&lt;p&gt;Now the model has attention weights.&lt;/p&gt;

&lt;p&gt;These weights say how much each token should contribute.&lt;/p&gt;

&lt;p&gt;This matters in practice.&lt;/p&gt;

&lt;p&gt;Without scaling, attention can collapse too aggressively onto one token.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Weighted Sum of Values
&lt;/h2&gt;

&lt;p&gt;The final output is a weighted sum of Value vectors.&lt;/p&gt;

&lt;p&gt;z = Σ αᵢvᵢ&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;values = [10, 20]&lt;/p&gt;

&lt;p&gt;weights = [0.88, 0.12]&lt;/p&gt;

&lt;p&gt;output = 0.88 × 10 + 0.12 × 20 = 11.2&lt;/p&gt;

&lt;p&gt;The first value contributes more.&lt;/p&gt;

&lt;p&gt;The second value contributes less.&lt;/p&gt;

&lt;p&gt;That is the basic meaning of attention output.&lt;/p&gt;

&lt;p&gt;It is not a simple average.&lt;/p&gt;

&lt;p&gt;It is selective information mixing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Attention vs Cross-Attention
&lt;/h2&gt;

&lt;p&gt;Self-Attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query, Key, and Value come from the same sequence&lt;/li&gt;
&lt;li&gt;models relationships inside one sequence&lt;/li&gt;
&lt;li&gt;used in Transformer encoders and decoders&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cross-Attention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query comes from the decoder&lt;/li&gt;
&lt;li&gt;Key and Value come from the encoder&lt;/li&gt;
&lt;li&gt;models relationships between two sequences&lt;/li&gt;
&lt;li&gt;used in encoder-decoder models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short:&lt;/p&gt;

&lt;p&gt;Self-Attention = inside the same sequence&lt;/p&gt;

&lt;p&gt;Cross-Attention = between different sequences&lt;/p&gt;

&lt;p&gt;This difference matters when reading Transformer code.&lt;/p&gt;

&lt;p&gt;If Q, K, and V come from the same tensor, it is Self-Attention.&lt;/p&gt;

&lt;p&gt;If Q comes from one tensor and K/V come from another, it is Cross-Attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive vs Matrix View
&lt;/h2&gt;

&lt;p&gt;Naive view:&lt;/p&gt;

&lt;p&gt;Each token compares with every other token one by one.&lt;/p&gt;

&lt;p&gt;Matrix view:&lt;/p&gt;

&lt;p&gt;All token relationships are computed at once.&lt;/p&gt;

&lt;p&gt;Naive logic:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for token_i in tokens:
    for token_j in tokens:
        compute_similarity(token_i, token_j)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Matrix logic:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scores = Q @ K.T
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That single matrix multiplication computes all pairwise token scores.&lt;/p&gt;

&lt;p&gt;This is why Transformers are GPU-friendly.&lt;/p&gt;

&lt;p&gt;They replace sequential loops with dense linear algebra.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Matrix Computation Matters
&lt;/h2&gt;

&lt;p&gt;The attention matrix contains token-to-token relationships.&lt;/p&gt;

&lt;p&gt;If the sequence length is n, the score matrix is n × n.&lt;/p&gt;

&lt;p&gt;Each row means:&lt;/p&gt;

&lt;p&gt;How much one token attends to every token.&lt;/p&gt;

&lt;p&gt;Each column means:&lt;/p&gt;

&lt;p&gt;How much that token is attended to by others.&lt;/p&gt;

&lt;p&gt;This structure is powerful.&lt;/p&gt;

&lt;p&gt;But it also creates a cost problem.&lt;/p&gt;

&lt;p&gt;Full Self-Attention grows roughly with O(n²).&lt;/p&gt;

&lt;p&gt;Longer context means more computation and memory.&lt;/p&gt;

&lt;p&gt;So the same design that makes attention expressive also makes it expensive.&lt;/p&gt;

&lt;p&gt;That is why efficient attention methods exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Conditions and Limits
&lt;/h2&gt;

&lt;p&gt;Self-Attention needs positional information.&lt;/p&gt;

&lt;p&gt;By itself, attention compares token content.&lt;/p&gt;

&lt;p&gt;It does not automatically know token order.&lt;/p&gt;

&lt;p&gt;Self-Attention also gets expensive as sequence length grows.&lt;/p&gt;

&lt;p&gt;For short and medium sequences, full attention is powerful.&lt;/p&gt;

&lt;p&gt;For very long sequences, memory and compute become major constraints.&lt;/p&gt;

&lt;p&gt;Another important point:&lt;/p&gt;

&lt;p&gt;Attention weights are not always perfect explanations.&lt;/p&gt;

&lt;p&gt;They show how information is mixed.&lt;/p&gt;

&lt;p&gt;But they should not always be treated as human-level reasoning traces.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Perspective
&lt;/h2&gt;

&lt;p&gt;In real models, QKV projection is often implemented as one combined linear layer.&lt;/p&gt;

&lt;p&gt;Instead of computing three separate matrix multiplications:&lt;/p&gt;

&lt;p&gt;Q = XW_Q&lt;br&gt;&lt;br&gt;
K = XW_K&lt;br&gt;&lt;br&gt;
V = XW_V&lt;/p&gt;

&lt;p&gt;Implementations often compute:&lt;/p&gt;

&lt;p&gt;QKV = XW_QKV&lt;/p&gt;

&lt;p&gt;Then split the result into Q, K, and V.&lt;/p&gt;

&lt;p&gt;This is faster and cleaner.&lt;/p&gt;

&lt;p&gt;The math stays the same.&lt;/p&gt;

&lt;p&gt;The implementation is optimized.&lt;/p&gt;

&lt;p&gt;That is the developer mindset:&lt;/p&gt;

&lt;p&gt;Understand the formula.&lt;/p&gt;

&lt;p&gt;Then recognize the optimized tensor layout in code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Self-Attention is the core operation behind Transformers.&lt;/p&gt;

&lt;p&gt;It works by projecting tokens into Q, K, and V.&lt;/p&gt;

&lt;p&gt;Q and K compute relevance.&lt;/p&gt;

&lt;p&gt;Softmax turns relevance into weights.&lt;/p&gt;

&lt;p&gt;Weights mix V into contextual outputs.&lt;/p&gt;

&lt;p&gt;The shortest version is:&lt;/p&gt;

&lt;p&gt;Self-Attention = compare tokens → weight information → update representations&lt;/p&gt;

&lt;p&gt;If you understand QKᵀ and weighted Values, you understand the heart of Transformer computation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When reading Transformer code, which part feels most confusing?&lt;/p&gt;

&lt;p&gt;QKV projection, Softmax attention weights, or the final matrix multiplication with V?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/self-attention-qkv-matrix-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/self-attention-qkv-matrix-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>transformers</category>
    </item>
    <item>
      <title>How Attention Actually Works — From Next-Token Prediction to QKV Intuition</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Wed, 17 Jun 2026 15:38:00 +0000</pubDate>
      <link>https://dev.to/zeromathai/how-attention-actually-works-from-next-token-prediction-to-qkv-intuition-29l2</link>
      <guid>https://dev.to/zeromathai/how-attention-actually-works-from-next-token-prediction-to-qkv-intuition-29l2</guid>
      <description>&lt;p&gt;A language model does not “write sentences.”&lt;/p&gt;

&lt;p&gt;It predicts the next token. One step at a time.&lt;/p&gt;

&lt;p&gt;So the real question is:&lt;/p&gt;

&lt;p&gt;How does it decide what matters right now?&lt;/p&gt;

&lt;p&gt;That is why attention exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;A Language Model = next-token probability estimator.&lt;/p&gt;

&lt;p&gt;Given previous tokens, it predicts the next token.&lt;/p&gt;

&lt;p&gt;Attention = mechanism that decides which past tokens matter more.&lt;/p&gt;

&lt;p&gt;This is critical.&lt;/p&gt;

&lt;p&gt;Because not all context is equally useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;Language Modeling can be reduced to:&lt;/p&gt;

&lt;p&gt;P(x₁, x₂, ..., xₜ) = Π P(xₜ | x₁...xₜ₋₁)&lt;/p&gt;

&lt;p&gt;And attention adds:&lt;/p&gt;

&lt;p&gt;weighted context selection&lt;/p&gt;

&lt;p&gt;More concretely:&lt;/p&gt;

&lt;p&gt;Language Model = context + weighting + prediction&lt;/p&gt;

&lt;p&gt;Without attention:&lt;/p&gt;

&lt;p&gt;All context is compressed.&lt;/p&gt;

&lt;p&gt;With attention:&lt;/p&gt;

&lt;p&gt;Context is dynamically re-weighted at every step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pseudo-code View
&lt;/h2&gt;

&lt;p&gt;Autoregressive generation:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;context = ["I", "love"]

while not finished:
    probs = model(context)
    next_token = sample(probs)

    context.append(next_token)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Attention inside the model:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for each token t:
    score = compare(query_t, keys)

    weights = softmax(score)

    output_t = sum(weights * values)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That is the core loop.&lt;/p&gt;

&lt;p&gt;Predict → append → repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Input:&lt;/p&gt;

&lt;p&gt;"I love"&lt;/p&gt;

&lt;p&gt;Possible next tokens:&lt;/p&gt;

&lt;p&gt;you, it, this, pizza&lt;/p&gt;

&lt;p&gt;The model assigns probabilities:&lt;/p&gt;

&lt;p&gt;you → 0.6&lt;br&gt;&lt;br&gt;
it → 0.2&lt;br&gt;&lt;br&gt;
this → 0.1&lt;br&gt;&lt;br&gt;
pizza → 0.1  &lt;/p&gt;

&lt;p&gt;Why does “you” win?&lt;/p&gt;

&lt;p&gt;Because attention focuses on relationships in context.&lt;/p&gt;

&lt;p&gt;“I” + “love” → strong pattern → “you”&lt;/p&gt;

&lt;p&gt;Now extend:&lt;/p&gt;

&lt;p&gt;"I love you because"&lt;/p&gt;

&lt;p&gt;The model must now decide:&lt;/p&gt;

&lt;p&gt;What does “because” relate to?&lt;/p&gt;

&lt;p&gt;Attention allows it to re-evaluate the entire context.&lt;/p&gt;

&lt;p&gt;Not just the last token.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Attention Is Needed
&lt;/h2&gt;

&lt;p&gt;Old Seq2Seq models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compress entire input into one vector&lt;/li&gt;
&lt;li&gt;lose information as sequence grows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Attention fixes this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keeps all token representations&lt;/li&gt;
&lt;li&gt;dynamically selects relevant ones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters because:&lt;/p&gt;

&lt;p&gt;Long sentences break fixed representations.&lt;/p&gt;

&lt;p&gt;Attention removes that bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  QKV Intuition
&lt;/h2&gt;

&lt;p&gt;Attention uses three vectors:&lt;/p&gt;

&lt;p&gt;Query, Key, Value&lt;/p&gt;

&lt;p&gt;Think like search:&lt;/p&gt;

&lt;p&gt;Query = what I want&lt;br&gt;&lt;br&gt;
Key = what each token offers&lt;br&gt;&lt;br&gt;
Value = the actual information  &lt;/p&gt;

&lt;p&gt;Flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;compare Query with Keys&lt;/li&gt;
&lt;li&gt;compute similarity scores&lt;/li&gt;
&lt;li&gt;normalize with softmax&lt;/li&gt;
&lt;li&gt;combine Values using weights&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is how context is selected.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Formula
&lt;/h2&gt;

&lt;p&gt;Attention is:&lt;/p&gt;

&lt;p&gt;Attention(Q, K, V) = softmax(QKᵀ / √d) V&lt;/p&gt;

&lt;p&gt;Meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;match Query with Keys&lt;/li&gt;
&lt;li&gt;turn matches into probabilities&lt;/li&gt;
&lt;li&gt;use those probabilities to mix Values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;p&gt;Each token becomes context-aware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross Attention and Context Vector
&lt;/h2&gt;

&lt;p&gt;In encoder-decoder models:&lt;/p&gt;

&lt;p&gt;Decoder does not rely only on its own tokens.&lt;/p&gt;

&lt;p&gt;It looks at Encoder outputs.&lt;/p&gt;

&lt;p&gt;Context vector:&lt;/p&gt;

&lt;p&gt;c = Σ (attention_weight × encoder_hidden_state)&lt;/p&gt;

&lt;p&gt;This is dynamic.&lt;/p&gt;

&lt;p&gt;At every step, the model recomputes what matters.&lt;/p&gt;

&lt;p&gt;Not a fixed summary.&lt;/p&gt;

&lt;p&gt;A moving focus.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive vs Real View
&lt;/h2&gt;

&lt;p&gt;Naive view:&lt;/p&gt;

&lt;p&gt;Language model = next word generator&lt;/p&gt;

&lt;p&gt;Real view:&lt;/p&gt;

&lt;p&gt;Language model = dynamic context weighting system&lt;/p&gt;

&lt;p&gt;Naive:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;predict next token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Real:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;compute attention
reweight context
then predict token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That difference is everything.&lt;/p&gt;

&lt;p&gt;It explains why Transformers outperform older models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Constraints
&lt;/h2&gt;

&lt;p&gt;Attention is powerful, but not free.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cost grows with sequence length&lt;/li&gt;
&lt;li&gt;requires memory for all tokens&lt;/li&gt;
&lt;li&gt;depends on good tokenization&lt;/li&gt;
&lt;li&gt;still generates sequentially at inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also:&lt;/p&gt;

&lt;p&gt;Attention does not understand meaning by itself.&lt;/p&gt;

&lt;p&gt;It only learns patterns from data.&lt;/p&gt;

&lt;p&gt;So quality depends on training.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters (Again)
&lt;/h2&gt;

&lt;p&gt;Early:&lt;/p&gt;

&lt;p&gt;Without attention → information bottleneck&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;p&gt;With attention → full context + selective focus&lt;/p&gt;

&lt;p&gt;This is why modern LLMs work.&lt;/p&gt;

&lt;p&gt;Not because they “know language.”&lt;/p&gt;

&lt;p&gt;But because they efficiently manage context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Language Model = next-token prediction.&lt;/p&gt;

&lt;p&gt;Attention = context selection.&lt;/p&gt;

&lt;p&gt;QKV = mechanism for selecting information.&lt;/p&gt;

&lt;p&gt;If you remember one thing:&lt;/p&gt;

&lt;p&gt;Attention lets a model decide what to look at before predicting what to say.&lt;/p&gt;

&lt;p&gt;That is the core of modern LLMs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When you think about LLM behavior, do you see it more as:&lt;/p&gt;

&lt;p&gt;a probability engine or a context selection system?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/attention-language-modeling-basics-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/attention-language-modeling-basics-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>transformers</category>
    </item>
    <item>
      <title>How Transformer Architecture Works — Encoder, Decoder, Tokens, and Context</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Tue, 16 Jun 2026 15:14:10 +0000</pubDate>
      <link>https://dev.to/zeromathai/how-transformer-architecture-works-encoder-decoder-tokens-and-context-4i8c</link>
      <guid>https://dev.to/zeromathai/how-transformer-architecture-works-encoder-decoder-tokens-and-context-4i8c</guid>
      <description>&lt;p&gt;Transformers changed NLP because they stopped treating text as a simple left-to-right chain.&lt;/p&gt;

&lt;p&gt;Instead of reading one token at a time, they compare tokens directly.&lt;/p&gt;

&lt;p&gt;That shift made modern language models faster, more scalable, and better at understanding context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;A Transformer is a sequence-to-sequence architecture.&lt;/p&gt;

&lt;p&gt;It maps an input sequence to an output sequence.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;English sentence → Korean sentence&lt;/p&gt;

&lt;p&gt;Question → Answer&lt;/p&gt;

&lt;p&gt;Document → Summary&lt;/p&gt;

&lt;p&gt;But the key idea is not “replace one word with another word.”&lt;/p&gt;

&lt;p&gt;The key idea is:&lt;/p&gt;

&lt;p&gt;Transformers build contextual token representations first.&lt;/p&gt;

&lt;p&gt;Then they generate or transform output from those representations.&lt;/p&gt;

&lt;p&gt;That is why the architecture matters.&lt;/p&gt;

&lt;p&gt;It gives the model a structured way to understand relationships inside text.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;A simplified Transformer flow looks like this:&lt;/p&gt;

&lt;p&gt;Input Text&lt;br&gt;&lt;br&gt;
→ Tokens&lt;br&gt;&lt;br&gt;
→ Word Embeddings&lt;br&gt;&lt;br&gt;
→ Encoder&lt;br&gt;&lt;br&gt;
→ Contextual Representations&lt;br&gt;&lt;br&gt;
→ Decoder&lt;br&gt;&lt;br&gt;
→ Output Tokens&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;Transformer = tokenization + embeddings + attention + encoder-decoder structure&lt;/p&gt;

&lt;p&gt;The model first converts raw text into tokens.&lt;/p&gt;

&lt;p&gt;Then each token becomes a vector.&lt;/p&gt;

&lt;p&gt;Then attention updates each vector based on relationships with other tokens.&lt;/p&gt;

&lt;p&gt;The Encoder understands the input.&lt;/p&gt;

&lt;p&gt;The Decoder generates the output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation View
&lt;/h2&gt;

&lt;p&gt;At a high level, the architecture works like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;split input text into tokens

convert tokens into embedding vectors

pass embeddings through encoder layers

for each encoder layer:
    compute self-attention

    mix information across tokens

    apply feed-forward transformation

    produce contextual token representations

pass previous output tokens into decoder

for each decoder layer:
    apply masked self-attention

    attend to encoder output with cross-attention

    apply feed-forward transformation

    predict the next output token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This structure is practical because attention can be computed with matrix operations.&lt;/p&gt;

&lt;p&gt;That makes Transformers much more GPU-friendly than step-by-step recurrent models.&lt;/p&gt;

&lt;p&gt;This is one of the biggest reasons Transformers scaled so well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Take this sentence:&lt;/p&gt;

&lt;p&gt;I love you.&lt;/p&gt;

&lt;p&gt;An RNN reads it step by step:&lt;/p&gt;

&lt;p&gt;I → love → you&lt;/p&gt;

&lt;p&gt;A Transformer can compare all tokens directly.&lt;/p&gt;

&lt;p&gt;When processing “love”, it can look at both “I” and “you” at the same time.&lt;/p&gt;

&lt;p&gt;So “love” is not treated as an isolated word.&lt;/p&gt;

&lt;p&gt;It becomes a contextual representation.&lt;/p&gt;

&lt;p&gt;The model learns:&lt;/p&gt;

&lt;p&gt;Who loves?&lt;/p&gt;

&lt;p&gt;Who is loved?&lt;/p&gt;

&lt;p&gt;Which tokens are related?&lt;/p&gt;

&lt;p&gt;This matters because language is not just a sequence of words.&lt;/p&gt;

&lt;p&gt;Language is a structure of relationships.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sequence-to-Sequence View
&lt;/h2&gt;

&lt;p&gt;A Transformer can be understood as a sequence-to-sequence model.&lt;/p&gt;

&lt;p&gt;It receives one sequence.&lt;/p&gt;

&lt;p&gt;It produces another sequence.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;translation&lt;/li&gt;
&lt;li&gt;summarization&lt;/li&gt;
&lt;li&gt;question answering&lt;/li&gt;
&lt;li&gt;text generation&lt;/li&gt;
&lt;li&gt;code generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The input and output lengths do not need to match.&lt;/p&gt;

&lt;p&gt;That is important.&lt;/p&gt;

&lt;p&gt;A short sentence can become a long explanation.&lt;/p&gt;

&lt;p&gt;A long document can become a short summary.&lt;/p&gt;

&lt;p&gt;The model is not copying token positions.&lt;/p&gt;

&lt;p&gt;It is transforming meaning.&lt;/p&gt;

&lt;h2&gt;
  
  
  RNN vs Transformer
&lt;/h2&gt;

&lt;p&gt;This comparison explains why Transformers became dominant.&lt;/p&gt;

&lt;p&gt;RNN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;processes tokens one by one&lt;/li&gt;
&lt;li&gt;keeps information in a hidden state&lt;/li&gt;
&lt;li&gt;naturally handles order&lt;/li&gt;
&lt;li&gt;is hard to parallelize&lt;/li&gt;
&lt;li&gt;can struggle with long-range dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Transformer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;processes tokens in parallel&lt;/li&gt;
&lt;li&gt;compares tokens directly&lt;/li&gt;
&lt;li&gt;uses attention instead of recurrence&lt;/li&gt;
&lt;li&gt;scales better on GPUs&lt;/li&gt;
&lt;li&gt;models long-distance relationships more directly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference is simple:&lt;/p&gt;

&lt;p&gt;RNN = memory through sequence steps&lt;/p&gt;

&lt;p&gt;Transformer = relationships through attention&lt;/p&gt;

&lt;p&gt;This is why Transformers are not just “faster RNNs.”&lt;/p&gt;

&lt;p&gt;They represent sequence information in a different way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Encoder-Decoder Architecture
&lt;/h2&gt;

&lt;p&gt;The original Transformer uses an Encoder-Decoder structure.&lt;/p&gt;

&lt;p&gt;The Encoder reads the input sequence.&lt;/p&gt;

&lt;p&gt;The Decoder generates the output sequence.&lt;/p&gt;

&lt;p&gt;Encoder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;receives input tokens&lt;/li&gt;
&lt;li&gt;applies self-attention&lt;/li&gt;
&lt;li&gt;builds contextual representations&lt;/li&gt;
&lt;li&gt;outputs one vector per token&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Decoder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;receives previously generated tokens&lt;/li&gt;
&lt;li&gt;uses masked self-attention&lt;/li&gt;
&lt;li&gt;attends to encoder output&lt;/li&gt;
&lt;li&gt;predicts the next token&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Encoder answers:&lt;/p&gt;

&lt;p&gt;What does the input mean?&lt;/p&gt;

&lt;p&gt;The Decoder answers:&lt;/p&gt;

&lt;p&gt;What should be generated next?&lt;/p&gt;

&lt;h2&gt;
  
  
  Transformer Encoder
&lt;/h2&gt;

&lt;p&gt;The Transformer Encoder is a stack of repeated encoder layers.&lt;/p&gt;

&lt;p&gt;Each layer has two main parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-Attention&lt;/li&gt;
&lt;li&gt;Feed-Forward Network&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Self-Attention lets each token look at other tokens in the same input.&lt;/p&gt;

&lt;p&gt;The Feed-Forward Network transforms each token representation independently.&lt;/p&gt;

&lt;p&gt;A simplified encoder layer looks like this:&lt;/p&gt;

&lt;p&gt;Input&lt;br&gt;&lt;br&gt;
→ Self-Attention&lt;br&gt;&lt;br&gt;
→ Feed-Forward Network&lt;br&gt;&lt;br&gt;
→ Contextual Output&lt;/p&gt;

&lt;p&gt;The important part is that every token representation becomes context-aware.&lt;/p&gt;

&lt;p&gt;A word is no longer just a word vector.&lt;/p&gt;

&lt;p&gt;It becomes a word vector shaped by the sentence around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Word Embedding, Tokens, and Vocabulary
&lt;/h2&gt;

&lt;p&gt;A Transformer does not understand raw text directly.&lt;/p&gt;

&lt;p&gt;It first splits text into tokens.&lt;/p&gt;

&lt;p&gt;A token can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a word&lt;/li&gt;
&lt;li&gt;a subword&lt;/li&gt;
&lt;li&gt;a character-like unit&lt;/li&gt;
&lt;li&gt;a special symbol&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full set of possible tokens is called the vocabulary.&lt;/p&gt;

&lt;p&gt;Each token is mapped to a vector through an embedding layer.&lt;/p&gt;

&lt;p&gt;The flow looks like this:&lt;/p&gt;

&lt;p&gt;Raw text&lt;br&gt;&lt;br&gt;
→ Tokens&lt;br&gt;&lt;br&gt;
→ Token IDs&lt;br&gt;&lt;br&gt;
→ Embedding vectors&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;"I love you"&lt;br&gt;&lt;br&gt;
→ ["I", "love", "you"]&lt;br&gt;&lt;br&gt;
→ [token_id_1, token_id_2, token_id_3]&lt;br&gt;&lt;br&gt;
→ [vector_1, vector_2, vector_3]&lt;/p&gt;

&lt;p&gt;This matters in practice.&lt;/p&gt;

&lt;p&gt;When building with LLMs, tokenization affects cost, context length, latency, and output behavior.&lt;/p&gt;

&lt;p&gt;So tokens are not just preprocessing details.&lt;/p&gt;

&lt;p&gt;They are part of the model interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transformer Decoder
&lt;/h2&gt;

&lt;p&gt;The Transformer Decoder generates output tokens.&lt;/p&gt;

&lt;p&gt;It has three main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Masked Self-Attention&lt;/li&gt;
&lt;li&gt;Cross-Attention&lt;/li&gt;
&lt;li&gt;Feed-Forward Network&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Masked Self-Attention prevents the model from seeing future tokens.&lt;/p&gt;

&lt;p&gt;This is required for autoregressive generation.&lt;/p&gt;

&lt;p&gt;When predicting the next token, the model can only use previous tokens.&lt;/p&gt;

&lt;p&gt;The flow looks like this:&lt;/p&gt;

&lt;p&gt;Previous output tokens&lt;br&gt;&lt;br&gt;
→ Masked Self-Attention&lt;br&gt;&lt;br&gt;
→ Cross-Attention with Encoder Output&lt;br&gt;&lt;br&gt;
→ Feed-Forward Network&lt;br&gt;&lt;br&gt;
→ Next Token Prediction&lt;/p&gt;

&lt;p&gt;This is how the model generates text step by step.&lt;/p&gt;

&lt;p&gt;It predicts one token.&lt;/p&gt;

&lt;p&gt;Then it appends that token.&lt;/p&gt;

&lt;p&gt;Then it predicts the next token.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Attention
&lt;/h2&gt;

&lt;p&gt;Cross-Attention connects the Decoder to the Encoder.&lt;/p&gt;

&lt;p&gt;The Decoder asks:&lt;/p&gt;

&lt;p&gt;Which part of the input should I focus on right now?&lt;/p&gt;

&lt;p&gt;This is especially useful in translation.&lt;/p&gt;

&lt;p&gt;The output word order may be different from the input word order.&lt;/p&gt;

&lt;p&gt;A phrase in one language may correspond to several words in another language.&lt;/p&gt;

&lt;p&gt;Cross-Attention helps the Decoder align output generation with the encoded input.&lt;/p&gt;

&lt;p&gt;Without Cross-Attention, the Decoder would generate mainly from its own previous tokens.&lt;/p&gt;

&lt;p&gt;With Cross-Attention, it can reference the input meaning directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Length
&lt;/h2&gt;

&lt;p&gt;Context length means:&lt;/p&gt;

&lt;p&gt;How many tokens the model can process at once.&lt;/p&gt;

&lt;p&gt;A longer context allows the model to use more information.&lt;/p&gt;

&lt;p&gt;This is useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;long documents&lt;/li&gt;
&lt;li&gt;long conversations&lt;/li&gt;
&lt;li&gt;code files&lt;/li&gt;
&lt;li&gt;retrieval-augmented generation&lt;/li&gt;
&lt;li&gt;summarization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But longer context is not free.&lt;/p&gt;

&lt;p&gt;Attention compares tokens with other tokens.&lt;/p&gt;

&lt;p&gt;So computational cost grows quickly as the sequence gets longer.&lt;/p&gt;

&lt;p&gt;This is why context length is both powerful and expensive.&lt;/p&gt;

&lt;p&gt;In real systems, context length affects memory usage, latency, and price.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive vs Practical View
&lt;/h2&gt;

&lt;p&gt;Naive view:&lt;/p&gt;

&lt;p&gt;A Transformer is a model that takes text and returns text.&lt;/p&gt;

&lt;p&gt;Practical developer view:&lt;/p&gt;

&lt;p&gt;A Transformer is a token-processing system with attention, context limits, and generation constraints.&lt;/p&gt;

&lt;p&gt;Naive mindset:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input text
get output text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Practical mindset:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tokenize input

manage context length

understand attention cost

choose decoding strategy

optimize inference

control output quality
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This matters because production AI systems are not only about model accuracy.&lt;/p&gt;

&lt;p&gt;They are also about speed, memory, cost, and reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Conditions and Limits
&lt;/h2&gt;

&lt;p&gt;Transformers are powerful, but they have important constraints.&lt;/p&gt;

&lt;p&gt;They need tokenization before processing text.&lt;/p&gt;

&lt;p&gt;They need positional information because attention alone does not know order.&lt;/p&gt;

&lt;p&gt;They can become expensive with long context.&lt;/p&gt;

&lt;p&gt;Decoder generation is sequential during inference.&lt;/p&gt;

&lt;p&gt;Context length limits how much information the model can use at once.&lt;/p&gt;

&lt;p&gt;These limits explain why modern LLM engineering focuses so much on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;efficient attention&lt;/li&gt;
&lt;li&gt;KV Cache&lt;/li&gt;
&lt;li&gt;long-context optimization&lt;/li&gt;
&lt;li&gt;better tokenization&lt;/li&gt;
&lt;li&gt;inference speed&lt;/li&gt;
&lt;li&gt;memory reduction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture is elegant.&lt;/p&gt;

&lt;p&gt;But scaling it requires engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transformer vs Traditional Seq2Seq
&lt;/h2&gt;

&lt;p&gt;Traditional Seq2Seq:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;often uses RNN-based Encoder and Decoder&lt;/li&gt;
&lt;li&gt;compresses input into hidden states&lt;/li&gt;
&lt;li&gt;processes sequence step by step&lt;/li&gt;
&lt;li&gt;may lose information in long sequences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Transformer Seq2Seq:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;uses attention-based Encoder and Decoder&lt;/li&gt;
&lt;li&gt;keeps contextual representations for all tokens&lt;/li&gt;
&lt;li&gt;supports parallel computation&lt;/li&gt;
&lt;li&gt;models token relationships directly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference:&lt;/p&gt;

&lt;p&gt;Traditional Seq2Seq compresses through recurrence.&lt;/p&gt;

&lt;p&gt;Transformer Seq2Seq connects through attention.&lt;/p&gt;

&lt;p&gt;That is why Transformers became the foundation for modern NLP systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;A Transformer works by turning tokens into contextual representations.&lt;/p&gt;

&lt;p&gt;The Encoder understands the input.&lt;/p&gt;

&lt;p&gt;The Decoder generates the output.&lt;/p&gt;

&lt;p&gt;Self-Attention models relationships inside a sequence.&lt;/p&gt;

&lt;p&gt;Cross-Attention connects generated output to encoded input.&lt;/p&gt;

&lt;p&gt;Context length controls how much information the model can use.&lt;/p&gt;

&lt;p&gt;If you remember one structure, remember this:&lt;/p&gt;

&lt;p&gt;Text → Tokens → Embeddings → Attention → Contextual Representations → Output&lt;/p&gt;

&lt;p&gt;That is the backbone of Transformer architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When learning Transformers, which part helped you understand the architecture fastest?&lt;/p&gt;

&lt;p&gt;The Encoder-Decoder structure, Self-Attention, tokenization, or the generation loop?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/transformer-architecture-core-components-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/transformer-architecture-core-components-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>transformers</category>
    </item>
    <item>
      <title>How Transformers Work — From Self-Attention to Modern LLM Architecture</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Mon, 15 Jun 2026 15:12:47 +0000</pubDate>
      <link>https://dev.to/zeromathai/how-transformers-work-from-self-attention-to-modern-llm-architecture-4j1o</link>
      <guid>https://dev.to/zeromathai/how-transformers-work-from-self-attention-to-modern-llm-architecture-4j1o</guid>
      <description>&lt;p&gt;Transformers changed AI because they stopped reading sequences one token at a time.&lt;/p&gt;

&lt;p&gt;Instead of moving step by step like an RNN, a Transformer compares tokens directly.&lt;/p&gt;

&lt;p&gt;That one design shift made modern LLMs possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;A Transformer is a neural network architecture built around attention.&lt;/p&gt;

&lt;p&gt;It looks at a sequence of tokens and learns how those tokens relate to each other.&lt;/p&gt;

&lt;p&gt;This matters because language is contextual.&lt;/p&gt;

&lt;p&gt;A word is not understood alone.&lt;/p&gt;

&lt;p&gt;It is understood through its relationship with surrounding words.&lt;/p&gt;

&lt;p&gt;That is why Self-Attention became the core mechanism.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;A simplified Transformer flow looks like this:&lt;/p&gt;

&lt;p&gt;Tokens → Embeddings → Positional Information → Self-Attention → Feed-Forward Network → Output&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;Transformer = token representations + attention + position + stacked blocks&lt;/p&gt;

&lt;p&gt;The model first converts text into token vectors.&lt;/p&gt;

&lt;p&gt;Then it injects position information.&lt;/p&gt;

&lt;p&gt;Then each Transformer block updates the token representations using attention and feed-forward layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation View
&lt;/h2&gt;

&lt;p&gt;At a high level, a Transformer processes text like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;split text into tokens

convert tokens into embeddings

add positional information

for each Transformer block:
    compute Self-Attention

    mix token information

    apply feed-forward transformation

    keep stable flow with residual connections and normalization

produce contextual token representations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For decoder-based LLMs, generation continues like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;predict next token

append generated token

reuse cached keys and values

repeat until stopping condition
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is why Transformers are practical for large-scale generation.&lt;/p&gt;

&lt;p&gt;They can learn relationships across many tokens.&lt;/p&gt;

&lt;p&gt;And with caching, they can generate efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Take this sentence:&lt;/p&gt;

&lt;p&gt;The animal did not cross the street because it was tired.&lt;/p&gt;

&lt;p&gt;What does “it” refer to?&lt;/p&gt;

&lt;p&gt;A simple left-to-right model may struggle if long context matters.&lt;/p&gt;

&lt;p&gt;Self-Attention lets the token “it” compare itself with other tokens like “animal” and “street.”&lt;/p&gt;

&lt;p&gt;The model can assign stronger attention to the token that best explains the meaning.&lt;/p&gt;

&lt;p&gt;That is the intuition.&lt;/p&gt;

&lt;p&gt;Attention lets tokens ask:&lt;/p&gt;

&lt;p&gt;Which other tokens matter for understanding me?&lt;/p&gt;

&lt;h2&gt;
  
  
  RNN vs Transformer
&lt;/h2&gt;

&lt;p&gt;This comparison explains why Transformers became so important.&lt;/p&gt;

&lt;p&gt;RNN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;processes tokens step by step&lt;/li&gt;
&lt;li&gt;carries information through hidden state&lt;/li&gt;
&lt;li&gt;naturally captures order&lt;/li&gt;
&lt;li&gt;is harder to parallelize&lt;/li&gt;
&lt;li&gt;can struggle with long-range dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Transformer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;processes tokens in parallel&lt;/li&gt;
&lt;li&gt;compares tokens directly through attention&lt;/li&gt;
&lt;li&gt;needs positional information for order&lt;/li&gt;
&lt;li&gt;scales well on GPUs&lt;/li&gt;
&lt;li&gt;handles long-range relationships more flexibly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the Transformer was not just faster.&lt;/p&gt;

&lt;p&gt;It changed how sequence relationships are represented.&lt;/p&gt;

&lt;p&gt;RNNs remember through recurrence.&lt;/p&gt;

&lt;p&gt;Transformers relate through attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Attention
&lt;/h2&gt;

&lt;p&gt;Self-Attention computes relationships between tokens in the same sequence.&lt;/p&gt;

&lt;p&gt;Each token creates three vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query&lt;/li&gt;
&lt;li&gt;Key&lt;/li&gt;
&lt;li&gt;Value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The intuition is simple:&lt;/p&gt;

&lt;p&gt;Query = what this token is looking for&lt;/p&gt;

&lt;p&gt;Key = what each token offers for matching&lt;/p&gt;

&lt;p&gt;Value = information to retrieve if the match is strong&lt;/p&gt;

&lt;p&gt;The core formula is:&lt;/p&gt;

&lt;p&gt;Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k))V&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;compare queries and keys&lt;/li&gt;
&lt;li&gt;turn scores into weights&lt;/li&gt;
&lt;li&gt;use those weights to combine values&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is how each token becomes context-aware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Head Attention
&lt;/h2&gt;

&lt;p&gt;One attention calculation is useful.&lt;/p&gt;

&lt;p&gt;But one view is not enough.&lt;/p&gt;

&lt;p&gt;Multi-Head Attention runs several attention heads in parallel.&lt;/p&gt;

&lt;p&gt;Each head can focus on a different type of relationship.&lt;/p&gt;

&lt;p&gt;One head may track syntax.&lt;/p&gt;

&lt;p&gt;Another may track semantic similarity.&lt;/p&gt;

&lt;p&gt;Another may track long-distance references.&lt;/p&gt;

&lt;p&gt;Then the outputs are combined into one representation.&lt;/p&gt;

&lt;p&gt;This makes attention richer than a single similarity calculation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Positional Encoding Is Needed
&lt;/h2&gt;

&lt;p&gt;Self-Attention does not automatically know token order.&lt;/p&gt;

&lt;p&gt;If you only give it a bag of token embeddings, the model needs another signal to know which token came first.&lt;/p&gt;

&lt;p&gt;That is why positional information is added.&lt;/p&gt;

&lt;p&gt;Common positional methods include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Absolute Positional Embedding&lt;/li&gt;
&lt;li&gt;Relative Positional Embedding&lt;/li&gt;
&lt;li&gt;Rotary Positional Embedding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;APE gives each position its own vector.&lt;/p&gt;

&lt;p&gt;RPE focuses on relative distance between tokens.&lt;/p&gt;

&lt;p&gt;RoPE rotates query and key vectors based on position, making relative position work naturally inside attention.&lt;/p&gt;

&lt;p&gt;This is why RoPE became common in modern LLMs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Encoder, Decoder, and LLMs
&lt;/h2&gt;

&lt;p&gt;The original Transformer used an Encoder-Decoder structure.&lt;/p&gt;

&lt;p&gt;Encoder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reads the input&lt;/li&gt;
&lt;li&gt;builds contextual representations&lt;/li&gt;
&lt;li&gt;works well for understanding tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Decoder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generates output tokens&lt;/li&gt;
&lt;li&gt;uses causal masking&lt;/li&gt;
&lt;li&gt;works well for autoregressive generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Encoder-Decoder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;connects input understanding with output generation&lt;/li&gt;
&lt;li&gt;useful for translation-style tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern GPT-style LLMs are mostly decoder-based.&lt;/p&gt;

&lt;p&gt;They generate text one token at a time.&lt;/p&gt;

&lt;p&gt;The decoder predicts the next token, appends it, and repeats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decoding Strategies
&lt;/h2&gt;

&lt;p&gt;Once the model produces logits, it needs to choose the next token.&lt;/p&gt;

&lt;p&gt;Different decoding strategies create different behavior.&lt;/p&gt;

&lt;p&gt;Greedy decoding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chooses the most likely token&lt;/li&gt;
&lt;li&gt;simple and deterministic&lt;/li&gt;
&lt;li&gt;can be repetitive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beam search:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keeps multiple candidate sequences&lt;/li&gt;
&lt;li&gt;useful for structured generation&lt;/li&gt;
&lt;li&gt;can still feel less diverse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Top-k sampling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;samples from the top k likely tokens&lt;/li&gt;
&lt;li&gt;adds diversity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Top-p sampling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;samples from the smallest probability mass above a threshold&lt;/li&gt;
&lt;li&gt;adapts the candidate set dynamically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So generation quality is not only about the model.&lt;/p&gt;

&lt;p&gt;It also depends on decoding.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Efficiency Problem
&lt;/h2&gt;

&lt;p&gt;Full Attention is powerful but expensive.&lt;/p&gt;

&lt;p&gt;If the sequence length is n, attention has roughly O(n^2) cost.&lt;/p&gt;

&lt;p&gt;That means longer context becomes expensive quickly.&lt;/p&gt;

&lt;p&gt;This is why efficient attention matters.&lt;/p&gt;

&lt;p&gt;Local Attention reduces the view to nearby tokens.&lt;/p&gt;

&lt;p&gt;Sparse Attention computes only selected attention links.&lt;/p&gt;

&lt;p&gt;FlashAttention keeps the formula but improves GPU memory access.&lt;/p&gt;

&lt;p&gt;The key idea:&lt;/p&gt;

&lt;p&gt;Do less unnecessary work, or move data more efficiently.&lt;/p&gt;

&lt;p&gt;Both make longer context more practical.&lt;/p&gt;

&lt;h2&gt;
  
  
  KV Cache
&lt;/h2&gt;

&lt;p&gt;Autoregressive generation has another problem.&lt;/p&gt;

&lt;p&gt;When generating one token at a time, the model repeatedly needs past key and value tensors.&lt;/p&gt;

&lt;p&gt;KV Cache stores those tensors.&lt;/p&gt;

&lt;p&gt;So the model does not recompute them from scratch at every step.&lt;/p&gt;

&lt;p&gt;The flow looks like this:&lt;/p&gt;

&lt;p&gt;Generated tokens → cached keys and values → new query attends to cache → next token&lt;/p&gt;

&lt;p&gt;This makes inference faster.&lt;/p&gt;

&lt;p&gt;But it creates a memory problem.&lt;/p&gt;

&lt;p&gt;Longer context means a larger KV Cache.&lt;/p&gt;

&lt;p&gt;That is why modern LLMs use techniques like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-Query Attention&lt;/li&gt;
&lt;li&gt;Grouped-Query Attention&lt;/li&gt;
&lt;li&gt;Multi-Head Latent Attention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These methods reduce the memory cost of storing key-value information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Modern Transformer Blocks
&lt;/h2&gt;

&lt;p&gt;Modern LLMs still use the Transformer idea.&lt;/p&gt;

&lt;p&gt;But the block has evolved.&lt;/p&gt;

&lt;p&gt;A typical modern block looks like this:&lt;/p&gt;

&lt;p&gt;Input&lt;br&gt;
→ RMSNorm or Pre-Layer Normalization&lt;br&gt;
→ Self-Attention with GQA and RoPE&lt;br&gt;
→ Residual Connection&lt;br&gt;
→ RMSNorm or Pre-Layer Normalization&lt;br&gt;
→ Feed-Forward Network with SwiGLU or Mixture of Experts&lt;br&gt;
→ Residual Connection&lt;/p&gt;

&lt;p&gt;Important upgrades include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RMSNorm for simpler normalization&lt;/li&gt;
&lt;li&gt;RoPE for positional representation&lt;/li&gt;
&lt;li&gt;GQA for efficient inference&lt;/li&gt;
&lt;li&gt;SwiGLU for stronger feed-forward layers&lt;/li&gt;
&lt;li&gt;MoE for sparse expert-based scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So today’s Transformer is not exactly the 2017 Transformer copied directly.&lt;/p&gt;

&lt;p&gt;It is an evolved architecture family.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transformer vs Modern LLM Architecture
&lt;/h2&gt;

&lt;p&gt;Original Transformer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;encoder-decoder structure&lt;/li&gt;
&lt;li&gt;standard multi-head attention&lt;/li&gt;
&lt;li&gt;sinusoidal positional encoding&lt;/li&gt;
&lt;li&gt;layer normalization&lt;/li&gt;
&lt;li&gt;dense feed-forward layers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern LLM architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;often decoder-only&lt;/li&gt;
&lt;li&gt;causal self-attention&lt;/li&gt;
&lt;li&gt;RoPE&lt;/li&gt;
&lt;li&gt;RMSNorm&lt;/li&gt;
&lt;li&gt;GQA or related KV-sharing methods&lt;/li&gt;
&lt;li&gt;SwiGLU&lt;/li&gt;
&lt;li&gt;sometimes Mixture of Experts&lt;/li&gt;
&lt;li&gt;KV Cache for inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core idea stayed the same.&lt;/p&gt;

&lt;p&gt;The engineering changed dramatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended Learning Order
&lt;/h2&gt;

&lt;p&gt;If Transformer architecture feels too large, learn it in this order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Attention Mechanism&lt;/li&gt;
&lt;li&gt;Self-Attention&lt;/li&gt;
&lt;li&gt;QKV Computation&lt;/li&gt;
&lt;li&gt;Multi-Head Attention&lt;/li&gt;
&lt;li&gt;Positional Encoding&lt;/li&gt;
&lt;li&gt;Encoder-Decoder Architecture&lt;/li&gt;
&lt;li&gt;Transformer Decoder&lt;/li&gt;
&lt;li&gt;KV Cache&lt;/li&gt;
&lt;li&gt;Efficient Attention&lt;/li&gt;
&lt;li&gt;Modern Transformer Block&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This order works because you first understand the relationship mechanism.&lt;/p&gt;

&lt;p&gt;Then you understand generation.&lt;/p&gt;

&lt;p&gt;Then you understand why modern LLMs needed efficiency upgrades.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;The Transformer is the architecture language of modern LLMs.&lt;/p&gt;

&lt;p&gt;The shortest version is:&lt;/p&gt;

&lt;p&gt;Transformer = attention + position + stacked blocks + efficient generation&lt;/p&gt;

&lt;p&gt;Self-Attention computes token relationships.&lt;/p&gt;

&lt;p&gt;Positional encoding injects order.&lt;/p&gt;

&lt;p&gt;The decoder generates tokens.&lt;/p&gt;

&lt;p&gt;KV Cache makes autoregressive inference practical.&lt;/p&gt;

&lt;p&gt;Modern upgrades like RoPE, RMSNorm, GQA, SwiGLU, and MoE make the architecture scalable.&lt;/p&gt;

&lt;p&gt;If you remember one idea, remember this:&lt;/p&gt;

&lt;p&gt;Transformers work by turning a sequence into a set of contextual relationships, then refining those relationships through stacked attention-based blocks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When learning Transformers, do you find it easier to start from the attention formula, the decoder generation loop, or the modern LLM block structure?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/transformer-architecture-overview-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/transformer-architecture-overview-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>What AI Really Is — From Turing Test to Deep Learning</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Mon, 18 May 2026 23:57:30 +0000</pubDate>
      <link>https://dev.to/zeromathai/what-ai-really-is-from-turing-test-to-deep-learning-39o2</link>
      <guid>https://dev.to/zeromathai/what-ai-really-is-from-turing-test-to-deep-learning-39o2</guid>
      <description>&lt;p&gt;AI is not just chatbots or neural networks.&lt;/p&gt;

&lt;p&gt;It is a long-running attempt to answer one question:&lt;/p&gt;

&lt;p&gt;Can a machine behave intelligently?&lt;/p&gt;

&lt;p&gt;That question shaped everything from symbolic AI to modern deep learning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;Artificial Intelligence is the field of building systems that can perform tasks requiring intelligence.&lt;/p&gt;

&lt;p&gt;That can include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reasoning&lt;/li&gt;
&lt;li&gt;learning&lt;/li&gt;
&lt;li&gt;planning&lt;/li&gt;
&lt;li&gt;perception&lt;/li&gt;
&lt;li&gt;language understanding&lt;/li&gt;
&lt;li&gt;decision-making&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But AI is not one single technique.&lt;/p&gt;

&lt;p&gt;It is a collection of paradigms.&lt;/p&gt;

&lt;p&gt;Different eras of AI tried different answers to the same question.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;A simple map of AI looks like this:&lt;/p&gt;

&lt;p&gt;Turing Test → Symbolic AI → AI Winter → Neural Networks → Deep Learning → Modern AI&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;AI = reasoning systems + learning systems + decision systems&lt;/p&gt;

&lt;p&gt;The important shift is this:&lt;/p&gt;

&lt;p&gt;AI moved from hand-coded rules toward data-driven learning.&lt;/p&gt;

&lt;p&gt;That shift explains why modern AI looks so different from early AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation View
&lt;/h2&gt;

&lt;p&gt;At a high level, an AI system often works like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;receive input from the environment

represent the problem internally

apply rules, search, or learned patterns

make a prediction or decision

act or generate an output

improve through feedback or training
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is why AI is broader than one model.&lt;/p&gt;

&lt;p&gt;A chatbot, a search algorithm, and a recommendation system may look different.&lt;/p&gt;

&lt;p&gt;But they all transform input into decisions or outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Imagine a spam detection system.&lt;/p&gt;

&lt;p&gt;A symbolic AI approach might use explicit rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if subject contains suspicious phrase, increase risk&lt;/li&gt;
&lt;li&gt;if sender is unknown, increase risk&lt;/li&gt;
&lt;li&gt;if many links exist, increase risk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A machine learning approach learns patterns from labeled examples.&lt;/p&gt;

&lt;p&gt;A deep learning approach may learn internal representations directly from text.&lt;/p&gt;

&lt;p&gt;Same task.&lt;/p&gt;

&lt;p&gt;Different AI paradigm.&lt;/p&gt;

&lt;h2&gt;
  
  
  Symbolic AI vs Connectionism
&lt;/h2&gt;

&lt;p&gt;This is one of the most important comparisons in AI history.&lt;/p&gt;

&lt;p&gt;Symbolic AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;uses explicit rules and logic&lt;/li&gt;
&lt;li&gt;represents knowledge with symbols&lt;/li&gt;
&lt;li&gt;is easier to inspect&lt;/li&gt;
&lt;li&gt;struggles with messy real-world data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Connectionism:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;uses neural-network-style learning&lt;/li&gt;
&lt;li&gt;learns patterns from data&lt;/li&gt;
&lt;li&gt;handles complex inputs better&lt;/li&gt;
&lt;li&gt;can be harder to interpret&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Symbolic AI asks:&lt;/p&gt;

&lt;p&gt;“What rules should the system follow?”&lt;/p&gt;

&lt;p&gt;Connectionism asks:&lt;/p&gt;

&lt;p&gt;“What patterns can the system learn from data?”&lt;/p&gt;

&lt;p&gt;Modern AI is strongly shaped by the second question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI Winter Happened
&lt;/h2&gt;

&lt;p&gt;AI did not grow in a straight line.&lt;/p&gt;

&lt;p&gt;Early expectations were extremely high.&lt;/p&gt;

&lt;p&gt;But hardware, data, algorithms, and practical results could not always keep up.&lt;/p&gt;

&lt;p&gt;This led to periods known as AI winters.&lt;/p&gt;

&lt;p&gt;The important lesson is simple:&lt;/p&gt;

&lt;p&gt;AI progress depends on more than ideas.&lt;/p&gt;

&lt;p&gt;It also depends on compute, data, algorithms, and realistic expectations.&lt;/p&gt;

&lt;p&gt;That is why modern AI surged when those conditions improved together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Current AI Stands
&lt;/h2&gt;

&lt;p&gt;Most current AI systems are narrow AI.&lt;/p&gt;

&lt;p&gt;They perform specific tasks well.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;image recognition&lt;/li&gt;
&lt;li&gt;translation&lt;/li&gt;
&lt;li&gt;recommendation&lt;/li&gt;
&lt;li&gt;text generation&lt;/li&gt;
&lt;li&gt;code assistance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They are not general human-level intelligence.&lt;/p&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;Narrow AI solves defined tasks.&lt;/p&gt;

&lt;p&gt;AGI would be able to generalize across many domains more like a human.&lt;/p&gt;

&lt;p&gt;Superintelligence would go beyond human-level cognitive ability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Modern AI Became So Powerful
&lt;/h2&gt;

&lt;p&gt;Modern AI grew because several ideas converged:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;neural networks&lt;/li&gt;
&lt;li&gt;deep learning&lt;/li&gt;
&lt;li&gt;large datasets&lt;/li&gt;
&lt;li&gt;GPUs and accelerators&lt;/li&gt;
&lt;li&gt;representation learning&lt;/li&gt;
&lt;li&gt;Transformer architectures&lt;/li&gt;
&lt;li&gt;large language models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The big change was representation learning.&lt;/p&gt;

&lt;p&gt;Instead of manually defining every feature, models learned useful internal structures from data.&lt;/p&gt;

&lt;p&gt;That made AI much more flexible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical vs Philosophical AI
&lt;/h2&gt;

&lt;p&gt;AI also raises deeper questions.&lt;/p&gt;

&lt;p&gt;Can a system follow rules without understanding?&lt;/p&gt;

&lt;p&gt;Does producing intelligent behavior mean it has intelligence?&lt;/p&gt;

&lt;p&gt;Where do choice, intention, and consciousness fit?&lt;/p&gt;

&lt;p&gt;These questions appear in debates like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chinese Room Argument&lt;/li&gt;
&lt;li&gt;strong AI vs weak AI&lt;/li&gt;
&lt;li&gt;free will discussions&lt;/li&gt;
&lt;li&gt;AGI and superintelligence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You do not need to answer them first.&lt;/p&gt;

&lt;p&gt;But they explain why AI is not only an engineering topic.&lt;/p&gt;

&lt;p&gt;It is also a question about mind and intelligence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended Learning Order
&lt;/h2&gt;

&lt;p&gt;If AI feels too broad, learn it in this order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Turing Test&lt;/li&gt;
&lt;li&gt;AI Paradigms&lt;/li&gt;
&lt;li&gt;Symbolic AI&lt;/li&gt;
&lt;li&gt;Connectionism&lt;/li&gt;
&lt;li&gt;AI Winter&lt;/li&gt;
&lt;li&gt;Neural Networks&lt;/li&gt;
&lt;li&gt;Deep Learning&lt;/li&gt;
&lt;li&gt;Narrow AI vs Broad AI&lt;/li&gt;
&lt;li&gt;AGI&lt;/li&gt;
&lt;li&gt;Singularity&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This order works because you first understand the question.&lt;/p&gt;

&lt;p&gt;Then you understand the paradigm shift.&lt;/p&gt;

&lt;p&gt;Then you connect it to modern AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;AI is not one algorithm.&lt;/p&gt;

&lt;p&gt;It is a field built around machines that reason, learn, decide, or act intelligently.&lt;/p&gt;

&lt;p&gt;The shortest version is:&lt;/p&gt;

&lt;p&gt;AI = systems that turn information into intelligent behavior&lt;/p&gt;

&lt;p&gt;Symbolic AI uses rules.&lt;/p&gt;

&lt;p&gt;Connectionist AI learns patterns.&lt;/p&gt;

&lt;p&gt;Modern AI is largely powered by neural networks, deep learning, and large-scale data.&lt;/p&gt;

&lt;p&gt;If you remember one idea, remember this:&lt;/p&gt;

&lt;p&gt;AI evolved from asking machines to follow rules into training machines to learn patterns from data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When explaining AI to beginners, do you start from the Turing Test and history, or from modern examples like neural networks and LLMs?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/ai-overview-hub-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/ai-overview-hub-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>How RNNs Work — Remembering Previous States in Sequential Data</title>
      <dc:creator>zeromathai</dc:creator>
      <pubDate>Mon, 18 May 2026 23:57:02 +0000</pubDate>
      <link>https://dev.to/zeromathai/how-rnns-work-remembering-previous-states-in-sequential-data-560o</link>
      <guid>https://dev.to/zeromathai/how-rnns-work-remembering-previous-states-in-sequential-data-560o</guid>
      <description>&lt;p&gt;A normal neural network treats each input mostly as a fixed snapshot.&lt;/p&gt;

&lt;p&gt;But many problems are not snapshots.&lt;/p&gt;

&lt;p&gt;Text, speech, and time-series data depend on order.&lt;/p&gt;

&lt;p&gt;That is why RNNs exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Idea
&lt;/h2&gt;

&lt;p&gt;A Recurrent Neural Network is designed for sequential data.&lt;/p&gt;

&lt;p&gt;It does not only look at the current input.&lt;/p&gt;

&lt;p&gt;It also carries information from previous steps.&lt;/p&gt;

&lt;p&gt;That carried information is called the hidden state.&lt;/p&gt;

&lt;p&gt;So an RNN can process a sequence one step at a time while keeping memory of what came before.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Structure
&lt;/h2&gt;

&lt;p&gt;A simple RNN flow looks like this:&lt;/p&gt;

&lt;p&gt;Previous Hidden State + Current Input → New Hidden State → Output&lt;/p&gt;

&lt;p&gt;More compactly:&lt;/p&gt;

&lt;p&gt;RNN = current input + previous state&lt;/p&gt;

&lt;p&gt;At each time step:&lt;/p&gt;

&lt;p&gt;h_t = f(x_t, h_{t-1})&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;x_t = input at the current time step&lt;/li&gt;
&lt;li&gt;h_{t-1} = previous hidden state&lt;/li&gt;
&lt;li&gt;h_t = updated hidden state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This recurrence is the core mechanism.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation View
&lt;/h2&gt;

&lt;p&gt;At a high level, an RNN processes a sequence like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;initialize hidden state

for each time step in the sequence:
    read current input

    combine it with previous hidden state

    update hidden state

    optionally produce output

return final output or all outputs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is why RNNs are useful for ordered data.&lt;/p&gt;

&lt;p&gt;The model can carry context forward.&lt;/p&gt;

&lt;p&gt;It does not restart from zero at every step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete Example
&lt;/h2&gt;

&lt;p&gt;Imagine a sentence:&lt;/p&gt;

&lt;p&gt;I love machine learning&lt;/p&gt;

&lt;p&gt;A basic feedforward network may process words as independent inputs.&lt;/p&gt;

&lt;p&gt;But an RNN reads them in order.&lt;/p&gt;

&lt;p&gt;Step 1:&lt;/p&gt;

&lt;p&gt;I&lt;/p&gt;

&lt;p&gt;Step 2:&lt;/p&gt;

&lt;p&gt;I love&lt;/p&gt;

&lt;p&gt;Step 3:&lt;/p&gt;

&lt;p&gt;I love machine&lt;/p&gt;

&lt;p&gt;Step 4:&lt;/p&gt;

&lt;p&gt;I love machine learning&lt;/p&gt;

&lt;p&gt;At each step, the hidden state carries previous context.&lt;/p&gt;

&lt;p&gt;That is how the model remembers earlier words while reading later ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standard Neural Network vs RNN
&lt;/h2&gt;

&lt;p&gt;This comparison makes the difference clear.&lt;/p&gt;

&lt;p&gt;Standard neural network:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;processes fixed-size input&lt;/li&gt;
&lt;li&gt;has no built-in memory across time&lt;/li&gt;
&lt;li&gt;works well for static feature vectors&lt;/li&gt;
&lt;li&gt;does not naturally model order&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RNN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;processes sequences step by step&lt;/li&gt;
&lt;li&gt;carries hidden state forward&lt;/li&gt;
&lt;li&gt;models temporal or ordered dependence&lt;/li&gt;
&lt;li&gt;fits text, speech, and time-series tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference is state.&lt;/p&gt;

&lt;p&gt;A standard network transforms input.&lt;/p&gt;

&lt;p&gt;An RNN transforms input while remembering previous context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Hidden State Matters
&lt;/h2&gt;

&lt;p&gt;The hidden state is the memory of the RNN.&lt;/p&gt;

&lt;p&gt;It is not memory in the human sense.&lt;/p&gt;

&lt;p&gt;It is a vector that summarizes previous information.&lt;/p&gt;

&lt;p&gt;At each step, the hidden state is updated.&lt;/p&gt;

&lt;p&gt;That updated state influences the next step.&lt;/p&gt;

&lt;p&gt;This lets the model capture patterns like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;word order&lt;/li&gt;
&lt;li&gt;temporal trends&lt;/li&gt;
&lt;li&gt;repeated signals&lt;/li&gt;
&lt;li&gt;dependency across earlier and later inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without hidden state, the sequence becomes just a list of disconnected inputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Deep RNNs Exist
&lt;/h2&gt;

&lt;p&gt;A basic RNN can model sequences.&lt;/p&gt;

&lt;p&gt;But some patterns are more complex.&lt;/p&gt;

&lt;p&gt;A Deep RNN stacks recurrent layers.&lt;/p&gt;

&lt;p&gt;That allows the model to build richer sequence representations.&lt;/p&gt;

&lt;p&gt;Basic RNN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one recurrent layer&lt;/li&gt;
&lt;li&gt;simpler sequence modeling&lt;/li&gt;
&lt;li&gt;easier to understand&lt;/li&gt;
&lt;li&gt;limited representational depth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deep RNN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multiple recurrent layers&lt;/li&gt;
&lt;li&gt;more expressive sequence modeling&lt;/li&gt;
&lt;li&gt;can capture more complex temporal patterns&lt;/li&gt;
&lt;li&gt;harder to train&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So Deep RNNs extend the same idea.&lt;/p&gt;

&lt;p&gt;They do not replace recurrence.&lt;/p&gt;

&lt;p&gt;They deepen it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where RNNs Fit in Deep Learning
&lt;/h2&gt;

&lt;p&gt;RNNs became important because different data types need different architectures.&lt;/p&gt;

&lt;p&gt;CNNs work well for images because images have spatial structure.&lt;/p&gt;

&lt;p&gt;RNNs work well for sequences because sequences have order.&lt;/p&gt;

&lt;p&gt;CNN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local spatial patterns&lt;/li&gt;
&lt;li&gt;image-centered tasks&lt;/li&gt;
&lt;li&gt;convolution kernels&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RNN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;temporal or sequential patterns&lt;/li&gt;
&lt;li&gt;text, speech, time series&lt;/li&gt;
&lt;li&gt;recurrent hidden state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why RNNs became one of the major deep learning architectures.&lt;/p&gt;

&lt;p&gt;They match the structure of sequential data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Limits
&lt;/h2&gt;

&lt;p&gt;RNNs are powerful, but they have limits.&lt;/p&gt;

&lt;p&gt;Long sequences are hard.&lt;/p&gt;

&lt;p&gt;Information from early steps can weaken over time.&lt;/p&gt;

&lt;p&gt;Training can become unstable because gradients must pass through many time steps.&lt;/p&gt;

&lt;p&gt;This is one reason later architectures became important.&lt;/p&gt;

&lt;p&gt;Attention mechanisms and Transformers changed the landscape by making long-range relationships easier to model.&lt;/p&gt;

&lt;p&gt;But RNNs remain the best starting point for understanding sequence modeling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended Learning Order
&lt;/h2&gt;

&lt;p&gt;If RNNs feel abstract, learn them in this order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Neural Network&lt;/li&gt;
&lt;li&gt;Recurrent Neural Network&lt;/li&gt;
&lt;li&gt;Hidden State&lt;/li&gt;
&lt;li&gt;Deep RNN&lt;/li&gt;
&lt;li&gt;CNN vs RNN comparison&lt;/li&gt;
&lt;li&gt;Attention Mechanism&lt;/li&gt;
&lt;li&gt;Transformer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This order works because you first understand normal neural networks.&lt;/p&gt;

&lt;p&gt;Then you see what changes when order matters.&lt;/p&gt;

&lt;p&gt;Then you understand why modern sequence models moved beyond basic recurrence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;An RNN is a neural network designed for sequences.&lt;/p&gt;

&lt;p&gt;The shortest version is:&lt;/p&gt;

&lt;p&gt;RNN = current input + previous hidden state&lt;/p&gt;

&lt;p&gt;It reads data step by step.&lt;/p&gt;

&lt;p&gt;It carries context forward.&lt;/p&gt;

&lt;p&gt;It uses that context to make better predictions on ordered data.&lt;/p&gt;

&lt;p&gt;If you remember one idea, remember this:&lt;/p&gt;

&lt;p&gt;RNNs make neural networks sequence-aware by passing hidden state from one time step to the next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;When learning sequence models, do you find it easier to start from RNNs first, or jump directly to Attention and Transformers?&lt;/p&gt;

&lt;p&gt;Originally published at zeromathai.com.&lt;br&gt;
Original article: &lt;a href="https://zeromathai.com/en/rnn-complete-hub-en/" rel="noopener noreferrer"&gt;https://zeromathai.com/en/rnn-complete-hub-en/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GitHub Resources&lt;br&gt;
AI diagrams, study notes, and visual guides:&lt;br&gt;
&lt;a href="https://github.com/zeromathai/zeromathai-ai" rel="noopener noreferrer"&gt;https://github.com/zeromathai/zeromathai-ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>deeplearning</category>
      <category>neuralnetworks</category>
    </item>
  </channel>
</rss>
