<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: pueding</title>
    <description>The latest articles on DEV Community by pueding (@pueding).</description>
    <link>https://dev.to/pueding</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F453161%2F9dc2c7a4-3298-46c4-bf96-00395ec12416.png</url>
      <title>DEV Community: pueding</title>
      <link>https://dev.to/pueding</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pueding"/>
    <language>en</language>
    <item>
      <title>SGLang v0.5.14: LPLB Expert-Parallel Load Balancing</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Tue, 30 Jun 2026 11:19:23 +0000</pubDate>
      <link>https://dev.to/pueding/sglang-v0514-lplb-expert-parallel-load-balancing-2dan</link>
      <guid>https://dev.to/pueding/sglang-v0514-lplb-expert-parallel-load-balancing-2dan</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/QCcBgX1CYrI"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;SGLang v0.5.14&lt;/strong&gt; release ships &lt;strong&gt;LPLB&lt;/strong&gt; — a &lt;strong&gt;linear-programming load balancer&lt;/strong&gt; for serving mixture-of-experts models, where the experts are split across many GPUs and each step routes every token to a few of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; In expert-parallel MoE serving, token routing is &lt;strong&gt;uneven and shifts every step&lt;/strong&gt;, so one overloaded GPU stalls the whole step at a sync barrier; &lt;strong&gt;evening that load is what unlocks throughput&lt;/strong&gt; on big MoE models like DeepSeek-V4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Earlier setups used &lt;strong&gt;static, hand-tuned expert placement&lt;/strong&gt; and ate the imbalance; LPLB keeps &lt;strong&gt;redundant replicas of the hot experts&lt;/strong&gt; and solves a small linear program &lt;strong&gt;each step&lt;/strong&gt; to minimize the busiest GPU's share of the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A warehouse store opening duplicate counters to even out the longest line.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;       40% of this step's tokens want one hot expert

   WITHOUT LPLB                 WITH LPLB (3 replicas)
   ┌──────────────┐             ┌──────────────┐
   │ GPU1 ####### │ 40%         │ GPU1 ##      │ 14%
   │ GPU2 #       │  5%         │ GPU2 ##      │ 14%
   │ GPU3 #       │  5%         │ GPU3 ##      │ 14%
   └──────┬───────┘             └──────┬───────┘
          ▼                            ▼
   barrier waits on GPU1        lanes finish together
   ✗ others idle ~1/3 step      ✓ idle time deleted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;a customer = a token routed to its experts this step&lt;/li&gt;
&lt;li&gt;a specialty counter = an expert (a sub-network in a mixture-of-experts model)&lt;/li&gt;
&lt;li&gt;a checkout lane = a GPU the experts are spread across&lt;/li&gt;
&lt;li&gt;one counter mobbed while others sit idle = per-GPU load imbalance&lt;/li&gt;
&lt;li&gt;duplicate copies of the busy counter = redundant expert replicas&lt;/li&gt;
&lt;li&gt;the floor manager who evens the longest line each wave = LPLB&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MoE (Mixture-of-Experts)&lt;/strong&gt; — A model whose feed-forward layer is split into many &lt;strong&gt;experts&lt;/strong&gt; (sub-networks); a small &lt;strong&gt;router&lt;/strong&gt; sends each token to only a few. Total parameters are huge, but the &lt;strong&gt;active&lt;/strong&gt; ones per token stay small. DeepSeek-V4 is a large MoE model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expert parallelism (EP)&lt;/strong&gt; — The serving layout that &lt;strong&gt;spreads a MoE's experts across many GPUs&lt;/strong&gt;, because all the experts together do not fit on one. Each step, tokens must be shipped to whichever GPU holds their chosen expert and the results shipped back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load imbalance&lt;/strong&gt; — When this step's router sends far more tokens to some experts than others, the GPUs holding the &lt;strong&gt;popular experts&lt;/strong&gt; get swamped while the rest sit idle. The pattern is &lt;strong&gt;data-dependent&lt;/strong&gt;, so it shifts batch to batch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redundant expert replicas&lt;/strong&gt; — Keeping &lt;strong&gt;extra copies of the hot experts&lt;/strong&gt; on several GPUs so their token load can be split, instead of one GPU owning a popular expert alone. The balancer decides how to divide each expert's tokens among its copies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LPLB&lt;/strong&gt; — SGLang's &lt;strong&gt;Linear-Programming Load Balancer&lt;/strong&gt;. Each step it solves a tiny &lt;strong&gt;linear program&lt;/strong&gt; over the current token counts to assign load across replicas so the &lt;strong&gt;maximum per-GPU load is as small as possible&lt;/strong&gt; (a min-max objective).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Waterfill&lt;/strong&gt; — The second expert-parallel balancer the release ships alongside LPLB. SGLang names it but does not detail how it works; the name points to a classic &lt;strong&gt;water-filling&lt;/strong&gt; heuristic — fill the least-loaded replica first — which would be a lighter alternative to solving the LP each step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All-to-all&lt;/strong&gt; — The expert-parallel &lt;strong&gt;communication step&lt;/strong&gt; that ships tokens out to their experts' GPUs and the results back. It runs every layer and &lt;strong&gt;waits for the slowest GPU&lt;/strong&gt;, which is why imbalance is so costly here.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On &lt;strong&gt;June 26, 2026&lt;/strong&gt;, the SGLang team &lt;a href="https://github.com/sgl-project/sglang/releases/tag/v0.5.14" rel="noopener noreferrer"&gt;released v0.5.14&lt;/a&gt;, with work from 56 contributors. The headline is &lt;strong&gt;5x higher throughput at the same interactivity&lt;/strong&gt; serving &lt;strong&gt;DeepSeek-V4&lt;/strong&gt; on NVIDIA GB300, driven by two new expert-parallel load balancers — &lt;strong&gt;Waterfill&lt;/strong&gt; and &lt;strong&gt;LPLB&lt;/strong&gt; (a linear-programming load balancer) — plus CuteDSL prefill kernels for Blackwell and int8 checkpoint pooling for linear-attention prefix caches. &lt;a href="https://github.com/sgl-project/sglang/releases/tag/v0.5.14" rel="noopener noreferrer"&gt;Read the release →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a warehouse store at peak rush. The checkout &lt;strong&gt;lanes&lt;/strong&gt; are the GPUs; the specialty &lt;strong&gt;counters&lt;/strong&gt; — deli, pharmacy, bakery — are the model's experts, and because no single lane can hold them all, the store spreads the counters across the lanes. That spread is &lt;strong&gt;expert parallelism&lt;/strong&gt;: a mixture-of-experts model has too many experts to fit on one GPU, so they live across many, and each decode step the router sends every customer (token) to the one or two counters they need. The trouble is that the rush is &lt;strong&gt;lumpy&lt;/strong&gt;. This wave, everyone wants the deli; next wave, the pharmacy. &lt;strong&gt;So one counter gets mobbed while the rest stand idle — and the store can't close out the rush until that longest line clears.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That last clause is the whole problem, because the lanes do not finish independently. Every GPU has to meet at a sync barrier — the &lt;strong&gt;all-to-all&lt;/strong&gt; that ships tokens to their experts and the answers back — and that barrier waits for the slowest lane. &lt;strong&gt;The GPU holding this step's most popular expert therefore sets the pace for all of them, and the fast lanes burn the difference as idle time.&lt;/strong&gt; Add more GPUs and the imbalance can get &lt;em&gt;worse&lt;/em&gt;, not better, because the hot expert still lives on one lane while you have paid for more lanes to stand around.&lt;/p&gt;

&lt;p&gt;SGLang v0.5.14's fix is to stop letting one counter bottleneck the floor. It keeps &lt;strong&gt;redundant replicas&lt;/strong&gt; of the hot experts — duplicate deli counters on several lanes — and then, each wave, the floor manager solves a quick assignment problem: given how many customers want each counter &lt;em&gt;right now&lt;/em&gt;, divide every counter's line across its copies so the &lt;strong&gt;busiest lane does as little as possible&lt;/strong&gt;. That floor manager is &lt;strong&gt;LPLB&lt;/strong&gt;, and "as little as possible" is literal: it solves a small &lt;strong&gt;linear program&lt;/strong&gt; whose objective is to &lt;strong&gt;minimize the maximum per-GPU load&lt;/strong&gt; (a min-max). &lt;strong&gt;Waterfill&lt;/strong&gt; is the other balancer the release pairs it with, and SGLang does not spell out how it works. The name, though, points to a classic &lt;em&gt;water-filling&lt;/em&gt; heuristic — fill the least-loaded replica first — which would be a lighter alternative to running the LP every step.&lt;/p&gt;

&lt;p&gt;Hold the layout fixed and walk the imbalance math &lt;em&gt;(illustrative — the release reports only the end-to-end 5x)&lt;/em&gt;. Say &lt;strong&gt;8 GPUs&lt;/strong&gt; serve a batch, and the router sends &lt;strong&gt;40%&lt;/strong&gt; of this step's tokens to one hot expert that lives on a single GPU, while another GPU draws just &lt;strong&gt;5%&lt;/strong&gt;. The step can't end until that one GPU finishes its &lt;strong&gt;40%&lt;/strong&gt;, so the other seven idle for roughly a third of the step — you own 8 GPUs but move at the speed of the busiest one. Now place &lt;strong&gt;3 replicas&lt;/strong&gt; of that hot expert and let LPLB split its tokens across them: its share per GPU falls from &lt;strong&gt;40%&lt;/strong&gt; toward about &lt;strong&gt;14%&lt;/strong&gt;, the barrier wait shrinks sharply, and the lanes finish much closer together. &lt;strong&gt;The win isn't a faster kernel — it's deleting the idle time that imbalance was manufacturing.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Expert-parallel balancing&lt;/th&gt;
&lt;th&gt;How it assigns load&lt;/th&gt;
&lt;th&gt;Per-step cost&lt;/th&gt;
&lt;th&gt;Balance quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Static / hand-tuned placement&lt;/td&gt;
&lt;td&gt;fixed expert→GPU map, set before serving&lt;/td&gt;
&lt;td&gt;~none&lt;/td&gt;
&lt;td&gt;poor under shifting, data-dependent routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Waterfill (this release)&lt;/td&gt;
&lt;td&gt;the release's second balancer; name implies water-filling, internals not detailed&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;a lighter companion to LPLB (inferred from the name)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LPLB (this release)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;solves a linear program to minimize the busiest GPU's load&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;a small solve each step&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;tightest — a min-max optimum over replicas&lt;/strong&gt; &lt;a href="https://github.com/sgl-project/sglang/releases/tag/v0.5.14" rel="noopener noreferrer"&gt;(SGLang v0.5.14)&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Where it earns its keep is exactly the regime DeepSeek-V4 lives in: a &lt;strong&gt;large MoE&lt;/strong&gt; served with expert parallelism across many Blackwell GPUs, where the &lt;strong&gt;all-to-all&lt;/strong&gt; and its sync barrier are a leading cost in each decode step. The release's headline — &lt;strong&gt;5x higher throughput at the same interactivity&lt;/strong&gt; — is a goodput claim: more tokens per second &lt;em&gt;without&lt;/em&gt; making any single user wait longer. &lt;strong&gt;Read it as the lanes finishing together instead of seven of them waiting on one — the same hardware, far less idle time.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/sglang-v0-5-12-tokenspeed-mla" rel="noopener noreferrer"&gt;SGLang v0.5.12 — TokenSpeed MLA backend&lt;/a&gt; — the prior SGLang release, a &lt;strong&gt;kernel-level&lt;/strong&gt; cache-write win rather than a &lt;strong&gt;scheduling&lt;/strong&gt; one&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/manifold-power-iteration-router-alignment" rel="noopener noreferrer"&gt;Manifold Power Iteration — MoE router alignment&lt;/a&gt; — the &lt;em&gt;other&lt;/em&gt; MoE balance problem: &lt;strong&gt;which&lt;/strong&gt; expert a token picks (router design), not &lt;strong&gt;where&lt;/strong&gt; that expert runs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/glm-5-2-active-vs-total-parameters" rel="noopener noreferrer"&gt;GLM-5.2 — active vs total parameters&lt;/a&gt; — why MoE serving is its own discipline: huge total weights, small active compute per token&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is LPLB (linear-programming load balancing)?
&lt;/h3&gt;

&lt;p&gt;LPLB is the Linear-Programming Load Balancer added in SGLang v0.5.14. When a mixture-of-experts model is served with expert parallelism — its experts split across many GPUs — the router sends an uneven, step-by-step-changing number of tokens to each expert, so some GPUs get swamped while others idle. LPLB keeps redundant replicas of the hot experts and, each step, solves a small linear program over the current token counts to divide every expert's load across its replicas so the maximum per-GPU load is minimized. Evening the load shrinks the wait at the all-to-all sync barrier that gates each decode step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does expert-parallel MoE serving need load balancing at all?
&lt;/h3&gt;

&lt;p&gt;Because expert parallelism makes the GPUs finish a step together, not independently. Every layer runs an all-to-all that ships tokens to their experts' GPUs and the results back, and that barrier waits for the slowest GPU. Since token-to-expert routing is data-dependent and shifts every batch, whichever GPU holds this step's most popular expert becomes the bottleneck for all of them — and the rest burn the difference as idle time. Without balancing, adding more GPUs can even make it worse, because the hot expert still lives on one GPU. SGLang reports a 5x throughput gain at the same interactivity for DeepSeek-V4 on NVIDIA GB300 once the load is evened.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does LPLB differ from Waterfill, and from a MoE router?
&lt;/h3&gt;

&lt;p&gt;Waterfill and LPLB are the two expert-parallel balancers the release ships, both aimed at spreading each step's token load across expert replicas. SGLang details LPLB — it solves a linear program for a tight min-max balance at a small per-step cost — but does not spell out Waterfill's internals; the name points to a classic water-filling heuristic (fill the least-loaded replica first), which would be a lighter alternative to an LP solve. Both differ from the MoE router: the router decides which expert each token should go to (a quality choice about the model's output), whereas the balancers decide where, among the redundant copies of that chosen expert, the work actually runs (a serving choice about GPU utilization).&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/sglang-v0-5-14-lplb-load-balancing" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>devops</category>
    </item>
    <item>
      <title>CacheWeaver Reorders RAG Evidence for Prefix-Cache Reuse: Prefix-Cache-Aware Evidence Reordering</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Mon, 29 Jun 2026 11:19:38 +0000</pubDate>
      <link>https://dev.to/pueding/cacheweaver-reorders-rag-evidence-for-prefix-cache-reuse-prefix-cache-aware-evidence-reordering-g8i</link>
      <guid>https://dev.to/pueding/cacheweaver-reorders-rag-evidence-for-prefix-cache-reuse-prefix-cache-aware-evidence-reordering-g8i</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/3BMvDWnDHlI"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; On &lt;strong&gt;June 18, 2026&lt;/strong&gt;, researchers posted &lt;strong&gt;CacheWeaver&lt;/strong&gt; — a prompt-layer method built on &lt;strong&gt;prefix-cache-aware evidence reordering&lt;/strong&gt;. It changes only the &lt;em&gt;order&lt;/em&gt; retrieved RAG chunks appear in the prompt, so the serving engine can reuse more of its &lt;strong&gt;KV prefix cache&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; In retrieval-augmented serving, &lt;strong&gt;time-to-first-token is dominated by prefilling the evidence&lt;/strong&gt;; reusing a cached prefix skips that work, and CacheWeaver squeezes out the reuse a naive system leaves on the table — with &lt;strong&gt;no engine change&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Versus &lt;strong&gt;naive retrieval-order caching&lt;/strong&gt; — chunks left in relevance order, so each prompt's opening rarely matches the cache — CacheWeaver re-sequences the same chunks to &lt;strong&gt;maximize the shared opening prefix&lt;/strong&gt;, the part the engine can actually reuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A kitchen that reuses orders already half-cooked on a warming shelf.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Shelf tray:  c1 c2 c3 | c4 c5     already cooked
New order:   c1 c2 c3 | cX cY     just arrived
             └──────┘   └───┘
              shared    differs
              opening   from here
                 │          │
                 ▼          ▼
           ✓ reuse it,  ✗ cook fresh —
             no prefill   must prefill
first plate out sooner  =  lower TTFT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;retrieved evidence chunk = one course in a multi-course order&lt;/li&gt;
&lt;li&gt;the prompt sent to the model = the full order, course by course in sequence&lt;/li&gt;
&lt;li&gt;KV prefix cache = the warming shelf of orders already partly cooked&lt;/li&gt;
&lt;li&gt;reusable prefix = the opening courses your order shares with one on the shelf&lt;/li&gt;
&lt;li&gt;CacheWeaver reordering = re-plating the same courses so the opening matches the shelf&lt;/li&gt;
&lt;li&gt;time-to-first-token = how soon the first plate leaves the kitchen&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TTFT (time-to-first-token)&lt;/strong&gt; — How long after a request arrives before the model emits its &lt;strong&gt;first&lt;/strong&gt; output token. For a long RAG prompt this is almost entirely prefill time — the engine has to read all the evidence before it can answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefill&lt;/strong&gt; — The first phase of inference, where the model processes the &lt;strong&gt;entire prompt at once&lt;/strong&gt; to build its KV cache. Its cost grows with prompt length, which is why stuffing evidence into a RAG prompt is what makes TTFT slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KV prefix cache (RadixAttention)&lt;/strong&gt; — Serving engines store the KV computed for a prompt and &lt;strong&gt;reuse it for any later prompt that starts with the exact same tokens&lt;/strong&gt;. SGLang's RadixAttention keeps these shared prefixes in a tree; vLLM does it by hashing fixed blocks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefix (and why order matters)&lt;/strong&gt; — Reuse works only from the &lt;strong&gt;front&lt;/strong&gt;, token-for-token. Two prompts that share their opening reuse that opening; the instant they diverge, everything after must be recomputed — so the &lt;em&gt;order&lt;/em&gt; of the evidence decides how much is reusable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG (retrieval-augmented generation)&lt;/strong&gt; — Before answering, the system retrieves relevant documents and pastes them into the prompt as evidence. The retriever returns them ranked by &lt;strong&gt;relevance&lt;/strong&gt;, which is the order CacheWeaver rearranges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Oracle ordering&lt;/strong&gt; — The best ordering you could pick if you knew the whole future cache state in advance — an upper bound, not a runnable policy. CacheWeaver's cheap greedy choice reaches &lt;strong&gt;about 97.5%&lt;/strong&gt; of this ideal.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On &lt;strong&gt;June 18, 2026&lt;/strong&gt;, researchers posted &lt;strong&gt;CacheWeaver&lt;/strong&gt;, a lightweight prompt-layer method that &lt;strong&gt;reorders retrieved evidence so grounded RAG requests reuse as much of the KV prefix cache as possible&lt;/strong&gt;. It changes neither the serving engine nor the retrieved documents — only the order the chunks appear in the prompt. Across three vLLM configurations it cuts &lt;strong&gt;median time-to-first-token by about 20–33%&lt;/strong&gt; relative to naive retrieval-order prefix caching, reaching &lt;strong&gt;97.5% of the gain an oracle ordering would give&lt;/strong&gt;, with no measured answer-quality degradation. &lt;a href="https://arxiv.org/abs/2606.19667" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a kitchen with a &lt;strong&gt;warming shelf&lt;/strong&gt; of orders that are already half-cooked. A new order comes in, and its first few courses happen to be the same dishes, in the same sequence, as a tray already sitting on the shelf. The cook doesn't start over — &lt;strong&gt;they grab the matching tray and only cook the courses that differ&lt;/strong&gt;, and the first plate leaves the pass far sooner. The whole trick is that the reuse runs strictly &lt;em&gt;from the front&lt;/em&gt;: the moment your order's courses stop matching the shelf, every course after that has to be cooked fresh.&lt;/p&gt;

&lt;p&gt;In serving terms, each course is a &lt;strong&gt;retrieved evidence chunk&lt;/strong&gt;, the order is the &lt;strong&gt;prompt&lt;/strong&gt;, and the warming shelf is the &lt;strong&gt;KV prefix cache&lt;/strong&gt;. A serving engine keeps the keys and values it already computed for a prompt and reuses them for any later prompt that &lt;strong&gt;begins with the exact same tokens&lt;/strong&gt;. But that matching is unforgiving: change a single token near the front and the block hash no longer matches, so the cache misses and the work is redone.&lt;/p&gt;

&lt;p&gt;Here is the catch retrieval creates. A RAG retriever returns chunks ranked by &lt;strong&gt;relevance&lt;/strong&gt;, and that ranking is different for almost every question — so even two requests that pull &lt;em&gt;the same documents&lt;/em&gt; arrange them differently, and their prompts share almost no opening. Because &lt;strong&gt;TTFT for a long RAG prompt is essentially the time to prefill all that evidence&lt;/strong&gt;, a cache that almost never hits means the GPU re-reads thousands of evidence tokens on every request.&lt;/p&gt;

&lt;p&gt;CacheWeaver's move is to treat the chunk order as a free variable. It maintains a &lt;strong&gt;prefix tree&lt;/strong&gt; of recently served evidence sequences and runs a &lt;strong&gt;greedy&lt;/strong&gt; algorithm that, for each incoming request, surfaces the most reusable cached prefix and then &lt;strong&gt;re-plates the retrieved chunks to match it&lt;/strong&gt;. Because the reordering happens entirely at the prompt layer, &lt;strong&gt;the serving engine and the retrieval results stay untouched&lt;/strong&gt; — and in the paper's evaluations the reordering shows no answer-quality loss, since the same evidence is present either way and only its order changes.&lt;/p&gt;

&lt;p&gt;Here is where it earns its keep, with illustrative numbers. Say a request's evidence prefills to &lt;strong&gt;5,000 tokens&lt;/strong&gt;, and prefill cost is roughly proportional to the tokens the GPU must process. Left in &lt;strong&gt;retrieval order&lt;/strong&gt;, only the shared system preamble and one lucky chunk match the cache — about &lt;strong&gt;1,500 tokens reused&lt;/strong&gt;, so the engine prefills the remaining &lt;strong&gt;3,500&lt;/strong&gt;. CacheWeaver reorders the same chunks so the opening matches a cached sequence of &lt;strong&gt;~2,500 tokens&lt;/strong&gt;, leaving just &lt;strong&gt;2,500&lt;/strong&gt; to prefill fresh. TTFT tracks the recomputed portion, so it falls from 3,500 to 2,500 — &lt;strong&gt;about a 29% cut&lt;/strong&gt;, squarely inside the paper's reported &lt;strong&gt;20–33%&lt;/strong&gt; band, and near the &lt;strong&gt;97.5% of oracle&lt;/strong&gt; the greedy policy is shown to reach. &lt;em&gt;(The 5,000- and reuse-token figures are illustrative; the 20–33% and 97.5% are from the CacheWeaver paper.)&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;What it changes&lt;/th&gt;
&lt;th&gt;Prefix reuse&lt;/th&gt;
&lt;th&gt;Median TTFT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval-order prefix caching&lt;/td&gt;
&lt;td&gt;nothing — chunks left in relevance order&lt;/td&gt;
&lt;td&gt;only when two orders happen to match&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CacheWeaver (greedy reorder)&lt;/td&gt;
&lt;td&gt;re-sequences evidence at the prompt layer&lt;/td&gt;
&lt;td&gt;maximized via a prefix-tree match&lt;/td&gt;
&lt;td&gt;&lt;a href="https://arxiv.org/abs/2606.19667" rel="noopener noreferrer"&gt;~20–33% lower (CacheWeaver paper)&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oracle ordering&lt;/td&gt;
&lt;td&gt;best possible order, known in hindsight&lt;/td&gt;
&lt;td&gt;maximal (upper bound)&lt;/td&gt;
&lt;td&gt;CacheWeaver reaches &lt;a href="https://arxiv.org/abs/2606.19667" rel="noopener noreferrer"&gt;~97.5% of this gain (paper)&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer quality&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;no measured degradation &lt;em&gt;(CacheWeaver paper)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The reason this is worth noticing is &lt;em&gt;where&lt;/em&gt; it sits: it is not a new attention kernel or a smaller cache, but a &lt;strong&gt;scheduling decision at the boundary between retrieval and serving&lt;/strong&gt; — the kind of free win that appears once you stop treating the retrieve-then-generate pipeline and the serving engine as two sealed boxes. The same evidence, the same engine, the same answer — only the order changes, and the cache does the rest.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: LLM Serving → Prefix Caching &amp;amp; RadixAttention → The prefix tree&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/aoiayn-stateful-prefix" rel="noopener noreferrer"&gt;Attention Once Is All You Need — persistent KV cache across queries&lt;/a&gt; — the same "reuse the prefix instead of recomputing it" idea, taken to a stateful cache that survives across separate queries.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/amd-atom-prefill-decode-disaggregation" rel="noopener noreferrer"&gt;AMD ATOM — prefill/decode disaggregation&lt;/a&gt; — why TTFT lives in the prefill phase, and another way to attack it: splitting prefill off onto its own hardware.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/grep-vs-vector-agentic-retrieval" rel="noopener noreferrer"&gt;Is Grep All You Need? — grep vs vector retrieval&lt;/a&gt; — the retrieval side of the pipeline CacheWeaver reorders, and how the choice of retriever shapes what lands in the prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is prefix-cache-aware evidence reordering?
&lt;/h3&gt;

&lt;p&gt;It is reordering the retrieved chunks in a RAG prompt so the serving engine's KV prefix cache can reuse as much of the prompt's opening as possible. The serving engine caches the keys and values it computed for earlier prompts and reuses them for any later prompt that begins with the exact same tokens. Because retrieval returns chunks ranked by relevance — a different order for almost every question — prompts rarely share an opening, so the cache misses. CacheWeaver re-sequences the same chunks at the prompt layer to maximize that shared opening prefix, without touching the engine or the retrieved documents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does it lower time-to-first-token?
&lt;/h3&gt;

&lt;p&gt;Time-to-first-token for a long RAG prompt is essentially the time to prefill all the evidence — the model must read every chunk before it can answer. When the prompt's opening matches a cached prefix, the engine reuses that work and only prefills the remaining tokens, so TTFT tracks the part it has to recompute. By making more of the opening reusable, CacheWeaver shrinks that recomputed portion, cutting median TTFT by about 20–33% across three vLLM configurations and reaching roughly 97.5% of an oracle ordering's gain, with no measured loss in answer quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does CacheWeaver relate to prefix caching and RAG?
&lt;/h3&gt;

&lt;p&gt;It sits exactly between them. Prefix caching (RadixAttention in SGLang, block hashing in vLLM) is the serving-side mechanism that reuses a shared opening; RAG is the retrieval-side pipeline that pastes ranked evidence into the prompt. CacheWeaver changes neither — it adds a prompt-layer scheduler that keeps a prefix tree of recently served sequences and greedily reorders each request's retrieved chunks to match the most reusable cached prefix. It is complementary to the engine's caching and to the retriever's ranking, because it only governs the order in which the already-chosen chunks are laid out.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/cacheweaver-prefix-cache-evidence-reordering" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Qwen-AgentWorld Trains a Language Model as a World Model for RL Agents: World Model as a Decoupled RL Simulator</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Sun, 28 Jun 2026 11:20:08 +0000</pubDate>
      <link>https://dev.to/pueding/qwen-agentworld-trains-a-language-model-as-a-world-model-for-rl-agents-world-model-as-a-decoupled-3ea2</link>
      <guid>https://dev.to/pueding/qwen-agentworld-trains-a-language-model-as-a-world-model-for-rl-agents-world-model-as-a-decoupled-3ea2</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/wMi3EsGK0Xg"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;Qwen-AgentWorld release&lt;/strong&gt; (arXiv 2606.24597) trains a language model to be a &lt;strong&gt;world model&lt;/strong&gt;: given the current observation and an agent's action, it &lt;strong&gt;predicts the next environment state&lt;/strong&gt;. The idea it makes concrete is using that model as a &lt;strong&gt;decoupled simulator for reinforcement-learning (RL) agents&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Training an agent with RL needs a vast number of &lt;strong&gt;trial-and-error attempts in an environment&lt;/strong&gt; — and real environments are slow, costly, and hard to run in parallel. A learned simulator lets you generate that experience &lt;strong&gt;cheaply and at massive scale&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Standard agent RL is &lt;strong&gt;coupled to a live environment&lt;/strong&gt; — every step waits on the real web page, terminal, or game; Qwen-AgentWorld &lt;strong&gt;decouples the two&lt;/strong&gt; by predicting the environment's response itself, and also serves as a &lt;strong&gt;warm-start foundation model&lt;/strong&gt; for downstream agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A flight simulator pilots train in instead of a real, costly plane.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                 THE RL AGENT (trainee pilot)
                            │
           ┌────────────────┴────────────────┐
           │                                 │
   ┌───────▼───────┐                 ┌───────▼───────┐
   │ World-model   │                 │ Real          │
   │ simulator     │                 │ environment   │
   │ (flight sim)  │                 │ (actual jet)  │
   └───────┬───────┘                 └───────┬───────┘
           │                                 │
   predicts next state              waits on the live
   in one forward pass              page/terminal/game
           │                                 │
           ▼                                 ▼
   ✓ thousands of runs at           ✗ slow, serial, and
     once — cheap to scale            costly to parallelize
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;world model = a flight simulator that predicts what happens next&lt;/li&gt;
&lt;li&gt;real environment = the actual aircraft, costly and slow to train in&lt;/li&gt;
&lt;li&gt;RL agent = the trainee pilot learning by trial and error&lt;/li&gt;
&lt;li&gt;next-state prediction = the simulator computing your next instrument reading&lt;/li&gt;
&lt;li&gt;decoupled simulator = running thousands of sim sessions at once, no real planes&lt;/li&gt;
&lt;li&gt;agent warm-start = the hours logged in the sim before the first real flight&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;World model&lt;/strong&gt; — A model that &lt;strong&gt;predicts how an environment changes&lt;/strong&gt;: feed it the current state and an action, and it returns the likely next state. Qwen-AgentWorld trains a &lt;em&gt;language&lt;/em&gt; model to do this for agent environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reinforcement learning (RL)&lt;/strong&gt; — Training by &lt;strong&gt;trial and error toward a reward&lt;/strong&gt; — the agent acts, sees what happens, and adjusts. It is data-hungry: it needs many environment steps, which is exactly what a fast simulator supplies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next-state prediction&lt;/strong&gt; — The world model's core job: &lt;strong&gt;given (observation, action), output the next observation&lt;/strong&gt;. Get this accurate enough and the model can replace the real environment for training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rollout&lt;/strong&gt; — One full &lt;strong&gt;trial run of an agent in an environment&lt;/strong&gt;, from start to finish. RL learns from thousands of rollouts; in a live environment each one is slow, in a simulator each one is cheap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decoupled (vs coupled)&lt;/strong&gt; — A &lt;strong&gt;coupled&lt;/strong&gt; setup ties each training step to the real environment; a &lt;strong&gt;decoupled&lt;/strong&gt; one swaps in the simulator, so training no longer waits on the live web page, terminal, or game.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warm-start / foundation model&lt;/strong&gt; — Using a pre-trained model as a &lt;strong&gt;head start&lt;/strong&gt; rather than training from scratch. Qwen-AgentWorld doubles as a foundation model that &lt;strong&gt;warms up downstream agents&lt;/strong&gt; before task-specific fine-tuning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid reward&lt;/strong&gt; — A reward signal that &lt;strong&gt;combines more than one objective&lt;/strong&gt;. Qwen-AgentWorld's final RL stage uses one to &lt;strong&gt;sharpen simulation fidelity&lt;/strong&gt; — how faithfully its predicted states match reality.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 24, 2026, the &lt;strong&gt;Qwen-AgentWorld&lt;/strong&gt; team released a language model trained to act as a &lt;strong&gt;world model for agents&lt;/strong&gt;: given the current observation and an agent's action, it &lt;strong&gt;predicts the next environment state&lt;/strong&gt;. It is used two ways — as a &lt;strong&gt;decoupled environment simulator&lt;/strong&gt; for training RL agents across thousands of scenarios, and as a &lt;strong&gt;foundation model&lt;/strong&gt; that warms up downstream agents. Training is a three-stage pipeline (continual pre-training → supervised fine-tuning → RL with a hybrid reward), and the team reports it &lt;strong&gt;outperforms existing frontier models on AgentWorldBench across seven domains&lt;/strong&gt; (the gain is stated qualitatively, without a single headline number). &lt;a href="https://arxiv.org/abs/2606.24597" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Think about how you train a pilot. &lt;strong&gt;You do not hand a beginner the controls of a real jet and let them crash a few hundred times — you put them in a flight simulator that predicts what the plane &lt;em&gt;would&lt;/em&gt; do in response to each input.&lt;/strong&gt; The simulator is cheaper, safer, and you can run a thousand of them at once. Qwen-AgentWorld does exactly this for software agents: instead of training in the slow, live environment, it trains a language model to &lt;em&gt;be&lt;/em&gt; the environment — to predict, from the current screen and the agent's action, what the next screen looks like.&lt;/p&gt;

&lt;p&gt;Why does this matter so much for RL? Because reinforcement learning is gluttonous for experience: it improves by trying an action, seeing the environment's response, and adjusting — thousands and thousands of times. &lt;strong&gt;When every one of those steps is coupled to a real web page or terminal, the environment, not the GPU, becomes the bottleneck.&lt;/strong&gt; A learned world model breaks that coupling: predicting the next state is just a forward pass, so you can run enormous numbers of rollouts in parallel, none of them waiting on the real world.&lt;/p&gt;

&lt;p&gt;How does Qwen-AgentWorld get a language model good enough to &lt;em&gt;be&lt;/em&gt; a simulator? &lt;strong&gt;Three stages, each adding one capability: continual pre-training instills broad world-modeling, supervised fine-tuning activates explicit next-state-prediction reasoning, and a final RL stage with a hybrid reward sharpens simulation fidelity&lt;/strong&gt; — how faithfully its predicted states match what the real environment would have done. The same trained model then does double duty as a &lt;strong&gt;warm-start foundation model&lt;/strong&gt;, giving downstream agents a head start before any task-specific fine-tuning.&lt;/p&gt;

&lt;p&gt;Walk the economics with illustrative numbers &lt;em&gt;(the paper does not publish step-rate figures)&lt;/em&gt;. Suppose a single rollout in a &lt;em&gt;live&lt;/em&gt; web environment takes &lt;strong&gt;30 seconds&lt;/strong&gt; and you can afford &lt;strong&gt;10 in parallel&lt;/strong&gt; — that is about &lt;strong&gt;1,200 rollouts an hour&lt;/strong&gt;. Now suppose the world model predicts a next state in &lt;strong&gt;~50 milliseconds&lt;/strong&gt; and you run &lt;strong&gt;1,000 in parallel&lt;/strong&gt; — that is on the order of &lt;strong&gt;tens of millions of steps an hour&lt;/strong&gt; &lt;em&gt;(illustrative)&lt;/em&gt;. &lt;strong&gt;That multiple-orders-of-magnitude gap in experience-per-hour is the whole point: it is what lets an agent be trained across thousands of scenarios that a live-environment budget could never reach.&lt;/strong&gt; The catch, of course, is fidelity — an agent trained in a simulator only transfers if the simulator's predictions stay close to reality, which is exactly what the final RL stage targets.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Training setup&lt;/th&gt;
&lt;th&gt;Where each step's "what happens next" comes from&lt;/th&gt;
&lt;th&gt;Cost of experience&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Coupled to a live environment&lt;/td&gt;
&lt;td&gt;the real web page / terminal / game&lt;/td&gt;
&lt;td&gt;Slow and hard to parallelize — the environment is the bottleneck&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Decoupled world-model simulator (Qwen-AgentWorld)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;the model's own next-state prediction&lt;/strong&gt; (&lt;a href="https://arxiv.org/abs/2606.24597" rel="noopener noreferrer"&gt;paper&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;A forward pass — cheap and massively parallel; fidelity is the risk to manage&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → Agent Loop &amp;amp; State → Inside a Tick&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/agent-env-survey-symbolic-vs-neural-synthesis" rel="noopener noreferrer"&gt;Agent environment survey — symbolic vs neural synthesis&lt;/a&gt; — the broader map of how to build an agent's training world; a learned world model is the "neural" end of that split.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/envfactory-tool-env-synthesis" rel="noopener noreferrer"&gt;EnvFactory — synthesizing tool environments&lt;/a&gt; — a different way to manufacture the environments agents train in.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/openthoughts-agent-task-source-diversity" rel="noopener noreferrer"&gt;OpenThoughts-Agent — task-source diversity&lt;/a&gt; — what you feed an agent in training; Qwen-AgentWorld is about &lt;em&gt;where&lt;/em&gt; that training experience comes from.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/role-agent-dual-role-self-play" rel="noopener noreferrer"&gt;Role-Agent — dual-role self-play&lt;/a&gt; — another case of a model imagining the other side of the interaction to train itself.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is a world model used as a decoupled RL simulator?
&lt;/h3&gt;

&lt;p&gt;A world model is a model that predicts how an environment changes: given the current observation and an action, it returns the likely next state. Qwen-AgentWorld (arXiv 2606.24597, June 2026) trains a language model to do this for agent environments, then uses it as a decoupled simulator — a stand-in for the real environment so reinforcement-learning agents can be trained across thousands of scenarios without waiting on a live web page, terminal, or game. The same model also serves as a foundation model that warms up downstream agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why train an agent in a learned simulator instead of the real environment?
&lt;/h3&gt;

&lt;p&gt;Reinforcement learning needs an enormous number of trial-and-error steps, and when each step runs against a real environment, that environment becomes the bottleneck — it is slow and hard to parallelize. A world model predicts the next state in a single forward pass, so rollouts become cheap and massively parallel, letting agents train across far more scenarios than a live-environment budget allows. The risk is fidelity: the agent only transfers to the real world if the simulator's predictions stay close to reality, which Qwen-AgentWorld's final RL stage targets with a hybrid reward.&lt;/p&gt;

&lt;h3&gt;
  
  
  How was Qwen-AgentWorld trained?
&lt;/h3&gt;

&lt;p&gt;Through a three-stage pipeline: continual pre-training to instill broad world-modeling capability, supervised fine-tuning to activate explicit next-state-prediction reasoning, and reinforcement learning with a hybrid reward to sharpen simulation fidelity. The team reports it outperforms existing frontier models on AgentWorldBench across seven domains, stated qualitatively rather than with a single headline number.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/qwen-agentworld-world-model-simulator" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>OpenAI and Broadcom's Jalapeño, a Custom Inference ASIC: Inference ASIC vs GPU</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Sat, 27 Jun 2026 11:21:30 +0000</pubDate>
      <link>https://dev.to/pueding/openai-and-broadcoms-jalapeno-a-custom-inference-asic-inference-asic-vs-gpu-36jm</link>
      <guid>https://dev.to/pueding/openai-and-broadcoms-jalapeno-a-custom-inference-asic-inference-asic-vs-gpu-36jm</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/xH0oi16XmvQ"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;OpenAI and Broadcom Jalapeño announcement&lt;/strong&gt; (June 24, 2026) is OpenAI's &lt;strong&gt;first custom LLM-inference ASIC&lt;/strong&gt; — a reticle-sized compute chiplet paired with HBM, built to &lt;strong&gt;run&lt;/strong&gt; models rather than train them. The idea it makes concrete is an &lt;strong&gt;inference-optimized ASIC versus a general-purpose GPU&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; At decode time the bottleneck is usually &lt;strong&gt;moving data, not doing math&lt;/strong&gt;, so a chip co-designed around that movement can serve the same tokens using &lt;strong&gt;far less power per token&lt;/strong&gt; — early testing reports substantially better performance-per-watt (final numbers still being measured), which at OpenAI's scale materially changes serving cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; A &lt;strong&gt;general-purpose GPU&lt;/strong&gt; runs anything — training, graphics, every model — and pays in silicon and power for that flexibility; Jalapeño is &lt;strong&gt;hard-wired for inference only&lt;/strong&gt;, trading the GPU's versatility for a shorter, faster path between memory and compute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A kitchen rebuilt to cook one dish, with the pantry moved beside the stove.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  THE ONE DISH: LLM inference
                            │
            ┌───────────────┴───────────────┐
            │                               │
     ┌──────▼───────┐                ┌──────▼───────┐
     │ Inference    │                │ General      │
     │ ASIC         │                │ GPU          │
     │ (one dish)   │                │ (whole menu) │
     └──────┬───────┘                └──────┬───────┘
            │                               │
   pantry beside the stove        pantry down the hall
   (HBM next to compute)          (data travels far)
            │                               │
            ▼                               ▼
   ✓ most plates per gas          ✗ pays power for
     (perf-per-watt)                flexibility unused
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;inference ASIC = a kitchen rebuilt to cook one dish, as fast and cheaply as possible&lt;/li&gt;
&lt;li&gt;general-purpose GPU = a restaurant kitchen that can cook anything on the menu&lt;/li&gt;
&lt;li&gt;data-movement bottleneck = cooks spending the night carrying ingredients from a far pantry&lt;/li&gt;
&lt;li&gt;HBM beside the compute chiplet = moving the pantry right next to the stove&lt;/li&gt;
&lt;li&gt;performance-per-watt = more plates served for every unit of gas burned&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ASIC&lt;/strong&gt; — An &lt;strong&gt;Application-Specific Integrated Circuit&lt;/strong&gt; — silicon built for &lt;strong&gt;one kind of job&lt;/strong&gt; rather than general-purpose computing. Giving up a general processor's flexibility buys speed and energy efficiency on that job. Jalapeño's job is LLM inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HBM&lt;/strong&gt; — High-Bandwidth Memory — stacked DRAM placed &lt;strong&gt;physically very close to the compute die&lt;/strong&gt; so data reaches the math units faster. It is the same fast memory used on high-end GPUs, and it is where the model actually lives during serving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inference vs training&lt;/strong&gt; — Training &lt;strong&gt;builds&lt;/strong&gt; a model's weights; inference &lt;strong&gt;runs&lt;/strong&gt; the finished weights to generate tokens. They stress hardware differently, so a chip can be excellent at one and unable to do the other. Jalapeño is &lt;strong&gt;inference-only&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory-bandwidth-bound&lt;/strong&gt; — When a computation spends most of its time &lt;strong&gt;waiting for data to arrive from memory&lt;/strong&gt; rather than doing arithmetic. Single-token decode is the classic example: lots of bytes read, little math per byte.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tape-out&lt;/strong&gt; — The moment a chip design is finished and &lt;strong&gt;sent to the fab to be manufactured&lt;/strong&gt;. Jalapeño went from first design to tape-out in &lt;strong&gt;roughly nine months&lt;/strong&gt;, which OpenAI describes as one of the fastest such cycles to date.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reticle-sized chiplet&lt;/strong&gt; — The &lt;em&gt;reticle&lt;/em&gt; is the largest area a chip-making machine can pattern in a single exposure (around 800 mm²). A &lt;strong&gt;reticle-sized compute chiplet&lt;/strong&gt; is about as large as one die can physically get — Jalapeño pairs one such tile with HBM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance-per-watt&lt;/strong&gt; — Useful work (tokens generated) divided by the &lt;strong&gt;electrical power it costs&lt;/strong&gt;. At data-center scale this — not peak speed alone — sets the bill, which is why a custom inference chip targets it directly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 24, 2026, &lt;strong&gt;OpenAI and Broadcom&lt;/strong&gt; unveiled &lt;strong&gt;Jalapeño&lt;/strong&gt;, OpenAI's first "Intelligence Processor" — a purpose-built &lt;strong&gt;ASIC for LLM inference&lt;/strong&gt;, not a repurposed training accelerator or a general-purpose AI chip. It pairs a single &lt;strong&gt;reticle-sized compute chiplet&lt;/strong&gt; with &lt;strong&gt;HBM&lt;/strong&gt; (not commodity DRAM) to hold high throughput and low latency together, and was co-designed from first design to &lt;strong&gt;tape-out in roughly nine months&lt;/strong&gt;. Engineering samples are already running production workloads in the lab, including &lt;strong&gt;GPT-5.3-Codex-Spark&lt;/strong&gt;, with early testing reporting performance-per-watt "substantially better" than current state-of-the-art (final numbers still being measured). Initial deployment is targeted for &lt;strong&gt;end of 2026&lt;/strong&gt;. &lt;a href="https://openai.com/index/openai-broadcom-jalapeno-inference-chip/" rel="noopener noreferrer"&gt;Read the announcement →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a restaurant kitchen that can cook anything on the menu — pastry, grill, soup, all of it. That flexibility is wonderful, and it is exactly what a &lt;strong&gt;general-purpose GPU&lt;/strong&gt; gives you: thousands of programmable cores that will run any parallel workload you throw at them, from training a model to rendering a game. &lt;strong&gt;Jalapeño is that kitchen torn down and rebuilt to cook one dish — LLM inference — and nothing else.&lt;/strong&gt; The bet is that if you only ever cook one dish, a kitchen shaped around that single dish will cook it faster and far more cheaply than the do-everything kitchen ever could.&lt;/p&gt;

&lt;p&gt;So what is the "one dish" actually limited by? Here is the part that surprises people: &lt;strong&gt;at decode time, the thing slowing the kitchen down is not the chef's hands — it is the cooks walking ingredients in from a far pantry.&lt;/strong&gt; When a model generates a token, at small batch sizes it must stream the model's weights out of memory and through the compute units once, while doing comparatively little arithmetic per byte read. That makes single-token decode &lt;strong&gt;memory-bandwidth-bound&lt;/strong&gt; — the roofline tips toward memory, and the math units sit mostly idle, waiting on data. The bottleneck the whole chip is fighting is &lt;em&gt;data movement&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Single-token decode — where the time goes:

moving data  ████████████████████████████████  dominates
computing    █                                  a sliver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The diagram makes the imbalance concrete: in the bandwidth-bound regime, the pink "moving data" segment dominates and the green "computing" segment is a sliver. &lt;strong&gt;Jalapeño's answer is the obvious one once you see the problem — move the pantry next to the stove.&lt;/strong&gt; It pairs that big compute chiplet with &lt;strong&gt;HBM kept physically close&lt;/strong&gt;, so the costly trip between memory and compute is as short and as fast as the silicon allows. OpenAI says the design was derived from its &lt;em&gt;own&lt;/em&gt; measurements of how its models behave at serving time, which is what "co-designed" really means here: the chip is shaped around the bottleneck the company actually observed, not a generic one.&lt;/p&gt;

&lt;p&gt;Walk the decode math on a single token &lt;em&gt;(illustrative numbers — OpenAI has not published Jalapeño's figures)&lt;/em&gt;. Say a model holds &lt;strong&gt;100 GB of weights&lt;/strong&gt; and the accelerator reads them from memory at &lt;strong&gt;4 TB/s&lt;/strong&gt;. Generating one token must stream those weights through compute roughly once, so the time is about &lt;strong&gt;100 GB ÷ 4 TB/s = 25 ms&lt;/strong&gt; — and across that 25 ms the arithmetic units are mostly idle, waiting. Now &lt;strong&gt;double the effective memory bandwidth and that 25 ms roughly halves&lt;/strong&gt;; double the raw compute instead and almost nothing changes. &lt;strong&gt;That is the whole reason an inference chip is built around feeding the math units, not stacking more of them&lt;/strong&gt; — and why the headline metric is &lt;em&gt;performance-per-watt&lt;/em&gt;, not peak FLOPs.&lt;/p&gt;

&lt;p&gt;None of this means GPUs are going away. The trade Jalapeño makes is real and one-directional: &lt;strong&gt;you give up the GPU's ability to train, to switch to a very different kind of workload, to run the whole range of models and tasks a GPU handles.&lt;/strong&gt; A custom ASIC only pays off when you run one workload at enormous, sustained scale — which is precisely OpenAI's situation, and precisely why a startup serving a thousand requests a day would still reach for a GPU. The interesting signal is not "ASICs beat GPUs"; it is that LLM inference has become a large and stable enough workload to justify burning a chip for it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Chip&lt;/th&gt;
&lt;th&gt;Built for&lt;/th&gt;
&lt;th&gt;Flexibility&lt;/th&gt;
&lt;th&gt;Where it wins&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;General-purpose GPU&lt;/td&gt;
&lt;td&gt;training + inference + any parallel workload&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;The default — runs anything, backed by a mature software ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repurposed training accelerator&lt;/td&gt;
&lt;td&gt;training, also used to serve&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Strong throughput, but carries training-only hardware that idles during inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inference ASIC (Jalapeño)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;LLM inference only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Built for top performance-per-watt on its one workload at scale (early results); inference-only, far less flexible&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: GPU &amp;amp; CUDA → Roofline Model → The Bottleneck Question&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/nvidia-ai-factories-tokens-per-mw" rel="noopener noreferrer"&gt;NVIDIA AI factories — tokens per megawatt&lt;/a&gt; — frames serving as a performance-per-watt problem at the datacenter scale Jalapeño is built to win.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/amd-atom-prefill-decode-disaggregation" rel="noopener noreferrer"&gt;AMD Atom — prefill/decode disaggregation&lt;/a&gt; — another hardware answer to the fact that prefill and decode stress the chip in opposite ways.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/blackwell-mlperf-6-0-strong-scaling" rel="noopener noreferrer"&gt;Blackwell on MLPerf 6.0 — strong scaling&lt;/a&gt; — the general-purpose GPU side of the same inference-efficiency race.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/jetson-thor-edge-blackwell" rel="noopener noreferrer"&gt;Jetson Thor — edge Blackwell&lt;/a&gt; — purpose-built inference silicon at the opposite end of the scale, the edge.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is an inference ASIC like Jalapeño?
&lt;/h3&gt;

&lt;p&gt;An inference ASIC is an Application-Specific Integrated Circuit — silicon built for one kind of job rather than general-purpose computing — made to run (not train) large language models. OpenAI and Broadcom's Jalapeño, unveiled June 24, 2026, is OpenAI's first such chip: a reticle-sized compute chiplet paired with HBM, co-designed around the data-movement bottleneck of serving models at scale. It gives up a GPU's general-purpose flexibility in exchange for higher performance-per-watt on that single workload (early testing reports substantially better, with final numbers still being measured).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why build a custom inference chip instead of using GPUs?
&lt;/h3&gt;

&lt;p&gt;At decode time, generating a token is usually memory-bandwidth-bound — the chip spends most of its time moving the model's weights out of memory, not doing arithmetic. A general-purpose GPU pays in silicon and power for flexibility that inference never uses. A chip co-designed around the data-movement bottleneck — a large compute chiplet with HBM kept close — can serve the same tokens at substantially better performance-per-watt in early testing (final numbers still being measured), which at OpenAI's scale materially changes serving cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is Jalapeño different from a GPU?
&lt;/h3&gt;

&lt;p&gt;A GPU is general-purpose: thousands of programmable cores that run training, graphics, and any model. Jalapeño is an ASIC built for LLM inference only — it cannot train and is far less flexible than a general-purpose GPU. That is the trade: it loses the GPU's versatility and gains a shorter, faster path between memory and compute, which is what matters when the bottleneck is data movement rather than raw math. A custom ASIC pays off only when you run one workload at enormous, sustained scale.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/jalapeno-inference-asic-vs-gpu" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>hardware</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Baidu Unlimited OCR Holds the KV Cache Constant for 40+ Pages: Reference Sliding Window Attention</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Fri, 26 Jun 2026 11:16:58 +0000</pubDate>
      <link>https://dev.to/pueding/baidu-unlimited-ocr-holds-the-kv-cache-constant-for-40-pages-reference-sliding-window-attention-3on5</link>
      <guid>https://dev.to/pueding/baidu-unlimited-ocr-holds-the-kv-cache-constant-for-40-pages-reference-sliding-window-attention-3on5</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/pQL9ihQ1yl8"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;Unlimited OCR release&lt;/strong&gt; (Baidu, arXiv 2606.23050) is a &lt;strong&gt;3-billion-parameter open OCR model&lt;/strong&gt; whose decoder replaces standard attention with &lt;strong&gt;Reference Sliding Window Attention (R-SWA)&lt;/strong&gt; — the trick that lets it transcribe 40+ pages in a single forward pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; The &lt;strong&gt;KV cache&lt;/strong&gt; is the memory that grows with every token a model writes; on a 40-page transcription that growth can dominate inference memory and slow generation, so &lt;strong&gt;holding the cache constant&lt;/strong&gt; is what makes one-pass, whole-document OCR practical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; A &lt;strong&gt;standard decoder&lt;/strong&gt; makes each new token attend to the &lt;strong&gt;entire growing output&lt;/strong&gt;, so its KV cache grows linearly; R-SWA makes each token attend to the &lt;strong&gt;fixed document plus only the last 128 output tokens&lt;/strong&gt;, so the cache stays a constant size.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A scribe copying a long book — the source kept open on the desk, and only the last line they wrote still in view.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                 COPYING ONE 40-PAGE BOOK
                            │
              ┌─────────────┴─────────────┐
              │                           │
     ┌────────▼────────┐         ┌────────▼────────┐
     │  R-SWA scribe   │         │   plain scribe  │
     │  (disciplined)  │         │  (stacks pages) │
     └────────┬────────┘         └────────┬────────┘
              │                           │
     desk = source book pinned    desk = every page copied
       + last 128 output lines       stacked so far, growing
              │                           │
              ▼                           ▼
     ✓ desk never overflows      ✗ desk overflows by page 40
       KV cache stays constant     KV cache grows linearly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;reference tokens = the source book kept open on the desk, always in reach&lt;/li&gt;
&lt;li&gt;sliding window = only the last line the scribe glances at to keep continuity&lt;/li&gt;
&lt;li&gt;constant KV cache = a desk that never overflows, however long the book&lt;/li&gt;
&lt;li&gt;linear KV growth = stacking every page you've copied on the desk until it spills&lt;/li&gt;
&lt;li&gt;40+ pages in one pass = copying a whole long book in a single sitting&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;KV cache&lt;/strong&gt; — The stored &lt;strong&gt;keys and values&lt;/strong&gt; for every token already processed, so attention never recomputes them. It is the dominant memory cost of inference, and in a standard decoder it grows with every token generated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference Sliding Window Attention (R-SWA)&lt;/strong&gt; — Baidu's replacement for every decoder attention layer: each generated token attends to &lt;strong&gt;all reference tokens plus only the preceding 128 output tokens&lt;/strong&gt;, instead of the entire growing sequence. That caps the cache at a constant size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference tokens&lt;/strong&gt; — The &lt;strong&gt;document (visual) tokens&lt;/strong&gt; the encoder produces from the pages being read. R-SWA keeps these &lt;strong&gt;fully visible to every output token&lt;/strong&gt; — they are the fixed part of the attention window, never slid past.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sliding window attention&lt;/strong&gt; — An attention mask where each token sees only the last &lt;em&gt;W&lt;/em&gt; tokens, not all of them. It bounds memory but, on its own, would &lt;strong&gt;slide off the document&lt;/strong&gt; a model is reading — which is the problem R-SWA's pinned reference tokens fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forward pass&lt;/strong&gt; — One run of the model over its inputs to produce outputs. Unlimited OCR transcribes &lt;strong&gt;40+ pages in a single forward pass&lt;/strong&gt; rather than chunking the document into many smaller passes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Active parameters&lt;/strong&gt; — Unlimited OCR has &lt;strong&gt;3 billion total parameters but activates only 500 million per token&lt;/strong&gt; — a sparse design where most weights stay idle on any given step, keeping compute low.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek-OCR encoder&lt;/strong&gt; — A high-compression visual encoder that turns a page image into a &lt;strong&gt;small number of tokens&lt;/strong&gt;. Pairing it with R-SWA's constant-cache decoder is what lets dozens of pages fit inside a single 32,000-token context.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 22, 2026, Baidu released &lt;strong&gt;Unlimited OCR&lt;/strong&gt;, a 3-billion-parameter (500 million active) end-to-end OCR model that transcribes &lt;strong&gt;40+ pages of documents in a single forward pass&lt;/strong&gt; under a 32,000-token context. It replaces every decoder attention layer with &lt;strong&gt;Reference Sliding Window Attention (R-SWA)&lt;/strong&gt;, which holds the KV cache at a constant size throughout decoding instead of letting it grow with output length, and reports new end-to-end state-of-the-art on OmniDocBench v1.5 and v1.6. Weights and code are public under CC-BY 4.0. &lt;a href="https://arxiv.org/abs/2606.23050" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a scribe copying a long book by hand. The trick that keeps the desk clear is not memory — it's &lt;em&gt;what stays on the desk&lt;/em&gt;. The &lt;strong&gt;source book lies open, always in reach&lt;/strong&gt;, and the scribe glances at &lt;strong&gt;only the last line they wrote&lt;/strong&gt; to keep the handwriting and spelling continuous. They never re-read the hundred pages already copied; those go in a drawer. So the desk holds the same two things on page 1 and on page 200 — &lt;strong&gt;the source, and the current line&lt;/strong&gt; — and it never overflows, no matter how long the book.&lt;/p&gt;

&lt;p&gt;A standard transformer decoder is the opposite scribe: it keeps &lt;strong&gt;every page it has copied stacked on the desk&lt;/strong&gt;, because each new token attends to all previous tokens. That stack is the KV cache, and &lt;strong&gt;its size grows linearly with the length of the output&lt;/strong&gt; — which is fine for a one-paragraph answer and ruinous for a 40-page transcription, where the output is enormous. The cache is already the biggest memory cost in inference; let it grow with every page and a long document becomes much harder to fit and serve efficiently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R-SWA is the disciplined scribe.&lt;/strong&gt; It replaces every decoder attention layer so that each generated token attends to exactly two things: the &lt;strong&gt;full set of reference tokens&lt;/strong&gt; — the document the encoder produced, kept pinned and fully visible — &lt;strong&gt;plus only the preceding 128 output tokens&lt;/strong&gt;, a short sliding window over what was just written. The document never slides out of view, but the &lt;em&gt;output&lt;/em&gt; history does. &lt;strong&gt;Because both pieces are bounded — the document is fixed and the window is 128 — the KV cache stays a constant size from the first page to the fortieth.&lt;/strong&gt; This is the move a plain sliding window can't make on its own: slide a fixed window over everything and you'd lose the document you're reading; R-SWA exempts the reference tokens from the slide.&lt;/p&gt;

&lt;p&gt;Here is why "grows with sequence length" is the term that hurts. KV-cache memory is a product — layers × heads × head dimension × bytes × &lt;strong&gt;sequence length&lt;/strong&gt; — and only that last factor moves as the model writes more. &lt;strong&gt;R-SWA freezes that factor for the output: instead of the sequence length climbing toward 32,000, the output's contribution is clamped at 128&lt;/strong&gt;, while the reference tokens add a fixed, encoder-compressed amount. Pair that constant-cache decoder with DeepSeek-OCR's high-compression visual encoder — which compresses each page image into far fewer visual tokens — and dozens of pages fit in one 32,000-token pass.&lt;/p&gt;

&lt;p&gt;Walk the numbers on one long document. Say transcribing 40 pages produces roughly &lt;strong&gt;12,000 output tokens&lt;/strong&gt; &lt;em&gt;(illustrative — the real count depends on the document)&lt;/em&gt;. A standard decoder's cache holds all 12,000, and the 12,000th token attends back across &lt;strong&gt;11,999&lt;/strong&gt; predecessors — so both memory and per-token attention work climb with every page. R-SWA caps the output window at &lt;strong&gt;128&lt;/strong&gt;. So that same final token attends to just the &lt;strong&gt;last 128 outputs&lt;/strong&gt; plus the fixed document tokens, and the output's contribution to the cache stays flat at &lt;strong&gt;128 entries&lt;/strong&gt; whether the document is 4 pages or 40. &lt;strong&gt;That clamp — from a number that grows with the page count to a constant 128 — is the decoder-side reason this can pair with a compressed visual encoder and read 40+ pages in one forward pass.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attention scheme&lt;/th&gt;
&lt;th&gt;Each output token attends to…&lt;/th&gt;
&lt;th&gt;KV cache vs output length&lt;/th&gt;
&lt;th&gt;Where it earns its keep&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard causal attention&lt;/td&gt;
&lt;td&gt;every previous token&lt;/td&gt;
&lt;td&gt;Grows linearly&lt;/td&gt;
&lt;td&gt;Accurate, but memory explodes on long outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain sliding-window attention&lt;/td&gt;
&lt;td&gt;only the last ~W tokens &lt;em&gt;(W is a fixed window, model-dependent)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Constant (~W)&lt;/td&gt;
&lt;td&gt;Cheap streaming, but it slides off the document being read&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;R-SWA (Unlimited OCR)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;all reference tokens + the last 128 outputs&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2606.23050" rel="noopener noreferrer"&gt;[paper]&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Constant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Long-document OCR: keeps the full source visible while bounding output memory&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest caveats. The 128-token window is a &lt;strong&gt;default&lt;/strong&gt;, and a short window is a bet that the next line of a transcription rarely depends on text written thousands of tokens earlier — true for reading a document top-to-bottom, less obviously true for tasks with long-range output structure. And the constant-cache win leans on the encoder doing real work: if the reference tokens themselves were not compressed, "all reference tokens" would be its own large, fixed cost. But the deeper lesson generalizes past OCR — the paper itself notes R-SWA is &lt;strong&gt;"a general-purpose parsing attention mechanism… equally applicable to tasks such as ASR, translation, etc."&lt;/strong&gt; &lt;strong&gt;Once you accept that an output token rarely needs the &lt;em&gt;entire&lt;/em&gt; output history, the question stops being "how do we shrink the cache" and becomes "what must stay pinned, and how short can the window be" — and the cache stops growing at all.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: LLM Internals → KV Cache → Memory Cost&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/flashmemory-lookahead-sparse-attention" rel="noopener noreferrer"&gt;FlashMemory — lookahead sparse attention&lt;/a&gt; — also bounds the KV cache by having each token attend to fewer keys; R-SWA bounds it by a fixed window plus a pinned reference instead.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/sp-kv-self-pruned-kv-cache" rel="noopener noreferrer"&gt;SP-KV — self-pruned KV cache&lt;/a&gt; — &lt;em&gt;drops&lt;/em&gt; low-value KV pairs to shrink the cache; R-SWA never writes the far output history in the first place.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/subq-1-1-subquadratic-sparse-attention" rel="noopener noreferrer"&gt;SubQ 1.1 — subquadratic sparse attention&lt;/a&gt; — near-linear attention for million-token context; R-SWA is the OCR-decoder cousin of the same "don't attend to everything" idea.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/deepseek-v4-long-context-cost" rel="noopener noreferrer"&gt;DeepSeek V4 — long-context cost cut to a fraction&lt;/a&gt; — attacks the same long-context memory pressure at the architecture level, and shares the optical-compression encoder lineage R-SWA builds on.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Reference Sliding Window Attention (R-SWA)?
&lt;/h3&gt;

&lt;p&gt;R-SWA is the decoder attention scheme in Baidu's Unlimited OCR (arXiv 2606.23050, June 2026). It replaces every decoder attention layer so that each generated token attends to all reference tokens — the document tokens the encoder produced — plus only the preceding 128 output tokens, rather than the entire growing output sequence. Because the document is fixed and the output window is capped at 128, the KV cache stays a constant size throughout decoding instead of growing linearly with output length. That is what lets the 3-billion-parameter (500 million active) model transcribe 40+ pages in a single 32,000-token forward pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does holding the KV cache constant matter for OCR?
&lt;/h3&gt;

&lt;p&gt;The KV cache is the dominant memory cost of inference, and in a standard decoder it grows with every token generated. Transcribing a long document produces an enormous output, so a linearly growing cache quickly exceeds GPU memory — which is why most OCR systems chunk a document into many small passes and stitch the results. R-SWA caps the output's contribution to the cache at 128 tokens, so memory does not grow with page count, and the whole document can be read in one forward pass. Baidu reports new end-to-end state-of-the-art on OmniDocBench v1.5 and v1.6 with this design.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is R-SWA different from normal sliding-window attention?
&lt;/h3&gt;

&lt;p&gt;Plain sliding-window attention (as in some streaming models) lets each token see only the last W tokens of everything — which bounds memory but would slide the document a model is reading out of view. R-SWA splits the window in two: the reference (document) tokens are pinned and stay fully visible to every output token, while only the output history is subject to the 128-token slide. So the model keeps the entire source in sight while still bounding the part of the cache that would otherwise grow — the constant-size cache without losing the document.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/baidu-unlimited-ocr-rswa-constant-kv" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>GLM-5.2 Becomes the Top Open-Weights Model: Active vs Total Parameters</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Tue, 23 Jun 2026 11:17:24 +0000</pubDate>
      <link>https://dev.to/pueding/glm-52-becomes-the-top-open-weights-model-active-vs-total-parameters-2284</link>
      <guid>https://dev.to/pueding/glm-52-becomes-the-top-open-weights-model-active-vs-total-parameters-2284</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/Edl4LHWkcr0"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The news anchor is &lt;strong&gt;GLM-5.2&lt;/strong&gt;, Zhipu AI's open-weights model that just topped the Artificial Analysis Intelligence Index; the concept it makes concrete is &lt;strong&gt;active vs total parameters&lt;/strong&gt; — the two numbers in its &lt;strong&gt;"744B total / 40B active"&lt;/strong&gt; spec.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Those two numbers &lt;strong&gt;price two different things&lt;/strong&gt;: &lt;strong&gt;total&lt;/strong&gt; sets the memory footprint and the GPU you need, while &lt;strong&gt;active&lt;/strong&gt; sets the compute and bandwidth you pay &lt;strong&gt;per token&lt;/strong&gt;. Reading both tells you what a model release actually costs to run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; The old habit was to quote &lt;strong&gt;one parameter count&lt;/strong&gt; — which assumes a &lt;strong&gt;dense&lt;/strong&gt; model where every weight fires on every token, so active equals total. A sparse &lt;strong&gt;Mixture-of-Experts&lt;/strong&gt; splits that into two, and the &lt;strong&gt;gap between them is the design lever&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A big engine that fires only a few of its cylinders at a time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;           GLM-5.2 ENGINE: 744 CYLINDERS BUILT IN
                           │
      ┌────────────────────┴─────────────────────┐
      │  the whole engine block is hauled along   │
      │  . . . . . . . . . . . . . . . . . . . .  │
      │  . . # . . . # . . # . . . # . . . # . .  │
      │  but only ~40 cylinders ( # ) fire now    │
      └────────────────────┬─────────────────────┘
                           │
          ┌────────────────┴────────────────┐
          ▼                                  ▼
   MEMORY = the whole block         COMPUTE = firing only
   all 744B resident, ~744 GB       ~40B active, ~80 GFLOP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;total parameters (744B) = every cylinder built into the engine block — the full capacity&lt;/li&gt;
&lt;li&gt;active parameters (40B) = the cylinders actually firing on this stroke — what burns fuel right now&lt;/li&gt;
&lt;li&gt;router = the engine controller deciding which cylinders fire for each token&lt;/li&gt;
&lt;li&gt;memory footprint = the whole engine block you still haul around, firing or idle&lt;/li&gt;
&lt;li&gt;sparsity ratio = how few cylinders fire (40) versus how many exist (744)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Total parameters&lt;/strong&gt; — Every weight the model contains — here &lt;strong&gt;744 billion&lt;/strong&gt;. The total sets the model's knowledge capacity and, critically, the &lt;strong&gt;memory footprint&lt;/strong&gt;: all 744 B must be loaded into GPU memory whether or not they are used on a given token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Active parameters&lt;/strong&gt; — The subset of weights &lt;strong&gt;actually read and multiplied for a single token&lt;/strong&gt; — here ~&lt;strong&gt;40 billion&lt;/strong&gt;. In a dense model active equals total; in a sparse model active is a fraction. Per-token compute and bandwidth track the &lt;em&gt;active&lt;/em&gt; count, not the total.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixture-of-Experts (MoE)&lt;/strong&gt; — A transformer variant that replaces each dense feed-forward network with many smaller "expert" sub-networks, plus a router that activates only a handful per token. It decouples total capacity from per-token cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Router&lt;/strong&gt; — The small learned network inside an MoE layer that &lt;strong&gt;assigns each token to its top-k experts&lt;/strong&gt;. It is what makes "which weights are active" change from token to token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sparsity ratio&lt;/strong&gt; — The fraction of total parameters that are active per token. GLM-5.2's 40 B of 744 B is roughly &lt;strong&gt;5%&lt;/strong&gt; — about one weight in eighteen. A lower ratio means more capacity sits idle on any given token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dense model&lt;/strong&gt; — A model with &lt;strong&gt;no routing&lt;/strong&gt;: every weight participates in every token, so active equals total. Per-token FLOPs scale linearly with the full parameter count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FLOP&lt;/strong&gt; — A &lt;strong&gt;floating-point operation&lt;/strong&gt; — one multiply or add. A useful rule of thumb: a forward pass costs about &lt;strong&gt;2 × (active parameters)&lt;/strong&gt; FLOPs per token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Artificial Analysis Intelligence Index&lt;/strong&gt; — A third-party benchmark that &lt;strong&gt;aggregates many evals&lt;/strong&gt; (reasoning, coding, knowledge) into a single comparable score. GLM-5.2 scored &lt;strong&gt;51&lt;/strong&gt; on v4.1, leading all open-weights models.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 17, 2026, &lt;strong&gt;Artificial Analysis&lt;/strong&gt; reported that &lt;strong&gt;Zhipu AI's GLM-5.2&lt;/strong&gt; became the leading &lt;strong&gt;open-weights&lt;/strong&gt; model on its &lt;strong&gt;Intelligence Index v4.1&lt;/strong&gt;, scoring &lt;strong&gt;51&lt;/strong&gt; — ahead of &lt;strong&gt;MiniMax-M3&lt;/strong&gt; and &lt;strong&gt;DeepSeek V4 Pro&lt;/strong&gt;, both at &lt;strong&gt;44&lt;/strong&gt;. The model carries &lt;strong&gt;744 B total parameters but activates only ~40 B per token&lt;/strong&gt;, ships under an &lt;strong&gt;MIT license&lt;/strong&gt;, and keeps GLM-5.1's architecture while showing particular strength on scientific reasoning. &lt;a href="https://artificialanalysis.ai/articles/glm-5-2-is-the-new-leading-open-weights-model-on-the-artificial-analysis-intelligence-index" rel="noopener noreferrer"&gt;Read the report →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a very large engine. It has hundreds of cylinders machined into the block, but at any instant only a handful are firing — and &lt;strong&gt;which few are firing changes constantly&lt;/strong&gt; as a controller picks the right ones for the moment. &lt;strong&gt;The size of the engine is one number; the cylinders burning fuel right now are a completely different one.&lt;/strong&gt; That is exactly the gap GLM-5.2 puts on its spec sheet: &lt;strong&gt;744 billion&lt;/strong&gt; cylinders built in, but only about &lt;strong&gt;40 billion&lt;/strong&gt; firing per token. The first number is the engine you have to build and haul; the second is the fuel you actually burn each stroke.&lt;/p&gt;

&lt;p&gt;Dense model releases often came with &lt;strong&gt;one&lt;/strong&gt; headline number, because the model was &lt;strong&gt;dense&lt;/strong&gt; — every weight fired on every token, so the engine you built and the engine you ran were the same. &lt;strong&gt;A Mixture-of-Experts model breaks that equality on purpose.&lt;/strong&gt; Most of a transformer's parameters live in its feed-forward layers — roughly two-thirds of them — so MoE replaces that one big dense feed-forward block with many smaller expert blocks and a &lt;strong&gt;router&lt;/strong&gt; that lights up only the few each token needs. The 744 B stays resident, but the per-token bill tracks the ~40 B that fire.&lt;/p&gt;

&lt;p&gt;So the two numbers price two genuinely different resources. &lt;strong&gt;The total parameter count sets your memory footprint&lt;/strong&gt; — every one of the 744 B weights has to sit in GPU memory, idle or not, which is why running an open-weights model this large means a multi-GPU node and a good reason to shrink the weights with quantization. &lt;strong&gt;The active count sets your per-token compute and bandwidth&lt;/strong&gt; — and at ~40 B active, GLM-5.2 computes each token at roughly the cost of a 40 B model even though it holds 744 B parameters of capacity. The notable part of this release is not just that an open-weights model topped the leaderboard; it is that it did so at a &lt;strong&gt;~5% sparsity ratio&lt;/strong&gt; — about one weight in eighteen — pushing the frontier on a very lean per-token budget.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Per token, you pay…&lt;/th&gt;
&lt;th&gt;If GLM-5.2 were dense (744B active)&lt;/th&gt;
&lt;th&gt;GLM-5.2 as shipped (744B-total, ~40B active)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Active parameters&lt;/td&gt;
&lt;td&gt;744B (all of them)&lt;/td&gt;
&lt;td&gt;~40B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute per token&lt;/td&gt;
&lt;td&gt;~1.49 TFLOP &lt;em&gt;(illustrative, ≈2× active-params rule)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;~80 GFLOP &lt;em&gt;(illustrative, ≈2× active-params rule)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weights held in memory&lt;/td&gt;
&lt;td&gt;~744 GB &lt;em&gt;(~approx, 1 byte/param at FP8)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;~744 GB &lt;em&gt;(~approx, 1 byte/param at FP8)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intelligence Index v4.1&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;51 (leading open weights)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Work one token through the numbers to see why the gap matters. Using the rule that a forward pass costs about &lt;strong&gt;2 × (active parameters)&lt;/strong&gt; FLOPs, a &lt;strong&gt;dense 744 B&lt;/strong&gt; model would burn &lt;strong&gt;2 × 744 B ≈ 1.49 TFLOP&lt;/strong&gt; every token; GLM-5.2, firing only &lt;strong&gt;~40 B&lt;/strong&gt;, burns &lt;strong&gt;2 × 40 B ≈ 80 GFLOP&lt;/strong&gt; — roughly &lt;strong&gt;18× less compute per token&lt;/strong&gt; &lt;em&gt;(illustrative — derived from the parameter counts, not measured)&lt;/em&gt;. &lt;strong&gt;But both versions still have to keep all 744 B weights resident — about 744 GB at one byte each — so the memory bill is identical.&lt;/strong&gt; That is the trade in parameter-count terms: &lt;strong&gt;MoE is designed to give you the per-token compute of a small model and the capacity of a large one — while still charging you the memory of the large one.&lt;/strong&gt; (Real systems also pay routing overhead and run dense attention layers, so the picture is more nuanced than the two counts alone.) Whether the trade is worth it depends on what binds you — if memory is the constraint, a smaller dense model can win, which is the flip side explored in the related Granite explainer below.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network and LLM Internals → Quantization → Why Quantize&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/ibm-granite-4-1-dense-vs-moe" rel="noopener noreferrer"&gt;IBM Granite 4.1 — 8B dense matches the prior 32B MoE&lt;/a&gt; — the serving-cost flip side: when memory is what binds, a smaller dense model can beat a larger MoE&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/mobilemoe-dram-aware-scaling" rel="noopener noreferrer"&gt;MobileMoE — DRAM-aware MoE scaling&lt;/a&gt; — what the active-vs-total gap buys you on memory-constrained devices&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/softmoe-differentiable-routing" rel="noopener noreferrer"&gt;SoftMoE — differentiable soft top-k routing&lt;/a&gt; — how the router actually decides which experts fire each token&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between active and total parameters?
&lt;/h3&gt;

&lt;p&gt;Total parameters are every weight the model contains — they set its knowledge capacity and its memory footprint, because all of them must be loaded into GPU memory. Active parameters are the subset actually read and multiplied for a single token; they set the per-token compute and bandwidth. In a dense model the two are equal; in a sparse Mixture-of-Experts model like GLM-5.2, active (~40B) is a small fraction of total (744B).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does GLM-5.2 list two parameter counts (744B total, 40B active)?
&lt;/h3&gt;

&lt;p&gt;Because it is a Mixture-of-Experts model. Its feed-forward layers are split into many expert sub-networks, and a router activates only a handful per token — so the model holds 744B weights but fires only ~40B on any given token. The total predicts the memory and GPU you need; the active count predicts how fast and how cheaply it runs per token. A single number would hide that the two costs have decoupled.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does a lower active-parameter count make a model cheaper to run?
&lt;/h3&gt;

&lt;p&gt;It makes the per-token compute and bandwidth cheaper — GLM-5.2 computes a token at roughly the cost of a 40B model. But it does not lower the memory bill: all 744B total parameters still have to fit in GPU memory whether or not they fire. So a very sparse model is cheap on compute and expensive on memory, which is why deployments often pair it with quantization and multi-GPU nodes.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/glm-5-2-active-vs-total-parameters" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Agent Leaderboards Mislead Under Distribution Shift (IBM): Predictive Validity</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Mon, 22 Jun 2026 11:17:42 +0000</pubDate>
      <link>https://dev.to/pueding/agent-leaderboards-mislead-under-distribution-shift-ibm-predictive-validity-4d0c</link>
      <guid>https://dev.to/pueding/agent-leaderboards-mislead-under-distribution-shift-ibm-predictive-validity-4d0c</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/PkEHYoxtX6c"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; A new IBM paper, &lt;strong&gt;"Beyond Static Leaderboards"&lt;/strong&gt;, argues that the way we rank AI agents is broken: a leaderboard collapses each agent into one &lt;strong&gt;aggregate score&lt;/strong&gt; and sorts by it. The fix it proposes is &lt;strong&gt;predictive validity&lt;/strong&gt; — the &lt;strong&gt;rank correlation between a benchmark's ranking and the ranking you'd see out-of-distribution&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; A single leaderboard number is a &lt;strong&gt;weak signal for real deployment&lt;/strong&gt;. The whole point of an eval is to tell you which agent to ship — and if the benchmark's #1 isn't your deployment's #1, the ranking you trusted pointed at the wrong agent. This is the core lesson of &lt;strong&gt;Evals &amp;amp; Diagnostics&lt;/strong&gt; and &lt;strong&gt;Production Evals&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Where the old way ranks agents by their &lt;strong&gt;aggregate mean score&lt;/strong&gt; on one benchmark and trusts that order, predictive validity asks a sharper question: &lt;strong&gt;does that order survive a distribution shift?&lt;/strong&gt; IBM's finding is blunt — aggregate-score rankings &lt;strong&gt;do not transfer&lt;/strong&gt; out-of-distribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;Ranking sprinters by their indoor bests, then racing them outdoors in the wind.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          SAME SPRINTERS, RANKED TWO WAYS
                         │
          ┌──────────────┴──────────────┐
          │                             │
   ┌──────▼──────┐               ┌──────▼──────┐
   │  INDOORS    │               │  OUTDOORS   │
   │  (no wind)  │               │  (windy)    │
   │   1. A      │               │   1. B      │
   │   2. B      │               │   2. C      │
   │   3. C      │               │   3. A      │
   └──────┬──────┘               └──────┬──────┘
          │                             │
   the leaderboard               the deployment
          └──────────────┬──────────────┘
                         ▼
       predictive validity = does the indoor
       order survive once the wind hits?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;sprinter = a model competing on the leaderboard&lt;/li&gt;
&lt;li&gt;indoor personal-best ranking = the aggregate-score leaderboard, measured in one controlled setting&lt;/li&gt;
&lt;li&gt;racing outdoors in the wind = deployment under a shifted, out-of-distribution workload&lt;/li&gt;
&lt;li&gt;the podium reshuffling = rank instability when the conditions change&lt;/li&gt;
&lt;li&gt;predictive validity = how well the indoor ranking predicts who actually wins outdoors&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Predictive validity&lt;/strong&gt; — Borrowed from measurement theory: &lt;strong&gt;does a test's score predict the real-world outcome it claims to measure?&lt;/strong&gt; For agent evals, IBM defines it as the &lt;strong&gt;rank correlation between in-sample and out-of-distribution results&lt;/strong&gt; — not the raw score, but whether the &lt;em&gt;ordering&lt;/em&gt; of agents holds up when conditions change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggregate score&lt;/strong&gt; — The single number a leaderboard reports per agent — typically a mean across many tasks. It is easy to sort by, but it &lt;strong&gt;throws away the variance&lt;/strong&gt; that tells you whether the ranking is stable. See AI Agents → Evals &amp;amp; Diagnostics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-sample vs out-of-distribution (OOD)&lt;/strong&gt; — &lt;strong&gt;In-sample&lt;/strong&gt; = the conditions the benchmark actually measured. &lt;strong&gt;Out-of-distribution&lt;/strong&gt; = anything different in deployment — new task types, a new orchestration, a shifted input mix. The gap between them is where leaderboards quietly fail; production teams watch it as drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rank correlation&lt;/strong&gt; — A measure of how well two rankings of the same items agree — &lt;strong&gt;+1&lt;/strong&gt; is identical order, &lt;strong&gt;0&lt;/strong&gt; is unrelated, &lt;strong&gt;−1&lt;/strong&gt; is reversed. Predictive validity &lt;em&gt;is&lt;/em&gt; this number, computed between the in-sample and OOD rankings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rank instability&lt;/strong&gt; — When a small change in conditions reshuffles the leaderboard — the agent ranked first in-sample lands third out-of-distribution. IBM points to &lt;strong&gt;public-to-hidden competition retrospectives&lt;/strong&gt; as direct evidence this happens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Falsifiable criterion&lt;/strong&gt; — A pass/fail test you can actually fail. IBM frames predictive validity through &lt;strong&gt;three falsifiable out-of-distribution criteria&lt;/strong&gt;, so a benchmark's claim to validity can be checked and rejected — not just asserted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP-based agent benchmark&lt;/strong&gt; — A benchmark built on the Model Context Protocol tool interface, so the same agent harness can be re-implemented many ways. IBM ran &lt;strong&gt;fourteen parallel implementations&lt;/strong&gt; of one such industrial-agent benchmark.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On &lt;strong&gt;June 18, 2026&lt;/strong&gt;, an IBM-led team (Dhaval Patel et al.) posted &lt;a href="https://arxiv.org/abs/2606.19704" rel="noopener noreferrer"&gt;&lt;em&gt;Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents&lt;/em&gt;&lt;/a&gt; to arXiv. They ran &lt;strong&gt;fourteen parallel implementation studies&lt;/strong&gt; of an MCP-based industrial-agent benchmark — varying asset classes, orchestrations, retrieval strategies, and reasoning modes — and aggregated &lt;strong&gt;seven prior agent benchmarks&lt;/strong&gt;. The headline: rankings derived from aggregate scores &lt;strong&gt;do not transfer&lt;/strong&gt; to out-of-distribution settings. In place of one number, they propose ranking benchmark configurations by &lt;strong&gt;predictive validity&lt;/strong&gt;: the correlation between in-sample and out-of-distribution rank, structured as a &lt;strong&gt;twelve-tier measurement apparatus&lt;/strong&gt; with &lt;strong&gt;three falsifiable criteria&lt;/strong&gt;. &lt;a href="https://arxiv.org/abs/2606.19704" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture timing a field of sprinters indoors, on a fast track with no wind, and printing the ranking from their personal bests. On paper you now know exactly who is fastest — first, second, third, in order. Then race day comes, outdoors, into a gusting headwind, and &lt;strong&gt;the podium reshuffles: the indoor record-holder fades to third, and someone who was never the fastest indoors wins the race that counts&lt;/strong&gt;. The indoor clock wasn't lying — it measured real speed in &lt;em&gt;one&lt;/em&gt; setting. It just had no way to tell you whether that order would survive the wind. The sprinter is an agent, the indoor ranking is an aggregate-score leaderboard, the outdoor race is deployment, and the question the indoor clock can't answer is &lt;strong&gt;predictive validity&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A leaderboard does exactly what the indoor clock does. It runs each agent over a fixed battery of tasks, averages the results into one &lt;strong&gt;aggregate score&lt;/strong&gt;, and sorts. That sort is the product everyone consumes — the tweet, the ranking row, the "best open agent" headline. &lt;strong&gt;But the average is measured under one distribution of tasks, and IBM's central result is that the ordering it produces does not hold once the distribution moves.&lt;/strong&gt; When they built the &lt;em&gt;same&lt;/em&gt; industrial-agent benchmark fourteen different ways — swapping orchestrations, retrieval strategies, and reasoning modes — the rankings disagreed with each other, and public-to-hidden competition retrospectives showed the same rank instability in the wild.&lt;/p&gt;

&lt;p&gt;The deeper move is to stop treating the benchmark as a scoreboard and start treating it as a &lt;strong&gt;measurement instrument&lt;/strong&gt; — and to ask of any instrument the measurement-theory question: does its reading predict the thing you actually care about? &lt;strong&gt;IBM operationalizes that as predictive validity: the rank correlation between a configuration's in-sample ranking and its out-of-distribution ranking&lt;/strong&gt; — a number near +1 means the leaderboard predicts reality, a number near 0 means it doesn't. They wrap it in a twelve-tier apparatus with three falsifiable criteria, so a benchmark's claim to validity is something you can test and reject, not just assert. In production terms, it is the difference between trusting an offline leaderboard and watching how rankings hold under shifted, online traffic.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;How you read the benchmark&lt;/th&gt;
&lt;th&gt;What it reports&lt;/th&gt;
&lt;th&gt;What it misses&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate score (today's leaderboard)&lt;/td&gt;
&lt;td&gt;one mean number per agent → a sorted ranking&lt;/td&gt;
&lt;td&gt;whether that ranking survives any change in conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Score + confidence interval&lt;/td&gt;
&lt;td&gt;the mean plus its in-sample noise&lt;/td&gt;
&lt;td&gt;still in-sample only — no view of the out-of-distribution shift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Predictive validity (&lt;a href="https://arxiv.org/abs/2606.19704" rel="noopener noreferrer"&gt;IBM&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;rank correlation between in-sample and out-of-distribution rankings&lt;/td&gt;
&lt;td&gt;— &lt;em&gt;(directly tests transfer; ~14 implementations, 12-tier apparatus, 3 falsifiable criteria)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Where the ranking breaks
&lt;/h3&gt;

&lt;p&gt;Here is why an unstable ranking is worse than a noisy one. Take an illustrative slice of &lt;strong&gt;three agents&lt;/strong&gt; — call them A, B, C — that an aggregate-score leaderboard ranks &lt;strong&gt;A &amp;gt; B &amp;gt; C&lt;/strong&gt; by a hair: scores of 71, 70, 68. The gaps are tiny, but the leaderboard reports a confident order, and a team reading it ships &lt;strong&gt;A&lt;/strong&gt;. Now shift the distribution — a new asset class, a different orchestration — and re-score: A drops to 64, B holds at 69, C climbs to 67. The out-of-distribution order is now &lt;strong&gt;B &amp;gt; C &amp;gt; A&lt;/strong&gt;, the exact reverse of where A and C started. The rank correlation between the two orderings is &lt;strong&gt;negative&lt;/strong&gt; — the leaderboard didn't just lose precision, it pointed at the &lt;em&gt;wrong&lt;/em&gt; agent. &lt;em&gt;(Only the 14 implementations, 12-tier apparatus, and 3 falsifiable criteria come from the paper; the A/B/C scores are illustrative.)&lt;/em&gt; &lt;strong&gt;A single aggregate number with a tidy sort hid the one fact that mattered: that order was never stable enough to ship on.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → Evals &amp;amp; Diagnostics → Pass/Fail vs Score&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;p&gt;This explainer stands alone from its news item (one concept), so the closest neighbors are &lt;strong&gt;other explainers on how a single evaluation number can quietly mislead&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/weavebench-trajectory-aware-grading" rel="noopener noreferrer"&gt;WeaveBench — trajectory-aware vs outcome-only grading&lt;/a&gt; — the sibling failure: WeaveBench shows a &lt;em&gt;single run's grade&lt;/em&gt; can be inflated; predictive validity shows a &lt;em&gt;whole ranking&lt;/em&gt; can be invalid&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/futuresim-harness-level-eval" rel="noopener noreferrer"&gt;FutureSim — harness-level agent eval&lt;/a&gt; — evaluating the agent's process rather than a single final number, the same "one score hides the truth" theme&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/efc-feedback-quality-scaling-law" rel="noopener noreferrer"&gt;Effective Feedback Compute (EFC)&lt;/a&gt; — another result that a headline number (raw compute) is the wrong predictor of agent success&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is predictive validity for AI agent evals?
&lt;/h3&gt;

&lt;p&gt;Predictive validity is a measurement-theory idea IBM applies to agent leaderboards: instead of ranking agents by their aggregate score, you measure the rank correlation between a benchmark's in-sample ranking and the ranking it produces out-of-distribution. A high correlation means the leaderboard predicts real-world ordering; a low correlation means the score is a poor guide to which agent to actually deploy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why are aggregate-score agent leaderboards misleading?
&lt;/h3&gt;

&lt;p&gt;Because they collapse a whole agent into one mean number measured under a single distribution of tasks, then sort by it. IBM's "Beyond Static Leaderboards" ran the same industrial-agent benchmark fourteen ways and found the rankings disagreed, and public-to-hidden competition retrospectives show the same rank instability. The sorted order looks authoritative but does not transfer once conditions shift, so it is a weak signal for deciding what to ship.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does predictive validity relate to distribution shift?
&lt;/h3&gt;

&lt;p&gt;Distribution shift is exactly the condition predictive validity tests. In-sample means the tasks the benchmark measured; out-of-distribution means anything different in deployment — new task types, a new orchestration, a shifted input mix. Predictive validity asks whether the agent ranking holds across that gap, and IBM structures it as a twelve-tier apparatus with three falsifiable out-of-distribution criteria so the claim can be checked rather than assumed.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/agent-leaderboards-predictive-validity" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Sun, 21 Jun 2026 11:20:15 +0000</pubDate>
      <link>https://dev.to/pueding/amd-atom-atomesh-prefilldecode-disaggregation-on-rocm-2p0a</link>
      <guid>https://dev.to/pueding/amd-atom-atomesh-prefilldecode-disaggregation-on-rocm-2p0a</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/5zDR-eQJXYM"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; AMD shipped &lt;strong&gt;ATOM + ATOMesh&lt;/strong&gt;, a ROCm-native LLM serving stack whose headline trick is &lt;strong&gt;prefill/decode disaggregation&lt;/strong&gt; — splitting the two phases of inference onto separate pools of GPUs instead of crowding them onto one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Prefill and decode have &lt;strong&gt;opposite bottlenecks&lt;/strong&gt; — prefill is compute-bound, decode is memory-bandwidth-bound — so running them on the same worker wastes hardware and lets one long prompt &lt;strong&gt;stall everyone else's token stream&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; A &lt;strong&gt;co-located server (vanilla single-pool vLLM)&lt;/strong&gt; interleaves prefill and decode on the same GPUs; disaggregation runs each on its own pool tuned for its bottleneck, paying for it by &lt;strong&gt;shipping the KV cache across the interconnect&lt;/strong&gt; between them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A restaurant kitchen that splits the prep station from the plating line.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   ORDER (the prompt)
        │
        ▼
  ┌──────────────┐    KV cart      ┌──────────────┐
  │ PREP STATION │   down the      │ PLATING LINE │
  │  (prefill)   │═══ hallway ════▶│   (decode)   │──▶ tokens
  │ compute-heavy│  (KV transfer)  │ memory-bound │
  └──────────────┘                 └──────────────┘
   chops a whole                    plates dishes
   order at once                    one at a time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;prefill = the prep cook chopping a whole order's ingredients in one compute-heavy burst&lt;/li&gt;
&lt;li&gt;decode = the plating cook building dishes one at a time, back to the fridge each plate&lt;/li&gt;
&lt;li&gt;KV cache = the fridge of prepped ingredients every plate reaches into&lt;/li&gt;
&lt;li&gt;disaggregation = giving prep and plating their own stations and staff, each tuned to its job&lt;/li&gt;
&lt;li&gt;KV-cache transfer = wheeling the prep cart down the hallway from prep to the plating line&lt;/li&gt;
&lt;li&gt;KV-aware scheduling = sending each order to the line whose fridge already holds its prep&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prefill&lt;/strong&gt; — The first phase of inference: the model reads your &lt;strong&gt;entire prompt in parallel&lt;/strong&gt; in one pass, building the KV cache. It does a lot of math per byte of memory it touches, so it is compute-bound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decode&lt;/strong&gt; — The second phase: the model generates &lt;strong&gt;one output token at a time&lt;/strong&gt;, and each step must read the whole KV cache plus all the weights to produce that single token. It moves a lot of memory for little math, so it is &lt;strong&gt;memory-bandwidth-bound&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KV cache&lt;/strong&gt; — The stored &lt;strong&gt;keys and values&lt;/strong&gt; for every token already processed, so the model never recomputes them. It is the dominant memory cost of inference — and, in a disaggregated stack, the thing that has to travel from the prefill pool to the decode pool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute-bound vs memory-bound&lt;/strong&gt; — The roofline distinction: a job is compute-bound when the GPU's math units are the limit, and memory-bound when memory bandwidth is. Prefill and decode sit on &lt;strong&gt;opposite sides of that line&lt;/strong&gt;, which is the whole reason to split them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disaggregation&lt;/strong&gt; — Running prefill and decode on separate pools of workers instead of one shared pool, so each pool can be sized and scheduled for its own bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KV-aware scheduling&lt;/strong&gt; — A scheduler that routes a request with knowledge of &lt;strong&gt;where its KV-cache blocks already live&lt;/strong&gt; — so it can reuse a cached prefix (prefix caching) or steer a request to the worker that avoids a transfer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ROCm / AITER / MORI / Instinct&lt;/strong&gt; — &lt;strong&gt;ROCm&lt;/strong&gt; is AMD's CUDA-equivalent software stack and &lt;strong&gt;Instinct&lt;/strong&gt; its datacenter GPU line. &lt;strong&gt;AITER&lt;/strong&gt; supplies the optimized ROCm kernels (the analogue of CUDA kernels), while &lt;strong&gt;MORI&lt;/strong&gt; handles the distributed, RDMA-style communication for tensor/expert parallelism (AMD's own collective library, RCCL, is the closer NCCL analogue).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 16, 2026, AMD published &lt;strong&gt;ATOM + ATOMesh&lt;/strong&gt;, a paired ROCm-native LLM serving stack for Instinct GPUs, shipped as an early (alpha) preview. &lt;strong&gt;ATOM&lt;/strong&gt; is an AITER-optimized inference engine (kernel acceleration via AITER, distributed communication via MORI); &lt;strong&gt;ATOMesh&lt;/strong&gt; is the orchestration layer on top — it exposes an OpenAI-compatible API, manages multiple engine backends, and applies &lt;strong&gt;prefill/decode disaggregation and KV-aware scheduling&lt;/strong&gt;, evaluated serving DeepSeek-V4-Pro on Instinct hardware. In AMD's framing it deliberately mirrors the vLLM/SGLang design — the same serving primitives, now on AMD silicon. &lt;a href="https://rocm.blogs.amd.com/software-tools-optimization/atomesh-inference/README.html" rel="noopener noreferrer"&gt;Read the release →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a restaurant kitchen where one cook does everything. First they &lt;strong&gt;prep&lt;/strong&gt; an order — chopping, slicing, mixing every ingredient the dish needs, all at once, in a furious burst of knife work. Then they &lt;strong&gt;plate&lt;/strong&gt; it — assembling the dish one component at a time, walking back to the fridge for each piece. Prep is a flat-out, hands-busy job; plating is a lot of trips to the fridge and not much knife work. &lt;strong&gt;Cram both onto one cook and they fight: a big prep order makes every waiting plate go cold, and during the slow plating trips the knives sit idle.&lt;/strong&gt; That single overloaded cook is one GPU running an LLM, and the two jobs are prefill and decode.&lt;/p&gt;

&lt;p&gt;When a model answers, it first runs prefill: it reads your &lt;strong&gt;entire prompt in one parallel pass&lt;/strong&gt;, doing dense matrix math and filling the KV cache. Then it runs &lt;strong&gt;decode&lt;/strong&gt;: it emits output &lt;strong&gt;one token per step&lt;/strong&gt;, and every step drags the whole KV cache and all the weights out of memory to produce that single token. &lt;strong&gt;Prefill is compute-bound — limited by the GPU's math units — while decode is memory-bandwidth-bound, limited by how fast it can stream the cache out of memory.&lt;/strong&gt; They are the prep cook and the plating cook: opposite appetites, forced to share one station.&lt;/p&gt;

&lt;p&gt;That opposite-appetites problem is why a single shared worker wastes hardware. &lt;strong&gt;Pack prefill and decode together and a long prompt's prefill burst blocks the queue of decode steps behind it — a head-of-line stall — while the memory-bound decodes leave the expensive compute units sitting idle.&lt;/strong&gt; You can never shape one machine to be right for both jobs at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disaggregation is the fix: give prep and plating their own stations.&lt;/strong&gt; Prefill runs on one pool of GPUs, scheduled for compute-heavy bursts; decode runs on a separate pool, scheduled for steady memory-bound streaming with large batches. When a request finishes prefill, the prefill worker &lt;strong&gt;hands its KV cache across the interconnect to a decode worker&lt;/strong&gt;, which then streams the tokens out. Each pool is now sized and tuned for the one bottleneck it actually has — and AMD's ATOMesh is the orchestration layer that does exactly this routing on ROCm. &lt;strong&gt;This is the same playbook vLLM and SGLang made standard; ATOM + ATOMesh shows AMD building a ROCm-native path to it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But disaggregation is not free, and the bill comes due at the handoff. After prefill, the KV cache has to physically travel from the prefill pool to the decode pool. For a 70B-class model with a 2,048-token prompt, that cache is &lt;code&gt;2 × 80 layers × 8 KV-heads × 128 dim × 2,048 tokens × 2 B ≈&lt;/code&gt; &lt;strong&gt;&lt;code&gt;0.67 GB&lt;/code&gt;&lt;/strong&gt; &lt;em&gt;(illustrative, Llama-3.1-70B with grouped-query attention)&lt;/em&gt;. Move it over PCIe 4.0 and you pay roughly &lt;strong&gt;&lt;code&gt;21 ms&lt;/code&gt;&lt;/strong&gt;; over NVLink, about &lt;strong&gt;&lt;code&gt;0.75 ms&lt;/code&gt;&lt;/strong&gt; — &lt;strong&gt;a ~28× gap&lt;/strong&gt; &lt;em&gt;(all three figures illustrative: the size is from the formula above, the times are set by each interconnect's bandwidth, none measured on ATOM)&lt;/em&gt;. &lt;strong&gt;That gap is why disaggregated stacks live or die by their interconnect — and why KV-aware scheduling tries to dodge the transfer entirely, steering a request to a worker that already holds its prefix.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;What it processes&lt;/th&gt;
&lt;th&gt;Bottleneck (roofline)&lt;/th&gt;
&lt;th&gt;What it wants from the hardware&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prefill&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The whole prompt, in one parallel pass&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Compute-bound&lt;/strong&gt; — high arithmetic intensity&lt;/td&gt;
&lt;td&gt;Raw matmul throughput; fewer, fatter GPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Decode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One output token per step, reading the full KV cache&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Memory-bandwidth-bound&lt;/strong&gt; — low arithmetic intensity&lt;/td&gt;
&lt;td&gt;Memory bandwidth and large batches to amortize the weight reads&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest caveat: ATOM + ATOMesh ship as an &lt;strong&gt;early (alpha) preview&lt;/strong&gt;, and AMD's post describes the &lt;strong&gt;mechanism&lt;/strong&gt;, not head-to-head numbers — it reports that ATOMesh mirrors the vLLM/SGLang design and was evaluated serving DeepSeek-V4-Pro, but it does not give usable numeric throughput or latency figures in the post text, so treat any performance claim as &lt;strong&gt;not yet quantified here&lt;/strong&gt; and check the source for benchmarks. The KV-transfer figures above are &lt;strong&gt;illustrative&lt;/strong&gt;, sized to a representative model rather than measured on ATOM. But the durable lesson stands: &lt;strong&gt;once you see that prefill and decode sit on opposite sides of the roofline, "one GPU does both" stops looking efficient — and a serving stack's real job is to split the two phases and move the KV cache between them cheaply.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: LLM Serving → Prefill/Decode Disaggregation → Disaggregation&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/sglang-v0-5-12-tokenspeed-mla" rel="noopener noreferrer"&gt;SGLang v0.5.12 — TokenSpeed MLA backend&lt;/a&gt; — SGLang is one half of the vLLM/SGLang design ATOMesh mirrors; this is the engine-level optimization that lives &lt;em&gt;inside&lt;/em&gt; a pool like ATOM's.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/hf-async-continuous-batching" rel="noopener noreferrer"&gt;HuggingFace — Async continuous batching&lt;/a&gt; — the other lever for keeping decode workers busy; disaggregation and continuous batching are complementary ways to fight the same memory-bound decode problem.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/tangram-per-head-kv-budgets" rel="noopener noreferrer"&gt;Tangram — Per-head KV cache budgets&lt;/a&gt; — shrinks the KV cache itself, which is exactly the payload a disaggregated stack has to transfer between pools.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/spec-decode-latency-load-model" rel="noopener noreferrer"&gt;Spec-decode latency — Load-dependent latency model&lt;/a&gt; — models how decode latency moves with load, the phase disaggregation isolates onto its own pool.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is prefill/decode disaggregation?
&lt;/h3&gt;

&lt;p&gt;It is a serving design that runs the two phases of LLM inference on separate pools of GPUs. Prefill — reading the whole prompt in one parallel, compute-heavy pass — runs on one pool, and decode — generating output one token at a time, bottlenecked by memory bandwidth — runs on another. After prefill, the request's KV cache is transferred across the interconnect to a decode worker. Splitting them lets each pool be sized and scheduled for its own bottleneck instead of compromising on one shared machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why split prefill and decode onto separate GPUs?
&lt;/h3&gt;

&lt;p&gt;Because they have opposite bottlenecks. Prefill is compute-bound (limited by the GPU's math units), while decode is memory-bandwidth-bound (limited by how fast it streams the KV cache and weights out of memory). On one shared worker a long prefill stalls the decode steps queued behind it, and the memory-bound decodes leave the compute units idle. Running each phase on hardware tuned for its own limit avoids that mutual interference — at the cost of moving the KV cache between the two pools.&lt;/p&gt;

&lt;h3&gt;
  
  
  What do AMD's ATOM and ATOMesh add, and how do they relate to vLLM and SGLang?
&lt;/h3&gt;

&lt;p&gt;ATOM is a ROCm-native inference engine (optimized kernels via AITER, cross-GPU communication via MORI) and ATOMesh is the orchestration layer above it — an OpenAI-compatible API that applies prefill/decode disaggregation and KV-aware scheduling. AMD describes it as deliberately mirroring the vLLM/SGLang design, so the contribution is not a new algorithm but the same modern serving primitives brought to AMD Instinct GPUs — a second-vendor implementation of the stack the LLM Serving track teaches.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/amd-atom-prefill-decode-disaggregation" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>NVIDIA Blackwell Sweeps MLPerf Training 6.0: Strong Scaling</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Sat, 20 Jun 2026 11:15:43 +0000</pubDate>
      <link>https://dev.to/pueding/nvidia-blackwell-sweeps-mlperf-training-60-strong-scaling-2epd</link>
      <guid>https://dev.to/pueding/nvidia-blackwell-sweeps-mlperf-training-60-strong-scaling-2epd</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/b7Xqa6w251g"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; On &lt;strong&gt;June 16, 2026&lt;/strong&gt;, NVIDIA's &lt;strong&gt;Blackwell&lt;/strong&gt; platform posted the fastest time on all seven &lt;strong&gt;MLPerf Training 6.0&lt;/strong&gt; benchmarks. The lens this explainer uses to read that result is &lt;strong&gt;strong scaling&lt;/strong&gt; — how much faster a &lt;em&gt;fixed&lt;/em&gt; model trains as you pour in more GPUs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Frontier pretraining now runs on &lt;strong&gt;5,000–8,000-GPU clusters&lt;/strong&gt;, and &lt;strong&gt;how well those GPUs scale together&lt;/strong&gt; — not just how many you own — decides both the wall-clock and the bill for training a model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; The naive assumption is that &lt;strong&gt;twice the GPUs means half the time&lt;/strong&gt;. Strong scaling is the reality check: every step the GPUs must &lt;strong&gt;stop and synchronize&lt;/strong&gt;, so a loosely-wired cluster gives sub-linear speedup where a &lt;strong&gt;rack-scale NVLink domain&lt;/strong&gt; keeps it near the line.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A rowing crew racing one boat to the finish.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    2× THE ROWERS (GPUs)
                            │
              ┌─────────────┴─────────────┐
              │                           │
      ┌───────▼───────┐          ┌────────▼───────┐
      │  NVLink crew  │          │  Loose cluster │
      │ (one cadence) │          │ (off the beat) │
      └───────┬───────┘          └────────┬───────┘
              │                           │
     every oar hits the          rowers miss the catch,
     catch in unison             power cancels at sync
              │                           │
              ▼                           ▼
        ✓ near-linear              ✗ far below 2×
          (~2× faster)              (sync tax eats it)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;GPU = one rower pulling one oar&lt;/li&gt;
&lt;li&gt;training step = one stroke the whole crew takes together&lt;/li&gt;
&lt;li&gt;gradient sync = the catch, where every oar must hit the water at the same instant&lt;/li&gt;
&lt;li&gt;adding GPUs = adding rowers to the boat&lt;/li&gt;
&lt;li&gt;NVLink rack domain = the racing shell and coxswain that keep a huge crew in perfect time&lt;/li&gt;
&lt;li&gt;low-precision math = lighter oars, so there is less to move on every stroke&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MLPerf Training&lt;/strong&gt; — The industry-standard training benchmark. It measures one thing: the &lt;strong&gt;wall-clock time to train a model to a fixed quality target&lt;/strong&gt; — so a faster time is a real, comparable result, not a vendor's peak-throughput number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strong scaling&lt;/strong&gt; — Hold the problem fixed (one model, one quality target), add more GPUs, and measure the speedup. Its sibling, &lt;strong&gt;weak scaling&lt;/strong&gt;, grows the problem &lt;em&gt;with&lt;/em&gt; the hardware. Strong scaling is the harder test, because the work per GPU keeps shrinking while the coordination cost does not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gradient synchronization (AllReduce)&lt;/strong&gt; — Each GPU trains on a different slice of the batch, then they must &lt;strong&gt;average their gradients&lt;/strong&gt; before the next step starts. That all-to-all exchange — an AllReduce — is a barrier: nobody moves on until everyone has caught up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NVLink domain (NVL72)&lt;/strong&gt; — 72 GPUs wired by fifth-generation NVLink into &lt;strong&gt;one coherent, high-bandwidth fabric&lt;/strong&gt; — a single rack that behaves like one big accelerator. The fast fabric is what makes the synchronization barrier cheap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Low-precision math (FP8 / NVFP4)&lt;/strong&gt; — Running the heavy matrix multiplies in 8-bit FP8 or 4-bit NVFP4 instead of 16-bit, so there is &lt;strong&gt;less data to move and less to compute&lt;/strong&gt; on every step. Blackwell's tensor cores support both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling efficiency&lt;/strong&gt; — The actual speedup divided by the ideal one. Double the GPUs and perfectly halve the time and you are at &lt;strong&gt;100% — perfectly linear&lt;/strong&gt;. Anything less is the time the GPUs spent waiting on each other.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On &lt;strong&gt;June 16, 2026&lt;/strong&gt;, NVIDIA reported that its &lt;strong&gt;Blackwell&lt;/strong&gt; platform posted the fastest time on &lt;strong&gt;every one of MLPerf Training 6.0's seven benchmarks&lt;/strong&gt;. The new &lt;strong&gt;GB300 NVL72&lt;/strong&gt; rack trained up to &lt;strong&gt;1.6× faster than the previous GB200 NVL72&lt;/strong&gt;. Submissions scaled to &lt;strong&gt;8,192 GPUs&lt;/strong&gt; — CoreWeave trained &lt;strong&gt;DeepSeek-V3 671B&lt;/strong&gt; to target in &lt;strong&gt;2.02 minutes&lt;/strong&gt;, and Microsoft Azure hit the quality target on &lt;strong&gt;Llama 3.1 405B in 7.07 minutes&lt;/strong&gt; at 8,192-GPU scale. The round also added new &lt;strong&gt;mixture-of-experts&lt;/strong&gt; pretraining workloads. &lt;a href="https://blogs.nvidia.com/blog/blackwell-mlperf-training-6-0/" rel="noopener noreferrer"&gt;Read the release →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture the rowing crew. The finish line is the model's quality target, and &lt;strong&gt;one stroke of the whole crew is one training step&lt;/strong&gt; — every rower pulls, the boat lurches forward, and they reset for the next stroke. Each rower is a GPU, working a different slice of the same race. The trick is the &lt;em&gt;catch&lt;/em&gt;: the instant the oars enter the water. &lt;strong&gt;If every oar hits at the same moment the boat surges; if they're even slightly out of time, the power cancels and the boat wallows.&lt;/strong&gt; Adding rowers should make the boat faster — but only if the bigger crew can still hit the catch together.&lt;/p&gt;

&lt;p&gt;That "only if" is the whole story, and its real name is &lt;strong&gt;strong scaling&lt;/strong&gt;: fix the model, add GPUs, and see how much the clock actually drops. The catch is the catch. Every step, the GPUs have to &lt;strong&gt;stop and combine their partial results&lt;/strong&gt; — gradients across the data-parallel replicas, plus activations and weights traded inside the tensor- and expert-parallel groups — before the next step can begin. &lt;strong&gt;That synchronization is a tax that grows as the crew grows, so doubling the GPUs buys you less than 2× — the speedup curve bends below the straight line.&lt;/strong&gt; A naive cluster, like a crew that can't hold its timing, gives back most of what each new rower adds.&lt;/p&gt;

&lt;p&gt;So the engineering is all about making the catch cheap. NVIDIA's answer is the &lt;strong&gt;rack-scale NVLink domain&lt;/strong&gt;: the GB300 NVL72 ties &lt;strong&gt;72 GPUs into one coherent fabric&lt;/strong&gt; — the racing shell and coxswain that keep a huge crew locked to a single cadence — so the per-step exchange finishes fast enough that thousands of GPUs still row almost as one. Pair that with &lt;strong&gt;lower-precision math&lt;/strong&gt; — Blackwell's tensor cores run the matmuls in 8-bit FP8 and 4-bit NVFP4, lighter oars with less to move every stroke — plus a stronger software stack, and NVIDIA credits that combination for the sweep: a GB300 rack trains up to &lt;strong&gt;1.6× faster&lt;/strong&gt; than last generation, and &lt;strong&gt;8,192 GPUs finish in minutes&lt;/strong&gt;, not hours.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;MLPerf Training 6.0 result&lt;/th&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;What it shows&lt;/th&gt;
&lt;th&gt;Time to target&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GB300 NVL72 vs GB200 NVL72&lt;/td&gt;
&lt;td&gt;72-GPU rack&lt;/td&gt;
&lt;td&gt;hardware-generation speedup&lt;/td&gt;
&lt;td&gt;up to ~1.6× faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V3 671B (MoE)&lt;/td&gt;
&lt;td&gt;~8,192 GPUs&lt;/td&gt;
&lt;td&gt;strong scaling, new MoE workload&lt;/td&gt;
&lt;td&gt;~2.02 min (CoreWeave)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 405B (dense)&lt;/td&gt;
&lt;td&gt;~8,192-GPU scale&lt;/td&gt;
&lt;td&gt;strong scaling at frontier size&lt;/td&gt;
&lt;td&gt;~7.07 min (Azure)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Strong scaling, in one calculation
&lt;/h3&gt;

&lt;p&gt;Hold the model fixed — DeepSeek-V3, 671B parameters, trained to MLPerf's quality target — and watch the clock as you add rowers. On &lt;strong&gt;8,192&lt;/strong&gt; Blackwell GPUs, CoreWeave's run finished in &lt;strong&gt;2.02 minutes&lt;/strong&gt;. Now ask the strong-scaling question: had you used half as many GPUs, would it have taken exactly twice as long? Perfect scaling says yes. Suppose &lt;em&gt;(illustrative)&lt;/em&gt; the 4,096-GPU run had actually taken &lt;strong&gt;3.7 minutes&lt;/strong&gt;. Then doubling the GPUs cut the time from 3.7 to 2.02 — a &lt;strong&gt;1.83× speedup, not the ideal 2.0×&lt;/strong&gt;. Divide the two and you get a &lt;strong&gt;scaling efficiency of ~92%&lt;/strong&gt;; the missing ~8% is the time the GPUs spent at the catch, waiting on each other. &lt;strong&gt;The entire job at this scale is keeping that number pinned near 100% — which is exactly what a faster NVLink fabric and lighter low-precision oars are for.&lt;/strong&gt; &lt;em&gt;(The 2.02-min, 8,192-GPU, and 1.6× figures are from NVIDIA's MLPerf 6.0 report; the 4,096-GPU split is illustrative.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: GPU &amp;amp; CUDA → Memory Hierarchy → NVLink &amp;amp; PCIe&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/vera-rubin-nvl72-nvlink-rack-domain" rel="noopener noreferrer"&gt;Vera Rubin NVL72 — the NVLink rack domain&lt;/a&gt; — the "72 GPUs as one coherent fabric" idea this article leans on, taken apart on its own.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/nvidia-ai-factories-tokens-per-mw" rel="noopener noreferrer"&gt;NVIDIA AI factories — tokens per megawatt&lt;/a&gt; — once a cluster scales well, the next question is energy: useful work per watt, not just per GPU.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/fused-int8-gemm-tensor-cores" rel="noopener noreferrer"&gt;Fused INT8 GEMM — INT8 beats FP8 on the tensor cores&lt;/a&gt; — the same "shrink the numbers so there's less to move" lever, on the inference side instead of training.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is strong scaling in distributed training?
&lt;/h3&gt;

&lt;p&gt;Strong scaling fixes the problem — one model, one quality target — and measures how much faster it trains as you add GPUs. Perfect strong scaling means N times the GPUs finishes in 1/N the time. In practice the speedup falls short of that line, because every training step the GPUs must stop and synchronize their partial results before the next step can start, and that coordination cost grows with the number of GPUs. The gap between the ideal and the actual speedup is the scaling efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why doesn't doubling the GPUs halve the training time?
&lt;/h3&gt;

&lt;p&gt;Because training is synchronous. Each GPU works on a different slice of the batch, but at the end of every step they must average their gradients (an AllReduce) and exchange activations and weights across the tensor- and expert-parallel groups before moving on. That barrier is overhead that does not shrink as fast as the per-GPU work does, so adding GPUs gives less than a proportional speedup. The fix is to make the synchronization cheap — a fast, rack-scale NVLink fabric and lower-precision (FP8/NVFP4) numbers — so the speedup curve stays close to linear.&lt;/p&gt;

&lt;h3&gt;
  
  
  What did NVIDIA Blackwell actually win in MLPerf Training 6.0?
&lt;/h3&gt;

&lt;p&gt;NVIDIA reported the fastest time on all seven MLPerf Training 6.0 benchmarks. The new GB300 NVL72 rack trained up to 1.6× faster than the prior GB200 NVL72, submissions scaled to 8,192 GPUs (CoreWeave trained DeepSeek-V3 671B to target in 2.02 minutes; Microsoft Azure hit the target on Llama 3.1 405B in 7.07 minutes at 8,192-GPU scale), and the round added new mixture-of-experts pretraining workloads. The headline is less about one chip than about how well thousands of them scale together.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/blackwell-mlperf-6-0-strong-scaling" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>FlashMemory Cuts DeepSeek-V4's KV Cache to 13.5%: Lookahead Sparse Attention</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Thu, 18 Jun 2026 11:18:45 +0000</pubDate>
      <link>https://dev.to/pueding/flashmemory-cuts-deepseek-v4s-kv-cache-to-135-lookahead-sparse-attention-5coe</link>
      <guid>https://dev.to/pueding/flashmemory-cuts-deepseek-v4s-kv-cache-to-135-lookahead-sparse-attention-5coe</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/CdIAWRAIHy4"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;FlashMemory-DeepSeek-V4&lt;/strong&gt; paper introduces &lt;strong&gt;Lookahead Sparse Attention (LSA)&lt;/strong&gt; — decoding very long context without loading the whole KV cache, by training a small &lt;strong&gt;Neural Memory Indexer&lt;/strong&gt; to predict which chunks of the cached past a token will actually use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; At long context the binding cost is &lt;strong&gt;memory, not math&lt;/strong&gt;: the KV cache grows with every token until it dominates GPU serving memory, so LSA cuts the &lt;strong&gt;physical cache footprint to 13.5%&lt;/strong&gt; of the full version while nudging accuracy up 0.6%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Where a plain &lt;strong&gt;full KV cache&lt;/strong&gt; holds every token's Key and Value for the entire context, LSA's indexer keeps only the predicted-relevant chunks — and, unusually, it is trained &lt;strong&gt;backbone-free&lt;/strong&gt;, so the trillion-scale model never has to sit in GPU memory while the indexer learns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;An assistant who sets out only the files you'll open next.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  THE NEXT WORD TO WRITE
                            │
            ┌───────────────┴───────────────┐
            │                               │
            ▼                               ▼
    ┌───────────────┐              ┌────────────────┐
    │  Dense recall │              │  The assistant │
    │ (full cache)  │              │  (LSA indexer) │
    └───────┬───────┘              └───────┬────────┘
            │                              │
   haul the WHOLE archive         set out only the few
   to the desk, every step        folders you'll open next
            │                              │
            ▼                              ▼
   ✗ ~40 GB on the desk           ✓ ~5.4 GB on the desk
     at 500K tokens                 (13.5% — a seventh)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;context token = a document filed away as the model reads&lt;/li&gt;
&lt;li&gt;KV cache = the whole archive of everything read so far&lt;/li&gt;
&lt;li&gt;Neural Memory Indexer = an assistant who predicts which folders you'll open&lt;/li&gt;
&lt;li&gt;Lookahead Sparse Attention = bringing only those few folders to the desk&lt;/li&gt;
&lt;li&gt;13.5% footprint = a desk that holds about a seventh of the archive&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;KV cache&lt;/strong&gt; — The stored &lt;strong&gt;Key&lt;/strong&gt; and &lt;strong&gt;Value&lt;/strong&gt; vectors for every token already processed, so the model never recomputes the past. It grows with context length — at 500K tokens it is enormous. Background: KV Cache → Memory Cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lookahead Sparse Attention (LSA)&lt;/strong&gt; — FlashMemory's mechanism: rather than attending over the entire cache, it &lt;strong&gt;looks ahead&lt;/strong&gt; and keeps only the KV entries a token is predicted to need, so most of the cache is never loaded during decode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Neural Memory Indexer&lt;/strong&gt; — The lightweight model that does the predicting. It scores &lt;strong&gt;chunks&lt;/strong&gt; of the cached context and flags which ones matter for the current query — the part that decides what LSA keeps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backbone-free / decoupled training&lt;/strong&gt; — Training the indexer &lt;strong&gt;on its own&lt;/strong&gt;, without the giant base model loaded. Because the trillion-scale "backbone" never sits in GPU memory during indexer training, the method is far cheaper to train.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decode vs prefill&lt;/strong&gt; — &lt;strong&gt;Prefill&lt;/strong&gt; reads the whole prompt in one parallel pass; &lt;strong&gt;decode&lt;/strong&gt; emits one token at a time, each time reading the cache. LSA targets the decode step, where the cache is read over and over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Physical footprint&lt;/strong&gt; — The actual GPU memory the cache occupies — not just how much compute it costs. LSA's headline is a footprint cut (to &lt;strong&gt;13.5%&lt;/strong&gt;), which is a memory win, not only a speed win.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek-V4&lt;/strong&gt; — The long-context model (released April 2026) that FlashMemory is built into. LSA is the change that makes its ultra-long-context serving affordable.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 9, 2026, a Tencent team (Wang et al., arXiv 2606.09079) released FlashMemory-DeepSeek-V4, introducing Lookahead Sparse Attention (LSA). Instead of loading the entire KV cache during long-context decode, a Neural Memory Indexer predicts which context chunks matter and keeps only their KV entries. On DeepSeek-V4 it cuts the physical KV-cache footprint to &lt;strong&gt;13.5%&lt;/strong&gt; of the full-context baseline — over 90% smaller at 500K context — while improving accuracy +0.6% on average. &lt;a href="https://arxiv.org/abs/2606.09079" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture the metaphor for a moment. The model has read an enormous document and filed every page into an archive — that archive is the KV cache. To write the next word, the naive approach hauls the &lt;em&gt;entire&lt;/em&gt; archive to the desk, every single time. LSA hires an assistant who has learned your habits: before each step they glance at what you're working on, &lt;strong&gt;predict which few folders you'll actually open&lt;/strong&gt;, and set out only those. The desk stays nearly empty, and in the paper's reported evaluations the quality holds. That "predict, then fetch only the few" move is the whole idea.&lt;/p&gt;

&lt;p&gt;Why does the archive — not the arithmetic — become the wall? Every token the model reads leaves a Key and a Value in the KV cache, and the cache is multiplied out across layers and heads, so it balloons with context length. At a few thousand tokens nobody notices; at 500K the cache becomes the dominant draw on GPU serving memory and caps how much the device can hold. LSA's Neural Memory Indexer attacks that directly: it scores chunks of the cached past, flags the ones the current token is predicted to need, and keeps only their KV entries — a selective gather of the query-critical chunks rather than a full sweep of the cache.&lt;/p&gt;

&lt;p&gt;The genuinely new trick is &lt;em&gt;how the indexer is trained&lt;/em&gt;. A straightforward way to learn which chunks matter would be to run the full model end-to-end, which means holding the trillion-scale backbone in GPU memory the whole time. FlashMemory trains the indexer &lt;strong&gt;backbone-free and decoupled&lt;/strong&gt; — the base model never has to be loaded during indexer training — so the part that earns the savings is also cheap to build.&lt;/p&gt;

&lt;p&gt;This is a different lever from the block-sparse schemes you may have seen. &lt;a href="https://learnaivisually.com/ai-explained/minimax-m3-msa-block-sparse-attention" rel="noopener noreferrer"&gt;MiniMax's MSA&lt;/a&gt; gathers a few KV blocks &lt;em&gt;per query&lt;/em&gt; to cut attention compute, but it still keeps the whole cache resident. LSA aims one layer down: it avoids holding most of the cache at all, so the win is the &lt;strong&gt;physical memory footprint&lt;/strong&gt;. Same family — select instead of sweep — but one saves FLOPs and the other saves gigabytes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the 13.5% comes from
&lt;/h3&gt;

&lt;p&gt;Hold the setup fixed and walk it. Say the full-context KV cache at 500K tokens weighs 40 GB &lt;em&gt;(illustrative)&lt;/em&gt; — already a heavy load for a single GPU. LSA's indexer keeps only the predicted-relevant chunks; at the paper's reported ratio it preserves 13.5% of that cache, so the resident footprint drops to about &lt;strong&gt;5.4 GB&lt;/strong&gt; — a ~34.6 GB saving, the "over 90% smaller at 500K" the authors report. The surprising part is the accuracy line: selection usually &lt;em&gt;costs&lt;/em&gt; a little quality, but FlashMemory reports +0.6% on average, because the indexer drops mostly the chunks a token wasn't going to use anyway.&lt;/p&gt;

&lt;h3&gt;
  
  
  How the long-context attention options compare
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;What stays in the KV cache&lt;/th&gt;
&lt;th&gt;What it saves&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full KV cache (dense)&lt;/td&gt;
&lt;td&gt;every token's Key and Value, always&lt;/td&gt;
&lt;td&gt;nothing — the baseline&lt;/td&gt;
&lt;td&gt;exact, but the footprint balloons with length&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sliding-window&lt;/td&gt;
&lt;td&gt;only a fixed recent window&lt;/td&gt;
&lt;td&gt;memory, by forgetting far context&lt;/td&gt;
&lt;td&gt;cheap; loses long-range recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Block-sparse gather (DSA / MoBA / MSA)&lt;/td&gt;
&lt;td&gt;the whole cache, but reads a few blocks per query&lt;/td&gt;
&lt;td&gt;mostly attention &lt;em&gt;compute&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;cache stays resident; saves FLOPs, not footprint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LSA (FlashMemory)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;only indexer-predicted chunks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;physical footprint → ~13.5%&lt;/strong&gt; &lt;em&gt;(Wang et al.; setup-dependent)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;+0.6% accuracy; indexer trained backbone-free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One caveat worth keeping: the headline numbers — 13.5% footprint, over-90% reduction at 500K, +0.6% accuracy — are the authors' own results on DeepSeek-V4 at a single operating point, and selective-cache methods are setup-dependent: chunk size, how aggressively the indexer prunes, the sequence length, and the task all move them. The durable lesson is the lever (predict which chunks matter, hold only those); the exact percentage is a reported headline, not a guarantee at every length.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: LLM Internals → KV Cache → Memory Cost&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/minimax-m3-msa-block-sparse-attention" rel="noopener noreferrer"&gt;MiniMax Sparse Attention (MSA)&lt;/a&gt; — the &lt;strong&gt;compute&lt;/strong&gt; sibling: block-sparse gather that saves FLOPs while the cache stays resident&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/sp-kv-self-pruned-kv-cache" rel="noopener noreferrer"&gt;SP-KV — a utility predictor for the KV cache&lt;/a&gt; — another &lt;strong&gt;learned selector&lt;/strong&gt;, pruning the cache by predicted usefulness&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/tangram-per-head-kv-budgets" rel="noopener noreferrer"&gt;Tangram — per-head KV cache budgets&lt;/a&gt; — a different way to &lt;strong&gt;shrink long-context cache&lt;/strong&gt;, by sizing each head's budget instead of indexing chunks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Lookahead Sparse Attention (LSA)?
&lt;/h3&gt;

&lt;p&gt;LSA is the mechanism inside FlashMemory-DeepSeek-V4 that decodes long context without loading the entire KV cache. A lightweight Neural Memory Indexer predicts which chunks of the cached past the current token will actually use and keeps only those KV entries, rather than holding the whole cache resident. On DeepSeek-V4 it cuts the physical KV-cache footprint to about 13.5% of the full-context baseline while improving average accuracy by 0.6%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does LSA matter?
&lt;/h3&gt;

&lt;p&gt;At very long context the KV cache — not the attention math — is the binding cost: it grows with every token across all layers and heads, and at 500K tokens it dominates GPU memory. By keeping only the chunks a token is predicted to need, LSA reportedly shrinks that footprint by over 90% at 500K context, which is what makes ultra-long-context serving on a model like DeepSeek-V4 affordable. Its indexer is also trained backbone-free, so the trick that saves memory is itself cheap to build.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does LSA differ from block-sparse attention like MSA?
&lt;/h3&gt;

&lt;p&gt;Both belong to the "select instead of sweep" family, but they save different resources. Block-sparse schemes (DSA, MoBA, MiniMax's MSA) keep the whole KV cache resident and gather only a few blocks per query, which cuts attention compute. LSA goes one layer down: it avoids holding most of the cache at all, so its win is the physical memory footprint. One saves FLOPs; the other saves gigabytes.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/flashmemory-lookahead-sparse-attention" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Microsoft FastContext: a Repo-Explorer Subagent Cuts Coding-Agent Tokens 60%: Explorer-Subagent Context Offloading</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Wed, 17 Jun 2026 11:23:06 +0000</pubDate>
      <link>https://dev.to/pueding/microsoft-fastcontext-a-repo-explorer-subagent-cuts-coding-agent-tokens-60-explorer-subagent-2lpk</link>
      <guid>https://dev.to/pueding/microsoft-fastcontext-a-repo-explorer-subagent-cuts-coding-agent-tokens-60-explorer-subagent-2lpk</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/uJi6C5J3lxI"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;FastContext&lt;/strong&gt; paper (Microsoft) trains a dedicated &lt;strong&gt;explorer subagent&lt;/strong&gt; — a 4B-30B model the main coding agent calls to find code — that issues read-only searches and returns compact file-line citations instead of dumping files into the main context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Reading and searching a repository is the biggest single drain on a coding agent: in GPT-5.4 traces it ate &lt;strong&gt;56.2% of tool-use turns and 46.5% of the main agent's tokens&lt;/strong&gt;, so moving that work off the main agent is where the token budget is won.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; A normal coding agent &lt;strong&gt;greps and reads files itself&lt;/strong&gt;, so every raw file lands in its own context window and crowds out the actual coding. FastContext &lt;strong&gt;offloads&lt;/strong&gt; exploration to a separate subagent that returns only &lt;strong&gt;citations&lt;/strong&gt; — the evidence, not the haystack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A reference librarian you send into the stacks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                   ONE CODE QUESTION
                          │
            ┌─────────────┴─────────────┐
            │                           │
   ┌────────▼────────┐         ┌────────▼────────┐
   │   READ IT       │         │   SEND A        │
   │   YOURSELF      │         │   LIBRARIAN     │
   │   (baseline)    │         │  (FastContext)  │
   └────────┬────────┘         └────────┬────────┘
            │                           │
   haul every file into        explorer greps the
   your own context            stacks, hands back
                               an index card
            │                           │
            ▼                           ▼
   ✗ ~18,000 tokens            ✓ ~480 tokens of
     bury the desk               citations — desk
     before you code             stays clear
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;main agent = you at a small desk, no room to pile up whole books&lt;/li&gt;
&lt;li&gt;explorer subagent = the librarian you send into the stacks to look&lt;/li&gt;
&lt;li&gt;Read / Glob / Grep = the librarian skimming many shelves in parallel&lt;/li&gt;
&lt;li&gt;file-line citation = an index card: shelf 3, page 88 — not the whole book&lt;/li&gt;
&lt;li&gt;context window = the desk; pile whole books on it and it overflows&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Explorer subagent&lt;/strong&gt; — A separate model the main agent delegates a sub-task to. Here its one job is exploration: take a natural-language query, search the repo, and hand back what it found — it never writes code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context offloading&lt;/strong&gt; — Keeping the bulky, raw evidence &lt;strong&gt;out&lt;/strong&gt; of the main agent's context window and bringing back only a compact result. The reading still happens — just not in the context that has to do the reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read / Glob / Grep&lt;/strong&gt; — The three &lt;strong&gt;read-only&lt;/strong&gt; tools an explorer uses: &lt;strong&gt;Read&lt;/strong&gt; opens a file, &lt;strong&gt;Glob&lt;/strong&gt; matches file &lt;em&gt;names&lt;/em&gt; by pattern, &lt;strong&gt;Grep&lt;/strong&gt; searches file &lt;em&gt;contents&lt;/em&gt;. None of them change anything, so running many at once is safe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File-line citation&lt;/strong&gt; — A pointer of the form &lt;code&gt;path/to/file.ts:88-104&lt;/code&gt; — the exact place the answer lives. Returning the citation instead of the whole file is what keeps the result compact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SFT (supervised fine-tuning)&lt;/strong&gt; — Training a model on example &lt;em&gt;(query → good exploration)&lt;/em&gt; pairs so it imitates them. It's the first of FastContext's two training stages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task-grounded RL&lt;/strong&gt; — Reinforcement learning where the reward isn't "did the search look reasonable" but &lt;strong&gt;did the exploration actually help solve the downstream task&lt;/strong&gt;. It tunes the explorer toward evidence that the main agent can act on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mini-SWE-Agent&lt;/strong&gt; — A small open-source coding-agent harness. FastContext was plugged into it to measure the end-to-end effect on real software-engineering tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token budget&lt;/strong&gt; — The total tokens an agent spends on a task — what you pay for in cost &lt;em&gt;and&lt;/em&gt; latency. Exploration dominates it, which is why offloading it moves the number so much.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On &lt;strong&gt;June 15, 2026&lt;/strong&gt;, Microsoft released &lt;strong&gt;FastContext&lt;/strong&gt;, a system that attacks the most expensive thing a coding agent does: finding the right code. Analyzing GPT-5.4 trajectories, the authors found reading and searching accounted for &lt;strong&gt;56.2% of tool-use turns&lt;/strong&gt; and &lt;strong&gt;46.5% of the main agent's total tokens&lt;/strong&gt;. FastContext trains dedicated &lt;strong&gt;4B-30B exploration models&lt;/strong&gt; that the main agent queries in natural language; the explorer fires read-only &lt;code&gt;Read&lt;/code&gt;/&lt;code&gt;Glob&lt;/code&gt;/&lt;code&gt;Grep&lt;/code&gt; calls in parallel and returns focused file-line citations. Plugged into Mini-SWE-Agent, it reports &lt;strong&gt;up to +5.5% resolution rate&lt;/strong&gt; and &lt;strong&gt;up to 60% fewer tokens&lt;/strong&gt;. Weights are open on Hugging Face. &lt;a href="https://arxiv.org/abs/2606.14066" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture yourself at a small desk in a vast library, trying to answer one question. The naive way is to walk the stacks yourself, haul every promising book back, and stack them on the desk — and within a dozen volumes the desk is buried, the early books slide onto the floor, and you can't even see the question anymore. &lt;strong&gt;The desk is the bottleneck, and you filled it with raw material you mostly didn't need.&lt;/strong&gt; A coding agent does exactly this when it greps and reads files itself: every file it opens lands in its own context window, and long before it starts writing the fix, the window is full of source it skimmed once and will never look at again.&lt;/p&gt;

&lt;p&gt;That's not a small inefficiency — it's &lt;em&gt;the&lt;/em&gt; inefficiency. When FastContext's authors traced real GPT-5.4 coding runs, &lt;strong&gt;reading and searching the repository accounted for 56.2% of every tool-use turn and 46.5% of the main agent's tokens&lt;/strong&gt;. Roughly half the agent's entire budget goes to &lt;em&gt;finding&lt;/em&gt; code, not changing it. And exploration is the most context-poisoning kind of work there is: it pulls in big, low-signal blobs of text whose only useful output is usually a single line number.&lt;/p&gt;

&lt;p&gt;So FastContext stops doing the exploring on the main desk. &lt;strong&gt;It sends a librarian into the stacks.&lt;/strong&gt; The main agent delegates a natural-language query — "where is the retry budget enforced?" — to a separate &lt;strong&gt;explorer subagent&lt;/strong&gt;, a 4B-30B model trained for exactly this. The explorer reads, globs, and greps its way through the repo in parallel read-only calls, then hands back not an armful of files but an &lt;strong&gt;index card&lt;/strong&gt;: &lt;code&gt;scheduler/retry.go:88-104&lt;/code&gt;, the exact evidence. The main agent's desk stays clear, holding citations instead of haystacks — the reading happened, but &lt;strong&gt;the bulk never touched the context that has to reason.&lt;/strong&gt; Because the explorer only ever uses read-only tools, running a swarm of those searches at once is safe by construction.&lt;/p&gt;

&lt;p&gt;The explorer earns its accuracy in two training stages. First &lt;strong&gt;supervised fine-tuning&lt;/strong&gt; teaches it to imitate good exploration traces; then &lt;strong&gt;task-grounded RL&lt;/strong&gt; rewards it not for searches that merely &lt;em&gt;look&lt;/em&gt; thorough but for evidence that actually lets the main agent solve the downstream task. A scout that brings back the wrong shelf is worse than useless, so the reward is tied to the &lt;em&gt;outcome&lt;/em&gt;, not the search.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Who reads the repo&lt;/th&gt;
&lt;th&gt;What lands in the main context&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Main agent itself (baseline)&lt;/td&gt;
&lt;td&gt;every file it opens — raw source&lt;/td&gt;
&lt;td&gt;~46.5% of tokens spent exploring &lt;a href="https://arxiv.org/abs/2606.14066" rel="noopener noreferrer"&gt;(paper)&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A prompted, untrained sub-call&lt;/td&gt;
&lt;td&gt;often the whole transcript dumped back&lt;/td&gt;
&lt;td&gt;re-floods context; little net saving &lt;em&gt;(illustrative)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FastContext explorer subagent&lt;/td&gt;
&lt;td&gt;compact file-line citations only&lt;/td&gt;
&lt;td&gt;up to &lt;strong&gt;60% fewer tokens&lt;/strong&gt;, +5.5% resolution &lt;a href="https://arxiv.org/abs/2606.14066" rel="noopener noreferrer"&gt;(paper)&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Where does a 60% cut actually come from? Walk one task &lt;em&gt;(token counts here are illustrative — the paper reports the percentages, not these absolute numbers)&lt;/em&gt;. Say solving a bug needs evidence from &lt;strong&gt;12 files&lt;/strong&gt; averaging &lt;strong&gt;1,500 tokens&lt;/strong&gt; each. A baseline agent that reads them all carries &lt;strong&gt;18,000 tokens&lt;/strong&gt; of raw source in its working context — and that's before it writes a line. FastContext's explorer reads the same 12 files in its &lt;em&gt;own&lt;/em&gt; scratch context, then returns &lt;strong&gt;12 citations at ~40 tokens each = ~480 tokens&lt;/strong&gt;. The main agent now reasons over &lt;strong&gt;~480 tokens&lt;/strong&gt; instead of &lt;strong&gt;18,000&lt;/strong&gt; — a &lt;strong&gt;~37× lighter&lt;/strong&gt; exploration footprint on the desk that matters. Multiply that across a long task where exploration was already &lt;strong&gt;46.5% of the budget&lt;/strong&gt;, and a headline &lt;strong&gt;60% token reduction&lt;/strong&gt; stops looking surprising — it's just the haystack never landing on the desk.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → Context Engineering → Subagents for context isolation&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/searchswarm-distilled-delegation" rel="noopener noreferrer"&gt;SearchSwarm — distilling delegation into the weights&lt;/a&gt; — a &lt;em&gt;different&lt;/em&gt; lever on the same problem: bake decomposition-and-delegation into one model's weights, rather than splitting off a separate explorer at runtime&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/grep-vs-vector-agentic-retrieval" rel="noopener noreferrer"&gt;Grep vs vector retrieval for agentic search&lt;/a&gt; — what the explorer is actually doing under the hood when it greps the repo instead of embedding it&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/grepseek-grpo-shell-command-search" rel="noopener noreferrer"&gt;GrepSeek — GRPO-trained shell-command search&lt;/a&gt; — another search agent trained with RL to use shell tools well&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is explorer-subagent context offloading?
&lt;/h3&gt;

&lt;p&gt;It's a pattern where a coding agent doesn't search the codebase itself but delegates the search to a separate "explorer" model. The explorer reads and greps files in its own context, then returns only compact pointers — file paths and line ranges — to the main agent. The bulky raw source never enters the main agent's context window, which is what frees up its budget for the actual coding. FastContext trains that explorer (SFT plus task-grounded RL) at 4B-30B scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does it cut tokens so much?
&lt;/h3&gt;

&lt;p&gt;Because finding code is the dominant cost. In FastContext's analysis of GPT-5.4 traces, reading and searching was 56.2% of tool-use turns and 46.5% of the main agent's tokens. Most of that text is low-signal — its only useful output is a line number. Offloading the reading to a subagent that returns citations instead of files removes the haystack from the main context, which is where the up-to-60% token reduction comes from.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is this different from SearchSwarm's distilled delegation?
&lt;/h3&gt;

&lt;p&gt;Both reduce context pressure through delegation, but at different layers. SearchSwarm bakes task-decomposition-and-delegation into one model's weights via supervised fine-tuning, so a single model delegates by reflex. FastContext keeps two separate agents at inference time: a general main agent plus a specialized read-only explorer it calls for context. One trains the behavior into a model; the other architects it into the system.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/fastcontext-explorer-subagent-offloading" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>Google Releases Gemma 4 12B: Encoder-Free Multimodal Projection</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Tue, 16 Jun 2026 11:17:41 +0000</pubDate>
      <link>https://dev.to/pueding/google-releases-gemma-4-12b-encoder-free-multimodal-projection-1p4i</link>
      <guid>https://dev.to/pueding/google-releases-gemma-4-12b-encoder-free-multimodal-projection-1p4i</guid>
      <description>&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/h6fSqCWYVnY"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; Google released &lt;strong&gt;Gemma 4 12B&lt;/strong&gt;, an open multimodal model whose headline trick is &lt;strong&gt;encoder-free multimodal projection&lt;/strong&gt; — it turns images and audio into tokens by projecting them straight into the token space, instead of running them through a dedicated encoder network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; The separate vision and audio encoders most multimodal models carry are extra parameters, compute, and latency that run &lt;strong&gt;before&lt;/strong&gt; the language model sees anything; dropping them is a big reason a 12B model can field pictures &lt;em&gt;and&lt;/em&gt; sound inside &lt;strong&gt;16 GB&lt;/strong&gt; of memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Versus the standard recipe — a &lt;strong&gt;frozen vision transformer (ViT)&lt;/strong&gt; plus a projector bolted onto a text model — Gemma 4 12B has no separate encoder at all: each image patch becomes a token through &lt;strong&gt;one matrix multiply&lt;/strong&gt; directly into the backbone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A meeting where guests either go through a translator or speak the language.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              IMAGE / AUDIO ARRIVES
                       │
        ┌──────────────┴──────────────┐
        │                             │
 ┌──────▼───────┐             ┌───────▼──────┐
 │  THE OLD WAY │             │ ENCODER-FREE │
 │via translator│             │ speak direct │
 └──────┬───────┘             └───────┬──────┘
        │                             │
  a whole vision/audio          one matrix-multiply
  encoder runs first            projects to a token
        │                             │
        ▼                             ▼
 ✗ extra params + latency      ✓ same token space,
   before the LLM looks          ~16 GB, lower latency
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;text token = a guest who already speaks the room's language&lt;/li&gt;
&lt;li&gt;vision/audio encoder = a separate translator the old way routes pictures and sound through&lt;/li&gt;
&lt;li&gt;encoder-free projection = one matrix-multiply that puts vision and audio into the room's language directly&lt;/li&gt;
&lt;li&gt;shared token space = the single language every guest speaks once inside&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Encoder-free (VLM)&lt;/strong&gt; — A multimodal model with &lt;strong&gt;no separate encoder&lt;/strong&gt; for non-text inputs — rather than run an image through a vision network first, it projects the raw input straight into the model's token space. The lineage runs through research models like Fuyu and EVE.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision encoder / ViT&lt;/strong&gt; — A &lt;strong&gt;Vision Transformer&lt;/strong&gt; — a stack of attention-and-MLP layers that turns an image into feature vectors. In the usual recipe it sits in front of the language model as a second network; encoder-free designs delete it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Patch&lt;/strong&gt; — An image is cut into a grid of small squares (e.g. 16×16 pixels). Each &lt;strong&gt;patch&lt;/strong&gt; is flattened into a list of raw numbers and treated as one unit of input — the visual equivalent of a text token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Projection&lt;/strong&gt; — A &lt;strong&gt;single matrix multiply&lt;/strong&gt; that maps a vector of one size onto a vector of another. Here it maps a flattened image patch onto a vector the same width as a word's embedding — so the result &lt;em&gt;is&lt;/em&gt; a token; audio is folded into that same space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token / embedding space&lt;/strong&gt; — A transformer doesn't read words or pixels; it reads &lt;strong&gt;dense vectors&lt;/strong&gt;. The "embedding space" is the shared vector format every input must arrive in — putting images and audio there is what lets one backbone read all three.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native audio&lt;/strong&gt; — Audio handled &lt;strong&gt;inside&lt;/strong&gt; the model as tokens, rather than transcribed to text by a separate speech model first. Gemma 4 12B is the first mid-sized Gemma to take audio in natively.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 3, 2026, Google released Gemma 4 12B, an Apache-2.0 model that drops the separate vision and audio encoders most multimodal models bolt on. Instead it projects both kinds of input &lt;em&gt;straight&lt;/em&gt; into the language backbone: vision through a lightweight module — reportedly a single matrix multiply plus positional and normalization terms — and audio into the same dimensional space as text tokens. It is the first mid-sized Gemma to take native audio input, runs on 16 GB of VRAM or unified memory, and reportedly scores near Google's larger 26B mixture-of-experts model. &lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/" rel="noopener noreferrer"&gt;Read the announcement →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture the meeting. A text prompt is a guest who already speaks the room's language — it walks in and starts talking. A picture and a sound clip don't: the usual fix hires a separate translator for each, a whole second staffer who listens, re-voices everything, and only then lets the guest join. Those translators are the model's &lt;strong&gt;vision and audio encoders&lt;/strong&gt; — extra networks that run before the language model sees a thing. Gemma 4 12B fires the translators. It teaches pictures and sound to speak the room's language directly, in one quick step, so every guest — text, image, audio — sits at the same table as an ordinary &lt;strong&gt;token&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Underneath the metaphor, "speaking the room's language" means landing in the model's embedding space — the dense vectors a transformer actually consumes. A token ID becomes a vector by a lookup; an image patch becomes one by a projection. As a toy example, cut a 256×256 image into 16×16 patches and you get 256 patches, each a flat list of 16·16·3 = 768 raw numbers. The old way pushes patches like these through a vision transformer — tens of attention-and-MLP layers — before the LLM gets a single feature. Gemma's encoder-free path instead, by Google's description, applies a &lt;strong&gt;single matrix multiply&lt;/strong&gt; (plus a positional term and normalization) that turns each patch straight into a token, the same shape as a word's embedding. Audio is projected into that same space too. The whole pre-LLM encoder stack collapses to that one projection — and the backbone itself takes over the visual and acoustic processing.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;How an image enters&lt;/th&gt;
&lt;th&gt;Separate encoder?&lt;/th&gt;
&lt;th&gt;Cost profile&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Encoder-based (ViT + projector)&lt;/td&gt;
&lt;td&gt;image → vision transformer (tens of layers) → projector → tokens&lt;/td&gt;
&lt;td&gt;yes — a full vision network runs first&lt;/td&gt;
&lt;td&gt;more parameters and latency before the first output token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encoder-free (Gemma 4 12B)&lt;/td&gt;
&lt;td&gt;patches → one matrix multiply (+ position/norm) → tokens&lt;/td&gt;
&lt;td&gt;no separate encoder&lt;/td&gt;
&lt;td&gt;~16 GB, lower pre-decode latency &lt;em&gt;(Google, reported)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Removing the encoder stack has consequences, but the wins are concrete. A separate vision tower is parameters you store, compute you run, and latency you pay &lt;em&gt;before&lt;/em&gt; the first output token; deleting it is a big reason a 12B model can field images and audio inside &lt;strong&gt;16 GB&lt;/strong&gt; rather than needing a datacenter card, and part of why Google can claim quality near its &lt;strong&gt;26B&lt;/strong&gt; mixture-of-experts model despite the smaller, simpler stack. The catch is that the backbone now has to learn visual and acoustic structure itself, with no pretrained encoder doing that work for it — which is plausibly why this ships as a &lt;em&gt;12B model trained for it from the start&lt;/em&gt; rather than a vision adapter glued onto an existing text model. The architectural specifics beyond the single-matmul description are not yet fully documented.&lt;/p&gt;

&lt;p&gt;The payoff is a cleaner idea of what "multimodal" even requires. You don't strictly need a bespoke eye and ear bolted onto a language model; if every input can be projected into the same token space, one backbone can read all of them. Gemma 4 12B is a bet that for a small, open model meant to run on modest hardware, fewer moving parts beats a heavier, more specialized stack.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: LLM Internals → Embeddings → From Token IDs to Vectors&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/glm-5v-native-multimodal" rel="noopener noreferrer"&gt;GLM-5V — native multimodal vs vision-bolted&lt;/a&gt; — the neighboring question: training a model multimodal from the start versus adapting a text model, a different axis than removing the encoder&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/gemini-omni-shared-token-space" rel="noopener noreferrer"&gt;Gemini Omni — modality unification in a shared token space&lt;/a&gt; — the same "one token space for every modality" idea, taken to full any-to-any generation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/gemma-4-qat" rel="noopener noreferrer"&gt;Gemma 4 QAT — quantization-aware training&lt;/a&gt; — the other route to running a real model on modest hardware: shrink the bits, instead of removing the encoder&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is encoder-free multimodal projection?
&lt;/h3&gt;

&lt;p&gt;It is a way to make a language model multimodal without a separate vision or audio encoder. Instead of running an image through a dedicated network first, the model cuts it into patches and turns each patch into a token with a single matrix multiply — projecting it directly into the same embedding space as text tokens. Audio is handled the same way. One backbone then reads text, image, and audio tokens as one stream.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does removing the vision encoder matter?
&lt;/h3&gt;

&lt;p&gt;A separate vision encoder is extra parameters to store, extra compute to run, and extra latency before the language model produces its first token. Dropping it is a big part of why Gemma 4 12B can handle images and native audio inside about 16 GB of memory and still report quality near Google's larger 26B mixture-of-experts model. The trade-off is that the backbone has to learn visual and acoustic structure itself, which is why the design ships as a model trained for it rather than a bolt-on.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does it relate to native multimodal models like GLM-5V?
&lt;/h3&gt;

&lt;p&gt;They answer different questions. "Native vs vision-bolted" is about training: was the model multimodal from the start, or was a vision module added to a finished text model? "Encoder-free" is about architecture: is there a separate encoder network at all, or does the input get projected straight into the token space? A model can be natively trained and still use a vision encoder; Gemma 4 12B is unusual in being both natively multimodal and encoder-free.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/gemma-4-12b-encoder-free-multimodal" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>google</category>
    </item>
  </channel>
</rss>
