DEV Community: pueding

Dockerless Verifies Coding-Agent Patches Without Containers: Execution-Free Patch Verification

pueding — Fri, 03 Jul 2026 11:15:14 +0000

What: The paper Dockerless (arXiv 2606.28436) is a way to decide whether a coding agent's code patch is correct — but without running the repository's tests. An "execution-free" judge explores the repo and reasons about the change instead.

Why: To train or grade a coding agent you must know if each generated patch actually fixes the issue. The standard check runs unit tests inside a per-repository Docker container, and building that container is the slow, expensive part — repeated for every repo.

vs prior: The usual approach is execution-based verification: build a Docker image per repo and run its tests. Dockerless drops the container and judges by agentic evidence-gathering, so the same signal can train both SFT and RL with no environment at all.

Think of it as

A senior reviewer approving a pull request by reading it, not by running CI.

               A CODING AGENT'S PATCH
                        │
          ┌─────────────┴─────────────┐
          │                           │
  ┌───────▼───────┐          ┌────────▼────────┐
  │ EXECUTION-    │          │  DOCKERLESS     │
  │ BASED (Docker)│          │ (read & reason) │
  └───────┬───────┘          └────────┬────────┘
          │                           │
   build a container,          explore the repo,
   run the test suite          reason about the diff
          │                           │
          ▼                           ▼
    ✓ correct verdict          ✓ correct verdict
      but pays the               no container —
      environment tax            62.0% on Verified

code patch = the pull request waiting for a correct / not-correct verdict
running unit tests in Docker = spinning up CI to build the repo and run the whole test suite
execution-free judge = a senior reviewer who reads the diff and reasons instead of running it
agentic repository exploration = the reviewer opening nearby files to gather evidence
one signal for SFT and RL = the same read-and-judge verdict both picks training data and scores the reward

Quick glossary

SWE-bench (Verified / Multilingual / Pro) — A benchmark of real GitHub issues where an agent must produce a patch that resolves the issue. Verified is a human-checked subset; Multilingual and Pro are other splits. The score is the resolve rate — the fraction of issues the patch actually fixes.

Execution-based verification — Checking a patch by actually running the repository's unit tests, usually inside a Docker image built for that repo. It is accurate, but the per-repo environment setup is the costly part.

Execution-free / environment-free — Judging correctness without building or running the repo — by reading and reasoning about the code. Dockerless's whole idea: no container, no test run.

Agentic repository exploration — Letting the judge act as an agent that opens files, greps, and reads surrounding code to gather evidence about whether a change is correct.

SFT (supervised fine-tuning) — Training a model on curated example trajectories. Here, the execution-free verdict selects which trajectories are good enough to learn from.

RL reward — The signal reinforcement learning optimizes. Dockerless uses the execution-free correct/not verdict as the reward, instead of a pass/fail from actually running tests.

Trajectory — The sequence of actions and edits an agent took to produce a patch. Trajectory selection keeps only the ones that led to a correct patch.

The news. On July 1, 2026, researchers released Dockerless (arXiv 2606.28436), a way to verify coding-agent patches without per-repository containers. Its starting point: standard execution-based verification "requires running unit tests inside per-repository environments such as Docker images, incurring substantial environment setup costs." Dockerless replaces that with an environment-free judge that explores the repository agentically and gathers evidence to decide correctness. It reaches 62.0% resolve rate on SWE-bench Verified (50.0% Multilingual, 35.2% Pro), surpassing the Qwen3.5-9B baseline by 2.4 / 8.7 / 2.9 points and matching environment-based post-training. Read the paper →

Picture a pull request landing in your review queue. The by-the-book way to approve it is to spin up CI: let it build the whole project, then run every unit test. Thorough — but each repository needs its own build environment, and standing that environment up is the slow, costly part. A senior reviewer often does something else entirely. They read the diff, open the files around it, trace how the change ripples through the code, and sign off by reasoning — no CI run at all.

That second move is exactly what Dockerless does for coding agents. To train or grade an agent, you have to know whether the patch it generated really fixes the bug. The standard check decides pass or fail by running the repo's unit tests inside a per-repository Docker image — and building that image, once per repo, is where the cost lives. Dockerless drops the container entirely: an execution-free judge explores the repository agentically, gathers evidence about the change, and decides correct-or-not by reasoning, not by running tests.

Here is the part that makes it more than a cheaper grader, and it is the whole idea. The same execution-free verdict does double duty in training. In supervised fine-tuning it selects which trajectories are worth learning from — keep the ones judged correct, drop the rest. In reinforcement learning it is the reward. Because neither step ever runs code, the entire post-training pipeline becomes environment-free: no per-repo containers anywhere, not for evaluation and not for reward. What was an offline eval you had to build infrastructure for collapses into a model reading a repo.

Walk the cost. Hold the base model fixed — the Qwen3.5-9B this work post-trains from — and change only how each patch gets verified. Say a patch check splits into two parts: standing up the environment (build a Docker image for that repo) and running the check (execute its tests). Execution-based pays both, and the environment build is the part that dominates — (illustrative) call it roughly 80% of the verification bill, because it repeats for every repository. Dockerless pays only the read-and-reason part, so it erases that environment tax — and still lands 62.0% on SWE-bench Verified, which is 2.4 points above that same Qwen3.5-9B base and matches the container-based pipeline. The measured pattern repeats on the other splits: 50.0% Multilingual (+8.7) and 35.2% Pro (+2.9).

Verification approach	How it checks a patch	Infra per repo	SWE-bench Verified resolve rate
Execution-based (Docker)	build a container, run the repo's unit tests	a Docker image + test run per repo	baseline (matched)
Dockerless (arXiv 2606.28436)	agentic repo exploration → reason about the patch, no execution	none	62.0% (+2.4 vs base; 50.0% / 35.2% on Multilingual / Pro)

Because correctness is decided by reading rather than running, the training loop sheds the one component that scaled with the number of repositories — the per-repo container — without giving up accuracy. The headline is not a new architecture or a higher ceiling; it is that the verification signal a coding agent learns from does not actually have to come from executed tests: a model that explores the repo and reasons can stand in for the whole environment.

Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score

Related explainers

The Verification Horizon — co-evolving verifiers — why verifying coding agents is its own hard problem, approached by training the verifier alongside the coder
MaxProof — defense-in-depth generative verifier — another verifier that reasons about correctness rather than only executing
NatureBench — discovery vs reproduction — how coding agents fare on real, hard software tasks
FutureSim — harness-level agent eval — grading an agent by its whole run, not a single answer

FAQ

What is execution-free patch verification (Dockerless)?

Execution-free patch verification decides whether a coding agent's code patch is correct without running the repository's tests. Dockerless (arXiv 2606.28436) replaces the standard per-repository Docker container — which builds the project and runs its unit tests — with a judge that explores the repository agentically, gathers evidence about the change, and reasons about correctness. It reaches 62.0% resolve rate on SWE-bench Verified while removing the per-repo container entirely.

Why verify a coding-agent patch without running its tests?

Running tests is accurate, but it requires building and running a per-repository environment (typically a Docker image), and that environment setup is the slow, expensive part — and it repeats for every repository, on every patch you want to check. An execution-free judge skips the container, so evaluation and reward generation cost only the read-and-reason step. Dockerless shows this environment-free signal matches environment-based post-training rather than trading accuracy for speed.

How does Dockerless relate to SWE-bench and RL reward generation?

SWE-bench is the benchmark: real GitHub issues an agent must resolve with a patch, scored by resolve rate. Dockerless's key move is that the same execution-free verdict drives two training stages at once — in supervised fine-tuning it selects which trajectories to learn from, and in reinforcement learning it serves as the reward. Because neither stage runs code, the whole post-training pipeline is environment-free, and it posts 62.0% Verified, 50.0% Multilingual, and 35.2% Pro (+2.4 / +8.7 / +2.9 over the Qwen3.5-9B baseline).

Originally posted on Learn AI Visually.

Cluster-Route-Escalate Cascade Serves LLMs at 97-99% Accuracy for Less Cost: Cost-Aware LLM Cascade

pueding — Thu, 02 Jul 2026 11:17:52 +0000

What: A new paper, Cluster, Route, Escalate (arXiv 2606.27457), is a cost-aware cascade for serving large language models: it sends most queries to a cheap model and reserves an expensive one for the hard cases.

Why: Serving cost is set by which model answers each query, and using one model for everything either overpays on easy queries or fails the hard ones — picking per query is the most direct cost lever you have.

vs prior: The usual setup runs one fixed model for all traffic. This cascade instead routes by query and escalates by answer quality, so a strong model is paid for only when a cheap one's answer looks weak.

Think of it as

A walk-in clinic: a nurse sees most patients, a specialist only the hard cases.

           incoming queries (patients)
                     │
              ┌──────▼──────┐
              │   triage    │  cluster + route
              └──────┬──────┘
                     │
              ┌──────▼──────┐
              │  NURSE      │  cheap model,
              │  (most)     │  answers by default
              └──────┬──────┘
             am I sure? (quality check)
             ┌───────┴───────┐
           yes               no, escalate
            │                     │
            ▼              ┌──────▼──────┐
       ✓ done, cheap       │ SPECIALIST  │ strong model:
                           │  (few)      │ accurate, pricey
                           └─────────────┘

incoming query = a patient walking into the clinic
query clustering = triage sorting patients by the kind of complaint
cheap model = the nurse who handles the routine visits
cost budget = how much specialist time the clinic will pay for
quality check = the nurse pausing to ask, am I sure about this?
escalation = referring an unsure case up to the specialist
strong model = the specialist: accurate, but expensive and slow

Quick glossary

Cascade — A chain of models ordered cheap-to-strong, where a query stops at the first model whose answer is good enough. The whole design is about moving as few queries as possible down the chain.

Query clustering — Grouping incoming queries by similarity so a routing decision can be made per group rather than re-derived for every single query. It is the "Stage 1" of this system.

Routing — Choosing which model answers a given query. Here the route is set by the cluster a query lands in.

Cost budget — A single tunable knob, fixed offline, that decides how aggressively traffic is pushed onto cheap models. Turn it up and you save more but risk more weak answers; turn it down and you spend more for safety.

Quality estimator — A small learned model that scores how good a cheap model's answer is, without seeing the right answer. It is the gate that decides whether to escalate. The paper notes it needs only task-correctness labels to train.

Escalation — Re-asking the same query of a stronger, pricier model when the quality estimator judges the cheap answer too weak. Only the hard or low-confidence cases trigger it.

TPOT — Time per output token — how long the system takes to emit each token of a response. Cheap models stream faster, so routing easy traffic to them cuts TPOT.

The news. On June 25, 2026, researchers posted Cluster, Route, Escalate (arXiv:2606.27457), a two-stage cascade for cost-aware LLM serving. Stage 1 clusters incoming queries and assigns each cluster to its most cost-effective model, with the aggressiveness set by a single cost-budget hyperparameter tuned offline. Stage 2 adds a quality-estimation cascade: when a Stage-1 answer is judged low-quality, the query is escalated to a stronger model, so only the hard or low-confidence cases reach the expensive models. The authors report retaining 97-99% of the strongest model's accuracy while reducing time per output token, using only task-correctness labels and adapting as the model pool changes. Read the paper →

Picture a walk-in clinic on a busy morning. If you sent every patient straight to the specialist, you would diagnose almost everyone correctly — and go broke paying specialist rates for sore throats. If you staffed only a nurse, the routine visits would fly by, but the genuinely tricky cases would get missed. The waste isn't in the easy patients or the hard ones — it's in deciding the staffing once, for everyone, instead of per patient. That is exactly the bind a serving operator is in: pick one model to answer every request and you are either overpaying on the easy traffic or under-serving the hard traffic.

The clinic's fix is triage: a quick sort at the door sends most patients to the nurse and flags the unusual ones. Cluster, Route, Escalate does the same thing in two moves. Stage 1 — Cluster and Route groups similar queries and sends each group to the cheapest model that can handle it, with one offline-tuned cost budget dialing how much traffic gets pushed onto cheap models. Stage 2 — Escalate is the nurse's second thought: a quality estimator reads each cheap answer and, when it looks weak, re-asks the question of a stronger model. The clever part, the authors note, is that the gate is trained from task-correctness labels alone and the cascade adapts as models enter or leave the pool, with no manual reconfiguration — so it keeps working as the model lineup changes.

Where it earns its keep

Picture 1,000 queries an hour, of which 800 are easy and 200 are hard (illustrative split). Send all 1,000 to the strong model and you pay the strong-model price 1,000 times — and every easy answer streams at the strong model's slower time per output token. With the cascade, the cheap model answers the 800 easy ones fast and cheap; the quality estimator escalates only the roughly 200 it is unsure about. You now pay the expensive price about 200 times instead of 1,000 — a ~5× cut in expensive calls (illustrative — the paper reports accuracy and latency, not a fixed cost multiple), while the paper reports the cascade still keeps 97-99% of the strong model's accuracy. The quality estimator itself is lightweight to run, far cheaper than the expensive model calls it cancels.

Serving strategy	Who answers a query	Cost on easy traffic	Accuracy on hard traffic
One cheap model	the cheap model, always	lowest	weak — hard cases fail
One strong model	the strong model, always	highest — overpays	best (the 100% reference)
Cluster, Route, Escalate	cheap by default, strong only on escalation	low — most stays cheap	~97-99% of the strong model (per paper)

Because the decision lives in how queries are routed and answers are graded — not in any model's weights — the cascade is a serving-layer wrapper that composes with the rest of the cost toolkit: prompt and result caching, batching, and the agent cost profile all still apply underneath it. Its real-world payoff depends on the gap between your cheap and strong models and on how skewed your traffic is toward easy queries — the more lopsided the traffic, the more the cascade saves.

Related explainers

Co-failure ceiling — routing, voting, and mixture-of-agents — the limit of model ensembles: when every model misses the same query, no router or vote can recover it
Tool router — contextual-bandit tool routing — the same "pick the cheapest thing that works" idea, applied to choosing tools instead of models
VIA-SD — tiered confidence-gated verification — a cheap-to-expensive escalation gate inside one model's decode loop, rather than across a model pool

FAQ

What is the Cluster-Route-Escalate cost-aware cascade?

It is a two-stage system for serving large language models more cheaply. Stage 1 clusters incoming queries and routes each cluster to the cheapest model that can handle it, with a single offline-tuned cost budget deciding how aggressively traffic is pushed onto cheap models. Stage 2 adds a quality estimator that reads each cheap answer and escalates only the low-quality ones to a stronger model — so the expensive models run only on hard or low-confidence cases.

Why does it cut cost without losing much accuracy?

In a typical workload, many queries are easy and a small model answers them correctly. A single strong model bills full price for all of that easy traffic. The cascade keeps the easy queries on cheap models and pays for the strong model only when the quality estimator flags a weak answer. The paper reports retaining 97-99% of the strongest model's accuracy while reducing time per output token, because the rare hard cases still reach the strong model.

How is this different from plain model routing?

Plain routing makes one upfront decision — pick a model for the query and commit. This cascade adds a second, answer-aware stage: it routes cheap first, then grades the result and escalates only if the answer looks weak. The quality estimator is trained from task-correctness labels alone and never sees the true answer, so the cascade adapts as models are added to or removed from the pool without manual reconfiguration.

Originally posted on Learn AI Visually.

S-Agent: Spatial Tool-Use Makes an 8B Agent Rival GPT-5.4 on Spatial Reasoning: Spatio-Temporal Evidence Accumulation

pueding — Wed, 01 Jul 2026 11:18:43 +0000

What: S-Agent is an agent framework for spatial reasoning over multi-view images and video, where a vision-language model acts as a planner that directs a hierarchy of spatial tools to build one shared 3-D model of the scene — what the paper calls spatio-temporal evidence accumulation.

Why: Spatial questions — how far apart, how many, which way is it facing — need geometry that no single flat frame carries. Moving that geometry out of the model's head and into an explicit 3-D store lets an 8-billion-parameter agent rival GPT-5.4 and Gemini 3 on spatial reasoning.

vs prior: A standard VLM does frame-by-frame reasoning: it re-derives the whole 3-D scene inside its context on every frame and loses track as the camera moves. S-Agent instead keeps the scene in an external Scene Memory that tools refine, frame after frame.

Think of it as

A detective rebuilding a room as a 3-D scale model from a stack of photos.

             A STACK OF PHOTOS (multi-view frames)
                          │
            ┌─────────────┴─────────────┐
            │                           │
    frame-by-frame                S-Agent detective:
    re-imagines the               each photo adds one
    room every photo              measurement to one
    (the picture slips)           shared 3-D scale model
            │                           │
            ▼                           ▼
    ✗ count drifts to 11          ✓ re-sightings collapse;
      (3, 3, 2, 3 seen)             count resolves to 4

the VLM planner = the lead detective who decides the next measurement, but never measures by hand
spatial tools & experts = the specialists called in — one spots each object, one lifts it into 3-D, one measures distance and angle
the multi-view frames = the stack of flat photos, each shot from a different angle
Scene Memory = the 3-D scale model on the table that every photo refines
Agent Memory = the case notebook holding the reasoning so far
evidence accumulation = each photo adds one measurement to the model; the answer is read off the model, not re-imagined

Quick glossary

VLM (vision-language model) — A model that takes images or video plus text and reasons over both. In S-Agent the VLM is not the thing that answers — it is the planner that decides which spatial tool to call next.

Semantic planner — The role the VLM plays: it reads the question and the scene so far and chooses the next action — ground this object, lift that one into 3-D, measure this distance — rather than computing the geometry itself.

Spatial tools & experts — A hierarchy of specialized tools the planner directs: 2-D object grounding, depth/geometry that lifts a 2-D detection into a 3-D position, and measurement (counting, distance, orientation).

Spatio-temporal evidence accumulation — The core idea: each frame contributes partial geometric evidence, and the tools aggregate it across space and time into one continuous 3-D world — so the answer is built up, not guessed from one view.

Scene Memory — An external store holding the evolving 3-D state of the world — where each object sits, how big, which way it faces. It is refined as frames arrive, and the planner reads its state instead of re-deriving it.

Agent Memory — The second store: the reasoning context — what the planner has asked, which tools ran, what is still unknown. Scene Memory is what the world looks like; Agent Memory is what the agent has done about it.

Training-free — The tool hierarchy improves spatial benchmarks with no weight updates at all. Fine-tuning an 8B model on the traces it produces then yields S-Agent-8B.

The news. On June 18, 2026, researchers posted S-Agent to arXiv — an LLM-agent framework for spatial intelligence over multi-view images and video. Instead of reasoning frame-by-frame, a vision-language model acts as a semantic planner that directs a hierarchy of spatial tools: it grounds objects in 2-D, lifts them into 3-D, and aggregates geometric evidence across frames. The tool hierarchy already improves multiple spatial benchmarks with no training; after fine-tuning on its own traces, S-Agent-8B rivals GPT-5.4 and Gemini 3 on spatial reasoning. Read the paper →

Picture a detective walking into a room they have never seen, handed a thick stack of photos — the same space shot from a dozen angles. The hopeless way to work is to riffle through the photos and try to picture the whole room: where the chair sits relative to the door, how far the table is from the window, which way the lamp is turned. Flat photos do not carry that, and the mental image slips with every page. What actually works is to build a small 3-D scale model on the table, and let each photo add one measurement to it — then answer every question by looking at the model, not by re-imagining it from the stack. That scale model is Scene Memory, the detective deciding what to measure next is the VLM planner, and "add one measurement per photo" is spatio-temporal evidence accumulation.

Underneath the metaphor, S-Agent is moving the 3-D scene out of the model's context window and into an explicit store. A plain VLM asked a spatial question sees only a sequence of flat frames and must re-build the geometry in its head on every step — exactly the frame-by-frame approach that loses track as the camera moves. S-Agent instead casts the VLM as a planner that directs a hierarchy of tools, the same orchestrator-and-workers shape the Agents track names: one tool grounds each object in 2-D, another lifts it into a 3-D position, another measures. Their outputs land in Scene Memory — the running 3-D model — while the planner's own reasoning lives in Agent Memory, keeping what the world looks like separate from what the agent has done.

Because the geometry now accumulates in a store the tools update, the same loop runs training-free — it changes how the agent acts, not the weights. The contrast with a code-as-action spatial agent is instructive: both move beyond asking one VLM to answer directly from the frames, but where that agent writes executable code as its action, S-Agent routes the work to typed spatial experts and a shared 3-D memory.

Approach	Where the 3-D scene lives	Spatial-reasoning result
Large VLM answering directly (e.g. GPT-5.4)	re-derived in the model's context, every frame	strong, but heavyweight
Code-as-action agent (SpatialClaw)	a stateful kernel the agent writes code against	+11.2 pts to 59.9% across 20 benchmarks
S-Agent (planner + spatial tools + Scene Memory)	an explicit 3-D model the tools refine	8B rivals GPT-5.4 & Gemini 3

Where it earns its keep

Picture a four-frame clip of a kitchen and the question "how many chairs?" A frame-by-frame counter tallies sightings — say it spots 3, then 3, then 2, then 3 chairs, and with no shared model it has no way to know which sightings are the same chair seen again, so it can drift toward 11. S-Agent places each detected chair at a 3-D coordinate in Scene Memory, so re-sightings from new angles collapse onto the same point — and the count resolves to 4. (The four-frame count is illustrative; only the **8-billion-parameter* scale, the training-free gains, and the GPT-5.4 / Gemini 3 parity come from the paper.)* That is the whole bet of accumulating evidence into one 3-D store rather than re-reasoning each flat frame: the geometry stops slipping, and an 8B agent reaches the neighborhood of frontier models built at far larger scale.

FAQ

What is spatio-temporal evidence accumulation?

It is the core mechanism of S-Agent: instead of answering a spatial question from one flat frame, the agent treats each frame as partial geometric evidence and aggregates it across space and time into a single continuous 3-D model of the scene. A vision-language model acts as a planner that directs spatial tools — grounding objects in 2-D, lifting them into 3-D, measuring distance, count, and orientation — and the answer is read off the assembled 3-D model rather than re-imagined from the frames.

Why does S-Agent matter?

It shows that spatial intelligence is gated by how the scene is represented, not just by model size. By moving the 3-D scene out of the model's context and into an explicit Scene Memory that tools refine frame by frame, an 8-billion-parameter agent rivals GPT-5.4 and Gemini 3 on spatial reasoning, and the tool hierarchy improves benchmarks training-free — before any fine-tuning at all.

How is S-Agent different from a vision-language model answering directly?

A VLM answering directly does frame-by-frame reasoning: it must re-derive the whole 3-D scene inside its context on every frame, which is lossy and slips as the camera moves. S-Agent recasts the VLM as a planner that directs a hierarchy of spatial tools and keeps the evolving 3-D state in an external Scene Memory, with the reasoning context held separately in Agent Memory — so geometry accumulates instead of being re-guessed each step.

Originally posted on Learn AI Visually.

SGLang v0.5.14: LPLB Expert-Parallel Load Balancing

pueding — Tue, 30 Jun 2026 11:19:23 +0000

What: The SGLang v0.5.14 release ships LPLB — a linear-programming load balancer for serving mixture-of-experts models, where the experts are split across many GPUs and each step routes every token to a few of them.

Why: In expert-parallel MoE serving, token routing is uneven and shifts every step, so one overloaded GPU stalls the whole step at a sync barrier; evening that load is what unlocks throughput on big MoE models like DeepSeek-V4.

vs prior: Earlier setups used static, hand-tuned expert placement and ate the imbalance; LPLB keeps redundant replicas of the hot experts and solves a small linear program each step to minimize the busiest GPU's share of the work.

Think of it as

A warehouse store opening duplicate counters to even out the longest line.

       40% of this step's tokens want one hot expert

   WITHOUT LPLB                 WITH LPLB (3 replicas)
   ┌──────────────┐             ┌──────────────┐
   │ GPU1 ####### │ 40%         │ GPU1 ##      │ 14%
   │ GPU2 #       │  5%         │ GPU2 ##      │ 14%
   │ GPU3 #       │  5%         │ GPU3 ##      │ 14%
   └──────┬───────┘             └──────┬───────┘
          ▼                            ▼
   barrier waits on GPU1        lanes finish together
   ✗ others idle ~1/3 step      ✓ idle time deleted

a customer = a token routed to its experts this step
a specialty counter = an expert (a sub-network in a mixture-of-experts model)
a checkout lane = a GPU the experts are spread across
one counter mobbed while others sit idle = per-GPU load imbalance
duplicate copies of the busy counter = redundant expert replicas
the floor manager who evens the longest line each wave = LPLB

Quick glossary

MoE (Mixture-of-Experts) — A model whose feed-forward layer is split into many experts (sub-networks); a small router sends each token to only a few. Total parameters are huge, but the active ones per token stay small. DeepSeek-V4 is a large MoE model.

Expert parallelism (EP) — The serving layout that spreads a MoE's experts across many GPUs, because all the experts together do not fit on one. Each step, tokens must be shipped to whichever GPU holds their chosen expert and the results shipped back.

Load imbalance — When this step's router sends far more tokens to some experts than others, the GPUs holding the popular experts get swamped while the rest sit idle. The pattern is data-dependent, so it shifts batch to batch.

Redundant expert replicas — Keeping extra copies of the hot experts on several GPUs so their token load can be split, instead of one GPU owning a popular expert alone. The balancer decides how to divide each expert's tokens among its copies.

LPLB — SGLang's Linear-Programming Load Balancer. Each step it solves a tiny linear program over the current token counts to assign load across replicas so the maximum per-GPU load is as small as possible (a min-max objective).

Waterfill — The second expert-parallel balancer the release ships alongside LPLB. SGLang names it but does not detail how it works; the name points to a classic water-filling heuristic — fill the least-loaded replica first — which would be a lighter alternative to solving the LP each step.

All-to-all — The expert-parallel communication step that ships tokens out to their experts' GPUs and the results back. It runs every layer and waits for the slowest GPU, which is why imbalance is so costly here.

The news. On June 26, 2026, the SGLang team released v0.5.14, with work from 56 contributors. The headline is 5x higher throughput at the same interactivity serving DeepSeek-V4 on NVIDIA GB300, driven by two new expert-parallel load balancers — Waterfill and LPLB (a linear-programming load balancer) — plus CuteDSL prefill kernels for Blackwell and int8 checkpoint pooling for linear-attention prefix caches. Read the release →

Picture a warehouse store at peak rush. The checkout lanes are the GPUs; the specialty counters — deli, pharmacy, bakery — are the model's experts, and because no single lane can hold them all, the store spreads the counters across the lanes. That spread is expert parallelism: a mixture-of-experts model has too many experts to fit on one GPU, so they live across many, and each decode step the router sends every customer (token) to the one or two counters they need. The trouble is that the rush is lumpy. This wave, everyone wants the deli; next wave, the pharmacy. So one counter gets mobbed while the rest stand idle — and the store can't close out the rush until that longest line clears.

That last clause is the whole problem, because the lanes do not finish independently. Every GPU has to meet at a sync barrier — the all-to-all that ships tokens to their experts and the answers back — and that barrier waits for the slowest lane. The GPU holding this step's most popular expert therefore sets the pace for all of them, and the fast lanes burn the difference as idle time. Add more GPUs and the imbalance can get worse, not better, because the hot expert still lives on one lane while you have paid for more lanes to stand around.

SGLang v0.5.14's fix is to stop letting one counter bottleneck the floor. It keeps redundant replicas of the hot experts — duplicate deli counters on several lanes — and then, each wave, the floor manager solves a quick assignment problem: given how many customers want each counter right now, divide every counter's line across its copies so the busiest lane does as little as possible. That floor manager is LPLB, and "as little as possible" is literal: it solves a small linear program whose objective is to minimize the maximum per-GPU load (a min-max). Waterfill is the other balancer the release pairs it with, and SGLang does not spell out how it works. The name, though, points to a classic water-filling heuristic — fill the least-loaded replica first — which would be a lighter alternative to running the LP every step.

Hold the layout fixed and walk the imbalance math (illustrative — the release reports only the end-to-end 5x). Say 8 GPUs serve a batch, and the router sends 40% of this step's tokens to one hot expert that lives on a single GPU, while another GPU draws just 5%. The step can't end until that one GPU finishes its 40%, so the other seven idle for roughly a third of the step — you own 8 GPUs but move at the speed of the busiest one. Now place 3 replicas of that hot expert and let LPLB split its tokens across them: its share per GPU falls from 40% toward about 14%, the barrier wait shrinks sharply, and the lanes finish much closer together. The win isn't a faster kernel — it's deleting the idle time that imbalance was manufacturing.

Expert-parallel balancing	How it assigns load	Per-step cost	Balance quality
Static / hand-tuned placement	fixed expert→GPU map, set before serving	~none	poor under shifting, data-dependent routing
Waterfill (this release)	the release's second balancer; name implies water-filling, internals not detailed	—	a lighter companion to LPLB (inferred from the name)
LPLB (this release)	solves a linear program to minimize the busiest GPU's load	a small solve each step	tightest — a min-max optimum over replicas (SGLang v0.5.14)

Where it earns its keep is exactly the regime DeepSeek-V4 lives in: a large MoE served with expert parallelism across many Blackwell GPUs, where the all-to-all and its sync barrier are a leading cost in each decode step. The release's headline — 5x higher throughput at the same interactivity — is a goodput claim: more tokens per second without making any single user wait longer. Read it as the lanes finishing together instead of seven of them waiting on one — the same hardware, far less idle time.

Related explainers

SGLang v0.5.12 — TokenSpeed MLA backend — the prior SGLang release, a kernel-level cache-write win rather than a scheduling one
Manifold Power Iteration — MoE router alignment — the other MoE balance problem: which expert a token picks (router design), not where that expert runs
GLM-5.2 — active vs total parameters — why MoE serving is its own discipline: huge total weights, small active compute per token

FAQ

What is LPLB (linear-programming load balancing)?

LPLB is the Linear-Programming Load Balancer added in SGLang v0.5.14. When a mixture-of-experts model is served with expert parallelism — its experts split across many GPUs — the router sends an uneven, step-by-step-changing number of tokens to each expert, so some GPUs get swamped while others idle. LPLB keeps redundant replicas of the hot experts and, each step, solves a small linear program over the current token counts to divide every expert's load across its replicas so the maximum per-GPU load is minimized. Evening the load shrinks the wait at the all-to-all sync barrier that gates each decode step.

Why does expert-parallel MoE serving need load balancing at all?

Because expert parallelism makes the GPUs finish a step together, not independently. Every layer runs an all-to-all that ships tokens to their experts' GPUs and the results back, and that barrier waits for the slowest GPU. Since token-to-expert routing is data-dependent and shifts every batch, whichever GPU holds this step's most popular expert becomes the bottleneck for all of them — and the rest burn the difference as idle time. Without balancing, adding more GPUs can even make it worse, because the hot expert still lives on one GPU. SGLang reports a 5x throughput gain at the same interactivity for DeepSeek-V4 on NVIDIA GB300 once the load is evened.

How does LPLB differ from Waterfill, and from a MoE router?

Waterfill and LPLB are the two expert-parallel balancers the release ships, both aimed at spreading each step's token load across expert replicas. SGLang details LPLB — it solves a linear program for a tight min-max balance at a small per-step cost — but does not spell out Waterfill's internals; the name points to a classic water-filling heuristic (fill the least-loaded replica first), which would be a lighter alternative to an LP solve. Both differ from the MoE router: the router decides which expert each token should go to (a quality choice about the model's output), whereas the balancers decide where, among the redundant copies of that chosen expert, the work actually runs (a serving choice about GPU utilization).

Originally posted on Learn AI Visually.

CacheWeaver Reorders RAG Evidence for Prefix-Cache Reuse: Prefix-Cache-Aware Evidence Reordering

pueding — Mon, 29 Jun 2026 11:19:38 +0000

What: On June 18, 2026, researchers posted CacheWeaver — a prompt-layer method built on prefix-cache-aware evidence reordering. It changes only the order retrieved RAG chunks appear in the prompt, so the serving engine can reuse more of its KV prefix cache.

Why: In retrieval-augmented serving, time-to-first-token is dominated by prefilling the evidence; reusing a cached prefix skips that work, and CacheWeaver squeezes out the reuse a naive system leaves on the table — with no engine change.

vs prior: Versus naive retrieval-order caching — chunks left in relevance order, so each prompt's opening rarely matches the cache — CacheWeaver re-sequences the same chunks to maximize the shared opening prefix, the part the engine can actually reuse.

Think of it as

A kitchen that reuses orders already half-cooked on a warming shelf.

Shelf tray:  c1 c2 c3 | c4 c5     already cooked
New order:   c1 c2 c3 | cX cY     just arrived
             └──────┘   └───┘
              shared    differs
              opening   from here
                 │          │
                 ▼          ▼
           ✓ reuse it,  ✗ cook fresh —
             no prefill   must prefill
first plate out sooner  =  lower TTFT

retrieved evidence chunk = one course in a multi-course order
the prompt sent to the model = the full order, course by course in sequence
KV prefix cache = the warming shelf of orders already partly cooked
reusable prefix = the opening courses your order shares with one on the shelf
CacheWeaver reordering = re-plating the same courses so the opening matches the shelf
time-to-first-token = how soon the first plate leaves the kitchen

Quick glossary

TTFT (time-to-first-token) — How long after a request arrives before the model emits its first output token. For a long RAG prompt this is almost entirely prefill time — the engine has to read all the evidence before it can answer.

Prefill — The first phase of inference, where the model processes the entire prompt at once to build its KV cache. Its cost grows with prompt length, which is why stuffing evidence into a RAG prompt is what makes TTFT slow.

KV prefix cache (RadixAttention) — Serving engines store the KV computed for a prompt and reuse it for any later prompt that starts with the exact same tokens. SGLang's RadixAttention keeps these shared prefixes in a tree; vLLM does it by hashing fixed blocks.

Prefix (and why order matters) — Reuse works only from the front, token-for-token. Two prompts that share their opening reuse that opening; the instant they diverge, everything after must be recomputed — so the order of the evidence decides how much is reusable.

RAG (retrieval-augmented generation) — Before answering, the system retrieves relevant documents and pastes them into the prompt as evidence. The retriever returns them ranked by relevance, which is the order CacheWeaver rearranges.

Oracle ordering — The best ordering you could pick if you knew the whole future cache state in advance — an upper bound, not a runnable policy. CacheWeaver's cheap greedy choice reaches about 97.5% of this ideal.

The news. On June 18, 2026, researchers posted CacheWeaver, a lightweight prompt-layer method that reorders retrieved evidence so grounded RAG requests reuse as much of the KV prefix cache as possible. It changes neither the serving engine nor the retrieved documents — only the order the chunks appear in the prompt. Across three vLLM configurations it cuts median time-to-first-token by about 20–33% relative to naive retrieval-order prefix caching, reaching 97.5% of the gain an oracle ordering would give, with no measured answer-quality degradation. Read the paper →

Picture a kitchen with a warming shelf of orders that are already half-cooked. A new order comes in, and its first few courses happen to be the same dishes, in the same sequence, as a tray already sitting on the shelf. The cook doesn't start over — they grab the matching tray and only cook the courses that differ, and the first plate leaves the pass far sooner. The whole trick is that the reuse runs strictly from the front: the moment your order's courses stop matching the shelf, every course after that has to be cooked fresh.

In serving terms, each course is a retrieved evidence chunk, the order is the prompt, and the warming shelf is the KV prefix cache. A serving engine keeps the keys and values it already computed for a prompt and reuses them for any later prompt that begins with the exact same tokens. But that matching is unforgiving: change a single token near the front and the block hash no longer matches, so the cache misses and the work is redone.

Here is the catch retrieval creates. A RAG retriever returns chunks ranked by relevance, and that ranking is different for almost every question — so even two requests that pull the same documents arrange them differently, and their prompts share almost no opening. Because TTFT for a long RAG prompt is essentially the time to prefill all that evidence, a cache that almost never hits means the GPU re-reads thousands of evidence tokens on every request.

CacheWeaver's move is to treat the chunk order as a free variable. It maintains a prefix tree of recently served evidence sequences and runs a greedy algorithm that, for each incoming request, surfaces the most reusable cached prefix and then re-plates the retrieved chunks to match it. Because the reordering happens entirely at the prompt layer, the serving engine and the retrieval results stay untouched — and in the paper's evaluations the reordering shows no answer-quality loss, since the same evidence is present either way and only its order changes.

Here is where it earns its keep, with illustrative numbers. Say a request's evidence prefills to 5,000 tokens, and prefill cost is roughly proportional to the tokens the GPU must process. Left in retrieval order, only the shared system preamble and one lucky chunk match the cache — about 1,500 tokens reused, so the engine prefills the remaining 3,500. CacheWeaver reorders the same chunks so the opening matches a cached sequence of ~2,500 tokens, leaving just 2,500 to prefill fresh. TTFT tracks the recomputed portion, so it falls from 3,500 to 2,500 — about a 29% cut, squarely inside the paper's reported 20–33% band, and near the 97.5% of oracle the greedy policy is shown to reach. (The 5,000- and reuse-token figures are illustrative; the 20–33% and 97.5% are from the CacheWeaver paper.)

Approach	What it changes	Prefix reuse	Median TTFT
Retrieval-order prefix caching	nothing — chunks left in relevance order	only when two orders happen to match	baseline
CacheWeaver (greedy reorder)	re-sequences evidence at the prompt layer	maximized via a prefix-tree match	~20–33% lower (CacheWeaver paper)
Oracle ordering	best possible order, known in hindsight	maximal (upper bound)	CacheWeaver reaches ~97.5% of this gain (paper)
Answer quality	—	—	no measured degradation (CacheWeaver paper)

The reason this is worth noticing is where it sits: it is not a new attention kernel or a smaller cache, but a scheduling decision at the boundary between retrieval and serving — the kind of free win that appears once you stop treating the retrieve-then-generate pipeline and the serving engine as two sealed boxes. The same evidence, the same engine, the same answer — only the order changes, and the cache does the rest.

Goes deeper in: LLM Serving → Prefix Caching & RadixAttention → The prefix tree

Related explainers

Attention Once Is All You Need — persistent KV cache across queries — the same "reuse the prefix instead of recomputing it" idea, taken to a stateful cache that survives across separate queries.
AMD ATOM — prefill/decode disaggregation — why TTFT lives in the prefill phase, and another way to attack it: splitting prefill off onto its own hardware.
Is Grep All You Need? — grep vs vector retrieval — the retrieval side of the pipeline CacheWeaver reorders, and how the choice of retriever shapes what lands in the prompt.

FAQ

What is prefix-cache-aware evidence reordering?

It is reordering the retrieved chunks in a RAG prompt so the serving engine's KV prefix cache can reuse as much of the prompt's opening as possible. The serving engine caches the keys and values it computed for earlier prompts and reuses them for any later prompt that begins with the exact same tokens. Because retrieval returns chunks ranked by relevance — a different order for almost every question — prompts rarely share an opening, so the cache misses. CacheWeaver re-sequences the same chunks at the prompt layer to maximize that shared opening prefix, without touching the engine or the retrieved documents.

Why does it lower time-to-first-token?

Time-to-first-token for a long RAG prompt is essentially the time to prefill all the evidence — the model must read every chunk before it can answer. When the prompt's opening matches a cached prefix, the engine reuses that work and only prefills the remaining tokens, so TTFT tracks the part it has to recompute. By making more of the opening reusable, CacheWeaver shrinks that recomputed portion, cutting median TTFT by about 20–33% across three vLLM configurations and reaching roughly 97.5% of an oracle ordering's gain, with no measured loss in answer quality.

How does CacheWeaver relate to prefix caching and RAG?

It sits exactly between them. Prefix caching (RadixAttention in SGLang, block hashing in vLLM) is the serving-side mechanism that reuses a shared opening; RAG is the retrieval-side pipeline that pastes ranked evidence into the prompt. CacheWeaver changes neither — it adds a prompt-layer scheduler that keeps a prefix tree of recently served sequences and greedily reorders each request's retrieved chunks to match the most reusable cached prefix. It is complementary to the engine's caching and to the retriever's ranking, because it only governs the order in which the already-chosen chunks are laid out.

Originally posted on Learn AI Visually.

Qwen-AgentWorld Trains a Language Model as a World Model for RL Agents: World Model as a Decoupled RL Simulator

pueding — Sun, 28 Jun 2026 11:20:08 +0000

What: The Qwen-AgentWorld release (arXiv 2606.24597) trains a language model to be a world model: given the current observation and an agent's action, it predicts the next environment state. The idea it makes concrete is using that model as a decoupled simulator for reinforcement-learning (RL) agents.

Why: Training an agent with RL needs a vast number of trial-and-error attempts in an environment — and real environments are slow, costly, and hard to run in parallel. A learned simulator lets you generate that experience cheaply and at massive scale.

vs prior: Standard agent RL is coupled to a live environment — every step waits on the real web page, terminal, or game; Qwen-AgentWorld decouples the two by predicting the environment's response itself, and also serves as a warm-start foundation model for downstream agents.

Think of it as

A flight simulator pilots train in instead of a real, costly plane.

                 THE RL AGENT (trainee pilot)
                            │
           ┌────────────────┴────────────────┐
           │                                 │
   ┌───────▼───────┐                 ┌───────▼───────┐
   │ World-model   │                 │ Real          │
   │ simulator     │                 │ environment   │
   │ (flight sim)  │                 │ (actual jet)  │
   └───────┬───────┘                 └───────┬───────┘
           │                                 │
   predicts next state              waits on the live
   in one forward pass              page/terminal/game
           │                                 │
           ▼                                 ▼
   ✓ thousands of runs at           ✗ slow, serial, and
     once — cheap to scale            costly to parallelize

world model = a flight simulator that predicts what happens next
real environment = the actual aircraft, costly and slow to train in
RL agent = the trainee pilot learning by trial and error
next-state prediction = the simulator computing your next instrument reading
decoupled simulator = running thousands of sim sessions at once, no real planes
agent warm-start = the hours logged in the sim before the first real flight

Quick glossary

World model — A model that predicts how an environment changes: feed it the current state and an action, and it returns the likely next state. Qwen-AgentWorld trains a language model to do this for agent environments.

Reinforcement learning (RL) — Training by trial and error toward a reward — the agent acts, sees what happens, and adjusts. It is data-hungry: it needs many environment steps, which is exactly what a fast simulator supplies.

Next-state prediction — The world model's core job: given (observation, action), output the next observation. Get this accurate enough and the model can replace the real environment for training.

Rollout — One full trial run of an agent in an environment, from start to finish. RL learns from thousands of rollouts; in a live environment each one is slow, in a simulator each one is cheap.

Decoupled (vs coupled) — A coupled setup ties each training step to the real environment; a decoupled one swaps in the simulator, so training no longer waits on the live web page, terminal, or game.

Warm-start / foundation model — Using a pre-trained model as a head start rather than training from scratch. Qwen-AgentWorld doubles as a foundation model that warms up downstream agents before task-specific fine-tuning.

Hybrid reward — A reward signal that combines more than one objective. Qwen-AgentWorld's final RL stage uses one to sharpen simulation fidelity — how faithfully its predicted states match reality.

The news. On June 24, 2026, the Qwen-AgentWorld team released a language model trained to act as a world model for agents: given the current observation and an agent's action, it predicts the next environment state. It is used two ways — as a decoupled environment simulator for training RL agents across thousands of scenarios, and as a foundation model that warms up downstream agents. Training is a three-stage pipeline (continual pre-training → supervised fine-tuning → RL with a hybrid reward), and the team reports it outperforms existing frontier models on AgentWorldBench across seven domains (the gain is stated qualitatively, without a single headline number). Read the paper →

Think about how you train a pilot. You do not hand a beginner the controls of a real jet and let them crash a few hundred times — you put them in a flight simulator that predicts what the plane would do in response to each input. The simulator is cheaper, safer, and you can run a thousand of them at once. Qwen-AgentWorld does exactly this for software agents: instead of training in the slow, live environment, it trains a language model to be the environment — to predict, from the current screen and the agent's action, what the next screen looks like.

Why does this matter so much for RL? Because reinforcement learning is gluttonous for experience: it improves by trying an action, seeing the environment's response, and adjusting — thousands and thousands of times. When every one of those steps is coupled to a real web page or terminal, the environment, not the GPU, becomes the bottleneck. A learned world model breaks that coupling: predicting the next state is just a forward pass, so you can run enormous numbers of rollouts in parallel, none of them waiting on the real world.

How does Qwen-AgentWorld get a language model good enough to be a simulator? Three stages, each adding one capability: continual pre-training instills broad world-modeling, supervised fine-tuning activates explicit next-state-prediction reasoning, and a final RL stage with a hybrid reward sharpens simulation fidelity — how faithfully its predicted states match what the real environment would have done. The same trained model then does double duty as a warm-start foundation model, giving downstream agents a head start before any task-specific fine-tuning.

Walk the economics with illustrative numbers (the paper does not publish step-rate figures). Suppose a single rollout in a live web environment takes 30 seconds and you can afford 10 in parallel — that is about 1,200 rollouts an hour. Now suppose the world model predicts a next state in ~50 milliseconds and you run 1,000 in parallel — that is on the order of tens of millions of steps an hour (illustrative). That multiple-orders-of-magnitude gap in experience-per-hour is the whole point: it is what lets an agent be trained across thousands of scenarios that a live-environment budget could never reach. The catch, of course, is fidelity — an agent trained in a simulator only transfers if the simulator's predictions stay close to reality, which is exactly what the final RL stage targets.

Training setup	Where each step's "what happens next" comes from	Cost of experience
Coupled to a live environment	the real web page / terminal / game	Slow and hard to parallelize — the environment is the bottleneck
Decoupled world-model simulator (Qwen-AgentWorld)	the model's own next-state prediction (paper)	A forward pass — cheap and massively parallel; fidelity is the risk to manage

Goes deeper in: AI Agents → Agent Loop & State → Inside a Tick

Related explainers

Agent environment survey — symbolic vs neural synthesis — the broader map of how to build an agent's training world; a learned world model is the "neural" end of that split.
EnvFactory — synthesizing tool environments — a different way to manufacture the environments agents train in.
OpenThoughts-Agent — task-source diversity — what you feed an agent in training; Qwen-AgentWorld is about where that training experience comes from.
Role-Agent — dual-role self-play — another case of a model imagining the other side of the interaction to train itself.

FAQ

What is a world model used as a decoupled RL simulator?

A world model is a model that predicts how an environment changes: given the current observation and an action, it returns the likely next state. Qwen-AgentWorld (arXiv 2606.24597, June 2026) trains a language model to do this for agent environments, then uses it as a decoupled simulator — a stand-in for the real environment so reinforcement-learning agents can be trained across thousands of scenarios without waiting on a live web page, terminal, or game. The same model also serves as a foundation model that warms up downstream agents.

Why train an agent in a learned simulator instead of the real environment?

Reinforcement learning needs an enormous number of trial-and-error steps, and when each step runs against a real environment, that environment becomes the bottleneck — it is slow and hard to parallelize. A world model predicts the next state in a single forward pass, so rollouts become cheap and massively parallel, letting agents train across far more scenarios than a live-environment budget allows. The risk is fidelity: the agent only transfers to the real world if the simulator's predictions stay close to reality, which Qwen-AgentWorld's final RL stage targets with a hybrid reward.

How was Qwen-AgentWorld trained?

Through a three-stage pipeline: continual pre-training to instill broad world-modeling capability, supervised fine-tuning to activate explicit next-state-prediction reasoning, and reinforcement learning with a hybrid reward to sharpen simulation fidelity. The team reports it outperforms existing frontier models on AgentWorldBench across seven domains, stated qualitatively rather than with a single headline number.

Originally posted on Learn AI Visually.

OpenAI and Broadcom's Jalapeño, a Custom Inference ASIC: Inference ASIC vs GPU

pueding — Sat, 27 Jun 2026 11:21:30 +0000

What: The OpenAI and Broadcom Jalapeño announcement (June 24, 2026) is OpenAI's first custom LLM-inference ASIC — a reticle-sized compute chiplet paired with HBM, built to run models rather than train them. The idea it makes concrete is an inference-optimized ASIC versus a general-purpose GPU.

Why: At decode time the bottleneck is usually moving data, not doing math, so a chip co-designed around that movement can serve the same tokens using far less power per token — early testing reports substantially better performance-per-watt (final numbers still being measured), which at OpenAI's scale materially changes serving cost.

vs prior: A general-purpose GPU runs anything — training, graphics, every model — and pays in silicon and power for that flexibility; Jalapeño is hard-wired for inference only, trading the GPU's versatility for a shorter, faster path between memory and compute.

Think of it as

A kitchen rebuilt to cook one dish, with the pantry moved beside the stove.

                  THE ONE DISH: LLM inference
                            │
            ┌───────────────┴───────────────┐
            │                               │
     ┌──────▼───────┐                ┌──────▼───────┐
     │ Inference    │                │ General      │
     │ ASIC         │                │ GPU          │
     │ (one dish)   │                │ (whole menu) │
     └──────┬───────┘                └──────┬───────┘
            │                               │
   pantry beside the stove        pantry down the hall
   (HBM next to compute)          (data travels far)
            │                               │
            ▼                               ▼
   ✓ most plates per gas          ✗ pays power for
     (perf-per-watt)                flexibility unused

inference ASIC = a kitchen rebuilt to cook one dish, as fast and cheaply as possible
general-purpose GPU = a restaurant kitchen that can cook anything on the menu
data-movement bottleneck = cooks spending the night carrying ingredients from a far pantry
HBM beside the compute chiplet = moving the pantry right next to the stove
performance-per-watt = more plates served for every unit of gas burned

Quick glossary

ASIC — An Application-Specific Integrated Circuit — silicon built for one kind of job rather than general-purpose computing. Giving up a general processor's flexibility buys speed and energy efficiency on that job. Jalapeño's job is LLM inference.

HBM — High-Bandwidth Memory — stacked DRAM placed physically very close to the compute die so data reaches the math units faster. It is the same fast memory used on high-end GPUs, and it is where the model actually lives during serving.

Inference vs training — Training builds a model's weights; inference runs the finished weights to generate tokens. They stress hardware differently, so a chip can be excellent at one and unable to do the other. Jalapeño is inference-only.

Memory-bandwidth-bound — When a computation spends most of its time waiting for data to arrive from memory rather than doing arithmetic. Single-token decode is the classic example: lots of bytes read, little math per byte.

Tape-out — The moment a chip design is finished and sent to the fab to be manufactured. Jalapeño went from first design to tape-out in roughly nine months, which OpenAI describes as one of the fastest such cycles to date.

Reticle-sized chiplet — The reticle is the largest area a chip-making machine can pattern in a single exposure (around 800 mm²). A reticle-sized compute chiplet is about as large as one die can physically get — Jalapeño pairs one such tile with HBM.

Performance-per-watt — Useful work (tokens generated) divided by the electrical power it costs. At data-center scale this — not peak speed alone — sets the bill, which is why a custom inference chip targets it directly.

The news. On June 24, 2026, OpenAI and Broadcom unveiled Jalapeño, OpenAI's first "Intelligence Processor" — a purpose-built ASIC for LLM inference, not a repurposed training accelerator or a general-purpose AI chip. It pairs a single reticle-sized compute chiplet with HBM (not commodity DRAM) to hold high throughput and low latency together, and was co-designed from first design to tape-out in roughly nine months. Engineering samples are already running production workloads in the lab, including GPT-5.3-Codex-Spark, with early testing reporting performance-per-watt "substantially better" than current state-of-the-art (final numbers still being measured). Initial deployment is targeted for end of 2026. Read the announcement →

Picture a restaurant kitchen that can cook anything on the menu — pastry, grill, soup, all of it. That flexibility is wonderful, and it is exactly what a general-purpose GPU gives you: thousands of programmable cores that will run any parallel workload you throw at them, from training a model to rendering a game. Jalapeño is that kitchen torn down and rebuilt to cook one dish — LLM inference — and nothing else. The bet is that if you only ever cook one dish, a kitchen shaped around that single dish will cook it faster and far more cheaply than the do-everything kitchen ever could.

So what is the "one dish" actually limited by? Here is the part that surprises people: at decode time, the thing slowing the kitchen down is not the chef's hands — it is the cooks walking ingredients in from a far pantry. When a model generates a token, at small batch sizes it must stream the model's weights out of memory and through the compute units once, while doing comparatively little arithmetic per byte read. That makes single-token decode memory-bandwidth-bound — the roofline tips toward memory, and the math units sit mostly idle, waiting on data. The bottleneck the whole chip is fighting is data movement.

Single-token decode — where the time goes:

moving data  ████████████████████████████████  dominates
computing    █                                  a sliver

The diagram makes the imbalance concrete: in the bandwidth-bound regime, the pink "moving data" segment dominates and the green "computing" segment is a sliver. Jalapeño's answer is the obvious one once you see the problem — move the pantry next to the stove. It pairs that big compute chiplet with HBM kept physically close, so the costly trip between memory and compute is as short and as fast as the silicon allows. OpenAI says the design was derived from its own measurements of how its models behave at serving time, which is what "co-designed" really means here: the chip is shaped around the bottleneck the company actually observed, not a generic one.

Walk the decode math on a single token (illustrative numbers — OpenAI has not published Jalapeño's figures). Say a model holds 100 GB of weights and the accelerator reads them from memory at 4 TB/s. Generating one token must stream those weights through compute roughly once, so the time is about 100 GB ÷ 4 TB/s = 25 ms — and across that 25 ms the arithmetic units are mostly idle, waiting. Now double the effective memory bandwidth and that 25 ms roughly halves; double the raw compute instead and almost nothing changes. That is the whole reason an inference chip is built around feeding the math units, not stacking more of them — and why the headline metric is performance-per-watt, not peak FLOPs.

None of this means GPUs are going away. The trade Jalapeño makes is real and one-directional: you give up the GPU's ability to train, to switch to a very different kind of workload, to run the whole range of models and tasks a GPU handles. A custom ASIC only pays off when you run one workload at enormous, sustained scale — which is precisely OpenAI's situation, and precisely why a startup serving a thousand requests a day would still reach for a GPU. The interesting signal is not "ASICs beat GPUs"; it is that LLM inference has become a large and stable enough workload to justify burning a chip for it.

Chip	Built for	Flexibility	Where it wins
General-purpose GPU	training + inference + any parallel workload	Highest	The default — runs anything, backed by a mature software ecosystem
Repurposed training accelerator	training, also used to serve	High	Strong throughput, but carries training-only hardware that idles during inference
Inference ASIC (Jalapeño)	LLM inference only	Lowest	Built for top performance-per-watt on its one workload at scale (early results); inference-only, far less flexible

Goes deeper in: GPU & CUDA → Roofline Model → The Bottleneck Question

Related explainers

NVIDIA AI factories — tokens per megawatt — frames serving as a performance-per-watt problem at the datacenter scale Jalapeño is built to win.
AMD Atom — prefill/decode disaggregation — another hardware answer to the fact that prefill and decode stress the chip in opposite ways.
Blackwell on MLPerf 6.0 — strong scaling — the general-purpose GPU side of the same inference-efficiency race.
Jetson Thor — edge Blackwell — purpose-built inference silicon at the opposite end of the scale, the edge.

FAQ

What is an inference ASIC like Jalapeño?

An inference ASIC is an Application-Specific Integrated Circuit — silicon built for one kind of job rather than general-purpose computing — made to run (not train) large language models. OpenAI and Broadcom's Jalapeño, unveiled June 24, 2026, is OpenAI's first such chip: a reticle-sized compute chiplet paired with HBM, co-designed around the data-movement bottleneck of serving models at scale. It gives up a GPU's general-purpose flexibility in exchange for higher performance-per-watt on that single workload (early testing reports substantially better, with final numbers still being measured).

Why build a custom inference chip instead of using GPUs?

At decode time, generating a token is usually memory-bandwidth-bound — the chip spends most of its time moving the model's weights out of memory, not doing arithmetic. A general-purpose GPU pays in silicon and power for flexibility that inference never uses. A chip co-designed around the data-movement bottleneck — a large compute chiplet with HBM kept close — can serve the same tokens at substantially better performance-per-watt in early testing (final numbers still being measured), which at OpenAI's scale materially changes serving cost.

How is Jalapeño different from a GPU?

A GPU is general-purpose: thousands of programmable cores that run training, graphics, and any model. Jalapeño is an ASIC built for LLM inference only — it cannot train and is far less flexible than a general-purpose GPU. That is the trade: it loses the GPU's versatility and gains a shorter, faster path between memory and compute, which is what matters when the bottleneck is data movement rather than raw math. A custom ASIC pays off only when you run one workload at enormous, sustained scale.

Originally posted on Learn AI Visually.

Baidu Unlimited OCR Holds the KV Cache Constant for 40+ Pages: Reference Sliding Window Attention

pueding — Fri, 26 Jun 2026 11:16:58 +0000

What: The Unlimited OCR release (Baidu, arXiv 2606.23050) is a 3-billion-parameter open OCR model whose decoder replaces standard attention with Reference Sliding Window Attention (R-SWA) — the trick that lets it transcribe 40+ pages in a single forward pass.

Why: The KV cache is the memory that grows with every token a model writes; on a 40-page transcription that growth can dominate inference memory and slow generation, so holding the cache constant is what makes one-pass, whole-document OCR practical.

vs prior: A standard decoder makes each new token attend to the entire growing output, so its KV cache grows linearly; R-SWA makes each token attend to the fixed document plus only the last 128 output tokens, so the cache stays a constant size.

Think of it as

A scribe copying a long book — the source kept open on the desk, and only the last line they wrote still in view.

                 COPYING ONE 40-PAGE BOOK
                            │
              ┌─────────────┴─────────────┐
              │                           │
     ┌────────▼────────┐         ┌────────▼────────┐
     │  R-SWA scribe   │         │   plain scribe  │
     │  (disciplined)  │         │  (stacks pages) │
     └────────┬────────┘         └────────┬────────┘
              │                           │
     desk = source book pinned    desk = every page copied
       + last 128 output lines       stacked so far, growing
              │                           │
              ▼                           ▼
     ✓ desk never overflows      ✗ desk overflows by page 40
       KV cache stays constant     KV cache grows linearly

reference tokens = the source book kept open on the desk, always in reach
sliding window = only the last line the scribe glances at to keep continuity
constant KV cache = a desk that never overflows, however long the book
linear KV growth = stacking every page you've copied on the desk until it spills
40+ pages in one pass = copying a whole long book in a single sitting

Quick glossary

KV cache — The stored keys and values for every token already processed, so attention never recomputes them. It is the dominant memory cost of inference, and in a standard decoder it grows with every token generated.

Reference Sliding Window Attention (R-SWA) — Baidu's replacement for every decoder attention layer: each generated token attends to all reference tokens plus only the preceding 128 output tokens, instead of the entire growing sequence. That caps the cache at a constant size.

Reference tokens — The document (visual) tokens the encoder produces from the pages being read. R-SWA keeps these fully visible to every output token — they are the fixed part of the attention window, never slid past.

Sliding window attention — An attention mask where each token sees only the last W tokens, not all of them. It bounds memory but, on its own, would slide off the document a model is reading — which is the problem R-SWA's pinned reference tokens fix.

Forward pass — One run of the model over its inputs to produce outputs. Unlimited OCR transcribes 40+ pages in a single forward pass rather than chunking the document into many smaller passes.

Active parameters — Unlimited OCR has 3 billion total parameters but activates only 500 million per token — a sparse design where most weights stay idle on any given step, keeping compute low.

DeepSeek-OCR encoder — A high-compression visual encoder that turns a page image into a small number of tokens. Pairing it with R-SWA's constant-cache decoder is what lets dozens of pages fit inside a single 32,000-token context.

The news. On June 22, 2026, Baidu released Unlimited OCR, a 3-billion-parameter (500 million active) end-to-end OCR model that transcribes 40+ pages of documents in a single forward pass under a 32,000-token context. It replaces every decoder attention layer with Reference Sliding Window Attention (R-SWA), which holds the KV cache at a constant size throughout decoding instead of letting it grow with output length, and reports new end-to-end state-of-the-art on OmniDocBench v1.5 and v1.6. Weights and code are public under CC-BY 4.0. Read the paper →

Picture a scribe copying a long book by hand. The trick that keeps the desk clear is not memory — it's what stays on the desk. The source book lies open, always in reach, and the scribe glances at only the last line they wrote to keep the handwriting and spelling continuous. They never re-read the hundred pages already copied; those go in a drawer. So the desk holds the same two things on page 1 and on page 200 — the source, and the current line — and it never overflows, no matter how long the book.

A standard transformer decoder is the opposite scribe: it keeps every page it has copied stacked on the desk, because each new token attends to all previous tokens. That stack is the KV cache, and its size grows linearly with the length of the output — which is fine for a one-paragraph answer and ruinous for a 40-page transcription, where the output is enormous. The cache is already the biggest memory cost in inference; let it grow with every page and a long document becomes much harder to fit and serve efficiently.

R-SWA is the disciplined scribe. It replaces every decoder attention layer so that each generated token attends to exactly two things: the full set of reference tokens — the document the encoder produced, kept pinned and fully visible — plus only the preceding 128 output tokens, a short sliding window over what was just written. The document never slides out of view, but the output history does. Because both pieces are bounded — the document is fixed and the window is 128 — the KV cache stays a constant size from the first page to the fortieth. This is the move a plain sliding window can't make on its own: slide a fixed window over everything and you'd lose the document you're reading; R-SWA exempts the reference tokens from the slide.

Here is why "grows with sequence length" is the term that hurts. KV-cache memory is a product — layers × heads × head dimension × bytes × sequence length — and only that last factor moves as the model writes more. R-SWA freezes that factor for the output: instead of the sequence length climbing toward 32,000, the output's contribution is clamped at 128, while the reference tokens add a fixed, encoder-compressed amount. Pair that constant-cache decoder with DeepSeek-OCR's high-compression visual encoder — which compresses each page image into far fewer visual tokens — and dozens of pages fit in one 32,000-token pass.

Walk the numbers on one long document. Say transcribing 40 pages produces roughly 12,000 output tokens (illustrative — the real count depends on the document). A standard decoder's cache holds all 12,000, and the 12,000th token attends back across 11,999 predecessors — so both memory and per-token attention work climb with every page. R-SWA caps the output window at 128. So that same final token attends to just the last 128 outputs plus the fixed document tokens, and the output's contribution to the cache stays flat at 128 entries whether the document is 4 pages or 40. That clamp — from a number that grows with the page count to a constant 128 — is the decoder-side reason this can pair with a compressed visual encoder and read 40+ pages in one forward pass.

Attention scheme	Each output token attends to…	KV cache vs output length	Where it earns its keep
Standard causal attention	every previous token	Grows linearly	Accurate, but memory explodes on long outputs
Plain sliding-window attention	only the last ~W tokens (W is a fixed window, model-dependent)	Constant (~W)	Cheap streaming, but it slides off the document being read
R-SWA (Unlimited OCR)	all reference tokens + the last 128 outputs [paper]	Constant	Long-document OCR: keeps the full source visible while bounding output memory

The honest caveats. The 128-token window is a default, and a short window is a bet that the next line of a transcription rarely depends on text written thousands of tokens earlier — true for reading a document top-to-bottom, less obviously true for tasks with long-range output structure. And the constant-cache win leans on the encoder doing real work: if the reference tokens themselves were not compressed, "all reference tokens" would be its own large, fixed cost. But the deeper lesson generalizes past OCR — the paper itself notes R-SWA is "a general-purpose parsing attention mechanism… equally applicable to tasks such as ASR, translation, etc." Once you accept that an output token rarely needs the entire output history, the question stops being "how do we shrink the cache" and becomes "what must stay pinned, and how short can the window be" — and the cache stops growing at all.

Goes deeper in: LLM Internals → KV Cache → Memory Cost

Related explainers

FlashMemory — lookahead sparse attention — also bounds the KV cache by having each token attend to fewer keys; R-SWA bounds it by a fixed window plus a pinned reference instead.
SP-KV — self-pruned KV cache — drops low-value KV pairs to shrink the cache; R-SWA never writes the far output history in the first place.
SubQ 1.1 — subquadratic sparse attention — near-linear attention for million-token context; R-SWA is the OCR-decoder cousin of the same "don't attend to everything" idea.
DeepSeek V4 — long-context cost cut to a fraction — attacks the same long-context memory pressure at the architecture level, and shares the optical-compression encoder lineage R-SWA builds on.

FAQ

What is Reference Sliding Window Attention (R-SWA)?

R-SWA is the decoder attention scheme in Baidu's Unlimited OCR (arXiv 2606.23050, June 2026). It replaces every decoder attention layer so that each generated token attends to all reference tokens — the document tokens the encoder produced — plus only the preceding 128 output tokens, rather than the entire growing output sequence. Because the document is fixed and the output window is capped at 128, the KV cache stays a constant size throughout decoding instead of growing linearly with output length. That is what lets the 3-billion-parameter (500 million active) model transcribe 40+ pages in a single 32,000-token forward pass.

Why does holding the KV cache constant matter for OCR?

The KV cache is the dominant memory cost of inference, and in a standard decoder it grows with every token generated. Transcribing a long document produces an enormous output, so a linearly growing cache quickly exceeds GPU memory — which is why most OCR systems chunk a document into many small passes and stitch the results. R-SWA caps the output's contribution to the cache at 128 tokens, so memory does not grow with page count, and the whole document can be read in one forward pass. Baidu reports new end-to-end state-of-the-art on OmniDocBench v1.5 and v1.6 with this design.

How is R-SWA different from normal sliding-window attention?

Plain sliding-window attention (as in some streaming models) lets each token see only the last W tokens of everything — which bounds memory but would slide the document a model is reading out of view. R-SWA splits the window in two: the reference (document) tokens are pinned and stay fully visible to every output token, while only the output history is subject to the 128-token slide. So the model keeps the entire source in sight while still bounding the part of the cache that would otherwise grow — the constant-size cache without losing the document.

Originally posted on Learn AI Visually.

GLM-5.2 Becomes the Top Open-Weights Model: Active vs Total Parameters

pueding — Tue, 23 Jun 2026 11:17:24 +0000

What: The news anchor is GLM-5.2, Zhipu AI's open-weights model that just topped the Artificial Analysis Intelligence Index; the concept it makes concrete is active vs total parameters — the two numbers in its "744B total / 40B active" spec.

Why: Those two numbers price two different things: total sets the memory footprint and the GPU you need, while active sets the compute and bandwidth you pay per token. Reading both tells you what a model release actually costs to run.

vs prior: The old habit was to quote one parameter count — which assumes a dense model where every weight fires on every token, so active equals total. A sparse Mixture-of-Experts splits that into two, and the gap between them is the design lever.

Think of it as

A big engine that fires only a few of its cylinders at a time.

           GLM-5.2 ENGINE: 744 CYLINDERS BUILT IN
                           │
      ┌────────────────────┴─────────────────────┐
      │  the whole engine block is hauled along   │
      │  . . . . . . . . . . . . . . . . . . . .  │
      │  . . # . . . # . . # . . . # . . . # . .  │
      │  but only ~40 cylinders ( # ) fire now    │
      └────────────────────┬─────────────────────┘
                           │
          ┌────────────────┴────────────────┐
          ▼                                  ▼
   MEMORY = the whole block         COMPUTE = firing only
   all 744B resident, ~744 GB       ~40B active, ~80 GFLOP

total parameters (744B) = every cylinder built into the engine block — the full capacity
active parameters (40B) = the cylinders actually firing on this stroke — what burns fuel right now
router = the engine controller deciding which cylinders fire for each token
memory footprint = the whole engine block you still haul around, firing or idle
sparsity ratio = how few cylinders fire (40) versus how many exist (744)

Quick glossary

Total parameters — Every weight the model contains — here 744 billion. The total sets the model's knowledge capacity and, critically, the memory footprint: all 744 B must be loaded into GPU memory whether or not they are used on a given token.

Active parameters — The subset of weights actually read and multiplied for a single token — here ~40 billion. In a dense model active equals total; in a sparse model active is a fraction. Per-token compute and bandwidth track the active count, not the total.

Mixture-of-Experts (MoE) — A transformer variant that replaces each dense feed-forward network with many smaller "expert" sub-networks, plus a router that activates only a handful per token. It decouples total capacity from per-token cost.

Router — The small learned network inside an MoE layer that assigns each token to its top-k experts. It is what makes "which weights are active" change from token to token.

Sparsity ratio — The fraction of total parameters that are active per token. GLM-5.2's 40 B of 744 B is roughly 5% — about one weight in eighteen. A lower ratio means more capacity sits idle on any given token.

Dense model — A model with no routing: every weight participates in every token, so active equals total. Per-token FLOPs scale linearly with the full parameter count.

FLOP — A floating-point operation — one multiply or add. A useful rule of thumb: a forward pass costs about 2 × (active parameters) FLOPs per token.

Artificial Analysis Intelligence Index — A third-party benchmark that aggregates many evals (reasoning, coding, knowledge) into a single comparable score. GLM-5.2 scored 51 on v4.1, leading all open-weights models.

The news. On June 17, 2026, Artificial Analysis reported that Zhipu AI's GLM-5.2 became the leading open-weights model on its Intelligence Index v4.1, scoring 51 — ahead of MiniMax-M3 and DeepSeek V4 Pro, both at 44. The model carries 744 B total parameters but activates only ~40 B per token, ships under an MIT license, and keeps GLM-5.1's architecture while showing particular strength on scientific reasoning. Read the report →

Picture a very large engine. It has hundreds of cylinders machined into the block, but at any instant only a handful are firing — and which few are firing changes constantly as a controller picks the right ones for the moment. The size of the engine is one number; the cylinders burning fuel right now are a completely different one. That is exactly the gap GLM-5.2 puts on its spec sheet: 744 billion cylinders built in, but only about 40 billion firing per token. The first number is the engine you have to build and haul; the second is the fuel you actually burn each stroke.

Dense model releases often came with one headline number, because the model was dense — every weight fired on every token, so the engine you built and the engine you ran were the same. A Mixture-of-Experts model breaks that equality on purpose. Most of a transformer's parameters live in its feed-forward layers — roughly two-thirds of them — so MoE replaces that one big dense feed-forward block with many smaller expert blocks and a router that lights up only the few each token needs. The 744 B stays resident, but the per-token bill tracks the ~40 B that fire.

So the two numbers price two genuinely different resources. The total parameter count sets your memory footprint — every one of the 744 B weights has to sit in GPU memory, idle or not, which is why running an open-weights model this large means a multi-GPU node and a good reason to shrink the weights with quantization. The active count sets your per-token compute and bandwidth — and at ~40 B active, GLM-5.2 computes each token at roughly the cost of a 40 B model even though it holds 744 B parameters of capacity. The notable part of this release is not just that an open-weights model topped the leaderboard; it is that it did so at a ~5% sparsity ratio — about one weight in eighteen — pushing the frontier on a very lean per-token budget.

Per token, you pay…	If GLM-5.2 were dense (744B active)	GLM-5.2 as shipped (744B-total, ~40B active)
Active parameters	744B (all of them)	~40B
Compute per token	~1.49 TFLOP (illustrative, ≈2× active-params rule)	~80 GFLOP (illustrative, ≈2× active-params rule)
Weights held in memory	~744 GB (~approx, 1 byte/param at FP8)	~744 GB (~approx, 1 byte/param at FP8)
Intelligence Index v4.1	—	51 (leading open weights)

Work one token through the numbers to see why the gap matters. Using the rule that a forward pass costs about 2 × (active parameters) FLOPs, a dense 744 B model would burn 2 × 744 B ≈ 1.49 TFLOP every token; GLM-5.2, firing only ~40 B, burns 2 × 40 B ≈ 80 GFLOP — roughly 18× less compute per token (illustrative — derived from the parameter counts, not measured). But both versions still have to keep all 744 B weights resident — about 744 GB at one byte each — so the memory bill is identical. That is the trade in parameter-count terms: MoE is designed to give you the per-token compute of a small model and the capacity of a large one — while still charging you the memory of the large one. (Real systems also pay routing overhead and run dense attention layers, so the picture is more nuanced than the two counts alone.) Whether the trade is worth it depends on what binds you — if memory is the constraint, a smaller dense model can win, which is the flip side explored in the related Granite explainer below.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network and LLM Internals → Quantization → Why Quantize

Related explainers

IBM Granite 4.1 — 8B dense matches the prior 32B MoE — the serving-cost flip side: when memory is what binds, a smaller dense model can beat a larger MoE
MobileMoE — DRAM-aware MoE scaling — what the active-vs-total gap buys you on memory-constrained devices
SoftMoE — differentiable soft top-k routing — how the router actually decides which experts fire each token

FAQ

What is the difference between active and total parameters?

Total parameters are every weight the model contains — they set its knowledge capacity and its memory footprint, because all of them must be loaded into GPU memory. Active parameters are the subset actually read and multiplied for a single token; they set the per-token compute and bandwidth. In a dense model the two are equal; in a sparse Mixture-of-Experts model like GLM-5.2, active (~40B) is a small fraction of total (744B).

Why does GLM-5.2 list two parameter counts (744B total, 40B active)?

Because it is a Mixture-of-Experts model. Its feed-forward layers are split into many expert sub-networks, and a router activates only a handful per token — so the model holds 744B weights but fires only ~40B on any given token. The total predicts the memory and GPU you need; the active count predicts how fast and how cheaply it runs per token. A single number would hide that the two costs have decoupled.

Does a lower active-parameter count make a model cheaper to run?

It makes the per-token compute and bandwidth cheaper — GLM-5.2 computes a token at roughly the cost of a 40B model. But it does not lower the memory bill: all 744B total parameters still have to fit in GPU memory whether or not they fire. So a very sparse model is cheap on compute and expensive on memory, which is why deployments often pair it with quantization and multi-GPU nodes.

Originally posted on Learn AI Visually.

Agent Leaderboards Mislead Under Distribution Shift (IBM): Predictive Validity

pueding — Mon, 22 Jun 2026 11:17:42 +0000

What: A new IBM paper, "Beyond Static Leaderboards", argues that the way we rank AI agents is broken: a leaderboard collapses each agent into one aggregate score and sorts by it. The fix it proposes is predictive validity — the rank correlation between a benchmark's ranking and the ranking you'd see out-of-distribution.

Why: A single leaderboard number is a weak signal for real deployment. The whole point of an eval is to tell you which agent to ship — and if the benchmark's #1 isn't your deployment's #1, the ranking you trusted pointed at the wrong agent. This is the core lesson of Evals & Diagnostics and Production Evals.

vs prior: Where the old way ranks agents by their aggregate mean score on one benchmark and trusts that order, predictive validity asks a sharper question: does that order survive a distribution shift? IBM's finding is blunt — aggregate-score rankings do not transfer out-of-distribution.

Think of it as

Ranking sprinters by their indoor bests, then racing them outdoors in the wind.

          SAME SPRINTERS, RANKED TWO WAYS
                         │
          ┌──────────────┴──────────────┐
          │                             │
   ┌──────▼──────┐               ┌──────▼──────┐
   │  INDOORS    │               │  OUTDOORS   │
   │  (no wind)  │               │  (windy)    │
   │   1. A      │               │   1. B      │
   │   2. B      │               │   2. C      │
   │   3. C      │               │   3. A      │
   └──────┬──────┘               └──────┬──────┘
          │                             │
   the leaderboard               the deployment
          └──────────────┬──────────────┘
                         ▼
       predictive validity = does the indoor
       order survive once the wind hits?

sprinter = a model competing on the leaderboard
indoor personal-best ranking = the aggregate-score leaderboard, measured in one controlled setting
racing outdoors in the wind = deployment under a shifted, out-of-distribution workload
the podium reshuffling = rank instability when the conditions change
predictive validity = how well the indoor ranking predicts who actually wins outdoors

Quick glossary

Predictive validity — Borrowed from measurement theory: does a test's score predict the real-world outcome it claims to measure? For agent evals, IBM defines it as the rank correlation between in-sample and out-of-distribution results — not the raw score, but whether the ordering of agents holds up when conditions change.

Aggregate score — The single number a leaderboard reports per agent — typically a mean across many tasks. It is easy to sort by, but it throws away the variance that tells you whether the ranking is stable. See AI Agents → Evals & Diagnostics.

In-sample vs out-of-distribution (OOD) — In-sample = the conditions the benchmark actually measured. Out-of-distribution = anything different in deployment — new task types, a new orchestration, a shifted input mix. The gap between them is where leaderboards quietly fail; production teams watch it as drift.

Rank correlation — A measure of how well two rankings of the same items agree — +1 is identical order, 0 is unrelated, −1 is reversed. Predictive validity is this number, computed between the in-sample and OOD rankings.

Rank instability — When a small change in conditions reshuffles the leaderboard — the agent ranked first in-sample lands third out-of-distribution. IBM points to public-to-hidden competition retrospectives as direct evidence this happens.

Falsifiable criterion — A pass/fail test you can actually fail. IBM frames predictive validity through three falsifiable out-of-distribution criteria, so a benchmark's claim to validity can be checked and rejected — not just asserted.

MCP-based agent benchmark — A benchmark built on the Model Context Protocol tool interface, so the same agent harness can be re-implemented many ways. IBM ran fourteen parallel implementations of one such industrial-agent benchmark.

The news. On June 18, 2026, an IBM-led team (Dhaval Patel et al.) posted Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents to arXiv. They ran fourteen parallel implementation studies of an MCP-based industrial-agent benchmark — varying asset classes, orchestrations, retrieval strategies, and reasoning modes — and aggregated seven prior agent benchmarks. The headline: rankings derived from aggregate scores do not transfer to out-of-distribution settings. In place of one number, they propose ranking benchmark configurations by predictive validity: the correlation between in-sample and out-of-distribution rank, structured as a twelve-tier measurement apparatus with three falsifiable criteria. Read the paper →

Picture timing a field of sprinters indoors, on a fast track with no wind, and printing the ranking from their personal bests. On paper you now know exactly who is fastest — first, second, third, in order. Then race day comes, outdoors, into a gusting headwind, and the podium reshuffles: the indoor record-holder fades to third, and someone who was never the fastest indoors wins the race that counts. The indoor clock wasn't lying — it measured real speed in one setting. It just had no way to tell you whether that order would survive the wind. The sprinter is an agent, the indoor ranking is an aggregate-score leaderboard, the outdoor race is deployment, and the question the indoor clock can't answer is predictive validity.

A leaderboard does exactly what the indoor clock does. It runs each agent over a fixed battery of tasks, averages the results into one aggregate score, and sorts. That sort is the product everyone consumes — the tweet, the ranking row, the "best open agent" headline. But the average is measured under one distribution of tasks, and IBM's central result is that the ordering it produces does not hold once the distribution moves. When they built the same industrial-agent benchmark fourteen different ways — swapping orchestrations, retrieval strategies, and reasoning modes — the rankings disagreed with each other, and public-to-hidden competition retrospectives showed the same rank instability in the wild.

The deeper move is to stop treating the benchmark as a scoreboard and start treating it as a measurement instrument — and to ask of any instrument the measurement-theory question: does its reading predict the thing you actually care about? IBM operationalizes that as predictive validity: the rank correlation between a configuration's in-sample ranking and its out-of-distribution ranking — a number near +1 means the leaderboard predicts reality, a number near 0 means it doesn't. They wrap it in a twelve-tier apparatus with three falsifiable criteria, so a benchmark's claim to validity is something you can test and reject, not just assert. In production terms, it is the difference between trusting an offline leaderboard and watching how rankings hold under shifted, online traffic.

How you read the benchmark	What it reports	What it misses
Aggregate score (today's leaderboard)	one mean number per agent → a sorted ranking	whether that ranking survives any change in conditions
Score + confidence interval	the mean plus its in-sample noise	still in-sample only — no view of the out-of-distribution shift
Predictive validity (IBM)	rank correlation between in-sample and out-of-distribution rankings	— (directly tests transfer; ~14 implementations, 12-tier apparatus, 3 falsifiable criteria)

Where the ranking breaks

Here is why an unstable ranking is worse than a noisy one. Take an illustrative slice of three agents — call them A, B, C — that an aggregate-score leaderboard ranks A > B > C by a hair: scores of 71, 70, 68. The gaps are tiny, but the leaderboard reports a confident order, and a team reading it ships A. Now shift the distribution — a new asset class, a different orchestration — and re-score: A drops to 64, B holds at 69, C climbs to 67. The out-of-distribution order is now B > C > A, the exact reverse of where A and C started. The rank correlation between the two orderings is negative — the leaderboard didn't just lose precision, it pointed at the wrong agent. (Only the 14 implementations, 12-tier apparatus, and 3 falsifiable criteria come from the paper; the A/B/C scores are illustrative.) A single aggregate number with a tidy sort hid the one fact that mattered: that order was never stable enough to ship on.

Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score

Related explainers

This explainer stands alone from its news item (one concept), so the closest neighbors are other explainers on how a single evaluation number can quietly mislead:

WeaveBench — trajectory-aware vs outcome-only grading — the sibling failure: WeaveBench shows a single run's grade can be inflated; predictive validity shows a whole ranking can be invalid
FutureSim — harness-level agent eval — evaluating the agent's process rather than a single final number, the same "one score hides the truth" theme
Effective Feedback Compute (EFC) — another result that a headline number (raw compute) is the wrong predictor of agent success

FAQ

What is predictive validity for AI agent evals?

Predictive validity is a measurement-theory idea IBM applies to agent leaderboards: instead of ranking agents by their aggregate score, you measure the rank correlation between a benchmark's in-sample ranking and the ranking it produces out-of-distribution. A high correlation means the leaderboard predicts real-world ordering; a low correlation means the score is a poor guide to which agent to actually deploy.

Why are aggregate-score agent leaderboards misleading?

Because they collapse a whole agent into one mean number measured under a single distribution of tasks, then sort by it. IBM's "Beyond Static Leaderboards" ran the same industrial-agent benchmark fourteen ways and found the rankings disagreed, and public-to-hidden competition retrospectives show the same rank instability. The sorted order looks authoritative but does not transfer once conditions shift, so it is a weak signal for deciding what to ship.

How does predictive validity relate to distribution shift?

Distribution shift is exactly the condition predictive validity tests. In-sample means the tasks the benchmark measured; out-of-distribution means anything different in deployment — new task types, a new orchestration, a shifted input mix. Predictive validity asks whether the agent ranking holds across that gap, and IBM structures it as a twelve-tier apparatus with three falsifiable out-of-distribution criteria so the claim can be checked rather than assumed.

Originally posted on Learn AI Visually.

AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm

pueding — Sun, 21 Jun 2026 11:20:15 +0000

What: AMD shipped ATOM + ATOMesh, a ROCm-native LLM serving stack whose headline trick is prefill/decode disaggregation — splitting the two phases of inference onto separate pools of GPUs instead of crowding them onto one.

Why: Prefill and decode have opposite bottlenecks — prefill is compute-bound, decode is memory-bandwidth-bound — so running them on the same worker wastes hardware and lets one long prompt stall everyone else's token stream.

vs prior: A co-located server (vanilla single-pool vLLM) interleaves prefill and decode on the same GPUs; disaggregation runs each on its own pool tuned for its bottleneck, paying for it by shipping the KV cache across the interconnect between them.

Think of it as

A restaurant kitchen that splits the prep station from the plating line.

   ORDER (the prompt)
        │
        ▼
  ┌──────────────┐    KV cart      ┌──────────────┐
  │ PREP STATION │   down the      │ PLATING LINE │
  │  (prefill)   │═══ hallway ════▶│   (decode)   │──▶ tokens
  │ compute-heavy│  (KV transfer)  │ memory-bound │
  └──────────────┘                 └──────────────┘
   chops a whole                    plates dishes
   order at once                    one at a time

prefill = the prep cook chopping a whole order's ingredients in one compute-heavy burst
decode = the plating cook building dishes one at a time, back to the fridge each plate
KV cache = the fridge of prepped ingredients every plate reaches into
disaggregation = giving prep and plating their own stations and staff, each tuned to its job
KV-cache transfer = wheeling the prep cart down the hallway from prep to the plating line
KV-aware scheduling = sending each order to the line whose fridge already holds its prep

Quick glossary

Prefill — The first phase of inference: the model reads your entire prompt in parallel in one pass, building the KV cache. It does a lot of math per byte of memory it touches, so it is compute-bound.

Decode — The second phase: the model generates one output token at a time, and each step must read the whole KV cache plus all the weights to produce that single token. It moves a lot of memory for little math, so it is memory-bandwidth-bound.

KV cache — The stored keys and values for every token already processed, so the model never recomputes them. It is the dominant memory cost of inference — and, in a disaggregated stack, the thing that has to travel from the prefill pool to the decode pool.

Compute-bound vs memory-bound — The roofline distinction: a job is compute-bound when the GPU's math units are the limit, and memory-bound when memory bandwidth is. Prefill and decode sit on opposite sides of that line, which is the whole reason to split them.

Disaggregation — Running prefill and decode on separate pools of workers instead of one shared pool, so each pool can be sized and scheduled for its own bottleneck.

KV-aware scheduling — A scheduler that routes a request with knowledge of where its KV-cache blocks already live — so it can reuse a cached prefix (prefix caching) or steer a request to the worker that avoids a transfer.

ROCm / AITER / MORI / Instinct — ROCm is AMD's CUDA-equivalent software stack and Instinct its datacenter GPU line. AITER supplies the optimized ROCm kernels (the analogue of CUDA kernels), while MORI handles the distributed, RDMA-style communication for tensor/expert parallelism (AMD's own collective library, RCCL, is the closer NCCL analogue).

The news. On June 16, 2026, AMD published ATOM + ATOMesh, a paired ROCm-native LLM serving stack for Instinct GPUs, shipped as an early (alpha) preview. ATOM is an AITER-optimized inference engine (kernel acceleration via AITER, distributed communication via MORI); ATOMesh is the orchestration layer on top — it exposes an OpenAI-compatible API, manages multiple engine backends, and applies prefill/decode disaggregation and KV-aware scheduling, evaluated serving DeepSeek-V4-Pro on Instinct hardware. In AMD's framing it deliberately mirrors the vLLM/SGLang design — the same serving primitives, now on AMD silicon. Read the release →

Picture a restaurant kitchen where one cook does everything. First they prep an order — chopping, slicing, mixing every ingredient the dish needs, all at once, in a furious burst of knife work. Then they plate it — assembling the dish one component at a time, walking back to the fridge for each piece. Prep is a flat-out, hands-busy job; plating is a lot of trips to the fridge and not much knife work. Cram both onto one cook and they fight: a big prep order makes every waiting plate go cold, and during the slow plating trips the knives sit idle. That single overloaded cook is one GPU running an LLM, and the two jobs are prefill and decode.

When a model answers, it first runs prefill: it reads your entire prompt in one parallel pass, doing dense matrix math and filling the KV cache. Then it runs decode: it emits output one token per step, and every step drags the whole KV cache and all the weights out of memory to produce that single token. Prefill is compute-bound — limited by the GPU's math units — while decode is memory-bandwidth-bound, limited by how fast it can stream the cache out of memory. They are the prep cook and the plating cook: opposite appetites, forced to share one station.

That opposite-appetites problem is why a single shared worker wastes hardware. Pack prefill and decode together and a long prompt's prefill burst blocks the queue of decode steps behind it — a head-of-line stall — while the memory-bound decodes leave the expensive compute units sitting idle. You can never shape one machine to be right for both jobs at once.

Disaggregation is the fix: give prep and plating their own stations. Prefill runs on one pool of GPUs, scheduled for compute-heavy bursts; decode runs on a separate pool, scheduled for steady memory-bound streaming with large batches. When a request finishes prefill, the prefill worker hands its KV cache across the interconnect to a decode worker, which then streams the tokens out. Each pool is now sized and tuned for the one bottleneck it actually has — and AMD's ATOMesh is the orchestration layer that does exactly this routing on ROCm. This is the same playbook vLLM and SGLang made standard; ATOM + ATOMesh shows AMD building a ROCm-native path to it.

But disaggregation is not free, and the bill comes due at the handoff. After prefill, the KV cache has to physically travel from the prefill pool to the decode pool. For a 70B-class model with a 2,048-token prompt, that cache is 2 × 80 layers × 8 KV-heads × 128 dim × 2,048 tokens × 2 B ≈ 0.67 GB (illustrative, Llama-3.1-70B with grouped-query attention). Move it over PCIe 4.0 and you pay roughly 21 ms; over NVLink, about 0.75 ms — a ~28× gap (all three figures illustrative: the size is from the formula above, the times are set by each interconnect's bandwidth, none measured on ATOM). That gap is why disaggregated stacks live or die by their interconnect — and why KV-aware scheduling tries to dodge the transfer entirely, steering a request to a worker that already holds its prefix.

Phase	What it processes	Bottleneck (roofline)	What it wants from the hardware
Prefill	The whole prompt, in one parallel pass	Compute-bound — high arithmetic intensity	Raw matmul throughput; fewer, fatter GPUs
Decode	One output token per step, reading the full KV cache	Memory-bandwidth-bound — low arithmetic intensity	Memory bandwidth and large batches to amortize the weight reads

The honest caveat: ATOM + ATOMesh ship as an early (alpha) preview, and AMD's post describes the mechanism, not head-to-head numbers — it reports that ATOMesh mirrors the vLLM/SGLang design and was evaluated serving DeepSeek-V4-Pro, but it does not give usable numeric throughput or latency figures in the post text, so treat any performance claim as not yet quantified here and check the source for benchmarks. The KV-transfer figures above are illustrative, sized to a representative model rather than measured on ATOM. But the durable lesson stands: once you see that prefill and decode sit on opposite sides of the roofline, "one GPU does both" stops looking efficient — and a serving stack's real job is to split the two phases and move the KV cache between them cheaply.

Goes deeper in: LLM Serving → Prefill/Decode Disaggregation → Disaggregation

Related explainers

SGLang v0.5.12 — TokenSpeed MLA backend — SGLang is one half of the vLLM/SGLang design ATOMesh mirrors; this is the engine-level optimization that lives inside a pool like ATOM's.
HuggingFace — Async continuous batching — the other lever for keeping decode workers busy; disaggregation and continuous batching are complementary ways to fight the same memory-bound decode problem.
Tangram — Per-head KV cache budgets — shrinks the KV cache itself, which is exactly the payload a disaggregated stack has to transfer between pools.
Spec-decode latency — Load-dependent latency model — models how decode latency moves with load, the phase disaggregation isolates onto its own pool.

FAQ

What is prefill/decode disaggregation?

It is a serving design that runs the two phases of LLM inference on separate pools of GPUs. Prefill — reading the whole prompt in one parallel, compute-heavy pass — runs on one pool, and decode — generating output one token at a time, bottlenecked by memory bandwidth — runs on another. After prefill, the request's KV cache is transferred across the interconnect to a decode worker. Splitting them lets each pool be sized and scheduled for its own bottleneck instead of compromising on one shared machine.

Why split prefill and decode onto separate GPUs?

Because they have opposite bottlenecks. Prefill is compute-bound (limited by the GPU's math units), while decode is memory-bandwidth-bound (limited by how fast it streams the KV cache and weights out of memory). On one shared worker a long prefill stalls the decode steps queued behind it, and the memory-bound decodes leave the compute units idle. Running each phase on hardware tuned for its own limit avoids that mutual interference — at the cost of moving the KV cache between the two pools.

What do AMD's ATOM and ATOMesh add, and how do they relate to vLLM and SGLang?

ATOM is a ROCm-native inference engine (optimized kernels via AITER, cross-GPU communication via MORI) and ATOMesh is the orchestration layer above it — an OpenAI-compatible API that applies prefill/decode disaggregation and KV-aware scheduling. AMD describes it as deliberately mirroring the vLLM/SGLang design, so the contribution is not a new algorithm but the same modern serving primitives brought to AMD Instinct GPUs — a second-vendor implementation of the stack the LLM Serving track teaches.

Originally posted on Learn AI Visually.

NVIDIA Blackwell Sweeps MLPerf Training 6.0: Strong Scaling

pueding — Sat, 20 Jun 2026 11:15:43 +0000

What: On June 16, 2026, NVIDIA's Blackwell platform posted the fastest time on all seven MLPerf Training 6.0 benchmarks. The lens this explainer uses to read that result is strong scaling — how much faster a fixed model trains as you pour in more GPUs.

Why: Frontier pretraining now runs on 5,000–8,000-GPU clusters, and how well those GPUs scale together — not just how many you own — decides both the wall-clock and the bill for training a model.

vs prior: The naive assumption is that twice the GPUs means half the time. Strong scaling is the reality check: every step the GPUs must stop and synchronize, so a loosely-wired cluster gives sub-linear speedup where a rack-scale NVLink domain keeps it near the line.

Think of it as

A rowing crew racing one boat to the finish.

                    2× THE ROWERS (GPUs)
                            │
              ┌─────────────┴─────────────┐
              │                           │
      ┌───────▼───────┐          ┌────────▼───────┐
      │  NVLink crew  │          │  Loose cluster │
      │ (one cadence) │          │ (off the beat) │
      └───────┬───────┘          └────────┬───────┘
              │                           │
     every oar hits the          rowers miss the catch,
     catch in unison             power cancels at sync
              │                           │
              ▼                           ▼
        ✓ near-linear              ✗ far below 2×
          (~2× faster)              (sync tax eats it)

GPU = one rower pulling one oar
training step = one stroke the whole crew takes together
gradient sync = the catch, where every oar must hit the water at the same instant
adding GPUs = adding rowers to the boat
NVLink rack domain = the racing shell and coxswain that keep a huge crew in perfect time
low-precision math = lighter oars, so there is less to move on every stroke

Quick glossary

MLPerf Training — The industry-standard training benchmark. It measures one thing: the wall-clock time to train a model to a fixed quality target — so a faster time is a real, comparable result, not a vendor's peak-throughput number.

Strong scaling — Hold the problem fixed (one model, one quality target), add more GPUs, and measure the speedup. Its sibling, weak scaling, grows the problem with the hardware. Strong scaling is the harder test, because the work per GPU keeps shrinking while the coordination cost does not.

Gradient synchronization (AllReduce) — Each GPU trains on a different slice of the batch, then they must average their gradients before the next step starts. That all-to-all exchange — an AllReduce — is a barrier: nobody moves on until everyone has caught up.

NVLink domain (NVL72) — 72 GPUs wired by fifth-generation NVLink into one coherent, high-bandwidth fabric — a single rack that behaves like one big accelerator. The fast fabric is what makes the synchronization barrier cheap.

Low-precision math (FP8 / NVFP4) — Running the heavy matrix multiplies in 8-bit FP8 or 4-bit NVFP4 instead of 16-bit, so there is less data to move and less to compute on every step. Blackwell's tensor cores support both.

Scaling efficiency — The actual speedup divided by the ideal one. Double the GPUs and perfectly halve the time and you are at 100% — perfectly linear. Anything less is the time the GPUs spent waiting on each other.

The news. On June 16, 2026, NVIDIA reported that its Blackwell platform posted the fastest time on every one of MLPerf Training 6.0's seven benchmarks. The new GB300 NVL72 rack trained up to 1.6× faster than the previous GB200 NVL72. Submissions scaled to 8,192 GPUs — CoreWeave trained DeepSeek-V3 671B to target in 2.02 minutes, and Microsoft Azure hit the quality target on Llama 3.1 405B in 7.07 minutes at 8,192-GPU scale. The round also added new mixture-of-experts pretraining workloads. Read the release →

Picture the rowing crew. The finish line is the model's quality target, and one stroke of the whole crew is one training step — every rower pulls, the boat lurches forward, and they reset for the next stroke. Each rower is a GPU, working a different slice of the same race. The trick is the catch: the instant the oars enter the water. If every oar hits at the same moment the boat surges; if they're even slightly out of time, the power cancels and the boat wallows. Adding rowers should make the boat faster — but only if the bigger crew can still hit the catch together.

That "only if" is the whole story, and its real name is strong scaling: fix the model, add GPUs, and see how much the clock actually drops. The catch is the catch. Every step, the GPUs have to stop and combine their partial results — gradients across the data-parallel replicas, plus activations and weights traded inside the tensor- and expert-parallel groups — before the next step can begin. That synchronization is a tax that grows as the crew grows, so doubling the GPUs buys you less than 2× — the speedup curve bends below the straight line. A naive cluster, like a crew that can't hold its timing, gives back most of what each new rower adds.

So the engineering is all about making the catch cheap. NVIDIA's answer is the rack-scale NVLink domain: the GB300 NVL72 ties 72 GPUs into one coherent fabric — the racing shell and coxswain that keep a huge crew locked to a single cadence — so the per-step exchange finishes fast enough that thousands of GPUs still row almost as one. Pair that with lower-precision math — Blackwell's tensor cores run the matmuls in 8-bit FP8 and 4-bit NVFP4, lighter oars with less to move every stroke — plus a stronger software stack, and NVIDIA credits that combination for the sweep: a GB300 rack trains up to 1.6× faster than last generation, and 8,192 GPUs finish in minutes, not hours.

MLPerf Training 6.0 result	Scale	What it shows	Time to target
GB300 NVL72 vs GB200 NVL72	72-GPU rack	hardware-generation speedup	up to ~1.6× faster
DeepSeek-V3 671B (MoE)	~8,192 GPUs	strong scaling, new MoE workload	~2.02 min (CoreWeave)
Llama 3.1 405B (dense)	~8,192-GPU scale	strong scaling at frontier size	~7.07 min (Azure)

Strong scaling, in one calculation

Hold the model fixed — DeepSeek-V3, 671B parameters, trained to MLPerf's quality target — and watch the clock as you add rowers. On 8,192 Blackwell GPUs, CoreWeave's run finished in 2.02 minutes. Now ask the strong-scaling question: had you used half as many GPUs, would it have taken exactly twice as long? Perfect scaling says yes. Suppose (illustrative) the 4,096-GPU run had actually taken 3.7 minutes. Then doubling the GPUs cut the time from 3.7 to 2.02 — a 1.83× speedup, not the ideal 2.0×. Divide the two and you get a scaling efficiency of ~92%; the missing ~8% is the time the GPUs spent at the catch, waiting on each other. The entire job at this scale is keeping that number pinned near 100% — which is exactly what a faster NVLink fabric and lighter low-precision oars are for. (The 2.02-min, 8,192-GPU, and 1.6× figures are from NVIDIA's MLPerf 6.0 report; the 4,096-GPU split is illustrative.)

Goes deeper in: GPU & CUDA → Memory Hierarchy → NVLink & PCIe

Related explainers

Vera Rubin NVL72 — the NVLink rack domain — the "72 GPUs as one coherent fabric" idea this article leans on, taken apart on its own.
NVIDIA AI factories — tokens per megawatt — once a cluster scales well, the next question is energy: useful work per watt, not just per GPU.
Fused INT8 GEMM — INT8 beats FP8 on the tensor cores — the same "shrink the numbers so there's less to move" lever, on the inference side instead of training.

FAQ

What is strong scaling in distributed training?

Strong scaling fixes the problem — one model, one quality target — and measures how much faster it trains as you add GPUs. Perfect strong scaling means N times the GPUs finishes in 1/N the time. In practice the speedup falls short of that line, because every training step the GPUs must stop and synchronize their partial results before the next step can start, and that coordination cost grows with the number of GPUs. The gap between the ideal and the actual speedup is the scaling efficiency.

Why doesn't doubling the GPUs halve the training time?

Because training is synchronous. Each GPU works on a different slice of the batch, but at the end of every step they must average their gradients (an AllReduce) and exchange activations and weights across the tensor- and expert-parallel groups before moving on. That barrier is overhead that does not shrink as fast as the per-GPU work does, so adding GPUs gives less than a proportional speedup. The fix is to make the synchronization cheap — a fast, rack-scale NVLink fabric and lower-precision (FP8/NVFP4) numbers — so the speedup curve stays close to linear.

What did NVIDIA Blackwell actually win in MLPerf Training 6.0?

NVIDIA reported the fastest time on all seven MLPerf Training 6.0 benchmarks. The new GB300 NVL72 rack trained up to 1.6× faster than the prior GB200 NVL72, submissions scaled to 8,192 GPUs (CoreWeave trained DeepSeek-V3 671B to target in 2.02 minutes; Microsoft Azure hit the target on Llama 3.1 405B in 7.07 minutes at 8,192-GPU scale), and the round added new mixture-of-experts pretraining workloads. The headline is less about one chip than about how well thousands of them scale together.

Originally posted on Learn AI Visually.