<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: pueding</title>
    <description>The latest articles on DEV Community by pueding (@pueding).</description>
    <link>https://dev.to/pueding</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F453161%2F9dc2c7a4-3298-46c4-bf96-00395ec12416.png</url>
      <title>DEV Community: pueding</title>
      <link>https://dev.to/pueding</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pueding"/>
    <language>en</language>
    <item>
      <title>Agent-Harness Scaling Law: Feedback Quality Predicts Success, Not Raw Compute: Effective Feedback Compute (EFC)</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Wed, 10 Jun 2026 11:15:58 +0000</pubDate>
      <link>https://dev.to/pueding/agent-harness-scaling-law-feedback-quality-predicts-success-not-raw-compute-effective-feedback-58hl</link>
      <guid>https://dev.to/pueding/agent-harness-scaling-law-feedback-quality-predicts-success-not-raw-compute-effective-feedback-58hl</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; A new &lt;strong&gt;agent-harness scaling-law paper&lt;/strong&gt; introduces &lt;strong&gt;Effective Feedback Compute (EFC)&lt;/strong&gt; — a single quantity that predicts whether an agent finishes a task from the quality of the feedback its harness returns each step, scored on four axes and normalized by how hard the task is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; It reframes agent reliability as a &lt;strong&gt;feedback-quality problem, not a token-budget problem&lt;/strong&gt; — plotted against EFC, harness-run success follows a clean law (R²≈0.94–0.99), while against raw compute the same runs barely fit (R²≈0.33–0.42).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Prior reliability work leaned on &lt;strong&gt;raw-compute scaling&lt;/strong&gt; — more tokens, more tool calls, bigger reasoning budgets — but EFC shows that axis is nearly flat, since lifting only feedback quality moved success from 0.27 to 0.90 with cost and tool-call counts held fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;a student with a sharp tutor instead of just re-reading the textbook&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  SAME EXAM, SAME HOURS LOGGED
                             │
               ┌─────────────┴──────────────┐
               ▼                            ▼
       ┌───────────────┐          ┌───────────────┐
       │  RE-READ THE  │          │  SHARP TUTOR  │
       │    TEXTBOOK   │          │  per problem  │
       │ (raw compute) │          │ (feedback Q)  │
       └───────┬───────┘          └───────┬───────┘
               │                          │
      pages logged, but          points at the exact
      no correction lands        mistake — and it sticks
               │                          │
               ▼                          ▼
         ✗ grade ~0.27              ✓ grade ~0.90
         effort, no signal         signal absorbed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;agent harness = the study setup that feeds you a correction each round&lt;/li&gt;
&lt;li&gt;raw compute = hours logged and pages re-read&lt;/li&gt;
&lt;li&gt;feedback quality = how useful the tutor's correction is each time&lt;/li&gt;
&lt;li&gt;informativeness = the tutor points at the exact mistake, not "study harder"&lt;/li&gt;
&lt;li&gt;validity = the correction is actually right, not misleading&lt;/li&gt;
&lt;li&gt;non-redundancy = the tutor doesn't repeat a note you already wrote down&lt;/li&gt;
&lt;li&gt;retention = you keep the correction in your notes for the next problem&lt;/li&gt;
&lt;li&gt;EFC = total useful correction absorbed, divided by how hard the exam is&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;EFC&lt;/strong&gt; — &lt;strong&gt;Effective Feedback Compute&lt;/strong&gt; — the paper's core metric. It measures how much &lt;em&gt;useful&lt;/em&gt; feedback signal a harness feeds back into the agent loop, scored on four axes (informativeness, validity, non-redundancy, retention) and normalized by task demand. It is the x-axis of the proposed scaling law, replacing "tokens and tool calls spent."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent harness&lt;/strong&gt; — The scaffolding around the model — the loop that runs tool calls, observes results, and feeds the next observation back to the model. The harness is what &lt;em&gt;delivers&lt;/em&gt; feedback, so it is where EFC is won or lost. Covered in Agent Engineering → Production Harness Architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling law&lt;/strong&gt; — An empirical curve that predicts an outcome (here, task success rate) from one quantity (here, EFC). A &lt;strong&gt;tight&lt;/strong&gt; scaling law means the curve explains most of the variation; a loose one means the quantity is a poor predictor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R² (fit quality)&lt;/strong&gt; — The fraction of variation in success the curve explains, from 0 (the x-axis predicts nothing) to 1 (it predicts everything). EFC reaches &lt;strong&gt;R²≈0.94–0.99&lt;/strong&gt;; the raw-compute baseline only &lt;strong&gt;0.33–0.42&lt;/strong&gt;. Higher R² = a better predictor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The four feedback axes&lt;/strong&gt; — &lt;strong&gt;Informativeness&lt;/strong&gt; (does the message localize the error?), &lt;strong&gt;validity&lt;/strong&gt; (is the correction actually right?), &lt;strong&gt;non-redundancy&lt;/strong&gt; (is it new, or a repeat?), and &lt;strong&gt;retention&lt;/strong&gt; (does the agent still have it later?). EFC is built from all four, so a harness can fail on any one of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task demand&lt;/strong&gt; — How much corrective signal a task actually &lt;em&gt;needs&lt;/em&gt; to be solved. EFC divides feedback quality by task demand so harnesses can be compared fairly across easy and hard tasks — the same crisp feedback is worth more on a demanding task than a trivial one.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On May 28, 2026, researchers posted &lt;a href="https://arxiv.org/abs/2605.29682" rel="noopener noreferrer"&gt;an agent-harness scaling-law paper&lt;/a&gt; to arXiv introducing &lt;strong&gt;Effective Feedback Compute (EFC)&lt;/strong&gt; — a metric that predicts agent success from the &lt;em&gt;quality&lt;/em&gt; of feedback the harness returns, not the compute it spends. Plotted against EFC, harness-run success rates fit a clean scaling law (reported &lt;strong&gt;R²≈0.94–0.99&lt;/strong&gt; across datasets); plotted against raw compute, the same runs barely fit (R²≈0.33–0.42, rising to ~0.88 only with a hand-built multivariate baseline). In one controlled comparison, lifting feedback quality moved success from &lt;strong&gt;0.27 to 0.90&lt;/strong&gt; with token cost and tool calls held fixed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture two students prepping for the same exam. The first logs &lt;em&gt;ten hours re-reading the textbook&lt;/em&gt; cover to cover — enormous effort, page after page. The second spends one hour with a sharp tutor who, after each practice problem, points at the &lt;em&gt;exact&lt;/em&gt; line where the reasoning went wrong, confirms the fix is correct, never repeats a note already written down, and makes sure it lands in the margin for next time. On exam day the second student wins, and it is not close. The hours-logged number — the &lt;strong&gt;raw compute&lt;/strong&gt; — told you almost nothing. The number that predicted the grade was how much &lt;em&gt;useful correction&lt;/em&gt; actually got absorbed. That second number is what this paper names &lt;strong&gt;Effective Feedback Compute&lt;/strong&gt;, and the claim is that agent harnesses behave the same way.&lt;/p&gt;

&lt;p&gt;The mechanism is a re-definition of the x-axis. Instead of counting tokens or tool invocations, EFC measures the &lt;strong&gt;useful signal the harness feeds back each step&lt;/strong&gt; — scored on four axes (informativeness, validity, non-redundancy, retention) — and then normalizes by task demand so a crisp correction counts for more on a hard task than an easy one. That normalized quantity becomes the horizontal axis of a scaling law that fits success rates across the paper's datasets. The practical reading for anyone building agents: the lever is not your reasoning budget but what your harness chooses to log and return after every tool call.&lt;/p&gt;

&lt;p&gt;This is why the raw-compute axis goes flat. A harness can burn an enormous budget returning &lt;em&gt;low-quality&lt;/em&gt; feedback — a terse &lt;code&gt;exit code 1&lt;/code&gt; with no stack trace (low informativeness), a linter warning that is actually a false positive (low validity), the same "tests failed" string ten turns in a row (high redundancy), or an error the agent has already forgotten by the time it matters (low retention). All of that is real compute and real tool calls, and on the EFC axis it is worth almost nothing. The tutor who just says "study harder" for an hour spent the hour; the student learned nothing. Worse, in a long rollout the low-signal steps let compounding errors accumulate unchecked, so the spend actively buys you a longer path to the same failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the feedback gap actually comes from
&lt;/h3&gt;

&lt;p&gt;Hold three variables fixed. One agent. One task. Two runs at the same budget — 40 tool calls, ~120K tokens each. The only difference is the harness's feedback quality. In Run A, every step returns a terse pass/fail string; say each step carries about &lt;code&gt;0.1&lt;/code&gt; units of useful, valid, non-redundant, retained signal, so over 40 steps the agent accumulates &lt;code&gt;40 × 0.1 = 4&lt;/code&gt; units. The task demands roughly &lt;code&gt;30&lt;/code&gt; units to solve, so EFC = &lt;code&gt;4 / 30 ≈ 0.13&lt;/code&gt; — low on the law's curve, landing near the &lt;strong&gt;0.27&lt;/strong&gt; success rate the paper reports at the bottom of its range. In Run B, the harness returns the failing assertion, the offending input, and a one-line diff each step — call it &lt;code&gt;0.8&lt;/code&gt; units per step, &lt;code&gt;40 × 0.8 = 32&lt;/code&gt; units, EFC = &lt;code&gt;32 / 30 ≈ 1.07&lt;/code&gt;, high on the curve and up near &lt;strong&gt;0.90&lt;/strong&gt; success. Same cost, same tool count, &lt;strong&gt;~8× the effective feedback&lt;/strong&gt; &lt;em&gt;(illustrative decomposition calibrated to the paper's 0.27→0.90 and R² headline figures — the per-step unit values and task-demand figure are stand-ins, not measured constants)&lt;/em&gt;. The success jump is the headline; the per-call yield jump is the deeper story.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scaling-law x-axis&lt;/th&gt;
&lt;th&gt;What it counts&lt;/th&gt;
&lt;th&gt;Fit to success (R²)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw compute&lt;/td&gt;
&lt;td&gt;tokens + tool calls spent&lt;/td&gt;
&lt;td&gt;~0.33–0.42 — poor &lt;em&gt;(&lt;a href="https://arxiv.org/abs/2605.29682" rel="noopener noreferrer"&gt;paper&lt;/a&gt;)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multivariate compute baseline&lt;/td&gt;
&lt;td&gt;several spend features combined&lt;/td&gt;
&lt;td&gt;~0.88 — better, hand-built &lt;em&gt;(&lt;a href="https://arxiv.org/abs/2605.29682" rel="noopener noreferrer"&gt;paper&lt;/a&gt;)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Effective Feedback Compute (EFC)&lt;/td&gt;
&lt;td&gt;4-axis feedback quality ÷ task demand&lt;/td&gt;
&lt;td&gt;~0.94–0.99 — tight &lt;em&gt;(&lt;a href="https://arxiv.org/abs/2605.29682" rel="noopener noreferrer"&gt;paper&lt;/a&gt;)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A caveat worth stating plainly: this is a &lt;em&gt;scaling-law fit on the paper's own datasets&lt;/em&gt;, and a tight fit is a strong correlation, not a guaranteed control knob. EFC is also harder to move than a token budget — "return better feedback" is a design problem, not a slider, and scoring the four axes reliably is itself non-trivial. The honest framing is that EFC gives you a &lt;em&gt;yardstick&lt;/em&gt; and a direction: instrument the feedback your harness returns, A/B candidate changes in shadow, and treat feedback quality as a first-class number alongside latency and cost. Whether the exact coefficients transfer to your stack is exactly the kind of thing you should measure, not assume.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → Evals &amp;amp; Diagnostics → Error analysis first&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/pushbench-qgp" rel="noopener noreferrer"&gt;PushBench — Quantitative Goal Persistence (QGP)&lt;/a&gt; — another harness-level number for long-horizon agent reliability&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/futuresim-harness-level-eval" rel="noopener noreferrer"&gt;FutureSim — harness-level agent eval&lt;/a&gt; — why evaluating the harness, not the model alone, is the trend&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/cursor-composer-2-5-targeted-textual-feedback-rl" rel="noopener noreferrer"&gt;Cursor Composer 2.5 — targeted textual feedback RL&lt;/a&gt; — the training-time analogue: a sharp, targeted correction beats a blunt end-of-rollout reward&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Effective Feedback Compute (EFC)?
&lt;/h3&gt;

&lt;p&gt;EFC is a metric that predicts agent-harness success from the quality of the feedback the harness returns each step, rather than from the raw compute it spends. It scores feedback on four axes — informativeness, validity, non-redundancy, and retention — and normalizes by task demand so harnesses can be compared fairly across easy and hard tasks. Plotted against EFC, the paper reports success rates fitting a scaling law at R²≈0.94–0.99, far tighter than the ~0.33–0.42 fit against raw compute.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does feedback quality predict success better than raw compute?
&lt;/h3&gt;

&lt;p&gt;A harness can spend an enormous budget returning low-quality feedback — terse pass/fail strings, false-positive warnings, repeated messages, or errors the agent has already forgotten. That is real compute that carries almost no useful signal, so the raw-compute axis goes nearly flat. EFC captures the signal that actually reaches the agent, which is why it fits success so much more tightly. In one controlled comparison, lifting only feedback quality moved success from 0.27 to 0.90 with token cost and tool-call counts held fixed.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I improve a harness's EFC in practice?
&lt;/h3&gt;

&lt;p&gt;Treat the feedback your harness returns as a first-class design surface: make tool-call results localize the error (informativeness), verify the signal is correct before returning it (validity), suppress repeated or stale messages (non-redundancy), and persist corrections so they survive later in the rollout (retention). Because EFC is a measurable yardstick rather than a slider, the practical loop is to instrument the feedback you return, A/B candidate changes in shadow mode, and track feedback quality alongside latency and cost.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/efc-feedback-quality-scaling-law" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>AutoLab Benchmarks Frontier Agents on Long-Horizon R&amp;D Tasks: Iterative Experiment-Loop Evaluation</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Tue, 09 Jun 2026 11:25:04 +0000</pubDate>
      <link>https://dev.to/pueding/autolab-benchmarks-frontier-agents-on-long-horizon-rd-tasks-iterative-experiment-loop-evaluation-470o</link>
      <guid>https://dev.to/pueding/autolab-benchmarks-frontier-agents-on-long-horizon-rd-tasks-iterative-experiment-loop-evaluation-470o</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;AutoLab benchmark&lt;/strong&gt; scores agents with &lt;strong&gt;iterative experiment-loop evaluation&lt;/strong&gt; — 36 realistic R&amp;amp;D tasks (optimize a system, tune a CUDA kernel, build a model) where the agent has to propose a change, run an experiment, measure the result, and refine, over and over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Across 17 frontier models, the strongest predictor of success was &lt;strong&gt;sustained iteration that incorporates empirical feedback&lt;/strong&gt; plus &lt;strong&gt;time-awareness&lt;/strong&gt; — knowing when to keep going — rather than the quality of the first answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Most LLM benchmarks grade &lt;strong&gt;a single answer once&lt;/strong&gt;; AutoLab grades the whole &lt;strong&gt;propose → run → measure → refine loop under a budget&lt;/strong&gt;, exposing two failure modes a one-shot score is blind to: &lt;strong&gt;stopping too early&lt;/strong&gt; and &lt;strong&gt;burning the budget with no measured progress&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;tuning a race car in the pit, reading lap times until qualifying closes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;         SAME CAR, SAME LAP BUDGET (12 laps)
                          │
        ┌─────────────────┬─────────────────┐
        ▼                 ▼                 ▼
   ┌─────────┐       ┌─────────┐       ┌─────────┐
   │ PARK    │       │ RE-TUNE │       │ TIME +  │
   │ EARLY   │       │ NEVER   │       │ TUNE    │
   │         │       │ TIME    │       │ EVERY   │
   │ 4 laps, │       │ 12 laps,│       │ LAP     │
   │ then    │       │ no clock│       │ 8 timed │
   │ quit    │       │ reading │       │ laps    │
   └────┬────┘       └────┬────┘       └────┬────┘
        ▼                 ▼                 ▼
   stops at          random-walks       compounds to
   ~0.46             ~0.27              ~0.76
   ✗ budget          ✗ no measured      ✓ best lap
     left unused       progress           wins slot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;task = set the fastest lap before qualifying closes&lt;/li&gt;
&lt;li&gt;experiment loop = adjust the setup → run a lap → read the lap time → adjust again&lt;/li&gt;
&lt;li&gt;empirical feedback = the lap time on the stopwatch, not a guess from the spec sheet&lt;/li&gt;
&lt;li&gt;budget = the laps you have before the qualifying flag drops&lt;/li&gt;
&lt;li&gt;stopping early = parking after two laps with time still on the clock&lt;/li&gt;
&lt;li&gt;burning the budget = re-tuning every lap but never reading the timer&lt;/li&gt;
&lt;li&gt;persistence = keep timing and tuning until the very last lap&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Long-horizon task&lt;/strong&gt; — A task that takes many steps and a real budget to finish — not one question with one answer, but a goal you reach by doing work, checking it, and adjusting. AutoLab's tasks run for many tool-using steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Experiment loop&lt;/strong&gt; — The repeating cycle at the heart of R&amp;amp;D work: &lt;strong&gt;propose&lt;/strong&gt; a change → &lt;strong&gt;run&lt;/strong&gt; an experiment or benchmark → &lt;strong&gt;measure&lt;/strong&gt; the outcome → &lt;strong&gt;refine&lt;/strong&gt;. AutoLab scores whether an agent actually keeps this loop turning, not just whether its first attempt looked good.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empirical feedback&lt;/strong&gt; — A result you &lt;em&gt;measured&lt;/em&gt; by running something — a benchmark number, a test pass/fail, a latency reading — as opposed to a guess. The key move is conditioning the next edit on a number the agent ran itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time-awareness&lt;/strong&gt; — The agent's sense of how much budget is left and whether more iteration is worth it. Failing it shows up two ways: quitting with budget unspent, or thrashing until the budget runs out with nothing to show.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent harness&lt;/strong&gt; — The runtime that wraps a model into an agent — it schedules tool calls, runs the experiments, and feeds results back into the loop. The same model in a better harness can score very differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CUDA-kernel optimization&lt;/strong&gt; — One of AutoLab's four domains: rewrite a GPU kernel to run faster, then benchmark it to see if it actually did. It is a textbook measure-and-refine loop — and it ties this agent benchmark to the GPU &amp;amp; CUDA track.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; Posted to arXiv on June 3, 2026, &lt;strong&gt;AutoLab&lt;/strong&gt; is a benchmark of 36 long-horizon R&amp;amp;D tasks across four domains — system optimization, puzzle &amp;amp; challenge, model development, and CUDA-kernel optimization — that ask an agent to propose changes, run experiments, measure outcomes, and iterate. Evaluating 17 state-of-the-art models, the dominant predictor of success was &lt;strong&gt;persistence in repeatedly benchmarking, editing, and incorporating empirical feedback&lt;/strong&gt; — not the quality of the initial response. Most frontier models either stopped prematurely or burned their budget with minimal progress; Claude-opus-4.6 showed the strongest long-horizon optimization behavior. &lt;a href="https://arxiv.org/abs/2606.05080" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a pit crew with a fixed number of laps before qualifying closes. The car that wins the slot isn't the one that posted the best first lap — it's the one whose crew keeps reading the lap time, adjusting the setup, and sending it back out until the flag drops. AutoLab is built on exactly this insight for agents: it hands an agent a real engineering goal and a budget, then watches not the first attempt but whether the agent keeps the &lt;strong&gt;experiment loop&lt;/strong&gt; — propose → run → measure → refine — turning all the way to the deadline.&lt;/p&gt;

&lt;p&gt;That loop is the whole concept of &lt;strong&gt;iterative experiment-loop evaluation&lt;/strong&gt;. A classic LLM benchmark asks one question and grades one answer; the agent never gets to &lt;em&gt;run&lt;/em&gt; anything. AutoLab instead scores the agent on tasks where it must execute its own experiments and read its own results — the errors compound across a long trajectory, so the only way to climb is to measure, learn, and correct. Crucially, the useful signal here is empirical feedback the agent generates itself (it benchmarks its own kernel and reads the number), which is a different lever from feedback a harness hands back step-by-step.&lt;/p&gt;

&lt;p&gt;The benchmark's headline finding is that frontier models fail this in two distinct ways, and both are about knowing when to stop. Some agents &lt;strong&gt;stop too early&lt;/strong&gt; — they post a decent second attempt and quit with most of the budget unspent. Others &lt;strong&gt;burn the whole budget&lt;/strong&gt; but skip the &lt;em&gt;measure&lt;/em&gt; step: they keep editing without conditioning each change on a result, so the score random-walks and never compounds. The agents that did well — led by Claude-opus-4.6 — spent their reasoning budget on a disciplined measure-then-refine cadence, which is exactly the time-awareness a one-shot eval can never see.&lt;/p&gt;

&lt;p&gt;Why does this matter beyond a leaderboard? Because it relocates the bottleneck for long-horizon agents from &lt;em&gt;raw capability&lt;/em&gt; to &lt;em&gt;behavior under a budget&lt;/em&gt;. The same skill that tops AutoLab — sustained, measured iteration — is what production teams care about when an agent tunes a config, optimizes a kernel, or chases a flaky test over an afternoon. That makes AutoLab a production-eval signal, not just an academic one: it predicts whether an agent will actually grind a real task to a good result instead of giving up or spinning.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AutoLab domain&lt;/th&gt;
&lt;th&gt;What the agent iterates on&lt;/th&gt;
&lt;th&gt;What it measures each loop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System optimization&lt;/td&gt;
&lt;td&gt;Configs, flags, resource allocation&lt;/td&gt;
&lt;td&gt;Throughput / latency of a benchmark run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CUDA-kernel optimization&lt;/td&gt;
&lt;td&gt;A GPU kernel's implementation&lt;/td&gt;
&lt;td&gt;Wall-clock kernel time vs a baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model development&lt;/td&gt;
&lt;td&gt;Training / architecture choices&lt;/td&gt;
&lt;td&gt;A validation metric on a held-out set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Puzzle &amp;amp; challenge&lt;/td&gt;
&lt;td&gt;Candidate solutions to a hard problem&lt;/td&gt;
&lt;td&gt;Pass / fail against the checker&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Four domains, 36 tasks total across them; the exact per-task scores are reported in the paper, and the row examples above are illustrative of the loop structure (&lt;a href="https://arxiv.org/abs/2606.05080" rel="noopener noreferrer"&gt;AutoLab, arXiv 2606.05080&lt;/a&gt;).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where the budget actually goes&lt;/strong&gt; &lt;em&gt;(numbers illustrative — AutoLab reports the model ranking and the persistence finding, not these per-task point values)&lt;/em&gt;. Hold three things fixed: a budget of 12 experiment runs, a starting score of ~0.23 (the first answer — roughly the same for all three agents), and a per-loop gain that only lands when the agent &lt;em&gt;measures&lt;/em&gt;. Agent A makes 4 measured runs at about +0.06 each, reaches ~0.46, then stops with 8 runs unused. Agent B spends all 12 runs but skips the measure step, so its edits aren't conditioned on a read result — its score random-walks around ~0.27 and never compounds. Agent C makes 8 measured runs, each conditioned on the last result, compounding to &lt;strong&gt;~0.76&lt;/strong&gt;. Same start, same budget; the entire gap comes from &lt;strong&gt;how the loop was spent&lt;/strong&gt;, not from the first try.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → Evals &amp;amp; Diagnostics → Compounding errors&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is iterative experiment-loop evaluation?
&lt;/h3&gt;

&lt;p&gt;It is scoring an agent on whether it keeps a propose → run → measure → refine loop turning, rather than grading a single answer. AutoLab gives the agent a real R&amp;amp;D task and a budget, then rewards measured iteration toward a better result instead of a good-looking first attempt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does sustained iteration beat initial answer quality?
&lt;/h3&gt;

&lt;p&gt;On long-horizon tasks the first attempt is rarely the best one, and errors compound. The agents that win are the ones that read an empirical result, correct, and repeat — using their whole budget. AutoLab found this disposition, not first-shot quality, was the dominant predictor across 17 models.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does AutoLab relate to benchmarks like EFC and QGP?
&lt;/h3&gt;

&lt;p&gt;They are complementary lenses on long-horizon agent reliability. EFC isolates the quality of the feedback signal a harness returns; QGP measures whether an agent finishes a fixed count of work without spinning; AutoLab measures whether the agent sustains its own measure-and-refine loop under a budget on realistic R&amp;amp;D tasks.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/autolab-experiment-loop-eval" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>MCP SEP-2106: Full JSON Schema 2020-12 in Tool I/O</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Mon, 08 Jun 2026 11:18:18 +0000</pubDate>
      <link>https://dev.to/pueding/mcp-sep-2106-full-json-schema-2020-12-in-tool-io-ee7</link>
      <guid>https://dev.to/pueding/mcp-sep-2106-full-json-schema-2020-12-in-tool-io-ee7</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; &lt;strong&gt;MCP SEP-2106&lt;/strong&gt; — merged into the protocol on May 18, 2026 — lets an MCP tool describe its inputs and outputs with the full JSON Schema 2020-12 keyword set in &lt;code&gt;inputSchema&lt;/code&gt; and &lt;code&gt;outputSchema&lt;/code&gt;, and widens &lt;code&gt;structuredContent&lt;/code&gt; from object-only to any JSON value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Composition (&lt;code&gt;oneOf&lt;/code&gt; / &lt;code&gt;anyOf&lt;/code&gt; / &lt;code&gt;allOf&lt;/code&gt;), conditionals (&lt;code&gt;if&lt;/code&gt; / &lt;code&gt;then&lt;/code&gt; / &lt;code&gt;else&lt;/code&gt;), and references (&lt;code&gt;$ref&lt;/code&gt; / &lt;code&gt;$defs&lt;/code&gt;) let a tool author &lt;strong&gt;push contract rules out of free-form description prose and into the schema&lt;/strong&gt;, where runtimes and SDKs can validate them before the call ever reaches the tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; The previous MCP spec accepted only a narrow JSON Schema subset (object root with a basic &lt;code&gt;type&lt;/code&gt; / &lt;code&gt;properties&lt;/code&gt; / &lt;code&gt;required&lt;/code&gt; vocabulary); composition, conditionals, refs, and non-object output shapes were not part of the wire vocabulary and had to live in tool &lt;code&gt;description&lt;/code&gt; prose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;It's like a job application form with conditional sections, alternatives, and refs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;           THE TOOL'S inputSchema (a form)
                        │
        ┌───────────────┴────────────────┐
 ┌──────▼───────┐                 ┌───────▼──────┐
 │ BEFORE 2106  │                 │ AFTER 2106   │
 │ plain fields │                 │ fields PLUS  │
 │ + prose note │                 │ oneOf / if / │
 │ at the bottom│                 │ then / $ref  │
 └──────┬───────┘                 └───────┬──────┘
        │                                 │
  rules live in                    rules live in
  English prose                    the schema
        │                                 │
        ▼                                 ▼
 ✗ runtime CANNOT                 ✓ runtime REJECTS
   check them                       bad calls early
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;MCP tool inputSchema = the application form a tool requires from the agent&lt;/li&gt;
&lt;li&gt;basic keywords (type / properties / required) = plain text fields and required checkboxes&lt;/li&gt;
&lt;li&gt;oneOf / anyOf / allOf = pick exactly one / any combination / all of these alternatives&lt;/li&gt;
&lt;li&gt;if / then / else = if you marked 'married', also fill spouse details&lt;/li&gt;
&lt;li&gt;$ref / $defs = see the 'Company Address' subform on page 4&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MCP&lt;/strong&gt; — The Model Context Protocol — a JSON-RPC wire protocol that lets LLM clients (Claude, ChatGPT, IDEs) discover and call tools served by external processes. See the MCP step in the Tool Use module.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEP&lt;/strong&gt; — A Specification Enhancement Proposal — the MCP equivalent of a Python PEP or a TC39 proposal. Each SEP is a numbered RFC merged into the spec only after review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;inputSchema / outputSchema&lt;/strong&gt; — The two JSON Schema documents an MCP server attaches to a tool definition — one for the arguments the agent must send, one for the structured value the tool returns. The runtime validates traffic against them before either side sees a malformed payload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;structuredContent&lt;/strong&gt; — The field inside a tool result that carries a typed value alongside the human-readable &lt;code&gt;content&lt;/code&gt; blocks. Pre-SEP-2106 the TypeScript type was &lt;code&gt;{ [key: string]: unknown }&lt;/code&gt; — objects only; after SEP-2106 it is plain &lt;code&gt;unknown&lt;/code&gt;, so arrays and primitives are wire-legal too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSON Schema 2020-12&lt;/strong&gt; — The 2020-12 draft of the JSON Schema spec — the most recent stable version. Adds composition (&lt;code&gt;oneOf&lt;/code&gt; / &lt;code&gt;anyOf&lt;/code&gt; / &lt;code&gt;allOf&lt;/code&gt; / &lt;code&gt;not&lt;/code&gt;), conditionals (&lt;code&gt;if&lt;/code&gt; / &lt;code&gt;then&lt;/code&gt; / &lt;code&gt;else&lt;/code&gt;), references (&lt;code&gt;$ref&lt;/code&gt; / &lt;code&gt;$defs&lt;/code&gt;), and tighter &lt;code&gt;$dynamicRef&lt;/code&gt; semantics over the older draft-07 vocabulary MCP previously implied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;oneOf / anyOf / allOf&lt;/strong&gt; — JSON Schema composition keywords. &lt;code&gt;oneOf&lt;/code&gt; = match exactly one of N subschemas; &lt;code&gt;anyOf&lt;/code&gt; = match at least one; &lt;code&gt;allOf&lt;/code&gt; = match every subschema (intersection).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;if / then / else&lt;/strong&gt; — JSON Schema conditional keywords. If a value matches the &lt;code&gt;if&lt;/code&gt; subschema, it must also match &lt;code&gt;then&lt;/code&gt;; otherwise it must match &lt;code&gt;else&lt;/code&gt;. Lets a single schema express "if &lt;code&gt;roundTrip&lt;/code&gt; is true, &lt;code&gt;return_date&lt;/code&gt; is required."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;$ref / $defs&lt;/strong&gt; — JSON Schema reference keywords. &lt;code&gt;$defs&lt;/code&gt; declares reusable named subschemas; &lt;code&gt;$ref&lt;/code&gt; points at one of them by JSON Pointer. Lets a long schema avoid copy-pasting the same address or money sub-shape three times.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On May 18, 2026, &lt;a href="https://github.com/modelcontextprotocol/modelcontextprotocol/commit/142b3c3cffcd10012e3dc1b07db5818877e64f9b" rel="noopener noreferrer"&gt;SEP-2106&lt;/a&gt; merged into the MCP specification. The change widens the schema vocabulary that tools may use to describe their input and output: &lt;code&gt;inputSchema&lt;/code&gt; now allows the full JSON Schema 2020-12 keyword set inside its required &lt;code&gt;type: "object"&lt;/code&gt; root, &lt;code&gt;outputSchema&lt;/code&gt; drops the object-root constraint entirely and accepts any 2020-12 schema, and &lt;code&gt;structuredContent&lt;/code&gt; is retyped from object-only to plain &lt;code&gt;unknown&lt;/code&gt;. Loosening on paper — but the SEP is explicit that compatibility is &lt;strong&gt;asymmetric&lt;/strong&gt;: a newer server emitting a non-object &lt;code&gt;structuredContent&lt;/code&gt; or a composition-rich schema may be rejected by an older client that hasn't been updated, so the SEP recommends servers also emit a serialized &lt;code&gt;TextContent&lt;/code&gt; fallback for non-object results during the transition.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a job application that, until last week, only let you fill in plain text fields and checkboxes — name, address, "married?" yes/no. If the form needed something conditional ("if married, also provide spouse name") or alternative ("attach exactly ONE of passport, driver's license, or state ID"), the only way to express it was a paragraph of free-text instructions at the bottom of the page. SEP-2106 hands the form designer a richer &lt;strong&gt;template language&lt;/strong&gt;: now the conditional, the alternatives, and the cross-references to other subforms are spelled out &lt;em&gt;on the form itself&lt;/em&gt;, in a way the form's automated validator can actually check before the application gets routed.&lt;/p&gt;

&lt;p&gt;The technical reason mirrors the metaphor. Before SEP-2106, the MCP wire spec implied a narrow JSON Schema subset — basically the keywords a 2014-era schema validator would understand: &lt;code&gt;type&lt;/code&gt;, &lt;code&gt;properties&lt;/code&gt;, &lt;code&gt;required&lt;/code&gt;, &lt;code&gt;items&lt;/code&gt;, &lt;code&gt;enum&lt;/code&gt;, &lt;code&gt;additionalProperties&lt;/code&gt;. If a tool needed to express "either a one-way booking (no return date) or a round-trip booking (return date required)," the schema author had two bad options: split it into two separate tools (now the model has to pick), or leave it as one tool with a permissive schema and a paragraph of natural-language instructions in &lt;code&gt;description&lt;/code&gt;. The first option inflates the agent's tool registry; the second relies on the model honoring prose constraints that the runtime can't enforce.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three surfaces, three changes
&lt;/h3&gt;

&lt;p&gt;SEP-2106 touches three places on the wire, with slightly different shapes of change.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Surface&lt;/th&gt;
&lt;th&gt;Before SEP-2106&lt;/th&gt;
&lt;th&gt;After SEP-2106&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;inputSchema&lt;/code&gt; root&lt;/td&gt;
&lt;td&gt;must be &lt;code&gt;type: "object"&lt;/code&gt; (SEP-2106 commit)&lt;/td&gt;
&lt;td&gt;must be &lt;code&gt;type: "object"&lt;/code&gt; (unchanged)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;inputSchema&lt;/code&gt; keywords inside the object root&lt;/td&gt;
&lt;td&gt;restricted vocabulary the spec named — &lt;code&gt;type&lt;/code&gt; / &lt;code&gt;properties&lt;/code&gt; / &lt;code&gt;required&lt;/code&gt; &lt;em&gt;(SDKs typically also accepted &lt;code&gt;items&lt;/code&gt;, &lt;code&gt;enum&lt;/code&gt;, &lt;code&gt;additionalProperties&lt;/code&gt;)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;full JSON Schema 2020-12 — adds &lt;code&gt;oneOf&lt;/code&gt; / &lt;code&gt;anyOf&lt;/code&gt; / &lt;code&gt;allOf&lt;/code&gt; / &lt;code&gt;not&lt;/code&gt;, &lt;code&gt;if&lt;/code&gt; / &lt;code&gt;then&lt;/code&gt; / &lt;code&gt;else&lt;/code&gt;, &lt;code&gt;$ref&lt;/code&gt; / &lt;code&gt;$defs&lt;/code&gt;, and the rest of the 2020-12 keyword set (SEP-2106 commit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;outputSchema&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;basic, object-rooted (mirrored &lt;code&gt;inputSchema&lt;/code&gt;) (SEP-2106 commit)&lt;/td&gt;
&lt;td&gt;fully flexible — any 2020-12 schema, including array roots, primitive roots, and composition (SEP-2106 commit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;structuredContent&lt;/code&gt; TypeScript type&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;{ [key: string]: unknown }&lt;/code&gt; — object only (SEP-2106 commit)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;unknown&lt;/code&gt; — array, primitive, union, object all wire-legal (SEP-2106 commit)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The root constraint on &lt;code&gt;inputSchema&lt;/code&gt; is preserved because every tool call still ships a JSON-RPC &lt;code&gt;arguments&lt;/code&gt; object — the call is &lt;code&gt;arguments: { ... }&lt;/code&gt;, not &lt;code&gt;arguments: 7&lt;/code&gt;. What changed is everything inside that object, plus the symmetric story for what a tool can return.&lt;/p&gt;

&lt;h3&gt;
  
  
  A worked example
&lt;/h3&gt;

&lt;p&gt;Picture a &lt;code&gt;book_flight&lt;/code&gt; tool. Before SEP-2106, its &lt;code&gt;inputSchema&lt;/code&gt; could declare four fields — &lt;code&gt;from&lt;/code&gt;, &lt;code&gt;to&lt;/code&gt;, &lt;code&gt;departure&lt;/code&gt;, optional &lt;code&gt;return&lt;/code&gt; — using the restricted vocabulary the spec named (&lt;code&gt;type&lt;/code&gt;, &lt;code&gt;properties&lt;/code&gt;, &lt;code&gt;required&lt;/code&gt;). To express "round-trip flights require &lt;code&gt;return&lt;/code&gt;, one-way flights forbid it," the author had three options: split into two tools (&lt;code&gt;book_one_way&lt;/code&gt;, &lt;code&gt;book_round_trip&lt;/code&gt;), leave a permissive schema and write a paragraph of &lt;code&gt;description&lt;/code&gt; prose, or both. After SEP-2106, the same tool fits in one schema using composition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"departure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"return"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"roundTrip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"boolean"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"departure"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"roundTrip"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"oneOf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"roundTrip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"const"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"return"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"roundTrip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"const"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"not"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"return"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The new schema reaches for &lt;code&gt;oneOf&lt;/code&gt;, two branch subschemas with their own &lt;code&gt;properties&lt;/code&gt; and &lt;code&gt;required&lt;/code&gt;, a &lt;code&gt;not&lt;/code&gt;, and two &lt;code&gt;const&lt;/code&gt; guards — every one of those keywords lived in the JSON Schema 2020-12 standard already, but none were in the wire vocabulary MCP would accept before this SEP. The runtime can now &lt;strong&gt;reject a malformed call before it ever reaches the tool&lt;/strong&gt;, instead of relying on the LLM to read and honor a paragraph of English in the &lt;code&gt;description&lt;/code&gt; field.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this lands now
&lt;/h3&gt;

&lt;p&gt;Two pressures converged. First, tool authors kept hitting the prose-vs-schema boundary: every nontrivial real-world tool grew a &lt;code&gt;description&lt;/code&gt; paragraph explaining what its schema couldn't say, and that paragraph then needed to be re-explained to every model that called the tool. Second, the structured tool I/O step of the agent stack — where output validation lives — assumed an object-rooted &lt;code&gt;structuredContent&lt;/code&gt; shape that forced tools returning a list (&lt;code&gt;list_files&lt;/code&gt;) or a scalar (&lt;code&gt;count_rows&lt;/code&gt;) to wrap their result in &lt;code&gt;{ "value": ... }&lt;/code&gt;. Both pressures land at the schema vocabulary, so SEP-2106 widens both at once.&lt;/p&gt;

&lt;p&gt;The rollout story is more nuanced than "strictly loosening." Existing tools that already used only the previously-allowed keywords keep working unchanged, and the wire protocol stays backward-compatible at the schema vocabulary level — composition keywords like &lt;code&gt;oneOf&lt;/code&gt; are legal JSON either way, so an older client that doesn't validate them will simply skip the extra checks (the schema still parses, just with weaker validation). The friction is asymmetric: a newer server emitting a non-object &lt;code&gt;structuredContent&lt;/code&gt; or a primitive-rooted &lt;code&gt;outputSchema&lt;/code&gt; may be rejected by an older client whose type checks still expect an object, which is why the SEP recommends servers also emit a serialized &lt;code&gt;TextContent&lt;/code&gt; fallback for non-object results during the transition. SDK consumers also see one TypeScript source break — the narrower &lt;code&gt;{ [k]: unknown }&lt;/code&gt; type loses to plain &lt;code&gt;unknown&lt;/code&gt;, and any code that depended on the narrower type needs to widen its own annotations to match.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → Tool Use → Structured tool I/O and AI Agents → Tool Use → MCP&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What changed in MCP SEP-2106 in one sentence?
&lt;/h3&gt;

&lt;p&gt;SEP-2106 lets MCP tool authors describe their inputs and outputs with the full JSON Schema 2020-12 keyword set — composition (&lt;code&gt;oneOf&lt;/code&gt; / &lt;code&gt;anyOf&lt;/code&gt; / &lt;code&gt;allOf&lt;/code&gt; / &lt;code&gt;not&lt;/code&gt;), conditionals (&lt;code&gt;if&lt;/code&gt; / &lt;code&gt;then&lt;/code&gt; / &lt;code&gt;else&lt;/code&gt;), and references (&lt;code&gt;$ref&lt;/code&gt; / &lt;code&gt;$defs&lt;/code&gt;) — and widens &lt;code&gt;structuredContent&lt;/code&gt; from an object-only TypeScript type to plain &lt;code&gt;unknown&lt;/code&gt;, while keeping &lt;code&gt;inputSchema&lt;/code&gt;'s root &lt;code&gt;type: "object"&lt;/code&gt; constraint unchanged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does richer tool-schema vocabulary matter for agents?
&lt;/h3&gt;

&lt;p&gt;The wire vocabulary is the only contract the runtime can validate before traffic reaches the tool. Anything that lives in the tool's free-form &lt;code&gt;description&lt;/code&gt; prose has to be re-explained to every model that calls the tool, and the runtime can't reject a malformed call until the tool itself errors out. Pushing rules like "if &lt;code&gt;roundTrip&lt;/code&gt; is true then &lt;code&gt;return&lt;/code&gt; is required" into the schema means the SDK can reject the call before invocation and the model gets a structured error it can react to, instead of a tool-side stack trace.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does SEP-2106 break existing MCP tools?
&lt;/h3&gt;

&lt;p&gt;Existing tool definitions remain valid because the change only adds allowed keywords and widens types — nothing is removed. Compatibility is asymmetric, though: a newer server emitting a non-object &lt;code&gt;structuredContent&lt;/code&gt; or a primitive-rooted &lt;code&gt;outputSchema&lt;/code&gt; may be rejected by an older client whose type checks still expect an object. The SEP recommends servers also emit a serialized &lt;code&gt;TextContent&lt;/code&gt; fallback for non-object results during the transition. There is also one source-level TypeScript break — consumers whose generic types narrowed &lt;code&gt;structuredContent&lt;/code&gt; from &lt;code&gt;unknown&lt;/code&gt; to &lt;code&gt;{ [key: string]: unknown }&lt;/code&gt; see a type error when they upgrade SDK versions, fixed by widening the consumer's type to match.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/mcp-sep-2106-json-schema-2020-12" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>agents</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>MarginGate: Margin-Gated Verification for Batch-Invariant Decoding</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Sun, 07 Jun 2026 11:16:59 +0000</pubDate>
      <link>https://dev.to/pueding/margingate-margin-gated-verification-for-batch-invariant-decoding-1cko</link>
      <guid>https://dev.to/pueding/margingate-margin-gated-verification-for-batch-invariant-decoding-1cko</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;MarginGate&lt;/strong&gt; paper (arXiv) targets a subtle serving bug with &lt;strong&gt;margin-gated verification for batch-invariant decoding&lt;/strong&gt;: temperature-0 BF16 decoding is treated as reproducible, yet the same prompt can emit different tokens decoded alone versus inside a larger batch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Reproducibility is load-bearing for &lt;strong&gt;debugging, evals, caching, and audits&lt;/strong&gt; — yet in BF16 greedy serving, the batch a request lands in can silently change which token it emits from one run to the next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; &lt;strong&gt;Always-on FP32 verification&lt;/strong&gt; also restores determinism, but MarginGate re-checks only the sparse &lt;strong&gt;low-margin&lt;/strong&gt; steps to reach it at roughly &lt;strong&gt;2× less verification overhead&lt;/strong&gt; in the paper.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;An airport security line with a fast lane and a secondary-screening booth.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                      DECODE STEP
                          │
                 how wide is the margin?
                          │
           ┌──────────────┴──────────────┐
           │                             │
    ┌──────▼───────┐             ┌───────▼──────┐
    │  clean scan  │             │   near-tie   │
    │ wide margin  │             │  tiny margin │
    └──────┬───────┘             └───────┬──────┘
           │                             │
     FAST LANE (BF16)            SECONDARY (FP32)
     wave through                re-check the step
           │                             │
           ▼                             ▼
    ✓ same token, every         ✓ flip caught; K/V
      batch (no jitter)           column repaired
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;decode step = a traveler reaching the security checkpoint&lt;/li&gt;
&lt;li&gt;logit margin = how clearly their boarding pass scans&lt;/li&gt;
&lt;li&gt;high-margin step = a clean scan → waved through the fast lane (BF16)&lt;/li&gt;
&lt;li&gt;low-margin step = a borderline scan → pulled into secondary screening (FP32)&lt;/li&gt;
&lt;li&gt;K/V cache column repair = fixing the one mis-tagged bag before boarding&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;BF16 (bfloat16)&lt;/strong&gt; — A 16-bit floating-point format used for fast inference. It keeps FP32's exponent range but drops mantissa bits, so rounding errors are larger — enough that the &lt;strong&gt;order&lt;/strong&gt; of a sum can change the result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FP32&lt;/strong&gt; — 32-bit floating point — slower but far more precise. MarginGate uses it as the trusted reference to re-check only the steps that might be wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;logit margin&lt;/strong&gt; — The gap between the top-1 and top-2 token scores at a decode step. A &lt;strong&gt;large&lt;/strong&gt; margin means the winner is unambiguous; a &lt;strong&gt;tiny&lt;/strong&gt; margin means a small numerical nudge can flip it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;greedy decoding (temperature 0)&lt;/strong&gt; — Always emit the single highest-scoring token. People assume this is deterministic — the catch is that "highest-scoring" can change when the arithmetic changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;floating-point reduction order&lt;/strong&gt; — Summing numbers in a different order gives slightly different results in finite precision (addition isn't perfectly associative). GPU kernels pick their reduction order based on batch size — so the logits shift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;batch-invariance&lt;/strong&gt; — The property MarginGate restores: a request produces the &lt;strong&gt;same tokens&lt;/strong&gt; no matter how many other requests share its batch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;K/V cache&lt;/strong&gt; — The cached keys and values from earlier tokens. When a step is repaired, MarginGate swaps the offending &lt;strong&gt;column&lt;/strong&gt; of this cache so the rest of the sequence stays consistent. See the KV Cache module.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;continuous batching&lt;/strong&gt; — A serving technique where requests join and leave the running batch every step — which is exactly why a request's batch size (and its results) can vary run to run. See Batching.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On May 28, 2026, a paper introduced &lt;strong&gt;MarginGate&lt;/strong&gt; (arXiv 2605.30218), starting from an uncomfortable fact: temperature-0, greedy BF16 decoding is usually assumed to be &lt;strong&gt;reproducible&lt;/strong&gt;, yet the same request can return different tokens depending on how many other requests happen to share its batch. MarginGate measures that batch-induced token flips are rare, then verifies only the steps at risk. &lt;a href="https://arxiv.org/abs/2605.30218" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture an &lt;strong&gt;airport security line&lt;/strong&gt;. Almost every traveler has a boarding pass that scans cleanly, so the agent waves them straight through the &lt;strong&gt;fast lane&lt;/strong&gt; — that's a decode step with a wide &lt;strong&gt;logit margin&lt;/strong&gt;, where the top token wins by a mile and no amount of numerical jitter would change it. The trouble is the occasional borderline pass: a near-tie between the top two tokens. For those travelers, a tiny nudge decides which way they go — and at temperature 0, that nudge can come from something as invisible as the &lt;strong&gt;batch they were standing in&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why would the batch matter? Because the GPU sums each token's scores in a &lt;strong&gt;reduction order that depends on batch size&lt;/strong&gt;, and in &lt;strong&gt;BF16&lt;/strong&gt; addition isn't perfectly associative — re-order the sum and the last bit can change. For a confident step that is harmless. For a near-tie it can &lt;strong&gt;flip the winner&lt;/strong&gt;, so the very same prompt emits one token when decoded alone and another when it rides inside a larger batch. The root cause lives one level down, in how BF16 trades mantissa bits for speed versus FP32.&lt;/p&gt;

&lt;p&gt;MarginGate's move is to &lt;strong&gt;gate on the margin&lt;/strong&gt;. High-margin steps keep the cheap &lt;strong&gt;BF16&lt;/strong&gt; fast lane untouched. Only the sparse &lt;strong&gt;low-margin&lt;/strong&gt; steps are sent to secondary screening — a re-computation in FP32, the same verify-then-correct shape that speculative decoding uses. If the trusted FP32 result disagrees with what BF16 produced, MarginGate &lt;strong&gt;repairs&lt;/strong&gt; the step by swapping the offending column of the K/V cache so the rest of the sequence stays consistent. The expensive check fires on a handful of travelers, not the whole terminal.&lt;/p&gt;

&lt;p&gt;How much does that save? Take a &lt;strong&gt;1,000-token&lt;/strong&gt; completion (illustrative). MarginGate flags the low-margin steps — about &lt;strong&gt;18%&lt;/strong&gt;, or ~180 steps — for an FP32 re-check, while the other ~820 keep the fast path. Of those 180, only a few are genuine flips: the paper measures flip rates of &lt;strong&gt;0.3–1.3%&lt;/strong&gt; of all steps (just &lt;strong&gt;0.48%&lt;/strong&gt; for Llama-3.1-8B on MATH500), so on the order of &lt;strong&gt;3–13 tokens&lt;/strong&gt; would actually have changed. In the paper's tested settings, MarginGate catches and repairs each one. Always-on verification would instead re-run all &lt;strong&gt;1,000&lt;/strong&gt; steps in FP32 for the identical result — which is why margin-gating reports &lt;strong&gt;~2× lower overhead&lt;/strong&gt; (2.23× and 1.99× in the paper) while still restoring &lt;strong&gt;100% sequence-level determinism&lt;/strong&gt; on the models the paper tested (Llama-3.1-8B and Qwen2.5-14B).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Steps re-checked&lt;/th&gt;
&lt;th&gt;Determinism&lt;/th&gt;
&lt;th&gt;Relative overhead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Trust BF16 (no verify)&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;✗ batch-dependent&lt;/td&gt;
&lt;td&gt;1× (baseline)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Always-on FP32 verify&lt;/td&gt;
&lt;td&gt;every step&lt;/td&gt;
&lt;td&gt;✓ 100%&lt;/td&gt;
&lt;td&gt;~2× the gate, varies by model (paper)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MarginGate (margin-gated)&lt;/td&gt;
&lt;td&gt;~15–18% (paper)&lt;/td&gt;
&lt;td&gt;✓ 100%&lt;/td&gt;
&lt;td&gt;~2× lower than always-on (2.23× / 1.99×, paper)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The deeper lesson is that &lt;strong&gt;temperature 0 was never a determinism guarantee&lt;/strong&gt; — it only fixes the sampling rule, not the arithmetic underneath it. MarginGate is cheap precisely because the failure is rare and &lt;em&gt;predictable from the margin&lt;/em&gt;: you don't have to distrust every token, just the few that are genuinely on the fence.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: LLM Internals → Batching → Continuous Batching, and LLM Serving → Serving Metrics &amp;amp; SLOs.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is batch-invariant decoding?
&lt;/h3&gt;

&lt;p&gt;Batch-invariant decoding means a request produces the exact same tokens regardless of how many other requests share its GPU batch. It is the property most people assume temperature-0 greedy decoding already has — and MarginGate is a method for restoring it cheaply when it has quietly broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does temperature-0 BF16 inference give different tokens in a batch?
&lt;/h3&gt;

&lt;p&gt;Because the GPU sums each step's scores in a reduction order that depends on batch size, and BF16 addition isn't perfectly associative, the logits shift by a tiny amount. On a near-tie between the top two tokens (a low logit margin), that tiny shift can flip which token wins, so the same prompt can emit a different token alone versus inside a larger batch. The paper measures these flips at roughly 0.3–1.3% of steps on the models it tested.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is MarginGate different from always-on FP32 verification?
&lt;/h3&gt;

&lt;p&gt;Always-on verification re-checks every decode step in FP32; it restores determinism but carries roughly 2× the verification overhead MarginGate does in the paper. MarginGate verifies only the sparse low-margin steps — about 15–18% in the paper — and repairs a true flip by swapping the offending K/V cache column, reaching the same determinism the paper reports (100% sequence-level on Llama-3.1-8B and Qwen2.5-14B).&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/margingate-batch-invariant-decoding" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>MCP 2026-07-28 RC: Stateless Transport</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Sat, 06 Jun 2026 11:16:43 +0000</pubDate>
      <link>https://dev.to/pueding/mcp-2026-07-28-rc-stateless-transport-4ma9</link>
      <guid>https://dev.to/pueding/mcp-2026-07-28-rc-stateless-transport-4ma9</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;MCP 2026-07-28 release candidate&lt;/strong&gt; reworks transport so the &lt;code&gt;tools/call&lt;/code&gt; request itself carries every field a server needs to handle it — protocol version, capabilities, auth context, routing keys. The headline framing is &lt;strong&gt;stateless transport&lt;/strong&gt;: any server in a fleet can serve any request, with no per-session pin to a specific instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; The previous design forced &lt;strong&gt;sticky routing&lt;/strong&gt;: a session was bound to a single server for its lifetime, so load balancers had to either pin connections by session ID or replicate session state out-of-band. Horizontal scaling, blue/green deploys, and crash-recovery all suffered. The 2026-07-28 RC is the headline change of the next stable MCP spec — and it touches every harness that talks to MCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Earlier MCP transports treated the first request as a handshake that established server-local state; subsequent requests had to land on the same instance. The new design drops the in-process session: each request is self-contained, and when long-lived cross-request state is genuinely needed (subscriptions, sampling sessions, auth tokens) it lives in a &lt;strong&gt;shared store&lt;/strong&gt; any server can read — not in one server's memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A self-addressed envelope at a post office with many windows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    tools/call  (one letter)
                              │
                ┌─────────────┴─────────────┐
                │                           │
        ┌───────▼───────┐           ┌───────▼───────┐
        │ sticky clerk  │           │ self-addressed│
        │ (one window)  │           │   envelope    │
        └───────┬───────┘           └───────┬───────┘
                │                           │
       state in HER drawer          address + tracking
       notebook, hers alone         ID on the envelope
                │                           │
                ▼                           ▼
       ✗ wait at her window         ✓ any open window
         again; if she's              serves it — open
         out, trail is lost           ten more, all equal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;tools/call = handing a letter to a clerk&lt;/li&gt;
&lt;li&gt;sticky routing = a clerk who only remembers your shipment from a notebook on her desk — come back to HER for status&lt;/li&gt;
&lt;li&gt;self-addressed request = a letter with the destination, sender, and tracking ID printed on the envelope — any window reads it&lt;/li&gt;
&lt;li&gt;shared session store (when needed) = the post office's central tracking database — any clerk queries it&lt;/li&gt;
&lt;li&gt;horizontal scaling = open ten more windows in the same office; any one serves you&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MCP&lt;/strong&gt; — The &lt;strong&gt;Model Context Protocol&lt;/strong&gt; — an open protocol for connecting LLM hosts to external tool servers. The host runs the model and the agent's tool loop; servers expose tools, resources, and prompts over JSON-RPC.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEP&lt;/strong&gt; — &lt;strong&gt;Specification Enhancement Proposal&lt;/strong&gt; — MCP's RFC-style change document. The 2026-07-28 RC bundles twenty-two scoped SEPs covering the transport rework, the new Extensions framework, MCP Apps, Tasks, and authorization fixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sticky routing&lt;/strong&gt; — A load-balancing pattern where a session ID is pinned to a single backend instance for its lifetime. The load balancer hashes the session ID and always routes to the same server. Works fine until that one server is overloaded, restarted, or replaced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-contained request&lt;/strong&gt; — A request shape where every field the server needs to handle it — protocol version, declared client capabilities, routing keys, auth context — travels with the request itself. The server does not assume any prior state from earlier messages on the same socket.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared session store&lt;/strong&gt; — An out-of-process store (Redis-equivalent, a database, an object store) that any server in the fleet can read and write. Used for the small subset of MCP interactions that genuinely need cross-request state — long-lived subscriptions, sampling sessions, OAuth tokens. The transport itself is still stateless; the store is an &lt;em&gt;implementation pattern&lt;/em&gt; for state that has to survive across requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tasks extension (SEP-2663)&lt;/strong&gt; — The async-handle model for long-running tools: a server returns a &lt;code&gt;Task&lt;/code&gt; handle the client drives with &lt;code&gt;tasks/get&lt;/code&gt;, &lt;code&gt;tasks/update&lt;/code&gt;, &lt;code&gt;tasks/cancel&lt;/code&gt;. It composes naturally with stateless transport because the task handle is the only cross-request key the client needs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On May 22, 2026, the MCP project landed &lt;a href="https://github.com/modelcontextprotocol/specification/pull/2750" rel="noopener noreferrer"&gt;PR #2750&lt;/a&gt; — the blog announcement for the 2026-07-28 specification release candidate. The post leads with the &lt;strong&gt;stateless transport rework&lt;/strong&gt; as the headline change, with a before/after HTTP example showing a self-contained &lt;code&gt;tools/call&lt;/code&gt; request. Extensions, MCP Apps, and Tasks follow as the new capability story; the authorization changes are summarized by the failure modes they fix rather than enumerated SEP-by-SEP. All twenty-two scoped SEPs are linked from the announcement.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture the post office with many windows. The slow path is the &lt;strong&gt;sticky clerk&lt;/strong&gt;: you hand your letter to clerk #3, and clerk #3 jots the details in a notebook only she keeps in her drawer. If you come back to check on your shipment, you have to wait at &lt;em&gt;her&lt;/em&gt; window — none of the other clerks can tell you anything. If clerk #3 is busy, or goes on break, or quits, the trail of your shipment goes with her. The line at her window grows; the other windows are quiet. &lt;strong&gt;That is exactly what sticky-routed MCP looks like today.&lt;/strong&gt; The agent's tool-use loop opens a session, the load balancer pins that session to one server, and every follow-up call has to land on that same server. One server gets the traffic; the others sit idle.&lt;/p&gt;

&lt;p&gt;The fast path is &lt;strong&gt;the self-addressed envelope&lt;/strong&gt;. You write the destination, the sender, and a tracking ID on the front of every letter, and the post office stops needing any one clerk to remember anything about your shipment. &lt;strong&gt;Any open window will do.&lt;/strong&gt; That is the 2026-07-28 framing: each &lt;code&gt;tools/call&lt;/code&gt; carries the protocol version it expects, the client capabilities it declared, any routing keys the server fleet needs, and the auth context — all in the request itself. The server reads the envelope and acts. No drawer notebook. No "come back to me." A second request half a second later can land on a different server entirely and produce identical behavior.&lt;/p&gt;

&lt;p&gt;There is a real subtlety worth saying out loud. A few MCP interactions genuinely do need cross-request memory — long-lived subscriptions, sampling sessions, OAuth tokens that have to outlive a single call. The new design does not pretend those don't exist. It externalizes them: the central tracking database the metaphor mentions is a &lt;strong&gt;shared store&lt;/strong&gt; (a Redis-equivalent, a database, an object store) that any server queries when it needs to hydrate that bit of cross-request state. The transport is still stateless — the request itself is self-contained — and the &lt;em&gt;implementation pattern&lt;/em&gt; of a shared store is what makes the small slice of stateful behavior work across a fleet. Mixing those two ideas up is easy and worth keeping straight: the protocol's change is at the transport layer; the shared store is one way servers can choose to persist what little state has to outlive a request.&lt;/p&gt;

&lt;p&gt;The capacity argument writes itself. Consider 300 concurrent agent sessions, each holding open MCP traffic at ~2 calls per second, hitting a fleet of 3 servers. Sticky routing assigns each session to one server at session open. Distribution is rarely uniform — three or four "power user" sessions can pin one server's load near saturation while the others sit at 10-20%. Numerically: a typical sticky-imbalance run might leave &lt;strong&gt;S1 at ~92% utilization while S2 and S3 sit at ~8% and ~41%&lt;/strong&gt; &lt;em&gt;(illustrative)&lt;/em&gt;. Under stateless transport with the same workload, the load balancer can spray every call independently. The same 600 calls/sec land on three servers at &lt;strong&gt;~49% each&lt;/strong&gt; &lt;em&gt;(illustrative)&lt;/em&gt; — a ~1.9× improvement in usable fleet headroom before any vertical scaling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the rework earns its keep
&lt;/h2&gt;

&lt;p&gt;Sticky routing's failure modes are well-known in the agent harness world: one hot server, blue/green deploys that have to drain sessions for minutes, crash recovery that can't transparently re-route. The 2026-07-28 RC closes all three at the transport level. Self-contained requests do not pin to anything, so a deploy that rolls a server out of rotation finishes in seconds — pending requests just hit the next server. A server that crashes drops its in-flight requests, and the client retries against the fleet — the next call lands somewhere else and proceeds. The only state that needs to survive the crash is whatever the workload chose to put in the shared store, which is the small minority of interactions.&lt;/p&gt;

&lt;p&gt;The shape of what the RC actually changes is concrete. The table below contrasts the legacy and new transport.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Sticky-routed transport (legacy)&lt;/th&gt;
&lt;th&gt;Stateless transport (2026-07-28 RC)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Session lifetime&lt;/td&gt;
&lt;td&gt;Bound to one server for the session's life&lt;/td&gt;
&lt;td&gt;No per-session server binding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing key&lt;/td&gt;
&lt;td&gt;Session ID hashed to a specific instance&lt;/td&gt;
&lt;td&gt;None — any instance, any request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First request&lt;/td&gt;
&lt;td&gt;Handshake that creates server-local state&lt;/td&gt;
&lt;td&gt;Self-contained, no implicit setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-request state&lt;/td&gt;
&lt;td&gt;In server memory&lt;/td&gt;
&lt;td&gt;In a shared store, only when needed (subscriptions, sampling, auth)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Horizontal scale-out&lt;/td&gt;
&lt;td&gt;Awkward — uneven load by session hash&lt;/td&gt;
&lt;td&gt;Native — load balancer sprays calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Server restart&lt;/td&gt;
&lt;td&gt;Drops the session; client must rebuild&lt;/td&gt;
&lt;td&gt;Drops in-flight; retry hits any other server&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A related design point is worth knowing. The Tasks extension (SEP-2663) ships a complementary idea one layer up: it gives the client a long-lived &lt;code&gt;taskId&lt;/code&gt; it can poll across reconnects. SEP-2663 needed the transport rework to be fully useful — a &lt;code&gt;taskId&lt;/code&gt; polled across reconnects only works if the next &lt;code&gt;tasks/get&lt;/code&gt; doesn't have to land on the &lt;em&gt;same&lt;/em&gt; server that issued the handle. Stateless transport is what makes that work: the &lt;code&gt;taskId&lt;/code&gt; is the only cross-request key the client carries, the server fleet hydrates the task's state from the shared store, and the polling call goes to whichever server is least busy.&lt;/p&gt;

&lt;p&gt;The boundary of what the RC changes is the transport itself, not the protocol semantics. Tools still return tool results; resources still return resource contents; the wire format of a method call is the same JSON-RPC envelope. What changes is what a server is &lt;em&gt;allowed to assume&lt;/em&gt;: nothing about prior calls on the same connection. That single discipline is enough to make every harness operator's life easier and to make the parallel-tool-call patterns the Cost &amp;amp; Latency module recommends actually achievable in a fleet.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What does stateless transport mean in the MCP 2026-07-28 RC?
&lt;/h3&gt;

&lt;p&gt;It means the &lt;code&gt;tools/call&lt;/code&gt; request itself carries every field a server needs to handle it — protocol version, declared client capabilities, routing keys, auth context. The server is not allowed to assume any state from prior calls on the same connection. A consequence is that any server in a fleet can serve any request, so no sticky session binding is needed at the load balancer.&lt;/p&gt;

&lt;h3&gt;
  
  
  What replaces sticky routing for state that genuinely has to live across requests?
&lt;/h3&gt;

&lt;p&gt;A shared store. The small subset of MCP interactions that need cross-request memory — long-lived subscriptions, sampling sessions, OAuth tokens — moves out of any one server's process and into a Redis-equivalent (or database, or object store) the entire fleet reads. The transport itself is still stateless; the shared store is an implementation pattern for the slice of state that must survive across requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does the transport rework relate to the Tasks extension (SEP-2663)?
&lt;/h3&gt;

&lt;p&gt;They compose. SEP-2663 lets a server return a long-lived &lt;code&gt;taskId&lt;/code&gt; the client polls later. Stateless transport is what makes that poll robust across a fleet: the next &lt;code&gt;tasks/get&lt;/code&gt; does not need to land on the same server that issued the handle. Together they let an agent harness survive server restarts, blue/green deploys, and load-balancer reshuffles without any session affinity.&lt;/p&gt;

&lt;h3&gt;
  
  
  What needs to change in existing MCP server code to support stateless transport?
&lt;/h3&gt;

&lt;p&gt;Concretely: stop reading state from the connection. Any field the server used to learn once at session-establish and remember for the lifetime of the connection — declared client capabilities, protocol version, auth identity, routing tenant — must now be read from each &lt;code&gt;tools/call&lt;/code&gt; request instead. Servers that already drove every decision off the incoming request payload need minimal changes. Servers that built up per-connection caches (negotiated capabilities, OAuth introspection results, tenant routing decisions) need to externalize those caches into a shared store the whole fleet reads, or push them to the client to re-send. Most production MCP servers will land in the middle: a few small migrations rather than a rewrite.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does stateless transport affect MCP authentication and authorization?
&lt;/h3&gt;

&lt;p&gt;Auth context becomes a per-request field rather than a per-session attribute. The 2026-07-28 RC expects every &lt;code&gt;tools/call&lt;/code&gt; to carry whatever proof the server needs — a bearer token, a signed capability, a tenant identifier — so any server in the fleet can verify the call without consulting prior connection state. The net effect on a production stack is that a load-balancer reshuffle, a server restart, or a blue/green deploy mid-flight no longer drops the agent's authorization, because no server held it in process memory in the first place. Token introspection caches still live somewhere, but in a shared store the entire fleet shares (Redis-equivalent), not in any single server's per-connection state.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/mcp-2026-07-28-stateless-transport" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>agents</category>
      <category>devops</category>
    </item>
    <item>
      <title>Token Budgets Paper: Affine-Typed Budget Ownership</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Fri, 05 Jun 2026 11:16:05 +0000</pubDate>
      <link>https://dev.to/pueding/token-budgets-paper-affine-typed-budget-ownership-4elj</link>
      <guid>https://dev.to/pueding/token-budgets-paper-affine-typed-budget-ownership-4elj</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;Token Budgets&lt;/strong&gt; paper catalogs 63 real LLM-agent cost-overrun incidents and ships a Rust crate that models a token/cost budget as an &lt;strong&gt;affine-typed&lt;/strong&gt; (use-at-most-once) resource the compiler tracks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Cost is a production failure mode, and the paper finds it's &lt;strong&gt;multi-agent delegation&lt;/strong&gt; — not single agents — that drives the overruns: fan out work to parallel sub-agents and each one quietly reserves budget against a cap nobody is decrementing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Versus a &lt;strong&gt;runtime budget guard&lt;/strong&gt; — an &lt;code&gt;assert&lt;/code&gt; that fires at spend time, after the tokens are already committed — affine typing makes an overrun a &lt;strong&gt;compile-time&lt;/strong&gt; error, so the unsafe code path can't ship in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;One prepaid gift card a group splits at dinner.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  ONE $1,000 GIFT CARD
                          │
          ┌───────────────┴───────────────┐
          │                               │
  ┌───────▼───────┐               ┌───────▼───────┐
  │  PHOTOCOPY IT │               │   SPLIT IT    │
  │ (static copy) │               │ (affine move) │
  └───────┬───────┘               └───────┬───────┘
          │                               │
 4 copies x $350 each         $300+$220+$260+$220
 nobody debits the card       money moves out, no copy
          │                               │
          ▼                               ▼
   ✗ bill = $1,400               ✓ total = $1,000
     over a $1,000 cap             bounded to the cap
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;token budget = the card's balance ($1,000)&lt;/li&gt;
&lt;li&gt;sub-agent = a friend who wants to spend&lt;/li&gt;
&lt;li&gt;static reservation = everyone photocopies the card and assumes the full balance&lt;/li&gt;
&lt;li&gt;overshoot = four copies each spend $350 — the bill hits $1,400 on a $1,000 card&lt;/li&gt;
&lt;li&gt;affine ownership = split the card into prepaid sub-cards — money moves out, can't be photocopied&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Token / cost budget&lt;/strong&gt; — A hard cap on how many tokens (and therefore dollars) one agent task is allowed to spend. Where those tokens go is the first thing a production agent has to account for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Affine type&lt;/strong&gt; — A type that may be used &lt;strong&gt;at most once&lt;/strong&gt;. The compiler tracks the value's ownership, so you can &lt;strong&gt;move&lt;/strong&gt; or &lt;strong&gt;split&lt;/strong&gt; it but never &lt;strong&gt;copy&lt;/strong&gt; it — exactly the property a budget needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delegation fan-out&lt;/strong&gt; — When an orchestrator hands a task to several sub-agents running in parallel. Each child needs some budget, and the question is who keeps the shared total honest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Static vs adaptive reservation&lt;/strong&gt; — Static reservation grabs a fixed slice up front and &lt;strong&gt;over-provisions 4–6×&lt;/strong&gt;; adaptive reservation re-estimates per call and over-provisions &lt;strong&gt;2.11×&lt;/strong&gt; — fewer wasted tokens, but still a runtime accounting trick.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compile-time vs runtime check&lt;/strong&gt; — A runtime check tests the budget while the agent runs (too late to un-spend); a compile-time check rejects the unsafe program before it ever runs. Affine typing moves the cap into the second category.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cohen's kappa&lt;/strong&gt; — An inter-rater agreement score (1.0 = perfect). The paper's 8-category failure taxonomy reaches &lt;strong&gt;0.837&lt;/strong&gt;, i.e. two independent reviewers classified the incidents almost identically.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 2, 2026, the &lt;strong&gt;Token Budgets&lt;/strong&gt; paper landed: an empirical catalog of &lt;strong&gt;63 production cost-overrun incidents&lt;/strong&gt; in LLM-agent systems, pulled from a review of 21 orchestration frameworks spanning 2023–2026 and clustered into an 8-category failure taxonomy (inter-rater Cohen's kappa 0.837). As a mitigation, the authors ship a 1,180-line Rust crate that uses affine-type ownership to turn budget violations into compile-time errors. In controlled tests, single-agent runs never overshot (0/30) while multi-agent asyncio delegation overshot every time (30/30); the mitigated runs then logged 0 cap violations across 160 live-API tests. &lt;a href="https://arxiv.org/abs/2606.04056" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture the group dinner. There's one prepaid gift card with $1,000 on it, and four friends who all want to order. The cheap, lazy move is for everyone to photocopy the card and assume they each have the full balance — four copies, four people each cheerfully spending $350, and a $1,400 bill arrives against a card that only ever held $1,000. The card was never &lt;em&gt;debited&lt;/em&gt; as people spent, so nothing stopped the overshoot until the bill came. &lt;strong&gt;Affine-typed budget ownership&lt;/strong&gt; is the opposite rule: there is exactly one card, and the only legal operation is to split it into prepaid sub-cards — the money physically moves out of the original, and a photocopy simply isn't allowed.&lt;/p&gt;

&lt;p&gt;In an agent system the "photocopy" bug is a delegation fan-out: an orchestrator spawns parallel sub-agents, and each one reserves a chunk of the token budget against a cap that no single owner is decrementing. The paper's headline number is that this pattern overshot 30 out of 30 runs, while a single agent — which spends against one running total — overshot 0 of 30. The fix is to make the budget an &lt;strong&gt;affine&lt;/strong&gt; value: the Rust compiler tracks it as use-at-most-once, so a code path where two sub-agents could both hold the same budget &lt;strong&gt;fails to type-check&lt;/strong&gt;. The cap is enforced &lt;em&gt;by construction&lt;/em&gt; rather than by an &lt;code&gt;assert&lt;/code&gt; that fires after the tokens are already gone — the same shift from runtime to compile-time that separates a retry loop that quietly re-bills you from one that can't.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the budget actually goes
&lt;/h3&gt;

&lt;p&gt;A back-of-envelope walk-through &lt;em&gt;(illustrative cap and slice sizes; the overshoot and over-reservation counts are the paper's)&lt;/em&gt;. Say the shared cap is 1,000 tokens and the orchestrator fans out to four sub-agents. Under static reservation each child grabs a fixed 350, and because the reservations are effectively copies, the total claimed is &lt;strong&gt;4 × 350 = 1,400&lt;/strong&gt; — a &lt;strong&gt;400-token (40%) overshoot&lt;/strong&gt; that nothing rejects until the spend lands. Make the budget affine and the same 1,000 is split into owned slices — say &lt;strong&gt;300 + 220 + 260 + 220 = 1,000&lt;/strong&gt; — where the fourth claim can only take what the first three left behind. The sum is bounded to the cap by construction, which is the property the paper's Rust crate enforces: across &lt;strong&gt;160 live-API tests it logged 0 cap violations&lt;/strong&gt;, where unbounded multi-agent delegation had overshot all 30 runs. Static reservation's habit of grabbing 4–6× the budget it needs (adaptive trims that to 2.11×) is the same waste, viewed from the other side.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;When the cap is checked&lt;/th&gt;
&lt;th&gt;Multi-agent overshoot&lt;/th&gt;
&lt;th&gt;Over-reservation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Runtime budget guard&lt;/td&gt;
&lt;td&gt;at spend time — after tokens commit&lt;/td&gt;
&lt;td&gt;possible (the default failure)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Static reservation&lt;/td&gt;
&lt;td&gt;up front, no shared cap&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;30/30 runs&lt;/strong&gt; &lt;em&gt;(Token Budgets paper)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;~4–6× &lt;em&gt;(paper)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adaptive reservation&lt;/td&gt;
&lt;td&gt;re-estimated per call&lt;/td&gt;
&lt;td&gt;not reported &lt;em&gt;(paper)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;~2.11× &lt;em&gt;(paper)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Affine-typed ownership&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;compile time&lt;/strong&gt; — won't type-check&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;0 violations / 160 tests&lt;/strong&gt; &lt;em&gt;(paper)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;bounded to the cap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The catch is that this only buys you safety where you can express ownership in the type system — a Rust crate gets it for free, a Python orchestrator built on &lt;code&gt;asyncio.gather&lt;/code&gt; does not, which is exactly where the paper's 30/30 overshoots came from. But the lesson generalizes past the language: in a multi-agent team the budget is a shared resource, and &lt;em&gt;who is allowed to hold it, and whether they can copy it,&lt;/em&gt; is a design decision — not something to discover when the bill arrives.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: Agent Engineering → Cost &amp;amp; Latency Engineering → Where the tokens go&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/streamma-streaming-inter-agent-reasoning" rel="noopener noreferrer"&gt;StreamMA — Streaming inter-agent reasoning&lt;/a&gt; — a &lt;em&gt;different&lt;/em&gt; multi-agent cost: wall-clock latency from serial handoffs, cut by pipelining rather than by bounding tokens&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/maestro-rl-orchestrator-frozen-experts" rel="noopener noreferrer"&gt;Maestro — RL orchestrator over frozen experts&lt;/a&gt; — the orchestrator-over-sub-agents topology where this fan-out budget problem lives&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/efc-feedback-quality-scaling-law" rel="noopener noreferrer"&gt;EFC — feedback-quality scaling law&lt;/a&gt; — what actually predicts agent-harness success, the other half of "spend the budget well"&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is affine-typed budget ownership?
&lt;/h3&gt;

&lt;p&gt;It models an agent's token or cost budget as an affine-typed value — one the compiler allows you to use at most once. You can split the budget into smaller owned slices or move it to a sub-agent, but you can't copy it, so two parts of the system can never both spend against the same cap. The Token Budgets paper implements this in a Rust crate and reports 0 cap violations across 160 live-API tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why do multi-agent systems overshoot their token budget?
&lt;/h3&gt;

&lt;p&gt;Because delegation fans the work out to parallel sub-agents that each reserve budget against a cap no single owner is decrementing. The reservations behave like copies, so their sum can exceed the real limit. In the paper's controlled tests, multi-agent asyncio delegation overshot 30 of 30 runs while a single agent — spending against one running total — overshot 0 of 30.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is a compile-time budget check different from a runtime guard?
&lt;/h3&gt;

&lt;p&gt;A runtime guard (an assert or limiter) checks the budget while the agent runs, which is too late to un-spend tokens already committed. A compile-time check rejects the unsafe program before it runs: with affine typing, a code path where two sub-agents could hold the same budget simply fails to type-check, so the cap is enforced by construction rather than by hoping the guard fires in time.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/token-budgets-affine-typed-budget-ownership" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Microsoft MAI-Code-1-Flash: Adaptive Solution-Length Control</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Thu, 04 Jun 2026 11:16:35 +0000</pubDate>
      <link>https://dev.to/pueding/microsoft-mai-code-1-flash-adaptive-solution-length-control-2fdp</link>
      <guid>https://dev.to/pueding/microsoft-mai-code-1-flash-adaptive-solution-length-control-2fdp</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; Microsoft's first in-house coding model, &lt;strong&gt;MAI-Code-1-Flash&lt;/strong&gt; (launched at Build 2026 alongside the MAI-Thinking-1 reasoner), ships &lt;strong&gt;adaptive solution-length control&lt;/strong&gt; — the model decides &lt;strong&gt;how many reasoning tokens to spend&lt;/strong&gt; based on how hard the task is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Reasoning tokens are the dominant cost of a thinking model: every token is a &lt;strong&gt;decode step&lt;/strong&gt; you pay for in latency and dollars. Spending the same long chain on a one-line fix as on a cross-file refactor wastes most of that budget — adaptive length spends it only where it buys accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; A &lt;strong&gt;fixed-budget reasoner&lt;/strong&gt; thinks to roughly the same length on every prompt; adaptive control &lt;strong&gt;stops at the point the answer is reached&lt;/strong&gt;, so Microsoft reports the model hitting its scores with &lt;strong&gt;up to 60% fewer tokens&lt;/strong&gt; than a flat budget would burn.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A test-taker who budgets minutes by how hard each question looks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              SAME EXAM: easy · medium · hard
                            │
            FIXED BUDGET    │    ADAPTIVE CONTROL
            same slot/Q     │    slot sized to Q
                            │
      easy   ██████████ 10m │ easy   █ 0.5m
      medium ██████████ 10m │ medium █████ 6m
      hard   ██████████ 10m │ hard   ██████████ 10m
                            │
       ✗ burns minutes on   │  ✓ same score,
         the easy ones      │    far fewer minutes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;task = one exam question&lt;/li&gt;
&lt;li&gt;reasoning tokens = minutes spent working it out&lt;/li&gt;
&lt;li&gt;fixed budget = giving every question the same long time slot&lt;/li&gt;
&lt;li&gt;adaptive control = sizing the minutes to each question's difficulty&lt;/li&gt;
&lt;li&gt;answer reached = the moment you have it, so you move on&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Reasoning tokens&lt;/strong&gt; — The intermediate "thinking" tokens a model generates one at a time before its final answer (the chain-of-thought). More reasoning tokens = more decode cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution length&lt;/strong&gt; — How long that reasoning chain runs before the model commits to an answer. &lt;strong&gt;Adaptive solution-length control&lt;/strong&gt; lets the model choose this length per task instead of using a fixed cap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test-time compute&lt;/strong&gt; — Compute spent at &lt;strong&gt;inference&lt;/strong&gt; (not training) — chiefly by generating more reasoning tokens. Spending more usually helps hard problems and is wasted on easy ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SWE-Bench Pro / Verified&lt;/strong&gt; — Benchmarks of real GitHub issues a model must resolve with working code. Microsoft reports MAI-Code-1-Flash at &lt;strong&gt;51.2%&lt;/strong&gt; on SWE-Bench Pro vs Claude Haiku 4.5's 35.2%, using &lt;strong&gt;up to 60% fewer tokens&lt;/strong&gt; on SWE-Bench Verified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sparse MoE&lt;/strong&gt; — Mixture-of-Experts: each token is routed through a small subset of "expert" sub-networks. MAI-Thinking-1, the reasoner alongside the coding model, is a sparse MoE with &lt;strong&gt;35B active parameters&lt;/strong&gt; and a 256K context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Underthinking&lt;/strong&gt; — Stopping the chain too early and committing to a wrong answer — the failure mode a fixed-&lt;em&gt;minimum&lt;/em&gt; budget risks, and the reason a good stop signal is the hard part of adaptive length.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 2, 2026, at Build 2026, Microsoft introduced its first in-house frontier models — MAI-Thinking-1 (a 35B-active sparse MoE reasoner with a 256K context, which Microsoft says was trained from scratch on licensed data with no distillation from third-party models) and &lt;strong&gt;MAI-Code-1-Flash&lt;/strong&gt;, a small, inference-efficient coding model built end-to-end by Microsoft and rolling out to GitHub Copilot users in VS Code. MAI-Code-1-Flash reportedly leads Claude Haiku 4.5 by 16 points on SWE-Bench Pro (51.2% vs 35.2%) while using &lt;strong&gt;up to 60% fewer tokens&lt;/strong&gt;, which it credits to &lt;em&gt;adaptive solution-length control&lt;/em&gt;. &lt;a href="https://microsoft.ai/news/introducing-mai-thinking-1/" rel="noopener noreferrer"&gt;Read the announcement →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture the test-taker for a second. Two students sit the same exam. The first was told to spend exactly ten minutes per question — so she burns the full ten on "2 + 2," sits there second-guessing a settled answer, and runs short on the proof at the end. The second reads each question, sizes up the effort, and moves on the moment she's sure — thirty seconds on the arithmetic, the full ten on the proof. Same paper, same score, far less time. &lt;strong&gt;Adaptive solution-length control&lt;/strong&gt; is the second student: the model spends its reasoning where difficulty actually demands it, instead of paying a flat tax on every task.&lt;/p&gt;

&lt;p&gt;Under the hood, the "minutes" are &lt;strong&gt;reasoning tokens&lt;/strong&gt;. A thinking model generates its chain-of-thought one token at a time before answering, and every one of those tokens is a decode step you pay for in latency and dollars. A fixed budget sets one length for all prompts; adaptive control instead decides how long to keep thinking and, crucially, when to stop. Microsoft hasn't disclosed the exact controller — whether the length is learned, predicted up front, or a learned stop signal mid-chain — so treat the &lt;em&gt;mechanism&lt;/em&gt; as undisclosed; what's reported is the outcome: the same benchmark scores at a fraction of the tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the tokens actually go
&lt;/h3&gt;

&lt;p&gt;A back-of-envelope walk-through &lt;em&gt;(illustrative numbers; the 60% figure is Microsoft's)&lt;/em&gt;. Take three Copilot tasks: an easy one-line fix, a medium multi-step bug, and a hard cross-file refactor. A fixed budget of ~2,000 reasoning tokens spends all three the same way → ~6,000 tokens total, even though the easy fix had its answer after ~200. Adaptive control stops each chain at its answer — roughly ~200 + ~650 + ~1,650 ≈ &lt;strong&gt;~2,500 tokens&lt;/strong&gt; — for the &lt;em&gt;same&lt;/em&gt; result. That's &lt;strong&gt;~58% fewer tokens&lt;/strong&gt; in this toy mix, right in line with the up to 60% fewer Microsoft reports. The hard task barely changes; the savings come almost entirely from not over-thinking the easy and medium ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three ways to set the reasoning length
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Easy task&lt;/th&gt;
&lt;th&gt;Hard task&lt;/th&gt;
&lt;th&gt;Main risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fixed-max budget&lt;/td&gt;
&lt;td&gt;thinks far past the answer&lt;/td&gt;
&lt;td&gt;fits — has room&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;over-thinking&lt;/strong&gt;: burns tokens it doesn't need&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fixed-min budget&lt;/td&gt;
&lt;td&gt;fits — short is fine&lt;/td&gt;
&lt;td&gt;cut off too early&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;underthinking&lt;/strong&gt;: commits to wrong answers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Adaptive control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;short chain&lt;/td&gt;
&lt;td&gt;long chain&lt;/td&gt;
&lt;td&gt;needs a reliable &lt;strong&gt;stop signal&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The catch lives in that last cell. A fixed budget is dumb but safe; adaptive length is only as good as its sense of &lt;em&gt;when it's done&lt;/em&gt;. Stop one token too early on a hard task and you get underthinking — a confident wrong answer that's worse than a slow right one. That's why the headline number is a coding model's: in software, a test or verifier can often tell the model whether it's actually done, giving the stop signal something concrete to lean on. The win is real and specific — &lt;strong&gt;fewer reasoning tokens for the same accuracy&lt;/strong&gt; — and it rides entirely on getting that stop right.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → Planning &amp;amp; Reflection → Reasoning budget&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/compute-where-it-counts-per-token-compute" rel="noopener noreferrer"&gt;Compute Where It Counts — Per-token compute controller&lt;/a&gt; — the &lt;em&gt;other axis&lt;/em&gt; of adaptive compute: how much work each token gets, vs how many tokens the chain runs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/longtracerl-rubric-process-reward" rel="noopener noreferrer"&gt;LongTraceRL — Rubric reward (process supervision)&lt;/a&gt; — how reasoning chains get &lt;em&gt;trained&lt;/em&gt;, where a good stop signal would come from&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/gemini-3-5-flash-agent-first-vs-chat-retrofit" rel="noopener noreferrer"&gt;Gemini 3.5 Flash — Agent-first model design&lt;/a&gt; — a &lt;em&gt;related angle&lt;/em&gt;: building a model for the agent loop rather than retrofitting chat&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is adaptive solution-length control?
&lt;/h3&gt;

&lt;p&gt;It's a model's ability to scale the length of its reasoning chain to the difficulty of the task. Instead of a fixed cap on reasoning tokens for every prompt, the model spends a short chain on easy tasks and a long one only on hard tasks, stopping when it has reached an answer. Microsoft's MAI-Code-1-Flash uses it to hit its benchmark scores with up to 60% fewer tokens than a flat budget would use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does it save so much without losing accuracy?
&lt;/h3&gt;

&lt;p&gt;Because the savings come from tasks that were over-thought, not under-thought. On an easy fix, a long reasoning chain reaches the answer early and then generates tokens past it — those extra tokens cost latency and money but don't change the result. Trimming the chain to the point the answer was reached removes pure waste. Hard tasks, which genuinely need a long chain, are barely affected.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is it different from a per-token compute controller?
&lt;/h3&gt;

&lt;p&gt;They tune different dials. A per-token compute controller (as in the "Compute Where It Counts" paper) changes how much compute each individual token gets — attention sparsity, layer pruning, bit-width. Adaptive solution-length control changes how many reasoning tokens the chain runs in total. One sizes the work per token; the other sizes the number of tokens. They're complementary.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/mai-code-1-flash-adaptive-solution-length" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Harness-1: State-Externalizing Search Harness</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Wed, 03 Jun 2026 11:16:02 +0000</pubDate>
      <link>https://dev.to/pueding/harness-1-state-externalizing-search-harness-2c9b</link>
      <guid>https://dev.to/pueding/harness-1-state-externalizing-search-harness-2c9b</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;Harness-1 paper&lt;/strong&gt; introduces a &lt;strong&gt;20B RL-trained search agent that externalizes its working memory into a structured harness&lt;/strong&gt; — candidate pools, evidence links, and verification records — instead of an ever-growing transcript.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; A deep search agent that &lt;strong&gt;replays its whole history every step&lt;/strong&gt; runs the context window dry. Harness-1 makes &lt;strong&gt;context cost stay flat as the search deepens&lt;/strong&gt;, which is the harness-as-state idea the agent-engineering world preaches, made concrete and &lt;strong&gt;RL-trained&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; Earlier search agents &lt;strong&gt;train over a growing transcript&lt;/strong&gt;, so every candidate, observation, and verification lands back in context. Harness-1 trains over an &lt;strong&gt;external workspace&lt;/strong&gt; and renders only a &lt;strong&gt;budget-bounded slice&lt;/strong&gt; — the policy decides what to search and verify; the harness owns the memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A detective's case-board on the wall, briefed by index card.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  THE GROWING CASE
                         │
              ┌──────────┴──────────┐
              │                     │
      ┌───────▼────────┐    ┌───────▼────────┐
      │  HARNESS-1     │    │ GROWING        │
      │  case-board    │    │ TRANSCRIPT     │
      │  on the wall   │    │ lug whole file │
      └───────┬────────┘    └───────┬────────┘
              │                     │
      carry one index card  haul the entire box
      into each interview    into every interview
              │                     │
              ▼                     ▼
      ✓ desk stays clear    ✗ desk overflows
        context stays flat     window overruns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;case-board on the wall = the durable harness workspace (every lead, evidence link, verified fact)&lt;/li&gt;
&lt;li&gt;index-card briefing = the budget-bounded slice rendered into the model's context each step&lt;/li&gt;
&lt;li&gt;lugging the whole case file into every interview = replaying the entire growing transcript&lt;/li&gt;
&lt;li&gt;running out of desk space = overflowing the context window as the search deepens&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Harness&lt;/strong&gt; — The &lt;strong&gt;scaffolding around the model&lt;/strong&gt; that owns tools, state, and exactly what gets shown to the model each step. The model is the brain; the harness is the desk, filing cabinet, and notepad.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context window&lt;/strong&gt; — The &lt;strong&gt;fixed token budget&lt;/strong&gt; the model can read on any single step. Anything outside it is invisible to the model — and tokens are not free, so a full window is both a cost and a hard ceiling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Growing transcript&lt;/strong&gt; — The naïve agent-memory design: &lt;strong&gt;concatenate the full action-and-observation history&lt;/strong&gt; and feed it back every step. It grows without bound, so a long search eventually overruns the context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State externalization&lt;/strong&gt; — Keeping durable working memory &lt;strong&gt;outside the model's context&lt;/strong&gt; — in the harness — so accumulated evidence does not spend context budget. The model reads a rendered view, not the raw store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget-bounded rendering&lt;/strong&gt; — Each step, the harness selects only a &lt;strong&gt;token-budgeted slice&lt;/strong&gt; of the workspace to render into context, so context size is &lt;strong&gt;constant regardless of search depth&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Curated set&lt;/strong&gt; — The agent's running shortlist of &lt;strong&gt;importance-tagged, verified evidence&lt;/strong&gt; — distinct from the raw candidate pool. Harness-1's headline metric is &lt;strong&gt;curated recall&lt;/strong&gt;: how much of the gold evidence lands in this set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Curated recall&lt;/strong&gt; — The fraction of the gold (correct) evidence that ends up in the curated set, averaged across &lt;strong&gt;8 retrieval benchmarks&lt;/strong&gt;. Harness-1 reports &lt;strong&gt;0.730&lt;/strong&gt;, +11.4 points over the next-best open search agent.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On June 1, 2026, &lt;em&gt;Harness-1&lt;/em&gt; (&lt;a href="https://arxiv.org/abs/2606.02373" rel="noopener noreferrer"&gt;arXiv:2606.02373&lt;/a&gt;) introduced a &lt;strong&gt;20B-parameter search agent&lt;/strong&gt; that separates semantic decision-making from state management. The policy decides what to search, inspect, curate, verify, and when to stop; a &lt;strong&gt;state-externalizing harness&lt;/strong&gt; holds the working memory — candidate pools, importance-tagged curated sets, evidence links, verification records, and compressed observations — and renders only a budget-bounded slice into the model's context each step. Rather than training over an ever-growing transcript, the agent is trained with &lt;strong&gt;reinforcement learning over a structured external workspace&lt;/strong&gt;. It reports &lt;strong&gt;0.730 average curated recall across 8 retrieval benchmarks (web, finance, patents, multi-hop QA), +11.4 points&lt;/strong&gt; over the next-strongest open search sub-agent. &lt;a href="https://arxiv.org/abs/2606.02373" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a detective working a long case. Every lead, photo, and verified alibi gets pinned to the &lt;strong&gt;case-board on the wall&lt;/strong&gt; and connected with red string — the board is the durable record, and it only ever grows. When the detective walks into an interview, they don't wheel the entire case file into the room; they carry a single &lt;strong&gt;index-card briefing&lt;/strong&gt; with just what this conversation needs. The board stays on the wall; only a briefing walks in. A rookie who instead lugs the whole growing file box into every interview eventually runs out of desk space — that is exactly what happens when a search agent replays its entire transcript into a finite context window.&lt;/p&gt;

&lt;p&gt;That is the move Harness-1 makes concrete. The naïve design treats the agent's memory as a &lt;strong&gt;growing transcript&lt;/strong&gt;: every observation, every candidate document, every verification step is concatenated and fed back to the model on the next step. It works for a few steps, then the transcript balloons and the search has to stop — not because the agent ran out of leads, but because it ran out of room. Harness-1 instead keeps that durable state in the harness — the case-board — and lets the policy decide where the agent's working state lives. Each step, the harness performs &lt;strong&gt;budget-bounded rendering&lt;/strong&gt;: it selects a token-bounded slice of the workspace — the briefing — and shows only that to the model. The board can grow to hundreds of items while the briefing stays the same size, so &lt;strong&gt;context cost stays flat no matter how deep the search goes&lt;/strong&gt;. Crucially, the agent is trained with reinforcement learning &lt;em&gt;over this workspace&lt;/em&gt;, not over transcripts, so the policy learns the harness skills — curate, importance-tag, verify, compress, stop — as first-class actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Growing transcript vs state-externalizing harness
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Design&lt;/th&gt;
&lt;th&gt;What lives in context&lt;/th&gt;
&lt;th&gt;Context cost as search deepens&lt;/th&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Growing transcript&lt;/td&gt;
&lt;td&gt;The full action + observation history, replayed every step&lt;/td&gt;
&lt;td&gt;Grows with every step&lt;/td&gt;
&lt;td&gt;Overflows the window; the search stalls on length, not leads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State-externalizing harness&lt;/td&gt;
&lt;td&gt;A &lt;a href="https://arxiv.org/abs/2606.02373" rel="noopener noreferrer"&gt;budget-bounded slice&lt;/a&gt; rendered from the workspace&lt;/td&gt;
&lt;td&gt;~Flat, set by a render budget&lt;/td&gt;
&lt;td&gt;A poorly-chosen slice can omit a needed item &lt;em&gt;(mitigated by importance tags + curated recall)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;The two rows describe the contrast Harness-1 draws between transcript-style memory and its externalized workspace; the "budget-bounded slice" claim is from the &lt;a href="https://arxiv.org/abs/2606.02373" rel="noopener noreferrer"&gt;paper&lt;/a&gt;. Token figures in the hero animation are illustrative.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Walk the budget with some round numbers &lt;em&gt;(illustrative)&lt;/em&gt;. Say each search step adds about &lt;strong&gt;2,000 tokens&lt;/strong&gt; of fresh observations. Under the growing-transcript design, those tokens never leave: after 8 steps the model is reading roughly 16,000 tokens of history, after 20 steps about 40,000, and a genuinely deep multi-hop search marches straight past a typical working window. Under the state-externalizing harness, those 2,000-token observations land in the workspace, but the model is only ever shown a fixed ~6,000-token render — step 8 and step 20 cost the &lt;em&gt;same&lt;/em&gt; &lt;strong&gt;6,000 tokens&lt;/strong&gt; in context. The accumulated evidence still exists; it just lives on the case-board instead of in the briefing. That is why Harness-1 can keep curating to &lt;strong&gt;0.730 recall&lt;/strong&gt; across deep benchmarks where a transcript agent would have run out of room — and it's the same lever the agent-engineering track frames as durable state the harness owns, rather than state smeared across a prompt.&lt;/p&gt;

&lt;p&gt;It lands as a sharp companion to the recent push on &lt;em&gt;how&lt;/em&gt; search agents act — GrepSeek learns a better &lt;strong&gt;action space&lt;/strong&gt; (shell commands over a corpus), while Harness-1 learns a better &lt;strong&gt;state substrate&lt;/strong&gt; (an externalized workspace). Same RL-trained-search-agent family, orthogonal levers. As the work frames it, the model should make the semantic calls and the harness should own the memory — a clean division that the standard fixes for an overflowing context have been circling, now learned end-to-end.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → The Agent Loop &amp;amp; State → The Anatomy of a Harness&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/grepseek-grpo-shell-command-search" rel="noopener noreferrer"&gt;GrepSeek — training a shell-command search agent&lt;/a&gt; — the other lever: learning the search &lt;em&gt;action space&lt;/em&gt; instead of the state substrate.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/grep-vs-vector-agentic-retrieval" rel="noopener noreferrer"&gt;Is Grep All You Need? — grep vs vector retrieval&lt;/a&gt; — empirical evidence that harness design dominates the retrieval algorithm.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/recmem-subconscious-recurrence" rel="noopener noreferrer"&gt;RecMem — subconscious + recurrence-triggered memory&lt;/a&gt; — another way to keep durable agent memory off the live context.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Harness-1?
&lt;/h3&gt;

&lt;p&gt;Harness-1 is a 20B-parameter, RL-trained search agent that separates the model's semantic decisions (what to search, inspect, curate, verify, and when to stop) from state management. A state-externalizing harness holds the durable working memory — candidate pools, importance-tagged curated sets, evidence links, verification records, and compressed observations — and renders only a budget-bounded slice into the model's context each step. It reports 0.730 average curated recall across 8 retrieval benchmarks, +11.4 points over the next-strongest open search sub-agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does externalizing state matter?
&lt;/h3&gt;

&lt;p&gt;A search agent that replays its full transcript into context each step grows that context with every observation, so a deep search eventually overruns the context window and stops on length rather than on evidence. Externalizing state keeps the accumulated evidence in the harness and renders only a fixed-size slice, so context cost stays flat regardless of search depth — letting the agent keep curating across deep, multi-hop benchmarks.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is this different from just a growing transcript?
&lt;/h3&gt;

&lt;p&gt;A growing transcript concatenates the entire action-and-observation history and feeds it back every step, so its size scales with the number of steps. Harness-1 instead stores that history in a structured external workspace and trains the policy with reinforcement learning over that workspace — so the model learns to curate, verify, and compress as explicit actions, and the context the model reads is a budget-bounded rendering of the workspace rather than the raw, unbounded log.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/harness-1-externalized-state" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
    </item>
    <item>
      <title>GrepSeek Trains a Search Agent to Use Shell Commands: GRPO-Trained Shell-Command Search</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Tue, 02 Jun 2026 11:17:41 +0000</pubDate>
      <link>https://dev.to/pueding/grepseek-trains-a-search-agent-to-use-shell-commands-grpo-trained-shell-command-search-a19</link>
      <guid>https://dev.to/pueding/grepseek-trains-a-search-agent-to-use-shell-commands-grpo-trained-shell-command-search-a19</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; &lt;strong&gt;GrepSeek&lt;/strong&gt; (Salemi, Zamani et al.) is a recipe for &lt;strong&gt;training an agent to search a raw text corpus by writing shell commands&lt;/strong&gt; — grep, pipes, and the like — instead of querying a pre-built vector index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Agentic search usually leans on an embedding model, a vector store, and an ANN index. GrepSeek shows you can instead &lt;strong&gt;learn a policy that searches the raw files directly&lt;/strong&gt;, and it reports the &lt;strong&gt;strongest F1 / Exact Match across 7 open-domain QA benchmarks&lt;/strong&gt; while staying index-free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; The earlier &lt;em&gt;"Is Grep All You Need?"&lt;/em&gt; study just &lt;strong&gt;wired an untrained grep tool into agents and measured it&lt;/strong&gt;; GrepSeek instead &lt;strong&gt;trains the search behaviour&lt;/strong&gt; — a two-stage Tutor/Planner distillation followed by &lt;strong&gt;GRPO&lt;/strong&gt; — so the agent &lt;em&gt;learns&lt;/em&gt; which commands to run rather than grepping by hand-written heuristics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A rookie detective learning to search a case archive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   CASE ARCHIVE: folders of files, no index to query
                         │
                         ▼
        ┌─────────────────────────────────┐
        │  ANSWER-AWARE TUTOR             │
        │  knows the answer; demonstrates │
        │  the drawer-pulls that crack it │
        └────────────────┬────────────────┘
                         │  rookie copies the moves
                         ▼
        ┌─────────────────────────────────┐
        │  ANSWER-BLIND PLANNER           │
        │  practises blind; keep only     │
        │  searches that solved the case  │
        └────────────────┬────────────────┘
                         │  GRPO: solved case = reward
                         ▼
   ✓ trained agent runs ONE targeted search,
     not a random rummage through the files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;raw corpus = a detective's case archive — folders of files, no index&lt;/li&gt;
&lt;li&gt;shell command = pulling a specific drawer or running Ctrl-F on a document&lt;/li&gt;
&lt;li&gt;answer-aware Tutor = a mentor who already knows the answer and demonstrates the efficient search&lt;/li&gt;
&lt;li&gt;answer-blind Planner = the rookie practising the moves without seeing the answer, keeping only searches that crack the case&lt;/li&gt;
&lt;li&gt;GRPO reward = the case getting solved, which reinforces the search habits that worked&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Agentic search&lt;/strong&gt; — A retrieval pattern where the LLM &lt;strong&gt;iteratively calls a search tool&lt;/strong&gt; and decides what to look for next, instead of being handed chunks in one shot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector index (ANN)&lt;/strong&gt; — The default for "RAG": embed every chunk, embed the query, return the &lt;strong&gt;top-k nearest&lt;/strong&gt; by an Approximate Nearest Neighbour index. GrepSeek skips this entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GRPO&lt;/strong&gt; — Group Relative Policy Optimization — an RL method that scores a &lt;strong&gt;group of sampled answers against each other&lt;/strong&gt; instead of training a separate value model, then pushes the policy toward the above-average ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tutor / Planner&lt;/strong&gt; — GrepSeek's two trajectory generators: an &lt;strong&gt;answer-aware Tutor&lt;/strong&gt; demonstrates effective shell-search sequences, and an &lt;strong&gt;answer-blind Planner&lt;/strong&gt; mimics them under realistic uncertainty.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verified trajectory&lt;/strong&gt; — A recorded sequence of shell commands that is kept for training &lt;strong&gt;only if it actually reached the correct answer&lt;/strong&gt; — the filter that stops the agent learning impressive-looking but useless searches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Byte-exact parallel engine&lt;/strong&gt; — A sharded execution engine that runs the agent's shell commands &lt;strong&gt;concurrently&lt;/strong&gt; yet returns output &lt;strong&gt;identical to a sequential run&lt;/strong&gt; — up to 7.6× faster with no change in results.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On &lt;strong&gt;May 28, 2026&lt;/strong&gt;, &lt;em&gt;GrepSeek: Training Search Agents for Direct Corpus Interaction&lt;/em&gt; (&lt;a href="https://arxiv.org/abs/2605.29307" rel="noopener noreferrer"&gt;arXiv:2605.29307&lt;/a&gt;, Salemi, Zeng, Diaz, Zamani et al.) trained LLM agents to interact with a text corpus through &lt;strong&gt;executable shell commands&lt;/strong&gt; rather than a pre-built dense index. Training is two-stage: an &lt;strong&gt;answer-aware Tutor&lt;/strong&gt; and &lt;strong&gt;answer-blind Planner&lt;/strong&gt; generate verified search trajectories, then the policy is refined with &lt;strong&gt;GRPO&lt;/strong&gt;. The paper reports the strongest token-level &lt;strong&gt;F1 and Exact Match across seven open-domain QA benchmarks&lt;/strong&gt;, with a &lt;strong&gt;byte-exact parallel execution engine&lt;/strong&gt; that speeds shell retrieval up to &lt;strong&gt;7.6×&lt;/strong&gt;. &lt;a href="https://arxiv.org/abs/2605.29307" rel="noopener noreferrer"&gt;Read the paper →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture the rookie detective again. On day one she rummages through the &lt;strong&gt;case archive&lt;/strong&gt; more or less at random — yanks open a drawer labelled "office," dumps a hundred folders on the desk, and can't say which line answers the question. That's an &lt;strong&gt;untrained agent&lt;/strong&gt; firing a broad &lt;code&gt;grep "office"&lt;/code&gt; at the corpus: lots of hits, almost all noise, wrong answer. Crucially, there is &lt;strong&gt;no card catalogue&lt;/strong&gt; — no embeddings, no vector index to ask. The only way through the archive is to run a literal search and read what comes back.&lt;/p&gt;

&lt;p&gt;GrepSeek's move is to &lt;strong&gt;coach the search itself&lt;/strong&gt;. First a mentor who already knows each case's answer — the &lt;strong&gt;answer-aware Tutor&lt;/strong&gt; — demonstrates the efficient sequence of drawer-pulls. Then the rookie, the &lt;strong&gt;answer-blind Planner&lt;/strong&gt;, practises those moves without peeking at the answer, and the team keeps &lt;strong&gt;only the trajectories that actually cracked the case&lt;/strong&gt;. That distilled set seeds a second stage of &lt;a href="https://learnaivisually.com/ai-explained/vpo-vector-reward-vs-grpo" rel="noopener noreferrer"&gt;reinforcement learning&lt;/a&gt;: &lt;strong&gt;GRPO&lt;/strong&gt; samples several command sequences per question, compares them against each other, and nudges the policy toward the ones that landed the right answer. Over training, the same agent stops grepping by habit and starts emitting a &lt;strong&gt;targeted pipeline&lt;/strong&gt; — &lt;code&gt;grep -i paris *.md | grep Q3&lt;/code&gt; — that returns the handful of lines that matter.&lt;/p&gt;

&lt;p&gt;Because the tool the agent calls is a literal shell, the win is two-sided. The retrieval is &lt;strong&gt;index-free&lt;/strong&gt;, so there is no embedding pass or ANN store to build and keep fresh — the agent's tool is just a command line over the raw files. And the searches are &lt;strong&gt;learned end-to-end against the answer&lt;/strong&gt;, so the policy adapts to &lt;em&gt;this&lt;/em&gt; corpus and &lt;em&gt;this&lt;/em&gt; question style rather than trusting a fixed similarity metric. This is the line worth drawing under the older agentic-retrieval results: where &lt;em&gt;"Is Grep All You Need?"&lt;/em&gt; showed an &lt;strong&gt;untrained&lt;/strong&gt; grep tool already competitive, GrepSeek shows what happens when you &lt;strong&gt;train the grep&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it sits among the options
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Retrieval backend&lt;/th&gt;
&lt;th&gt;Training&lt;/th&gt;
&lt;th&gt;Adapts to the corpus?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classic RAG&lt;/td&gt;
&lt;td&gt;embeddings + ANN vector index&lt;/td&gt;
&lt;td&gt;none (frozen retriever)&lt;/td&gt;
&lt;td&gt;only via re-embedding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Is Grep All You Need?" &lt;a href="https://learnaivisually.com/ai-explained/grep-vs-vector-agentic-retrieval" rel="noopener noreferrer"&gt;(explainer)&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;literal grep tool, untrained&lt;/td&gt;
&lt;td&gt;none (hand-wired tool)&lt;/td&gt;
&lt;td&gt;no — fixed heuristics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GrepSeek&lt;/td&gt;
&lt;td&gt;shell commands, no index&lt;/td&gt;
&lt;td&gt;Tutor/Planner distill → GRPO&lt;/td&gt;
&lt;td&gt;yes — learned against the answer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why the parallel engine matters for &lt;em&gt;training&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;The byte-exact engine sounds like a systems footnote until you remember &lt;strong&gt;where RL spends its time&lt;/strong&gt;. Each GRPO update needs &lt;strong&gt;many rollouts per question&lt;/strong&gt;, and every rollout actually runs the agent's shell commands against the corpus. Say a single sequential &lt;code&gt;grep&lt;/code&gt; sweep over the shards takes &lt;strong&gt;760 ms&lt;/strong&gt; &lt;em&gt;(illustrative)&lt;/em&gt; and a training run does &lt;strong&gt;100,000 rollouts&lt;/strong&gt;: that is roughly &lt;strong&gt;21 hours&lt;/strong&gt; of pure retrieval before you count a single gradient step. The &lt;strong&gt;sharded-parallel engine&lt;/strong&gt; runs those shards concurrently for a &lt;strong&gt;byte-exact&lt;/strong&gt; identical result, collapsing &lt;strong&gt;760 ms → ~100 ms&lt;/strong&gt; — the same 100,000 rollouts now cost about &lt;strong&gt;2.8 hours&lt;/strong&gt;. The speedup is the real, reported &lt;strong&gt;7.6×&lt;/strong&gt;; the win compounds precisely because, in RL, you pay the retrieval cost on &lt;strong&gt;every rollout&lt;/strong&gt;, not once.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → Retrieval &amp;amp; RAG → RAG failure modes&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/grep-vs-vector-agentic-retrieval" rel="noopener noreferrer"&gt;Is Grep All You Need? — grep vs vector retrieval for agentic search&lt;/a&gt; — the untrained-grep study GrepSeek builds on.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/vpo-vector-reward-vs-grpo" rel="noopener noreferrer"&gt;VPO — vector reward vs GRPO&lt;/a&gt; — a closer look at the GRPO objective GrepSeek refines with.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is GrepSeek?
&lt;/h3&gt;

&lt;p&gt;GrepSeek is a method for training an LLM agent to retrieve from a raw text corpus by writing executable shell commands — grep, pipes, and the like — instead of querying a pre-built vector index. It distills verified search trajectories from an answer-aware Tutor and answer-blind Planner, then refines the policy with GRPO.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does it matter?
&lt;/h3&gt;

&lt;p&gt;It shows agentic search can be a learned skill rather than a fixed retrieval stack. By skipping the embedding model, vector store, and ANN index and learning shell-command search end-to-end against the answer, GrepSeek reports the strongest F1 and Exact Match across seven open-domain QA benchmarks while staying index-free.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is it different from the "Is Grep All You Need?" study?
&lt;/h3&gt;

&lt;p&gt;That study wired an untrained grep tool into agents and measured it against vector retrieval; it does no learning. GrepSeek instead trains the search behaviour — a two-stage Tutor/Planner distillation followed by GRPO — so the agent learns which commands to run rather than relying on hand-written heuristics.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/grepseek-grpo-shell-command-search" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>AgentDoG 1.5: Small Inline Guard Models for Agent Actions</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Mon, 01 Jun 2026 11:16:41 +0000</pubDate>
      <link>https://dev.to/pueding/agentdog-15-small-inline-guard-models-for-agent-actions-2mh8</link>
      <guid>https://dev.to/pueding/agentdog-15-small-inline-guard-models-for-agent-actions-2mh8</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; &lt;strong&gt;AgentDoG 1.5&lt;/strong&gt;, an arXiv preprint posted in May 2026, is a family of &lt;strong&gt;small inline guard models&lt;/strong&gt; (0.8B–8B parameters) that sit beside an agent and screen each action — a tool call, a shell command, a code-execution request — as safe or risky before it runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Every production agent needs &lt;strong&gt;something watching its actions&lt;/strong&gt;, and the lethal trifecta means an agent with private data, untrusted input, and a way to act can be steered into harm; a guard model is the screen that catches it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; The usual screen is a &lt;strong&gt;large closed safety model&lt;/strong&gt; (GPT-class) or a heavyweight sandboxed checker run per action; AgentDoG reports matching that catch rate with a model trained on only &lt;strong&gt;~1,000 purified samples&lt;/strong&gt; at roughly &lt;strong&gt;100× less deployment overhead&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A rookie door guard who studied a veteran's casebook.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              AN AGENT ACTION REACHES THE DOOR
             (tool call / shell / code execution)
                            │
              ┌─────────────┴──────────────┐
              │                            │
     ┌────────▼────────┐          ┌────────▼────────┐
     │   ROOKIE GUARD  │          │  VETERAN CHIEF  │
     │  0.8B–8B model  │          │   closed model  │
     │     (inline)    │          │  (own service)  │
     └────────┬────────┘          └────────┬────────┘
              │                            │
     ~1k-case casebook;          same catch rate, but
     cheap on every action       ~100× deploy overhead
              │                            │
              ▼                            ▼
       ✓ screen them all          ✗ too costly to screen
         affordably                 every single action
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;agent action = a visitor at the door asking to come in (a tool call or shell command)&lt;/li&gt;
&lt;li&gt;guard model = the rookie who clears safe visitors and stops risky ones, in real time&lt;/li&gt;
&lt;li&gt;closed safety model = the veteran chief — just as sharp, but expensive to keep on the payroll&lt;/li&gt;
&lt;li&gt;~1,000 training samples = a thin casebook holding only the most instructive incidents&lt;/li&gt;
&lt;li&gt;influence-function purification = throwing out the case files that taught the rookie nothing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Guard model&lt;/strong&gt; — A &lt;strong&gt;separate, dedicated classifier&lt;/strong&gt; that screens an agent's inputs and actions for risk — distinct from the agent LLM doing the task. It lives inline in the loop and returns an allow / block decision per action. See Agent Engineering → Output Filters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lethal trifecta&lt;/strong&gt; — The three ingredients that make an agent dangerous together: &lt;strong&gt;private data&lt;/strong&gt; access, exposure to &lt;strong&gt;untrusted content&lt;/strong&gt;, and an &lt;strong&gt;exfiltration channel&lt;/strong&gt;. A guard model is one way to break the chain. See AI Agents → The Lethal Trifecta.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Influence functions&lt;/strong&gt; — A method that estimates &lt;strong&gt;how much each training example actually helps&lt;/strong&gt; the model. AgentDoG uses it to purge low-value samples so that roughly &lt;strong&gt;1,000&lt;/strong&gt; high-signal cases remain — the "purification" step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Taxonomy-guided data engine&lt;/strong&gt; — A pipeline that &lt;strong&gt;synthesizes training examples from a structured catalogue of risk categories&lt;/strong&gt;. AgentDoG's taxonomy is updated to explicitly cover &lt;strong&gt;code-execution&lt;/strong&gt; risks, not just text-level prompt attacks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SFT + RL&lt;/strong&gt; — &lt;strong&gt;Supervised fine-tuning&lt;/strong&gt; (learn from labeled examples) followed by &lt;strong&gt;reinforcement learning&lt;/strong&gt; (learn from reward in an environment). AgentDoG trains its guards in an "agentic safety" SFT+RL setup so they see realistic action traces, not isolated prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defense-in-depth&lt;/strong&gt; — Layering &lt;strong&gt;independent filters&lt;/strong&gt; so no single failure is fatal — input filters, output filters, policy checks, fail-safe defaults. A cheap guard model makes it affordable to add more layers. See Agent Engineering → Defense-in-Depth.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On May 29, 2026, researchers posted &lt;a href="https://arxiv.org/abs/2605.29801" rel="noopener noreferrer"&gt;AgentDoG 1.5&lt;/a&gt;, a lightweight alignment framework for agent safety. It trains guard models at 0.8B, 2B, 4B, and 8B parameters on roughly 1,000 samples, using a taxonomy-guided data engine (now covering code-execution risk) with influence-function purification. The paper reports performance &lt;strong&gt;comparable to leading closed models such as GPT-5.4&lt;/strong&gt; on agent-risk screening, while cutting Docker-level deployment overhead by about &lt;strong&gt;two orders of magnitude&lt;/strong&gt;. &lt;a href="https://arxiv.org/abs/2605.29801" rel="noopener noreferrer"&gt;Read the preprint →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture a doorway with a guard. Every visitor — a delivery, a contractor, someone who insists they were invited — stops at the door and the guard makes one call: come in, or turn around. That is exactly the job of a &lt;strong&gt;guard model&lt;/strong&gt; in an agent system. The agent emits an action — call this tool, run this shell command, execute this code — and before the action reaches the real world, a small dedicated model screens it. The rookie at the door is not the chief of security; it is a 0.8B-to-8B model whose only skill is clearing safe actions and stopping risky ones, fast. The expensive alternative is to put the veteran chief on the door: a large closed safety model that is just as sharp but costs far more to keep standing by for every action.&lt;/p&gt;

&lt;p&gt;What makes the rookie good is the casebook, not the size of its brain. AgentDoG's guards are not trained on millions of examples; they are trained on roughly &lt;strong&gt;1,000&lt;/strong&gt;. A taxonomy-guided data engine synthesizes candidate cases from a structured list of agent risks — and crucially the taxonomy is extended to cover code execution, the place where a steered agent does the most damage. Then &lt;strong&gt;influence functions&lt;/strong&gt; estimate which of those synthesized cases actually move the model and throw the rest away, the way you would keep the three case files that taught a real lesson and bin the hundred that were routine. The cleaned set trains the guard in an SFT + RL loop that shows it realistic action traces rather than isolated prompts.&lt;/p&gt;

&lt;p&gt;In the layered-defense picture, a guard model is an &lt;strong&gt;input and output filter&lt;/strong&gt; standing in the agent loop. It is one concrete way to cut a leg off the lethal trifecta: even when an agent has private data and reads untrusted content, the guard can refuse the action that would exfiltrate it through a tool call. Because the guard is cheap, you can afford to run it on &lt;em&gt;every&lt;/em&gt; action and still add other layers — which is the whole point of defense-in-depth. The design choice that remains yours is what the guard does when it is unsure: fail-safe (block and ask) is the conservative default, fail-open (allow) trades safety for uptime.&lt;/p&gt;

&lt;h3&gt;
  
  
  How the guard sizes stack up
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Guard variant&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;~4-bit footprint&lt;/th&gt;
&lt;th&gt;Where it fits&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AgentDoG-0.8B&lt;/td&gt;
&lt;td&gt;~0.8B&lt;/td&gt;
&lt;td&gt;~0.4 GB &lt;em&gt;(derived: 0.5 byte/param)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Sidecar on the same GPU, or even CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AgentDoG-2B&lt;/td&gt;
&lt;td&gt;~2B&lt;/td&gt;
&lt;td&gt;~1 GB &lt;em&gt;(derived)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Sidecar on the agent's GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AgentDoG-4B&lt;/td&gt;
&lt;td&gt;~4B&lt;/td&gt;
&lt;td&gt;~2 GB &lt;em&gt;(derived)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Sidecar on the agent's GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AgentDoG-8B&lt;/td&gt;
&lt;td&gt;~8B&lt;/td&gt;
&lt;td&gt;~4 GB &lt;em&gt;(derived)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Shares one GPU with the agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Closed safety model&lt;/td&gt;
&lt;td&gt;tens of billions &lt;em&gt;(setup-dependent, illustrative)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;~100+ GB &lt;em&gt;(illustrative)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;Its own Docker-sandboxed service&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Where the ~100× actually comes from
&lt;/h3&gt;

&lt;p&gt;Hold the catch rate fixed — the paper's claim is that the small guard &lt;em&gt;matches&lt;/em&gt; the closed model there — and the win is in Docker-level deployment overhead: the standing service and sandbox each screen needs. Walk a back-of-envelope version &lt;em&gt;(illustrative footprints; the preprint reports the overhead ratio, not these absolutes)&lt;/em&gt;. A frontier-scale closed safety model runs to tens of billions of parameters — on the order of ~100 GB in FP16 — and is typically deployed as its own sandboxed service. AgentDoG-8B in 4-bit is about &lt;strong&gt;8B × 0.5 byte ≈ 4 GB&lt;/strong&gt;, and the 0.8B variant is under 0.5 GB — small enough to ride as an in-process sidecar next to the agent rather than as a separate service. That difference — a separate Docker-sandboxed service versus an in-process sidecar — is the &lt;strong&gt;roughly two-orders-of-magnitude (~100×) deployment-overhead cut&lt;/strong&gt; the paper reports, and it is what makes screening &lt;em&gt;every&lt;/em&gt; action affordable instead of sampling a few.&lt;/p&gt;

&lt;p&gt;The catch, and the reason a small guard is not a free win: a model trained on ~1k cases only knows the risks in its taxonomy, so coverage gaps are real, and a determined attacker can probe for the action it does not recognize — the same evasion pressure that the &lt;a href="https://learnaivisually.com/ai-explained/camouflage-injection-detection-gap" rel="noopener noreferrer"&gt;camouflage-injection detection gap&lt;/a&gt; explainer describes. The parity-with-GPT-5.4 number is the &lt;strong&gt;paper's reported result&lt;/strong&gt;, not an independent reproduction, and "comparable on a benchmark taxonomy" is narrower than "as safe in the wild." Treat AgentDoG as a cheap, always-on &lt;em&gt;layer&lt;/em&gt; — not a replacement for capability scoping and a real data-flow review.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: Agent Engineering → Layered Guardrails → Output Filters&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/camouflage-injection-detection-gap" rel="noopener noreferrer"&gt;Camouflage-injection detection gap&lt;/a&gt; — the attack a guard model has to catch: prompt injection hidden where a screener does not look&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/copilot-cowork-image-url-exfiltration" rel="noopener noreferrer"&gt;Copilot/Cowork image-URL exfiltration&lt;/a&gt; — a concrete exfiltration channel an output filter is meant to block&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/glasswing-detection-saturated-pipeline" rel="noopener noreferrer"&gt;Glasswing — detection in a saturated pipeline&lt;/a&gt; — what changes when the screener itself becomes the bottleneck&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is a guard model for agent actions?
&lt;/h3&gt;

&lt;p&gt;A guard model is a small, dedicated classifier that sits inline in an agent's loop and screens each action — a tool call, a shell command, a code-execution request — as safe or risky before it runs. It is separate from the agent LLM doing the work; its only job is the allow/block decision. AgentDoG 1.5 trains such guards at 0.8B, 2B, 4B, and 8B parameters so the screen can run as a cheap sidecar rather than a heavyweight closed safety model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does AgentDoG only need ~1,000 training samples?
&lt;/h3&gt;

&lt;p&gt;It uses a taxonomy-guided data engine to synthesize candidate cases from a structured catalogue of agent risks (including code execution), then applies influence functions to keep only the examples that measurably improve the model and discard the rest. The result is a small, high-signal "casebook" of roughly 1,000 samples instead of millions, trained in an SFT-then-RL agentic-safety setup. Quality and coverage of the taxonomy matter more than raw sample count.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does a guard model relate to the lethal trifecta?
&lt;/h3&gt;

&lt;p&gt;The lethal trifecta is private-data access plus untrusted content plus an exfiltration channel — dangerous only when all three combine. A guard model is one way to cut a leg off that triangle: even when an agent holds private data and reads untrusted input, the guard can refuse the specific action that would leak it. Because a small guard is cheap to run on every action, it slots into a defense-in-depth stack alongside capability scoping and data-flow review rather than replacing them.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/agentdog-1-5-inline-guard-models" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>agents</category>
      <category>llm</category>
    </item>
    <item>
      <title>Claude Opus 4.8: Parallel-Subagent Dynamic Workflows</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Sun, 31 May 2026 11:17:47 +0000</pubDate>
      <link>https://dev.to/pueding/claude-opus-48-parallel-subagent-dynamic-workflows-19f7</link>
      <guid>https://dev.to/pueding/claude-opus-48-parallel-subagent-dynamic-workflows-19f7</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; release adds &lt;strong&gt;"dynamic workflows"&lt;/strong&gt; in Claude Code: a lead agent can &lt;strong&gt;fan out parallel subagents&lt;/strong&gt; instead of running subtasks one after another.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Independent subtasks — search the docs, read the code, run the tests — don't need each other's output, so running them &lt;strong&gt;concurrently and merging the results&lt;/strong&gt; finishes in the time of the &lt;strong&gt;slowest&lt;/strong&gt; one and, in the usual subagent design, keeps each subtask's context isolated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; The usual single-agent loop does &lt;strong&gt;one tool call at a time&lt;/strong&gt;, so wall-clock grows with the &lt;strong&gt;sum&lt;/strong&gt; of the subtasks and a single context window has to hold everything at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A pit crew changing four tires at once instead of one mechanic doing all four.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    CAR NEEDS 4 NEW TIRES
                            │
            ┌───────────────┴───────────────┐
            │                               │
   ┌────────▼────────┐             ┌────────▼────────┐
   │   ONE MECHANIC  │             │     PIT CREW    │
   │     (serial)    │             │   (parallel)    │
   └────────┬────────┘             └────────┬────────┘
            │                               │
   tires done one-by-one          4 crew, one tire each
      each in turn                     all at once
            │                               │
            ▼                               ▼
   ✗ time = sum of all 4         ✓ time = the slowest one
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;orchestrator = the crew chief who sends everyone in at once&lt;/li&gt;
&lt;li&gt;subagent = one crew member on one tire, working independently&lt;/li&gt;
&lt;li&gt;parallel run = all four tires changed in the time of the slowest one&lt;/li&gt;
&lt;li&gt;merge = the chief waves the car out once all four are done&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Orchestrator (lead agent)&lt;/strong&gt; — The agent that owns the task, decides how to split it, and dispatches the pieces. It does not do the subtask work itself — it coordinates. See Agent Engineering → Agent Teams → Supervisor/worker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subagent&lt;/strong&gt; — A spawned worker agent that handles &lt;strong&gt;one subtask&lt;/strong&gt; and reports a result back; in the usual design it runs in its own context window. Background: AI Agents → Context Engineering → Subagents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fan-out / fan-in&lt;/strong&gt; — The shape of the workflow: &lt;strong&gt;fan-out&lt;/strong&gt; is the orchestrator launching many subagents at once; &lt;strong&gt;fan-in&lt;/strong&gt; is collecting their results back into one place to merge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context isolation&lt;/strong&gt; — In the usual subagent design, each subagent gets a &lt;strong&gt;fresh, narrow context window&lt;/strong&gt; with only what its subtask needs, so the orchestrator's window doesn't fill with every subtask's raw output. A context-engineering win as much as a speed one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wall-clock latency&lt;/strong&gt; — Real elapsed time the user waits, as opposed to total compute. Parallelism trades more concurrent compute for &lt;strong&gt;less wall-clock&lt;/strong&gt;. See Agent Engineering → Cost &amp;amp; Latency → Parallel tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestrator-workers pattern&lt;/strong&gt; — The classic agent design this productizes: a central agent splits a task, hands pieces to worker agents, and synthesizes their outputs. Walkthrough: AI Agents → Workflow Patterns → Orchestrator-workers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On May 28, 2026, Anthropic &lt;a href="https://www.anthropic.com/news/claude-opus-4-8" rel="noopener noreferrer"&gt;released Claude Opus 4.8&lt;/a&gt;. Among the agentic upgrades, the announcement calls out "dynamic workflows" that let Claude Code run parallel subagents. Anthropic framed it as a harness capability rather than a model-internals change — the model got better at the &lt;em&gt;judgment&lt;/em&gt; of when to split work, and the harness got the machinery to run the pieces at once. &lt;a href="https://www.anthropic.com/news/claude-opus-4-8" rel="noopener noreferrer"&gt;Read the release →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Stay with the pit crew for a second. A single mechanic changing all four tires does them in series: jack, loosen, swap, torque, repeat — four times. A pit crew sends one person to each wheel at the same time, so the stop lasts as long as the &lt;em&gt;slowest&lt;/em&gt; corner, not the sum of all four. The crew chief doesn't turn a single wheel; their whole job is to send everyone in together and wave the car out once all four are done. That division of labor is the entire idea behind a parallel-subagent workflow.&lt;/p&gt;

&lt;p&gt;In an agent, the crew chief is the &lt;strong&gt;orchestrator&lt;/strong&gt;. Faced with a task whose parts don't depend on each other — &lt;em&gt;search the docs&lt;/em&gt;, &lt;em&gt;read the code&lt;/em&gt;, &lt;em&gt;run the tests&lt;/em&gt;, &lt;em&gt;draft a summary&lt;/em&gt; — it can fan out a subagent per part instead of doing them back-to-back. In the usual subagent design each one works in its own context window, so the orchestrator's window doesn't bloat with four subtasks' raw output, and the results fan back in to be merged into one answer. The payoff is &lt;strong&gt;wall-clock&lt;/strong&gt;: with &lt;strong&gt;independent&lt;/strong&gt; subtasks, elapsed time tracks the &lt;em&gt;slowest&lt;/em&gt; subagent, not the running total. (Anthropic disclosed that Claude Code can run parallel subagents; the fan-out/isolate/merge shape here is the established orchestrator-workers pattern the release productizes, not a newly published harness internal.)&lt;/p&gt;

&lt;p&gt;The sharp edge is the word &lt;em&gt;independent&lt;/em&gt;. The pit crew only works because the four corners don't wait on each other; if torquing the front-left depended on first finishing the rear-right, you'd be back to serial. The same holds for agents — parallelization is a win for subtasks that can run blind to each other, and a trap for a dependency chain where step two needs step one's output. The orchestrator's real skill is telling those apart, which is why the release pairs the harness machinery with better agentic judgment about &lt;em&gt;when&lt;/em&gt; to split.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serial vs. parallel, and when each wins
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Wall-clock for N subtasks&lt;/th&gt;
&lt;th&gt;Best when&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single agent, serial&lt;/td&gt;
&lt;td&gt;≈ &lt;strong&gt;sum&lt;/strong&gt; of all subtasks&lt;/td&gt;
&lt;td&gt;subtasks form a &lt;strong&gt;dependency chain&lt;/strong&gt; (step 2 needs step 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parallel subagents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≈ &lt;strong&gt;slowest&lt;/strong&gt; subtask + merge &lt;em&gt;(coordination overhead, illustrative)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;subtasks are &lt;strong&gt;independent&lt;/strong&gt; and read-mostly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Walk the numbers with four illustrative subtasks taking 5.0s, 6.2s, 6.8s, and 4.4s. Run serially, wall-clock is the sum: 5.0 + 6.2 + 6.8 + 4.4 = &lt;strong&gt;22.4s&lt;/strong&gt; &lt;em&gt;(illustrative)&lt;/em&gt;. Fan them out as parallel subagents and wall-clock collapses to the slowest corner — &lt;strong&gt;6.8s&lt;/strong&gt; — for a 22.4 / 6.8 ≈ &lt;strong&gt;3.3×&lt;/strong&gt; speedup &lt;em&gt;(illustrative)&lt;/em&gt;. Coordination isn't free: the orchestrator spends a little time dispatching and a little merging, so a realistic number is a touch below 3.3×. And the ceiling is fixed by that slowest subtask — adding a fifth fast subagent doesn't help once "run the tests" is already the long pole.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: Agent Engineering → Agent Teams → Supervisor/worker&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/opus-4-8-cache-preserving-system-messages" rel="noopener noreferrer"&gt;Claude Opus 4.8 — Cache-preserving mid-task system messages&lt;/a&gt; — the other harness-level feature from the same release, on the caching side rather than the orchestration side&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/msr-delegation-fidelity-drift" rel="noopener noreferrer"&gt;MSR delegation study — Cascading fidelity loss over 20 iterations&lt;/a&gt; — the cost of delegation: handing work to subagents is fast, but each handoff can lose fidelity if you aren't careful&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What are parallel-subagent dynamic workflows?
&lt;/h3&gt;

&lt;p&gt;They are a Claude Code capability in Opus 4.8 where a lead agent (the orchestrator) splits a task into independent subtasks and launches a separate subagent for each one to run at the same time, then merges their results. "Dynamic" means the orchestrator decides the split at run time based on the task, rather than following a fixed, pre-wired script. It is the orchestrator-workers pattern made native to the harness.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does running subagents in parallel cut wall-clock time?
&lt;/h3&gt;

&lt;p&gt;Because independent subtasks don't have to wait for each other. Run serially, elapsed time is the sum of every subtask; run in parallel, elapsed time is just the slowest one plus a little coordination overhead. For four subtasks of roughly equal size that is close to a 4× reduction in wall-clock (illustrative) — though the real ceiling is fixed by the single longest subtask, so an uneven split benefits less.&lt;/p&gt;

&lt;h3&gt;
  
  
  When does parallelizing subagents NOT help?
&lt;/h3&gt;

&lt;p&gt;When the subtasks form a dependency chain — if step two needs step one's output, you can't start it early, so parallelism buys nothing and just adds coordination cost. It also adds little when one subtask dominates the others (the slowest one sets the floor), or when the subtasks share so much state that isolating their context windows loses important information. The orchestrator's job is to recognize these cases and keep them serial.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/opus-4-8-parallel-subagent-workflows" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>OmniRetrieval: Source-Native Query Dispatch</title>
      <dc:creator>pueding</dc:creator>
      <pubDate>Sat, 30 May 2026 11:26:26 +0000</pubDate>
      <link>https://dev.to/pueding/omniretrieval-source-native-query-dispatch-3f80</link>
      <guid>https://dev.to/pueding/omniretrieval-source-native-query-dispatch-3f80</guid>
      <description>

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; The &lt;strong&gt;OmniRetrieval paper&lt;/strong&gt; introduces &lt;strong&gt;source-native query dispatch&lt;/strong&gt;: a router sends a natural-language query to whichever knowledge source fits — &lt;strong&gt;text, tables, or graphs&lt;/strong&gt; — and runs each source's own query engine instead of embedding everything into one vector store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Many vector-first RAG stacks flatten every source into a single &lt;strong&gt;embed-then-ANN&lt;/strong&gt; pipeline, which throws away the structure that makes tables and graphs useful. Keeping each source native lets &lt;strong&gt;JOINs and graph edges survive retrieval&lt;/strong&gt;, so the generator sees answers a flat similarity search can't assemble.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs prior:&lt;/strong&gt; The previous default is a &lt;strong&gt;unified vector index&lt;/strong&gt; — chunk, embed, and nearest-neighbour everything together. Its failure mode is &lt;strong&gt;structural collapse&lt;/strong&gt;: a table's column relationships and a graph's edges are averaged into a single vector and can no longer be queried as structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think of it as
&lt;/h2&gt;

&lt;p&gt;A reference desk that routes each question to the right specialist clerk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                THE QUERY
                    │
             router (clerk)
                    │
        ┌───────────┴───────────┐
        │                       │
 ┌──────▼────────┐     ┌────────▼──────┐
 │    FLATTEN    │     │   DISPATCH    │
 │  one vector   │     │  text, SQL,   │
 │   index bin   │     │  graph query  │
 └──────┬────────┘     └────────┬──────┘
        │                       │
  shred docs into        ask each clerk
  one big bin            in its own tongue
        │                       │
        ▼                       ▼
 ✗ JOINs &amp;amp; edges        ✓ JOINs compose,
   averaged away          edges traverse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;query = a question handed to the reference desk&lt;/li&gt;
&lt;li&gt;router = the clerk who decides which specialist to ask&lt;/li&gt;
&lt;li&gt;source-native query = asking each specialist in their own language — ledgers, archives, family trees&lt;/li&gt;
&lt;li&gt;unified vector index = shredding every document into one bin of identical index cards&lt;/li&gt;
&lt;li&gt;preserved structure = the ledger keeps its columns and the family tree keeps its lines&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAG&lt;/strong&gt; — Retrieval-Augmented Generation — fetch relevant context from a knowledge store, then condition the model on it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector index&lt;/strong&gt; — The default RAG store: every chunk is embedded into a fixed-dimension vector, and retrieval returns the &lt;strong&gt;top-k nearest&lt;/strong&gt; by cosine or dot-product similarity. Structure inside a chunk is not preserved — only its position in embedding space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ANN&lt;/strong&gt; — Approximate Nearest Neighbour — the index family (HNSW, IVF, FAISS) that makes vector search fast by trading exactness for speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source-native query&lt;/strong&gt; — Running a source in its own engine: &lt;strong&gt;full-text search&lt;/strong&gt; over passages, a &lt;strong&gt;SQL-style query&lt;/strong&gt; (with JOINs) over tables, a &lt;strong&gt;traversal&lt;/strong&gt; over a graph — rather than one similarity lookup over a shared embedding space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Heterogeneous-source retrieval&lt;/strong&gt; — Retrieval across sources of different &lt;em&gt;kinds&lt;/em&gt; — unstructured text, relational tables, and graphs — kept in their native form instead of homogenised into one representation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Knowledge base (KB)&lt;/strong&gt; — A single corpus the system retrieves from. OmniRetrieval reports evaluating across &lt;strong&gt;309 distinct KBs&lt;/strong&gt; spanning 13 datasets.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The news.&lt;/strong&gt; On May 29, 2026, the &lt;strong&gt;OmniRetrieval&lt;/strong&gt; paper (&lt;a href="https://arxiv.org/abs/2605.29250" rel="noopener noreferrer"&gt;arXiv:2605.29250&lt;/a&gt;) proposed a retrieval framework that accepts a natural-language query and &lt;strong&gt;routes it to the appropriate knowledge source&lt;/strong&gt; — unstructured text, relational tables, or graphs — dispatching each source's &lt;strong&gt;native query&lt;/strong&gt; to its own execution engine rather than flattening everything into a single embedding index. The authors report evaluating across &lt;strong&gt;13 datasets and 309 distinct knowledge bases&lt;/strong&gt;, and exceeding single-source retrieval baselines.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Picture the reference desk again. A question comes in — &lt;em&gt;"which suppliers shipped to the Berlin warehouse in Q3, and who introduced them?"&lt;/em&gt; The &lt;strong&gt;clerk&lt;/strong&gt; doesn't translate the question into one bland house dialect and shout it at the whole building. They split it: the &lt;strong&gt;archivist&lt;/strong&gt; searches the prose contracts, the &lt;strong&gt;accountant&lt;/strong&gt; runs the numbers in their ledgers, and the &lt;strong&gt;genealogist&lt;/strong&gt; walks the introductions graph. Each specialist answers &lt;strong&gt;in their own language&lt;/strong&gt;, keeping the structure that makes their corner useful — the ledger's columns, the family tree's lines.&lt;/p&gt;

&lt;p&gt;The animation above is that desk. In the first beat the query flows &lt;strong&gt;query → router → one flat vector index&lt;/strong&gt;: every source is shredded into the same uniform grid of vectors, and the table's &lt;strong&gt;JOIN&lt;/strong&gt; arc and the graph's &lt;strong&gt;edges&lt;/strong&gt; go dashed and grey — flattened into embedding space where structure can't be queried. Then the router flips from &lt;em&gt;flatten&lt;/em&gt; to &lt;em&gt;dispatch&lt;/em&gt;: the same three sources light up in their native forms, and three typed queries fan out — &lt;strong&gt;text search&lt;/strong&gt;, &lt;strong&gt;table query · JOIN&lt;/strong&gt;, &lt;strong&gt;graph traversal&lt;/strong&gt; — with the JOIN arc and graph edges restored in green.&lt;/p&gt;

&lt;p&gt;The mechanism is a &lt;strong&gt;router plus per-source engines&lt;/strong&gt;. Instead of an embed-then-ANN lookup over one homogenised store, OmniRetrieval generates a query &lt;strong&gt;in each source's own language&lt;/strong&gt; and runs it on that source's engine, then unifies the heterogeneous results for the generator. Because the table never left its relational form, a &lt;strong&gt;JOIN&lt;/strong&gt; still composes rows by key; because the graph never left its node-edge form, a &lt;strong&gt;traversal&lt;/strong&gt; still follows edges. Those are exactly the &lt;strong&gt;structural affordances&lt;/strong&gt; a single similarity vector erases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why flattening loses the answer
&lt;/h3&gt;

&lt;p&gt;Hold the Berlin question fixed and walk the two paths. Say the answer needs a &lt;strong&gt;JOIN&lt;/strong&gt; of a &lt;strong&gt;5,000-row suppliers&lt;/strong&gt; table with a &lt;strong&gt;40,000-row shipments&lt;/strong&gt; table on &lt;code&gt;supplier_id&lt;/code&gt;. The relational path evaluates the key match &lt;strong&gt;exactly&lt;/strong&gt; — every shipment resolves to its supplier, and the introductions graph then traverses two hops to the people who introduced them. The flat-index path instead embeds each row into, say, a &lt;strong&gt;768-dimension&lt;/strong&gt; vector and returns, say, the &lt;strong&gt;top-k = 20&lt;/strong&gt; nearest chunks to the query. Two hundred million potential supplier–shipment pairings &lt;strong&gt;(5,000 × 40,000 = 200,000,000, illustrative)&lt;/strong&gt; collapse into &lt;strong&gt;20 fuzzy neighbours&lt;/strong&gt; chosen by cosine distance — and "who introduced them" is an &lt;em&gt;edge&lt;/em&gt; that was never embedded as an edge. &lt;strong&gt;The JOIN and the traversal are not slow in the flat index; they are absent.&lt;/strong&gt; That is the RAG failure mode source-native dispatch is built to remove.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flat vector index vs. source-native dispatch
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Unified vector index&lt;/th&gt;
&lt;th&gt;Source-native dispatch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Index built upfront&lt;/td&gt;
&lt;td&gt;one embedding pass + ANN index over all sources&lt;/td&gt;
&lt;td&gt;each source keeps its native engine (full-text, SQL, graph)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query form&lt;/td&gt;
&lt;td&gt;one nearest-neighbour lookup over shared space&lt;/td&gt;
&lt;td&gt;a source-native query per source, chosen by the router&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structure preserved&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;no&lt;/strong&gt; — JOINs and edges averaged into a vector&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;yes&lt;/strong&gt; — JOINs compose, edges traverse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best when&lt;/td&gt;
&lt;td&gt;fuzzy topical match over homogeneous prose&lt;/td&gt;
&lt;td&gt;answer spans tables / graphs, needs exact relations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reported scope&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;~13 datasets · 309 KBs, beats single-source baselines &lt;a href="https://arxiv.org/abs/2605.29250" rel="noopener noreferrer"&gt;(OmniRetrieval)&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The table isn't an argument that embeddings are obsolete — fuzzy topical recall over prose is exactly what a vector index is good at. It's that a single representation can't be the right one for text, tables, &lt;strong&gt;and&lt;/strong&gt; graphs at once. Routing to a &lt;strong&gt;source-native query&lt;/strong&gt; lets each source answer with the structure it actually has, and only then unifies — so the generator receives composed relations, not a bag of nearest neighbours.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Goes deeper in: AI Agents → Retrieval &amp;amp; RAG → RAG Failure Modes&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Related explainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/grep-vs-vector-agentic-retrieval" rel="noopener noreferrer"&gt;Is Grep All You Need? — Grep vs vector retrieval for agentic search&lt;/a&gt; — another result that the vector index is one option among many, not the default&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learnaivisually.com/ai-explained/cdd-context-driven-decomposition" rel="noopener noreferrer"&gt;CDD — Context-Driven Decomposition for RAG knowledge conflict&lt;/a&gt; — what to do once retrieval returns sources that disagree&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is source-native query dispatch?
&lt;/h3&gt;

&lt;p&gt;It's a retrieval design where a router sends a natural-language query to whichever knowledge source fits — unstructured text, a relational table, or a graph — and runs that source's own query engine (full-text search, a SQL-style query with JOINs, or a graph traversal) instead of embedding everything into one shared vector store. OmniRetrieval reports doing this across 13 datasets and 309 knowledge bases and exceeding single-source baselines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not just embed tables and graphs into the same vector index?
&lt;/h3&gt;

&lt;p&gt;Because embedding collapses structure. A table's columns and a graph's edges become a single vector positioned by similarity, so a JOIN can no longer compose rows by key and a traversal can no longer follow edges — those relations are averaged away rather than slowed down. Keeping each source native preserves the structural affordances that answer relational and multi-hop questions, which a top-k nearest-neighbour lookup over one space cannot reconstruct.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does this mean vector retrieval is obsolete?
&lt;/h3&gt;

&lt;p&gt;No. A unified vector index is still the right tool for fuzzy, topical recall over homogeneous prose, where semantic similarity is doing the real work. Source-native dispatch matters when an answer spans different kinds of sources or needs exact relations — tables to JOIN, graphs to traverse. The shift is in the default: route to the source's native engine first, and treat the shared embedding space as one source among several rather than the only one.&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://learnaivisually.com/ai-explained/omniretrieval-source-native-dispatch" rel="noopener noreferrer"&gt;Learn AI Visually&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
