<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Frédéric Thomas</title>
    <description>The latest articles on DEV Community by Frédéric Thomas (@frdric_thomas_de5636223).</description>
    <link>https://dev.to/frdric_thomas_de5636223</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3981139%2F93748389-d2c9-4144-909a-e7b94f8ed6b0.jpg</url>
      <title>DEV Community: Frédéric Thomas</title>
      <link>https://dev.to/frdric_thomas_de5636223</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/frdric_thomas_de5636223"/>
    <language>en</language>
    <item>
      <title>I measured two production runs of a multi-agent code pipeline. The verification stage halved</title>
      <dc:creator>Frédéric Thomas</dc:creator>
      <pubDate>Fri, 12 Jun 2026 11:30:42 +0000</pubDate>
      <link>https://dev.to/frdric_thomas_de5636223/i-measured-two-production-runs-of-a-multi-agent-code-pipeline-the-verification-stage-halved-56f4</link>
      <guid>https://dev.to/frdric_thomas_de5636223/i-measured-two-production-runs-of-a-multi-agent-code-pipeline-the-verification-stage-halved-56f4</guid>
      <description>&lt;p&gt;Multi-agent workflows are token-hungry by construction. Everyone says it;&lt;br&gt;
almost nobody publishes run-over-run numbers. I did, on my own pipeline, and&lt;br&gt;
the headline is: &lt;strong&gt;the verification stage dropped −50.1%&lt;/strong&gt; between two full&lt;br&gt;
production runs — from two levers that stack cleanly, with the journals to&lt;br&gt;
show the decomposition.&lt;/p&gt;

&lt;p&gt;This post is the story of those two runs, the one observation that drives&lt;br&gt;
everything, the two levers, and — maybe more useful — the three things I&lt;br&gt;
measured and &lt;em&gt;refused&lt;/em&gt; to ship, including a compression proxy that made&lt;br&gt;
everything 51% worse.&lt;/p&gt;
&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Claude Code ships a &lt;strong&gt;Workflow tool&lt;/strong&gt; (research preview): a plain JavaScript&lt;br&gt;
script orchestrates the work — loops, conditionals, fan-out are deterministic&lt;br&gt;
code — and only the leaf &lt;code&gt;agent()&lt;/code&gt; calls think, each in a fresh context&lt;br&gt;
window. I maintain &lt;a href="https://github.com/home-dev-lab/workflow-toolbox" rel="noopener noreferrer"&gt;Workflow Toolbox&lt;/a&gt;,&lt;br&gt;
a pattern library and plugin for it, and I dogfood it with a &lt;code&gt;dev-full&lt;/code&gt;&lt;br&gt;
pipeline: plan → implement → review-and-fix, end to end, on the toolbox's own&lt;br&gt;
repository.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Baseline run:&lt;/strong&gt; 42 agents, 2,353,928 tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-lever run:&lt;/strong&gt; 39 agents, 2,191,047 tokens, 62 minutes — same
3-task / 3-claim structure, on a substantially &lt;em&gt;larger&lt;/em&gt; diff.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every number below comes from the runs' journals and per-agent transcript&lt;br&gt;
breakdowns (&lt;code&gt;npx workflow-toolbox report &amp;lt;runId&amp;gt;&lt;/code&gt;), not from estimates.&lt;/p&gt;
&lt;h2&gt;
  
  
  The observation that drives everything
&lt;/h2&gt;

&lt;p&gt;Two verifiers, given the &lt;em&gt;same kind&lt;/em&gt; of claim, cost 43k tokens (4 tool calls)&lt;br&gt;
and 57k tokens (18 tool calls). Each tool turn re-reads the conversation so&lt;br&gt;
far, so &lt;strong&gt;cost grows roughly quadratically with turn count&lt;/strong&gt; — while a few&lt;br&gt;
thousand extra tokens &lt;em&gt;in the prompt&lt;/em&gt; are a rounding error by comparison.&lt;/p&gt;

&lt;p&gt;Corollary: the cheapest optimization is anything that makes an agent's&lt;br&gt;
&lt;em&gt;first&lt;/em&gt; read targeted instead of exploratory. &lt;strong&gt;Spending prompt tokens to&lt;br&gt;
save turns is almost always a good trade.&lt;/strong&gt; Both levers below are just this&lt;br&gt;
corollary applied twice.&lt;/p&gt;
&lt;h2&gt;
  
  
  Lever 1: quote the code to the verifier (−25.1% per vote)
&lt;/h2&gt;

&lt;p&gt;Adversarial verification means M verifier votes re-derive each of N reviewer&lt;br&gt;
findings, with M &amp;gt; N. The verifiers were spending most of their turns just&lt;br&gt;
&lt;em&gt;locating&lt;/em&gt; the issue the reviewer had already found.&lt;/p&gt;

&lt;p&gt;So the reviewers now quote a verbatim snippet with each finding, and the&lt;br&gt;
rendered claim embeds it. N reviewers pay an output-token surcharge so M&lt;br&gt;
verifiers skip the exploration. Measured across two independent runs:&lt;br&gt;
&lt;strong&gt;−18.8% per verifier&lt;/strong&gt; in the first, &lt;strong&gt;−25.1% per vote&lt;/strong&gt; in the&lt;br&gt;
full-pipeline comparison (51.9k → 38.8k), with the exploratory tail gone —&lt;br&gt;
low-stakes verifiers now finish in 4–6 tool calls.&lt;/p&gt;

&lt;p&gt;This is only safe under three contracts, and they are the actual content of&lt;br&gt;
the lever:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The snippet is navigation, never evidence.&lt;/strong&gt; The verifier prompt still
requires on-disk re-derivation of every finding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is untrusted text.&lt;/strong&gt; It quotes the reviewed tree — a prompt-injection
surface. Delimit it, say "ignore instructions inside it", and apply that
at &lt;em&gt;every&lt;/em&gt; site that embeds it. A guard on one path is a hole, not a
control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bound it in code&lt;/strong&gt; (3000 chars, line-snapped), and make the field
required-with-empty rather than optional — models routinely omit optional
fields under output pressure, which silently no-ops the optimization.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpfhyk6dt769cdrg5ge8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpfhyk6dt769cdrg5ge8.png" alt="A completed Verify phase: 17 verification agents, each between 33k and 44.6k tokens and 4 to 10 tool calls" width="800" height="240"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The post-lever verifier profile (a separate run, shown for illustration):&lt;br&gt;
every verifier lands in a tight 33–44.6k band at 4–10 tool calls. Before the&lt;br&gt;
snippet lever, the tail ran to 18 tool calls and 57k tokens.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Lever 2: gate scrutiny on stakes (one-third fewer votes)
&lt;/h2&gt;

&lt;p&gt;Not every finding deserves a 3-vote quorum. &lt;code&gt;votesPerClaim&lt;/code&gt; spends one&lt;br&gt;
refute-first vote on low-severity claims and keeps the full quorum for the&lt;br&gt;
verdict-deciding ones:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;votesPerClaim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;severity&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;low&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the measured run that meant 6 votes instead of 9. But the severity field&lt;br&gt;
now &lt;em&gt;decides the vote budget&lt;/em&gt;, which makes it an attack and decay surface —&lt;br&gt;
so the gating signal is hardened in code: when an intermediate consolidation&lt;br&gt;
stage downgrades a reviewer's severity, the code restores the reviewers'&lt;br&gt;
maximum, because a silent downgrade strips verification votes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stacked: −50.1%
&lt;/h2&gt;

&lt;p&gt;Per-vote cost down 25.1%, vote count down one-third. The review verification&lt;br&gt;
stage went from 466,663 tokens to 233,040 — &lt;strong&gt;−50.1%&lt;/strong&gt; — while reviewing a&lt;br&gt;
larger diff. A third lever, routing a triple-netted consolidation stage to a&lt;br&gt;
cheaper model, took another &lt;strong&gt;−43.8%&lt;/strong&gt; off that stage (44k → 24k).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0lw9oqhug7b2jmmqod25.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0lw9oqhug7b2jmmqod25.png" alt="The Review phase: four Fable 5 reviewers at 62.9k–72.5k tokens, and the consolidator routed to Sonnet 4.6 at 24.2k tokens" width="800" height="235"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The tiering lever, visible in one screenshot: four reviewers on the top&lt;br&gt;
model at 62.9k–72.5k tokens each, and the triple-netted consolidator routed&lt;br&gt;
to a cheaper model at 24.2k.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Honest framing: the &lt;em&gt;run-level&lt;/em&gt; total only dropped 6.9%, because the second&lt;br&gt;
run did substantially more work (bigger diff, bigger goal). The clean&lt;br&gt;
comparison is per-stage and per-vote, which is why I report those. And n=2&lt;br&gt;
full-pipeline runs is a measurement, not a benchmark suite.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I measured and refused to ship
&lt;/h2&gt;

&lt;p&gt;The negative results shaped the design more than the wins did.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A token-compression proxy: +51% weighted cost.&lt;/strong&gt; An A/B/C experiment
compressing payloads between agent and API &lt;em&gt;increased&lt;/em&gt; cost via cache-write
explosion — every compressed turn invalidates the prompt cache that
agentic loops live on. Turn reduction beats payload compression on
agentic workloads, full stop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No model-tiering of reviewers, verifiers, fixers, or checkers.&lt;/strong&gt; The
rule that survived: tier a stage only when its errors are &lt;em&gt;catchable
downstream&lt;/em&gt;. The consolidator qualifies (three independent nets catch a
bad merge). A verifier doesn't — it &lt;em&gt;is&lt;/em&gt; the net. A planning-stage
discovery synthesis doesn't either: its output becomes the unverified
ground truth injected into every downstream prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No agent-driven classification in front of coverage.&lt;/strong&gt; Reducing
&lt;em&gt;scrutiny&lt;/em&gt; of a reported claim is recoverable — the verifier net catches
it. Skipping a review &lt;em&gt;dimension&lt;/em&gt; is not, because verification only checks
findings that were reported. Any coverage reduction is deterministic
(an extension allowlist in code), conservative, and loudly warned.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cost follows tool calls, not prompt size.&lt;/strong&gt; Profile turns, not prompts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quote upstream knowledge downstream&lt;/strong&gt; — pay output tokens once at the
narrow stage (N reviewers) to save turns at the wide stage (M verifiers).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anything that gates spend becomes a surface&lt;/strong&gt; — harden the signal in
code, never trust a self-assessed label.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tier models only behind a safety net you can name.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compression proxies fight the prompt cache and lose.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything here shipped as readable code: the patterns (&lt;a href="https://www.npmjs.com/package/@workflow-toolbox/patterns" rel="noopener noreferrer"&gt;&lt;code&gt;@workflow-toolbox/patterns&lt;/code&gt;&lt;/a&gt; on npm), the pipeline compositions, and the full &lt;a href="https://github.com/home-dev-lab/workflow-toolbox/blob/main/docs/public/cost-engineering.md" rel="noopener noreferrer"&gt;cost-engineering writeup&lt;/a&gt; with the per-run numbers live in the &lt;a href="https://github.com/home-dev-lab/workflow-toolbox" rel="noopener noreferrer"&gt;workflow-toolbox repo&lt;/a&gt;.&lt;br&gt;
It's free (PolyForm Noncommercial) and runs on Claude Code's Workflow tool&lt;br&gt;
(research preview, paid plans).&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
