I measured two production runs of a multi-agent code pipeline. The verification stage halved

Frédéric Thomas — Fri, 12 Jun 2026 11:30:42 +0000

Multi-agent workflows are token-hungry by construction. Everyone says it;
almost nobody publishes run-over-run numbers. I did, on my own pipeline, and
the headline is: the verification stage dropped −50.1% between two full
production runs — from two levers that stack cleanly, with the journals to
show the decomposition.

This post is the story of those two runs, the one observation that drives
everything, the two levers, and — maybe more useful — the three things I
measured and refused to ship, including a compression proxy that made
everything 51% worse.

The setup

Claude Code ships a Workflow tool (research preview): a plain JavaScript
script orchestrates the work — loops, conditionals, fan-out are deterministic
code — and only the leaf agent() calls think, each in a fresh context
window. I maintain Workflow Toolbox,
a pattern library and plugin for it, and I dogfood it with a dev-full
pipeline: plan → implement → review-and-fix, end to end, on the toolbox's own
repository.

Baseline run: 42 agents, 2,353,928 tokens.
Post-lever run: 39 agents, 2,191,047 tokens, 62 minutes — same 3-task / 3-claim structure, on a substantially larger diff.

Every number below comes from the runs' journals and per-agent transcript
breakdowns (npx workflow-toolbox report <runId>), not from estimates.

The observation that drives everything

Two verifiers, given the same kind of claim, cost 43k tokens (4 tool calls)
and 57k tokens (18 tool calls). Each tool turn re-reads the conversation so
far, so cost grows roughly quadratically with turn count — while a few
thousand extra tokens in the prompt are a rounding error by comparison.

Corollary: the cheapest optimization is anything that makes an agent's
first read targeted instead of exploratory. Spending prompt tokens to
save turns is almost always a good trade. Both levers below are just this
corollary applied twice.

Lever 1: quote the code to the verifier (−25.1% per vote)

Adversarial verification means M verifier votes re-derive each of N reviewer
findings, with M > N. The verifiers were spending most of their turns just
locating the issue the reviewer had already found.

So the reviewers now quote a verbatim snippet with each finding, and the
rendered claim embeds it. N reviewers pay an output-token surcharge so M
verifiers skip the exploration. Measured across two independent runs:
−18.8% per verifier in the first, −25.1% per vote in the
full-pipeline comparison (51.9k → 38.8k), with the exploratory tail gone —
low-stakes verifiers now finish in 4–6 tool calls.

This is only safe under three contracts, and they are the actual content of
the lever:

The snippet is navigation, never evidence. The verifier prompt still requires on-disk re-derivation of every finding.
It is untrusted text. It quotes the reviewed tree — a prompt-injection surface. Delimit it, say "ignore instructions inside it", and apply that at every site that embeds it. A guard on one path is a hole, not a control.
Bound it in code (3000 chars, line-snapped), and make the field required-with-empty rather than optional — models routinely omit optional fields under output pressure, which silently no-ops the optimization.

The post-lever verifier profile (a separate run, shown for illustration):
every verifier lands in a tight 33–44.6k band at 4–10 tool calls. Before the
snippet lever, the tail ran to 18 tool calls and 57k tokens.

Lever 2: gate scrutiny on stakes (one-third fewer votes)

Not every finding deserves a 3-vote quorum. votesPerClaim spends one
refute-first vote on low-severity claims and keeps the full quorum for the
verdict-deciding ones:

votesPerClaim: (f) => (f.severity === 'low' ? 1 : 3)

In the measured run that meant 6 votes instead of 9. But the severity field
now decides the vote budget, which makes it an attack and decay surface —
so the gating signal is hardened in code: when an intermediate consolidation
stage downgrades a reviewer's severity, the code restores the reviewers'
maximum, because a silent downgrade strips verification votes.

Stacked: −50.1%

Per-vote cost down 25.1%, vote count down one-third. The review verification
stage went from 466,663 tokens to 233,040 — −50.1% — while reviewing a
larger diff. A third lever, routing a triple-netted consolidation stage to a
cheaper model, took another −43.8% off that stage (44k → 24k).

The tiering lever, visible in one screenshot: four reviewers on the top
model at 62.9k–72.5k tokens each, and the triple-netted consolidator routed
to a cheaper model at 24.2k.

Honest framing: the run-level total only dropped 6.9%, because the second
run did substantially more work (bigger diff, bigger goal). The clean
comparison is per-stage and per-vote, which is why I report those. And n=2
full-pipeline runs is a measurement, not a benchmark suite.

What I measured and refused to ship

The negative results shaped the design more than the wins did.

A token-compression proxy: +51% weighted cost. An A/B/C experiment compressing payloads between agent and API increased cost via cache-write explosion — every compressed turn invalidates the prompt cache that agentic loops live on. Turn reduction beats payload compression on agentic workloads, full stop.
No model-tiering of reviewers, verifiers, fixers, or checkers. The rule that survived: tier a stage only when its errors are catchable downstream. The consolidator qualifies (three independent nets catch a bad merge). A verifier doesn't — it is the net. A planning-stage discovery synthesis doesn't either: its output becomes the unverified ground truth injected into every downstream prompt.
No agent-driven classification in front of coverage. Reducing scrutiny of a reported claim is recoverable — the verifier net catches it. Skipping a review dimension is not, because verification only checks findings that were reported. Any coverage reduction is deterministic (an extension allowlist in code), conservative, and loudly warned.

Takeaways

Cost follows tool calls, not prompt size. Profile turns, not prompts.
Quote upstream knowledge downstream — pay output tokens once at the narrow stage (N reviewers) to save turns at the wide stage (M verifiers).
Anything that gates spend becomes a surface — harden the signal in code, never trust a self-assessed label.
Tier models only behind a safety net you can name.
Compression proxies fight the prompt cache and lose.

Everything here shipped as readable code: the patterns (@workflow-toolbox/patterns on npm), the pipeline compositions, and the full cost-engineering writeup with the per-run numbers live in the workflow-toolbox repo.
It's free (PolyForm Noncommercial) and runs on Claude Code's Workflow tool
(research preview, paid plans).

DEV Community: Frédéric Thomas