DEV Community

Sensei
Sensei

Posted on

Lessons from a 109-agent code audit workflow

I spent 9.3M tokens on a 109-agent code audit so you don't have to

The short version: I pointed a swarm of AI agents at a ~5k-line codebase to hunt for things worth fixing. The pipeline was parallel subsystem mappers → 8 "finder" lenses → dedup → adversarial verification of every finding → a ranking panel → synthesis.

It worked. 32 verified findings, a clean top-10, no complaints about the output. But it cost ~9.3M tokens (call it $46 of API time) and somewhere north of two-thirds of that was me lighting money on fire. Here's the autopsy.

Where the tokens actually went

Stage Agents What went wrong
Verify 86 of 109 (79%!) Killed only 2 of 34 findings — I paid the full cost of re-reading the code 86 times for a 6% hit rate
Map 9 Redundant. The finders re-read the code anyway (correctly), so the map was a tax I paid twice
Find 8 lenses ~30% overlap between them (48 raw findings collapsed to 34 after dedup)
Prompts JSON.stringify(x, null, 2) — yes, the pretty-printing — quietly bloated every downstream prompt by 30-40%

The kicker: cache reads were 77% of all tokens. Every agent spawns with a fresh context and re-reads the same files from scratch. That's just the cost of fanning out wide — which means you'd better fan out where it pays.

The one mistake that mattered

I verified everything before I ranked anything. Adversarial verification is the single most expensive stage per unit of value, and most findings never even make the final cut. So I was paying premium prices to fact-check findings that were destined for the cutting-room floor.

Flip the order:

find → dedup → RANK → verify only the top ~12-15 → synthesize
Enter fullscreen mode Exit fullscreen mode

Same output. ~70% fewer agents. That's the whole lesson, really — the rest is footnotes.

Footnotes (the actual rules, in order of how much they save you)

  1. Rank first, verify after. Only spend adversarial verification on the findings that'll actually appear in the deliverable.
  2. Match the paranoia to the stakes. A wrong finding in an internal audit costs someone a few minutes of reading — one refuter is plenty. Save the full 3-lens panel (code-truth / impact / approach) for finalists or claims that trigger real action.
  3. Batch verification by file. 34 findings lived in ~10 files. One verifier checking every finding against the same file reads it once instead of ten times. Never do per-finding-per-lens fan-out — that's the expensive way to learn this lesson.
  4. Skip the mappers on small repos (<10k LOC). One agent can swallow the whole thing, and the finders re-read it anyway, so a map phase is overhead twice over: once to build it, once to staple it onto every finder prompt.
  5. Six finder lenses, tops — with explicit "you don't cover X, that's the other agent's job" boundaries. Overlap isn't free; every duplicate has to swim through dedup and verification before it dies.
  6. Compact your JSON. JSON.stringify(x), never null, 2. At agent-prompt scale, pretty-printing is just padding nobody reads.
  7. Send the cheap models to do the chores. Dedup, evidence-checking, code-truth verification — none of it needs the frontier model.
  8. Set a token budget up front and have the orchestrator check what's left before each fan-out, scaling the fleet down instead of barreling ahead open-loop.

What I'd keep, no notes

  • The code-truth veto in verification earned its keep — it caught 2 plausible-but-wrong findings that would've embarrassed me in the final report. Adversarial checking is worth every token on the things that survive ranking.
  • Structured output schemas on every agent: zero parse failures across all 109. Boring, reliable, great.
  • Hard caps on finder output (max 6 each) kept the funnel from ballooning.
  • Resumable runs with cached results — I stopped mid-run and picked it back up later for basically nothing.

TL;DR

Fan out to find. Converge before you verify. Breadth is for discovery; rigor is for the survivors. Do it backwards and you'll end up — as I did — with 79% of your agents diligently auditing things no human will ever read.

Top comments (1)

Collapse
 
mehmetcanfarsak profile image
Mehmet Can Farsak

The find → dedup → rank → verify flow is a great optimization. One thing I've seen in multi-agent setups: a lot of tokens get burned because agents don't have a 'thinking mode' vs 'action mode'. They'll start executing (tool calls, code generation) during phases that should be pure analysis or brainstorming.

I put together Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) that adds three modes via hooks — divergent, actionable, academic — to keep agents in the right headspace for each pipeline stage. Could be useful for your finder/verifier separation.