yureki_lab

Posted on Jun 14

5 Mistakes I Made Designing My First Claude Code Sub-Agent Pipeline

#ai #claudecode #softwareengineering #programming

TL;DR

I spent a weekend wiring up my first multi-agent pipeline with Claude Code, and almost every design choice I made was wrong. Here are the 5 mistakes — one monolithic prompt, free-text returns, eager barriers, ignored concurrency caps, and no dedup — and how I fixed each one. If you're about to fan out sub-agents for the first time, read this before you ship.

The Problem

I wanted to run a "find bugs in this repo" sweep across a medium-sized codebase. The naive version was easy: one Claude Code session, one big prompt, walk the tree. But it was slow, it ran out of context partway through, and the output was a wall of unstructured prose I had to re-parse by hand.

So I rewrote it as a fan-out: spawn N sub-agents, each focused on a slice, collect their findings, verify, dedupe, report. Classic map-reduce on top of Claude Code sub-agents.

The first version "worked" — it produced output. But the wall-clock was almost as bad as the monolith, half the findings were duplicates, and roughly 1 in 5 sub-agent results couldn't be parsed at all. I rebuilt it three times in a week. These are the mistakes I keep seeing in my own code and in other people's.

Versions: Claude Code v2.x, Node.js 22.x. The patterns are general but the API surface I reference is from late 2025/early 2026.

How I Solved It

I'll walk through the 5 mistakes one by one. Each one has the "before" sketch I actually wrote, then the version I landed on.

Mistake 1 — One monolithic prompt for every sub-agent

My first fan-out gave every sub-agent the same prompt and just varied the input slice:

const findings = await Promise.all(
  slices.map(slice =>
    runAgent(`Find bugs in this code. Look for correctness issues,
              security holes, performance problems, dead code,
              and missing error handling. Be thorough.\n\n${slice}`)
  )
)

This looks tidy. It's also why my outputs were noisy. Every agent tried to be a generalist, every agent re-discovered the same shallow issues (unused variables, missing null checks), and nobody went deep on anything.

The fix was to give each agent a single lens. Same input, different prompts:

const LENSES = [
  { name: 'correctness', prompt: 'Find logic bugs. Ignore style.' },
  { name: 'security',    prompt: 'Find injection, auth, secret-handling bugs.' },
  { name: 'concurrency', prompt: 'Find race conditions and ordering bugs.' },
]
const findings = await Promise.all(
  LENSES.flatMap(lens =>
    slices.map(slice => runAgent(`${lens.prompt}\n\n${slice}`, { label: lens.name }))
  )
)

Same agent count, vastly better signal. Diversity beats redundancy when you're searching for something you can't fully specify upfront.

Mistake 2 — Free-text returns

I let agents return prose, then tried to regex out the findings. Roughly 20% of returns had a header I didn't anticipate, or a numbering scheme that broke my parser, or a "By the way…" tail that polluted the next stage.

The fix: enforce a schema at the tool layer. Most agent frameworks now support forcing the agent to call a structured-output tool. In Claude Code's workflow primitives this looks like:

const FINDING_SCHEMA = {
  type: 'object',
  required: ['findings'],
  properties: {
    findings: {
      type: 'array',
      items: {
        type: 'object',
        required: ['file', 'line', 'severity', 'description'],
        properties: {
          file:        { type: 'string' },
          line:        { type: 'integer' },
          severity:    { enum: ['low', 'medium', 'high'] },
          description: { type: 'string' },
        },
      },
    },
  },
}

const result = await runAgent(prompt, { schema: FINDING_SCHEMA })
// result is already a typed object — no parsing

The agent retries internally on schema mismatch, so by the time I get the object back, it's valid. This single change cut my downstream code in half and eliminated the parse-error tax.

Mistake 3 — Eager barriers between stages

My original pipeline looked like this:

const reviews  = await Promise.all(items.map(reviewAgent))  // BARRIER
const verified = await Promise.all(reviews.map(verifyAgent)) // BARRIER

Looks clean. It also means the verify stage cannot start until every review finishes. If one slow reviewer takes 3x the median, the verifier sits idle for that whole stretch.

The fix is to pipeline: each item flows through all stages independently. Item A can be in verify while item B is still in review.

async function pipeline(items, ...stages) {
  return Promise.all(items.map(async (item) => {
    let cur = item
    for (const stage of stages) cur = await stage(cur)
    return cur
  }))
}

const results = await pipeline(items, reviewAgent, verifyAgent)

Wall-clock dropped from "sum of slowest per stage" to "slowest single-item chain." On a 12-item run that's the difference between ~90s and ~35s for me.

A barrier is only correct when stage N actually needs all of stage N-1 (dedup across the full set, early-exit if zero findings, cross-item comparison). Otherwise: pipeline.

Mistake 4 — Ignoring the concurrency cap

I gleefully shoved 80 items into Promise.all. The runner happily accepted them, then quietly queued 70 of them while running 10 at a time. My logs showed "80 agents started" — but only 10 were actually doing work, and I had no idea why my wall-clock was so bad.

Two fixes, depending on the situation:

Know your cap. Most agent runners have a concurrency cap (often min(16, CPU - 2)). Anything above that queues. If you want to reason about wall-clock, treat the cap as your effective batch size.
Right-size the fan-out. I now scale fan-out to the work budget, not "as wide as possible":

const BATCH = Math.min(items.length, MAX_CONCURRENCY)
log(`Running ${BATCH} concurrent; ${items.length - BATCH} queued.`)

Logging the queued count was the single most useful debug change I made all month. It turned an invisible bottleneck into a number.

Mistake 5 — No dedup before verification

Verification is the expensive stage. Each verifier read files, ran tools, asked Claude to refute the claim. So when my finders surfaced "this function lacks input validation" from three different lenses, I was paying 3x for the same finding.

The fix is dumb-simple — dedup in plain code between fan-out stages:

const seen = new Set()
const fresh = allFindings.filter(f => {
  const key = `${f.file}:${f.line}:${f.description.slice(0, 60)}`
  if (seen.has(key)) return false
  seen.add(key)
  return true
})
const verified = await Promise.all(fresh.map(verifyAgent))

The temptation is to make the dedup itself an agent ("ask Claude to merge similar findings"). Don't. A Set and a stable key are faster, deterministic, and free. Reach for an agent only when the comparison genuinely needs judgment.

The shape I landed on

After all five fixes, the pipeline looks roughly like this:

flowchart LR
    A[Slices] --> B[Lens 1: correctness]
    A --> C[Lens 2: security]
    A --> D[Lens 3: concurrency]
    B --> E[Dedup]
    C --> E
    D --> E
    E --> F[Verify - pipelined]
    F --> G[Report]

Multiple lenses (mistake 1), schema-enforced returns (2), pipelined verify with no barrier (3), batch-aware concurrency (4), dedup before the expensive stage (5).

Lessons Learned

Diversity beats redundancy. If you're spawning N agents on the same problem, give them N different angles. N copies of the same prompt is wasted spend.
Schemas are not bureaucracy — they're a parser. Forcing structured output at the tool layer is the single highest-leverage change you can make to a multi-agent system. Stop regexing prose.
Pipeline by default, barrier only when you must. Most "I'll await everything then start the next stage" code is a wall-clock tax for no reason. The barrier is correct only when stage N needs cross-item context from all of stage N-1.
The concurrency cap is real and silent. If your runner queues, you need to know. Log the queued count. Right-size fan-out to the cap, not your ambition.
Dedup with code, not agents. Cheap deterministic operations (filter, group, dedup, sort) belong in your script, not in an LLM call. Reserve the agents for the judgment calls.

What's Next

The version I have running now still has a verification stage that's overly trusting — if a finder is confidently wrong, a single verifier can rubber-stamp it. I'm experimenting with an adversarial panel: three skeptics per finding, each prompted to refute, kill if a majority refute. Early results look promising but the cost goes up linearly, so I want to measure precision/recall properly before I write that one up.

I'm also tracking how much of the pipeline's wall-clock is the slowest single agent in each stage. If it's consistently one outlier, the right move is probably a timeout-and-retry rather than waiting it out.

Wrap-up / CTA

If you're building anything with Claude Code sub-agents, try the schema fix first — it'll pay for itself within a day. The pipelining change is bigger but invasive; do it once your output is reliable enough to trust.

If this was useful:

Follow me on Dev.to — I'm writing up more agent-design war stories as I hit them.
If you haven't tried Claude Code yet, the sub-agent + workflow primitives are what made all of this even possible.
Hit me up in the comments with your own multi-agent mistakes — I want to collect a "things that bit us all" list.

Build in public, break in private. 🛠️

Top comments (1)

Adam Lewis • Jun 14

Useful write-up. The single-verifier problem you flagged for next time is the one that worries me most, a finder that's confidently wrong gets rubber-stamped and the error surfaces months later as something everyone trusted. Before an adversarial panel, the cheaper step is giving the verifier something deterministic to check, the way your schema fix turned parsing from a judgement into a check. Same instinct as dedup-with-code-not-agents.