DEV Community

Deva
Deva

Posted on • Originally published at arihantdeva.com

Subagent Orchestration: Fanout, Pipeline, and Where Both Break

Fanout vs. Pipeline: Optimizing Parallel Subagent Execution

Most engineering teams start with a single LLM call, then graduate to chains, and eventually hit the same wall: one agent cannot do everything at once. The natural next step is to split work across multiple subagents. The question becomes how to coordinate them. Two patterns dominate: fanout, where a single task fractures into parallel subtasks; and pipeline, where output from one stage feeds sequentially into the next. Both exploit concurrency, but they fail in different ways and at different scales.

Fanout is the simpler mental model. You take a problem, decompose it into independent chunks, and dispatch each to its own subagent. A code review system might spawn ten agents, each analyzing a different module. A research task might split across ten parallel search queries. The parent agent collects results and synthesizes. The appeal is obvious: wall-clock time drops dramatically when work proceeds in parallel rather than serially. Pipeline, by contrast, structures work as a sequence of transformations. Raw input passes through stage one, whose structured output becomes stage two's input, and so on. Each stage may itself run in parallel internally, but the macro structure is sequential. Pipelines excel when later stages need earlier stages' context to do meaningful work, or when output quality depends on progressive refinement rather than independent aggregation.

The critical insight is that these are not opposing choices but composable primitives. Real systems mix them: a fanout of parallel pipelines, or a pipeline whose middle stage fans out again. The engineering challenge is knowing where to place the boundary. Fanout assumes subtasks are truly independent. Violate that assumption and you pay in coherence; the parent synthesis becomes a forced reconciliation of incompatible partial answers. Pipeline assumes stages are cleanly separable. Blur that boundary and latency compounds; you have built a slower serial process with extra overhead.

68% of developers report hitting scaling limitations when pushing LLM agent fan-out beyond 10 agents, according to the LangChain State of AI Agents 2024 report. This threshold marks where coordination overhead, token costs, and result aggregation complexity overwhelm the linear speedup from adding more parallel workers.

"You already know how to spawn a subagent. You've used delegate_task to research a problem, review some code, or pull context from a distant corner of your codebase," notes Anthropic's guidance on subagent orchestration, grounding the pattern in familiar primitives rather than abstract theory. (Hermes Subagent Orchestration: Map, Reduce, and Fan-Out Patterns That Actually Ship)

In practice, fanout works best when decomposition is clean and the synthesis step is well-defined. Leviathan's terminal environment uses fanout for tasks like multi-file analysis, where each subagent receives a scoped file context and returns structured findings. The parent then resolves conflicts and produces a unified report. Pipeline works best when quality depends on staged transformation, such as extracting entities, then reasoning about relationships, then generating a response. The failure mode of fanout is false independence: tasks that secretly share dependencies and produce redundant or contradictory outputs. The failure mode of pipeline is hidden latency: a bottleneck at any stage stalls the entire flow, and retry logic becomes complex when stages are stateful. Choosing between them, or combining them, requires mapping your actual dependency structure rather than defaulting to the pattern that sounds more parallel.

Structured Artifacts Over Prose: Enhancing Output Precision

Subagents that return unstructured prose create a hidden tax on orchestration. Every downstream step must parse intent, handle ambiguity, and recover from misaligned assumptions. The cost compounds: a single malformed paragraph can stall a pipeline, trigger retry loops, or corrupt aggregate results. Structured artifacts, schemas, and typed outputs eliminate this friction by making contracts explicit between producer and consumer.

The mechanism is straightforward. Each subagent emits a validated object, a JSON blob with known fields, rather than free text. The orchestrator consumes these objects directly, routing on field values, merging on keys, or filtering on predicates. No regex. No "extract the third sentence" heuristics. The schema becomes the API. At Leviathan, we have seen pipelines shrink from hundreds of lines of parsing logic to simple json.loads calls with Pydantic validation. The difference is not merely cosmetic. It changes which bugs are even possible.

Average latency drops by a factor of 2.3x when switching from sequential pipeline to fan-out patterns for LLM agent workflows, according to "Benchmarking LLM Agent Orchestration Patterns" on arXiv. This gain is only realizable when every subagent returns a predictable structure that the orchestrator can merge without re-parsing.

The failure mode of prose is insidious. A subagent tasked with "summarize these findings" might return a bullet list, a narrative paragraph, or a numbered sequence depending on prompt temperature and context window pressure. Another subagent, expecting bullets, splits on newlines and receives garbage. The pipeline does not crash; it silently degrades. Structured artifacts make such mismatches impossible by construction. They also enable static analysis: you can validate a subagent's output against its schema before any downstream code runs.

"One subagent is a convenience. Ten subagents working in parallel is an architecture." This observation from Hermes Subagent Orchestration: Map, Reduce, and Fan-Out Patterns That Actually Ship captures why ad-hoc prose outputs collapse at scale. Architecture demands contracts, and contracts require structure.

The alternative, post-hoc extraction via secondary LLM calls, is common but expensive. It adds latency, cost, and another failure surface. Structured artifacts frontload this work into the prompt and schema definition. The upfront design cost pays back immediately in reduced operational complexity. Pick this tactic when your orchestration involves more than two subagents, when outputs feed into databases or APIs, or when you need to reason about correctness without reading every output by hand.

Context Isolation Tradeoffs: Balancing Efficiency and State Management

Every subagent orchestration strategy eventually confronts the same tension: how much context each subagent receives, and how much it returns. Give a subagent the full conversation history, and you pay for redundant tokens while risking confusion from irrelevant prior turns. Give it nothing, and you spend on re-derivation or watch it make decisions that conflict with work already done elsewhere. The production systems that survive are the ones that make this tradeoff explicit, not accidental.

Context isolation is the mechanism that enforces boundaries. In a fanout pattern, each subagent typically receives a scoped prompt with only the inputs it needs, plus any shared grounding documents. The parent agent then merges outputs. In a pipeline, each stage receives the prior stage's structured output, ideally with a strict schema rather than raw prose. The efficiency gains are real: reduced token counts, faster completion times, and fewer hallucinations triggered by noisy context. But the state management overhead grows quickly. You need a canonical store for intermediate results, versioned so that retries do not corrupt the chain. You need deterministic keys for correlating subagent outputs with their originating requests. And you need cleanup logic, because orphaned context accumulates.

42% of LLM agent projects encounter deadlock or livelock when combining fan-out and pipeline patterns, according to a systematic study of orchestration failures. This figure underscores why context isolation is not merely a performance optimization but a stability requirement.

The failure modes are specific and observable. Over-isolation produces reconciliation bugs: two subagents modify related state in divergent ways, and the parent lacks the information to detect the conflict. Under-isolation produces attention dilution, where a subagent fixates on an earlier part of the conversation and ignores its actual task. In Leviathan's terminal environment, we have seen this manifest when a code-generation subagent receives the full parent trace including prior failed attempts; the subagent reintroduces the same bug because the failure pattern is salient in its context window.

"The difference between a demo and a production agent system is orchestration: how you split work, run it concurrently, and collapse the results back into something the parent agent can act on without drowning in context," notes the Hermes Subagent Orchestration guide. This framing applies directly to context isolation: the collapse step is where most systems leak efficiency or introduce subtle bugs.

The alternative to disciplined isolation is often implicit context inheritance, where subagents simply see whatever the parent saw. This works for demos and for narrow pipelines with no parallelism. It fails when fanout multiplies redundant context, or when a long pipeline exceeds the effective window of any single model. The cost is not just tokens; it is the cognitive load on each subagent, which degrades output quality in ways that are harder to measure than latency or spend. Choose explicit isolation when your orchestration combines parallel and serial stages, when subagents have distinct responsibilities that should not be contaminated, or when you need to retry individual stages without replaying the entire history. Accept the state management burden as the price of predictable scaling.

The 25+ Concurrent Sonnet Agents Cost Cliff: When Scalability Falters

Parallel execution feels free until the bill arrives. Engineers building with Claude Sonnet quickly discover that fanning out to dozens of subagents simultaneously hits a pricing wall that linear scaling models fail to predict. The problem is not throughput; it is the compounding cost of context windows, retry logic, and the hidden overhead of orchestration itself.

Each Sonnet request carries a base cost in tokens, but the real expense accumulates in ways that are not obvious from the per-request pricing page. When you launch twenty-five concurrent agents, you are not just paying for twenty-five completions. You are paying for twenty-five full context windows, each potentially stuffed with system prompts, tool schemas, and prior conversation history. Anthropic's token pricing for Sonnet applies to every input and output token across every agent, and there is no batch discount for parallel execution. The context window for Sonnet 3.5 supports up to 200K tokens, but filling even a fraction of that across many agents multiplies costs fast.

Retry storms amplify the damage. Concurrent subagents often hit rate limits simultaneously. Your orchestrator retries with exponential backoff, but during that backoff, contexts grow. Agents that failed mid-generation may need to restart from scratch, replaying expensive prompt prefixes. A single retry on twenty-five agents is not twenty-five extra requests; it is twenty-five full context replays, sometimes with even longer prompts because your framework appended error context.

The 25-agent threshold matters because it is where most naive orchestration frameworks start to strain. Below this count, simple thread pools or async loops behave predictably. Above it, you encounter queuing delays, context management bugs, and cost spikes that look like exponential growth on your dashboard. The Anthropic API does not throttle differently at exactly 25, but your infrastructure does. Connection pools exhaust. Memory pressure from holding twenty-five active contexts forces garbage collection pauses. Your observability pipeline, if you built one, starts dropping spans.

Leviathan's approach to this problem at leviathanterminal.com involves tiered dispatch: a fast classifier routes work to a small hot pool of agents, while a larger cold pool handles overflow. This keeps concurrent Sonnet usage bounded without sacrificing parallelism where it actually helps. The key insight is that same-agent work can often be batched sequentially into fewer contexts, or routed to cheaper models for pre-filtering.

The failure mode is insidious because everything still works. Requests do not error out; they just cost three times what you budgeted. Your latency p99 looks fine. Your error rate is zero. Only the invoice reveals that your "scalable" fanout architecture is actually a cost amplifier dressed in async/await syntax. Engineers notice too late, usually during a monthly review or when a demo account suddenly demands a credit card increase.

Mitigation requires explicit concurrency caps, aggressive context trimming, and model tiering. Cap concurrent Sonnet agents to a number you can afford to run for an hour straight. Trim system prompts and tool definitions to the minimum viable set; every token you save per agent multiplies across your concurrency. Route preprocessable work to Haiku or to cached results before ever touching Sonnet. Measure cost per task, not just latency or success rate. The metric that matters is dollars per completed workflow, and that number degrades faster than throughput as you add agents.

When Fanout and Pipeline Fail: Identifying Breakpoints in Scalability

Both fanout and pipeline patterns hit walls. The failures do not announce themselves with clear errors. They surface as latency spikes, cost explosions, or silent quality degradation. Recognizing the breakpoints before they become outages is the core operational skill.

Fanout breaks first on coordination overhead. Below a threshold of parallel units, adding workers reduces wall-clock time linearly. Past that threshold, the orchestrator itself becomes the bottleneck. It must collect partial results, handle stragglers, and merge outputs. In Leviathan's routing layer, we observed that fanning out to more than twelve subagents for a single user query introduced merge latency that erased the parallel speedup. The breakpoint was not the LLM throughput. It was the serialization and ranking of divergent outputs.

Pipeline breaks on error propagation and state bloat. Each stage depends on the prior stage's output. A malformed intermediate artifact cascades. Recovery requires either expensive backtracking or discarding partial work. More subtly, context windows accumulate. A pipeline with four stages, each appending reasoning to a shared context, can exceed token limits even when each stage individually fits. The breakpoint here is often invisible until the fourth or fifth stage, when a previously reliable pipeline suddenly fails on longer inputs.

The most dangerous failures are sublinear. A fanout that degrades gracefully still produces correct output, just slower. A pipeline that degrades still completes, but with accumulated hallucinations from error-correcting stages compensating for earlier mistakes. These patterns require active monitoring: track end-to-end latency per parallel width, and track per-stage validation scores in pipelines.

Cost cliffs deserve explicit attention. Parallel execution multiplies base costs. Sequential execution multiplies them over time. Neither scales linearly with value delivered. The breakpoint is often economic rather than technical. A fanout of twenty subagents may complete in seconds, but if seventeen of those outputs are discarded, the effective cost per useful token becomes unsustainable.

The practical response is hybrid. Run a narrow fanout to generate diverse candidates, then pipeline a validation and refinement stage on the top two or three. Abort early when confidence variance across fanout outputs is low, indicating consensus. Abort pipelines when stage-level validation drops below a threshold, rather than propagating damage forward. Breakpoints are not fixed numbers. They shift with model version, prompt complexity, and output format. Treat them as measured properties of a specific deployment, not architectural constants.

Beyond Parallelism: Advanced Orchestration Patterns for Complex Workloads

Fanout and pipeline patterns handle many workloads, but complex problems demand more sophisticated orchestration. Advanced patterns combine, nest, and conditionally branch subagent execution to match structure to task complexity without overspending on unnecessary parallelism.

Hierarchical fanout adds a tree structure to flat parallelism. A root agent decomposes a problem into subproblems, then each subproblem fans out to its own subagents. In Leviathan, we use this for multi-file code generation: one agent scopes the change, domain agents draft each file, and a final integration pass resolves conflicts. The key control is depth. Each additional level multiplies context overhead and cost. We cap at two levels for most tasks; three only when file dependencies are genuinely sparse.

Conditional routing sends work to different subagent pools based on intermediate results. A classifier subagent tags incoming tasks by type; each tag triggers a specialized pipeline. This avoids the waste of running all tasks through all stages. The failure mode is classifier drift. If the router degrades, the wrong specialist handles work silently. We monitor routing accuracy with a lightweight validator that samples 10% of decisions and escalates mismatches.

Iterative refinement loops replace one-shot generation with repeated subagent passes. An agent produces output, a critic subagent flags issues, and the producer revises. This resembles pipeline but with feedback cycles. It excels where quality matters more than latency, such as security-sensitive code review. The loop terminates when the critic returns no findings or hits a maximum iteration count. Without that cap, adversarial dynamics between producer and critic can inflate costs arbitrarily.

Dynamic resource allocation adjusts parallelism based on real-time cost and quality signals. Rather than fixed fanout counts, an orchestrator scales subagent instances up when confidence is low and down when convergence is fast. This requires instrumenting each subagent for latency and output variance, then applying a simple control policy (we use a threshold on inter-agent disagreement).

These patterns share a design principle: orchestration should be explicit, inspectable, and reversible. Implicit coordination through shared state or emergent behavior fails at scale. Every advanced pattern we use includes a rollback mechanism to a known good state, because complexity compounds failure modes in ways that flat parallelism does not.

Navigating the Cost-Performance Tradeoff in Large-Scale Subagent Deployments

At some point, every team building with subagents faces the same spreadsheet: agent count multiplied by tokens per turn, multiplied by turns per task, multiplied by model price per million tokens. The numbers escalate quickly. The question is not whether to optimize, but where to spend your complexity budget first.

The cost model for subagent systems has three levers: concurrency width, model tier, and context retention. Concurrency width is the number of agents running simultaneously. Model tier is the specific model each agent uses. Context retention is how much history each agent keeps alive across turns. Each lever trades against output quality and latency in predictable ways, but the interactions between them are less obvious.

Running many agents on a cheaper model tier seems appealing until you measure retry rates. A fanout of twenty subagents on a lightweight model may cost less per invocation, but if half the outputs fail validation and must be regenerated, the effective cost converges on the more expensive tier while adding latency. Conversely, running everything on the most capable model flattens your cost curve into a single expensive line that scales linearly with task volume. The useful space is between these extremes: classify the task first, route to the appropriate tier, and reserve the heavy model for subproblems where failure is expensive.

Context retention is the hidden multiplier. Each subagent that carries full conversation history into every turn balloons token usage. In pipeline patterns, where Agent B needs only the structured output of Agent A, passing the entire transcript is pure waste. The fix is narrow: define explicit artifact schemas and pass only what the next stage requires. In fanout patterns, where agents may need shared background, the temptation is to hydrate every context with the same preamble. Deduplicate this. Load shared context once, reference it by identifier, and let each agent instantiate only what it needs.

The breakpoint where cost discipline matters most is the transition from tens to hundreds of concurrent agents. Below this threshold, rough estimates suffice. Above it, small inefficiencies compound into budget overruns and rate-limit collisions. The teams that navigate this well treat subagent orchestration as infrastructure, not prompt engineering: they meter token flow, set per-task budgets, and fail fast when a path exceeds its allocation.

Pick this tactic when your subagent count is growing and your current approach is "use the best model everywhere and keep everything." That works for prototypes. It does not survive contact with production scale.

Top comments (0)