Three months ago, I had a single agent handling document classification, tagging, and summary generation for a client workflow. It worked fine with 50 documents a day. Then volume hit 500. The agent started taking 40 minutes per batch. Throughput didn't scale — it imploded.
The fix wasn't a bigger model. It was splitting the agent into three specialized roles running in parallel. Throughput went from 40 minutes to 4. The model never changed. The architecture did.
This is the part of AI agent development nobody talks about enough: agent orchestration architecture. The gap between "it works" and "it scales" is almost always an architectural problem, not a model problem.
The Sequential Trap
When developers first build agent systems, they almost always start with one agent doing everything in sequence:
# The trap: one agent, everything sequential
class DocumentPipeline:
def process(self, doc):
summary = self.agent.run(f"summarize: {doc}")
tags = self.agent.run(f"extract tags: {doc}")
classification = self.agent.run(f"classify: {doc}")
return {"summary": summary, "tags": tags, "class": classification}
Each .run() call is a separate LLM call with full context passed in. For one document, fine. For 500, you're making 1,500 LLM calls sequentially. Even at 2 seconds per call, that's 3,000 seconds — 50 minutes.
The model is barely working. It's waiting most of that time.
The Multi-Agent Pattern That Actually Works
The fix is to split by role and run agents in parallel when they don't depend on each other:
import asyncio
from dataclasses import dataclass
from typing import List
@dataclass
class DocResult:
summary: str
tags: List[str]
classification: str
async def run_parallel_pipeline(doc: str) -> DocResult:
# Each agent is specialized — smaller system prompts, faster inference
summarizer = Agent(role="summarizer", model="gpt-4o-mini")
tagger = Agent(role="tagger", model="gpt-4o-mini")
classifier = Agent(role="classifier", model="gpt-4o-mini")
# All three run concurrently — one round-trip latency each
summary_task = asyncio.to_thread(summarizer.run, f"summarize: {doc}")
tags_task = asyncio.to_thread(tagger.run, f"extract 5 tags: {doc}")
class_task = asyncio.to_thread(classifier.run, f"classify type: {doc}")
summary, tags, classification = await asyncio.gather(
summary_task, tags_task, class_task
)
return DocResult(summary=summary, tags=tags, classification=classification)
Three agents, one asyncio.gather() call. Each agent gets a smaller, focused system prompt — which also means faster inference because the context window is tighter and the model has less to think about.
The bottleneck shifts from "model speed" to "how fast can you dispatch and merge."
When Not to Parallelize
Parallel isn't always better. Here's where I burned time going parallel too early:
When the second task needs the first task's output. If your tagging depends on the summary, you can't parallelize those two. The dependency graph matters.
When the task is too small to justify the overhead. Dispatching an agent has overhead — loading the system prompt, establishing context. For tasks under 100ms of actual LLM time, serial might be faster than the coordination cost.
When you're I/O bound on retrieval, not compute. If your agent spends 90% of its time waiting for a RAG lookup, parallelizing LLM calls won't help. Optimize the retrieval first.
# When you must serialize (dependency exists)
async def run_dependent_pipeline(doc: str) -> str:
# Step 1 must complete before step 2 starts
summary = await asyncio.to_thread(summarizer.run, doc)
# Step 2 needs step 1's output
refined = await asyncio.to_thread(refiner.run, f"expand: {summary}")
return refined
The Routing Problem: When to Fan Out, When to Queue
The more agents you have, the more interesting the routing problem becomes. I use a simple dispatcher pattern:
class AgentDispatcher:
def __init__(self):
self.agents = {
"summarize": SummarizerAgent(),
"tag": TaggerAgent(),
"classify": ClassifierAgent(),
"extract": ExtractionAgent(),
}
self.queue = asyncio.Queue()
self.results = {}
async def dispatch(self, task_type: str, payload: str, task_id: str):
if task_type not in self.agents:
raise ValueError(f"Unknown task type: {task_type}")
agent = self.agents[task_type]
result = await asyncio.to_thread(agent.run, payload)
self.results[task_id] = result
return result
async def fan_out(self, tasks: List[tuple]) -> dict:
# tasks: [(task_type, payload, task_id), ...]
coros = [self.dispatch(t, p, i) for t, p, i in tasks]
await asyncio.gather(*coros)
return self.results
This dispatcher handles fan-out for independent tasks and lets me add new agent roles without changing the calling code. The bottleneck now is "how many agents can you run concurrently without hitting rate limits" — which is a different problem, and a good one to have.
What I Learned
1. Profile before you parallelize. Measure where time actually goes. Is it LLM inference or retrieval? Parallelizing the wrong thing is a waste of complexity.
2. Specialized agents are faster than generalists. A summarizer agent with a tight 3-line system prompt on gpt-4o-mini will consistently beat a general agent on gpt-4o for summarization tasks. Smaller model, focused role, faster and cheaper.
3. The architecture question is a product question. Before going multi-agent, ask: does each role have enough work to stay warm? If you're running 5 agents but each only processes one task every 10 minutes, you're paying for orchestration overhead without the parallelism benefit.
4. Tool use complicates the dependency graph. Once agents start using tools (web search, database queries, file I/O), the dependency graph gets more complex. Map it out explicitly before you code.
The biggest unlock for me was realizing that "building an AI agent" is two separate problems: what the agent does (prompting, tools, knowledge), and how it integrates with other agents and systems. Most tutorials focus entirely on the first. Production systems live and die by the second.
If you're hitting throughput limits and your first instinct is "try a bigger model," I'd suggest spending an afternoon drawing your system's dependency graph first. You might find the bigger model wasn't the bottleneck at all.
Top comments (0)