TutorialQ

Posted on Mar 25 • Originally published at tutorialq.com

Multi-Agent Systems: Coordinating AI Agents for Complex Tasks

#ai #agents #llm #software

System Design Deep Dive — #5 of 20 | This is part of a 20-post series covering the most critical system design topics. Follow to get the next one.

ChatGPT can write code. But can it research a problem, write the implementation, review its own code for bugs, write tests, fix what fails, and document the result? Not reliably in one shot. This is why companies like Cognition (Devin), Factory AI, and Microsoft (AutoGen) are building multi-agent systems -- specialized AI agents that collaborate like a human engineering team.

TL;DR: Multi-agent systems divide complex tasks among specialized agents with distinct roles, tools, and evaluation criteria. The orchestration layer (how agents coordinate) determines system effectiveness more than individual agent quality. Start with two agents, add cross-validation, and only scale when you've proven the pattern works.

The Problem

Complex tasks overwhelm single agents. When one agent tries to be researcher, coder, reviewer, and tester simultaneously, it loses focus. The context window gets polluted, reasoning quality drops, and errors compound. Each role requires different expertise, different context, and different evaluation criteria.

Multi-agent systems mirror how human teams operate: specialize roles, coordinate handoffs, and cross-check each other's work.

How Multi-Agent Systems Work

Agent Specialization

Each agent is optimized for a specific role with tailored system prompts, tools, and evaluation criteria:

agents = {
    "researcher": Agent(
        role="Research and gather information",
        tools=["web_search", "document_reader"],
        model="gpt-4o",
    ),
    "coder": Agent(
        role="Write and debug code",
        tools=["code_executor", "file_writer"],
        model="gpt-4o",
    ),
    "reviewer": Agent(
        role="Review code for bugs and improvements",
        tools=["code_reader", "linter"],
        model="gpt-4o",
    ),
}

Specialized agents consistently outperform generalist agents because they carry less irrelevant context and their prompts are focused on a single competency.

Orchestration Layer

Someone needs to be the project manager. The orchestrator -- often another agent -- coordinates work:

Task decomposition: break a complex task into subtasks with clear inputs/outputs
Dependency graph: determine what runs in parallel vs. sequentially
Routing: match each subtask to the right specialized agent
Conflict resolution: when agents disagree, the orchestrator decides (or escalates)

Here's what a real orchestrator loop looks like:

async def orchestrate(task: str, agents: dict, max_rounds: int = 10):
    """Core orchestration loop — decompose, delegate, validate, repeat."""
    plan = await agents["planner"].run(
        f"Break this into subtasks with dependencies: {task}"
    )
    # plan = [{"id": 1, "agent": "researcher", "input": "...", "depends_on": []},
    #         {"id": 2, "agent": "coder", "input": "...", "depends_on": [1]}, ...]

    results = {}  # task_id -> output
    for round in range(max_rounds):
        # Find tasks whose dependencies are all satisfied
        ready = [t for t in plan if t["id"] not in results
                 and all(d in results for d in t["depends_on"])]
        if not ready:
            break  # All tasks complete

        # Run independent tasks in parallel
        parallel_results = await asyncio.gather(*[
            agents[t["agent"]].run(
                t["input"],
                context={did: results[did] for did in t["depends_on"]}
            )
            for t in ready
        ])

        for task_spec, result in zip(ready, parallel_results):
            # Validate output before accepting
            validation = await agents["reviewer"].run(
                f"Validate this output for task '{task_spec['input']}': {result}"
            )
            if validation.approved:
                results[task_spec["id"]] = result
            else:
                # Re-run with feedback — the agent gets the reviewer's critique
                plan.append({
                    "id": task_spec["id"],
                    "agent": task_spec["agent"],
                    "input": f"{task_spec['input']}\nFeedback: {validation.reason}",
                    "depends_on": task_spec["depends_on"],
                })

    return results

The key insight: the orchestrator doesn't do the work -- it manages the dependency graph and validation loop. Notice how failed validations re-enqueue the task with feedback, creating a self-correcting cycle. This pattern (plan → execute → validate → retry) is the backbone of every production multi-agent system I've seen work reliably.

Communication Protocols

Agents need structured ways to exchange information. The protocol choice determines how tightly coupled agents are, how you debug failures, and whether the system can scale beyond 3-4 agents.

Communication Pattern	Latency	Coupling	Best For
Message passing	Medium	Loose	Async workflows, event-driven
Shared memory	Low	Tight	Fast iteration, small teams
Blackboard	Medium	Medium	Knowledge accumulation
Function calling	Low	Tight	Direct delegation

Message passing is the most common pattern in production. Each agent sends structured messages with a defined schema:

@dataclass
class AgentMessage:
    sender: str          # "researcher"
    recipient: str       # "coder" or "orchestrator"
    msg_type: str        # "result", "error", "clarification_needed"
    content: dict        # The actual payload
    parent_task_id: str  # Links back to the orchestrator's plan
    timestamp: float

# Researcher sends findings to the orchestrator
msg = AgentMessage(
    sender="researcher",
    recipient="orchestrator",
    msg_type="result",
    content={
        "findings": "Redis supports sorted sets for leaderboards...",
        "confidence": 0.92,
        "sources": ["redis.io/docs/data-types/sorted-sets/"],
    },
    parent_task_id="task-001",
    timestamp=time.time(),
)

Shared memory (also called a "scratchpad") works better for tight iteration loops where agents need to see each other's work in real time -- think of it as a shared Google Doc. AutoGen and CrewAI both support this pattern. The tradeoff: it creates implicit coupling, and debugging becomes harder because any agent can modify the shared state at any time.

Blackboard architecture is the hybrid -- a central knowledge store that agents read from and write to, but with structured rules about who can update which sections. This is how MetaGPT's SOP-driven approach works: the researcher writes to the "research" section, the coder reads from it and writes to the "code" section, and the reviewer reads both.

Consensus and Validation

One of the most powerful patterns in multi-agent systems is cross-validation. Multiple agents check each other's work:

Debate: two agents argue opposing positions, and a judge agent decides
Voting: multiple agents independently solve the same problem, and the majority answer wins
Hierarchical review: a senior agent reviews and approves junior agent output

This significantly reduces errors compared to a single agent working alone.

State Management

Tracking the overall state across multiple agents is the hardest operational challenge. You need to know which agent did what, when, and why -- and handle situations where agents produce conflicting results or one agent fails mid-task.

Here's a practical state manager that handles the core problems -- concurrency, conflict detection, and rollback:

class AgentStateManager:
    def __init__(self):
        self.state = {}          # Current shared state
        self.history = []        # Append-only log of all changes
        self.locks = {}          # Per-key locks for write safety

    async def update(self, agent_id: str, key: str, value: any):
        """Write to shared state with optimistic locking."""
        async with self.locks.setdefault(key, asyncio.Lock()):
            old_value = self.state.get(key)
            self.history.append({
                "agent": agent_id,
                "key": key,
                "old": old_value,
                "new": value,
                "timestamp": time.time(),
            })
            self.state[key] = value

    def rollback_agent(self, agent_id: str):
        """Undo all changes by a specific agent (reverse order)."""
        agent_changes = [h for h in self.history if h["agent"] == agent_id]
        for change in reversed(agent_changes):
            self.state[change["key"]] = change["old"]
            self.history.remove(change)

    def get_agent_contributions(self, agent_id: str) -> list:
        """Audit trail: what did this agent change and when?"""
        return [h for h in self.history if h["agent"] == agent_id]

Three things make or break state management in multi-agent systems:

Append-only history -- Never overwrite without logging. When something goes wrong (and it will), the history log is how you debug which agent produced the bad output and what state they saw when they made that decision.
Per-agent rollback -- If the reviewer rejects the coder's output, you need to undo the coder's state changes without affecting the researcher's contributions. This is why the history tracks agent_id per change.
Token budget tracking -- Multi-agent systems can burn through API credits fast. Track cumulative token usage per agent and set hard limits. A runaway researcher agent doing infinite web searches at $0.01 per call adds up when it runs 500 iterations.

When to Use Multi-Agent Systems

Multi-agent systems add real complexity. Use them when:

The task genuinely requires multiple distinct competencies
A single agent's context window can't hold all the needed information
Cross-validation would meaningfully improve output quality
Sub-tasks can be parallelized for speed

Don't use them for tasks a single agent handles well. Start with one agent, identify where it fails, and split only those responsibilities.

5 Hidden Gotchas That Will Bite You in Production

Multi-agent systems are the new frontier — and they multiply every single-agent failure mode by the number of agents. Andrew Ng has noted that agentic workflows are "the most important trend in AI" — but they're also the most operationally complex:

1. Agent Deadlock

Your Researcher agent calls the Coder agent: "Implement the solution from my research." The Coder agent calls back to the Researcher: "I need more details before I can implement." Both agents wait for each other indefinitely, consuming tokens on every "waiting" message. This is the distributed systems deadlock problem — but with LLM calls at $0.01+ each. A 2-hour deadlock loop between two GPT-4 agents costs ~$50 in wasted tokens.

Fix: Implement timeouts on every inter-agent call (30-60 seconds). The orchestrator monitors agent-to-agent call graphs and detects cycles. On timeout, the orchestrator forces resolution: either provides a default response or escalates to a human. Never allow bilateral agent-to-agent calls — route all communication through the orchestrator.

2. Conflicting Actions

The Researcher agent edits report.md to add findings. The Coder agent simultaneously edits report.md to add code examples. Neither knows about the other's changes. The last write wins — and one agent's work is silently lost. This is the classic concurrent write problem, but with the added complexity that agents can't detect or resolve merge conflicts.

Fix: Resource-level locks: only one agent can hold a write lock on a file at a time. The orchestrator manages the lock table. Or use a turn-based architecture: agents take turns in a defined sequence (Research → Code → Review), passing artifacts forward like a relay race. For shared resources, use append-only semantics — agents add to a shared scratchpad rather than editing each other's work.

3. Cost Amplification

The orchestrator routes a task to 3 specialist agents. Each agent makes 4 tool calls (each tool call includes the full conversation context). Each tool call triggers a sub-agent for validation. One user request → 3 agents × 4 tool calls × 1 sub-agent = 12 LLM calls. At 5,000 tokens per call, that's 60,000 tokens for one user request. Now multiply by 10,000 daily users. The cost grows not linearly with agents, but multiplicatively with the depth of agent delegation.

Fix: Budget propagation: the orchestrator allocates a token budget per request. Each agent receives a fraction and must operate within it. Set depth limits (max 2 levels of agent delegation). Cache tool call results so repeated queries don't trigger new LLM calls. Use cheaper models for routine sub-tasks (GPT-4o-mini for validation, GPT-4o for reasoning).

4. Blame Attribution

The multi-agent system produces a bug report with an incorrect root cause analysis. The workflow: Researcher gathered logs → Analyzer identified the wrong component → Coder proposed a fix for the wrong component → Reviewer approved it. Who made the mistake? Without structured per-agent logging, debugging requires reading through 50+ LLM interactions across 4 agents.

Fix: Every agent logs its inputs, reasoning, tool calls, and outputs as structured events with a shared trace_id. Use OpenTelemetry-style spans: each agent call is a span with parent-child relationships. Build a trace viewer (Langfuse, Arize Phoenix, or LangSmith) that shows the full decision tree. When output is wrong, trace backward from the output to the first agent that deviated.

5. State Desynchronization

Agent A reads a config file at 10:00:01. Agent B modifies the config file at 10:00:02. Agent A makes a decision based on the old config at 10:00:03. Agent A's decision is logically correct based on what it "saw" — but it's based on stale state. The result is contradictory actions: Agent A acts as if feature X is disabled, Agent B acts as if feature X is enabled.

Fix: Shared state store with version vectors: before acting, each agent reads the latest state version. If the version has changed since the agent last read, it must re-read and re-plan. Use a shared "blackboard" pattern: all agents read and write to a central state store with optimistic concurrency control. The orchestrator validates that agent actions are consistent with the current state before executing them.

Common Design-Time Mistakes

Those gotchas emerge when agents interact. These design mistakes happen before a single agent call is made — during the architecture phase — and they determine whether your multi-agent system scales or collapses under its own complexity.

Starting with too many agents

The team designs a 6-agent orchestration pipeline before validating that 2 agents outperform 1. Each additional agent adds latency (sequential LLM calls), cost (more tokens), and debugging complexity (more interactions to trace). Start with a single agent. Add a second only when you have evidence (not intuition) that task decomposition improves output quality. Validate with your eval suite before adding agents.

No shared context management

Agent A researches a topic and produces findings. Agent B starts coding — but can't see what Agent A found because there's no shared workspace. Agent B re-researches the same topic, wasting time and tokens. Design a shared scratchpad or context store that all agents can read from and write to. Every agent's output should be immediately visible to every other agent.

No circuit breaker for runaway costs

A multi-agent pipeline processes a user request. Due to a reasoning loop, Agent B calls Agent C 47 times. Total cost for one request: $8. Without a per-request cost ceiling, you discover this from your monthly invoice. Implement hard spending caps per request (max_tokens_per_request), per user (daily_budget_per_user), and per pipeline run. Kill the pipeline and return a graceful fallback when the ceiling is reached.

Evaluating agents individually, not end-to-end

Each agent passes its individual eval: the Researcher finds relevant info 90% of the time, the Coder produces working code 85% of the time, the Reviewer catches 80% of issues. But end-to-end pipeline quality is 90% × 85% × 80% = 61.2%. Individual quality doesn't compound — it degrades multiplicatively. Build end-to-end evals that measure final output quality against human baselines.

Tightly coupled agent interfaces

Agent A's output format changes slightly (adds a new field). Agent B can't parse it. The entire pipeline breaks. Design agent interfaces as explicit contracts (JSON schemas or Protocol Buffers). Version them. Test backward compatibility in CI. Agents should be independently deployable — like microservices, not monolith modules.

Multi-Agent Frameworks

Framework	Architecture	Key Strength	Maturity
LangGraph	Graph-based workflows	Flexible state machines	High
AutoGen	Conversational	Multi-turn agent chat	High
CrewAI	Role-based teams	Simple mental model	Medium
MetaGPT	Software team simulation	SOP-driven coordination	Medium
Swarm (OpenAI)	Lightweight handoffs	Minimal orchestration	Experimental/educational

Key Takeaways

Multi-agent systems specialize, coordinate, and cross-check -- just like human teams
The orchestration layer determines system effectiveness more than individual agent quality
Start with two agents with clear roles before scaling to larger systems
Cross-validation (debate, voting, review) can meaningfully reduce errors compared to single agents -- Microsoft Research's paper on LLM debate showed that multi-agent discussion improved accuracy on reasoning benchmarks
State management across agents is the hardest engineering problem -- invest early
Monitor per-agent costs; multi-agent systems can 3-5x your LLM spend if uncontrolled

🎯 Real-World Decision: What Would You Do?

You're building an automated code review system. A PR comes in with 500 lines of changes across 8 files. You need to check for bugs, security vulnerabilities, performance issues, test coverage, and style compliance.

Option A: One agent reviews everything with a comprehensive prompt
Option B: 5 specialized agents (bug hunter, security scanner, perf reviewer, test critic, style checker) running in parallel, orchestrator merges feedback
Option C: 2 agents — one reviews code, the other reviews the first agent's feedback for false positives. Then a human reviews the final output.

Option B sounds impressive but costs 5x more and often produces conflicting feedback. Option C is the sweet spot — cross-validation catches the worst false positives, total cost is 2x not 5x, and the human final review builds trust. Start with 2 agents, prove value, then split roles. What would you build?

Quick Reference Card

Bookmark this — multi-agent system decisions at a glance.

Component	Start With	Scale To
Agent count	2 agents with clear roles	3-5 after proving 2 > 1
Communication	Shared memory (simple)	Message passing (async)
Orchestrator	Hardcoded sequence	LLM-based routing
Validation	Agent B reviews Agent A	Debate or voting for critical tasks
State	Shared dict/database	Event-driven with conflict resolution
Cost tracking	Per-agent token counters	Budget allocation per agent
Framework	LangGraph or CrewAI	Custom when framework limits hit

Warning sign: If your multi-agent system doesn't outperform a single well-prompted agent, you've added complexity without value. Always benchmark.

What's Next?

Multi-agent systems often need access to structured, production-grade data. Feature stores provide the infrastructure to serve consistent, versioned features to both training and inference pipelines — critical for ML-powered agent capabilities.

📚 System Design Deep Dive Series

This is post #5 of 20 in the System Design Deep Dive series.

Previously: AI Agent Architecture ← | Up next: Feature Store Architecture → | Full series index →

If you found this useful, follow and share it with your team. Building these deep dives takes serious effort — your support keeps the series going.

DEV Community