DEV Community

Cover image for Your Multi-Agent System Has a Routing Problem
sly-the-fox
sly-the-fox

Posted on

Your Multi-Agent System Has a Routing Problem

Five agents. Twenty possible connections. Ten agents? Ninety. The math is simple and the consequences are brutal.

Most multi-agent systems start with a reasonable architecture. Two or three agents with clear responsibilities. The orchestrator routes work. Everything makes sense. Then you add a fourth agent. A fifth. A specialized summarizer. A governance layer. Suddenly every agent can reach every other agent, and nobody's drawn a map of which paths should actually exist.

This is the N-squared coordination problem. And it's the architectural debt that kills multi-agent systems before they ever reach production.

The group chat anti-pattern

The default in most agent frameworks is full connectivity. Any agent can call any tool, read any state, trigger any other agent. It feels flexible. It's actually fragile.

When Agent A can invoke Agent B, C, D, and E directly, you've created implicit dependencies that aren't visible in your architecture diagram (assuming you have one). When something breaks, the failure could have originated from any of those paths. Debugging becomes combinatorial.

The parallel in traditional software engineering is obvious. We stopped building monoliths where every module calls every other module. We drew service boundaries. We defined interfaces. We made coupling explicit and limited.

Multi-agent systems need the same treatment, but most builders skip it because the framework doesn't enforce it.

Trust boundaries as architecture

A trust boundary is a line you draw between agents that limits what they can access and who they can reach. It's not about security in the traditional sense (though it helps). It's about making the system legible.

Here's what this looks like in practice:

# Without trust boundaries — any agent reaches anything
class AgentOrchestrator:
    def route(self, task):
        # Pick the "best" agent and let it loose
        agent = self.select_agent(task)
        return agent.execute(task, context=self.full_context)

# With trust boundaries — explicit routing and scoped access
class BoundedOrchestrator:
    def __init__(self):
        self.boundaries = {
            "summarizer": {
                "can_read": ["documents", "notes"],
                "can_reach": ["editor"],
                "cannot_reach": ["database_writer", "auth_manager"]
            },
            "database_writer": {
                "can_read": ["validated_records"],
                "can_reach": ["auditor"],
                "cannot_reach": ["summarizer", "external_api"]
            }
        }

    def route(self, source_agent, target_agent, task):
        rules = self.boundaries.get(source_agent)
        if target_agent in rules.get("cannot_reach", []):
            raise BoundaryViolation(
                f"{source_agent} cannot reach {target_agent}"
            )
        # Scoped context — only what this agent is allowed to see
        context = self.scoped_context(source_agent, rules["can_read"])
        return self.agents[target_agent].execute(task, context=context)
Enter fullscreen mode Exit fullscreen mode

The difference isn't complexity. It's clarity. The bounded version makes every routing decision explicit. When something breaks, you know exactly which paths were available and which one failed.

Three patterns that work

After building systems with 30+ agents, three routing patterns consistently hold up:

1. Hub-and-spoke

A central router handles all inter-agent communication. Agents never talk to each other directly. This is the simplest model and works well up to about 15 agents. The router becomes a bottleneck at scale, but the traceability is excellent.

2. Hierarchical routing

Agents are organized into groups (governance, technical, knowledge) with a group coordinator. Agents within a group can communicate freely, but cross-group communication goes through the coordinators. This scales better and naturally creates bounded contexts.

3. Pipeline with side channels

Work flows through a defined sequence (plan, execute, review, document), but specific agents can reach specific others outside the pipeline for scoped queries. The pipeline is the primary path; side channels are explicit exceptions with documented justification.

The worst pattern is the implicit mesh, where any agent can invoke any other agent through shared state or direct calls. It works until it doesn't, and when it breaks, the failure surface is every connection in the system.

Scoped context matters as much as scoped access

Trust boundaries aren't just about who can call whom. They're about what each agent can see. A summarizer working on meeting notes doesn't need access to your financial records. A code reviewer doesn't need to see customer PII.

When you scope the context each agent receives, two things improve immediately:

  1. Agents perform better. Less noise in the context means more focused output. An agent that receives only the documents it needs produces better summaries than one drowning in the full system state.
  2. Failures are contained. If an agent hallucinates or makes a bad decision, the blast radius is limited to what it could access. A summarizer with access to everything can corrupt everything. A summarizer scoped to documents can only affect documents.

The routing question

The next time you add an agent to your system, ask three questions before writing any code:

  1. Who can this agent reach? List the specific agents it's allowed to invoke or send data to.
  2. What can this agent see? Define the scoped context it receives, not the full system state.
  3. Who can reach this agent? Inbound access matters as much as outbound. An agent that any other agent can trigger is an implicit dependency for the entire system.

If you can't answer these questions, the agent doesn't have an architecture. It has a hope.

The pattern is consistent across every well-designed system, from microservices to operating systems to multi-agent AI. Constraints don't limit capability. They make capability legible. And legibility is what lets you debug, scale, and trust the system you're building.


Building trust infrastructure for AI agents. Follow for weekly patterns on agent architecture, governance, and the systems underneath.

Try Sigil: github.com/chaddhq/sigil | Subscribe: The Alignment Layer

Top comments (0)