William Baker

Posted on May 10

P2P vs. Broker: The Architecture Decision Defining Multi-Agent Systems

#ai #agents #architecture #devops

Most multi-agent systems are built on a broker. There's a coordinator that receives tasks, dispatches them to worker agents, and collects results. It's a natural architecture. It mirrors how humans organize teams. It's easy to reason about.

It's also a bottleneck that gets worse as your fleet grows.

This post breaks down when broker architectures work, when they fail, and what a peer-to-peer alternative actually looks like in production.

The Broker Model: Strengths and Limits

A broker-based system has real advantages for small fleets:

Simple mental model: one coordinator, many workers. Easy to debug.
Clear ordering: the broker controls task sequencing. No race conditions.
Auditability: everything flows through a central point. Logs are coherent.
Access control: the broker is the single enforcement point for permissions.

For a team running 10-50 coordinated agents on a bounded set of tasks, this is the right call. The overhead is manageable and the observability is worth it.

The problems emerge at scale.

Broker Failure Modes at Scale

Single point of failure: When the broker goes down, the fleet stops. High availability for the broker requires redundancy that adds operational complexity and latency.

Throughput ceiling: Every message goes through one process. Even a well-engineered broker becomes a bottleneck when ephemeral agents spin up and down at high frequency.

Discovery through the broker: In a brokered system, agents don't know about each other unless the broker tells them. Adding a new capability to the system requires registering it with the broker, which requires a human in the loop.

Latency tax: A query that could go agent-to-agent in one hop goes agent-to-broker-to-agent in two, with serialization/deserialization at each step.

Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025. Many of the teams now scaling from pilot to production are hitting these limits.

The P2P Alternative

In a peer-to-peer architecture, agents connect directly to each other. Discovery happens at the network layer, not through a central registry. Results can propagate across the mesh without routing through a single coordinator.

The tradeoffs shift:

Property	Broker	P2P
Simplicity at small scale	High	Medium
Throughput at large scale	Limited by broker	Linear with peers
Failure surface	Single point	Distributed
Discovery	Centralized	Network-layer
Observability	Easy	Requires tooling
Latency	2 hops	1 hop

The missing piece for P2P in practice has always been addressing and discovery. How does an agent find a peer that has the capability it needs? How do they establish a trusted connection without a central authority?

What the Network Layer Provides

Pilot Protocol addresses this by inserting a session layer (L5) between UDP/TCP and your application framework. Each agent gets a stable 48-bit address:

0:A91F.0000.7C2E

Agents organize into domain-specific groups: travel, finance, security, research. A query for SEC filings routes to agents in the finance group. A query about certificate transparency routes to the security group. The routing is network-level, not application-level. No broker required.

The encryption is per-tunnel: X25519 key exchange, AES-256-GCM, Ed25519 identity. NAT traversal handles the cases where direct P2P isn't possible, with relay fallback. The agent developer doesn't implement any of this. It happens at the network layer.

Hybrid Architectures

The real-world answer for most production systems isn't "choose one." It's:

Use a broker for: task orchestration within a bounded fleet, audit trails, access control enforcement, sequential workflows with dependencies.

Use P2P for: high-throughput data retrieval, cross-fleet queries, capability discovery, anything where latency matters and the broker is not adding value.

A practical pattern: your orchestrator agent uses a broker to coordinate its internal fleet, but connects to the P2P network at the boundary to retrieve external data. Internal coordination stays brokered and auditable. External queries go direct.

This is roughly how Pilot Protocol's "Orgs" feature works: pre-wired multi-agent fleets where agents discover and trust each other on first boot, without requiring a live broker for every interaction.

The Ephemeral Agent Problem

The clearest case for P2P is ephemeral agents. A broker-based system where agents register on startup and deregister on shutdown works fine when agents live for hours. When agents live for seconds or milliseconds, the registration overhead dominates.

A session-layer network where agents get addresses on install and are immediately routable handles ephemerality without registration logic. The agent is online when the daemon is running. It's offline when it's not. No state management in the broker.

The Pilot network currently runs ~176,000 agents with 57% growth in the past 7 days. At that scale, a central broker would be a significant engineering problem. The address-based P2P model is what makes the numbers work.

Implementation Considerations

If you're evaluating this architecture shift:

Observability requires investment: distributed tracing across a P2P network is harder than reading broker logs. Plan for this upfront.

Trust model changes: instead of trusting the broker to enforce access control, you're trusting the network addressing and encryption. Make sure you understand your threat model before deploying a P2P fleet.

Gradual migration is possible: you don't have to rip out your broker. Connect your existing orchestrator to the P2P network for data retrieval first. Measure latency. Expand from there.

Not everything should be P2P: sequential, stateful workflows with strong ordering requirements still benefit from a coordinator. P2P shines for parallel, independent, high-throughput operations.

Where This Is Heading

The agent protocol space is converging on a layered model. MCP handles tool access at L7. A2A handles agent coordination at L7. Neither solves transport. The session layer is the open gap.

An agent fleet that runs on a native L5 network is faster, more resilient, and self-organizing in ways that broker architectures can't match. The cost is operational complexity that requires better tooling to manage.

The tooling is catching up.

Further reading: Pilot Protocol docs | Browse pre-wired agent orgs

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.