William Baker

Posted on May 8

Building a Multi-Agent Fleet with No Central Server

#agents #ai #architecture #distributedsystems

Most multi-agent architectures have the same shape: a coordinator talks to workers through a central hub. The hub is usually a message queue, a shared database, or an orchestration service like Ray or Temporal.

That hub is also the first thing that breaks. It's a single point of failure, a scaling bottleneck, and an operational cost you pay even when the agents aren't working.

Here's how to build a fleet where agents find each other and route tasks without any central intermediary.

The Central Hub Problem

When you're spinning up a 5-agent prototype, a central coordinator makes sense. It's simple, debuggable, and gets out of your way.

At 50 agents it starts to fray. At 500 it becomes your hardest reliability problem.

The hub becomes a global lock. Every message goes through it. Every failure cascades through it. Every scaling decision has to account for it.

The alternative — having agents discover and contact each other directly — sounds appealing but has historically been hard. How does Agent A know Agent B's address? How do you handle NAT traversal? How do you authenticate the connection?

These are solved problems in networking. We just haven't applied the solutions to agents until now.

Peer-to-Peer at the Session Layer

Pilot Protocol operates at OSI Layer 5 — the session layer, the same slot TLS occupies for the web. It gives each agent:

A permanent 48-bit address (0:A91F.0000.7C2E)
Automatic NAT traversal (STUN → hole-punch → relay fallback for symmetric NATs)
End-to-end encrypted tunnels (X25519 key exchange, AES-256-GCM, Ed25519 identity)
A global directory (the backbone) for agent discovery

With Pilot, the hub isn't a server you run. It's the network itself — and the network is maintained by the protocol, not by your ops team.

A Fleet Pattern That Actually Works

Here's a concrete pattern for a research fleet:

Coordinator agent
    ↓ Pilot (P2P, encrypted)
[Specialist A] [Specialist B] [Specialist C]
    ↓                ↓               ↓
  Papers           FX data       News feeds

Each specialist registers its capabilities on the Pilot backbone when it starts. The coordinator queries the backbone — "I need a peer that can resolve academic citations" — and gets back the address of Specialist A. Direct connection from there.

No service registry you maintain. No hardcoded addresses. No configuration file you update when a worker moves.

The Code

Getting an agent online:

curl -fsSL https://pilotprotocol.network/install.sh | sh
pilotctl daemon start --hostname coordinator

That's it. The agent is addressable, authenticated, and reachable from any other Pilot peer — regardless of NAT, firewall, or cloud region.

For the specialists:

# On each worker node
pilotctl daemon start --hostname specialist-papers
pilotctl daemon start --hostname specialist-fx
pilotctl daemon start --hostname specialist-news

Each one joins the backbone automatically. The coordinator can ping them:

pilotctl ping specialist-papers
# ✓ reply from 0:4B2E.0000.1A3D · 22ms

Self-Organization: How Groups Work

Beyond individual peer connections, Pilot has a concept of groups — clusters of agents that self-organize around a shared domain.

A trading fleet might form a TRADING group. A research fleet might join RESEARCH. Agents within a group can broadcast to all members or route to the most relevant peer within the domain.

This is closer to how human organizations actually work: a new employee joins the company and immediately has access to colleagues in their department, not just a single manager they have to route everything through.

The Pilot network status page shows these groups live: BACKBONE, TRAVEL, TRADING, RESEARCH, INSURANCE, and more, with real-time agent counts.

What You Give Up

Centralized orchestration isn't all downside. You give up some things going P2P:

Observability. A central hub is easy to instrument. A P2P mesh requires distributed tracing from day one. Plan for this.

Debuggability. When something goes wrong, "what was the message queue state at time T" is easier to answer than "what was the P2P graph state." Log aggressively at the agent level.

Simplicity. For a 3-agent prototype, a coordinator is simpler. P2P earns its complexity at scale.

When to Switch

The right time to move to a P2P architecture is usually later than you think but earlier than you want. Signals that you're ready:

You're spending meaningful eng time on coordinator reliability
Agents in different cloud regions are paying latency costs to route through a central server
You want agents from different operators to collaborate without giving either access to your infrastructure
Your fleet is growing fast enough that a central bottleneck is becoming a scaling conversation

If two or more of those are true, the session-layer approach is worth the investment.

DEV Community