Artemii Amelin

Posted on May 14

Agent Discovery in 2026: DNS-SD, ACP Registries, and Pilot Protocol's Overlay Directory

#ai #agents #pilotprotocol #go

Every time I spin up a new agent, I hit the same wall: how does it find anything else?

It sounds like a boring infrastructure problem until you realize it shapes everything. Security, latency, whether your agent network actually scales, whether it works at all when half your nodes are sitting behind carrier-grade NAT. The answer you pick in week one tends to calcify fast.

In 2026 there are really three approaches worth knowing about. DNS-SD is the old reliable for local setups. ACP-style centralized registries are what most multi-agent frameworks ship with. And then there's Pilot Protocol, which takes a different path entirely. None of them are universally correct. Here's how I think about the tradeoffs.

DNS-SD: Fine Until You Leave the Building

DNS-Based Service Discovery (RFC 6763) is the tech behind Bonjour and mDNS. Services announce themselves on the local network with structured DNS records, clients query for them, and everything just works with zero configuration. It's been doing this reliably since the early 2000s.

For local agent deployments, it's genuinely good. Dev clusters, edge device fleets, lab environments. If all your agents live on the same subnet and you want them to find each other without any setup, DNS-SD is the obvious choice and there's no reason to overthink it.

But the moment you try to go beyond local, it falls apart fast. mDNS is link-local by default, so cross-subnet discovery means switching to unicast DNS and suddenly you have configuration overhead again. There's no semantic filtering either, so you can discover "there is an agent here" but not "find me an agent that does structured financial data." And NAT traversal is completely your problem to solve, which in practice means it doesn't work for cloud-hosted agents, most home networks, or anything running on a phone.

DNS-SD is a great fit for a narrow set of use cases. Just don't try to stretch it past them.

Centralized ACP Registries: Works Until It Doesn't

The dominant pattern in multi-agent frameworks right now is centralized registries. Agents register themselves with a directory, the directory serves queries, and sometimes the directory brokers communication too. Google's A2A protocol uses agent cards at well-known URLs. Most cloud-native multi-agent stacks have some version of this.

The benefits are real. You get schema enforcement on capability declarations, access control on discovery itself, usage telemetry, versioning. These are not trivial things. For a lot of production use cases inside a single platform, a centralized registry is the mature, well-supported option.

The problems are also real. The registry is a single point of failure, which is fine until it isn't. Each platform runs its own registry, so agents in different ecosystems can't see each other without custom federation work. Trust is delegated to whatever auth system the platform uses, which means the security of every agent identity is tied to the security of that platform's token system. And NAT traversal is, again, your problem.

If you're building entirely within one vendor's stack and you never need to talk outside it, this is probably fine. But if you want agents that work across providers, or that can reach peers running on home networks and edge devices, you're going to outgrow this pattern.

Pilot Protocol: Discovery as a Network Primitive

Pilot Protocol approaches this differently. Instead of a local broadcast or a centralized registry, it's an overlay network. Every agent gets an Ed25519 keypair that generates a deterministic virtual address. Discovery, routing, NAT traversal, and identity are all handled by the same stack.

The directory is a live, queryable service with over 190,000 registered agents and more than 19.7 billion requests served. You query it through the same pilotctl CLI you use for everything else:

pilotctl handshake list-agents
pilotctl send-message list-agents --data '/data {"search":"weather","limit":5}' --wait

You get back structured JSON with agent hostnames and capability descriptions. No API keys, no HTML to parse, no rate limit to negotiate.

What's structurally different from the other two approaches is where trust lives. In DNS-SD there's no trust layer at all. In centralized registries, trust is delegated to the platform. In Pilot Protocol, it's bilateral and cryptographic. Before two agents can exchange anything, both sides have to complete a handshake signed by their respective Ed25519 keys. If one side hasn't approved the other, no tunnel forms and nothing flows. Discovery tells you a peer exists. The handshake is the actual trust decision, and it's anchored to cryptographic identity rather than a platform token.

The NAT traversal story is also meaningfully different. Pilot Protocol treats it as a first-class problem rather than something operators solve per deployment. Agents connect through relay nodes when direct paths aren't available, with X25519 key exchange and AES-256-GCM encryption handling the transport. From the agent's perspective, every peer just has a virtual address that works. The underlying complexity is handled by the daemon.

That's how the network ends up spanning agents on cloud VMs, home networks, developer laptops, and edge devices at the same time.

The Data Exchange Network

One thing worth knowing about: Pilot Protocol ships with a curated data exchange network called Network 9 that has 50+ specialist agents covering live financial data, weather, transit, sports, academic papers, and a bunch more. You query them exactly the same way you'd query any other peer:

pilotctl handshake <specialist>
pilotctl send-message <specialist> --data '/data {"symbol":"BTC"}' --wait

Each specialist is a typed API served by a dedicated agent. For agents that need live external data as part of their reasoning, this is considerably faster to set up than wiring up individual third-party APIs with their own auth and rate limits.

How They Actually Compare

	DNS-SD	ACP Registry	Pilot Protocol
Scope	Local network	Platform-scoped	Global overlay
Identity	Hostname or none	Platform token	Ed25519 keypair
NAT traversal	No	No	Yes
Trust model	None	Platform-delegated	Bilateral cryptographic
Single point of failure	No	Yes	No
Cross-platform	Limited	No	Yes
Agents reachable	Local only	Registry-scoped	190,000+

Picking One

Use DNS-SD when agents are local, co-located, and you want zero setup. It's the right tool for dev environments, home automation, and edge clusters where everything is on the same network.

Use a centralized registry when you're operating entirely within one platform and you want the governance and access control that comes with it. If you never need to talk outside the vendor's ecosystem, the managed registry is mature and practical.

Use Pilot Protocol when your agents need to reach peers across different platforms, providers, or network topologies, or when you want identity that's cryptographically verifiable rather than tied to a platform's token system. The NAT traversal being handled by default (not something you configure per environment) and the bilateral trust model are the two things that are genuinely hard to replicate elsewhere.

Trying It

The pilotctl CLI sets up quickly and Network 9 is queryable without any prior configuration beyond a running daemon:

pilotctl daemon start
pilotctl network join 9
pilotctl handshake list-agents
pilotctl send-message list-agents --data '/data {"search":"","limit":10}' --wait

Any specialist in the directory is two commands away after that.

Full documentation is at pilotprotocol.network.

Discovery is a design decision more than a technical one. The three approaches above cover the realistic options in 2026 and the constraints are different enough that the right answer usually falls out pretty quickly once you know where your agents actually need to run and who they need to trust.

Top comments (8)

Matt McKay • May 14

I've been using DNS-SD for a local agent cluster at work and the "fine until you leave the building" framing is exactly right, we hit that wall the moment we tried to add a cloud node to the mix. Ended up hacking together a VPN just to make discovery work which in hindsight was obviously the wrong call.
Question on the Pilot Protocol trust model though - if both sides have to approve the handshake, how does that work for public-facing agents that want to be discoverable by anyone?

Artemii Amelin • May 14

Yeah the VPN workaround is exactly the trap, works until you have more than like 3 nodes and then you're basically running network infrastructure full time.

On your question --Network 9 service agents have auto-approve on by default, so anything on that
network accepts handshakes without you having to manually approve each one. For your own agents
that you want publicly accessible you can configure the same policy. The bilateral requirement is
really meant for peer-to-peer between agents where you actually care who's talking to you. It's
opt-in friction basically, not a wall.

Matt McKay • May 14

That makes sense, opt-in friction is a good way to put it. Honestly that's the thing I keep running into with most agent frameworks, they either trust everything by default which is obviously a mess, or they lock everything down and you spend half your time managing permissions instead of building. Having a sane default with the option to open it up is the right call. Going to try spinning up a node this weekend and see how the Network 9 data stuff compares to what we're doing now. We're currently hitting OpenWeatherMap and a couple finance APIs directly from the agent which works but it's a lot of boilerplate to maintain. If I can replace that with two pilotctl commands I'm sold.

Vic Chen • May 15

Strong framing. The DNS-SD vs registry vs overlay comparison makes the tradeoffs much clearer, especially around where trust actually lives. I also liked the point that NAT traversal becomes an architecture decision, not just a networking footnote. As someone building AI products, that distinction matters a lot once agents leave the lab and have to span laptops, cloud nodes, and edge devices.

Artemii Amelin • May 16

The trust location point is the one that clicked for me pretty late. With a central registry you're trusting the registry to be correct and available. With an overlay and bilateral cryptographic handshakes the trust relationship is between the two agents directly, which means a compromised registry doesn't automatically compromise every connection on the network. That's a meaningful difference once you're running anything sensitive.

The NAT thing I think gets dismissed because in local dev it just works and nobody has to think about it. Then you ship something that spans a cloud VM and a laptop and suddenly half your debugging time is NAT related. At that point it stops feeling like a footnote pretty quickly. Treating it as an architecture decision upfront is annoying because it feels premature, but the alternative is retrofitting it later which is way worse.

The laptop to cloud to edge span is exactly the case that exposes all of this. Each of those environments has different NAT behavior, different address spaces, no shared infrastructure. If the networking layer doesn't handle that automatically you end up building it yourself for every new agent you add. That's the part that doesn't scale.

Vic Chen • May 17

Yeah, that trust boundary is the real architectural fork. Once the registry becomes both naming plane and trust anchor, a partial compromise turns into a network wide blast radius problem. Bilateral auth keeps the failure domain smaller, but then you pay for it in key management, rendezvous, and a more opinionated transport layer.

Same with NAT. It looks like plumbing until you span laptop, cloud, and edge. Then it starts shaping the product itself. Teams that treat connectivity as a later concern usually end up rebuilding it piecemeal in app code, which is the most expensive place to learn the lesson.

Rasmus Ros • May 15

Bundling discovery, identity, and NAT traversal into one overlay is a pretty sensible direction. The part I find most interesting is that it reduces the amount of glue operators have to build themselves, especially once agents need to work across cloud, home networks, and edge devices.

I think the practical questions are mostly around how the directory behaves over time. Things like freshness of records, how search quality holds up as the network grows, and how relay usage is managed under load. But as a model, treating discovery as a network primitive instead of a bolt-on registry makes a lot of sense.

Artemii Amelin • May 16

The glue reduction was exactly what sold me too. After wiring up discovery, auth, and NAT handling separately a few too many times you start paying a lot more attention to anything that just handles all of it.

Your questions about the directory are the right ones to be asking. Freshness is handled through heartbeats so agents that go offline actually drop out rather than leaving stale records, which matters more than it sounds. There's a short propagation window so you can occasionally query something that just went down, but in practice it's been pretty consistent. Search quality at scale is the one I'm watching most closely. Right now it's literal token match which works fine at a few hundred agents but I'm curious how it holds up when the catalogue gets significantly larger and you have a lot of agents with overlapping descriptions.

On relay, it's fallback only for symmetric NATs where hole punching fails. Most connections go direct so relay load doesn't grow linearly with the network the way it would if it were the default path. That's a deliberate architectural choice rather than just an optimization.

The primitive vs bolt-on framing you used is honestly the clearest way I've seen it put. That's the actual operational difference. Every new agent inherits the properties instead of creating a new deployment problem.