Transport SPI: Making Agent Network Infrastructure Pluggable

#java #ai #agents #architecture

When agent ensembles become long-running services that communicate over a network, the communication layer becomes infrastructure. And infrastructure has a property that application code should not: it varies by deployment environment.

Development uses in-process queues. Staging might use Redis. Production runs Kafka. The application code -- the agents, tasks, workflows -- should not change between these environments. The question is where to draw the abstraction line.

The transport problem

An ensemble network needs several communication primitives:

Request queues -- how work requests arrive at an ensemble
Delivery registries -- how responses get routed back to the requester
Capability registries -- how ensembles advertise and discover shared tasks and tools
Capacity tracking -- how ensembles report their current load

Each of these has a natural in-process implementation (maps, queues, lists) and at least one distributed implementation (Kafka topics, Redis streams, service registries). If these are hardcoded to a specific backing store, every deployment environment change requires code changes.

The SPI design

AgentEnsemble defines transport as a set of Java interfaces -- a Service Provider Interface -- with pluggable implementations:

// The transport factory -- entry point for all transport primitives
Transport transport = Transport.websocket("kitchen");

// Or for production with delivery guarantees
Transport transport = Transport.simple("kitchen", deliveryRegistry);

The Transport interface provides access to the individual primitives:

Primitive	Interface	Purpose
Request queue	`RequestQueue`	Inbound work request buffering
Delivery registry	`DeliveryRegistry`	Response routing back to callers
Capability registry	`CapabilityRegistry`	Shared task/tool advertisement

Each interface has a simple contract. RequestQueue, for example:

public interface RequestQueue {
    void enqueue(WorkRequest request);
    Optional<WorkRequest> poll(Duration timeout);
    int size();
}

The in-process implementation uses a LinkedBlockingQueue. The Kafka implementation produces to a topic and consumes with manual offset commits. Same interface, different backing.

Simple mode for development

The default transport uses in-process data structures:

Transport transport = Transport.websocket("kitchen");

This gives you a working ensemble network with WebSocket connections between ensembles, backed by in-process queues and maps. It is fast, requires no external infrastructure, and is appropriate for development and testing.

For local development that needs delivery tracking (ensuring responses reach their intended recipients), use the simple transport with a delivery registry:

DeliveryRegistry registry = new InMemoryDeliveryRegistry();
Transport transport = Transport.simple("kitchen", registry);

Why this matters for agent systems

The transport SPI is not unusual as an architectural pattern -- it is a standard dependency inversion. What makes it interesting in the agent context is what it enables.

Agent networks are inherently non-deterministic. Agents take variable time, produce variable output, and may fail in unpredictable ways. Adding infrastructure variability on top of that makes the system harder to reason about.

By isolating transport from application logic, you can:

Test with in-process transport -- no containers, no network, deterministic ordering
Develop locally with WebSocket transport -- real network behavior, zero infrastructure setup
Deploy to production with Kafka -- durability, horizontal scaling, replay capability
Switch between environments -- without touching agent code, task definitions, or workflow configuration

The agent code does not know whether its work requests arrive from an in-process queue or a Kafka topic. It processes them the same way.

The capability registry

One of the more interesting transport primitives is the capability registry. When an ensemble shares a task or tool on the network, that capability needs to be discoverable by other ensembles.

// Ensemble advertises capabilities
CapabilityRegistry registry = transport.capabilityRegistry();
registry.register("prepare-meal", CapabilityType.TASK, "kitchen");
registry.register("check-inventory", CapabilityType.TOOL, "kitchen");

// Another ensemble discovers capabilities
Optional<String> provider = registry.findProvider("prepare-meal");

In simple mode, this is an in-memory map. In production, it could be backed by a service registry, a shared database, or Kafka's consumer group protocol. The application code that registers and discovers capabilities does not change.

Tradeoffs

Abstraction leaks. In-process queues have different ordering and delivery guarantees than Kafka topics. The SPI abstracts the interface but cannot fully abstract the semantics. Code that depends on strict FIFO ordering will behave differently with Kafka partitions.

Configuration complexity. Each transport implementation has its own configuration (bootstrap servers, consumer groups, topic prefixes). The SPI does not unify configuration -- you still need environment-specific setup for each backing store.

Performance characteristics vary. In-process queues are nanosecond-scale. Kafka adds millisecond-scale latency. If your agent workflow is latency-sensitive, the transport choice matters and the abstraction cannot hide that.

Not all primitives are equally portable. Request queues map cleanly to most message systems. Delivery registries (correlating responses to specific requests) are harder to implement efficiently in some message brokers.

The design principle

The useful insight is that agent network communication has a small number of well-defined primitives (request queuing, response routing, capability registration), and these primitives have natural implementations at every scale (in-process, single-node, distributed).

Rather than building the network layer directly on top of a specific infrastructure choice, defining the primitives as interfaces lets the infrastructure decision be made at deployment time rather than at development time.

This is standard dependency inversion. It is not novel. But it is the foundation that makes everything else in the ensemble network possible -- durable transport, discovery, federation, and capacity management all build on these same interfaces.

The transport SPI is part of AgentEnsemble. The durable transport guide covers the Kafka implementation in detail.

I'd be interested in what transport backends others are using for agent-to-agent communication, and whether the primitive set (request queue, delivery registry, capability registry) feels complete or whether there are missing pieces.