mgd43b for AgentEnsemble

Posted on May 9 • Originally published at agentensemble.net

Durable Transport for Agent Networks: Moving from In-Process Queues to Kafka

#java #ai #agents #architecture

In-process queues are fine for development. They are fast, deterministic, and require zero infrastructure. But they have a property that becomes a liability in production: when the process dies, the queue contents disappear.

For agent networks that run as long-lived services -- handling work requests over hours or days -- losing queued requests on restart is not acceptable. The transport layer needs durability, and that means moving from in-process data structures to something that survives process failures.

What durability means for agent networks

An agent ensemble network has three communication patterns that need durable backing:

Work request delivery -- a request from one ensemble to another should not be lost if the receiving ensemble is temporarily unavailable
Response routing -- when an ensemble completes a request, the response needs to reach the original caller even if the caller restarted
Capability advertisement -- shared tasks and tools should remain discoverable across process restarts

Each of these has different durability requirements. Work requests are the most critical -- a lost request means lost work. Response routing needs correlation (matching responses to requests). Capability advertisement needs eventual consistency but not strict durability.

Kafka as the transport backing

The agentensemble-transport-kafka module implements the transport SPIs against Apache Kafka. All components share a single configuration:

KafkaTransportConfig config = KafkaTransportConfig.builder()
    .bootstrapServers("kafka:9092")
    .consumerGroupId("kitchen-ensemble")
    .topicPrefix("agentensemble.")
    .build();

Request queues

The KafkaRequestQueue produces work requests to a Kafka topic and consumes them with manual offset commits:

KafkaRequestQueue queue = KafkaRequestQueue.builder()
    .config(config)
    .ensembleName("kitchen")
    .build();

// Enqueue a work request (produces to Kafka)
queue.enqueue(workRequest);

// Poll for requests (consumes from Kafka)
Optional<WorkRequest> request = queue.poll(Duration.ofSeconds(5));

The topic name is derived from the ensemble name and prefix: agentensemble.kitchen.requests. Manual offset commits ensure that a request is only acknowledged after the ensemble has finished processing it. If the ensemble crashes mid-processing, the request will be redelivered on restart.

Delivery registry

The KafkaDeliveryRegistry tracks pending deliveries and routes responses back to callers:

KafkaDeliveryRegistry registry = KafkaDeliveryRegistry.builder()
    .config(config)
    .build();

// Register a pending delivery (before sending request)
CompletableFuture<String> future = registry.register(requestId);

// Complete the delivery when response arrives
registry.complete(requestId, responsePayload);

// Caller awaits the result
String response = future.get(30, TimeUnit.SECONDS);

The registry uses a Kafka topic for durability: pending deliveries are produced as records, and completions are produced as tombstones. On restart, the registry rebuilds its state by replaying the topic from the beginning.

Priority queues with aging

For workloads where some requests are more urgent than others, the PriorityRequestQueue adds priority levels with aging:

PriorityRequestQueue priorityQueue = PriorityRequestQueue.builder()
    .requestQueue(kafkaQueue)
    .levels(3)                    // 3 priority levels (0 = highest)
    .agingInterval(Duration.ofMinutes(5))
    .build();

// Enqueue with priority
priorityQueue.enqueue(urgentRequest, 0);   // highest priority
priorityQueue.enqueue(normalRequest, 1);   // normal priority
priorityQueue.enqueue(batchRequest, 2);    // lowest priority

Aging prevents starvation: requests that have waited longer than the aging interval are promoted to the next higher priority level. A batch request that has been waiting for 10 minutes (two aging intervals) gets promoted twice, eventually reaching the highest priority.

This is implemented as a layer on top of any RequestQueue implementation, so it works with both in-process and Kafka-backed queues.

What changes operationally

Moving from in-process to Kafka transport changes the operational profile of the ensemble network:

Startup behavior changes. With in-process queues, an ensemble starts with an empty queue. With Kafka, it may start with a backlog of unprocessed requests from before the restart. The ensemble needs to handle this gracefully -- processing the backlog before accepting new work, or processing both concurrently.

Failure modes change. In-process queue failures are process-fatal (if the process dies, the queue is gone). Kafka failures are infrastructure-level (broker unavailable, topic not found, authorization errors). The error handling needs to distinguish between transient failures (retry) and permanent failures (alert and skip).

Monitoring needs change. With in-process queues, queue depth is a simple counter. With Kafka, you need to monitor consumer lag, topic partition health, and broker connectivity. The ensemble's health check needs to include Kafka reachability.

Ordering semantics change. In-process queues provide strict FIFO. Kafka provides per-partition ordering, which means requests may be processed out of order if the topic has multiple partitions. For most agent workloads, this is fine -- requests are independent. But if your workflow depends on ordering, you need single-partition topics or application-level sequencing.

The configuration boundary

One design decision worth calling out: the Kafka transport configuration is separate from the ensemble configuration. The ensemble does not know it is using Kafka -- it interacts with the transport SPIs. The Kafka-specific configuration (bootstrap servers, consumer groups, topic prefixes) lives in the infrastructure layer.

// Infrastructure layer: Kafka-specific setup
KafkaTransportConfig kafkaConfig = KafkaTransportConfig.builder()
    .bootstrapServers("kafka:9092")
    .consumerGroupId("kitchen-ensemble")
    .build();

KafkaRequestQueue queue = KafkaRequestQueue.builder()
    .config(kafkaConfig)
    .ensembleName("kitchen")
    .build();

// Application layer: ensemble setup (transport-agnostic)
Ensemble kitchen = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.of("Manage kitchen operations"))
    .requestQueue(queue)
    .build();

This separation means the same ensemble code works in development (with in-process queues) and production (with Kafka) without changes. The transport choice is an infrastructure decision, not an application decision.

Tradeoffs

Operational complexity. Kafka is infrastructure that needs to be provisioned, monitored, and maintained. For small deployments, the operational overhead may not be justified. The in-process transport with periodic state snapshots might be a simpler alternative.

Latency. Kafka adds millisecond-scale latency to every request delivery. For agent workloads where task execution takes seconds or minutes, this is negligible. For sub-second workflows, it may not be.

Topic proliferation. Each ensemble gets its own request topic. A network of 20 ensembles means 20+ Kafka topics. This is manageable but requires topic lifecycle management (creation, cleanup, retention policies).

Exactly-once is hard. The current implementation provides at-least-once delivery. A request may be processed twice if the ensemble crashes after completing the work but before committing the offset. For most agent workloads (which are non-deterministic anyway), this is acceptable. For workloads that require exactly-once, additional deduplication logic is needed.

When to use durable transport

The decision is straightforward:

Development and testing: in-process transport. Zero setup, fast, deterministic.
Single-node production: in-process transport with periodic state persistence. Simple, no external dependencies.
Multi-node production: Kafka transport. Durability, horizontal scaling, replay capability.
Edge or embedded: in-process transport. No infrastructure dependency.

The transport SPI lets you make this decision per-deployment without changing application code.

The Kafka transport module is part of AgentEnsemble. The durable transport guide covers the full configuration and operational details.

I'd be interested in whether others are using Kafka (or similar) for agent-to-agent communication, and what delivery guarantee level they find sufficient in practice.

DEV Community