mgd43b for AgentEnsemble

Posted on May 19 • Originally published at agentensemble.net

Testing Distributed Agent Systems: Stubs, Recordings, and Isolation

#java #ai #agents #architecture

Testing a single agent ensemble is already harder than testing most software: the output is non-deterministic, the execution path depends on LLM responses, and the number of iterations is unpredictable.

Testing a network of agent ensembles adds distributed system concerns on top of that: WebSocket connections between services, shared state across ensembles, capability discovery, and cross-ensemble delegation. If your tests require all of these to be running, your test suite becomes an integration environment rather than a test suite.

The question is how to test an ensemble's network behavior without requiring the rest of the network to be running.

The testing problem

An ensemble that delegates work to other ensembles via NetworkTask or NetworkTool has external dependencies. In production, those dependencies are real ensembles running on real infrastructure. In tests, you need control over what those dependencies return.

// Production code: room service delegates to kitchen
Ensemble roomService = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.builder()
        .description("Handle room service request")
        .tools(NetworkTask.of("kitchen", "prepare-meal",
            "ws://kitchen:7329/ws"))
        .build())
    .build();

If you test this ensemble without a running kitchen, the NetworkTask fails to connect. If you run a real kitchen, your test depends on the kitchen's LLM, its prompt, its tools -- all of which are outside your control and non-deterministic.

Stubs for predictable behavior

NetworkTask.stub() and NetworkTool.stub() return canned responses without connecting to any real ensemble:

StubNetworkTask mealStub = NetworkTask.stub("kitchen", "prepare-meal",
    "Meal prepared: wagyu steak, medium-rare. Estimated 25 minutes.");

StubNetworkTool inventoryStub = NetworkTool.stub("kitchen", "check-inventory",
    "3 portions available");

Ensemble roomService = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.builder()
        .description("Handle room service request")
        .tools(mealStub, inventoryStub)
        .build())
    .build();

EnsembleOutput result = roomService.run();

The stub replaces the real network call with a predetermined response. The ensemble under test processes the stub's response exactly as it would process a response from a real kitchen ensemble.

This gives you deterministic network behavior while letting the ensemble's own LLM interactions remain non-deterministic. You are testing how the ensemble uses the network response, not whether the network itself works.

Recordings for assertion

Sometimes you need to verify not just the output but what the ensemble sent to its dependencies. NetworkTask.recording() captures every request for later assertion:

RecordingNetworkTask recorder = NetworkTask.recording("kitchen", "prepare-meal");

Ensemble roomService = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.builder()
        .description("Handle room service request for wagyu steak")
        .tools(recorder)
        .build())
    .build();

roomService.run();

// Assert what was sent to the kitchen
assertThat(recorder.callCount()).isEqualTo(1);
assertThat(recorder.lastRequest()).contains("wagyu");
assertThat(recorder.requests()).hasSize(1);

Recordings combine the predictability of stubs (they return a configurable response, defaulting to "recorded") with the observability of a mock. You can verify that the ensemble called the right capability with the right parameters.

This is particularly useful for testing delegation logic: does the ensemble delegate to the right capability? Does it include the right context in the request? Does it handle the response correctly?

Custom responses for recordings

By default, recordings return "recorded". You can provide a custom response:

RecordingNetworkTask recorder = NetworkTask.recording("kitchen", "prepare-meal",
    "Meal prepared in 25 minutes");

Tool naming consistency

Both stubs and recordings use the same naming convention as real network tools: "ensemble.capability". A stub for the kitchen's prepare-meal capability is named "kitchen.prepare-meal" -- the same name the agent sees when using a real NetworkTask.

This means the agent's tool selection logic works identically in tests and production. The agent does not know whether kitchen.prepare-meal is backed by a real WebSocket connection, a stub, or a recording.

Thread safety

All test doubles are thread-safe. RecordingNetworkTask and RecordingNetworkTool use CopyOnWriteArrayList internally, so concurrent calls from parallel tool execution are safely recorded.

This matters because agent ensembles with multiple concurrent tasks may invoke network tools from different threads simultaneously. The test doubles handle this correctly without external synchronization.

Integration test patterns

Stubs and recordings handle unit-level testing: verifying one ensemble's behavior in isolation. For integration testing -- verifying that two ensembles work together correctly -- you can run both ensembles in the same process with in-process transport:

// In-process: no WebSocket connections needed
Ensemble kitchen = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.of("Manage kitchen operations"))
    .shareTask("prepare-meal", mealTask)
    .build();

Ensemble roomService = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.builder()
        .description("Handle room service request")
        .tools(NetworkTask.of("kitchen", "prepare-meal"))
        .build())
    .build();

// Both ensembles run in-process with shared registry

In-process transport eliminates network concerns (connection timeouts, port conflicts, container management) while preserving real cross-ensemble communication. The ensembles interact through in-memory queues and registries, so the behavior is the same as production except for the transport layer.

Testing patterns summary

What to test	Tool	Approach
Ensemble uses network response correctly	`NetworkTask.stub()`	Canned response, deterministic
Ensemble sends correct request to network	`NetworkTask.recording()`	Capture and assert requests
Two ensembles work together	In-process transport	Real interaction, no network
End-to-end with real infrastructure	WebSocket transport	Full integration test

Each level adds realism and cost. Start with stubs for fast, focused tests. Use recordings when you need to verify outbound requests. Use in-process transport for integration tests. Reserve full WebSocket tests for the deployment verification layer.

Tradeoffs

Stubs hide integration problems. A stub always returns the same response, regardless of what the ensemble sends. If the ensemble sends a malformed request that a real kitchen would reject, the stub does not catch that. Integration tests with in-process transport or WebSocket transport are needed to verify the contract between ensembles.

LLM non-determinism leaks through. Even with stubbed network dependencies, the ensemble's own LLM calls are non-deterministic. The same test may pass or fail depending on the model's response. For fully deterministic tests, you need to stub the LLM as well (using LangChain4j's test doubles or a local model with temperature 0).

Recordings only capture what was sent. They do not verify that the request would be accepted by the real provider. Schema validation or contract testing would be needed to verify compatibility.

In-process tests share a JVM. Two ensembles running in the same process share class loaders, thread pools, and memory. Resource contention in tests that does not occur in production is possible. Conversely, isolation problems that only occur with separate processes are not caught.

The design principle

The useful insight is that network behavior and business logic are separable concerns. An ensemble's decision to delegate to the kitchen, and how it processes the kitchen's response, is business logic. The WebSocket connection, serialization, and transport is infrastructure.

Test doubles let you test the business logic without the infrastructure. In-process transport lets you test the interaction without the network. Full integration tests verify everything works together.

This layered approach is standard practice for distributed systems. What makes it notable in the agent context is that the business logic is already non-deterministic (LLM-driven), so isolating the network layer from the LLM layer is particularly valuable for test stability.

Network testing tools are part of AgentEnsemble. The network testing guide covers the full API including stubs, recordings, and in-process transport setup.

I'd be interested in how others approach testing multi-agent systems -- especially how you handle the double non-determinism of LLM behavior plus network behavior.

DEV Community