mgd43b for AgentEnsemble

Posted on Mar 31 • Originally published at agentensemble.net

Debugging Multi-Agent Systems: Traces, Capture Mode, and Live Dashboards

#java #ai #agents #debugging

Multi-agent systems are hard to debug.

It's not the same as debugging a web request or a database query. You can't set a breakpoint in the middle of an LLM call. You can't predict what the model will say. When an agent produces bad output, you need to understand the full chain of events: what prompt was sent, what the model returned, which tools were called, what context from previous tasks was injected, and whether the output parsing succeeded.

Traditional debuggers don't help here. You need purpose-built observability.

This post covers the debugging and observability stack in AgentEnsemble: structured traces for post-mortem analysis, capture mode for recording full execution state, and the live dashboard for real-time visibility during development.

The Debugging Challenge

Consider a three-agent pipeline: Researcher, Analyst, Writer. The Writer produces a report that's factually wrong. Where did things go wrong?

Did the Researcher find bad information?
Did the Analyst misinterpret the research?
Did the Writer ignore the analysis and hallucinate?
Did a tool call return unexpected results?
Was the wrong context passed between tasks?

Without observability, you're guessing. With it, you're reading a log.

Layer 1: Structured Traces

The most broadly useful debugging tool is the structured trace. It records every significant event in an ensemble run as a tree of spans:

EnsembleOutput output = Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    .traceExporter(TraceExporter.json(Path.of("traces/")))
    .build()
    .run();

This produces a JSON file in the traces/ directory with a structure like:

Ensemble Run (total: 8,420ms, 5,230 tokens)
 |
 +-- Task: Research emerging trends (3,240ms, 1,847 tokens)
 |    +-- LLM Call #1 (1,900ms, 1,200 tokens)
 |    +-- Tool: WebSearch "emerging tech trends 2024" (890ms)
 |    +-- LLM Call #2 (450ms, 647 tokens)
 |
 +-- Task: Analyze research findings (2,180ms, 1,583 tokens)
 |    +-- LLM Call #1 (2,180ms, 1,583 tokens)
 |
 +-- Task: Write final report (3,000ms, 1,800 tokens)
      +-- LLM Call #1 (2,400ms, 1,400 tokens)
      +-- LLM Call #2 (600ms, 400 tokens)  // output retry

Each span records:

Field	Description
Name	Task description or tool call name
Duration	Wall-clock time in milliseconds
Token count	Input + output tokens for LLM calls
Status	Success, failure, or retry
Input/Output	What went in, what came out

Accessing Traces Programmatically

You don't have to read the JSON file. The trace is available on the EnsembleOutput:

ExecutionTrace trace = output.getTrace();

// Walk the span tree
for (TraceSpan span : trace.getSpans()) {
    System.out.printf("[%s] %s -- %dms, %d tokens%n",
        span.getStatus(),
        span.getName(),
        span.getDurationMs(),
        span.getTokenCount());

    for (TraceSpan child : span.getChildren()) {
        System.out.printf("  [%s] %s -- %dms%n",
            child.getStatus(),
            child.getName(),
            child.getDurationMs());
    }
}

This is useful for writing assertions in tests:

@Test
void ensembleShouldCompleteAllTasks() {
    EnsembleOutput output = ensemble.run();

    ExecutionTrace trace = output.getTrace();
    assertThat(trace.getSpans()).hasSize(3);
    assertThat(trace.getSpans())
        .allMatch(span -> span.getStatus() == TraceStatus.SUCCESS);
    assertThat(output.getMetrics().getTotalTokens()).isLessThan(10_000);
}

Trace Export for Analysis Pipelines

The JSON trace format is designed for programmatic consumption. Feed it into your log aggregation system, build custom analysis scripts, or import it into a notebook:

// Export to a specific directory with timestamped filenames
.traceExporter(TraceExporter.json(Path.of("traces/")))

// Or get the raw JSON string
String traceJson = output.getTrace().toJson();
logAggregator.ingest("agent-trace", traceJson);

Layer 2: Capture Mode

Traces tell you what happened. Capture mode tells you exactly what happened -- including the full prompts, raw LLM responses, and tool call payloads.

Three Levels

Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .captureMode(CaptureMode.FULL) // OFF, STANDARD, or FULL
    .build()
    .run();

Level	What's Captured	Use Case
`OFF`	Standard metrics only	Production
`STANDARD`	+ Full LLM message history per iteration, memory operations	Staging, initial deployment
`FULL`	+ Tool call I/O payloads, raw LLM responses, detailed timing	Development, debugging

What STANDARD Adds

With CaptureMode.STANDARD, each task's execution record includes the full conversation between the framework and the LLM:

Task: Research emerging trends
  Iteration 1:
    System prompt: "You are Senior Research Analyst. Your goal is..."
    User message: "Research emerging trends in AI thoroughly..."
    Assistant response: "I'll search for the latest information..."
    Tool call: WebSearch("emerging AI trends 2024")
  Iteration 2:
    System prompt: [same]
    User message: [previous context + tool result]
    Assistant response: "Based on my research, here are the key..."

This is invaluable for understanding why an agent behaved a certain way. You can see exactly what prompt it received, what context was injected, and how it reasoned through the task.

What FULL Adds

CaptureMode.FULL adds the raw payloads for every interaction:

Tool call inputs: The exact arguments passed to each tool.
Tool call outputs: The exact response from each tool.
Raw LLM responses: The complete response body, including any JSON that was parsed.
Timing breakdowns: Per-iteration timing, not just per-task.

This is the level you use when something is wrong and you can't figure out why from the trace alone. It's verbose -- expect significantly more data -- but it gives you full replay capability.

Using Capture Data in Tests

Capture mode is a testing power tool. Record a full execution, then write assertions against the captured data:

@Test
void researcherShouldUseWebSearch() {
    EnsembleOutput output = Ensemble.builder()
        .agents(researcher, writer)
        .tasks(researchTask, writeTask)
        .chatLanguageModel(model)
        .captureMode(CaptureMode.FULL)
        .build()
        .run();

    // Verify the researcher used the web search tool
    ExecutionTrace trace = output.getTrace();
    TraceSpan researchSpan = trace.getSpans().get(0);

    boolean usedWebSearch = researchSpan.getChildren().stream()
        .anyMatch(child -> child.getName().contains("WebSearch"));
    assertThat(usedWebSearch).isTrue();
}

You can also capture a "golden run" and use it as a reference for regression testing -- comparing future runs against the expected execution pattern.

Layer 3: Event Callbacks

For real-time debugging during development, callbacks give you a live stream of execution events:

Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    .listener(event -> {
        switch (event) {
            case TaskStartEvent e ->
                System.out.printf("%n>>> Starting: %s (agent: %s)%n",
                    e.taskDescription(), e.agentRole());

            case TaskCompleteEvent e ->
                System.out.printf("<<< Completed: %s (%dms, %d tokens)%n",
                    e.taskDescription(), e.durationMs(), e.tokenCount());

            case TaskFailedEvent e ->
                System.err.printf("!!! Failed: %s -- %s%n",
                    e.taskDescription(), e.errorMessage());

            case ToolCallEvent e ->
                System.out.printf("    [tool] %s(%s) -> %s%n",
                    e.toolName(),
                    truncate(e.input(), 50),
                    truncate(e.result(), 100));

            case DelegationStartedEvent e ->
                System.out.printf("    [delegate] %s -> %s%n",
                    e.fromAgent(), e.toAgent());

            case TokenEvent e ->
                // Streaming: print tokens as they arrive
                System.out.print(e.token());

            default -> {}
        }
    })
    .build()
    .run();

This gives you a live play-by-play of the ensemble execution in your terminal. You see each task start and complete, each tool call and its result, and each delegation in hierarchical workflows.

Combining Callbacks with Logging

For persistent debugging output, route events to your logging framework:

.listener(event -> {
    if (event instanceof TaskCompleteEvent e) {
        log.info("Task completed: task={}, agent={}, duration={}ms, tokens={}",
            e.taskDescription(), e.agentRole(),
            e.durationMs(), e.tokenCount());
    }
    if (event instanceof TaskFailedEvent e) {
        log.error("Task failed: task={}, error={}",
            e.taskDescription(), e.errorMessage());
    }
})

These flow into your existing log aggregation pipeline (ELK, Splunk, CloudWatch Logs) alongside your application's other logs.

Layer 4: The Live Dashboard

For the most visual debugging experience, AgentEnsemble includes a live browser dashboard:

Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    .devtools(Devtools.enabled())
    .build()
    .run();

When the ensemble starts, a browser window opens (or a URL is printed to the console) showing a real-time visualization of the execution:

What the Dashboard Shows

DAG Visualization: A graph of all tasks and their dependencies. Nodes change color as tasks progress from pending to running to completed.
Agent Activity: Which agent is currently active, what it's doing, and how many iterations it's taken.
Token Consumption: Real-time token counters per task and for the entire ensemble.
Task Output Preview: Click on a completed task to see its output.
Timeline: A Gantt-chart-style view of task execution, showing parallelism and bottlenecks.

When to Use It

The live dashboard is a development tool, not a production monitoring dashboard. Use it when:

Building a new agent workflow and you want to see the execution flow.
Debugging why a specific task takes too long or produces unexpected output.
Demonstrating an agent system to stakeholders.
Understanding the parallelism in a DAG or MapReduce workflow.

For production monitoring, use the Micrometer metrics integration and your existing Grafana/Prometheus stack.

Debugging Recipes

Here are specific debugging scenarios and how to approach them with the tools above.

"The output is wrong, but I don't know which agent failed"

Use traces. Look at each task's output in the trace tree. Find the first task whose output is incorrect -- that's where things diverged.

.traceExporter(TraceExporter.json(Path.of("debug/")))

Then read the trace JSON, find the task with bad output, and check its input context to see what it received from upstream tasks.

"The agent keeps calling the same tool in a loop"

Use capture mode + callbacks. Enable CaptureMode.FULL and add a callback that logs tool calls:

.captureMode(CaptureMode.FULL)
.listener(event -> {
    if (event instanceof ToolCallEvent e) {
        log.warn("Tool call: {} with input: {}",
            e.toolName(), e.input());
    }
})

Then check the captured LLM conversation to see why the agent keeps making the same call. Usually it's a prompt issue -- the agent doesn't recognize the tool result as sufficient.

"The structured output parsing keeps failing"

Use capture mode. Enable CaptureMode.FULL and check the raw LLM response:

.captureMode(CaptureMode.FULL)

The captured data includes the raw response before parsing. Compare it to your record schema. Common issues:

The LLM wraps JSON in markdown code blocks.
Field names don't match (the LLM uses camelCase, the record uses snake_case).
The LLM adds extra fields or comments.

The framework handles most of these, but FULL capture mode shows you exactly what's happening.

"A parallel workflow is slower than expected"

Use the live dashboard. Enable devtools and look at the timeline view:

.devtools(Devtools.enabled())

You'll see whether tasks are actually running in parallel or if there's an unexpected dependency bottleneck. Common issues:

A task accidentally depends on another task via context() when it shouldn't.
One task takes much longer than the others, creating a bottleneck for downstream tasks.
Rate limiting is causing parallel tasks to serialize.

"I need to understand the full prompt the agent received"

Use CaptureMode.STANDARD or CaptureMode.FULL. The captured data includes the complete system prompt, user message, and any injected context for each LLM call.

This is the only way to see the actual prompt -- the framework constructs it dynamically from the agent's role/goal/background, the task description, context from previous tasks, and tool results.

Putting It All Together

A typical debugging setup during development:

EnsembleOutput output = Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    // Full observability stack
    .captureMode(CaptureMode.FULL)
    .traceExporter(TraceExporter.json(Path.of("traces/")))
    .devtools(Devtools.enabled())
    .listener(event -> {
        if (event instanceof TaskCompleteEvent e) {
            log.info("[DONE] {} -- {}ms", e.taskDescription(), e.durationMs());
        }
        if (event instanceof ToolCallEvent e) {
            log.info("[TOOL] {} -> {}", e.toolName(), e.result());
        }
    })
    .costConfiguration(CostConfiguration.builder()
        .inputTokenCostPer1k(0.01)
        .outputTokenCostPer1k(0.03)
        .build())
    .build()
    .run();

// Post-run analysis
EnsembleMetrics metrics = output.getMetrics();
log.info("Total cost: ${}, tokens: {}, duration: {}ms",
    metrics.getTotalCost(), metrics.getTotalTokens(),
    output.getTotalDuration());

For production, dial it back:

EnsembleOutput output = Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    // Production observability
    .captureMode(CaptureMode.OFF)
    .traceExporter(TraceExporter.json(Path.of("/var/log/agent-traces/")))
    .meterRegistry(prometheusMeterRegistry)
    .listener(productionEventHandler)
    .costConfiguration(costConfig)
    .build()
    .run();

The observability stack scales from "show me everything" during development to "show me what matters" in production. Same API, different configuration.

The Core Idea

Multi-agent systems are opaque by nature. An LLM call is a black box -- you send a prompt, you get a response, and the reasoning happens inside the model. The only way to make agent systems debuggable is to capture and structure everything around those black box calls: what went in, what came out, how long it took, and how it fits into the broader execution flow.

That's what traces, capture mode, callbacks, and the live dashboard provide. Not transparency into the model, but transparency around it. And in practice, that's enough to debug anything.

Get started:

Documentation -- guides, examples, and API reference
Capture Mode Guide -- deep execution recording
Metrics Guide -- Micrometer integration
Live Dashboard Guide -- real-time execution visualization
Getting Started -- up and running in 5 minutes
GitHub -- source, issues, and contributions

AgentEnsemble is MIT-licensed and available on GitHub.

DEV Community