mgd43b for AgentEnsemble

Posted on Mar 19 • Originally published at agentensemble.net

Production-Minded Multi-Agent Orchestration in Java

#java #ai #agents #production

The demo works. Your two-agent research-writer pipeline produces a decent article. Your hierarchical team generates a plausible report. Someone on the team says "let's ship it."

And then reality sets in.

How many tokens did that run consume? What happens when the LLM returns garbage JSON? Can we rate-limit calls so we don't blow our API budget on day one? What if a task takes 90 seconds and produces hallucinated nonsense -- does a human get to review it before it hits the customer?

These aren't edge cases. They're the difference between a demo and a deployment. This post covers what production multi-agent orchestration actually requires, and how AgentEnsemble handles each concern.

Observability: Know What Your Agents Are Doing

Event Callbacks

Every significant event in an ensemble run fires a callback. You register listeners on the builder:

Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .listener(event -> {
        switch (event) {
            case TaskStartEvent e ->
                logger.info("Starting: {}", e.taskDescription());
            case TaskCompleteEvent e ->
                logger.info("Completed: {} ({}ms, {} tokens)",
                    e.taskDescription(), e.durationMs(), e.tokenCount());
            case TaskFailedEvent e ->
                logger.error("Failed: {} - {}", e.taskDescription(),
                    e.errorMessage());
            case ToolCallEvent e ->
                logger.info("Tool call: {} -> {}", e.toolName(),
                    e.result());
            default -> {} // other events
        }
    })
    .build()
    .run();

Events are typed. You pattern-match on them. No string parsing, no event name constants.

You can register multiple listeners, and they're exception-safe -- if one listener throws, the others still fire and the ensemble keeps running.

For more structured collection, implement the EnsembleListener interface:

public class MetricsCollector implements EnsembleListener {
    private final Map<String, Long> taskDurations = new ConcurrentHashMap<>();

    @Override
    public void onTaskComplete(TaskCompleteEvent event) {
        taskDurations.put(event.taskDescription(), event.durationMs());
    }

    @Override
    public void onToolCall(ToolCallEvent event) {
        meterRegistry.counter("agent.tool.calls",
            "tool", event.toolName()).increment();
    }
}

Micrometer Metrics

AgentEnsemble integrates with Micrometer out of the box. Add the metrics module:

implementation("net.agentensemble:agentensemble-metrics-micrometer:2.3.0")

Then wire it up:

MeterRegistry registry = new PrometheusMeterRegistry(
    PrometheusConfig.DEFAULT);

Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .meterRegistry(registry)
    .build()
    .run();

You get counters for token usage, task completions, task failures, and tool calls. Gauges for active tasks. Timers for task and ensemble duration. All tagged with agent role and task description for Grafana/Prometheus filtering.

Structured Traces

For post-mortem analysis, export full execution traces as JSON:

EnsembleOutput output = Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .traceExporter(TraceExporter.json(Path.of("traces/")))
    .build()
    .run();

// Or access the trace programmatically
ExecutionTrace trace = output.getTrace();
trace.getSpans().forEach(span ->
    System.out.printf("%s: %dms, %d tokens%n",
        span.getName(), span.getDurationMs(), span.getTokenCount()));

Each trace includes spans for every task, tool call, and LLM interaction. Duration, token counts, input/output payloads -- all structured, all queryable.

Cost Tracking

Know what you're spending:

Ensemble.builder()
    // ...
    .costConfiguration(CostConfiguration.builder()
        .inputTokenCostPer1k(0.01)
        .outputTokenCostPer1k(0.03)
        .build())
    .build()
    .run();

EnsembleMetrics metrics = output.getMetrics();
System.out.printf("Total cost: $%.4f%n", metrics.getTotalCost());
System.out.printf("Input tokens: %d, Output tokens: %d%n",
    metrics.getInputTokens(), metrics.getOutputTokens());

You configure per-model pricing, and the framework tracks token consumption and computes costs across all tasks in the run.

Error Handling: Fail Gracefully

Max Iterations

Agents can get stuck in loops -- calling the same tool repeatedly, or generating output that fails validation. Cap it:

Agent researcher = Agent.builder()
    .role("Researcher")
    .goal("Find information about {{topic}}")
    .maxIterations(10) // default is 25
    .build();

When the limit is hit, the agent returns its best output so far rather than running forever.

Output Retries

When a task has outputType() set and the LLM returns invalid JSON, the framework retries automatically:

Task profileTask = Task.builder()
    .description("Create a competitor profile")
    .expectedOutput("Structured JSON profile")
    .agent(analyst)
    .outputType(CompetitorProfile.class)
    .maxOutputRetries(3) // retry up to 3 times on parse failure
    .build();

Parallel Error Strategy

In parallel workflows, one failing task shouldn't necessarily kill the entire ensemble:

Ensemble.builder()
    .agents(marketAnalyst, financialAnalyst, strategist)
    .tasks(marketTask, financialTask, summaryTask)
    .chatLanguageModel(model)
    .workflow(Workflow.parallel()
        .errorStrategy(ParallelErrorStrategy.CONTINUE_ON_ERROR)
        .build())
    .build()
    .run();

With CONTINUE_ON_ERROR, successful tasks still complete, and downstream tasks receive whatever results are available. The ensemble output tells you which tasks succeeded and which failed.

Guardrails

Validate inputs and outputs at the framework level:

Agent agent = Agent.builder()
    .role("Content Writer")
    .goal("Write marketing content")
    .inputGuardrail(input -> {
        if (input.contains("competitor")) {
            return GuardrailResult.reject(
                "Input must not reference competitors");
        }
        return GuardrailResult.accept();
    })
    .outputGuardrail(output -> {
        if (output.length() > 5000) {
            return GuardrailResult.reject("Output exceeds length limit");
        }
        return GuardrailResult.accept();
    })
    .build();

Guardrails run before/after each agent iteration. Rejection stops the agent and surfaces a clear error, not a silent failure.

Cost Control: Don't Blow Your Budget

Rate Limiting

Protect against runaway API costs:

Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .rateLimit(RateLimit.builder()
        .maxRequestsPerMinute(60)
        .build())
    .build()
    .run();

The rate limiter applies across all agents in the ensemble. Requests that exceed the limit are queued, not dropped.

Delegation Depth

In hierarchical workflows, prevent infinite delegation chains:

Ensemble.builder()
    // ...
    .workflow(Workflow.HIERARCHICAL)
    .maxDelegationDepth(3)
    .build()
    .run();

Human-in-the-Loop: Review Before You Ship

The most production-critical feature is often the simplest: letting a human review agent output before it's treated as final.

Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .reviewHandler(taskOutput -> {
        System.out.println("Agent produced:\n" + taskOutput.getRaw());
        System.out.print("Approve? (y/n/edit): ");

        String response = scanner.nextLine();
        if (response.equals("y")) {
            return Review.approve();
        } else if (response.equals("n")) {
            return Review.reject("Quality below threshold");
        } else {
            System.out.print("Enter corrected output: ");
            return Review.edit(scanner.nextLine());
        }
    })
    .reviewPolicy(ReviewPolicy.REVIEW_ALL)
    .build()
    .run();

Review policies let you control which tasks trigger review:

REVIEW_ALL -- every task output goes through review.
REVIEW_FAILED -- only tasks that hit errors or retries.
FIRST_TASK_ONLY -- review the first output to calibrate, then let the rest run.

You can also add pre-flight validation with beforeReview():

Ensemble.builder()
    // ...
    .beforeReview(taskOutput -> {
        // Automated quality check before human review
        if (taskOutput.getRaw().length() < 100) {
            return Review.reject("Output too short, re-running");
        }
        return Review.skip(); // passes automated check, proceed to human
    })
    .build()
    .run();

Testing: Capture and Replay

Production confidence starts with testability. AgentEnsemble's capture mode records full execution data:

EnsembleOutput output = Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .captureMode(CaptureMode.FULL) // record everything
    .traceExporter(TraceExporter.json(Path.of("test-traces/")))
    .build()
    .run();

CaptureMode.FULL records:

Full LLM message history per iteration (system prompt, user messages, assistant responses)
Tool call inputs and outputs
Token counts per interaction
Memory operations
Timing data

This gives you deterministic test fixtures. Capture a run once, then write assertions against the trace without making live LLM calls.

The Production Checklist

Here's a quick reference for what to enable before shipping:

Concern	Configuration
Logging	SLF4J is used throughout; configure your logging framework
Event callbacks	`.listener()` for real-time monitoring
Metrics	`.meterRegistry()` with Micrometer
Traces	`.traceExporter()` for structured JSON
Cost tracking	`.costConfiguration()` with your model's pricing
Rate limiting	`.rateLimit()` to protect API budgets
Error strategy	`ParallelErrorStrategy.CONTINUE_ON_ERROR` for resilience
Review gates	`.reviewHandler()` + `.reviewPolicy()` for human oversight
Guardrails	`.inputGuardrail()` / `.outputGuardrail()` on agents
Testing	`.captureMode(CaptureMode.FULL)` for trace capture

None of these require custom infrastructure. They're all builder methods on the same API you already use to define agents and tasks.

Production-grade agent orchestration isn't about adding a monitoring sidecar after the fact. It's about choosing a framework that was built for production from the start.

Get started:

Documentation -- guides, examples, and API reference
Getting Started -- up and running in 5 minutes
Examples -- runnable code for every pattern
GitHub -- source, issues, and contributions

AgentEnsemble is MIT-licensed and available on GitHub.

DEV Community