mgd43b for AgentEnsemble

Posted on May 21 • Originally published at agentensemble.net

Operating Agent Networks: Visual Topology, Drill-Down, and Runtime Visibility

#java #ai #agents #architecture

Building an agent network is one problem. Operating it is a different one. When you have ten ensembles communicating over WebSockets, sharing capabilities via discovery, routing requests across federation boundaries, and managing capacity with priority queues -- you need to see what is happening.

The operational question is not "does the code work?" but "what is the system doing right now?" Which ensembles are healthy? Which are overloaded? What capabilities are available? Where are requests being routed? What changed in the last hour?

The visibility gap

Individual ensemble dashboards (the live execution view covered in an earlier post) show what one ensemble is doing: its current task, agent iterations, tool calls, and trace. But they do not show the network -- how ensembles relate to each other, where requests flow, and where bottlenecks form.

The gap is between per-ensemble observability and network-level observability. Both are needed. The per-ensemble view tells you why a specific task took 30 seconds. The network view tells you why the kitchen has a queue depth of 15 while the maintenance team is idle.

The network dashboard

AgentEnsemble's network dashboard provides a topology view of the entire ensemble network. Navigate to:

http://localhost:5173/network?ensembles=kitchen:ws://localhost:7329/ws,maintenance:ws://localhost:7330/ws

The ensembles query parameter accepts a comma-separated list of name:wsUrl pairs. Each ensemble gets its own independent WebSocket connection -- no aggregating proxy or central coordinator needed.

Topology graph

Ensembles are displayed as nodes in an interactive graph powered by React Flow. Each node shows:

Ensemble name
Lifecycle state (green = READY, yellow = STARTING, red = STOPPED)
Active task count and queue depth
Task progress bar

Connections between ensembles (shared tasks and tools) are displayed as animated edges. The topology is derived from capability discovery -- if the kitchen shares a prepare-meal task and room service uses it, the edge appears automatically.

Ensemble detail sidebar

Click an ensemble node to open the sidebar panel showing:

Lifecycle state and connection status
Active tasks, queue depth, completed tasks metrics
Shared capabilities (tasks and tools with tags)
WebSocket URL

Drill-down to live execution

Click "Drill Down" in the sidebar to navigate to the live execution dashboard for that specific ensemble. This reuses the existing per-ensemble dashboard infrastructure -- the same trace view, agent iteration timeline, and tool call details.

The flow is: network topology (high-level) -> ensemble detail (mid-level) -> live execution trace (low-level). Each level answers different questions.

Dynamic ensemble addition

Click "Add Ensemble" in the header to connect to a new ensemble by entering its name and WebSocket URL. The dashboard is not static -- you can add ensembles as you discover them or as new ones come online.

Architecture

The network dashboard opens independent WebSocket connections to each ensemble. There is no central aggregator. Each ensemble already exposes a WebSocket endpoint for the live dashboard; the network dashboard reuses those same endpoints.

Status polling (every 5 seconds) fetches /api/status from each ensemble's HTTP endpoint for queue depth and lifecycle state. The existing HelloMessage with snapshotTrace provides late-join support -- when a new connection opens, it receives the current state immediately rather than waiting for the next update.

This architecture means the dashboard works with any combination of ensembles, including ensembles in different namespaces or clusters (as long as the WebSocket URLs are reachable). It also means the dashboard has no state of its own -- refreshing the page reconnects to all ensembles and rebuilds the view.

Audit trail

Beyond the real-time dashboard, the audit trail provides a historical record of network events:

Work requests received, completed, and failed
Capacity changes (profile applications, scaling events)
Discovery events (capabilities registered and deregistered)
Federation events (cross-realm routing decisions)

The audit trail is append-only and backed by the same transport infrastructure as the rest of the network. In development, it is an in-memory log. In production, it can be backed by Kafka for durability and external consumption.

The audit trail answers questions that the real-time dashboard cannot: "When did the kitchen start receiving requests from the airport realm?" "How many requests failed between 2am and 4am?" "When was the weekend profile last applied?"

Operational profiles in the dashboard

When an operational profile is applied, the dashboard receives a ProfileAppliedMessage and updates the topology to reflect the new capacity targets:

{
  "type": "profile_applied",
  "profileName": "sporting-event-weekend",
  "capacities": {
    "front-desk": { "replicas": 4, "maxConcurrent": 50, "dormant": false },
    "kitchen": { "replicas": 3, "maxConcurrent": 100, "dormant": false }
  },
  "appliedAt": "2026-03-28T14:30:00Z"
}

The dashboard can display the active profile, show which ensembles have adjusted their capacity, and highlight ensembles that have not yet reached their target capacity.

What the dashboard does not do

The dashboard is read-only. It does not send commands to ensembles, apply profiles, or adjust capacity. It observes and displays.

This is deliberate. Operational actions (scaling, profile application, ensemble restarts) should go through the directive system, the profile scheduler, or your deployment pipeline. The dashboard provides the visibility to make those decisions, not the mechanism to execute them.

The exception is the "Add Ensemble" feature, which adds a display connection -- it does not modify the ensemble's configuration or behavior.

Tradeoffs

No central state. The dashboard has no persistent state. If you close it and reopen, it reconnects and rebuilds. This simplifies the architecture but means there is no historical view in the dashboard itself -- that is what the audit trail is for.

WebSocket reachability. The dashboard needs WebSocket access to every ensemble. In a Kubernetes deployment, this may require ingress configuration or port-forwarding. In development, ensembles run on localhost and are directly reachable.

Polling interval. Status is polled every 5 seconds. Events that happen between polls (a brief spike in queue depth, a transient connection failure) may not be visible. For sub-second operational visibility, you would need to supplement with metrics (Micrometer, OpenTelemetry) exported to a time-series database.

No alerting. The dashboard shows current state and recent history. It does not trigger alerts when thresholds are crossed. Alerting should be handled by your monitoring stack (Grafana, Prometheus, PagerDuty) using the metrics that ensembles already export.

The design principle

The useful insight is that operating an agent network requires three levels of visibility:

Network level -- topology, connections, capacity distribution, routing patterns
Ensemble level -- queue depth, active tasks, shared capabilities, health
Execution level -- individual task traces, agent iterations, tool calls

Each level answers different questions. The network dashboard provides levels 1 and 2, with drill-down to level 3 via the existing live execution dashboard. The audit trail provides the historical dimension that complements the real-time view.

This layered observability is not unique to agent systems -- it mirrors the service mesh / individual service / request trace pattern in microservices. What is specific to agent systems is the non-determinism: you cannot predict how long a task will take, how many iterations an agent will need, or whether a request will be delegated to another ensemble. The dashboard helps operators reason about a system that is inherently unpredictable.

The network dashboard is part of AgentEnsemble. The network dashboard guide covers setup, and the audit trail guide covers the historical event log.

I'd be interested in what operational tools others are using for multi-agent systems -- and whether the topology-first approach matches how you think about agent network operations.

DEV Community