DEV Community

bredmond1019
bredmond1019

Posted on • Originally published at learn-agentic-ai.com

Multi-Agent Observability: See Everything Your AI Agents Do

Once I had three agents running in parallel, I lost the thread. I couldn't tell which one was waiting on me, which had stalled on a bad tool call, or why the final output came back missing a piece.

The problem wasn't the agents — it was that I had no visibility into what any of them were actually doing. Each one was a black box unless I stopped everything and read its terminal.

Here's the setup I built to fix that: Claude Code hooks feeding a minimal event server, so you can see what every agent is doing in real time — across 3, 5, 10 instances at once.

The Problem: Too Many Agents, Too Little Visibility

When I'm running the SDLC harness with tasks in parallel, the setup looks something like this:

  • One agent implementing a new module
  • Another reviewing the previous task's output
  • A third running the validation gates
  • Two more doing research on different parts of the codebase

Without observability, you're flying blind. Which agent needs your input? What are they actually doing? When something goes wrong, how do you trace it back?

The Solution: Real-Time Multi-Agent Observability

Here's what we're building:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Claude Agent 1 │     │  Claude Agent 2 │     │  Claude Agent 3 │
│   (App: CRM)    │     │ (App: API Docs) │     │ (App: Testing)  │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         │    Claude Hooks       │    Claude Hooks       │
         └───────────┬───────────┴───────────┬──────────┘
                     │                       │
                     ▼                       ▼
              ┌─────────────────────────────────┐
              │        BUN SERVER               │
              │   • Store to SQLite             │
              │   • Broadcast via WebSocket     │
              └──────────────┬──────────────────┘
                             │
                             ▼
              ┌─────────────────────────────────┐
              │     REAL-TIME DASHBOARD         │
              │   • Live Activity Pulse         │
              │   • Event Stream                │
              │   • AI Summaries               │
              └─────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Key Features:

  • Live Activity Pulse: Visual representation of all agent activities
  • Event Stream: Every tool call, hook, and decision
  • AI-Powered Summaries: Understand at a glance what each agent is doing
  • Session Tracking: Color-coded agents for easy identification

Building the Observability System

Step 1: Enhanced Hook Configuration

First, we upgrade our hooks to send comprehensive event data:

#!/usr/bin/env python3
# ~/.claude/hooks/send-event.py

import sys
import json
import requests
import os
from datetime import datetime

def summarize_with_ai(event_data, event_type):
    """Use a small, fast model to summarize the event"""
    if event_type not in ['pre-tool-use', 'post-tool-use']:
        return None

    try:
        # Use Haiku for ultra-fast summaries
        import anthropic
        client = anthropic.Client()

        prompt = f"Summarize in 10 words what this {event_type} event does: {json.dumps(event_data)}"

        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=30,
            messages=[{"role": "user", "content": prompt}]
        )

        return response.content[0].text
    except:
        return None

def send_event(app_name, event_type, summarize=True):
    """Send event to observability server"""
    event = json.loads(sys.stdin.read())

    # Add metadata
    event['app_name'] = app_name
    event['event_type'] = event_type
    event['timestamp'] = datetime.now().isoformat()
    event['session_id'] = os.environ.get('CLAUDE_SESSION_ID', 'unknown')

    # Add AI summary if requested
    if summarize:
        event['summary'] = summarize_with_ai(event, event_type)

    # Send to server
    try:
        requests.post('http://localhost:3000/events', 
                     json=event, 
                     timeout=1)
    except:
        pass  # Don't block on failures

    sys.exit(0)

# Usage: send-event.py <app-name> <event-type> [--summarize]
if __name__ == "__main__":
    app_name = sys.argv[1]
    event_type = sys.argv[2]
    summarize = '--summarize' in sys.argv
    send_event(app_name, event_type, summarize)
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure Hooks for All Events

# ~/.claude/settings.yaml
hooks:
  pre-prompt:
    - ~/.claude/hooks/send-event.py "my-app" "pre-prompt"

  pre-tool-use:
    - ~/.claude/hooks/security-check.py
    - ~/.claude/hooks/send-event.py "my-app" "pre-tool-use" --summarize

  post-tool-use:
    - ~/.claude/hooks/send-event.py "my-app" "post-tool-use" --summarize

  notification:
    - ~/.claude/hooks/send-event.py "my-app" "notification"
    - ~/.claude/hooks/play-sound.py

  stop:
    - ~/.claude/hooks/send-event.py "my-app" "stop"
    - ~/.claude/hooks/voice-summary.py

  pre-compact:
    - ~/.claude/hooks/send-event.py "my-app" "pre-compact"
Enter fullscreen mode Exit fullscreen mode

Step 3: Build the Event Server

// server.ts
import { serve } from 'bun';
import { Database } from 'bun:sqlite';

const db = new Database('events.db');
db.run(`
  CREATE TABLE IF NOT EXISTS events (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT,
    app_name TEXT,
    session_id TEXT,
    event_type TEXT,
    data TEXT
  )
`);

const clients = new Set<WebSocket>();

serve({
  port: 3000,

  async fetch(req, server) {
    const url = new URL(req.url);

    // Handle event ingestion
    if (url.pathname === '/events' && req.method === 'POST') {
      const event = await req.json();

      // Store in database
      db.run(
        'INSERT INTO events (timestamp, app_name, session_id, event_type, data) VALUES (?, ?, ?, ?, ?)',
        event.timestamp,
        event.app_name,
        event.session_id,
        event.event_type,
        JSON.stringify(event)
      );

      // Broadcast to all connected clients
      const message = JSON.stringify({
        type: 'event',
        data: event
      });

      clients.forEach(client => {
        if (client.readyState === WebSocket.OPEN) {
          client.send(message);
        }
      });

      return new Response('OK');
    }

    // Upgrade to WebSocket for real-time updates
    if (url.pathname === '/ws') {
      if (server.upgrade(req)) {
        return;
      }
    }

    return new Response('Not found', { status: 404 });
  },

  websocket: {
    open(ws) {
      clients.add(ws);
      console.log('Client connected');
    },

    close(ws) {
      clients.delete(ws);
      console.log('Client disconnected');
    },

    message(ws, message) {
      // Handle client messages if needed
    }
  }
});

console.log('Observability server running on http://localhost:3000');
Enter fullscreen mode Exit fullscreen mode

Step 4: Create the Real-Time Dashboard

// useWebSocketEvents.ts
import { ref, onMounted, onUnmounted } from 'vue';

export interface AgentEvent {
  timestamp: string;
  app_name: string;
  session_id: string;
  event_type: string;
  summary?: string;
  data: any;
}

export function useWebSocketEvents() {
  const events = ref<AgentEvent[]>([]);
  const isConnected = ref(false);
  let ws: WebSocket | null = null;

  const connect = () => {
    ws = new WebSocket('ws://localhost:3000/ws');

    ws.onopen = () => {
      isConnected.value = true;
      console.log('Connected to observability server');
    };

    ws.onmessage = (event) => {
      const message = JSON.parse(event.data);
      if (message.type === 'event') {
        // Add to events array (limit to last 1000)
        events.value = [message.data, ...events.value].slice(0, 1000);
      }
    };

    ws.onclose = () => {
      isConnected.value = false;
      // Reconnect after 1 second
      setTimeout(connect, 1000);
    };
  };

  onMounted(connect);

  onUnmounted(() => {
    ws?.close();
  });

  return {
    events,
    isConnected
  };
}
Enter fullscreen mode Exit fullscreen mode

Advanced Observability Features

1. Live Activity Pulse

Visualize agent activity over time with a pulse chart that shows activity intensity and which agents are most active.

2. Smart Event Filtering

Filter events by:

  • Application name
  • Event type (pre-tool-use, post-tool-use, etc.)
  • Session ID
  • Time range
  • Search query

3. Session-Based Color Coding

Each agent session gets a unique color based on its session ID, making it easy to track individual agents visually.

Practical Patterns

Pattern 1: Agent Health Monitoring

Detect when agents get stuck or stop responding by tracking the time since their last event.

Pattern 2: Cross-Agent Coordination

Track when multiple agents are working on the same files to prevent conflicts.

Pattern 3: Performance Analytics

Measure agent performance with metrics like:

  • Total events per session
  • Tools used
  • Average response time
  • Error rate
  • AI-generated summaries

Note:
Pro Tip: Use small, fast models like claude-3-haiku-20240307 for event summarization. The summaries are generated quickly and stay out of the critical path, so agent throughput stays high.

Scaling Considerations

As you scale from 3 agents to 30:

1. Event Sampling

For high-frequency events, sample rather than log everything:

if (Math.random() < 0.1) { // 10% sampling
  sendEvent(data);
}
Enter fullscreen mode Exit fullscreen mode

2. Batch Processing

Send events in batches to reduce network overhead:

const eventBatch = [];
const flushBatch = () => {
  if (eventBatch.length > 0) {
    sendBatch(eventBatch);
    eventBatch.length = 0;
  }
};
setInterval(flushBatch, 1000); // Flush every second
Enter fullscreen mode Exit fullscreen mode

3. Data Retention

Implement automatic cleanup:

DELETE FROM events WHERE timestamp < datetime('now', '-7 days');
Enter fullscreen mode Exit fullscreen mode

The Power of Visibility

With multi-agent observability in place, you can:

  1. Scale Confidently: Run 10+ agents without losing track
  2. Debug Quickly: Trace issues back to specific agents and actions
  3. Optimize Workflows: Identify bottlenecks and inefficiencies
  4. Prevent Conflicts: Detect when agents step on each other's toes
  5. Measure Impact: Quantify what your agents actually accomplish


Real Impact: Once observability is in place, you can confidently hand off work to multiple agents in parallel — you trust them because you can see everything they're doing, not because you're hoping for the best.

Getting Started

  1. Start Simple: Begin with basic event logging to a file
  2. Add Real-Time: Implement WebSocket broadcasting
  3. Build the Dashboard: Start with a simple event list, add visualizations
  4. Scale Gradually: Add more agents as your observability improves

Remember: If you don't measure it, you can't improve it. If you don't monitor it, how will you know what's actually happening?

The future of engineering is multi-agent systems. The key to multi-agent systems is observability. Build it once, scale it forever.

Your agents are working hard. It's time you could see everything they do.


If this was useful, I write about building production AI and agentic systems at learn-agentic-ai.com — including hands-on learning paths available in both English and Brazilian Portuguese. Come build something real.

Top comments (0)