Once I had three agents running in parallel, I lost the thread. I couldn't tell which one was waiting on me, which had stalled on a bad tool call, or why the final output came back missing a piece.
The problem wasn't the agents — it was that I had no visibility into what any of them were actually doing. Each one was a black box unless I stopped everything and read its terminal.
Here's the setup I built to fix that: Claude Code hooks feeding a minimal event server, so you can see what every agent is doing in real time — across 3, 5, 10 instances at once.
The Problem: Too Many Agents, Too Little Visibility
When I'm running the SDLC harness with tasks in parallel, the setup looks something like this:
- One agent implementing a new module
- Another reviewing the previous task's output
- A third running the validation gates
- Two more doing research on different parts of the codebase
Without observability, you're flying blind. Which agent needs your input? What are they actually doing? When something goes wrong, how do you trace it back?
The Solution: Real-Time Multi-Agent Observability
Here's what we're building:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Claude Agent 1 │ │ Claude Agent 2 │ │ Claude Agent 3 │
│ (App: CRM) │ │ (App: API Docs) │ │ (App: Testing) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
│ Claude Hooks │ Claude Hooks │
└───────────┬───────────┴───────────┬──────────┘
│ │
▼ ▼
┌─────────────────────────────────┐
│ BUN SERVER │
│ • Store to SQLite │
│ • Broadcast via WebSocket │
└──────────────┬──────────────────┘
│
▼
┌─────────────────────────────────┐
│ REAL-TIME DASHBOARD │
│ • Live Activity Pulse │
│ • Event Stream │
│ • AI Summaries │
└─────────────────────────────────┘
Key Features:
- Live Activity Pulse: Visual representation of all agent activities
- Event Stream: Every tool call, hook, and decision
- AI-Powered Summaries: Understand at a glance what each agent is doing
- Session Tracking: Color-coded agents for easy identification
Building the Observability System
Step 1: Enhanced Hook Configuration
First, we upgrade our hooks to send comprehensive event data:
#!/usr/bin/env python3
# ~/.claude/hooks/send-event.py
import sys
import json
import requests
import os
from datetime import datetime
def summarize_with_ai(event_data, event_type):
"""Use a small, fast model to summarize the event"""
if event_type not in ['pre-tool-use', 'post-tool-use']:
return None
try:
# Use Haiku for ultra-fast summaries
import anthropic
client = anthropic.Client()
prompt = f"Summarize in 10 words what this {event_type} event does: {json.dumps(event_data)}"
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=30,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
except:
return None
def send_event(app_name, event_type, summarize=True):
"""Send event to observability server"""
event = json.loads(sys.stdin.read())
# Add metadata
event['app_name'] = app_name
event['event_type'] = event_type
event['timestamp'] = datetime.now().isoformat()
event['session_id'] = os.environ.get('CLAUDE_SESSION_ID', 'unknown')
# Add AI summary if requested
if summarize:
event['summary'] = summarize_with_ai(event, event_type)
# Send to server
try:
requests.post('http://localhost:3000/events',
json=event,
timeout=1)
except:
pass # Don't block on failures
sys.exit(0)
# Usage: send-event.py <app-name> <event-type> [--summarize]
if __name__ == "__main__":
app_name = sys.argv[1]
event_type = sys.argv[2]
summarize = '--summarize' in sys.argv
send_event(app_name, event_type, summarize)
Step 2: Configure Hooks for All Events
# ~/.claude/settings.yaml
hooks:
pre-prompt:
- ~/.claude/hooks/send-event.py "my-app" "pre-prompt"
pre-tool-use:
- ~/.claude/hooks/security-check.py
- ~/.claude/hooks/send-event.py "my-app" "pre-tool-use" --summarize
post-tool-use:
- ~/.claude/hooks/send-event.py "my-app" "post-tool-use" --summarize
notification:
- ~/.claude/hooks/send-event.py "my-app" "notification"
- ~/.claude/hooks/play-sound.py
stop:
- ~/.claude/hooks/send-event.py "my-app" "stop"
- ~/.claude/hooks/voice-summary.py
pre-compact:
- ~/.claude/hooks/send-event.py "my-app" "pre-compact"
Step 3: Build the Event Server
// server.ts
import { serve } from 'bun';
import { Database } from 'bun:sqlite';
const db = new Database('events.db');
db.run(`
CREATE TABLE IF NOT EXISTS events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT,
app_name TEXT,
session_id TEXT,
event_type TEXT,
data TEXT
)
`);
const clients = new Set<WebSocket>();
serve({
port: 3000,
async fetch(req, server) {
const url = new URL(req.url);
// Handle event ingestion
if (url.pathname === '/events' && req.method === 'POST') {
const event = await req.json();
// Store in database
db.run(
'INSERT INTO events (timestamp, app_name, session_id, event_type, data) VALUES (?, ?, ?, ?, ?)',
event.timestamp,
event.app_name,
event.session_id,
event.event_type,
JSON.stringify(event)
);
// Broadcast to all connected clients
const message = JSON.stringify({
type: 'event',
data: event
});
clients.forEach(client => {
if (client.readyState === WebSocket.OPEN) {
client.send(message);
}
});
return new Response('OK');
}
// Upgrade to WebSocket for real-time updates
if (url.pathname === '/ws') {
if (server.upgrade(req)) {
return;
}
}
return new Response('Not found', { status: 404 });
},
websocket: {
open(ws) {
clients.add(ws);
console.log('Client connected');
},
close(ws) {
clients.delete(ws);
console.log('Client disconnected');
},
message(ws, message) {
// Handle client messages if needed
}
}
});
console.log('Observability server running on http://localhost:3000');
Step 4: Create the Real-Time Dashboard
// useWebSocketEvents.ts
import { ref, onMounted, onUnmounted } from 'vue';
export interface AgentEvent {
timestamp: string;
app_name: string;
session_id: string;
event_type: string;
summary?: string;
data: any;
}
export function useWebSocketEvents() {
const events = ref<AgentEvent[]>([]);
const isConnected = ref(false);
let ws: WebSocket | null = null;
const connect = () => {
ws = new WebSocket('ws://localhost:3000/ws');
ws.onopen = () => {
isConnected.value = true;
console.log('Connected to observability server');
};
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
if (message.type === 'event') {
// Add to events array (limit to last 1000)
events.value = [message.data, ...events.value].slice(0, 1000);
}
};
ws.onclose = () => {
isConnected.value = false;
// Reconnect after 1 second
setTimeout(connect, 1000);
};
};
onMounted(connect);
onUnmounted(() => {
ws?.close();
});
return {
events,
isConnected
};
}
Advanced Observability Features
1. Live Activity Pulse
Visualize agent activity over time with a pulse chart that shows activity intensity and which agents are most active.
2. Smart Event Filtering
Filter events by:
- Application name
- Event type (pre-tool-use, post-tool-use, etc.)
- Session ID
- Time range
- Search query
3. Session-Based Color Coding
Each agent session gets a unique color based on its session ID, making it easy to track individual agents visually.
Practical Patterns
Pattern 1: Agent Health Monitoring
Detect when agents get stuck or stop responding by tracking the time since their last event.
Pattern 2: Cross-Agent Coordination
Track when multiple agents are working on the same files to prevent conflicts.
Pattern 3: Performance Analytics
Measure agent performance with metrics like:
- Total events per session
- Tools used
- Average response time
- Error rate
- AI-generated summaries
Note:
Pro Tip: Use small, fast models likeclaude-3-haiku-20240307for event summarization. The summaries are generated quickly and stay out of the critical path, so agent throughput stays high.
Scaling Considerations
As you scale from 3 agents to 30:
1. Event Sampling
For high-frequency events, sample rather than log everything:
if (Math.random() < 0.1) { // 10% sampling
sendEvent(data);
}
2. Batch Processing
Send events in batches to reduce network overhead:
const eventBatch = [];
const flushBatch = () => {
if (eventBatch.length > 0) {
sendBatch(eventBatch);
eventBatch.length = 0;
}
};
setInterval(flushBatch, 1000); // Flush every second
3. Data Retention
Implement automatic cleanup:
DELETE FROM events WHERE timestamp < datetime('now', '-7 days');
The Power of Visibility
With multi-agent observability in place, you can:
- Scale Confidently: Run 10+ agents without losing track
- Debug Quickly: Trace issues back to specific agents and actions
- Optimize Workflows: Identify bottlenecks and inefficiencies
- Prevent Conflicts: Detect when agents step on each other's toes
- Measure Impact: Quantify what your agents actually accomplish
→
Real Impact: Once observability is in place, you can confidently hand off work to multiple agents in parallel — you trust them because you can see everything they're doing, not because you're hoping for the best.
Getting Started
- Start Simple: Begin with basic event logging to a file
- Add Real-Time: Implement WebSocket broadcasting
- Build the Dashboard: Start with a simple event list, add visualizations
- Scale Gradually: Add more agents as your observability improves
Remember: If you don't measure it, you can't improve it. If you don't monitor it, how will you know what's actually happening?
The future of engineering is multi-agent systems. The key to multi-agent systems is observability. Build it once, scale it forever.
Your agents are working hard. It's time you could see everything they do.
If this was useful, I write about building production AI and agentic systems at learn-agentic-ai.com — including hands-on learning paths available in both English and Brazilian Portuguese. Come build something real.
Top comments (0)