Kai

Posted on Feb 4

Visual Debugging for AI Agents (ANY Framework)

#ai #agents #debugging #python

TL;DR: We built LangGraph Studio's visual debugging experience, but made it work with every AI agent framework. Open source. Local-first. Try it now.

The Problem: Debugging AI Agents is Broken

Traditional debugging tools don't work for AI agents:

❌ Breakpoints → Agents are async, non-deterministic
❌ Print statements → Good luck finding the relevant logs
❌ Stack traces → Doesn't show LLM calls or agent decisions
❌ Unit tests → Hard to test non-deterministic behavior

What developers told us (from talking to 50+ production teams):

"LangGraph is S-tier specifically because of visual debugging. But we're stuck—we can't switch frameworks without losing the debugger."

The data:

94% of production deployments need observability
LangGraph rated S-tier specifically for visual execution traces
But all solutions are framework-locked

The landscape:

LangGraph Studio → LangGraph only
LangSmith → LangChain-focused
Crew Analytics → CrewAI only
AutoGen → no visual debugger at all

Developers are choosing frameworks based on tooling, not capabilities.

That's backwards.

The Solution: Framework-Agnostic Observability

Today we're launching OpenClaw Observability Toolkit - universal visual debugging for AI agents.

🎯 Works With Any Framework

# LangChain
from openclaw_observability.integrations import LangChainCallbackHandler
chain.run(input="query", callbacks=[LangChainCallbackHandler()])

# Raw Python (works TODAY)
from openclaw_observability import observe

@observe()
def my_agent_function(input):
    return process(input)

# CrewAI, AutoGen (coming soon)

One tool. All frameworks.

What You Get

1. Visual Execution Traces

See your agent's execution flow as an interactive graph:

┌─────────────────────────────────────┐
│ Customer Service Agent               │
├─────────────────────────────────────┤
│   [User Query: "Why was I charged?"] │
│        ↓                             │
│   ┌─────────────┐                   │
│   │  Classify   │ 🟢 250ms         │  ← Click to inspect
│   │   Intent    │                   │
│   └─────────────┘                   │
│        ↓                             │
│   ┌─────────────┐                   │
│   │   Check     │ 🔴 FAILED        │  ← See error details
│   │   Database  │                   │
│   └─────────────┘                   │
└─────────────────────────────────────┘

2. Step-Level Debugging

Click any node to see:

Inputs & outputs - What went in, what came out
LLM calls - Full prompts, responses, tokens, cost
Timing - How long each step took
Errors - Full stack traces with context

3. Production Monitoring

Track what matters:

Cost per agent
Latency per step
Success rates
Quality metrics

Real-World Example: Multi-Agent Debugging

Problem: You have a customer service system with 3 agents (router, billing, support). A customer query fails. Which agent broke?

Without observability:

ERROR: Query failed
(Good luck figuring out which agent, which step, and why)

With OpenClaw Observability:

Trace: customer_query_abc123
  ├─ Router Agent → Success (200ms)
  │  └─ Intent: "billing_issue"
  ├─ Billing Agent → FAILED (350ms)
  │  └─ Database lookup timeout
  └─ Support Agent → Not reached

Click "Billing Agent" → See full error:

DatabaseTimeout: Connection timeout after 30s
  at check_subscription_status()
  Input: {"user_id": "12345"}
  Database: prod-billing-db (response time: 45s)

Root cause: Billing database is slow. Scale it up.

Time to debug: 30 seconds (instead of 3 hours).

How It Works

1. Install

pip install openclaw-observability

2. Instrument Your Code

from openclaw_observability import observe, init_tracer
from openclaw_observability.span import SpanType

tracer = init_tracer(agent_id="my-agent")

@observe(span_type=SpanType.AGENT_DECISION)
def choose_action(state):
    action = llm.predict(state)
    return action

@observe(span_type=SpanType.TOOL_CALL)
def fetch_data(query):
    return database.query(query)

3. Run Your Agent

result = choose_action(current_state)

4. View Traces

python -m openclaw_observability.server
# Open http://localhost:5000

That's it.

Technical Details

Performance

<1% latency overhead (async data collection)
<5MB memory per 1000 traces
No blocking I/O (background storage)

Privacy

Local-first: All data stored on your machine
No telemetry: We don't collect anything
No cloud: No API keys, no vendors, no lock-in

Extensibility

Plugin architecture: Add custom span types
Framework integrations: Build your own (it's just Python)
Storage backends: JSON (default), ClickHouse, TimescaleDB, S3

Quick Start

# Clone the repo
git clone https://github.com/reflectt/openclaw-observability.git
cd openclaw-observability

# Run example
python examples/basic_example.py

# Start web UI
python server/app.py

# Open http://localhost:5000

Roadmap

v0.1.0 (TODAY):

✅ Core tracing SDK
✅ LangChain integration
✅ Web visualization UI
✅ Step-level debugging

v0.2.0 (4 weeks):

CrewAI and AutoGen integrations
Real-time trace streaming
Advanced filtering and search
Trace comparison

v0.3.0 (8 weeks):

Production monitoring dashboard
Cost alerts and budgets
Quality metrics
Anomaly detection

Why We Built This

We're building OpenClaw - an operating system for AI agents. As we talked to teams deploying agents to production, the same problem kept coming up:

"We love LangGraph's debugger, but we can't use LangGraph for [technical reason]. So we're back to print statements."

That's a solved problem—but the solution is locked.

We believe:

Visual debugging should be universal (not framework-locked)
Observability should be local-first (not cloud-dependent)
Tooling should be open source (not vendor-controlled)

So we built it.

Get Involved

Try it:

pip install openclaw-observability

Star the repo:
https://github.com/reflectt/openclaw-observability

Contribute:
We're actively looking for:

Framework integrations (CrewAI, AutoGen, custom frameworks)
UI improvements (filtering, search, real-time updates)
Production features (monitoring, alerts, metrics)

DEV Community