DEV Community

Raju Shanigarapu
Raju Shanigarapu

Posted on

I Built a Debugger for LLM Agents — Here's Why "Observability" Wasn't Enough

Every time I changed a prompt, I was running a hypothesis test.

But I had no debugger. No way to pause execution. No structural comparison between "before" and "after." Just two terminal windows and a vague feeling that maybe it was better now.

I built agent-lens to fix this.


The Problem with "Observability"

Langfuse, LangSmith, Phoenix — these are great tools. They show you what happened. Traces, spans, token counts.

But none of them answer the question I actually had: did this change make it better?

That requires something different:

  • A way to compare two runs structurally
  • A record of why you made the change (the hypothesis)
  • A verdict — not just "here are the numbers," but "this was an improvement"

What agent-lens Does Differently

1. Pause a live agent mid-run

import agent_lens
from openai import OpenAI

agent_lens.install()          # auto-patches OpenAI + Anthropic
agent_lens.dashboard.start()  # localhost:7878

client = OpenAI()

@agent_lens.trace
def my_agent(query: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}]
    ).choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Open the dashboard, click Pause. The agent blocks at the next LLM call.

2. State a hypothesis before you change anything

POST /runs/{run_id}/fork
{
  "span_id": "abc123",
  "edited_messages": [{"role": "system", "content": "Be concise."}],
  "notes": "Hypothesis: shorter system prompt reduces hallucination",
  "expected_output": "concise"
}
Enter fullscreen mode Exit fullscreen mode

The note travels with the run forever. Future you can read your reasoning.

3. GET /diff — one call, one verdict

GET /runs/{run_a}/diff/{run_b}
Enter fullscreen mode Exit fullscreen mode
{
  "metrics_delta": {
    "latency_ms":   {"a": 1847, "b": 820,  "pct_change": -55.6},
    "total_tokens": {"a": 453,  "b": 87,   "pct_change": -80.8},
    "cost_usd":     {"a": 0.0045, "b": 0.00087, "pct_change": -80.7}
  },
  "assertion_result": {
    "expected_output": "concise",
    "passed_in_a": false,
    "passed_in_b": true,
    "verdict": "improved"
  }
}
Enter fullscreen mode Exit fullscreen mode

Hypothesis confirmed. With numbers.


The Full Flow

[Agent running] → Pause → agent blocks at next LLM call
                              ↓
                    [Edit messages in dashboard]
                              ↓
                    Fork → new run diverges
                              ↓
                    Resume → original continues
                              ↓
              [Two runs. GET /diff. Get verdict.]
Enter fullscreen mode Exit fullscreen mode

No restarts. No re-running preceding steps.


Zero Infrastructure

Everything runs locally. SQLite at ~/.agent-lens/runs.db. No Docker. No cloud. No API keys needed to start exploring:

pip install agentlens-tracer
python examples/07_demo_mock.py  # runs a full demo with no API key
Enter fullscreen mode Exit fullscreen mode

Works with LangChain and LlamaIndex Too

from agent_lens.integrations.langchain import AgentLensCallbackHandler
from agent_lens.integrations.llamaindex import AgentLensLlamaIndexHandler
Enter fullscreen mode Exit fullscreen mode

Pass as a callback — every LLM call is traced automatically.


Why This Matters

You're not debugging a function. You're debugging a probabilistic system. Every prompt change is a hypothesis test.

Today you run that test by eyeballing outputs. agent-lens makes it structural, repeatable, and recorded.

Vibes-based prompt engineering is debugging without a debugger.
agent-lens is the debugger.


GitHub: https://github.com/RAJUSHANIGARAPU/agent-lens
Install: pip install agentlens-tracer

Would love to hear how you're currently debugging LLM agents — drop a comment below.

Top comments (0)