Raju Shanigarapu

Posted on May 18

I Built a Debugger for LLM Agents — Here's Why "Observability" Wasn't Enough

#python #llm #opensource #devtool

Every time I changed a prompt, I was running a hypothesis test.

But I had no debugger. No way to pause execution. No structural comparison between "before" and "after." Just two terminal windows and a vague feeling that maybe it was better now.

I built agent-lens to fix this.

The Problem with "Observability"

Langfuse, LangSmith, Phoenix — these are great tools. They show you what happened. Traces, spans, token counts.

But none of them answer the question I actually had: did this change make it better?

That requires something different:

A way to compare two runs structurally
A record of why you made the change (the hypothesis)
A verdict — not just "here are the numbers," but "this was an improvement"

What agent-lens Does Differently

1. Pause a live agent mid-run

import agent_lens
from openai import OpenAI

agent_lens.install()          # auto-patches OpenAI + Anthropic
agent_lens.dashboard.start()  # localhost:7878

client = OpenAI()

@agent_lens.trace
def my_agent(query: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}]
    ).choices[0].message.content

Open the dashboard, click Pause. The agent blocks at the next LLM call.

2. State a hypothesis before you change anything

POST /runs/{run_id}/fork
{
  "span_id": "abc123",
  "edited_messages": [{"role": "system", "content": "Be concise."}],
  "notes": "Hypothesis: shorter system prompt reduces hallucination",
  "expected_output": "concise"
}

The note travels with the run forever. Future you can read your reasoning.

3. GET /diff — one call, one verdict

GET /runs/{run_a}/diff/{run_b}

{
  "metrics_delta": {
    "latency_ms":   {"a": 1847, "b": 820,  "pct_change": -55.6},
    "total_tokens": {"a": 453,  "b": 87,   "pct_change": -80.8},
    "cost_usd":     {"a": 0.0045, "b": 0.00087, "pct_change": -80.7}
  },
  "assertion_result": {
    "expected_output": "concise",
    "passed_in_a": false,
    "passed_in_b": true,
    "verdict": "improved"
  }
}

Hypothesis confirmed. With numbers.

The Full Flow

[Agent running] → Pause → agent blocks at next LLM call
                              ↓
                    [Edit messages in dashboard]
                              ↓
                    Fork → new run diverges
                              ↓
                    Resume → original continues
                              ↓
              [Two runs. GET /diff. Get verdict.]

No restarts. No re-running preceding steps.

Zero Infrastructure

Everything runs locally. SQLite at ~/.agent-lens/runs.db. No Docker. No cloud. No API keys needed to start exploring:

pip install agentlens-tracer
python examples/07_demo_mock.py  # runs a full demo with no API key

Works with LangChain and LlamaIndex Too

from agent_lens.integrations.langchain import AgentLensCallbackHandler
from agent_lens.integrations.llamaindex import AgentLensLlamaIndexHandler

Pass as a callback — every LLM call is traced automatically.

Why This Matters

You're not debugging a function. You're debugging a probabilistic system. Every prompt change is a hypothesis test.

Today you run that test by eyeballing outputs. agent-lens makes it structural, repeatable, and recorded.

Vibes-based prompt engineering is debugging without a debugger.
agent-lens is the debugger.

GitHub: https://github.com/RAJUSHANIGARAPU/agent-lens
Install: pip install agentlens-tracer

Would love to hear how you're currently debugging LLM agents — drop a comment below.

Top comments (1)

Harjot Singh • May 31

Observability tells you WHAT, debugging tells you WHY is the exact distinction the agent tooling space keeps blurring. A trace showing the agent called tool A then tool B is necessary but it answers the wrong question, the thing you actually need to know at 2am is why it chose B over the obviously-correct C, and that decision lived in the prompt and context state at that step, not in the call log. So the primitive that makes an agent debugger real, like you said, is replay plus step-state inspection: reconstruct the exact context the model saw at step N, the full prompt, the retrieved chunks, the prior tool outputs, because the decision is a function of that input and you can't reason about it from the output alone. That's genuinely harder than a normal debugger, where you inspect variables; here the variable is the entire context window that produced a probabilistic choice. Two things that make replay trustworthy: pinning the nondeterminism (seed, temperature, model version) so the replay doesn't diverge from the original run, and capturing inputs at each step rather than trying to reconstruct them after. See-the-context-the-model-saw, not just the action it took is the whole game. That reconstruct-the-decision-inputs instinct is core to how I think about agent debugging in Moonshift. For replay, are you snapshotting the full per-step context, or re-deriving it, and how are you handling model nondeterminism so the replay matches?