Every time I changed a prompt, I was running a hypothesis test.
But I had no debugger. No way to pause execution. No structural comparison between "before" and "after." Just two terminal windows and a vague feeling that maybe it was better now.
I built agent-lens to fix this.
The Problem with "Observability"
Langfuse, LangSmith, Phoenix — these are great tools. They show you what happened. Traces, spans, token counts.
But none of them answer the question I actually had: did this change make it better?
That requires something different:
- A way to compare two runs structurally
- A record of why you made the change (the hypothesis)
- A verdict — not just "here are the numbers," but "this was an improvement"
What agent-lens Does Differently
1. Pause a live agent mid-run
import agent_lens
from openai import OpenAI
agent_lens.install() # auto-patches OpenAI + Anthropic
agent_lens.dashboard.start() # localhost:7878
client = OpenAI()
@agent_lens.trace
def my_agent(query: str) -> str:
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": query}]
).choices[0].message.content
Open the dashboard, click Pause. The agent blocks at the next LLM call.
2. State a hypothesis before you change anything
POST /runs/{run_id}/fork
{
"span_id": "abc123",
"edited_messages": [{"role": "system", "content": "Be concise."}],
"notes": "Hypothesis: shorter system prompt reduces hallucination",
"expected_output": "concise"
}
The note travels with the run forever. Future you can read your reasoning.
3. GET /diff — one call, one verdict
GET /runs/{run_a}/diff/{run_b}
{
"metrics_delta": {
"latency_ms": {"a": 1847, "b": 820, "pct_change": -55.6},
"total_tokens": {"a": 453, "b": 87, "pct_change": -80.8},
"cost_usd": {"a": 0.0045, "b": 0.00087, "pct_change": -80.7}
},
"assertion_result": {
"expected_output": "concise",
"passed_in_a": false,
"passed_in_b": true,
"verdict": "improved"
}
}
Hypothesis confirmed. With numbers.
The Full Flow
[Agent running] → Pause → agent blocks at next LLM call
↓
[Edit messages in dashboard]
↓
Fork → new run diverges
↓
Resume → original continues
↓
[Two runs. GET /diff. Get verdict.]
No restarts. No re-running preceding steps.
Zero Infrastructure
Everything runs locally. SQLite at ~/.agent-lens/runs.db. No Docker. No cloud. No API keys needed to start exploring:
pip install agentlens-tracer
python examples/07_demo_mock.py # runs a full demo with no API key
Works with LangChain and LlamaIndex Too
from agent_lens.integrations.langchain import AgentLensCallbackHandler
from agent_lens.integrations.llamaindex import AgentLensLlamaIndexHandler
Pass as a callback — every LLM call is traced automatically.
Why This Matters
You're not debugging a function. You're debugging a probabilistic system. Every prompt change is a hypothesis test.
Today you run that test by eyeballing outputs. agent-lens makes it structural, repeatable, and recorded.
Vibes-based prompt engineering is debugging without a debugger.
agent-lens is the debugger.
GitHub: https://github.com/RAJUSHANIGARAPU/agent-lens
Install: pip install agentlens-tracer
Would love to hear how you're currently debugging LLM agents — drop a comment below.
Top comments (0)