90% of AI agents fail in production. When they do, you get... nothing. No trace, no replay, no step-by-step view of what went wrong. Debugging an agent is like debugging a black box.
I built llm-lens to fix this.
What is llm-lens?
A single Rust binary that sits between your code and any LLM API, records every call, and lets you replay sessions step-by-step in your terminal.
Your code / agent framework
|
http://localhost:4001
|
┌─────────┐
│ llm-lens │ ← records everything, forwards unchanged
└────┬────┘
|
LLM API (OpenAI, Anthropic, etc.)
Zero code changes. Swap one environment variable:
export OPENAI_BASE_URL=http://localhost:4001/v1
Every LLM call now gets recorded. Your code works exactly the same.
Quick Start
git clone https://github.com/LakshmiSravyaVedantham/llm-lens.git
cd llm-lens
cargo build --release
cp config.example.toml config.toml
./target/release/llm-lens start
That is it. Every LLM call through port 4001 is now recorded.
The Killer Feature: Session Replay
Run llm-lens replay --last to step through your most recent agent session:
┌─ Session a3f8 -- 7 calls -- 12.4s total ────────────┐
│ │
│ Step 3/7 [gpt-4] tokens: 340>120 latency: 1.2s │
│ │
│ --- REQUEST --- │
│ system: You are a coding assistant │
│ user: Fix the bug in auth.py line 42 │
│ │
│ --- RESPONSE --- │
│ The issue is in the token validation logic. │
│ Here is the fix: ... │
│ │
│ h: prev | l: next | q: quit | e: export │
└───────────────────────────────────────────────────────┘
Navigate with h/l (prev/next), q (quit), e (export). See exactly what the agent sent, what it got back, and where it went wrong.
What You Get
| Feature | What it does |
|---|---|
| Session recording | Groups related calls by header or time window |
| Full trace capture | Stores request, response, tokens, latency, model |
| TUI replay | Step through any session call-by-call |
| Failure detection | Auto-flags 5xx errors, empty responses, error fields |
| JSON export |
llm-lens export <id> for programmatic analysis |
| Markdown export |
llm-lens export <id> --md for sharing in PRs/docs |
| llmux chaining | Chain behind llmux for caching + tracing together |
Session Grouping
llm-lens automatically groups related calls into sessions:
-
Explicit: Set
X-Session-Idheader in your agent code - Time-based: Calls within 60 seconds of each other = same session
- Fallback: Each call is its own session
Most agent frameworks make multiple LLM calls per task. llm-lens groups them so you can see the full chain of reasoning.
Browse All Sessions
$ llm-lens sessions
Session Calls Latency Tokens Errors Last Activity
----------------------------------------------------------------------
a3f8 7 12400ms 2840>710 - 2026-03-06 14:23:01
b2c1 3 4200ms 890>120 YES 2026-03-06 14:15:42
c9d4 12 28100ms 5200>1800 - 2026-03-06 13:50:18
Spot the session with errors. Replay it. Find the exact call that failed.
Export for Sharing
# JSON for programmatic analysis
llm-lens export a3f8 > session.json
# Markdown for PRs and docs
llm-lens export a3f8 --md > session.md
The Markdown export produces a clean document with every step, request, and response — ready to paste into a GitHub issue or PR.
Chain with llmux
llm-lens pairs with llmux (the LLM gateway I built last week):
Your code -> llm-lens:4001 -> llmux:4000 -> OpenAI/Anthropic
| |
records traces caches + failover
Caching, failover, cost tracking, AND full session recording. Two binaries, zero code changes.
Why Rust?
- Sub-millisecond proxy overhead (your agent does not slow down)
- Single binary, no runtime dependencies
- Thread-safe concurrent request handling
- SQLite storage — everything stays on your machine
What is Next
This is the second tool in a trilogy:
- llmux — LLM gateway with failover, caching, cost tracking
- llm-lens (this project) — session recording and trace replay
- llm-guard (coming next) — runtime safety monitor for AI agents
Each tool is standalone. Together they form a complete AI agent infrastructure stack.
Try It
git clone https://github.com/LakshmiSravyaVedantham/llm-lens.git
cd llm-lens && cargo build --release
Star it if useful: github.com/LakshmiSravyaVedantham/llm-lens
llm-lens is MIT licensed and open source. Built with Rust, axum, tokio, ratatui, and rusqlite.
Top comments (1)
The "zero code changes, swap one env var" DX is exactly right — the best observability tools are transparent proxies. Seeing the full session grouped by time window is especially valuable because most agent failures aren't in a single call; they're in the chain of reasoning across multiple calls.
One thing that compounds the debugging problem: even if you can replay what the agent sent, the prompt itself is often the culprit and it's hard to audit. A monolithic string prompt gives you no visibility into which section (the role definition? the constraints? the output format?) drove a bad decision. Structured prompts help here — if you define your prompt as typed blocks that compile to XML, you can correlate a bad agent output back to a specific block.
I've been working on something adjacent — flompt (flompt.dev) is a free open-source visual prompt builder that decomposes prompts into typed blocks and compiles to XML for Claude-style structured prompting. The combination of a tool like llm-lens (to see what was sent) and a structured prompt builder (to control what gets sent) seems like a natural pair for agent debugging workflows.
The trilogy concept is smart — llmux + llm-lens + llm-guard as composable infrastructure. Looking forward to llm-guard.