DEV Community

Lakshmi Sravya Vedantham
Lakshmi Sravya Vedantham

Posted on

I Built a Flight Recorder for AI Agents — Now I Can Replay Every Decision They Made

90% of AI agents fail in production. When they do, you get... nothing. No trace, no replay, no step-by-step view of what went wrong. Debugging an agent is like debugging a black box.

I built llm-lens to fix this.

What is llm-lens?

A single Rust binary that sits between your code and any LLM API, records every call, and lets you replay sessions step-by-step in your terminal.

Your code / agent framework
        |
   http://localhost:4001
        |
    ┌─────────┐
    │ llm-lens │  ← records everything, forwards unchanged
    └────┬────┘
         |
    LLM API (OpenAI, Anthropic, etc.)
Enter fullscreen mode Exit fullscreen mode

Zero code changes. Swap one environment variable:

export OPENAI_BASE_URL=http://localhost:4001/v1
Enter fullscreen mode Exit fullscreen mode

Every LLM call now gets recorded. Your code works exactly the same.

Quick Start

git clone https://github.com/LakshmiSravyaVedantham/llm-lens.git
cd llm-lens
cargo build --release
cp config.example.toml config.toml
./target/release/llm-lens start
Enter fullscreen mode Exit fullscreen mode

That is it. Every LLM call through port 4001 is now recorded.

The Killer Feature: Session Replay

Run llm-lens replay --last to step through your most recent agent session:

┌─ Session a3f8 -- 7 calls -- 12.4s total ────────────┐
│                                                       │
│  Step 3/7  [gpt-4]  tokens: 340>120  latency: 1.2s  │
│                                                       │
│  --- REQUEST ---                                      │
│  system: You are a coding assistant                    │
│  user: Fix the bug in auth.py line 42                 │
│                                                       │
│  --- RESPONSE ---                                     │
│  The issue is in the token validation logic.          │
│  Here is the fix: ...                                 │
│                                                       │
│  h: prev | l: next | q: quit | e: export             │
└───────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Navigate with h/l (prev/next), q (quit), e (export). See exactly what the agent sent, what it got back, and where it went wrong.

What You Get

Feature What it does
Session recording Groups related calls by header or time window
Full trace capture Stores request, response, tokens, latency, model
TUI replay Step through any session call-by-call
Failure detection Auto-flags 5xx errors, empty responses, error fields
JSON export llm-lens export <id> for programmatic analysis
Markdown export llm-lens export <id> --md for sharing in PRs/docs
llmux chaining Chain behind llmux for caching + tracing together

Session Grouping

llm-lens automatically groups related calls into sessions:

  1. Explicit: Set X-Session-Id header in your agent code
  2. Time-based: Calls within 60 seconds of each other = same session
  3. Fallback: Each call is its own session

Most agent frameworks make multiple LLM calls per task. llm-lens groups them so you can see the full chain of reasoning.

Browse All Sessions

$ llm-lens sessions

Session      Calls    Latency      Tokens     Errors   Last Activity
----------------------------------------------------------------------
a3f8         7        12400ms      2840>710   -        2026-03-06 14:23:01
b2c1         3        4200ms       890>120    YES      2026-03-06 14:15:42
c9d4         12       28100ms      5200>1800  -        2026-03-06 13:50:18
Enter fullscreen mode Exit fullscreen mode

Spot the session with errors. Replay it. Find the exact call that failed.

Export for Sharing

# JSON for programmatic analysis
llm-lens export a3f8 > session.json

# Markdown for PRs and docs
llm-lens export a3f8 --md > session.md
Enter fullscreen mode Exit fullscreen mode

The Markdown export produces a clean document with every step, request, and response — ready to paste into a GitHub issue or PR.

Chain with llmux

llm-lens pairs with llmux (the LLM gateway I built last week):

Your code -> llm-lens:4001 -> llmux:4000 -> OpenAI/Anthropic
                 |                  |
            records traces    caches + failover
Enter fullscreen mode Exit fullscreen mode

Caching, failover, cost tracking, AND full session recording. Two binaries, zero code changes.

Why Rust?

  • Sub-millisecond proxy overhead (your agent does not slow down)
  • Single binary, no runtime dependencies
  • Thread-safe concurrent request handling
  • SQLite storage — everything stays on your machine

What is Next

This is the second tool in a trilogy:

  1. llmux — LLM gateway with failover, caching, cost tracking
  2. llm-lens (this project) — session recording and trace replay
  3. llm-guard (coming next) — runtime safety monitor for AI agents

Each tool is standalone. Together they form a complete AI agent infrastructure stack.

Try It

git clone https://github.com/LakshmiSravyaVedantham/llm-lens.git
cd llm-lens && cargo build --release
Enter fullscreen mode Exit fullscreen mode

Star it if useful: github.com/LakshmiSravyaVedantham/llm-lens


llm-lens is MIT licensed and open source. Built with Rust, axum, tokio, ratatui, and rusqlite.

Top comments (1)

Collapse
 
nyrok profile image
Hamza KONTE

The "zero code changes, swap one env var" DX is exactly right — the best observability tools are transparent proxies. Seeing the full session grouped by time window is especially valuable because most agent failures aren't in a single call; they're in the chain of reasoning across multiple calls.

One thing that compounds the debugging problem: even if you can replay what the agent sent, the prompt itself is often the culprit and it's hard to audit. A monolithic string prompt gives you no visibility into which section (the role definition? the constraints? the output format?) drove a bad decision. Structured prompts help here — if you define your prompt as typed blocks that compile to XML, you can correlate a bad agent output back to a specific block.

I've been working on something adjacent — flompt (flompt.dev) is a free open-source visual prompt builder that decomposes prompts into typed blocks and compiles to XML for Claude-style structured prompting. The combination of a tool like llm-lens (to see what was sent) and a structured prompt builder (to control what gets sent) seems like a natural pair for agent debugging workflows.

The trilogy concept is smart — llmux + llm-lens + llm-guard as composable infrastructure. Looking forward to llm-guard.