Rishabh Jain

Posted on Apr 14

I use Langfuse for tracing. Here's why I added Rewind for debugging.

#agents #ai #opensource #python

Last week my research agent failed at step 15 of a 30-step run. Langfuse showed me exactly where it broke. The writer sub-agent hallucinated, citing a stale 2019 population figure as current fact. Clean trace, obvious failure.

Now what?

I changed the system prompt. Re-ran the agent. $1.20 in tokens. 3 minutes of wall time. Different answer, still wrong, different hallucination. Re-ran again. $1.20 more. Another answer. By the fifth attempt I'd spent $6 and 15 minutes, and I still wasn't sure the fix was right because every run gave a different output.

Langfuse is great at showing you what happened. It can't let you change what happened and observe a different outcome.

So I built a tool that does.

The gap between tracing and fixing

Most LLM observability tools (Langfuse, LangSmith, Helicone) solve the same problem: "What did my agent do?" They capture traces, show you token counts, latencies, and the content of each step. That's valuable.

But when something breaks at step 15 of a 30-step agent, you're stuck:

You can't isolate the failure. To test a fix, you re-run all 30 steps. Steps 1-14 were fine. You're paying for them again.
You can't reproduce it. LLMs are non-deterministic. Re-run the same agent and you get a different result. The bug might not even appear.
You can't prove your fix works. You changed the prompt. Did it actually fix the hallucination, or just shift the problem to a different step?

I needed something that lets me fork at step 14, replay only the broken part, and prove the fix works. So I built Rewind, an open-source time-travel debugger for AI agents.

How I debug Langfuse traces now

1. Import the trace

I see the broken trace in Langfuse's UI. Copy the trace ID:

pip install rewind-agent

export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
rewind import from-langfuse --trace abc123

Rewind calls the Langfuse REST API, fetches the trace, and gives me a browsable session with the full span tree: agent boundaries, tool calls, handoffs, token counts.

Same data as Langfuse, but now it's in a system that can act on it. Everything stays on your machine - Rewind is a single binary that stores traces locally in SQLite. No cloud account, no data leaving your environment.

2. Fork at the failure, replay with the fix

I fix my code (added a date cross-referencing instruction to the system prompt), then:

rewind replay latest --from 14

Steps 1-14 are served from cache. Zero tokens, zero API calls, no side effects retriggered - cached steps return stored responses without hitting upstream. Only step 15 onward re-runs live. If the fix isn't right, I replay again. Each time I only pay for the steps after the fork point.

3. Prove the fix with LLM-as-judge

Instead of eyeballing the output, score both timelines with LLM-as-judge:

# One-time setup: create an LLM-as-judge evaluator (requires OPENAI_API_KEY)
rewind eval evaluator create correctness -t llm_judge -c '{"criteria": "correctness"}'

# Score both timelines
rewind eval score latest -e correctness --compare-timelines

Original timeline: 0.2 on correctness. Fixed timeline: 0.95. Not me guessing. An evaluator comparing the output against expected results.

4. Share the proof

rewind share latest --include-content -o debug-session.html

Self-contained HTML file. Open in any browser, no install needed. The full trace, both timelines, the diff, the scores. Drop it in Slack. Anyone can see what broke and the proof that it's fixed.

5. Export back to Langfuse

rewind export otel latest \
  --endpoint https://cloud.langfuse.com/api/public/otel \
  --header "Authorization=Basic $(echo -n $LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY | base64)"

The debugged session goes back to Langfuse for the team dashboard.

The cost difference

	Before (re-run)	After (Rewind)
Attempts	5 full re-runs	2 targeted replays
Tokens	1,370,000	311,000
Cost	$6.00	$1.36
Time	15 minutes	3 minutes
Proof	"Looks right to me"	Correctness: 0.95

GPT-4o pricing ($2.50/1M input, $10/1M output). Each run: ~274K tokens. Cached steps use 0 tokens. Savings scale with failure position - failing later in the chain saves more.

For longer agents (50, 100 steps), the savings compound.

The workflow

Langfuse monitors production
Something breaks. Import the trace: rewind import from-langfuse --trace <id>
Fork and replay: rewind replay latest --from 14
Prove it: rewind eval score latest -e correctness --compare-timelines
Share: rewind share latest
Export back: rewind export otel latest --endpoint <langfuse-otel-url>

Or skip steps 2-4 entirely: rewind fix latest --apply diagnoses the failure, forks, and replays with a fix - one command.

Langfuse is my production backbone. Rewind is what I reach for when something breaks.

Try it

pip install rewind-agent
rewind demo && rewind show latest

No API keys needed, no cloud account, nothing to configure. pip install sets up the CLI - on first run it downloads the native binary (~30MB), then you're running. The demo seeds a 5-step research agent with a hallucination at step 5, so you can try the full fork/replay/score workflow without connecting to Langfuse.

Want the one-command version? rewind fix latest diagnoses the failure with AI and suggests the fix. rewind fix latest --apply automates the entire fork/replay loop.

For the Langfuse import workflow: integration guide

Website
GitHub (MIT licensed)
PyPI

Having trouble with a specific agent failure? Open a discussion and paste the trace. I'll walk through debugging it with you.

DEV Community