<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rishabh Jain</title>
    <description>The latest articles on DEV Community by Rishabh Jain (@risjai).</description>
    <link>https://dev.to/risjai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3877294%2F050cca51-7953-45f8-bcf6-5e635d0eab81.jpg</url>
      <title>DEV Community: Rishabh Jain</title>
      <link>https://dev.to/risjai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/risjai"/>
    <language>en</language>
    <item>
      <title>I use Langfuse for tracing. Here's why I added Rewind for debugging.</title>
      <dc:creator>Rishabh Jain</dc:creator>
      <pubDate>Tue, 14 Apr 2026 12:31:15 +0000</pubDate>
      <link>https://dev.to/risjai/i-use-langfuse-for-tracing-heres-why-i-added-rewind-for-debugging-357p</link>
      <guid>https://dev.to/risjai/i-use-langfuse-for-tracing-heres-why-i-added-rewind-for-debugging-357p</guid>
      <description>&lt;p&gt;Last week my research agent failed at step 15 of a 30-step run. Langfuse showed me exactly where it broke. The writer sub-agent hallucinated, citing a stale 2019 population figure as current fact. Clean trace, obvious failure.&lt;/p&gt;

&lt;p&gt;Now what?&lt;/p&gt;

&lt;p&gt;I changed the system prompt. Re-ran the agent. $1.20 in tokens. 3 minutes of wall time. Different answer, still wrong, different hallucination. Re-ran again. $1.20 more. Another answer. By the fifth attempt I'd spent $6 and 15 minutes, and I still wasn't sure the fix was right because every run gave a different output.&lt;/p&gt;

&lt;p&gt;Langfuse is great at showing you what happened. It can't let you change what happened and observe a different outcome.&lt;/p&gt;

&lt;p&gt;So I built a tool that does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap between tracing and fixing
&lt;/h2&gt;

&lt;p&gt;Most LLM observability tools (Langfuse, LangSmith, Helicone) solve the same problem: "What did my agent do?" They capture traces, show you token counts, latencies, and the content of each step. That's valuable.&lt;/p&gt;

&lt;p&gt;But when something breaks at step 15 of a 30-step agent, you're stuck:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You can't isolate the failure.&lt;/strong&gt; To test a fix, you re-run all 30 steps. Steps 1-14 were fine. You're paying for them again.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can't reproduce it.&lt;/strong&gt; LLMs are non-deterministic. Re-run the same agent and you get a different result. The bug might not even appear.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can't prove your fix works.&lt;/strong&gt; You changed the prompt. Did it actually fix the hallucination, or just shift the problem to a different step?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I needed something that lets me fork at step 14, replay only the broken part, and prove the fix works. So I built &lt;a href="https://github.com/agentoptics/rewind" rel="noopener noreferrer"&gt;Rewind&lt;/a&gt;, an open-source time-travel debugger for AI agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I debug Langfuse traces now
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Import the trace
&lt;/h3&gt;

&lt;p&gt;I see the broken trace in Langfuse's UI. Copy the trace ID:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;rewind-agent

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LANGFUSE_PUBLIC_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pk-lf-...
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LANGFUSE_SECRET_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-lf-...
rewind import from-langfuse &lt;span class="nt"&gt;--trace&lt;/span&gt; abc123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rewind calls the Langfuse REST API, fetches the trace, and gives me a browsable session with the full span tree: agent boundaries, tool calls, handoffs, token counts.&lt;/p&gt;

&lt;p&gt;Same data as Langfuse, but now it's in a system that can act on it. Everything stays on your machine - Rewind is a single binary that stores traces locally in SQLite. No cloud account, no data leaving your environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Fork at the failure, replay with the fix
&lt;/h3&gt;

&lt;p&gt;I fix my code (added a date cross-referencing instruction to the system prompt), then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rewind replay latest &lt;span class="nt"&gt;--from&lt;/span&gt; 14
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Steps 1-14 are served from cache. Zero tokens, zero API calls, no side effects retriggered - cached steps return stored responses without hitting upstream. Only step 15 onward re-runs live. If the fix isn't right, I replay again. Each time I only pay for the steps after the fork point.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Prove the fix with LLM-as-judge
&lt;/h3&gt;

&lt;p&gt;Instead of eyeballing the output, score both timelines with LLM-as-judge:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One-time setup: create an LLM-as-judge evaluator (requires OPENAI_API_KEY)&lt;/span&gt;
rewind &lt;span class="nb"&gt;eval &lt;/span&gt;evaluator create correctness &lt;span class="nt"&gt;-t&lt;/span&gt; llm_judge &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'{"criteria": "correctness"}'&lt;/span&gt;

&lt;span class="c"&gt;# Score both timelines&lt;/span&gt;
rewind &lt;span class="nb"&gt;eval &lt;/span&gt;score latest &lt;span class="nt"&gt;-e&lt;/span&gt; correctness &lt;span class="nt"&gt;--compare-timelines&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Original timeline: 0.2 on correctness. Fixed timeline: 0.95. Not me guessing. An evaluator comparing the output against expected results.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Share the proof
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rewind share latest &lt;span class="nt"&gt;--include-content&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; debug-session.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Self-contained HTML file. Open in any browser, no install needed. The full trace, both timelines, the diff, the scores. Drop it in Slack. Anyone can see what broke and the proof that it's fixed.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Export back to Langfuse
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rewind &lt;span class="nb"&gt;export &lt;/span&gt;otel latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; https://cloud.langfuse.com/api/public/otel &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Authorization=Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nv"&gt;$LANGFUSE_PUBLIC_KEY&lt;/span&gt;:&lt;span class="nv"&gt;$LANGFUSE_SECRET_KEY&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The debugged session goes back to Langfuse for the team dashboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost difference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before (re-run)&lt;/th&gt;
&lt;th&gt;After (Rewind)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Attempts&lt;/td&gt;
&lt;td&gt;5 full re-runs&lt;/td&gt;
&lt;td&gt;2 targeted replays&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens&lt;/td&gt;
&lt;td&gt;1,370,000&lt;/td&gt;
&lt;td&gt;311,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$6.00&lt;/td&gt;
&lt;td&gt;$1.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;3 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Proof&lt;/td&gt;
&lt;td&gt;"Looks right to me"&lt;/td&gt;
&lt;td&gt;Correctness: 0.95&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;GPT-4o pricing ($2.50/1M input, $10/1M output). Each run: ~274K tokens. Cached steps use 0 tokens. Savings scale with failure position - failing later in the chain saves more.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For longer agents (50, 100 steps), the savings compound.&lt;/p&gt;

&lt;h2&gt;
  
  
  The workflow
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Langfuse&lt;/strong&gt; monitors production&lt;/li&gt;
&lt;li&gt;Something breaks. Import the trace: &lt;code&gt;rewind import from-langfuse --trace &amp;lt;id&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Fork and replay: &lt;code&gt;rewind replay latest --from 14&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Prove it: &lt;code&gt;rewind eval score latest -e correctness --compare-timelines&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Share: &lt;code&gt;rewind share latest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Export back: &lt;code&gt;rewind export otel latest --endpoint &amp;lt;langfuse-otel-url&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Or skip steps 2-4 entirely: &lt;code&gt;rewind fix latest --apply&lt;/code&gt; diagnoses the failure, forks, and replays with a fix - one command.&lt;/p&gt;

&lt;p&gt;Langfuse is my production backbone. Rewind is what I reach for when something breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;rewind-agent
rewind demo &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; rewind show latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API keys needed, no cloud account, nothing to configure. &lt;code&gt;pip install&lt;/code&gt; sets up the CLI - on first run it downloads the native binary (~30MB), then you're running. The demo seeds a 5-step research agent with a hallucination at step 5, so you can try the full fork/replay/score workflow without connecting to Langfuse.&lt;/p&gt;

&lt;p&gt;Want the one-command version? &lt;code&gt;rewind fix latest&lt;/code&gt; diagnoses the failure with AI and suggests the fix. &lt;code&gt;rewind fix latest --apply&lt;/code&gt; automates the entire fork/replay loop.&lt;/p&gt;

&lt;p&gt;For the Langfuse import workflow: &lt;a href="https://agentoptics.dev/blog/langfuse-debugging" rel="noopener noreferrer"&gt;integration guide&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://agentoptics.dev" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/agentoptics/rewind" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; (MIT licensed)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/rewind-agent/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Having trouble with a specific agent failure? &lt;a href="https://github.com/agentoptics/rewind/discussions" rel="noopener noreferrer"&gt;Open a discussion&lt;/a&gt; and paste the trace. I'll walk through debugging it with you.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
