Why You Can't Reproduce AI Agent Failures (And Why That's a Huge Problem)

#ai #llm #opensource #tooling

Why You Can't Reproduce AI Agent Failures (And Why That's a Huge Problem)

If you've used Claude Code, Cursor, or any AI coding agent for more than a week, you've probably experienced this:

The agent does something wrong. Maybe it deletes a file it shouldn't have. Maybe it rewrites your auth module and breaks everything. Maybe it makes a chain of 15 edits and somewhere in the middle, something went sideways.

So you try to figure out what happened. You look at the conversation. You stare at the diffs. You try to piece together the sequence of events. And then you think "let me just re-run it and watch more carefully this time."

And it does something completely different.

The Nondeterminism Problem

This isn't a bug. It's fundamental to how LLMs work.

Every time an LLM generates a response, it's sampling from a probability distribution over possible next tokens. Temperature, top_p, and the inherent randomness in the sampling process mean that the same prompt can produce meaningfully different outputs every single time.

In traditional software, debugging works because code is deterministic. Same input, same output. You reproduce the bug, step through it, find the problem. That mental model completely breaks down with AI agents.

When an AI agent makes a bad decision, that specific chain of reasoning is gone the moment it happens. You can never get it back by re-running. The exact sequence of token probabilities that led to the failure will never repeat.

Why Logging Doesn't Fix This

Tools like Braintrust, Langfuse, and LangSmith are great at what they do. They log your prompts, responses, token counts, and latency. You can see traces of what happened.

But there's a critical difference between logging and recording.

A log tells you what the agent said. A recording captures enough context to make the agent say it again.

Think about it like this. A security camera shows you that someone walked into a room and took a file from the drawer. That's logging. Useful for knowing what happened.

But what if you could recreate the exact room, with the exact drawer contents, at the exact moment, put the person back in, and watch them do the exact same thing? That's recording. And what if you could then change one variable — "what if the drawer was locked?" — and see if the outcome changes?

That second thing is what's missing from the AI agent toolchain right now.

What Deterministic Replay Actually Means

The concept is simple. If you record not just the prompts and responses, but the full execution context of every LLM call (the exact model version, sampling parameters, tool definitions, system prompt, the complete message history, and the full response object including tool calls and stop reasons), you can replay the session by intercepting future LLM calls and returning the recorded responses instead of calling the real API.

The agent code doesn't know the difference. It receives the exact same response it got during the original run. It makes the same decisions. Calls the same tools. Produces the same output.

Zero API cost. Identical behavior. The exact same failure, reproduced on demand.

This is something that feels obvious in retrospect but is architecturally impossible if you only stored the conversation text. You need the full request and response objects, linked together in a causal chain with timestamps and parent-child relationships.

Counterfactual Debugging

Once you have deterministic replay, something even more interesting becomes possible.

You can pause the replay at any decision point and ask: "What if the agent had chosen differently here?"

Replace the recorded LLM response with an alternative. The replay engine uses the recorded data up to that point (free), injects your alternative at the fork point, then continues execution from there with live LLM calls in a sandboxed environment.

Now you have two timelines. What actually happened, and what would have happened. Side by side.

Instead of guessing whether changing a prompt would fix the issue, you can literally test it against the recorded reality. "The agent decided to rewrite auth.py at step 7. What if it had added a migration script instead?" Fork at step 7, inject that alternative, run it forward, compare.

This turns debugging from detective work into experimentation.

Why This Matters Now

AI agents are becoming the way software gets built. Claude Code is at $2.5B ARR. Cursor is at $500M. Millions of developers are using agents that make real changes to real codebases every day.

And 88% of organizations reported AI agent security or data privacy incidents in the last 12 months. Gartner predicts 40% of agentic AI projects will fail by 2027.

The incidents are happening now. The agents are getting more autonomous. And the debugging tools haven't caught up.

Observability tools tell you what happened. What's been missing is a tool that tells you why, and lets you prove your fix works before shipping it.

What I Built

I spent the past month building an open-source tool called Culpa that does exactly this. It records every LLM call, tool invocation, and file change with full context. It replays sessions deterministically. And it lets you fork at any decision point to test alternatives.

It works with Claude Code and Cursor via a local proxy. You just point them at it and everything gets recorded transparently:

culpa proxy start --name "debugging auth" --watch .
ANTHROPIC_BASE_URL=http://localhost:4560 claude

It also works with any Python script that uses the Anthropic or OpenAI SDK:

import culpa
culpa.init()
# your agent code here, everything gets recorded

MIT licensed, 91 tests, fully open source: github.com/AnshKanyadi/culpa

I'm a CS freshman at UMass. If you've dealt with the same frustration of debugging AI agent failures, I'd love to hear how you're currently handling it and what you'd want from a tool like this.