A note on building reliability infrastructure for AI agents — and why post-incident debugging matters more than pre-flight validation.
A few weeks ago I started building SafeRun — inline reliability infrastructure for AI agents in production. The temptation, when you're building something in the agent reliability space, is to lead with validation. Block the bad action before it happens. Stop the runaway loop. Enforce the policy.
These are real features. SafeRun ships all of them. But they're not the first thing we built. The first thing we built was Replay.
Here's why.
The failure mode no one talks about
Most teams shipping AI agents into production discover the same problem after their first bad incident. The agent did something it shouldn't have. They go to investigate. And they find that they can't reproduce what happened.
The traces are flat. The logs don't show the model's reasoning between tool calls. The arguments to the failed call aren't fully captured. The retrieved context that informed the decision is missing. The agent's plan, if it had one, isn't anywhere.
So the engineer does what engineers do. They start rerunning the agent, trying to recreate the conditions that led to the failure. The agent is non-deterministic. The conditions change. They spend a weekend trying to reproduce one bad action.
This is the universal pain. I've talked to maybe twenty engineers shipping agents in production, and every single one of them has lived this. Not "heard about it." Lived it.
Why observability tools don't solve this
LangSmith, Langfuse, Helicone, Arize, and the broader observability category do something genuinely useful: they tell you what happened. But "what happened" is a description, not a reproduction. You can read a trace. You can't re-execute it.
Replay is different. Replay means capturing the complete state of an agent run with enough fidelity to step through it frame by frame after the fact, see the exact arguments to each tool call, see the model's reasoning between calls, see the retrieved context at each decision point, see the policy that evaluated each action, see the decision that was returned.
This is a different engineering problem than logging. It requires deterministic state capture. It requires decision-time context snapshotting separately from outcome context. It requires versioning every policy and every rule and every classifier that participated in a decision. We built this first because everything else depends on it.
The four-step loop, and why Replay is the foundation
SafeRun's product loop is Replay → Understand → Create Rule → Prevent.
You can't understand a failure you can't reproduce.
You can't create a rule to prevent a failure you don't understand.
You can't prevent a category of failure if your rule was created against an incomplete picture of what happened.
The order matters. Build Replay first, and everything else compounds. Build prevention first, and your rules will be flat patches against failures you don't fully see.
The Stripe boolean problem
Here's the failure that taught me Replay matters more than any other layer.
An agent issues a Stripe refund instead of a charge because a single boolean flipped in the agent's planning step. The call shape is correct. The schema passes. Type-checking passes. Most observability tools log a successful refund and move on.
The engineer notices the next morning when the customer complains. They go to investigate. They have a trace. The trace tells them "Stripe refund issued, amount $4,500, customer cus_9281." That's true. It tells them nothing about why.
With Replay, they can step back through the agent's decision frame by frame. See the user's request was actually a charge. See the agent's planning step had is_refund: false. See that somewhere between the plan and the tool call, the boolean flipped. See whether it was a model hallucination, a prompt injection, a code bug, or a retrieved-context misinterpretation.
Now they know what to do. They can write a prevention rule. They can fix the upstream cause. They can ship a fix that actually prevents recurrence, instead of patching the symptom.
This is what Replay enables. None of the rest of the product matters without it.
What we shipped, in order:
Phase 0: Working prototype with six failure simulations, including the Stripe boolean problem.
Phase 1: Persistent backend on Supabase. Replays survive page reload, browser close, account switch.
Phase 2: POST /v1/check-action API with sub-50ms p95 latency. Decision-time context snapshotting (inputs, retrieved context, external state, policy version, evaluator model version) captured synchronously, persisted asynchronously. The replay is built from the decision, not assembled after.
Phase 3: Python and TypeScript SDKs. Three-line install. @guard decorator wraps any tool call.
Phase 4: Intent Guard — catches valid-shape, wrong-intent tool calls. The Stripe boolean problem from above. Visible confidence scores, threshold calibration as a product surface, feedback loop closes back into recalibration.
Phase 5: Multi-tenant, project-scoped API keys, environment separation (dev logs, staging warns, production blocks), replay redaction, audit log, rule versioning.
Phase 6: Design partner onboarding, Prevention Impact Dashboard.
Phase 7: Self-hosted/VPC, SSO/SAML, audit log export, SOC 2 readiness, SafeRun as an MCP-callable tool.
The whole roadmap exists in service of the Replay layer. Every phase compounds on the previous one. Every feature ladders to Replay → Understand → Create Rule → Prevent.
What's next
We're onboarding the first design partners now. Engineering teams shipping AI agents into production — agents that move real money, modify real customer data, talk to real customers. Free during the partnership in exchange for honest feedback.
If you're shipping agents and want to be one of the first teams running SafeRun in production, get in touch. saferun.dev.
If you're shipping agents and don't want to be a design partner but want to try the SDK, it's pip install saferun and three lines.
Either way, the bet is this: replay the failure, prevent the next one. The first one always happens. The second one is the company's choice.
Top comments (0)