What's the worst thing your AI agent has done in production?

SafeRun — Thu, 14 May 2026 19:04:01 +0000

What's the worst thing your AI agent has done in production?

I'm building reliability infrastructure for AI agents, and I'm collecting failure stories from engineers who've shipped agents to production.

If you've shipped an AI agent and watched it do something nobody could explain — this post is for you.

Why I'm asking

For the past few weeks I've been talking to engineers running AI agents in production. The same pattern keeps showing up.

Their agent did something they couldn't predict. The damage was already done by the time they noticed. The tools they had only logged the failure after the fact.

One engineer told me he spent a whole weekend rerunning an agent trying to reproduce one failure.

Another watched his sales agent email the same lead twelve times in five minutes before anyone caught it.

A third issued a $4,500 refund because the customer asked nicely and the agent didn't think to check.

These aren't edge cases. This is what production AI agents do when they're given real tools and real money — and the current generation of observability tools tell you about it after the fact.

I'm building SafeRun to close that gap.

What SafeRun does

SafeRun sits inline between AI agents and the tools they call.

Validates every tool call against your policies before execution
Blocks unsafe operations and runaway loops in real time
Escalates ambiguous actions to a human approval queue
Replays every agent decision frame by frame when something goes wrong

The killer feature, based on what engineers keep telling me, is Replay. Step through every input, model reasoning step, tool argument, policy result, latency, and cost — for every decision the agent made. And rerun from any step with modified inputs.

It's a flight recorder for AI agents.

What I'm asking for

Two things.

1. If you're shipping AI agents to production, join the waitlist. Early access is opening soon. We're onboarding the first batch of teams over the coming weeks.

→ saferun.dev

2. Tell me your worst agent failure story. Drop it in the comments below, or DM me. I'm collecting them — anonymized — to make sure SafeRun actually solves the real problems engineers have.

The weirder, the better. Hallucinated tool args. Runaway loops. Unauthorized actions. Cost spirals. Customer-facing incidents you can't talk about publicly. All of it.

The pattern across these stories is what shapes what gets built first.

What's coming

The early SDK ships as a Python decorator first, then TypeScript. Native integrations for LangGraph, OpenAI Agents SDK, Anthropic Claude Agent SDK, Vercel AI SDK, CrewAI, and Mastra. MCP-layer proxy for framework-agnostic coverage.

If you want to be in the first batch, the waitlist is at saferun.dev.

And if you've lived through an agent failure that still haunts you — please, tell me about it. I'd genuinely rather build the right thing than the impressive thing.

— Tidiane
Founder, SafeRun
x.com/saferunai

DEV Community: SafeRun

What's the worst thing your AI agent has done in production?

What's the worst thing your AI agent has done in production?

Why I'm asking

What SafeRun does

What I'm asking for

What's coming