DEV Community

Cover image for I built an open source SDK to catch AI agent regressions before they ship.
TaimoorKhan10
TaimoorKhan10

Posted on

I built an open source SDK to catch AI agent regressions before they ship.

I built an open source SDK to catch AI agent regressions before they ship

You fix a bug in your agent. A week later you change the prompt or swap the model. The same bug comes back. Nobody notices until a user does.

Regular software has regression tests for this. AI agents mostly do not. So I built replayd.

When your agent fails, you capture that run and save it as a test. Before you ship a new version, you replay the saved failures against it. If the same failure comes back, you catch it before your users do.

pip install replayd
Enter fullscreen mode Exit fullscreen mode

The interesting part was grading. You cannot use exact output matching because LLMs are non deterministic. So replayd does not check the text. It checks whether the specific failure came back. Structural failures get deterministic assertions. Semantic ones get an LLM as judge. You assert on what the agent did, not what it said.

It is v0.1.1, early, rough edges, but the core loop works. Zero runtime dependencies in the core. Framework agnostic.

GitHub: github.com/TaimoorKhan10/replayd

If you are running agents in production I would love your feedback on the grading approach. What are you catching manually right now that you wish was automated?

Top comments (1)

Collapse
 
harjjotsinghh profile image
Harjot Singh

Catching agent regressions before they ship is the missing CI layer for AI, deterministic code has tests, agents mostly have vibes and a prayer. The hard part you're solving: output is non-deterministic, so "did it regress" needs semantic assertions or a judge, not string equality. Pinning behavior with replayable fixtures plus tolerance on the fuzzy parts is the move. I built exactly this kind of verify-before-ship gate into Moonshift, output has to clear checks before it counts as done. What's your assertion model, golden outputs, a judge model, or property-based checks?