DEV Community

Cover image for I built an open source SDK to catch AI agent regressions before they ship.
TaimoorKhan10
TaimoorKhan10

Posted on

I built an open source SDK to catch AI agent regressions before they ship.

I built an open source SDK to catch AI agent regressions before they ship

You fix a bug in your agent. A week later you change the prompt or swap the model. The same bug comes back. Nobody notices until a user does.

Regular software has regression tests for this. AI agents mostly do not. So I built replayd.

When your agent fails, you capture that run and save it as a test. Before you ship a new version, you replay the saved failures against it. If the same failure comes back, you catch it before your users do.

pip install replayd
Enter fullscreen mode Exit fullscreen mode

The interesting part was grading. You cannot use exact output matching because LLMs are non deterministic. So replayd does not check the text. It checks whether the specific failure came back. Structural failures get deterministic assertions. Semantic ones get an LLM as judge. You assert on what the agent did, not what it said.

It is v0.1.1, early, rough edges, but the core loop works. Zero runtime dependencies in the core. Framework agnostic.

GitHub: github.com/TaimoorKhan10/replayd

If you are running agents in production I would love your feedback on the grading approach. What are you catching manually right now that you wish was automated?

Top comments (0)