<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: TaimoorKhan10</title>
    <description>The latest articles on DEV Community by TaimoorKhan10 (@taimoorkhan10).</description>
    <link>https://dev.to/taimoorkhan10</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1701821%2Fac7491bc-9599-4cde-981e-0ce47302f6b7.jpeg</url>
      <title>DEV Community: TaimoorKhan10</title>
      <link>https://dev.to/taimoorkhan10</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/taimoorkhan10"/>
    <language>en</language>
    <item>
      <title>I built an open source SDK to catch AI agent regressions before they ship.</title>
      <dc:creator>TaimoorKhan10</dc:creator>
      <pubDate>Sat, 30 May 2026 21:47:34 +0000</pubDate>
      <link>https://dev.to/taimoorkhan10/i-built-an-open-source-sdk-to-catch-ai-agent-regressions-before-they-ship-40ba</link>
      <guid>https://dev.to/taimoorkhan10/i-built-an-open-source-sdk-to-catch-ai-agent-regressions-before-they-ship-40ba</guid>
      <description>&lt;p&gt;I built an open source SDK to catch AI agent regressions before they ship&lt;/p&gt;

&lt;p&gt;You fix a bug in your agent. A week later you change the prompt or swap the model. The same bug comes back. Nobody notices until a user does.&lt;/p&gt;

&lt;p&gt;Regular software has regression tests for this. AI agents mostly do not. So I built replayd.&lt;/p&gt;

&lt;p&gt;When your agent fails, you capture that run and save it as a test. Before you ship a new version, you replay the saved failures against it. If the same failure comes back, you catch it before your users do.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;replayd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting part was grading. You cannot use exact output matching because LLMs are non deterministic. So replayd does not check the text. It checks whether the specific failure came back. Structural failures get deterministic assertions. Semantic ones get an LLM as judge. You assert on what the agent did, not what it said.&lt;/p&gt;

&lt;p&gt;It is v0.1.1, early, rough edges, but the core loop works. Zero runtime dependencies in the core. Framework agnostic.&lt;/p&gt;

&lt;p&gt;GitHub: github.com/TaimoorKhan10/replayd&lt;/p&gt;

&lt;p&gt;If you are running agents in production I would love your feedback on the grading approach. What are you catching manually right now that you wish was automated?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
