<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Eugene Dayne Mawuli </title>
    <description>The latest articles on DEV Community by Eugene Dayne Mawuli  (@eugene001dayne).</description>
    <link>https://dev.to/eugene001dayne</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3836065%2F4fcce7d1-814a-47e8-9ba6-01d456db8def.jpeg</url>
      <title>DEV Community: Eugene Dayne Mawuli </title>
      <link>https://dev.to/eugene001dayne</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eugene001dayne"/>
    <language>en</language>
    <item>
      <title>built an open-source reliability layer for AI agents , three tools, all live, zero infrastructure cost</title>
      <dc:creator>Eugene Dayne Mawuli </dc:creator>
      <pubDate>Sun, 29 Mar 2026 01:33:44 +0000</pubDate>
      <link>https://dev.to/eugene001dayne/built-an-open-source-reliability-layer-for-ai-agents-three-tools-all-live-zero-infrastructure-40ha</link>
      <guid>https://dev.to/eugene001dayne/built-an-open-source-reliability-layer-for-ai-agents-three-tools-all-live-zero-infrastructure-40ha</guid>
      <description>&lt;p&gt;Over the last few months I identified three problems that every developer building AI agents hits in production — and built a standalone open-source tool for each one.&lt;/p&gt;

&lt;p&gt;Together they form the &lt;strong&gt;Thread Suite.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem Space&lt;/strong&gt;&lt;br&gt;
When you deploy an AI agent to production, you face three specific failure modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure Mode 1 — Structural corruption&lt;/strong&gt;&lt;br&gt;
Your agent returns conversational text instead of JSON. Or missing fields. Or wrong types. Your database gets dirty data. Your pipeline crashes silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure Mode 2 — Behavior drift&lt;/strong&gt;&lt;br&gt;
Your agent starts behaving differently across runs. Hallucinating. Refusing. Formatting incorrectly. You find out when a user complains — not before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure Mode 3 — Prompt degradation&lt;/strong&gt;&lt;br&gt;
You change a prompt and have no idea if performance improved or degraded. There's no version history. No metrics. No rollback.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Tools
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Iron-Thread&lt;/strong&gt;&lt;br&gt;
Middleware that sits between your AI model and your database. Validates output structure against a defined schema. Blocks failures. Auto-corrects using AI when a key is available.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install iron-thread&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Live API: &lt;a href="https://iron-thread-production.up.railway.app/docs" rel="noopener noreferrer"&gt;https://iron-thread-production.up.railway.app/docs&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/eugene001dayne/iron-thread" rel="noopener noreferrer"&gt;https://github.com/eugene001dayne/iron-thread&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TestThread&lt;/strong&gt;&lt;br&gt;
pytest for AI agents. Define expected behavior, run tests, get pass/fail results with AI-powered diagnosis.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install testthread&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Live API: &lt;a href="https://test-thread-production.up.railway.app/docs" rel="noopener noreferrer"&gt;https://test-thread-production.up.railway.app/docs&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/eugene001dayne/test-thread" rel="noopener noreferrer"&gt;https://github.com/eugene001dayne/test-thread&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PromptThread&lt;/strong&gt;&lt;br&gt;
Git for prompts — with performance data attached. Version control, A/B testing, regression alerts that fire automatically when pass rate drops or latency spikes, and golden set testing that runs your critical cases against every new version.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install promptthread&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Live API: &lt;a href="https://prompt-thread.onrender.com/docs" rel="noopener noreferrer"&gt;https://prompt-thread.onrender.com/docs&lt;/a&gt;&lt;br&gt;
Dashboard: &lt;a href="https://prompt-thread-dashboard.lovable.app" rel="noopener noreferrer"&gt;https://prompt-thread-dashboard.lovable.app&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/eugene001dayne/prompt-thread" rel="noopener noreferrer"&gt;https://github.com/eugene001dayne/prompt-thread&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How They Connect
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

Iron-Thread  → Did the AI return the right structure?
TestThread   → Did the agent do the right thing?
PromptThread → Is my prompt the best version of itself?

Each tool works standalone. Together they form a complete reliability pipeline.

**The Build Stats**
- One person, 
- Celeron processor, 4GB RAM, Windows, VS Code
- Stack: FastAPI, Supabase, Railway/Render, Lovable
- Infrastructure cost: $0 and some help from claude
- Time: a few weeks of focused building

All three tools are MIT licensed, open source, and free to self-host.
What reliability problems are you hitting with your agents? Happy to answer any questions.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>Why I built a testing framework for AI agents (and how to use it)</title>
      <dc:creator>Eugene Dayne Mawuli </dc:creator>
      <pubDate>Fri, 20 Mar 2026 20:16:14 +0000</pubDate>
      <link>https://dev.to/eugene001dayne/why-i-built-a-testing-framework-for-ai-agents-and-how-to-use-it-a6c</link>
      <guid>https://dev.to/eugene001dayne/why-i-built-a-testing-framework-for-ai-agents-and-how-to-use-it-a6c</guid>
      <description>&lt;h2&gt;
  
  
  The problem nobody talks about
&lt;/h2&gt;

&lt;p&gt;Everyone is building AI agents right now. But here's what happens after you ship one:&lt;/p&gt;

&lt;p&gt;It breaks silently.&lt;/p&gt;

&lt;p&gt;Wrong output formats. Hallucinations. Failed tool calls. You find out when something downstream crashes — not before. By then it's already affected real users.&lt;/p&gt;

&lt;p&gt;I kept running into this and couldn't find a clean solution. So I built one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing TestThread
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;pytest for AI agents.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TestThread lets you define exactly what your agent should do, run it against your live endpoint, and get clear pass/fail results — with AI diagnosis explaining why something failed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;testthread
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;testthread&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TestThread&lt;/span&gt;

&lt;span class="n"&gt;tt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TestThread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gemini_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;suite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My Agent Tests&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-agent.com/run&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;suite_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic check&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 2 + 2?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;match_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contains&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Passed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What makes it different
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Semantic matching&lt;/strong&gt; — instead of checking if output contains an exact string, AI judges whether the &lt;em&gt;meaning&lt;/em&gt; matches. Your agent can say "The answer is four" and still pass a test expecting "4".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI diagnosis&lt;/strong&gt; — when a test fails, Gemini explains exactly why and suggests a fix. Not just "failed" — actual actionable feedback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regression detection&lt;/strong&gt; — every run is compared against the previous one. If pass rate drops, you get flagged immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PII detection&lt;/strong&gt; — automatically scans every agent output for emails, phone numbers, API keys, credit cards, SSNs. Auto-fails the test if found. Critical for production agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trajectory assertions&lt;/strong&gt; — test not just what your agent returned, but how it got there. Did it call the right tools? Did it complete in under 5 steps? Did it avoid calling delete_user?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Set trajectory assertions
&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;BASE&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/suites/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;suite_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/cases/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;case_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/assertions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_called&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_not_called&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delete_user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CI/CD integration&lt;/strong&gt; — one file in your repo and TestThread runs on every push. Fails the build if tests regress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scheduled runs&lt;/strong&gt; — run your test suite hourly, daily, or weekly automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The live dashboard
&lt;/h2&gt;

&lt;p&gt;Everything is visible at &lt;a href="https://test-thread.lovable.app" rel="noopener noreferrer"&gt;test-thread.lovable.app&lt;/a&gt; — pass rates, regression flags, PII alerts, trajectory timelines, cost per run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part of a bigger suite
&lt;/h2&gt;

&lt;p&gt;TestThread is part of the Thread Suite — open source reliability tools for AI agents.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Iron-Thread&lt;/strong&gt; — validates AI output structure before it hits your database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TestThread&lt;/strong&gt; — tests whether your agent behaves correctly across runs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PromptThread&lt;/strong&gt; — versions and tracks prompt performance &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;testthread
&lt;span class="c"&gt;# or&lt;/span&gt;
npm &lt;span class="nb"&gt;install &lt;/span&gt;testthread
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/eugene001dayne/test-thread" rel="noopener noreferrer"&gt;github.com/eugene001dayne/test-thread&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Live API: &lt;a href="https://test-thread-production.up.railway.app" rel="noopener noreferrer"&gt;test-thread-production.up.railway.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Would love feedback from anyone building agents. What testing problems are you running into that TestThread doesn't solve yet?&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>showdev</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
