built an open-source reliability layer for AI agents , three tools, all live, zero infrastructure cost

Eugene Dayne Mawuli — Sun, 29 Mar 2026 01:33:44 +0000

Over the last few months I identified three problems that every developer building AI agents hits in production — and built a standalone open-source tool for each one.

Together they form the Thread Suite.

The Problem Space
When you deploy an AI agent to production, you face three specific failure modes:

Failure Mode 1 — Structural corruption
Your agent returns conversational text instead of JSON. Or missing fields. Or wrong types. Your database gets dirty data. Your pipeline crashes silently.

Failure Mode 2 — Behavior drift
Your agent starts behaving differently across runs. Hallucinating. Refusing. Formatting incorrectly. You find out when a user complains — not before.

Failure Mode 3 — Prompt degradation
You change a prompt and have no idea if performance improved or degraded. There's no version history. No metrics. No rollback.

The Three Tools

Iron-Thread
Middleware that sits between your AI model and your database. Validates output structure against a defined schema. Blocks failures. Auto-corrects using AI when a key is available.

pip install iron-thread

Live API: https://iron-thread-production.up.railway.app/docs
GitHub: https://github.com/eugene001dayne/iron-thread

TestThread
pytest for AI agents. Define expected behavior, run tests, get pass/fail results with AI-powered diagnosis.

pip install testthread

Live API: https://test-thread-production.up.railway.app/docs
GitHub: https://github.com/eugene001dayne/test-thread

PromptThread
Git for prompts — with performance data attached. Version control, A/B testing, regression alerts that fire automatically when pass rate drops or latency spikes, and golden set testing that runs your critical cases against every new version.

pip install promptthread

Live API: https://prompt-thread.onrender.com/docs
Dashboard: https://prompt-thread-dashboard.lovable.app
GitHub: https://github.com/eugene001dayne/prompt-thread

How They Connect



Iron-Thread  → Did the AI return the right structure?
TestThread   → Did the agent do the right thing?
PromptThread → Is my prompt the best version of itself?

Each tool works standalone. Together they form a complete reliability pipeline.

**The Build Stats**
- One person, 
- Celeron processor, 4GB RAM, Windows, VS Code
- Stack: FastAPI, Supabase, Railway/Render, Lovable
- Infrastructure cost: $0 and some help from claude
- Time: a few weeks of focused building

All three tools are MIT licensed, open source, and free to self-host.
What reliability problems are you hitting with your agents? Happy to answer any questions.

Why I built a testing framework for AI agents (and how to use it)

Eugene Dayne Mawuli — Fri, 20 Mar 2026 20:16:14 +0000

The problem nobody talks about

Everyone is building AI agents right now. But here's what happens after you ship one:

It breaks silently.

Wrong output formats. Hallucinations. Failed tool calls. You find out when something downstream crashes — not before. By then it's already affected real users.

I kept running into this and couldn't find a clean solution. So I built one.

Introducing TestThread

pytest for AI agents.

TestThread lets you define exactly what your agent should do, run it against your live endpoint, and get clear pass/fail results — with AI diagnosis explaining why something failed.

pip install testthread

from testthread import TestThread

tt = TestThread(gemini_key="your-key")

suite = tt.create_suite(
    name="My Agent Tests",
    agent_endpoint="https://your-agent.com/run"
)

tt.add_case(
    suite_id=suite["id"],
    name="Basic check",
    input="What is 2 + 2?",
    expected_output="4",
    match_type="contains"
)

result = tt.run_suite(suite["id"])
print(f"Passed: {result['passed']} | Failed: {result['failed']}")

What makes it different

Semantic matching — instead of checking if output contains an exact string, AI judges whether the meaning matches. Your agent can say "The answer is four" and still pass a test expecting "4".

AI diagnosis — when a test fails, Gemini explains exactly why and suggests a fix. Not just "failed" — actual actionable feedback.

Regression detection — every run is compared against the previous one. If pass rate drops, you get flagged immediately.

PII detection — automatically scans every agent output for emails, phone numbers, API keys, credit cards, SSNs. Auto-fails the test if found. Critical for production agents.

Trajectory assertions — test not just what your agent returned, but how it got there. Did it call the right tools? Did it complete in under 5 steps? Did it avoid calling delete_user?

# Set trajectory assertions
requests.post(f"{BASE}/suites/{suite_id}/cases/{case_id}/assertions", json=[
    {"type": "tool_called", "value": "search"},
    {"type": "tool_not_called", "value": "delete_user"},
    {"type": "max_steps", "value": 5}
])

CI/CD integration — one file in your repo and TestThread runs on every push. Fails the build if tests regress.

Scheduled runs — run your test suite hourly, daily, or weekly automatically.

The live dashboard

Everything is visible at test-thread.lovable.app — pass rates, regression flags, PII alerts, trajectory timelines, cost per run.

Part of a bigger suite

TestThread is part of the Thread Suite — open source reliability tools for AI agents.

Iron-Thread — validates AI output structure before it hits your database
TestThread — tests whether your agent behaves correctly across runs
PromptThread — versions and tracks prompt performance (coming soon)

Try it

pip install testthread
# or
npm install testthread

GitHub: github.com/eugene001dayne/test-thread

Live API: test-thread-production.up.railway.app

Would love feedback from anyone building agents. What testing problems are you running into that TestThread doesn't solve yet?

DEV Community: Eugene Dayne Mawuli