DEV Community

Cover image for Why I built a testing framework for AI agents (and how to use it)
Eugene Dayne Mawuli
Eugene Dayne Mawuli

Posted on

Why I built a testing framework for AI agents (and how to use it)

The problem nobody talks about

Everyone is building AI agents right now. But here's what happens after you ship one:

It breaks silently.

Wrong output formats. Hallucinations. Failed tool calls. You find out when something downstream crashes — not before. By then it's already affected real users.

I kept running into this and couldn't find a clean solution. So I built one.

Introducing TestThread

pytest for AI agents.

TestThread lets you define exactly what your agent should do, run it against your live endpoint, and get clear pass/fail results — with AI diagnosis explaining why something failed.

pip install testthread
Enter fullscreen mode Exit fullscreen mode
from testthread import TestThread

tt = TestThread(gemini_key="your-key")

suite = tt.create_suite(
    name="My Agent Tests",
    agent_endpoint="https://your-agent.com/run"
)

tt.add_case(
    suite_id=suite["id"],
    name="Basic check",
    input="What is 2 + 2?",
    expected_output="4",
    match_type="contains"
)

result = tt.run_suite(suite["id"])
print(f"Passed: {result['passed']} | Failed: {result['failed']}")
Enter fullscreen mode Exit fullscreen mode

What makes it different

Semantic matching — instead of checking if output contains an exact string, AI judges whether the meaning matches. Your agent can say "The answer is four" and still pass a test expecting "4".

AI diagnosis — when a test fails, Gemini explains exactly why and suggests a fix. Not just "failed" — actual actionable feedback.

Regression detection — every run is compared against the previous one. If pass rate drops, you get flagged immediately.

PII detection — automatically scans every agent output for emails, phone numbers, API keys, credit cards, SSNs. Auto-fails the test if found. Critical for production agents.

Trajectory assertions — test not just what your agent returned, but how it got there. Did it call the right tools? Did it complete in under 5 steps? Did it avoid calling delete_user?

# Set trajectory assertions
requests.post(f"{BASE}/suites/{suite_id}/cases/{case_id}/assertions", json=[
    {"type": "tool_called", "value": "search"},
    {"type": "tool_not_called", "value": "delete_user"},
    {"type": "max_steps", "value": 5}
])
Enter fullscreen mode Exit fullscreen mode

CI/CD integration — one file in your repo and TestThread runs on every push. Fails the build if tests regress.

Scheduled runs — run your test suite hourly, daily, or weekly automatically.

The live dashboard

Everything is visible at test-thread.lovable.app — pass rates, regression flags, PII alerts, trajectory timelines, cost per run.

Part of a bigger suite

TestThread is part of the Thread Suite — open source reliability tools for AI agents.

  • Iron-Thread — validates AI output structure before it hits your database
  • TestThread — tests whether your agent behaves correctly across runs
  • PromptThread — versions and tracks prompt performance (coming soon)

Try it

pip install testthread
# or
npm install testthread
Enter fullscreen mode Exit fullscreen mode

GitHub: github.com/eugene001dayne/test-thread

Live API: test-thread-production.up.railway.app

Would love feedback from anyone building agents. What testing problems are you running into that TestThread doesn't solve yet?

Top comments (0)