The problem nobody talks about
Everyone is building AI agents right now. But here's what happens after you ship one:
It breaks silently.
Wrong output formats. Hallucinations. Failed tool calls. You find out when something downstream crashes — not before. By then it's already affected real users.
I kept running into this and couldn't find a clean solution. So I built one.
Introducing TestThread
pytest for AI agents.
TestThread lets you define exactly what your agent should do, run it against your live endpoint, and get clear pass/fail results — with AI diagnosis explaining why something failed.
pip install testthread
from testthread import TestThread
tt = TestThread(gemini_key="your-key")
suite = tt.create_suite(
name="My Agent Tests",
agent_endpoint="https://your-agent.com/run"
)
tt.add_case(
suite_id=suite["id"],
name="Basic check",
input="What is 2 + 2?",
expected_output="4",
match_type="contains"
)
result = tt.run_suite(suite["id"])
print(f"Passed: {result['passed']} | Failed: {result['failed']}")
What makes it different
Semantic matching — instead of checking if output contains an exact string, AI judges whether the meaning matches. Your agent can say "The answer is four" and still pass a test expecting "4".
AI diagnosis — when a test fails, Gemini explains exactly why and suggests a fix. Not just "failed" — actual actionable feedback.
Regression detection — every run is compared against the previous one. If pass rate drops, you get flagged immediately.
PII detection — automatically scans every agent output for emails, phone numbers, API keys, credit cards, SSNs. Auto-fails the test if found. Critical for production agents.
Trajectory assertions — test not just what your agent returned, but how it got there. Did it call the right tools? Did it complete in under 5 steps? Did it avoid calling delete_user?
# Set trajectory assertions
requests.post(f"{BASE}/suites/{suite_id}/cases/{case_id}/assertions", json=[
{"type": "tool_called", "value": "search"},
{"type": "tool_not_called", "value": "delete_user"},
{"type": "max_steps", "value": 5}
])
CI/CD integration — one file in your repo and TestThread runs on every push. Fails the build if tests regress.
Scheduled runs — run your test suite hourly, daily, or weekly automatically.
The live dashboard
Everything is visible at test-thread.lovable.app — pass rates, regression flags, PII alerts, trajectory timelines, cost per run.
Part of a bigger suite
TestThread is part of the Thread Suite — open source reliability tools for AI agents.
- Iron-Thread — validates AI output structure before it hits your database
- TestThread — tests whether your agent behaves correctly across runs
- PromptThread — versions and tracks prompt performance (coming soon)
Try it
pip install testthread
# or
npm install testthread
GitHub: github.com/eugene001dayne/test-thread
Live API: test-thread-production.up.railway.app
Would love feedback from anyone building agents. What testing problems are you running into that TestThread doesn't solve yet?
Top comments (0)