I Built pytest for AI Agents — Here's What I Learned

#ai #python #opensource #testing

Every AI agent developer knows this pain:

You run your agent. It works. You run it again. It doesn't. You have no idea why. You check the logs — there are no logs. You check the cost — $4.20 for a single run. You cry.

I got tired of this, so I built AgentProbe — an open-source testing framework for AI agents. Think pytest, but for LLM-powered agents.

What it does

Record — Capture every LLM call, tool call, and decision your agent makes.

Test — Run 35+ built-in assertions: cost limits, quality checks, safety validation, PII detection.

Replay — Swap models and compare results. "What happens if I switch from GPT-4o to Claude?"

Fuzz — 55 prompt injection attacks built-in. Find vulnerabilities before your users do.

The features nobody asked for (but everyone loves)

Agent Roast — Run agentprobe roast and get a brutally honest (and funny) analysis of your agent:

"Your agent spends money like a drunk sailor at a token store. Cost grade: D"

X-Ray Mode — Visualize exactly how your agent thinks, step by step, with costs per step.

Cost Calculator — Find out what your agent REALLY costs per month. Spoiler: more than you think.

Health Check — Get a 0-100 health score across 5 dimensions: reliability, speed, cost, security, quality.

Injection Playground — 55 built-in prompt injection attacks across 5 categories. Test your agent's defenses.

Quick start

Install:

pip install agentprobe

Record your agent:

agentprobe record my_agent.py

Run tests:

agentprobe test

Get roasted:

agentprobe roast my_agent.py --level savage

What I learned building this

AI agents are black boxes by default. Nobody builds logging into their agents. When something breaks, you're guessing.
Cost is the silent killer. Most developers have no idea what their agents cost per run. I've seen agents that cost $5 per query because of redundant LLM calls.
Security is an afterthought. Most agents are vulnerable to basic prompt injection. A simple "ignore previous instructions" breaks 60% of agents I've tested.
Testing AI is different from testing code. You can't just assert output == expected. You need statistical assertions — "quality above 80% over 100 runs."