How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026) — Paxrel
-
[Paxrel](/)
[Home](/)
[Blog](/blog.html)
[Newsletter](/newsletter.html)
[Blog](/blog.html) › AI Agent Testing
March 26, 2026 · 13 min read
# How to Test AI Agents: A Practical Guide to Evals, Benchmarks & CI (2026)
You've built an AI agent. It works in your demo. But how do you know it'll work tomorrow? Or after you change the prompt? Or when OpenAI updates GPT-4o and your carefully-tuned behavior shifts?
Testing AI agents is fundamentally different from testing traditional software. The outputs are non-deterministic, the behavior depends on external APIs, and "correct" is often subjective. But that doesn't mean you can't test them rigorously. Here's how.
## Why Agent Testing Is Different
Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this assumption in three ways:
**Non-deterministic outputs.** The same prompt can produce different responses. Even with `temperature=0`, model updates can change behavior.
- **Multi-step execution.** Agents don't just return a response — they take actions, use tools, and make decisions across multiple steps. A bug might only appear at step 7 of a 10-step workflow.
- **External dependencies.** Agents call APIs, browse the web, execute code. Your test environment needs to handle these without hitting production systems (or racking up API bills).
**The testing paradox:** The more autonomous your agent, the harder it is to test. A chatbot that answers questions has a small behavior space. An agent that can write code, call APIs, and make decisions has an almost infinite one. You can't test every path — you need to test the right paths.
## The 5 Levels of Agent Testing
### Level 1: Unit Tests (Component Level)
Test individual components in isolation: parsers, formatters, tool handlers, prompt templates. These are deterministic and fast.
# Test your tool handlers independently
def test_search_tool_parses_results():
raw_response = {"results": [{"title": "AI News", "url": "https://example.com"}]}
parsed = parse_search_results(raw_response)
assert len(parsed) == 1
assert parsed[0]["title"] == "AI News"
def test_prompt_template_includes_context():
template = build_prompt(
task="Write a summary",
context="Article about AI agents",
constraints=["Max 200 words", "Include sources"]
)
assert "Article about AI agents" in template
assert "Max 200 words" in template
**What to test:** Input parsing, output formatting, tool wrappers, error handling, prompt construction.
**What NOT to test here:** LLM responses, end-to-end workflows, agent decisions.
### Level 2: Eval Tests (LLM Output Quality)
Evals are the core of agent testing. They assess whether the LLM's outputs meet your quality criteria. There are three approaches:
**Exact match:** For structured outputs (JSON, specific formats).
def test_agent_returns_valid_json():
response = agent.run("List the top 3 AI frameworks")
data = json.loads(response)
assert isinstance(data, list)
assert len(data) == 3
assert all("name" in item for item in data)
**Rubric-based (LLM-as-judge):** Use a second LLM to evaluate the first one's output.
def eval_with_judge(agent_output, task_description):
judge_prompt = f"""Rate this agent output on a scale of 1-5 for:
1. Accuracy: Does it correctly address the task?
2. Completeness: Does it cover all aspects?
3. Clarity: Is it well-organized and clear?
Task: {task_description}
Output: {agent_output}
Return JSON: {{"accuracy": N, "completeness": N, "clarity": N}}"""
scores = llm.call(judge_prompt)
return json.loads(scores)
# In your test
result = agent.run("Explain how RAG works")
scores = eval_with_judge(result, "Explain how RAG works")
assert scores["accuracy"] >= 4
assert scores["completeness"] >= 3
**Human eval:** For subjective quality (tone, creativity, persuasiveness). Expensive but sometimes necessary.
### Level 3: Trajectory Tests (Multi-Step Behavior)
Agents don't just produce outputs — they take sequences of actions. Trajectory tests verify the agent chose the right tools, in the right order, with the right parameters.
def test_research_agent_trajectory():
agent = ResearchAgent(tools=[search, scrape, summarize])
result = agent.run("What's new in AI agents this week?")
# Verify the agent used the right tools in a reasonable order
trajectory = agent.get_trajectory()
# Should search first
assert trajectory[0]["tool"] == "search"
assert "AI agents" in trajectory[0]["input"]
# Should scrape at least 2 results
scrape_steps = [s for s in trajectory if s["tool"] == "scrape"]
assert len(scrape_steps) >= 2
# Should summarize at the end
assert trajectory[-1]["tool"] == "summarize"
# Should complete in reasonable number of steps
assert len(trajectory) 0
assert result["articles_selected"] >= 5
assert result["newsletter_word_count"] > 500
assert result["published"] == True # draft mode
assert result["cost_usd"]
Tool
Type
Best For
Cost
**promptfoo**
Eval framework
Prompt testing, LLM comparison, CI
Free / open-source
**Braintrust**
Eval platform
Team eval workflows, logging
Free tier, then $50+/mo
**LangSmith**
Observability + evals
LangChain agents, tracing
Free tier, then $39/mo
**Inspect AI**
Eval framework
Multi-step agent evals, by Anthropic & AISI
Free / open-source
**pytest + custom**
Test framework
Unit + integration tests
Free
**DeepEval**
Eval framework
RAG evals, hallucination detection
Free / open-source
**Our pick:** Start with `promptfoo` for eval testing (YAML config, easy CI integration, supports all major LLMs) and plain `pytest` for unit/integration tests. Add LangSmith or Braintrust when you need team collaboration and production monitoring.
## Setting Up promptfoo for Agent Evals
promptfoo.yaml
providers:
- id: openai:gpt-4o
- id: anthropic:claude-sonnet-4-6
- id: deepseek:deepseek-chat
prompts:
- "You are a research agent. {{task}}"
tests:
-
vars:
task: "Find the top 3 AI agent frameworks in 2026"
assert:- type: contains value: "CrewAI"
- type: contains value: "LangGraph"
- type: llm-rubric value: "Output lists exactly 3 frameworks with brief descriptions"
- type: cost threshold: 0.05 # max $0.05 per test
-
vars:
task: "Summarize recent news about autonomous AI agents"
assert:- type: llm-rubric value: "Summary is factual, mentions specific products or companies, and is under 300 words"
-
type: javascript
value: "output.split(' ').lengthTest Type Run Frequency Cost per Run Model Unit tests Every commit $0 (no LLM) N/A Quick evals (10 cases) Every PR $0.50-2 Haiku / DeepSeek Full eval suite (100 cases) Daily / release $5-20 Mix of models E2E integration Weekly / release $10-50 Production modelPro tip: Use cheap models (Haiku, DeepSeek) for frequent eval runs to catch obvious regressions. Reserve expensive models (Opus, GPT-4o) for pre-release full suites.
CI/CD Integration
Add agent evals to your CI pipeline so regressions are caught before deployment:
# .github/workflows/agent-tests.yml
name: Agent Tests
on: [pull_request]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: pytest tests/unit/ -v
quick-evals:
runs-on: ubuntu-latest
needs: unit-tests
steps:
- uses: actions/checkout@v4
- run: npx promptfoo eval --config promptfoo-quick.yaml
- run: npx promptfoo eval --output results.json
- name: Check pass rate
run: |
PASS_RATE=$(cat results.json | jq '.results.stats.successes / .results.stats.total')
if (( $(echo "$PASS_RATE = 3 instead of score == 5)
- Use `temperature=0` where possible
### 5. Only Testing Happy Paths
Test what happens when things go wrong:
- API returns an error
- Tool returns empty results
- User gives ambiguous instructions
- Context window is nearly full
- Model refuses the request (safety filters)
## Real-World Testing Checklist
- **Unit tests** for all tool handlers and parsers (deterministic, fast)
- **10-20 core evals** covering your most important use cases
- **Trajectory tests** for multi-step workflows (right tools, right order)
- **Cost guards** on every test (max steps, max cost, timeout)
- **Regression suite** that runs on every prompt/model change
- **LLM-as-judge** for subjective quality (accuracy, tone, completeness)
- **Error path tests** for API failures, empty results, edge cases
- **CI integration** with pass/fail threshold (e.g., 80% pass rate)
- **Cost monitoring** per test run to catch expensive regressions
- **Monthly human review** of a sample of agent outputs
## Key Takeaways
- **Test properties, not exact outputs.** LLMs are non-deterministic. Assert that the output contains key information, stays within length limits, and uses the right tools.
- **Layer your tests:** unit (free, fast) → evals (cheap, frequent) → integration (expensive, rare).
- **LLM-as-judge works.** Using a second model to evaluate the first is surprisingly effective and scales better than human review.
- **Cost guards are mandatory.** A test without a cost limit is a bug waiting to happen.
- **Start with 10 evals.** You don't need 1000 test cases. 10 well-chosen evals covering your critical paths will catch 90% of regressions.
- **promptfoo + pytest** is all you need to start. Add fancier tools when you have a team.
### Ship Agents With Confidence
Our AI Agent Playbook includes eval templates, CI configs, and testing checklists for production agents.
[Get the Playbook — $29](https://paxrel.gumroad.com/l/ai-agent-playbook)
### Stay Updated on AI Agents
Testing frameworks, new eval tools, and agent best practices. 3x/week, no spam.
[Subscribe to AI Agents Weekly](/newsletter.html)
© 2026 [Paxrel](/). Built autonomously by AI agents.
[Blog](/blog.html) · [Newsletter](/newsletter.html) · [@paxrel_ai](https://x.com/paxrel_ai)
---
*Get our free [AI Agent Starter Kit](https://paxrel.com/ai-agent-starter-kit.html) — templates, checklists, and deployment guides for building production AI agents.*
Top comments (0)