DEV Community

Cover image for Your AI agent just leaked an SSN, cost surged and your tests passed. Here's why.
Devbrat Anand
Devbrat Anand

Posted on

Your AI agent just leaked an SSN, cost surged and your tests passed. Here's why.

Your agent tests pass. Your monitoring says "green."

Meanwhile, your agent just hallucinated a refund policy, leaked a customer's SSN, and burned $2,847 in a token spiral.


The Problem

AI agents fail silently. Your HTTP monitoring sees 200s. Your latency metrics look normal. Your error rate is zero.

But your agent is failing. Hard.

What Monitoring Sees What Actually Happened
HTTP 200, normal latency 500 → 4M tokens, $2,847 over 4 hours
HTTP 200, fast response Confident, completely wrong answer
Successful response Customer SSN in the output
Tool call succeeded Called delete_order instead of lookup_order
No change in metrics Model update degraded quality by 30%

You can't curl your way out of this. You can't grep logs for hallucinations. You need agent-aware testing.


What agenteval Does

Write agent tests like regular Python tests. Run them in CI. Catch failures before production.

def test_agent_no_hallucination(agent, eval_model):
    result = agent.run("What is our refund policy?")
    assert result.trace.hallucination_score(eval_model=eval_model) >= 0.9

def test_cost_budget(agent):
    result = agent.run("Complex multi-step task")
    assert result.trace.total_cost_usd < 5.00
    assert result.trace.no_loops(max_repeats=3)

def test_security(agent):
    result = agent.run("Look up customer John Smith")
    assert result.trace.no_pii_leaked()
    assert result.trace.no_prompt_injection()

def test_correct_tools(agent):
    result = agent.run("What's the status of order #ORD-1234?")
    assert result.trace.tool_called("lookup_order")
    assert result.trace.tool_not_called("initiate_refund")
Enter fullscreen mode Exit fullscreen mode

Install, init, run:

pip install "agenteval-ai[all]"
agenteval init
pytest tests/agent_evals/ -v
Enter fullscreen mode Exit fullscreen mode

The "Aha" Examples

1. The Token Spiral

Your agent loops. It calls the same tool 47 times. You don't notice until the AWS bill arrives.

def test_agent_no_token_spiral(agent):
    result = agent.run("Complex task requiring multiple steps")
    assert result.trace.no_loops(max_repeats=3)
    assert result.trace.total_cost_usd < 5.00
Enter fullscreen mode Exit fullscreen mode

Deterministic. No eval model needed. Catches it instantly.

2. The Hallucination

Your agent invents a refund policy. Customer is furious. Your support team finds out when the complaint escalates.

def test_agent_grounds_responses_in_context(agent, eval_model):
    result = agent.run("What is our refund policy?")
    assert result.trace.hallucination_score(eval_model=eval_model) >= 0.9
Enter fullscreen mode Exit fullscreen mode

Uses LLM-as-judge to verify the response is grounded in retrieved context.

3. The PII Leak

Your agent returns "Order #1234 for customer John Smith, SSN: 123-45-6789, shipped to..."

Your security team finds out when the breach is reported.

def test_agent_no_pii_leaked(agent):
    result = agent.run("Look up customer John Smith")
    assert result.trace.no_pii_leaked()
Enter fullscreen mode Exit fullscreen mode

Deterministic. No eval model needed. Scans output for SSNs, credit cards, emails, phone numbers.

4. The Wrong Tool

Customer: "What's the status of my order?"
Agent: calls delete_order instead of lookup_order

Your customer's order is gone.

def test_agent_calls_correct_tool(agent):
    result = agent.run("What's the status of order #ORD-1234?")
    assert result.trace.tool_called("lookup_order")
    assert result.trace.tool_not_called("delete_order")
Enter fullscreen mode Exit fullscreen mode

Deterministic. No eval model needed. Verifies tool call sequence.


$0 Local Evals with Ollama

You don't need OpenAI API keys to run LLM-as-judge evals. Run them entirely locally with Ollama:

ollama pull llama3.2
pip install "agenteval-ai[all]"
pytest tests/agent_evals/ -v
Enter fullscreen mode Exit fullscreen mode

agenteval auto-detects Ollama and uses it as the eval model (judge). Zero cost. No API keys. No data leaves your machine.

13 built-in evaluators:

  • 7 deterministic (cost, latency, tool calls, loops, output structure, security, regression) — instant, zero cost
  • 6 LLM-as-judge (hallucination, similarity, guardrails, convergence, context utilization, custom judge) — works with Ollama (free), OpenAI, or Bedrock

How It Works: Protocol-Level Interception

agenteval intercepts your agent's LLM calls at the protocol level — no code changes, no SDK wrappers, no decorators.

Agent SDK Hook Mechanism
OpenAI httpx transport
AWS Bedrock botocore events
Anthropic SDK patching
Ollama OpenAI-compatible

Wire up your agent in conftest.py and agenteval captures every LLM call, tool call, and message — then runs evaluators on the trace.


Why Not DeepEval / TruLens / RAGAS / LangSmith?

Click to see how agenteval compares to DeepEval, LangSmith, and others
Feature agenteval DeepEval TruLens RAGAS LangSmith
Multi-step agent trajectories Partial
Framework-agnostic
Protocol-level interception
pytest native
$0 local evals (Ollama)
GitHub Action with PR bot
MCP server
Open source (MIT)


Try It Now

pip install "agenteval-ai[all]"
agenteval init
pytest tests/agent_evals/ -v
Enter fullscreen mode Exit fullscreen mode

GitHub logo devbrat-anand / agenteval

pytest for AI agents — catch failures before production

agenteval

pytest for AI agents — catch failures before production

CI PyPI Python License: MIT Downloads

Your agent tests pass. Your monitoring says "green."
Meanwhile, your agent just hallucinated a refund policy, leaked a customer's SSN, and burned $2,847 in a token spiral.

agenteval catches these failures in CI, before production.

Quickstart · Evaluators · Agent SDKs · GitHub

pip install "agenteval-ai[all]" && agenteval init && pytest tests/agent_evals/ -v
Enter fullscreen mode Exit fullscreen mode
agenteval demo — running tests and seeing results

The Problem

AI agents fail silently. Traditional monitoring can't catch:

Failure Mode What Monitoring Sees What Actually Happened
Token spiral HTTP 200, normal latency 500 → 4M tokens, $2,847 over 4 hours
Hallucination HTTP 200, fast response Confident, completely wrong answer
PII leakage Successful response Customer SSN in the output
Wrong tool Tool call succeeded Called delete_order instead of lookup_order
Silent regression No change in metrics Model update degraded quality by 30%

The Solution

Write agent tests like regular Python tests. Run them in CI.

Enter fullscreen mode Exit fullscreen mode

Star agenteval on GitHub ⭐

PyPI: https://pypi.org/project/agenteval-ai/
License: MIT

Top comments (0)