klement Gunndu

Posted on Feb 26

Stop Guessing If Your AI Agent Works: 3 Eval Patterns That Catch Failures First

#ai #python #testing #tutorial

72% of LLM-based agents exhibit non-deterministic behavior in production. You can't unit test randomness the same way you test a REST endpoint. But you can evaluate it.

We run 82 AI agents in production. Every one of them passes an evaluation suite before it touches real data. Not because we're cautious — because we shipped without evals once, and an agent published fabricated product features under our founder's name. That mistake cost us credibility and 3 days of cleanup.

Here are the 3 evaluation patterns we use to prevent that from happening again.

What Makes AI Agent Testing Different

Traditional testing asserts exact outputs. assert response == expected works when your function is deterministic.

AI agents aren't deterministic. The same input produces different outputs across runs. Worse, the outputs look correct even when they're wrong. A hallucinated API endpoint reads just as confidently as a real one.

This means you need a different testing strategy:

Evaluate properties, not exact values — Is the output relevant? Is it faithful to the context? Did the agent use the right tools?
Use LLM-as-judge for semantic checks — A second model scores the first model's output against defined criteria.
Run evals in CI, not just notebooks — Evaluations that don't run automatically don't catch regressions.

Pattern 1: Hallucination Detection with Context Grounding

The most dangerous failure mode for AI agents is hallucination — generating confident, plausible text that isn't grounded in the provided context. This is the failure that burned us.

DeepEval's HallucinationMetric (as of v2.x, 2026) scores how much of the agent's output contradicts or fabricates beyond the given context. Here's how to wire it into pytest:

# test_hallucination.py
import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase


CONTEXT = [
    "DeepEval supports 50+ evaluation metrics.",
    "DeepEval integrates with pytest for CI/CD pipelines.",
    "The HallucinationMetric requires context and actual_output fields.",
]


def get_agent_response(question: str) -> str:
    """Replace with your actual agent call."""
    # your_agent.invoke(question)
    return "DeepEval supports over 50 metrics and integrates with pytest."


@pytest.mark.parametrize("question,context", [
    ("What does DeepEval support?", CONTEXT),
    ("How does DeepEval integrate with testing?", CONTEXT),
])
def test_agent_does_not_hallucinate(question, context):
    response = get_agent_response(question)

    test_case = LLMTestCase(
        input=question,
        actual_output=response,
        context=context,
    )

    metric = HallucinationMetric(
        threshold=0.5,  # Score must be <= 0.5 (lower = less hallucination)
        include_reason=True,
    )

    assert_test(test_case, [metric])

Run it:

pip install -U deepeval
deepeval test run test_hallucination.py

What this catches: An agent that says "DeepEval has 200 metrics and requires a GPU cluster" would fail — neither claim is in the context. The metric uses an LLM judge to compare each claim in actual_output against the context list.

Key detail: The context parameter takes a list of strings, not a single string. Each string is a separate fact the agent should be grounded in. This matters because the metric evaluates claim-by-claim.

Pattern 2: Faithfulness Scoring for RAG Agents

If your agent retrieves documents before generating answers (RAG), hallucination detection alone isn't enough. You need to verify that the agent's answer is faithful to what it retrieved — not just non-contradictory, but actually supported by the retrieval context.

DeepEval's FaithfulnessMetric (as of v2.x, 2026) does exactly this. It extracts claims from the output, then checks each claim against retrieval_context:

# test_faithfulness.py
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase


def test_rag_agent_stays_faithful():
    # Simulate what your retriever pulled from the vector store
    retrieval_context = [
        "Our return policy allows full refunds within 30 days of purchase.",
        "Refund requests must be submitted through the customer portal.",
        "Items must be in original packaging to qualify for a refund.",
    ]

    # Simulate what your agent generated from that context
    agent_output = (
        "You can get a full refund within 30 days. "
        "Submit your request through the customer portal. "
        "The item must be in its original packaging."
    )

    test_case = LLMTestCase(
        input="What is your return policy?",
        actual_output=agent_output,
        retrieval_context=retrieval_context,
    )

    metric = FaithfulnessMetric(
        threshold=0.7,
        include_reason=True,
    )

    assert_test(test_case, [metric])

What this catches: If the agent adds "We also offer exchanges for store credit" — a claim not in retrieval_context — the faithfulness score drops. The metric extracts truths from the retrieval context, then checks whether each claim in the output is supported.

Hallucination vs. Faithfulness — when to use which:

Metric	Use when	Checks against
`HallucinationMetric`	Agent has predefined context (docs, rules)	`context` field
`FaithfulnessMetric`	Agent retrieves context dynamically (RAG)	`retrieval_context` field

Use both if your agent has both static knowledge and dynamic retrieval.

Pattern 3: Tool Correctness for Agentic Workflows

Agents that call tools (APIs, databases, search) introduce a different failure mode: calling the wrong tool, calling tools in the wrong order, or skipping a tool entirely.

DeepEval's ToolCorrectnessMetric (as of v2.x, 2026) compares the tools your agent actually called against the tools it should have called:

# test_tool_correctness.py
from deepeval import assert_test
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCall


def test_agent_uses_correct_tools():
    test_case = LLMTestCase(
        input="Find the current weather in Berlin and send a summary email.",
        actual_output="The weather in Berlin is 5C and cloudy. Email sent.",
        tools_called=[
            ToolCall(name="weather_api"),
            ToolCall(name="send_email"),
        ],
        expected_tools=[
            ToolCall(name="weather_api"),
            ToolCall(name="send_email"),
        ],
    )

    metric = ToolCorrectnessMetric(threshold=0.5)
    assert_test(test_case, [metric])


def test_agent_does_not_skip_tools():
    """Agent should search before answering, not guess."""
    test_case = LLMTestCase(
        input="What are the top Python packages for data validation?",
        actual_output="Pydantic, Marshmallow, and Cerberus.",
        tools_called=[],  # Agent answered from memory, no search
        expected_tools=[
            ToolCall(name="web_search"),
        ],
    )

    metric = ToolCorrectnessMetric(threshold=0.5)
    assert_test(test_case, [metric])  # This FAILS — search was expected

What this catches: An agent that skips the search tool and answers from its training data. An agent that calls delete_user when it should have called get_user. The metric generates a deterministic score by comparing actual vs. expected tool lists.

Production tip: Log every tool call your agent makes. Use those logs to build your expected_tools ground truth. Start with 10-20 representative queries, then expand as you find edge cases.

Putting It All Together: CI Pipeline

These patterns work best when they run automatically. Here's a minimal setup:

# .github/workflows/eval.yml
name: Agent Evaluation
on:
  pull_request:
    paths:
      - "agents/**"
      - "prompts/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -U deepeval pytest
      - run: deepeval test run tests/eval/ --verbose
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Every PR that touches agent code or prompts triggers the evaluation suite. If faithfulness drops or hallucination spikes, the PR is blocked.

Cost consideration: DeepEval metrics that use LLM-as-judge (HallucinationMetric, FaithfulnessMetric) require API calls to an evaluation model. ToolCorrectnessMetric is deterministic and costs nothing. Budget roughly $0.01-0.05 per evaluation test case for LLM-judged metrics, depending on context size and model choice.

What We Learned Running Evals on 82 Agents

After adding evaluation suites to all our agents, three things became clear:

Hallucination correlates with context length. Agents with more than 8,000 tokens of context hallucinate more. We split long contexts into focused chunks and test each chunk separately.
Tool correctness catches 40% of bugs that unit tests miss. An agent can produce a correct-looking output while using the wrong data source. The output passes a string check; the tool check catches the real problem.
Faithfulness thresholds need tuning per agent. A customer support agent needs 0.9+ faithfulness. A creative writing agent might work fine at 0.6. Don't use one threshold for everything.

Start Here

Pick one agent. Write 5 test cases. Run them.

If you have a RAG agent: start with FaithfulnessMetric. If you have a tool-calling agent: start with ToolCorrectnessMetric. If you have any agent at all: start with HallucinationMetric.

The goal isn't 100% coverage on day one. The goal is catching the failure that costs you credibility — before a user finds it.

Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (2)

Matthew Hou • Mar 1

The "agent published fabricated product features under our founder's name" story is exactly the kind of incident that should be on every AI team's risk register. The eval patterns you describe are solid — I want to add one nuance from what I've seen running agents in production:

The hardest evals to write aren't for hallucinations (those are actually detectable). The hardest are for subtle correctness issues where the output looks plausible and passes schema validation but is semantically wrong. Like an agent that generates a SQL query that's valid, returns results, but uses the wrong JOIN condition.

For that class of problems, I've found that golden dataset evaluation — comparing agent output against known-correct human outputs for the same inputs — catches things that metric-based evals miss. It's more work to maintain, but it's the closest thing to "does this actually work the way a human would expect."

klement Gunndu • Mar 1

Golden datasets catch that, but you need stratified sampling by failure mode—group your known-goods by query complexity, data cardinality, schema depth so you're not just validating happy paths.