DEV Community

Cover image for Test Your AI Agent Like a Senior Engineer: 4 Patterns That Work
klement Gunndu
klement Gunndu

Posted on

Test Your AI Agent Like a Senior Engineer: 4 Patterns That Work

Your AI agent passes every unit test. Then it hallucinates a database schema in production, invents an API endpoint that doesn't exist, and confidently returns a JSON response missing three required fields.

Unit tests prove your functions run. They don't prove your agent works. The difference costs you production incidents, user trust, and the 3 AM pages that make you question your career choices.

Here are 4 testing patterns that senior engineers use to catch these failures before deployment — with working Python code for each.

Pattern 1: Schema Contract Tests

The first thing that breaks in an AI agent is the output format. You ask for structured data, the LLM returns something close but not quite right. A missing field. A string where you expected an integer. A nested object with an unexpected key.

Schema contract tests enforce that every agent output matches an exact Pydantic model — and they do it without calling the real LLM.

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel

class AnalysisResult(BaseModel):
    summary: str = Field(min_length=10)
    confidence: float = Field(ge=0.0, le=1.0)
    categories: list[str] = Field(min_length=1)
    source_count: int = Field(ge=1)

analysis_agent = Agent(
    "openai:gpt-4o",
    output_type=AnalysisResult,
    system_prompt="Analyze the given text and return structured results.",
)

async def test_output_schema_enforced():
    """Every response must match AnalysisResult exactly."""
    with analysis_agent.override(model=TestModel()):
        result = await analysis_agent.run("Analyze this quarterly report.")
        # TestModel returns default-valid data matching AnalysisResult
        assert isinstance(result.output, AnalysisResult)
        assert 0.0 <= result.output.confidence <= 1.0
        assert len(result.output.categories) >= 1
        assert result.output.source_count >= 1
Enter fullscreen mode Exit fullscreen mode

The key detail: TestModel from Pydantic AI generates deterministic responses that satisfy your output type's schema. No API calls. No flaky tests. No cost. The test runs in milliseconds and proves your schema constraints are enforced at the framework level.

For stricter validation — catching coercion issues where "3" silently becomes 3 — use Pydantic's strict mode:

async def test_strict_schema_validation():
    """Catch silent type coercion before production does."""
    raw_output = {
        "summary": "Market analysis complete",
        "confidence": "0.95",  # String, not float — should fail strict
        "categories": ["finance"],
        "source_count": 3,
    }
    try:
        AnalysisResult.model_validate(raw_output, strict=True)
        assert False, "Should have raised ValidationError"
    except Exception:
        pass  # Strict mode correctly rejects string-to-float coercion
Enter fullscreen mode Exit fullscreen mode

This pattern catches the entire class of "the LLM returned something that looks right but isn't" failures.

Pattern 2: Deterministic Tool Call Tests

Your agent calls tools — search APIs, databases, calculators. The tool call sequence is where most production bugs hide. The agent calls the wrong tool, passes wrong arguments, or skips a tool it should have used.

FunctionModel from Pydantic AI lets you script the exact sequence of LLM responses, so you can assert on every tool call:

from pydantic_ai.models.function import FunctionModel, AgentInfo
from pydantic_ai.messages import ModelMessage, ModelResponse, ToolCallPart, TextPart
from pydantic_ai import Agent, capture_run_messages

def controlled_tool_sequence(
    messages: list[ModelMessage], info: AgentInfo
) -> ModelResponse:
    """Simulate an LLM that calls tools in a specific order."""
    call_count = sum(
        1 for m in messages
        if hasattr(m, "parts") and any(
            hasattr(p, "tool_name") for p in m.parts
        )
    )
    if call_count == 0:
        # First call: agent should search
        return ModelResponse(
            parts=[ToolCallPart("search_docs", {"query": "revenue Q3"})]
        )
    elif call_count == 1:
        # Second call: agent should calculate
        return ModelResponse(
            parts=[ToolCallPart("calculate", {"expression": "150000 * 1.12"})]
        )
    else:
        # Final: return answer
        return ModelResponse(
            parts=[TextPart("Q3 revenue grew 12% to $168,000.")]
        )

research_agent = Agent("openai:gpt-4o", system_prompt="Research financial data.")

@research_agent.tool_plain
def search_docs(query: str) -> str:
    return "Q3 revenue: $150,000"

@research_agent.tool_plain
def calculate(expression: str) -> str:
    return str(eval(expression))  # Simplified for example

async def test_tool_call_sequence():
    """Agent must call search_docs before calculate."""
    with capture_run_messages() as messages:
        with research_agent.override(model=FunctionModel(controlled_tool_sequence)):
            result = await research_agent.run("What was Q3 revenue growth?")

    tool_calls = [
        part.tool_name
        for msg in messages
        if hasattr(msg, "parts")
        for part in msg.parts
        if hasattr(part, "tool_name")
    ]
    assert tool_calls == ["search_docs", "calculate"], (
        f"Expected search then calculate, got {tool_calls}"
    )
Enter fullscreen mode Exit fullscreen mode

This test doesn't check if the LLM "feels right." It asserts on the exact behavioral contract: search first, then calculate. If a prompt change causes the agent to skip the search step, this test fails immediately.

Pattern 3: Eval Datasets With Scoring Rubrics

Unit tests check individual behaviors. Eval datasets measure whether your agent is getting better or worse across dozens of scenarios simultaneously.

Pydantic Evals (part of the Pydantic AI ecosystem, installed via pip install pydantic-evals) provides a framework for this:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

class ContainsExpectedInfo(Evaluator[str, str]):
    """Check if the output contains the expected key information."""

    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
        if ctx.expected_output and ctx.expected_output.lower() in ctx.output.lower():
            return 1.0
        return 0.0

class NoHallucination(Evaluator[str, str]):
    """Flag outputs that mention topics not in the input."""

    forbidden_terms: list[str] = []

    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
        for term in self.forbidden_terms:
            if term.lower() in ctx.output.lower():
                return 0.0
        return 1.0

dataset = Dataset(
    cases=[
        Case(
            name="capital_of_france",
            inputs="What is the capital of France?",
            expected_output="Paris",
        ),
        Case(
            name="python_creator",
            inputs="Who created Python?",
            expected_output="Guido van Rossum",
        ),
        Case(
            name="no_hallucinate_dates",
            inputs="What is Python?",
            expected_output="programming language",
            evaluators=[
                NoHallucination(forbidden_terms=["founded in 2025", "version 4.0"]),
            ],
        ),
    ],
    evaluators=[ContainsExpectedInfo()],
)
Enter fullscreen mode Exit fullscreen mode

Run it against your agent function:

async def my_agent(question: str) -> str:
    """Your actual agent call goes here."""
    # Replace with your agent.run() call
    result = await analysis_agent.run(question)
    return result.output.summary

report = dataset.evaluate_sync(my_agent)
report.print(include_input=True, include_output=True)
Enter fullscreen mode Exit fullscreen mode

The output is a table showing pass/fail per case, scores per evaluator, and aggregate metrics. Run this in CI. Track scores across commits. When a prompt change drops your hallucination score from 1.0 to 0.7, you catch it before deployment — not after a user files a bug report.

The real power: you build this dataset incrementally. Every production failure becomes a new test case. After 6 months, you have 200 cases that encode every failure mode your agent has ever hit.

Pattern 4: Failure Injection Tests

Your agent works when everything works. The question is: what happens when things break?

Failure injection tests simulate the real failure modes — API timeouts, malformed LLM responses, tool exceptions — and verify your agent degrades gracefully instead of crashing or hallucinating.

import asyncio
import pytest
from pydantic_ai.models.function import FunctionModel, AgentInfo
from pydantic_ai.messages import ModelMessage, ModelResponse, TextPart

def timeout_model(
    messages: list[ModelMessage], info: AgentInfo
) -> ModelResponse:
    """Simulate an LLM that takes too long to respond."""
    raise TimeoutError("Model API timed out after 30s")

def garbage_model(
    messages: list[ModelMessage], info: AgentInfo
) -> ModelResponse:
    """Simulate an LLM that returns unparseable output."""
    return ModelResponse(parts=[TextPart("sure here is {broken json")])

async def test_agent_handles_timeout():
    """Agent must not crash on LLM timeout."""
    with research_agent.override(model=FunctionModel(timeout_model)):
        with pytest.raises(TimeoutError):
            await research_agent.run("Analyze this data.")
    # If your agent has retry logic, test that instead:
    # assert result.output contains fallback message

async def test_agent_handles_malformed_response():
    """Agent must handle garbage LLM output without crashing."""
    with research_agent.override(model=FunctionModel(garbage_model)):
        # Pydantic AI will attempt to validate the response.
        # For structured output agents, this triggers retry/validation.
        try:
            result = await research_agent.run("Analyze this data.")
        except Exception as e:
            # Agent should raise a clear error, not silently corrupt data
            assert "validation" in str(e).lower() or "parse" in str(e).lower()
Enter fullscreen mode Exit fullscreen mode

For tool failures, inject exceptions at the tool level:

# Create a separate agent with a tool that always fails
failing_agent = Agent("openai:gpt-4o", system_prompt="Research financial data.")

@failing_agent.tool_plain
def search_docs(query: str) -> str:
    raise ConnectionError("Search API is down")

@failing_agent.tool_plain
def calculate(expression: str) -> str:
    return str(eval(expression))

async def test_agent_survives_tool_failure():
    """Agent must handle a tool that throws an exception."""
    with failing_agent.override(model=TestModel()):
        # The agent should either retry, use a fallback,
        # or return a clear error — not hallucinate an answer
        try:
            result = await failing_agent.run("Find Q3 revenue data.")
        except Exception as e:
            # Verify it fails clearly, not silently
            assert "ConnectionError" in type(e).__name__ or "tool" in str(e).lower()
Enter fullscreen mode Exit fullscreen mode

This pattern catches the most dangerous failure mode in production AI systems: the agent encounters an error, silently ignores it, and returns a confident-sounding response based on nothing.

Putting It All Together: The Test Pyramid for AI Agents

Traditional software has a test pyramid: unit tests at the bottom, integration tests in the middle, end-to-end tests at the top.

AI agents need a different pyramid:

        /  E2E  \          Multi-turn conversations (expensive, slow)
       /  Evals  \         Dataset scoring across scenarios (medium)
      / Contracts \        Schema + tool call sequence (fast, cheap)
     /  Failure   \        Timeout, malformed, tool crash (fast)
    / Unit (tools) \       Individual tool functions (fastest)
Enter fullscreen mode Exit fullscreen mode

Bottom layer (unit tests for tools): Test each tool function in isolation. Pure functions, no LLM involved. Run on every commit.

Second layer (failure injection): Inject timeouts, bad responses, tool crashes. Run on every commit. Catches resilience regressions.

Third layer (schema contracts): Use TestModel and FunctionModel to verify output schemas and tool call sequences. Run on every commit. Catches behavioral regressions.

Fourth layer (eval datasets): Run your growing dataset against the actual agent (with real LLM calls). Run in CI on PRs. Catches quality regressions.

Top layer (end-to-end): Multi-turn conversations with real tools and real LLMs. Run nightly or before releases. Catches integration issues.

The bottom three layers run in milliseconds, cost nothing, and catch 80% of production failures. The top two layers catch the remaining 20% but require API calls and take minutes.

What Changes After You Adopt These Patterns

Before: you ship prompt changes and hope nothing breaks. A user reports that the agent started returning incomplete data. You dig through logs for 2 hours trying to reproduce the issue.

After: a CI check fails the moment a prompt change breaks the output schema contract. The eval dataset shows a 15% drop in accuracy on financial queries. The failure injection test reveals the agent no longer retries on timeouts. You fix all three before merging.

The shift isn't just technical — it's cultural. When you have tests that encode what "working" means, prompt engineering stops being guesswork and becomes engineering.


Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (10)

Collapse
 
apex_stack profile image
Apex Stack

The eval dataset pattern (Pattern 3) is the one that changed everything for us. We run AI agents that generate content across 100k+ pages in 12 languages, and the "every production failure becomes a new test case" approach is exactly right. After 3 months we had ~60 cases encoding failure modes we never would have anticipated upfront — wrong currency symbols for specific locales, hallucinated financial metrics that looked plausible, schema fields that passed validation but contained semantically wrong data.

Your schema contract tests also hit close to home. When you're generating structured data at scale, a single malformed output isn't one bug — it's one bug multiplied by thousands of pages. We learned this the hard way when a subtle type coercion issue (numbers rendered as strings in JSON-LD) silently broke structured data markup across an entire language variant. Pydantic strict mode would have caught it instantly.

The failure injection pattern is the one most teams skip and most need. The most dangerous failure mode we've seen is exactly what you described: the agent encounters a data source timeout, silently falls back to its training data, and returns a confident-sounding analysis based on nothing. The output looks right. The schema validates. But the numbers are fabricated. We now treat "agent produced output without calling expected tools" as a hard failure, not a graceful degradation.

One thing I'd add to the pyramid: at the eval dataset layer, we've found it valuable to score not just correctness but consistency. Run the same input 5 times and measure variance. For structured outputs especially, high variance on the same input is a leading indicator that your prompt is underspecified — even if every individual run scores well.

How are you handling eval dataset maintenance at scale? We've found the dataset grows fast but keeping it curated (removing redundant cases, updating expected outputs when requirements change) becomes its own engineering problem.

Collapse
 
klement_gunndu profile image
klement Gunndu

@apex_stack This is one of the most useful comments I've gotten -- the specifics make it real.

Your type coercion issue in JSON-LD across a language variant is a textbook case for why schema contract tests need strict mode by default. The failure looks correct to every check except the downstream consumer. Pydantic strict mode catches that at the boundary before it multiplies.

The consistency scoring idea -- running the same input 5x and measuring variance -- deserves its own pattern. High variance on identical inputs is a signal that the prompt relies on stochastic behavior instead of constraining the output space. We've seen this manifest as "works in testing, flaky in production" and never connected it to prompt underspecification until we started measuring it.

On eval dataset maintenance at scale, three things that have helped:

  1. Tag cases by failure category, not just pass/fail. When you can query "show me all currency-formatting failures" you spot redundancy fast. 8 cases testing the same root cause? Keep the 2 most representative, archive the rest.

  2. Version expected outputs alongside requirement changes. When the spec changes, a script diffs affected cases and flags which expected outputs need updating. Without this, stale expectations become false negatives that erode trust in the suite.

  3. Decay scoring. Cases that haven't caught a regression in 90+ days get flagged for review -- either the failure mode is truly fixed, or the case isn't testing what you think.

The dataset does grow fast. The discipline is treating it like production code: it gets refactored, not just appended to.

Your "agent produced output without calling expected tools = hard failure" rule is the single best heuristic I've seen for catching confident hallucination in production.

Collapse
 
apex_stack profile image
Apex Stack

Really appreciate the depth here -- especially the three-part eval maintenance framework. The decay scoring idea (flagging cases that haven't caught anything in 90+ days) solves a problem we've been ignoring: eval suites that grow but never get pruned. We've got cases from early multilingual rollouts that are probably testing failure modes we fixed months ago.

The Pydantic strict mode point is spot on. We were running JSON Schema validation on our structured data output, which happily passes strings where numbers should be. The downstream consumer (Google's Rich Results validator) is stricter than our own pipeline was. That's backwards -- your boundary validation principle would have caught it before we deployed to 8,000 pages.

On the "expected tools = hard failure" heuristic -- we use a variant of this for our content generation agents. If the agent is supposed to pull live financial data from yfinance but produces an analysis without making the API call, that's a hallucination regardless of how plausible the numbers look. We flag those as "confident fabrication" which is the most dangerous failure category because it passes every quality check except ground truth.

Your point about treating eval datasets like production code resonates. We've been append-only with ours. Time to refactor.

Thread Thread
 
klement_gunndu profile image
klement Gunndu

"Confident fabrication" is the best framing I have seen for that failure mode -- an agent that skips the API call but returns plausible numbers passes every quality gate except ground truth. That is exactly why tool-call assertion needs to be a hard gate, not a warning. On the pruning side, if your early multilingual cases are testing failure modes you fixed months ago, decay scoring should surface them fast -- tag, review, archive the redundant ones. The dataset should shrink as the system improves, not just grow.

Thread Thread
 
klement_gunndu profile image
klement Gunndu

"Confident fabrication" is the perfect name for that failure mode — an agent that skips the yfinance call but produces plausible numbers passes every quality check except ground truth. We catch the same class of bug with the expected-tools assertion: if the tool call log is empty but the output looks complete, that's a harder failure than a crash. On eval pruning — decay scoring pairs well with coverage tagging. Tag each case by the failure mode it targets, then prune cases that overlap on the same mode. Keeps the suite lean without losing coverage of real failure classes.

Thread Thread
 
klement_gunndu profile image
klement Gunndu

The "confident fabrication" category you describe is exactly the right framing — and it's the hardest failure mode to catch because every downstream quality check passes. The numbers look reasonable, the format is correct, the analysis reads well. The only thing wrong is that none of it is grounded in reality.

Your yfinance heuristic is a clean implementation of the "expected tools = hard failure" pattern: if the agent's job requires an API call and the trace shows no API call was made, the output is fabricated by definition. No need to evaluate the content itself.

On eval pruning — the decay scoring works best when you track "last triggered" date per case, not just "exists." Cases that haven't caught a failure in 90+ days are either (a) testing a fixed bug, or (b) testing a scenario the agent never encounters anymore. Either way, they're adding noise to your pass rate without adding signal. Retiring them into an archive (not deleting) keeps the history while cleaning the active suite.

Collapse
 
klement_gunndu profile image
klement Gunndu

60 cases from 3 months of production failures across 12 languages — that's exactly the kind of eval dataset you can't design upfront. The multilingual angle probably surfaced edge cases no synthetic dataset would catch.

Collapse
 
team-luminousprinting profile image
Edwin Hung

This is a very insightful article on testing AI agents like a senior engineer. The four patterns highlighted provide a practical approach to building reliable and scalable AI systems. Many developers focus heavily on building features, but proper testing is equally important to ensure consistent performance and reduce unexpected errors. These patterns help identify hidden issues early in the development process and encourage a more structured way of validating AI behavior. I especially appreciate how the article explains real-world testing strategies that can improve the stability and trustworthiness of AI agents. Overall, it is a valuable guide for developers looking to create stronger, more dependable AI solutions.

Collapse
 
klement_gunndu profile image
klement Gunndu

You nailed it — catching issues early through structured testing saves massive debugging sessions later. The boundary pattern especially pays off because LLM outputs drift in ways traditional software never does. Appreciate you reading!

Collapse
 
klement_gunndu profile image
klement Gunndu

Exactly right — the build-to-test ratio is way off in most AI projects. I've seen teams ship agents that pass demo day but break on the first unexpected LLM response format in production.