DEV Community

Cover image for Test Your AI Agent Like a Senior Engineer: 4 Patterns That Work
klement Gunndu
klement Gunndu

Posted on

Test Your AI Agent Like a Senior Engineer: 4 Patterns That Work

Your AI agent passes every unit test. Then it hallucinates a database schema in production, invents an API endpoint that doesn't exist, and confidently returns a JSON response missing three required fields.

Unit tests prove your functions run. They don't prove your agent works. The difference costs you production incidents, user trust, and the 3 AM pages that make you question your career choices.

Here are 4 testing patterns that senior engineers use to catch these failures before deployment — with working Python code for each.

Pattern 1: Schema Contract Tests

The first thing that breaks in an AI agent is the output format. You ask for structured data, the LLM returns something close but not quite right. A missing field. A string where you expected an integer. A nested object with an unexpected key.

Schema contract tests enforce that every agent output matches an exact Pydantic model — and they do it without calling the real LLM.

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel

class AnalysisResult(BaseModel):
    summary: str = Field(min_length=10)
    confidence: float = Field(ge=0.0, le=1.0)
    categories: list[str] = Field(min_length=1)
    source_count: int = Field(ge=1)

analysis_agent = Agent(
    "openai:gpt-4o",
    output_type=AnalysisResult,
    system_prompt="Analyze the given text and return structured results.",
)

async def test_output_schema_enforced():
    """Every response must match AnalysisResult exactly."""
    with analysis_agent.override(model=TestModel()):
        result = await analysis_agent.run("Analyze this quarterly report.")
        # TestModel returns default-valid data matching AnalysisResult
        assert isinstance(result.output, AnalysisResult)
        assert 0.0 <= result.output.confidence <= 1.0
        assert len(result.output.categories) >= 1
        assert result.output.source_count >= 1
Enter fullscreen mode Exit fullscreen mode

The key detail: TestModel from Pydantic AI generates deterministic responses that satisfy your output type's schema. No API calls. No flaky tests. No cost. The test runs in milliseconds and proves your schema constraints are enforced at the framework level.

For stricter validation — catching coercion issues where "3" silently becomes 3 — use Pydantic's strict mode:

async def test_strict_schema_validation():
    """Catch silent type coercion before production does."""
    raw_output = {
        "summary": "Market analysis complete",
        "confidence": "0.95",  # String, not float — should fail strict
        "categories": ["finance"],
        "source_count": 3,
    }
    try:
        AnalysisResult.model_validate(raw_output, strict=True)
        assert False, "Should have raised ValidationError"
    except Exception:
        pass  # Strict mode correctly rejects string-to-float coercion
Enter fullscreen mode Exit fullscreen mode

This pattern catches the entire class of "the LLM returned something that looks right but isn't" failures.

Pattern 2: Deterministic Tool Call Tests

Your agent calls tools — search APIs, databases, calculators. The tool call sequence is where most production bugs hide. The agent calls the wrong tool, passes wrong arguments, or skips a tool it should have used.

FunctionModel from Pydantic AI lets you script the exact sequence of LLM responses, so you can assert on every tool call:

from pydantic_ai.models.function import FunctionModel, AgentInfo
from pydantic_ai.messages import ModelMessage, ModelResponse, ToolCallPart, TextPart
from pydantic_ai import Agent, capture_run_messages

def controlled_tool_sequence(
    messages: list[ModelMessage], info: AgentInfo
) -> ModelResponse:
    """Simulate an LLM that calls tools in a specific order."""
    call_count = sum(
        1 for m in messages
        if hasattr(m, "parts") and any(
            hasattr(p, "tool_name") for p in m.parts
        )
    )
    if call_count == 0:
        # First call: agent should search
        return ModelResponse(
            parts=[ToolCallPart("search_docs", {"query": "revenue Q3"})]
        )
    elif call_count == 1:
        # Second call: agent should calculate
        return ModelResponse(
            parts=[ToolCallPart("calculate", {"expression": "150000 * 1.12"})]
        )
    else:
        # Final: return answer
        return ModelResponse(
            parts=[TextPart("Q3 revenue grew 12% to $168,000.")]
        )

research_agent = Agent("openai:gpt-4o", system_prompt="Research financial data.")

@research_agent.tool_plain
def search_docs(query: str) -> str:
    return "Q3 revenue: $150,000"

@research_agent.tool_plain
def calculate(expression: str) -> str:
    return str(eval(expression))  # Simplified for example

async def test_tool_call_sequence():
    """Agent must call search_docs before calculate."""
    with capture_run_messages() as messages:
        with research_agent.override(model=FunctionModel(controlled_tool_sequence)):
            result = await research_agent.run("What was Q3 revenue growth?")

    tool_calls = [
        part.tool_name
        for msg in messages
        if hasattr(msg, "parts")
        for part in msg.parts
        if hasattr(part, "tool_name")
    ]
    assert tool_calls == ["search_docs", "calculate"], (
        f"Expected search then calculate, got {tool_calls}"
    )
Enter fullscreen mode Exit fullscreen mode

This test doesn't check if the LLM "feels right." It asserts on the exact behavioral contract: search first, then calculate. If a prompt change causes the agent to skip the search step, this test fails immediately.

Pattern 3: Eval Datasets With Scoring Rubrics

Unit tests check individual behaviors. Eval datasets measure whether your agent is getting better or worse across dozens of scenarios simultaneously.

Pydantic Evals (part of the Pydantic AI ecosystem, installed via pip install pydantic-evals) provides a framework for this:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

class ContainsExpectedInfo(Evaluator[str, str]):
    """Check if the output contains the expected key information."""

    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
        if ctx.expected_output and ctx.expected_output.lower() in ctx.output.lower():
            return 1.0
        return 0.0

class NoHallucination(Evaluator[str, str]):
    """Flag outputs that mention topics not in the input."""

    forbidden_terms: list[str] = []

    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
        for term in self.forbidden_terms:
            if term.lower() in ctx.output.lower():
                return 0.0
        return 1.0

dataset = Dataset(
    cases=[
        Case(
            name="capital_of_france",
            inputs="What is the capital of France?",
            expected_output="Paris",
        ),
        Case(
            name="python_creator",
            inputs="Who created Python?",
            expected_output="Guido van Rossum",
        ),
        Case(
            name="no_hallucinate_dates",
            inputs="What is Python?",
            expected_output="programming language",
            evaluators=[
                NoHallucination(forbidden_terms=["founded in 2025", "version 4.0"]),
            ],
        ),
    ],
    evaluators=[ContainsExpectedInfo()],
)
Enter fullscreen mode Exit fullscreen mode

Run it against your agent function:

async def my_agent(question: str) -> str:
    """Your actual agent call goes here."""
    # Replace with your agent.run() call
    result = await analysis_agent.run(question)
    return result.output.summary

report = dataset.evaluate_sync(my_agent)
report.print(include_input=True, include_output=True)
Enter fullscreen mode Exit fullscreen mode

The output is a table showing pass/fail per case, scores per evaluator, and aggregate metrics. Run this in CI. Track scores across commits. When a prompt change drops your hallucination score from 1.0 to 0.7, you catch it before deployment — not after a user files a bug report.

The real power: you build this dataset incrementally. Every production failure becomes a new test case. After 6 months, you have 200 cases that encode every failure mode your agent has ever hit.

Pattern 4: Failure Injection Tests

Your agent works when everything works. The question is: what happens when things break?

Failure injection tests simulate the real failure modes — API timeouts, malformed LLM responses, tool exceptions — and verify your agent degrades gracefully instead of crashing or hallucinating.

import asyncio
import pytest
from pydantic_ai.models.function import FunctionModel, AgentInfo
from pydantic_ai.messages import ModelMessage, ModelResponse, TextPart

def timeout_model(
    messages: list[ModelMessage], info: AgentInfo
) -> ModelResponse:
    """Simulate an LLM that takes too long to respond."""
    raise TimeoutError("Model API timed out after 30s")

def garbage_model(
    messages: list[ModelMessage], info: AgentInfo
) -> ModelResponse:
    """Simulate an LLM that returns unparseable output."""
    return ModelResponse(parts=[TextPart("sure here is {broken json")])

async def test_agent_handles_timeout():
    """Agent must not crash on LLM timeout."""
    with research_agent.override(model=FunctionModel(timeout_model)):
        with pytest.raises(TimeoutError):
            await research_agent.run("Analyze this data.")
    # If your agent has retry logic, test that instead:
    # assert result.output contains fallback message

async def test_agent_handles_malformed_response():
    """Agent must handle garbage LLM output without crashing."""
    with research_agent.override(model=FunctionModel(garbage_model)):
        # Pydantic AI will attempt to validate the response.
        # For structured output agents, this triggers retry/validation.
        try:
            result = await research_agent.run("Analyze this data.")
        except Exception as e:
            # Agent should raise a clear error, not silently corrupt data
            assert "validation" in str(e).lower() or "parse" in str(e).lower()
Enter fullscreen mode Exit fullscreen mode

For tool failures, inject exceptions at the tool level:

# Create a separate agent with a tool that always fails
failing_agent = Agent("openai:gpt-4o", system_prompt="Research financial data.")

@failing_agent.tool_plain
def search_docs(query: str) -> str:
    raise ConnectionError("Search API is down")

@failing_agent.tool_plain
def calculate(expression: str) -> str:
    return str(eval(expression))

async def test_agent_survives_tool_failure():
    """Agent must handle a tool that throws an exception."""
    with failing_agent.override(model=TestModel()):
        # The agent should either retry, use a fallback,
        # or return a clear error — not hallucinate an answer
        try:
            result = await failing_agent.run("Find Q3 revenue data.")
        except Exception as e:
            # Verify it fails clearly, not silently
            assert "ConnectionError" in type(e).__name__ or "tool" in str(e).lower()
Enter fullscreen mode Exit fullscreen mode

This pattern catches the most dangerous failure mode in production AI systems: the agent encounters an error, silently ignores it, and returns a confident-sounding response based on nothing.

Putting It All Together: The Test Pyramid for AI Agents

Traditional software has a test pyramid: unit tests at the bottom, integration tests in the middle, end-to-end tests at the top.

AI agents need a different pyramid:

        /  E2E  \          Multi-turn conversations (expensive, slow)
       /  Evals  \         Dataset scoring across scenarios (medium)
      / Contracts \        Schema + tool call sequence (fast, cheap)
     /  Failure   \        Timeout, malformed, tool crash (fast)
    / Unit (tools) \       Individual tool functions (fastest)
Enter fullscreen mode Exit fullscreen mode

Bottom layer (unit tests for tools): Test each tool function in isolation. Pure functions, no LLM involved. Run on every commit.

Second layer (failure injection): Inject timeouts, bad responses, tool crashes. Run on every commit. Catches resilience regressions.

Third layer (schema contracts): Use TestModel and FunctionModel to verify output schemas and tool call sequences. Run on every commit. Catches behavioral regressions.

Fourth layer (eval datasets): Run your growing dataset against the actual agent (with real LLM calls). Run in CI on PRs. Catches quality regressions.

Top layer (end-to-end): Multi-turn conversations with real tools and real LLMs. Run nightly or before releases. Catches integration issues.

The bottom three layers run in milliseconds, cost nothing, and catch 80% of production failures. The top two layers catch the remaining 20% but require API calls and take minutes.

What Changes After You Adopt These Patterns

Before: you ship prompt changes and hope nothing breaks. A user reports that the agent started returning incomplete data. You dig through logs for 2 hours trying to reproduce the issue.

After: a CI check fails the moment a prompt change breaks the output schema contract. The eval dataset shows a 15% drop in accuracy on financial queries. The failure injection test reveals the agent no longer retries on timeouts. You fix all three before merging.

The shift isn't just technical — it's cultural. When you have tests that encode what "working" means, prompt engineering stops being guesswork and becomes engineering.


Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (0)