How to Test LLM Agents Without Calling the Real API

#hermeschallenge #ai #python #agents

Testing an LLM agent in CI is annoying. The API costs money on every run. Rate limits bite you at the worst time. And the outputs are nondeterministic, so a naive assertion will flake 20% of the time regardless of whether your code is correct.

Most teams solve this by not testing. They run the agent manually, confirm it looks right, and push. That works until it doesn't. A changed prompt, a new tool, a slightly different system message, and you get a regression you only find in production.

There is a better approach. You do not need the real API in CI. You need three patterns: a FakeProvider for unit tests, agentsnap for regression snapshots, and module-level mocks for integration tests. Each has a job. None of them replaces the others.

The FakeProvider Pattern

A FakeProvider is a drop-in replacement for your LLM client that returns canned responses. No HTTP. No token spend. Deterministic output every time.

# fake_provider.py
from typing import Iterator

class FakeMessage:
    def __init__(self, content: str):
        self.content = content
        self.tool_calls = []

class FakeChoice:
    def __init__(self, content: str):
        self.message = FakeMessage(content)

class FakeCompletion:
    def __init__(self, content: str):
        self.choices = [FakeChoice(content)]

class FakeProvider:
    """Drop-in stub for an OpenAI-compatible client."""

    def __init__(self, responses: list[str]):
        self._responses = list(responses)
        self._call_count = 0
        self.calls: list[dict] = []

    @property
    def chat(self):
        return self

    @property
    def completions(self):
        return self

    def create(self, **kwargs) -> FakeCompletion:
        self.calls.append(kwargs)
        if not self._responses:
            raise ValueError("FakeProvider ran out of responses")
        response = self._responses[self._call_count % len(self._responses)]
        self._call_count += 1
        return FakeCompletion(response)


# tests/test_summarizer.py
from fake_provider import FakeProvider
from myagent.summarizer import summarize_document

def test_summarizer_trims_long_output():
    provider = FakeProvider([
        "Here is the summary: " + "word " * 300,  # 300-word stub response
    ])
    result = summarize_document("some long text", client=provider, max_words=100)
    assert len(result.split()) <= 100
    assert len(provider.calls) == 1

def test_summarizer_retries_on_empty_response():
    provider = FakeProvider(["", "Here is the summary: short and done."])
    result = summarize_document("some text", client=provider)
    assert "short and done" in result
    assert len(provider.calls) == 2

Your agent function needs to accept client as a parameter. If it creates the client internally, you cannot swap it. Dependency injection is not optional here.

Snapshot Testing with agentsnap

The FakeProvider tests behavior. agentsnap tests regression: you run the real agent once, capture the full trace of tool calls and outputs, then replay that trace forever without hitting the API again.

# pip install agentsnap
from agentsnap import Snap, record_session, replay_session

# Step 1: record once (costs API tokens, run locally)
# AGENTSNAP_RECORD=1 pytest tests/snapshots/test_research_agent.py

# Step 2: replay forever in CI (free, deterministic)

# tests/snapshots/test_research_agent.py
import pytest
from agentsnap import Snap
from myagent.research import run_research_agent

SNAP = Snap("snapshots/research_agent_v1.json")

@SNAP.test
def test_research_agent_finds_sources(snap_client):
    """
    In record mode: runs the real agent, saves trace.
    In replay mode: feeds saved tool outputs back, checks final output matches.
    """
    result = run_research_agent(
        query="What is the capital of France?",
        client=snap_client,
    )
    assert "Paris" in result.answer
    assert len(result.sources) >= 1

# The snapshot file stores:
# - model inputs (messages, tools, temperature)
# - tool call sequences
# - tool outputs
# - final model response
# On replay, snap_client intercepts tool calls and feeds back the recorded outputs.

The snapshot file goes in version control. When you update the agent, you regenerate the snapshot on your machine and commit it. CI uses the committed snapshot. You spend tokens once, not on every push.

Retry Logic Testing with FakeProvider + llm-retry-py

Testing retry logic against a real API is painful. The rate limit has to actually fire. With FakeProvider, you control exactly when it fails.

from fake_provider import FakeProvider
from llm_retry import with_retry, RateLimitError
from myagent.caller import call_with_retry

def test_retries_on_rate_limit():
    def raise_rate_limit(**kwargs):
        raise RateLimitError("rate limited")

    provider = FakeProvider([])
    provider.completions.create = raise_rate_limit  # first call fails

    call_count = 0
    real_responses = ["Success after retry"]

    def side_effect(**kwargs):
        nonlocal call_count
        call_count += 1
        if call_count == 1:
            raise RateLimitError("rate limited")
        return FakeCompletion(real_responses[0])

    provider.completions.create = side_effect
    result = call_with_retry(client=provider, prompt="hello")
    assert result == "Success after retry"
    assert call_count == 2

What These Patterns Do NOT Do

FakeProvider and agentsnap cannot test reasoning quality. If your agent produces bad output because of a subtle prompt issue, no amount of unit testing will catch it. They also cannot catch emergent failures: behaviors that only appear when the model sees a specific combination of context.

Snapshot tests go stale. When you change tools or update the system prompt, old snapshots no longer reflect real behavior. You need to regenerate them. That is not a bug, that is the workflow.

Module-level mocks (httpretty, responses) work for tools that make HTTP calls directly. They do not work well for structured tool-use flows where the model decides which tool to call.

Structuring the Test Suite

Three tiers:

Smoke tests run on every commit. Use FakeProvider. Fast, no API tokens, covers happy paths and error branches.

Regression tests run on every PR. Use agentsnap. Replay real traces, confirm the agent still produces the same output structure. Regenerate snapshots when behavior intentionally changes.

Integration tests run before release or on a schedule. Use the real API with a dedicated low-quota key. Cover end-to-end flows that snapshots cannot capture.

Keep the tiers separate. Do not let integration tests sneak into the smoke tier. Mark them explicitly.

# pyproject.toml
[tool.pytest.ini_options]
markers = [
    "smoke: fast, no API calls",
    "regression: agentsnap replay, no API calls",
    "integration: real API, slow, run before release",
]

# Run smoke only in CI:
# pytest -m smoke

# Run smoke + regression locally:
# pytest -m "smoke or regression"

# Run all before release:
# pytest -m "smoke or regression or integration"

Quick-Start Snippet

pip install agentsnap llm-retry-py

# Create your FakeProvider (copy the pattern above, it has no dependencies)
# Write smoke tests with FakeProvider first

# When smoke tests pass, record a snapshot:
AGENTSNAP_RECORD=1 pytest tests/snapshots/ -m regression

# Commit the snapshot file
git add snapshots/
git commit -m "add agentsnap regression snapshot for research agent"

# CI runs replay automatically (no AGENTSNAP_RECORD set)
pytest -m "smoke or regression"

Related Libraries

Library	What It Does
agentsnap	Record and replay tool-call traces for regression tests
agenttrace	Structured trace export (cost, latency, tool calls per run)
agentvet	Static checks for agent configuration before deploy
llm-retry-py	Exponential backoff retry for LLM calls
prompt-eval-rubric	0.0-1.0 scoring rubrics for output quality checks
llm-output-validator	Rule-based string validation for model outputs

What's Next

Once you have smoke and regression tests running in CI, the next gap is output quality. Snapshot tests tell you the structure did not change. They do not tell you the content is good.

That is where prompt-eval-rubric and llm-output-validator fill in. You define a rubric (relevance, completeness, no hallucinated citations) and score outputs against it. You can run rubric checks against your agentsnap replays without touching the API.

The pattern after that is A/B prompt testing: keep the snapshot, change the prompt template, regenerate the snapshot, diff the outputs. That workflow is covered in post 114 on prompt engineering hygiene.