VCR-Style Record/Replay for LLM Tests: Make Your Agent Tests Deterministic and Free

#hermeschallenge #ai #python #agents

You are writing tests for your agent. Your tests call the real LLM API. Every test run costs money. The tests are flaky because the LLM response is non-deterministic. Your CI pipeline fails 10% of the time due to API rate limits.

VCR-style testing is the standard solution for HTTP-dependent code. Record the real API responses once. Replay them in subsequent test runs without hitting the network. llm-fixture-replay brings this pattern to LLM calls.

The Shape of the Fix

from llm_fixture_replay import LLMFixture

# In record mode: calls the real API and saves responses
# In replay mode: returns saved responses without hitting the API

fixture = LLMFixture(
    path="./fixtures/customer-support.jsonl",
    mode="auto",  # "record" | "replay" | "auto"
)

def call_llm(messages: list[dict]) -> dict:
    return fixture.call(
        fn=anthropic_client.messages.create,
        model="claude-sonnet-4-6",
        messages=messages,
        max_tokens=1024,
    )

First run with an empty fixture file: calls the real API, saves the request/response pair to the JSONL file. Subsequent runs with the same inputs: returns the saved response without any network call.

What It Does NOT Do

llm-fixture-replay does not handle partial matches. The replay looks for an exact match of the request in the fixture file. If you change the messages, model, or any other parameter, it is a miss and the real API is called (in auto mode) or an error is raised (in replay mode).

It does not modify the fixture automatically. Updating a test's expected response means deleting the old fixture and recording a new one. There is no "re-record this one entry" operation; the fixture is a complete JSONL file.

It does not intercept at the HTTP layer. It wraps the fn callable you provide. If your code calls the API through multiple layers, you need to wrap the outermost callable that your agent code calls.

Inside the Library

Each fixture entry is one request/response pair in JSONL:

{"request": {"model": "claude-sonnet-4-6", "messages": [...], "max_tokens": 1024}, "response": {"content": [...], "usage": {...}}, "recorded_at": 1748107200}

The implementation:

class LLMFixture:
    def __init__(self, path: str, mode: str = "auto"):
        self._path = Path(path)
        self._mode = mode
        self._entries: list[dict] = []
        self._load()

    def _load(self) -> None:
        if self._path.exists():
            for line in self._path.read_text().splitlines():
                if line.strip():
                    self._entries.append(json.loads(line))

    def _key(self, kwargs: dict) -> str:
        canonical = json.dumps(kwargs, sort_keys=True, default=str)
        return hashlib.sha256(canonical.encode()).hexdigest()

    def call(self, fn: Callable, **kwargs) -> Any:
        key = self._key(kwargs)

        # Look for match
        for entry in self._entries:
            if entry["key"] == key:
                return entry["response"]

        if self._mode == "replay":
            raise FixtureMiss(key=key, kwargs=kwargs)

        # Record mode: call real API
        response = fn(**kwargs)

        entry = {
            "key": key,
            "request": kwargs,
            "response": response,
            "recorded_at": time.time(),
        }

        self._entries.append(entry)
        with self._path.open("a") as f:
            f.write(json.dumps(entry, default=str) + "\n")

        return response

The mode="auto" behavior: try to replay from fixture; if not found, call real API and record. This means a test run with new inputs will automatically extend the fixture file. When you commit the updated fixture to git, subsequent test runs are fully offline.

When to Use It

Use it for integration tests that need realistic LLM responses. The first run produces real responses. Subsequent runs are deterministic and free.

Use it for regression testing when you change prompts. Record the baseline responses with the old prompt. Update the prompt. Run in auto mode. New requests (from the changed prompt) hit the real API and get recorded. You can then diff old vs new responses to evaluate the change.

Use it when you are implementing a feature and want predictable test behavior without mocking every API detail manually. Record real responses once, replay forever.

Skip it for tests that specifically need to test API error handling. For those, use a mock that explicitly raises the errors you want to test.

Install

pip install git+https://github.com/MukundaKatta/llm-fixture-replay

# Or from PyPI
pip install llm-fixture-replay

from llm_fixture_replay import LLMFixture
import pytest

@pytest.fixture
def llm_fixture(tmp_path):
    # In CI: use committed fixtures (replay mode)
    # Locally: record new fixtures as needed (auto mode)
    mode = "replay" if os.environ.get("CI") else "auto"
    return LLMFixture(
        path="./tests/fixtures/agent-responses.jsonl",
        mode=mode,
    )

def test_customer_support_agent(llm_fixture):
    def mock_call(**kwargs):
        return llm_fixture.call(fn=real_anthropic_client.messages.create, **kwargs)

    agent = CustomerSupportAgent(llm_fn=mock_call)
    response = agent.handle("What is your return policy?")

    assert "30 days" in response
    assert "receipt" in response.lower()

Sibling Libraries

Library	What it solves
`prompt-replay`	Cross-provider replay for prompt regression testing
`agent-debug-replay`	Step-through navigation of recorded agent runs
`agent-state-checkpoint`	Checkpoint state to resume recorded runs
`llm-multi-vote`	Run the same prompt multiple times and compare
`agent-shadow-mode`	Record-not-execute for pre-deployment validation

The testing stack: llm-fixture-replay for request-level VCR, agent-debug-replay for step-level replay, prompt-replay for cross-model comparison, agent-shadow-mode for production pre-validation.

What's Next

Match modes: beyond exact matching, support approximate matching where the model and max_tokens must match but message content can vary within a fuzzy threshold. Useful when prompts change slightly between test runs but you want the same fixture to apply.

Fixture compression: the JSONL file grows as you add more test cases. Optional gzip compression would reduce storage by 60-80% for typical JSON response fixtures, making fixture files practical to commit to git even for large test suites.

Fixture expiry: LLMFixture(path=..., max_age_days=30) that treats fixture entries older than 30 days as misses. After expiry, the real API is called and a fresh response is recorded. Prevents tests from using months-old responses that may no longer reflect current model behavior.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.