How to mock OpenAI and Anthropic API calls in pytest (without monkeypatching)

#python #testing #ai #pytest

If you're building with LLM APIs, you've probably hit this problem: your test suite makes real API calls.

That means:

Every CI run costs money
Tests are slow (1–3 seconds per call)
Tests are flaky (LLM outputs aren't deterministic)
Tests fail without an API key

The usual fix is monkeypatching — replacing the client with a fake. But that means maintaining fake responses by hand, and your tests stop reflecting what the real API actually returns.

There's a better approach: record real responses once, replay them forever.

The record/replay pattern

Instead of faking the API, you intercept it:

Record mode — runs against the real API once, saves the response to a JSON file
Replay mode — returns the saved response instantly, no network call

The fixture file gets committed to git. CI runs offline. Tests are deterministic and free.

llm-mock

https://github.com/autopost/llm-mock is a pytest plugin that does exactly this for the Anthropic and OpenAI SDKs. It intercepts at the HTTP layer — your production code is never touched.

pip install llm-mock

Example: testing an Anthropic pipeline

Say you have this production code:

 # my_app/pipeline.py
  import anthropic                                                                                           

  client = anthropic.Anthropic()                                                                             

  def summarize(text: str) -> str:                                                                           
      message = client.messages.create(
          model="claude-sonnet-4-6",
          max_tokens=100,                                                                                    
          messages=[{"role": "user", "content": f"Summarize:{text}"}],
      )                                                                                                      
      return message.content[0].text

Step 1 — Record once

Run this locally with your API key:

 from llm_mock import llm_mock
  from my_app.pipeline import summarize                                                                      

  with llm_mock(mode="record", fixture="tests/fixtures/summarize"):                                          
      result = summarize("Long article about climate change...")
      print(result)

ANTHROPIC_API_KEY=sk-... python record_fixtures.py

This creates tests/fixtures/summarize.json. Commit it to git.

Step 2 — Replay in tests

# tests/test_pipeline.py
  import pytest
  from my_app.pipeline import summarize

  @pytest.mark.llm_replay(fixture="summarize")                                                               
  def test_summarize():
      result = summarize("Long article about climate change...")                                             
      assert "climate" in result

pytest # no API key needed, instant, deterministic

That's it. The decorator intercepts the httpx call the Anthropic SDK makes internally and returns the saved response.

Auto mode — the easiest workflow

If you don't want to think about record vs replay at all, use mode="auto":

@pytest.mark.llm_replay(fixture="summarize", mode="auto")
  def test_summarize():                                                                                      
      result = summarize("Long article about climate change...")
      assert "climate" in result

Fixture exists → replays it
Fixture missing → records it automatically

First run needs an API key. Every run after that is free and offline.

Refreshing fixtures

When you change a prompt or update a model, refresh fixtures without touching test code:

LLM_MOCK_DISABLED=1 ANTHROPIC_API_KEY=sk-... pytest

LLM_MOCK_DISABLED=1 bypasses llm-mock entirely — all tests hit the real API and save fresh responses.

What the fixture looks like

Plain JSON, human-readable, diff-friendly in PRs:

{               
    "version": "1.0",                                                                                        
    "provider": "anthropic",
    "interactions": [
      {
        "hash": "a3f2c1...",
        "request": {                                                                                         
          "model": "claude-sonnet-4-6",                                                                      
          "messages": [{"role": "user", "content": "Summarize: Long article..."}],                           
          "max_tokens": 100                                                                                  
        },        
        "response": {                                                                                        
          "content": [{"type": "text", "text": "The article covers..."}],
          "stop_reason": "end_turn"                                                                          
        },        
        "recorded_at": "2026-06-01T10:00:00+00:00"                                                           
      }                                                                                                      
    ]
  }

Requests are matched by SHA256 of (model, messages, temperature) — same request always hits the same fixture entry.

Works with OpenAI too

 @pytest.mark.llm_replay(fixture="gpt_summary", mode="auto")
  def test_openai_summary():                                                                                 
      result = my_openai_pipeline("Summarize this...")                                                       
      assert len(result) > 0

Both providers in one fixture file if you use provider="all" (the default).

vs other approaches

Approach	Problem
unittest.mock / monkeypatch	Fake responses drift from real API behavior
VCR.py	Records raw HTTP — doesn't understand LLM request semantics
Always hit real API	Expensive, slow, flaky, needs credentials in CI
llm-mock	Record once, replay forever, fixtures in git