I was building an AI project and ran into something that kept bothering me.
Every test that touched my LLM code was making a real API call. To OpenAI. Every single time.
Tests were slow — 3 to 5 seconds each just waiting for a response. Every CI run cost real money in tokens, for code that hadn't even shipped yet. And the tests were flaky: same code, same input, different output. Language models are non-deterministic, so I'd get a passing run, then a failing run, with no way to tell if my code was broken or if the model just felt like responding differently.
The existing options aren't great
Mock the OpenAI Python client — then you're not testing the HTTP layer at all. The real SDK does a lot between your code and the wire.
VCR-style cassettes — brittle. They break whenever the SDK updates its request format, which happens constantly.
Ollama / local model — needs a GPU, still non-deterministic, and slow to start.
None of these gave me what I actually wanted: fast, deterministic, zero-cost tests that behave like the real API.
What I built
stubllm is a local HTTP server that impersonates OpenAI, Anthropic, or Gemini. You define fixtures in YAML — pattern to match, response to return — and point your code at localhost instead of the real API. Your code doesn't know the difference.
fixtures:
- name: "greeting"
match:
provider: openai
messages:
- role: user
content:
contains: "hello"
response:
content: "Hello! How can I help you today?"
Any test that sends a message containing "hello" gets that response back. Every time. In under 1ms.
pytest plugin
There's a built-in pytest plugin. You get a stubllm_server fixture that starts the server, loads your fixtures, and resets state between tests automatically.
from stubllm.pytest_plugin import use_fixtures
@use_fixtures("fixtures/chat.yaml")
def test_greeting(stubllm_server):
client = openai.OpenAI(
base_url=stubllm_server.openai_url,
api_key="test-key"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "hello"}]
)
assert "Hello" in response.choices[0].message.content
assert stubllm_server.call_count == 1
Error injection
You can also simulate errors — rate limits, 500s, auth failures — to test your retry logic without ever triggering a real API error.
fixtures:
- name: "rate_limit"
match:
provider: openai
response:
http_status: 429
error_message: "Rate limit exceeded."
Each provider gets its native error format automatically. OpenAI gets the OpenAI error envelope, Anthropic gets the Anthropic one. You don't have to think about it.
Streaming
Streaming works out of the box too. Whether your code requests stream=True or not, stubllm handles it — the fixture stays the same either way.
Try it
pip install stubllm
It's open source, MIT licensed, and supports OpenAI, Anthropic, and Gemini. Just released v0.1.2.
GitHub: https://github.com/airupt/stubllm
Happy to answer questions in the comments.
Top comments (0)