How I stopped paying OpenAI to run my test suite

airupt — Thu, 02 Apr 2026 12:07:01 +0000

I was building an AI project and ran into something that kept bothering me.

Every test that touched my LLM code was making a real API call. To OpenAI. Every single time.

Tests were slow — 3 to 5 seconds each just waiting for a response. Every CI run cost real money in tokens, for code that hadn't even shipped yet. And the tests were flaky: same code, same input, different output. Language models are non-deterministic, so I'd get a passing run, then a failing run, with no way to tell if my code was broken or if the model just felt like responding differently.

The existing options aren't great

Mock the OpenAI Python client — then you're not testing the HTTP layer at all. The real SDK does a lot between your code and the wire.

VCR-style cassettes — brittle. They break whenever the SDK updates its request format, which happens constantly.

Ollama / local model — needs a GPU, still non-deterministic, and slow to start.

None of these gave me what I actually wanted: fast, deterministic, zero-cost tests that behave like the real API.

What I built

stubllm is a local HTTP server that impersonates OpenAI, Anthropic, or Gemini. You define fixtures in YAML — pattern to match, response to return — and point your code at localhost instead of the real API. Your code doesn't know the difference.

fixtures:
  - name: "greeting"
    match:
      provider: openai
      messages:
        - role: user
          content:
            contains: "hello"
    response:
      content: "Hello! How can I help you today?"

Any test that sends a message containing "hello" gets that response back. Every time. In under 1ms.

pytest plugin

There's a built-in pytest plugin. You get a stubllm_server fixture that starts the server, loads your fixtures, and resets state between tests automatically.

from stubllm.pytest_plugin import use_fixtures

@use_fixtures("fixtures/chat.yaml")
def test_greeting(stubllm_server):
    client = openai.OpenAI(
        base_url=stubllm_server.openai_url,
        api_key="test-key"
    )
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "hello"}]
    )
    assert "Hello" in response.choices[0].message.content
    assert stubllm_server.call_count == 1

Error injection

You can also simulate errors — rate limits, 500s, auth failures — to test your retry logic without ever triggering a real API error.

fixtures:
  - name: "rate_limit"
    match:
      provider: openai
    response:
      http_status: 429
      error_message: "Rate limit exceeded."

Each provider gets its native error format automatically. OpenAI gets the OpenAI error envelope, Anthropic gets the Anthropic one. You don't have to think about it.

Streaming

Streaming works out of the box too. Whether your code requests stream=True or not, stubllm handles it — the fixture stays the same either way.

Try it

pip install stubllm

It's open source, MIT licensed, and supports OpenAI, Anthropic, and Gemini. Just released v0.1.2.

GitHub: https://github.com/airupt/stubllm

Happy to answer questions in the comments.