I'm an automation engineer that writes mostly UI tests with some API sprinkled in. A recruiter wrote to me about an interesting job - AI/LLM testing. I was curious to learn more so I asked the model itself: what skills do I need to learn? The answer was this project.
What is it
A FastAPI service with one endpoint (/ask) that forwards a question to a local LLM (Ollama running llama3.2) and returns the answer. Plus a pytest suite.
~90 lines of app code, 23 tests, 100% coverage, two-tier test split (fast <1s, full ~90s).
The point was to learn what AI testing actually looks like compared to UI/API testing.
Repo: https://github.com/sbezjak/llm-api-testing
One honest thing up front. The suite worked first try. That made it harder to learn from, not easier — when nothing breaks, you don't have to understand it. I spent more time reading the code than I would have spent writing it.
Process timeline
1. Read every line before running anything. Docs, code, tests, setup. I wanted the big picture — classes, endpoints, test structure in my head before I touched anything.
2. Ask questions instead of copy-pasting. It's easy to create something that passes. It's harder to understand why it does. I spent 2 hours just discussing the project with the model. Questions like: Why 70% and not 100%? What does ASGITransport actually do? Why does ConnectError map to 503 and HTTP errors to 502? Why mock at all with respx? What's xfail and why is it used like this? What's temperature?
3. Ran it. All passed. But "10 passed in 99s" wasn't enough. I wanted to see which tests hit the model, how long each took, what the model actually answered. So I added structured logging:
POST /ask verdict=allowed status=200 elapsed=0.42s answer='Paris.'
And a pytest-html report with per-test captured logs. Now every test run is a document I can read.
4. Iterate with the model. Added logs, reports, comments. Asked about code I didn't understand — why something was there, what a piece did. This is where the differences between UI and AI testing started to click. Probabilistic vs deterministic. The 70% Paris case.
5. Make it production-ish. Asked how a real team would harden this. Mocking Ollama and 100% coverage were added in this step.
The thing that actually clicked — probabilistic vs deterministic
The consistency test sends "What is the capital of France?" ten times and asserts ≥70% of answers contain "paris".
answers = [await ask(prompt) for _ in range(10)]
hits = sum(1 for a in answers if "paris" in a.lower())
assert hits / len(answers) >= 0.7
In UI testing, same input produces the same output. You assert on exact values. assert button.opens_modal() == True.
LLMs don't work like that. Same prompt, different valid answers every call — "Paris.", "The capital is Paris.", a paragraph about French geography. The model samples from a distribution. There is no single right string.
So you assert on properties of the distribution, or on the envelope of acceptable answers. assert ≥70% of answers contain "paris". 70% is arbitrary — high enough to catch regressions, low enough to tolerate the model's variance. In a real system you'd tune per prompt.
Point vs region. Four years of UI-testing instincts took a while to shift.
The three bugs and what they taught me
Bug 1 — latency test failing at 35s.
First thought: my M1 is slow. Then I ran ollama run llama3.2 "say hi" directly in the terminal — instant. So the model was fine.
llama3.2 is chatty. Asking "string" produced an essay on null-termination and Unicode. The 35 seconds was generation time, not system latency.
Fix: "options": {"num_predict": 200} to cap output tokens. Warm requests dropped to 1-3 seconds.
Lesson: traditional APIs return what you ask for. LLMs return what they feel like returning. Latency tests measure output length unless you constrain it.
Bug 2 — coverage stuck at 85%.
Cause: no test exercised Ollama failure paths.
Fix: three mocked tests with respx — unreachable → 503, Ollama 5xx → 502, empty response → 502. Coverage hit 100%. New tests run in <50ms each because no real model is involved.
Lesson: check coverage reports. Gaps usually point at untested failure modes, not untested happy paths.
Bug 3 — moderation filter false positives.
The moderation filter is a substring blocklist — a Python list of phrases like "how to kill", "how to hack", etc. Any question containing one gets refused with a 400. Simple: "how to kill a process on linux" contains "how to kill", so a normal dev question gets blocked.
Fix: added the false positive to the benign dataset with pytest.mark.xfail and a written reason. The test now runs, fails as expected, and shows as a yellow dot in the report instead of red. Documented in the suite itself.
It flips to green the day the substring is replaced with a real classifier — a model that understands intent ("is this user actually trying to cause harm?") instead of just matching strings. That could be a small fine-tuned model, an open-source moderation model like Llama Guard, or a commercial moderation API. The upgrade closes the false-positive gap, the test starts passing, and xfail(strict=False) signals "unexpectedly passed" — the cue to remove the marker.
Lesson: xfail makes the suite record what's broken, not just what works. I'd only used xfail for flaky tests before, not as living documentation of known bugs. Much better than hiding a bug in a backlog ticket.
What I still don't fully understand
- The ASGI internals
ASGITransportrelies on. I know what it does, not what's happening inside. - When
respxis the right call vs building a proper fake. - Embedding similarity math beyond "cosine measures angle."
- What a real production eval harness looks like.
From a QA perspective
Most UI-testing instincts didn't transfer. Equality assertions, fixed latency thresholds, asserting a single correct outcome — all had to shift.
What did transfer: discipline around edge cases, thoughts about what happens when the upstream service dies, care about keeping the feedback loop fast, coverage reports.
Setting up a local model was new. Using it as a dependency in a test suite was new. Testing something that returns different valid outputs every call was new. If you're a QA engineer looking at this direction — the probability side is the new thing. The rest is still testing.
How to run it
# Install & start Ollama
brew install ollama
ollama serve # leave running in its own terminal
ollama pull llama3.2 # in another terminal
# Python env
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Run the API
uvicorn app.main:app --reload --port 8000
# → http://localhost:8000/docs
# Tests
pytest -m "not ollama" # fast tier, no Ollama needed, ~1s
pytest # full suite with HTML reports
Conclusion
When you're testing robustness (did the system stay well-behaved?) instead of correctness (did the right thing happen?), you assert the shape of acceptable failure, not the shape of success. AI systems fail in more ways, so the distinction matters more — a 500 is always a bug; anything else might be correct behavior for an edge case.
Repo: https://github.com/sbezjak/llm-api-testing
Next up — 5 more projects on the list: eval harness, RAG with observability, red-team suite, agent testing, model benchmarking. Writing each one up as I go.
Top comments (0)