DEV Community: Sara Bezjak

Five ways to test an LLM's answer and what each one misses

Sara Bezjak — Tue, 19 May 2026 09:25:24 +0000

I'm a regular automation engineer. My usual job is checking that an app does the same thing every time. AI testing is the opposite: the same question can give a different answer each run.

A learning project, written up for anyone trying to get into AI testing.

Repo: https://github.com/sbezjak/llm-eval-harness

What I built

A pytest project. 10 questions, a hand-written expected answer for each, a local model (llama3.2) answering the questions, and the model's responses saved to a file. Then I read every response myself and wrote PASS or FAIL like a human grader. The rest of the project is about getting code to agree with that human verdict.

I built five scorers and ran all five against the saved responses:

Scorer	What it checks
Exact match	`output == expected`
BLEU	shared word sequences (from machine translation)
ROUGE	overlap on longest common subsequence (from summarization)
Semantic similarity	angle between sentence embeddings (does the meaning roughly match)
LLM-as-judge	second model call with a correctness + relevance rubric

None of them was right on its own. The interesting bit is how each one was wrong.

The main finding: the judge passes its own hallucinations, deterministically

LLM-as-judge means a second LLM grades the first one's answer against a rubric. It's the only scorer here that can actually read meaning, which is also why it fails in ways the others don't.

The only wrong answer in my set was on a pytest question. The model invented a command-line flag (--junit-xml-filter). I had set up the LLM judge specifically to catch this kind of factual error.

The judge gave it correctness 8/10, relevance 6/10. Combined: 0.700. I set the pass threshold: 0.700. So it passed, exactly on the line.

LLM outputs aren't deterministic, so I expected the score to be around 0.7 and the verdict to flip. I ran the judge five times against the same frozen response:

run 1: score=0.700 passed=True
run 2: score=0.700 passed=True
run 3: score=0.700 passed=True
run 4: score=0.700 passed=True
run 5: score=0.700 passed=True

Identical every run. The judge isn't changing its mind, it's stuck on the threshold. Worse than a flaky test: a flaky test eventually flips red and someone investigates. A deterministic wrong-pass looks green in CI and ships the bug.

The mechanism is self-grading bias: the judge is the same llama3.2 that wrote the bad answer, so the hallucinated flag doesn't look wrong to either of them. Averaging more runs helps when a score is noisy. It does nothing here. The fix is a different, stronger judge model.

The other story: 4 of 5 scorers reject a correct answer because of its shape

Question: "How many planets are in our solar system?" Expected: "8". The model returned a bulleted list of all eight planets with their names, plus a section about Pluto. A human reads that and says PASS.

Scorer	Verdict	Score
Exact match	FAIL	n/a
BLEU	FAIL	~0
ROUGE	FAIL	~0
Semantic similarity	FAIL	0.194
LLM-as-judge	PASS	1.000

You think you are testing whether the model got the answer right. You are actually testing whether your scorer can recognize the right answer when the model gives it in a different shape than the reference. Four out of five could not.

A note on how semantic similarity works: each sentence becomes a vector (a list of numbers encoding meaning), and the score is the angle between two vectors. Close angle, similar meaning. But "similar meaning" isn't "correct". A wrong answer about planets sits in the same semantic neighborhood as a right one.

The naive fix is to lower the cosine threshold until the planets row passes. It does not work. The lowest right-answer score and the only wrong-answer score in the set sit 0.004 apart. Any threshold that admits the right one also admits the wrong one. Semantic similarity is measuring textual proximity, not correctness.

A closer look at BLEU and ROUGE

Two of the four scorers that failed the planets case were BLEU and ROUGE. It's worth slowing down on these, because the usual one-line explanation ("BLEU and ROUGE are bad at prose") turned out to be the wrong story.

Both metrics measure word overlap. They look at the model's answer, look at your reference answer, and count how many words or word sequences appear in both. More overlap, higher score.

BLEU (from machine translation, 2002) counts shared runs of words. If the reference is "the cat sat on the mat" and the model says "the cat sat on a mat," BLEU sees five shared single words and three shared two-word sequences ("the cat", "cat sat", "sat on") and gives a high score.
ROUGE (from summarization, 2004) counts the longest sequence of words that appears in both texts in the same order, even if other words are sprinkled between them. Same idea, slightly different bookkeeping.

The important part is the denominator. Both metrics divide "shared words" by "how long the texts are." That ratio is what breaks on the planets row.

Reference: "8". One token. The model's answer: a paragraph naming all eight planets. The word "8" appears in the paragraph, so the numerator is 1. The denominator is "length of the model's answer," around 30 words. The score is 1/30, basically zero. BLEU also needs the texts to share two-word and three-word sequences, and the reference has none of those (it only has one word), so part of BLEU's math is forced to zero before anything else happens. Final score: zero. Same answer, different reference, completely different result. If the reference had been "There are 8 planets in our solar system," BLEU and ROUGE would both score the same model answer highly, because now there are sequences to overlap with.

So the rule isn't "BLEU and ROUGE are bad at prose." They were built for prose. The rule is: they only work when the reference and the model's answer are similar in shape and length. Short reference plus long answer collapses the score. Long reference plus short answer collapses it too.

This is what the xfail tests surfaced. I had marked the BLEU and ROUGE rows as "expected to fail on prose," and five of them passed unexpectedly. The ones that passed were the rows where the reference happened to be a full sentence, not a single token. The shape matched, the score worked, the test that was "supposed" to fail didn't. That mismatch is what pushed the finding from "BLEU is bad" to "BLEU needs matching reference shape."

The practical version: if you want a meaningful BLEU or ROUGE score, write reference answers that look roughly like the outputs you expect. A one-word gold answer is fine for exact match but wastes these metrics. For short answers, use exact match or an LLM judge instead. Production setups also support multiple reference answers per question and take the best match, which is another way to cover the shape problem.

Smaller findings, briefly

A bias-swap test caught one drifting pair. Same prompt, only the name changes (David vs Priya). Three of four pairs gave similar responses. One question about career advice drifted noticeably. One drifting pair isn't proof of bias, but it's the kind of drift a real bias suite would flag for review.
Length bias, null result here. LLM judges often score longer answers higher. I expected this and tested three short-vs-long pairs of correct answers. The judge gave both the same score every time. Not proof there's no bias, just no bias on this model and rubric.
Trust the judge's score, not its reasoning. The judge sometimes wrote explanations that contradicted the number it gave. The number was closer to right. Treat the prose as a debugging hint, not evidence.

The thing I'd take back into a Playwright suite tomorrow

pytest.xfail(strict=True) with a reason field. The test is supposed to fail for a written-down reason, and if it ever starts passing, the build breaks on purpose so somebody investigates. I marked every "scorer disagrees with the human" case that way. The test file became the project's spec for what each scorer is known to get wrong.

It paid for itself twice. I expected the judge-variance test to show noise; stdev came back 0.000, which surfaced the "stuck on the threshold" finding. I expected BLEU and ROUGE to fail on prose like exact match does; five XPASSes forced the reference-shape finding instead. Both times the suite caught me being wrong before I published.

This is not AI-specific. It works on any flaky integration where the failure mode is understood.

A note on the numbers

The set is 10 items. That is too small for real applications. The patterns are reproducible. Production calibration uses data in the hundreds with multiple human raters. This project is an introduction to eval harnesses - the same patterns scale up.

How to run it

brew install ollama
ollama serve
ollama pull llama3.2

uv sync
uv run pytest -m "not ollama"   # fast tier, mocked, ~10s
uv run pytest                   # full suite, ~7 min

Conclusion

When the answer is a paragraph and not a value, no single scorer is enough. You run a panel of imperfect scorers, write down where each one is wrong, and let the disagreements be the actual test.

Repo: https://github.com/sbezjak/llm-eval-harness

Project 1 of a five-project series on testing AI systems. Project 2 is retrieval-augmented generation.

A QA engineer's first AI testing project - FastAPI + local LLM + pytest

Sara Bezjak — Fri, 24 Apr 2026 11:15:15 +0000

I'm an automation engineer that writes mostly UI tests with some API sprinkled in. A recruiter wrote to me about an interesting job - AI/LLM testing. I was curious to learn more so I asked the model itself: what skills do I need to learn? The answer was this project.

What is it

A FastAPI service with one endpoint (/ask) that forwards a question to a local LLM (Ollama running llama3.2) and returns the answer. Plus a pytest suite.

~90 lines of app code, 23 tests, 100% coverage, two-tier test split (fast <1s, full ~90s).

The point was to learn what AI testing actually looks like compared to UI/API testing.

Repo: https://github.com/sbezjak/llm-api-testing

One honest thing up front. The suite worked first try. That made it harder to learn from, not easier - when nothing breaks, you don't have to understand it. I spent more time reading the code than I would have spent writing it.

Process timeline

1. Read every line before running anything. Docs, code, tests, setup. I wanted the big picture - classes, endpoints, test structure in my head before I touched anything.

2. Ask questions instead of copy-pasting. It's easy to create something that passes. It's harder to understand why it does. I spent 2 hours just discussing the project with the model. Questions like: Why 70% and not 100%? What does ASGITransport actually do? Why does ConnectError map to 503 and HTTP errors to 502? Why mock at all with respx? What's xfail and why is it used like this? What's temperature?

3. Ran it. All passed. But "10 passed in 99s" wasn't enough. I wanted to see which tests hit the model, how long each took, what the model actually answered. So I added structured logging:

POST /ask verdict=allowed status=200 elapsed=0.42s answer='Paris.'

And a pytest-html report with per-test captured logs. Now every test run is a document I can read.

4. Iterate with the model. Added logs, reports, comments. Asked about code I didn't understand - why something was there, what a piece did. This is where the differences between UI and AI testing started to click. Probabilistic vs deterministic. The 70% Paris case.

5. Make it production-ish. Asked how a real team would harden this. Mocking Ollama and 100% coverage were added in this step.

The thing that actually clicked - probabilistic vs deterministic

The consistency test sends "What is the capital of France?" ten times and asserts ≥70% of answers contain "paris".

answers = [await ask(prompt) for _ in range(10)]
hits = sum(1 for a in answers if "paris" in a.lower())
assert hits / len(answers) >= 0.7

In UI testing, same input produces the same output. You assert on exact values. assert button.opens_modal() == True.

LLMs don't work like that. Same prompt, different valid answers every call - "Paris.", "The capital is Paris.", a paragraph about French geography. The model samples from a distribution. There is no single right string.

So you assert on properties of the distribution, or on the envelope of acceptable answers. assert ≥70% of answers contain "paris". 70% is arbitrary - high enough to catch regressions, low enough to tolerate the model's variance. In a real system you'd tune per prompt.

Point vs region. Four years of UI-testing instincts took a while to shift.

The three bugs and what they taught me

Bug 1 - latency test failing at 35s.

First thought: my M1 is slow. Then I ran ollama run llama3.2 "say hi" directly in the terminal - instant. So the model was fine.

llama3.2 is chatty. Asking "string" produced an essay on null-termination and Unicode. The 35 seconds was generation time, not system latency.

Fix: "options": {"num_predict": 200} to cap output tokens. Warm requests dropped to 1-3 seconds.

Lesson: traditional APIs return what you ask for. LLMs return what they feel like returning. Latency tests measure output length unless you constrain it.

Bug 2 - coverage stuck at 85%.

Cause: no test exercised Ollama failure paths.

Fix: three mocked tests with respx — unreachable → 503, Ollama 5xx → 502, empty response → 502. Coverage hit 100%. New tests run in <50ms each because no real model is involved.

Lesson: check coverage reports. Gaps usually point at untested failure modes, not untested happy paths.

Bug 3 - moderation filter false positives.

The moderation filter is a substring blocklist - a Python list of phrases like "how to kill", "how to hack", etc. Any question containing one gets refused with a 400. Simple: "how to kill a process on linux" contains "how to kill", so a normal dev question gets blocked.

Fix: added the false positive to the benign dataset with pytest.mark.xfail and a written reason. The test now runs, fails as expected, and shows as a yellow dot in the report instead of red. Documented in the suite itself.

It flips to green the day the substring is replaced with a real classifier - a model that understands intent ("is this user actually trying to cause harm?") instead of just matching strings. That could be a small fine-tuned model, an open-source moderation model like Llama Guard, or a commercial moderation API. The upgrade closes the false-positive gap, the test starts passing, and xfail(strict=False) signals "unexpectedly passed" - the cue to remove the marker.

Lesson: xfail makes the suite record what's broken, not just what works. I'd only used xfail for flaky tests before, not as living documentation of known bugs. Much better than hiding a bug in a backlog ticket.

What I still don't fully understand

The ASGI internals ASGITransport relies on. I know what it does, not what's happening inside.
When respx is the right call vs building a proper fake.
Embedding similarity math beyond "cosine measures angle."
What a real production eval harness looks like.

From a QA perspective

Most UI-testing instincts didn't transfer. Equality assertions, fixed latency thresholds, asserting a single correct outcome - all had to shift.

What did transfer: discipline around edge cases, thoughts about what happens when the upstream service dies, care about keeping the feedback loop fast, coverage reports.

Setting up a local model was new. Using it as a dependency in a test suite was new. Testing something that returns different valid outputs every call was new. If you're a QA engineer looking at this direction - the probability side is the new thing. The rest is still testing.

How to run it

# Install & start Ollama
brew install ollama
ollama serve             # leave running in its own terminal
ollama pull llama3.2     # in another terminal

# Python env
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Run the API
uvicorn app.main:app --reload --port 8000
# → http://localhost:8000/docs

# Tests
pytest -m "not ollama"   # fast tier, no Ollama needed, ~1s
pytest                   # full suite with HTML reports

Conclusion

When you're testing robustness (did the system stay well-behaved?) instead of correctness (did the right thing happen?), you assert the shape of acceptable failure, not the shape of success. AI systems fail in more ways, so the distinction matters more - a 500 is always a bug; anything else might be correct behavior for an edge case.

Repo: https://github.com/sbezjak/llm-api-testing

Next up - 5 more projects on the list: eval harness, RAG with observability, red-team suite, agent testing, model benchmarking. Writing each one up as I go.

AI Tools for Existing Playwright + Pytest Frameworks: What Actually Works

Sara Bezjak — Thu, 26 Mar 2026 16:02:24 +0000

Purpose

Research and evaluate AI-powered tools and workflows to improve test automation efficiency, specifically for test creation speed and reducing maintenance time when UI or business flows change. Focus on tools compatible with an existing Playwright + pytest (Python) stack and IntelliJ IDE.

Current Workflow & Pain Points

The two primary pain points in test automation are:

Creating new tests: Requires manually assembling context (page objects, fixture patterns, example tests) and writing tests that match existing conventions. The copy-paste workflow works but is slow and repetitive.

Updating tests when UI or flows change: When the product changes, tests break. Diagnosing which tests are affected, understanding what changed, and fixing them to match the new behavior consumes significant time.

Tools Evaluated

Claude Code (Anthropic) — Recommended

Claude Code is a terminal-based AI coding assistant that works with your entire codebase as context. It integrates with IntelliJ via a plugin (currently in beta) and can read, generate, and modify files directly in the project.

Key advantages:

Works in IntelliJ via plugin or integrated terminal. No IDE switch required.
Reads the full repository — page objects, fixtures, test files so generated code matches existing patterns and conventions.
Supports a CLAUDE.md configuration file in the project root which contains definitions of framework conventions, naming patterns, fixture usage, and domain context. This ensures output is framework-specific and not generic.
Suggests changes via IntelliJ's native diff viewer, making review and approval straightforward.
Shares IDE diagnostics (lint errors, syntax issues) automatically.
Available on Pro plan ($20/month), which is sufficient for regular usage.

Used for: Generated a change billing test using Claude Code with full project context. The output followed existing page object patterns, used the correct fixtures, and required minimal manual adjustment.

Playwright MCP (Model Context Protocol)

Playwright MCP is a server that gives AI tools live browser access. Instead of manually inspecting the DOM for selectors or using codegen tools, Claude Code can navigate the application, interact with elements, and read the actual page structure.

Useful for: Discovering selectors on new or changed pages without manually opening DevTools / Codegen. Especially valuable when new UI elements are added as part of feature changes. Requires guidance on which flow to walk through (natural language instructions).

Playwright Agents (Planner / Generator / Healer) — Not Compatible Yet

Playwright v1.56 introduced three AI agents that can generate test plans, create test code, and automatically fix broken tests. The Healer agent is particularly interesting for maintenance. It replays failing tests, inspects the live UI, and patches selectors or waits.

However, these agents currently only support TypeScript/JavaScript. There is an open feature request for Python support but no timeline.

Cursor — Viable Alternative

Cursor is an AI-powered IDE (VS Code-based) that provides full codebase context and inline AI editing. Comparable to Claude Code in capabilities for test generation.

Disadvantage: Requires switching from IntelliJ to a VS Code-based editor, which means losing existing IDE configuration, shortcuts, and debugging setup. The functionality overlap with Claude Code did not justify the migration cost.

Platform-Based Tools (Testim, Mabl, Katalon, ContextQA)

These are full test automation platforms with AI features including self-healing selectors, test generation from natural language, and visual test builders.

Not recommended because:

They require adopting their platform and abandoning your existing framework.
Generated test code is generic and does not match existing page object structure, fixture patterns, or naming conventions.
You lose domain-specific knowledge already embedded in your current test suite.
Migrating away from a platform later is expensive.

Qase Aiden

Evaluated previously and joined a live demo call. Generates test code but it is generic and does not adapt to codebase patterns. Same limitation as the platform tools above.

Implementation

Completed:

Installed Claude Code CLI
Set up Playwright MCP server for live browser access during test creation
Created CLAUDE.md in project root with framework conventions, project structure, page object patterns, fixture descriptions, test naming conventions, and domain context
Successfully generated a test using Claude Code with full project context — output matched existing framework patterns

Next steps:

Continue using Claude Code for upcoming test generation (simple vs complex tests and comparison between them)
Use Claude Code for upcoming test maintenance and updates to measure time savings vs manual approach
Continue monitoring Playwright Agents for Python support
Research and write about Javascript agent healers

Takeaway

After evaluating the available tools, the best results came from bringing AI into the existing codebase rather than switching to a new platform. The md file made the biggest difference — once the framework conventions were clearly described, the generated code matched existing patterns consistently. There's a clear improvement in speed for both test creation and maintenance, but it still requires human guidance, architectural thinking, and review. It's a powerful assistant, not a replacement, but one wonders what else it will be capable of in the future.

I'm a solo QA automation engineer and founder based in Slovenia. I build test frameworks, evaluate tooling, and write about what actually works in QA. Find me on LinkedIn.