Five ways to test an LLM's answer and what each one misses

#ai #testing #python #qa

I'm a regular automation engineer. My usual job is checking that an app does the same thing every time. AI testing is the opposite: the same question can give a different answer each run.

A learning project, written up for anyone trying to get into AI testing.

Repo: https://github.com/sbezjak/llm-eval-harness

What I built

A pytest project. 10 questions, a hand-written expected answer for each, a local model (llama3.2) answering the questions, and the model's responses saved to a file. Then I read every response myself and wrote PASS or FAIL like a human grader. The rest of the project is about getting code to agree with that human verdict.

I built five scorers and ran all five against the saved responses:

Scorer	What it checks
Exact match	`output == expected`
BLEU	shared word sequences (from machine translation)
ROUGE	overlap on longest common subsequence (from summarization)
Semantic similarity	angle between sentence embeddings (does the meaning roughly match)
LLM-as-judge	second model call with a correctness + relevance rubric

None of them was right on its own. The interesting bit is how each one was wrong.

The main finding: the judge passes its own hallucinations, deterministically

LLM-as-judge means a second LLM grades the first one's answer against a rubric. It's the only scorer here that can actually read meaning, which is also why it fails in ways the others don't.

The only wrong answer in my set was on a pytest question. The model invented a command-line flag (--junit-xml-filter). I had set up the LLM judge specifically to catch this kind of factual error.

The judge gave it correctness 8/10, relevance 6/10. Combined: 0.700. I set the pass threshold: 0.700. So it passed, exactly on the line.

LLM outputs aren't deterministic, so I expected the score to be around 0.7 and the verdict to flip. I ran the judge five times against the same frozen response:

run 1: score=0.700 passed=True
run 2: score=0.700 passed=True
run 3: score=0.700 passed=True
run 4: score=0.700 passed=True
run 5: score=0.700 passed=True

Identical every run. The judge isn't changing its mind, it's stuck on the threshold. Worse than a flaky test: a flaky test eventually flips red and someone investigates. A deterministic wrong-pass looks green in CI and ships the bug.

The mechanism is self-grading bias: the judge is the same llama3.2 that wrote the bad answer, so the hallucinated flag doesn't look wrong to either of them. Averaging more runs helps when a score is noisy. It does nothing here. The fix is a different, stronger judge model.

The other story: 4 of 5 scorers reject a correct answer because of its shape

Question: "How many planets are in our solar system?" Expected: "8". The model returned a bulleted list of all eight planets with their names, plus a section about Pluto. A human reads that and says PASS.

Scorer	Verdict	Score
Exact match	FAIL	n/a
BLEU	FAIL	~0
ROUGE	FAIL	~0
Semantic similarity	FAIL	0.194
LLM-as-judge	PASS	1.000

You think you are testing whether the model got the answer right. You are actually testing whether your scorer can recognize the right answer when the model gives it in a different shape than the reference. Four out of five could not.

A note on how semantic similarity works: each sentence becomes a vector (a list of numbers encoding meaning), and the score is the angle between two vectors. Close angle, similar meaning. But "similar meaning" isn't "correct". A wrong answer about planets sits in the same semantic neighborhood as a right one.

The naive fix is to lower the cosine threshold until the planets row passes. It does not work. The lowest right-answer score and the only wrong-answer score in the set sit 0.004 apart. Any threshold that admits the right one also admits the wrong one. Semantic similarity is measuring textual proximity, not correctness.

A closer look at BLEU and ROUGE

Two of the four scorers that failed the planets case were BLEU and ROUGE. It's worth slowing down on these, because the usual one-line explanation ("BLEU and ROUGE are bad at prose") turned out to be the wrong story.

Both metrics measure word overlap. They look at the model's answer, look at your reference answer, and count how many words or word sequences appear in both. More overlap, higher score.

BLEU (from machine translation, 2002) counts shared runs of words. If the reference is "the cat sat on the mat" and the model says "the cat sat on a mat," BLEU sees five shared single words and three shared two-word sequences ("the cat", "cat sat", "sat on") and gives a high score.
ROUGE (from summarization, 2004) counts the longest sequence of words that appears in both texts in the same order, even if other words are sprinkled between them. Same idea, slightly different bookkeeping.

The important part is the denominator. Both metrics divide "shared words" by "how long the texts are." That ratio is what breaks on the planets row.

Reference: "8". One token. The model's answer: a paragraph naming all eight planets. The word "8" appears in the paragraph, so the numerator is 1. The denominator is "length of the model's answer," around 30 words. The score is 1/30, basically zero. BLEU also needs the texts to share two-word and three-word sequences, and the reference has none of those (it only has one word), so part of BLEU's math is forced to zero before anything else happens. Final score: zero. Same answer, different reference, completely different result. If the reference had been "There are 8 planets in our solar system," BLEU and ROUGE would both score the same model answer highly, because now there are sequences to overlap with.

So the rule isn't "BLEU and ROUGE are bad at prose." They were built for prose. The rule is: they only work when the reference and the model's answer are similar in shape and length. Short reference plus long answer collapses the score. Long reference plus short answer collapses it too.

This is what the xfail tests surfaced. I had marked the BLEU and ROUGE rows as "expected to fail on prose," and five of them passed unexpectedly. The ones that passed were the rows where the reference happened to be a full sentence, not a single token. The shape matched, the score worked, the test that was "supposed" to fail didn't. That mismatch is what pushed the finding from "BLEU is bad" to "BLEU needs matching reference shape."

The practical version: if you want a meaningful BLEU or ROUGE score, write reference answers that look roughly like the outputs you expect. A one-word gold answer is fine for exact match but wastes these metrics. For short answers, use exact match or an LLM judge instead. Production setups also support multiple reference answers per question and take the best match, which is another way to cover the shape problem.

Smaller findings, briefly

A bias-swap test caught one drifting pair. Same prompt, only the name changes (David vs Priya). Three of four pairs gave similar responses. One question about career advice drifted noticeably. One drifting pair isn't proof of bias, but it's the kind of drift a real bias suite would flag for review.
Length bias, null result here. LLM judges often score longer answers higher. I expected this and tested three short-vs-long pairs of correct answers. The judge gave both the same score every time. Not proof there's no bias, just no bias on this model and rubric.
Trust the judge's score, not its reasoning. The judge sometimes wrote explanations that contradicted the number it gave. The number was closer to right. Treat the prose as a debugging hint, not evidence.

The thing I'd take back into a Playwright suite tomorrow

pytest.xfail(strict=True) with a reason field. The test is supposed to fail for a written-down reason, and if it ever starts passing, the build breaks on purpose so somebody investigates. I marked every "scorer disagrees with the human" case that way. The test file became the project's spec for what each scorer is known to get wrong.

It paid for itself twice. I expected the judge-variance test to show noise; stdev came back 0.000, which surfaced the "stuck on the threshold" finding. I expected BLEU and ROUGE to fail on prose like exact match does; five XPASSes forced the reference-shape finding instead. Both times the suite caught me being wrong before I published.

This is not AI-specific. It works on any flaky integration where the failure mode is understood.

A note on the numbers

The set is 10 items. That is too small for real applications. The patterns are reproducible. Production calibration uses data in the hundreds with multiple human raters. This project is an introduction to eval harnesses - the same patterns scale up.

How to run it

brew install ollama
ollama serve
ollama pull llama3.2

uv sync
uv run pytest -m "not ollama"   # fast tier, mocked, ~10s
uv run pytest                   # full suite, ~7 min

Conclusion

When the answer is a paragraph and not a value, no single scorer is enough. You run a panel of imperfect scorers, write down where each one is wrong, and let the disagreements be the actual test.

Repo: https://github.com/sbezjak/llm-eval-harness

Project 1 of a five-project series on testing AI systems. Project 2 is retrieval-augmented generation.