How to Evaluate LLM Output Quality Programmatically

#ai #llm #python #tutorial

When you ship an LLM-powered feature, "does it work?" is not a binary question. An answer can be grammatically correct, topically on-point, factually wrong, and subtly biased — all at the same time. Without a systematic way to measure output quality, regressions silently creep in with every model update, prompt tweak, or temperature change. By the time users complain, you've already lost trust.

This article walks through a practical evaluation framework: what to measure, how to automate it, and how to integrate quality gates into your CI/CD pipeline.

Why Manual Testing Doesn't Scale

The typical early-stage approach is to write 10–20 example prompts, run them by hand, and eyeball the outputs. This works until you change the prompt — then you need to re-run everything. It breaks entirely when you swap models or add a retrieval layer.

The problems compound quickly:

Inconsistency: Different reviewers apply different quality bars. Even the same person reviews differently on Monday morning versus Friday afternoon.
Coverage: You test the happy path. Edge cases only surface in production, where they cause visible failures.
Speed: Manual review doesn't fit in a CI pipeline. You can't block a merge because someone needs to manually evaluate 40 outputs.

The solution is a repeatable test suite that runs automatically and produces a numeric score you can trend over time.

What to Actually Measure

Not every quality dimension matters equally. Start with the metrics that map directly to user pain:

Factual correctness — did the model get the facts right? This requires ground-truth answers and is the hardest to automate, but it's the most important for high-stakes applications.

Relevance — is the output actually responsive to the input? A perfectly written paragraph that ignores the actual question is useless, regardless of how fluent it sounds.

Format compliance — if you asked for JSON, did you get valid JSON? If you asked for a numbered list, did you get prose instead? Format failures break downstream parsing and are trivially easy to detect.

Verbosity — outputs that are 10x longer than needed cost tokens, increase latency, and frustrate users. Outputs that are too short are often missing substance.

Groundedness (for RAG pipelines) — does the answer reference only facts present in the retrieved context? Answers that go beyond the source documents are fabrications, even when they sound plausible.

For a first pass, format compliance and verbosity are the easiest to automate with zero ground-truth data. Start there before tackling the harder problems.

Building an Evaluation Harness in Python

Here's a minimal but useful evaluation harness. It defines test cases with expected format and length constraints, runs them against your model, and computes a per-case score:

import json
import re
from dataclasses import dataclass, field
from typing import Callable

@dataclass
class TestCase:
    name: str
    prompt: str
    expected_format: str  # "json", "list", "prose"
    max_words: int = 300
    min_words: int = 10
    validator: Callable[[str], bool] | None = None

@dataclass
class EvalResult:
    case_name: str
    output: str
    format_ok: bool
    length_ok: bool
    custom_ok: bool | None
    score: float = field(init=False)

    def __post_init__(self):
        checks = [self.format_ok, self.length_ok]
        if self.custom_ok is not None:
            checks.append(self.custom_ok)
        self.score = sum(checks) / len(checks)

def check_format(output: str, expected: str) -> bool:
    if expected == "json":
        try:
            json.loads(output)
            return True
        except json.JSONDecodeError:
            return False
    elif expected == "list":
        return bool(re.search(r"^(\s*[-*]\s+|\d+\.\s+)", output, re.MULTILINE))
    return True  # prose format has no structural constraint

def evaluate(case: TestCase, output: str) -> EvalResult:
    word_count = len(output.split())
    return EvalResult(
        case_name=case.name,
        output=output,
        format_ok=check_format(output, case.expected_format),
        length_ok=case.min_words <= word_count <= case.max_words,
        custom_ok=case.validator(output) if case.validator else None,
    )

def run_suite(cases: list[TestCase], model_fn: Callable[[str], str]) -> list[EvalResult]:
    results = []
    for case in cases:
        output = model_fn(case.prompt)
        results.append(evaluate(case, output))
    avg = sum(r.score for r in results) / len(results)
    print(f"Suite score: {avg:.2%} ({len(results)} cases)")
    return results

Wire model_fn to your actual model API call. Define test cases that cover your application's key behaviors, and you have a repeatable baseline. Run it in CI: if the average score drops below a threshold, fail the build.

Using a Language Model as a Judge

Format checks are cheap and useful, but they don't catch subtler quality problems: hallucinated facts, unhelpful vagueness, or off-tone responses.

One effective pattern is LLM-as-judge: you pass the original prompt and the model's output to a second language model and ask it to score specific dimensions. This sounds circular, but it works well in practice when:

The judge model is different from (or larger than) the model under test.
The scoring rubric is explicit and narrow.
You're measuring relative quality (A/B comparisons) rather than absolute truth.

import json
import urllib.request

JUDGE_PROMPT = (
    "You are an evaluator. Score the following LLM response on two dimensions.\n\n"
    "Original prompt: {prompt}\n"
    "Response to evaluate: {response}\n\n"
    "Rate each dimension from 1 to 5:\n"
    "- relevance: Does the response actually answer the original prompt?\n"
    "- clarity: Is the response well-structured and easy to understand?\n\n"
    'Return valid JSON only, no explanation:\n'
    '{{"relevance": <int>, "clarity": <int>}}'
)

def judge_output(prompt: str, response: str, api_key: str, model: str) -> dict:
    payload = json.dumps({
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": JUDGE_PROMPT.format(prompt=prompt, response=response),
            }
        ],
        "max_tokens": 100,
        "temperature": 0,
    }).encode()
    req = urllib.request.Request(
        "https://api.openai.com/v1/chat/completions",
        data=payload,
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        },
    )
    with urllib.request.urlopen(req, timeout=30) as resp:
        data = json.loads(resp.read())
    return json.loads(data["choices"][0]["message"]["content"])

Keep the judge prompt narrow. Asking a model to simultaneously rate relevance, tone, factual accuracy, and helpfulness in one call produces noisy, inconsistent scores. One or two dimensions per call, with temperature: 0 to make scores deterministic across runs.

Integrating Quality Gates into CI

Once you have scores, you need to act on them. The goal is not to block every deployment when a single test case regresses — that's too brittle. The goal is to catch systematic degradation before it reaches users.

A practical threshold setup:

Suite average score below 0.90 → fail the build
Any individual test case score below 0.60 → add a warning comment to the PR
LLM-judge average relevance below 3.5 → flag for human review before merging

Store scores per git commit to a lightweight JSON file or a time-series database. A score that drifts from 0.97 to 0.89 over 10 commits is actionable even if no individual commit crossed the hard threshold. Trend visibility is what separates monitoring from a one-off check.

One often-overlooked concern: the test inputs and outputs flowing through your evaluation pipeline may contain sensitive data — user queries, PII, internal business logic. The evaluation infrastructure itself becomes an attack surface. Teams formalizing their AI deployment process will find our free security hardening checklists useful — they cover LLM pipeline security alongside traditional application hardening.

The Takeaway

LLM evaluation doesn't need to be complicated to be useful. Start with the cheapest checks: format compliance and length bounds. Add an LLM-as-judge layer for subjective quality dimensions. Integrate both into your CI pipeline with explicit pass/fail thresholds.

The metric you don't measure is the one that bites you in production. A failing format check caught in a 10-second CI run is infinitely less painful than a broken JSON response cascading through a production system at 2am.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.