How to Evaluate LLM Output Quality Programmatically

#ai #llm #python #tutorial

Shipping a language model integration without automated evaluation is flying blind. Manual review does not scale, and eyeballing a handful of outputs in staging misses the regressions that appear after model version bumps or prompt rewrites. This article walks through a practical, layered evaluation framework you can wire into CI.

What "Quality" Means in Practice

Evaluation is context-dependent. For a classification task, quality means accuracy. For a summarizer, it means coverage and faithfulness to the source. For a code generator, it means the output compiles and passes the test suite. Before writing a single line of evaluation code, define your quality dimensions:

Correctness: Does the output contain the expected information?
Format compliance: Is the structure valid JSON, Markdown, or whatever your downstream expects?
Safety: Does the output avoid hallucinated facts, PII leakage, or policy violations?
Consistency: Does the same input produce semantically equivalent outputs across runs?

Pick 2-3 dimensions that matter for your specific use case and build metrics around those. Trying to measure everything at once usually means measuring nothing well.

Layer 1: Deterministic Checks

The cheapest checks are fully deterministic -- no model call required. They catch obvious failures in milliseconds.

import json
import re
from dataclasses import dataclass

@dataclass
class EvalResult:
    passed: bool
    score: float
    reason: str

def check_json_valid(output: str) -> EvalResult:
    try:
        json.loads(output)
        return EvalResult(passed=True, score=1.0, reason="Valid JSON")
    except json.JSONDecodeError as e:
        return EvalResult(passed=False, score=0.0, reason=f"Invalid JSON: {e}")

def check_required_fields(output: str, required: list[str]) -> EvalResult:
    try:
        data = json.loads(output)
    except json.JSONDecodeError:
        return EvalResult(passed=False, score=0.0, reason="Not JSON")
    missing = [f for f in required if f not in data]
    score = 1.0 - len(missing) / len(required)
    return EvalResult(
        passed=len(missing) == 0,
        score=score,
        reason=f"Missing fields: {missing}" if missing else "All fields present",
    )

def check_no_pii(output: str) -> EvalResult:
    patterns = [
        r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}',  # email
        r'\d{3}[-.\s]?\d{2}[-.\s]?\d{4}',                     # SSN-like
    ]
    for pattern in patterns:
        if re.search(pattern, output):
            return EvalResult(passed=False, score=0.0, reason="PII pattern detected")
    return EvalResult(passed=True, score=1.0, reason="No PII found")

A format check that eliminates 15-20% of bad outputs before you reach heavier evaluation stages pays for itself immediately. Make these the first gate in every pipeline.

Layer 2: Semantic Similarity

When exact match is not possible -- which is most of the time for free-form text -- semantic similarity gives you a continuous quality score. The sentence-transformers library produces embeddings that work well for most comparison tasks without requiring a live model API call at evaluation time.

from sentence_transformers import SentenceTransformer, util

_model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_similarity(
    output: str, reference: str, threshold: float = 0.75
) -> EvalResult:
    embeddings = _model.encode([output, reference], convert_to_tensor=True)
    score = float(util.cos_sim(embeddings[0], embeddings[1]))
    return EvalResult(
        passed=score >= threshold,
        score=score,
        reason=f"Cosine similarity: {score:.3f}",
    )

def evaluate_dataset(predictions: list[str], references: list[str]) -> dict:
    results = [semantic_similarity(p, r) for p, r in zip(predictions, references)]
    scores = [r.score for r in results]
    return {
        "mean_score": round(sum(scores) / len(scores), 4),
        "pass_rate": round(sum(1 for r in results if r.passed) / len(results), 4),
        "failure_indices": [i for i, r in enumerate(results) if not r.passed],
    }

Build a golden dataset of 50-200 input/output pairs from your best human-reviewed examples. Run evaluate_dataset against every new prompt version before deploying. A drop of more than 3-5 points in mean_score is a hard signal to investigate before the change goes to production.

Layer 3: LLM-as-Judge

Some quality dimensions -- coherence, helpfulness, instruction-following -- are genuinely hard to capture without a model. The LLM-as-judge pattern routes a second model call to score the output against a rubric.

import requests as req

JUDGE_PROMPT = (
    "You are an evaluation assistant. Score the response on a scale of 1 to 5.

"
    "Rubric:
"
    "- 5: Fully answers the question, factually accurate, concise
"
    "- 3: Partial answer, minor inaccuracies or unnecessary content
"
    "- 1: Off-topic, factually wrong, or harmful

"
    "Question: {question}
Response: {response}

"
    'Return only JSON: {{"score": <int>, "reason": "<one sentence>"}}'
)

def llm_judge(
    question: str, response: str, api_url: str, api_key: str, model: str
) -> EvalResult:
    prompt = JUDGE_PROMPT.format(question=question, response=response)
    r = req.post(
        api_url,
        headers={"Authorization": f"Bearer {api_key}"},
        json={"model": model, "messages": [{"role": "user", "content": prompt}]},
        timeout=30,
    )
    r.raise_for_status()
    verdict = json.loads(r.json()["choices"][0]["message"]["content"])
    return EvalResult(
        passed=verdict["score"] >= 4,
        score=(verdict["score"] - 1) / 4,  # normalize to 0-1
        reason=verdict["reason"],
    )

Use LLM-as-judge sparingly -- it adds latency and cost. Reserve it for the dimensions that matter most and that neither deterministic nor embedding checks can cover. Also validate the judge itself on a small labeled set where you already know the correct answer, and track its agreement rate with human labels over time.

Assembling a Quality Gate for CI

Combine the three layers into a harness that runs against your golden dataset on every prompt change and fails the deploy if the pass rate drops below your threshold.

def run_eval_suite(test_cases: list[dict]) -> None:
    results = []
    for case in test_cases:
        output = case["model_output"]
        checks = {
            "json_valid": check_json_valid(output),
            "required_fields": check_required_fields(
                output, case.get("required_fields", [])
            ),
            "no_pii": check_no_pii(output),
            "semantic": semantic_similarity(output, case["reference"]),
        }
        overall = all(c.passed for c in checks.values())
        results.append({"id": case["id"], "passed": overall, "checks": checks})

    total = len(results)
    n_passed = sum(1 for r in results if r["passed"])
    pass_rate = n_passed / total
    print(f"Pass rate: {pass_rate:.1%}  ({n_passed}/{total})")

    for r in results:
        if not r["passed"]:
            for name, check in r["checks"].items():
                if not check.passed:
                    print(f"  FAIL [{r['id']}] {name}: {check.reason}")

    if pass_rate < 0.90:
        raise SystemExit(f"Quality gate failed: {pass_rate:.1%} < 90%")

Store your golden dataset in version control alongside your prompts. When the gate fails, the diff between the old and new failure lists tells you exactly which cases regressed. For security-sensitive applications, extend the dataset with adversarial inputs -- prompt injections, data-exfiltration attempts, boundary cases. A structured set of LLM threat categories to test against is part of the free security hardening checklists we publish.

The Takeaway

Programmatic evaluation is not optional for production language model systems -- it is what prevents silent regressions after every model update or prompt tweak. Start with deterministic checks (fast and free), add semantic similarity for content quality, and layer in LLM-as-judge only where it is genuinely irreplaceable. Keep a versioned golden dataset alongside your prompts and run the full suite on every change before deployment. That is the minimum viable evaluation stack.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists -- PDF and Excel.