How to Stop Evaluating LLM Outputs by Gut Feel

#ai #llm #opensource #testing

The standard workflow for evaluating LLM output quality goes something like this: someone reads Response A, reads Response B, and says "I think A is better." Everyone nods. The prompt ships.

This is a problem for three reasons:

It doesn't scale. You can't manually review 500 eval pairs after every prompt change.
It's inconsistent. The same person evaluating the same pair on different days produces different results.
It doesn't tell you why. "Response A is better" doesn't tell you what to fix when Response B becomes the baseline.

I built LLM Eval Suite to replace gut feel with structured, evidence-backed scoring — for any task type, with CI integration.

→ Full tool page

The Core Insight: Evidence, Not Opinion

Every score in LLM Eval Suite is accompanied by a verbatim quote from the response being evaluated. Not "this response has poor faithfulness" — but:

Faithfulness: 1.0/10
Quote: "30-day return policy, no questions asked"
Reasoning: "Source document specifies 14 days. This is a clear hallucination, not an interpretation."

This changes what you can do with the output. You can show it to a stakeholder. You can track it over time. You can build a regression test from it. You can tell the model what specifically went wrong.

Six Evaluation Capabilities

Multi-Dimensional Scoring

Ten task presets — QA, summarisation, RAG, code generation, creative writing, classification, translation, and more. Each preset activates the dimensions that matter for that task:

Task Type	Key Dimensions
`qa`	Faithfulness, Completeness, Conciseness, Relevance
`summarisation`	Coverage, Compression, Accuracy, Readability
`rag`	Faithfulness, Answer Relevancy, Context Precision, Context Recall
`code`	Correctness, Efficiency, Readability, Security

Every dimension score comes with verbatim evidence from the response text.

docker-compose run cli eval \
  --file examples/eval_qa.json \
  --mode compare \
  --format markdown

Regression Testing

Save any eval report as a named baseline:

docker-compose run cli regression save results.json --id prod-baseline

Run future evals against it:

docker-compose run cli regression run results.json --id prod-baseline --format markdown

Per-dimension deltas are compared against configurable thresholds. Exit code 1 when scores drop below your floor. This is the feature that makes the tool useful in CI.

GitHub Actions Integration

- name: Run LLM eval
  run: |
    docker-compose run cli eval \
      --file evals/suite.json \
      --mode rank \
      --format junit \
      --output results.xml

- uses: mikepenz/action-junit-report@v3
  with:
    report_paths: results.xml

- name: Regression check
  run: |
    docker-compose run cli regression run \
      results.json --id prod-baseline
    # exits 1 if any dimension drops beyond threshold

This gates model upgrades, prompt changes, and fine-tune releases automatically. The JUnit XML output integrates with any CI system that understands test reports.

Hallucination Detection

Claim-level analysis against a source document. Each claim in the response is classified as supported or unsupported — binary, not "mostly faithful."

docker-compose run cli hallucination \
  --response output.txt \
  --source source.txt \
  --format markdown

Risk levels: none / low / moderate / high / critical, with a safe_to_use boolean for downstream gating. This is what you run before using LLM output in a production pipeline where accuracy matters.

Example output:

hallucination_risk: high
safe_to_use: false

Claim: "30-day return policy"
  status: unsupported
  evidence: "Source specifies 14 days"
  severity: critical

Claim: "no questions asked"
  status: unsupported
  evidence: "Source makes no mention of return conditions"
  severity: high

Prompt Sensitivity Analysis

Test 2–5 prompt variants against a fixed response. Per-dimension variance tells you which dimensions are fragile across phrasings and which are stable.

docker-compose run cli sensitivity \
  --file examples/prompt_variants.json \
  --format markdown

Know which prompt phrasings shift your scores before you deploy. High-variance dimensions across prompts signal that your evaluation isn't measuring the response — it's measuring the prompt wording.

Panel Evaluation

Run N independent judge passes on the same evaluation. Mean and variance per dimension expose where judges agree and where they disagree.

docker-compose run cli panel \
  --file examples/eval_qa.json \
  --judges 5 \
  --format markdown

High-variance dimensions are flagged for human review automatically. The panel mode is the right choice when you're evaluating subjective tasks like creative writing where a single judge's opinion is insufficient signal.

RAGAS-Compatible RAG Preset

The rag task type maps the four RAGAS metrics — faithfulness, answer relevancy, context precision, context recall — as first-class evaluation dimensions with equal weighting. The output is compatible with RAGAS reporting conventions, so you can integrate this into existing RAGAS workflows or use it as a drop-in alternative.

Example: Two Responses In, Clear Winner Out

Input:

{
  "task_type": "qa",
  "eval_mode": "compare",
  "source": "Refunds are accepted within 14 days if the item is unused.",
  "responses": [
    {
      "label": "Response A",
      "text": "You can get a refund within 14 days if the item hasn't been used."
    },
    {
      "label": "Response B",
      "text": "Our 30-day return policy means no questions asked."
    }
  ]
}

Output:

winner: Response A
margin: clear

Response B — Faithfulness
  score: 1.0/10
  quote: "30-day return policy, no questions asked"
  reasoning: "Source specifies 14 days. 'No questions asked' is not in the source.
              Two distinct hallucinations in one sentence."

Response A — Faithfulness
  score: 9.5/10
  quote: "within 14 days if the item hasn't been used"
  reasoning: "Accurately paraphrases the source with no additions."

Why This Matters in Production

LLM evaluation is usually treated as a one-time concern — you evaluate before you ship. But models change, prompts drift, data distributions shift, and retrieval quality fluctuates. A system that was 90% faithful in January may be 75% faithful in April because the upstream data changed.

The regression testing and CI integration in LLM Eval Suite are designed for this reality. You run evals continuously, not just at release time. The baseline is the floor — if you drop below it, the pipeline stops.

→ View the full tool page, docs, and GitHub repo