The standard workflow for evaluating LLM output quality goes something like this: someone reads Response A, reads Response B, and says "I think A is better." Everyone nods. The prompt ships.
This is a problem for three reasons:
- It doesn't scale. You can't manually review 500 eval pairs after every prompt change.
- It's inconsistent. The same person evaluating the same pair on different days produces different results.
- It doesn't tell you why. "Response A is better" doesn't tell you what to fix when Response B becomes the baseline.
I built LLM Eval Suite to replace gut feel with structured, evidence-backed scoring — for any task type, with CI integration.
The Core Insight: Evidence, Not Opinion
Every score in LLM Eval Suite is accompanied by a verbatim quote from the response being evaluated. Not "this response has poor faithfulness" — but:
Faithfulness: 1.0/10
Quote: "30-day return policy, no questions asked"
Reasoning: "Source document specifies 14 days. This is a clear hallucination, not an interpretation."
This changes what you can do with the output. You can show it to a stakeholder. You can track it over time. You can build a regression test from it. You can tell the model what specifically went wrong.
Six Evaluation Capabilities
Multi-Dimensional Scoring
Ten task presets — QA, summarisation, RAG, code generation, creative writing, classification, translation, and more. Each preset activates the dimensions that matter for that task:
| Task Type | Key Dimensions |
|---|---|
qa |
Faithfulness, Completeness, Conciseness, Relevance |
summarisation |
Coverage, Compression, Accuracy, Readability |
rag |
Faithfulness, Answer Relevancy, Context Precision, Context Recall |
code |
Correctness, Efficiency, Readability, Security |
Every dimension score comes with verbatim evidence from the response text.
docker-compose run cli eval \
--file examples/eval_qa.json \
--mode compare \
--format markdown
Regression Testing
Save any eval report as a named baseline:
docker-compose run cli regression save results.json --id prod-baseline
Run future evals against it:
docker-compose run cli regression run results.json --id prod-baseline --format markdown
Per-dimension deltas are compared against configurable thresholds. Exit code 1 when scores drop below your floor. This is the feature that makes the tool useful in CI.
GitHub Actions Integration
- name: Run LLM eval
run: |
docker-compose run cli eval \
--file evals/suite.json \
--mode rank \
--format junit \
--output results.xml
- uses: mikepenz/action-junit-report@v3
with:
report_paths: results.xml
- name: Regression check
run: |
docker-compose run cli regression run \
results.json --id prod-baseline
# exits 1 if any dimension drops beyond threshold
This gates model upgrades, prompt changes, and fine-tune releases automatically. The JUnit XML output integrates with any CI system that understands test reports.
Hallucination Detection
Claim-level analysis against a source document. Each claim in the response is classified as supported or unsupported — binary, not "mostly faithful."
docker-compose run cli hallucination \
--response output.txt \
--source source.txt \
--format markdown
Risk levels: none / low / moderate / high / critical, with a safe_to_use boolean for downstream gating. This is what you run before using LLM output in a production pipeline where accuracy matters.
Example output:
hallucination_risk: high
safe_to_use: false
Claim: "30-day return policy"
status: unsupported
evidence: "Source specifies 14 days"
severity: critical
Claim: "no questions asked"
status: unsupported
evidence: "Source makes no mention of return conditions"
severity: high
Prompt Sensitivity Analysis
Test 2–5 prompt variants against a fixed response. Per-dimension variance tells you which dimensions are fragile across phrasings and which are stable.
docker-compose run cli sensitivity \
--file examples/prompt_variants.json \
--format markdown
Know which prompt phrasings shift your scores before you deploy. High-variance dimensions across prompts signal that your evaluation isn't measuring the response — it's measuring the prompt wording.
Panel Evaluation
Run N independent judge passes on the same evaluation. Mean and variance per dimension expose where judges agree and where they disagree.
docker-compose run cli panel \
--file examples/eval_qa.json \
--judges 5 \
--format markdown
High-variance dimensions are flagged for human review automatically. The panel mode is the right choice when you're evaluating subjective tasks like creative writing where a single judge's opinion is insufficient signal.
RAGAS-Compatible RAG Preset
The rag task type maps the four RAGAS metrics — faithfulness, answer relevancy, context precision, context recall — as first-class evaluation dimensions with equal weighting. The output is compatible with RAGAS reporting conventions, so you can integrate this into existing RAGAS workflows or use it as a drop-in alternative.
Example: Two Responses In, Clear Winner Out
Input:
{
"task_type": "qa",
"eval_mode": "compare",
"source": "Refunds are accepted within 14 days if the item is unused.",
"responses": [
{
"label": "Response A",
"text": "You can get a refund within 14 days if the item hasn't been used."
},
{
"label": "Response B",
"text": "Our 30-day return policy means no questions asked."
}
]
}
Output:
winner: Response A
margin: clear
Response B — Faithfulness
score: 1.0/10
quote: "30-day return policy, no questions asked"
reasoning: "Source specifies 14 days. 'No questions asked' is not in the source.
Two distinct hallucinations in one sentence."
Response A — Faithfulness
score: 9.5/10
quote: "within 14 days if the item hasn't been used"
reasoning: "Accurately paraphrases the source with no additions."
Why This Matters in Production
LLM evaluation is usually treated as a one-time concern — you evaluate before you ship. But models change, prompts drift, data distributions shift, and retrieval quality fluctuates. A system that was 90% faithful in January may be 75% faithful in April because the upstream data changed.
The regression testing and CI integration in LLM Eval Suite are designed for this reality. You run evals continuously, not just at release time. The baseline is the floor — if you drop below it, the pipeline stops.
Top comments (0)