Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation
TL;DR: We built agent-eval-lite, a zero-dependency Python framework for LLM-as-judge evaluation. It achieves κ=0.68 on FaithBench (faithfulness) and PCAcc=91-100% on JudgeBench (pairwise comparison) — competitive with heavy frameworks that require 40+ dependencies.
The Problem
You've built an AI agent. It answers 10,000 questions a day. How do you know it's not hallucinating?
Manual review doesn't scale. LLM-as-judge — using one LLM to evaluate another — is the practical answer. But existing frameworks (DeepEval, Ragas) drag in torch, transformers, langchain, and dozens of transitive dependencies.
agent-eval-lite does the same job with zero external dependencies. Just urllib from Python's stdlib.
What's New in v0.5
1. Multi-Model Jury Voting
Different models have different biases. GPT-5.2 is lenient (high false positive rate), while Grok is too strict (high false negative rate). Claude Sonnet 4.6 is the most balanced.
Jury mode exploits this:
from agent_eval import JudgeJury, JudgeProvider, judge_faithfulness
jury = JudgeJury([
JudgeProvider(api_key="k1", base_url="...", model="claude-sonnet-4-6"),
JudgeProvider(api_key="k2", base_url="...", model="grok-4.1-fast"),
JudgeProvider(api_key="k3", base_url="...", model="gpt-5.2"),
])
verdict = jury.judge(judge_faithfulness, context="...", output="...")
print(verdict.passed) # Majority vote
print(verdict.agreement_ratio) # How much judges agree
2. Pairwise Comparison with Position-Consistency
Comparing two responses? LLMs have position bias — they tend to prefer whichever response appears first. Our pairwise judge runs the evaluation twice with A/B swapped:
from agent_eval import judge_pairwise
result = judge_pairwise(provider,
prompt="Explain recursion",
response_a="Short answer...",
response_b="Detailed answer with examples...",
swap=True, # Run twice, check consistency
)
# result.passed = True means A is better
# result.passed = None means position bias detected
3. Multi-Step Faithfulness Pipeline
For cases where you need detailed per-claim analysis:
result = judge_faithfulness(provider,
context="Source text...",
output="Agent response...",
mode="thorough", # 3-step: extract claims → verify each → aggregate
)
# Each claim classified as: supported / contradicted / fabricated / idk
Benchmark Results
We tested against two standard academic benchmarks:
FaithBench (Hallucination Detection)
FaithBench (NAACL 2025) contains 750 human-annotated summarization hallucinations — deliberately hard cases where existing detectors disagree.
| Judge Model | Accuracy | Cohen's κ |
|---|---|---|
| Claude Sonnet 4.6 | 83% | 0.68 |
| GPT-5.2 | 77% | 0.55 |
| Grok 4.1 Fast | 70% | 0.29 |
| DeepSeek v3.2 | 70% | 0.31 |
κ=0.68 = "substantial agreement" with human annotators.
JudgeBench (Pairwise Comparison)
JudgeBench (ICLR 2025) tests pairwise judgment on objectively verifiable tasks. We report position-consistent accuracy — correct in both A/B orderings.
| Judge Model | PC Accuracy | Consistency Rate |
|---|---|---|
| GPT-5.2 | 100% | 89% |
| Claude Sonnet 4.6 | 91% | 77% |
| Grok 4.1 Fast | 80% | 17% |
GPT-5.2 got every position-consistent judgment correct. Grok shows severe position bias (83% inconsistent).
Why Zero Dependencies Matters
| DeepEval | Ragas | agent-eval-lite | |
|---|---|---|---|
| Dependencies | 40+ | langchain ecosystem | 0 |
| Install time | Minutes | Minutes | Seconds |
| CI/CD friendly | Heavy | Heavy | Lightweight |
| Judge cost tracking | No | No | Yes |
| Multi-model jury | No | No | Yes |
For CI pipelines, Docker images, and edge deployments, zero dependencies means faster builds, fewer conflicts, and smaller attack surface.
Get Started
pip install agent-eval-lite
from agent_eval import JudgeProvider, judge_faithfulness
provider = JudgeProvider(
api_key="your-key",
model="gpt-4o", # Any OpenAI-compatible API
)
result = judge_faithfulness(provider,
context="The API returned: temp=72°F, condition=sunny",
output="It's 72°F and sunny, with heavy rain expected.",
)
print(result.passed) # False (fabricated rain)
print(result.unsupported_claims) # ["heavy rain expected"]
GitHub: xiaona-ai/agent-eval
PyPI: agent-eval-lite
183 tests. Zero dependencies. Paper-level benchmarks.
Top comments (0)