Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation

#python #llm #testing #ai

Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation

TL;DR: We built agent-eval-lite, a zero-dependency Python framework for LLM-as-judge evaluation. It achieves κ=0.68 on FaithBench (faithfulness) and PCAcc=91-100% on JudgeBench (pairwise comparison) — competitive with heavy frameworks that require 40+ dependencies.

The Problem

You've built an AI agent. It answers 10,000 questions a day. How do you know it's not hallucinating?

Manual review doesn't scale. LLM-as-judge — using one LLM to evaluate another — is the practical answer. But existing frameworks (DeepEval, Ragas) drag in torch, transformers, langchain, and dozens of transitive dependencies.

agent-eval-lite does the same job with zero external dependencies. Just urllib from Python's stdlib.

What's New in v0.5

1. Multi-Model Jury Voting

Different models have different biases. GPT-5.2 is lenient (high false positive rate), while Grok is too strict (high false negative rate). Claude Sonnet 4.6 is the most balanced.

Jury mode exploits this:

from agent_eval import JudgeJury, JudgeProvider, judge_faithfulness

jury = JudgeJury([
    JudgeProvider(api_key="k1", base_url="...", model="claude-sonnet-4-6"),
    JudgeProvider(api_key="k2", base_url="...", model="grok-4.1-fast"),
    JudgeProvider(api_key="k3", base_url="...", model="gpt-5.2"),
])

verdict = jury.judge(judge_faithfulness, context="...", output="...")
print(verdict.passed)           # Majority vote
print(verdict.agreement_ratio)  # How much judges agree

2. Pairwise Comparison with Position-Consistency

Comparing two responses? LLMs have position bias — they tend to prefer whichever response appears first. Our pairwise judge runs the evaluation twice with A/B swapped:

from agent_eval import judge_pairwise

result = judge_pairwise(provider,
    prompt="Explain recursion",
    response_a="Short answer...",
    response_b="Detailed answer with examples...",
    swap=True,  # Run twice, check consistency
)
# result.passed = True means A is better
# result.passed = None means position bias detected

3. Multi-Step Faithfulness Pipeline

For cases where you need detailed per-claim analysis:

result = judge_faithfulness(provider,
    context="Source text...",
    output="Agent response...",
    mode="thorough",  # 3-step: extract claims → verify each → aggregate
)
# Each claim classified as: supported / contradicted / fabricated / idk

Benchmark Results

We tested against two standard academic benchmarks:

FaithBench (Hallucination Detection)

FaithBench (NAACL 2025) contains 750 human-annotated summarization hallucinations — deliberately hard cases where existing detectors disagree.

Judge Model	Accuracy	Cohen's κ
Claude Sonnet 4.6	83%	0.68
GPT-5.2	77%	0.55
Grok 4.1 Fast	70%	0.29
DeepSeek v3.2	70%	0.31

κ=0.68 = "substantial agreement" with human annotators.

JudgeBench (Pairwise Comparison)

JudgeBench (ICLR 2025) tests pairwise judgment on objectively verifiable tasks. We report position-consistent accuracy — correct in both A/B orderings.

Judge Model	PC Accuracy	Consistency Rate
GPT-5.2	100%	89%
Claude Sonnet 4.6	91%	77%
Grok 4.1 Fast	80%	17%

GPT-5.2 got every position-consistent judgment correct. Grok shows severe position bias (83% inconsistent).

Why Zero Dependencies Matters

	DeepEval	Ragas	agent-eval-lite
Dependencies	40+	langchain ecosystem	0
Install time	Minutes	Minutes	Seconds
CI/CD friendly	Heavy	Heavy	Lightweight
Judge cost tracking	No	No	Yes
Multi-model jury	No	No	Yes

For CI pipelines, Docker images, and edge deployments, zero dependencies means faster builds, fewer conflicts, and smaller attack surface.

Get Started

pip install agent-eval-lite

from agent_eval import JudgeProvider, judge_faithfulness

provider = JudgeProvider(
    api_key="your-key",
    model="gpt-4o",  # Any OpenAI-compatible API
)

result = judge_faithfulness(provider,
    context="The API returned: temp=72°F, condition=sunny",
    output="It's 72°F and sunny, with heavy rain expected.",
)
print(result.passed)              # False (fabricated rain)
print(result.unsupported_claims)  # ["heavy rain expected"]