Human evaluation is the gold standard for LLM output quality. It is also the bottleneck that kills every scaling plan.
One human reviewer processes 50-100 examples per hour. A single model comparison across 1,000 test cases takes 10-20 hours of human labor. Run that across 5 metrics and 3 model candidates, and you are looking at weeks of work before you ship anything.
LLM-as-a-Judge solves this. You use a capable model to evaluate the outputs of another model — scoring relevance, faithfulness, coherence, or any custom criteria you define. Research shows well-configured LLM judges achieve roughly 85% agreement with human reviewers — higher than the typical 81% agreement rate between two human raters on the same task. Not perfect. But 1,000x faster and consistent enough to catch regressions before humans need to look.
Here are 3 patterns for implementing LLM-as-a-Judge in Python, from raw API calls to production-grade frameworks.
Pattern 1: Raw LLM-as-a-Judge With the OpenAI SDK
Before reaching for a framework, understand the core mechanism. LLM-as-a-Judge is a structured prompt that asks one model to score another model's output.
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class EvalResult(BaseModel):
score: int
reasoning: str
def judge_output(
question: str,
answer: str,
criteria: str = "relevance and accuracy",
) -> EvalResult:
"""Use an LLM to evaluate another LLM's output."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
response_format=EvalResult,
messages=[
{
"role": "system",
"content": (
"You are an expert evaluator. Score the answer "
"on a scale of 1-10 based on the given criteria. "
"Provide chain-of-thought reasoning before scoring."
),
},
{
"role": "user",
"content": (
f"Criteria: {criteria}\n\n"
f"Question: {question}\n\n"
f"Answer: {answer}\n\n"
"Evaluate this answer. Return your reasoning "
"and a score from 1-10."
),
},
],
)
return response.choices[0].message.parsed
Use it like this:
result = judge_output(
question="What causes a Python deadlock?",
answer="A deadlock occurs when two threads each hold a lock the other needs.",
criteria="technical accuracy and completeness",
)
print(f"Score: {result.score}/10")
print(f"Reasoning: {result.reasoning}")
This is the foundation. Every framework builds on this exact pattern: structured prompt, scoring rubric, chain-of-thought reasoning.
Three things make this raw approach work:
- Structured output — Pydantic enforces the response schema. No regex parsing.
- Chain-of-thought — The judge reasons before scoring. This reduces score variance by forcing the model to justify its decision.
- Explicit criteria — The rubric tells the judge what to measure. Vague criteria produce vague scores.
The limitation: you build everything yourself. Threshold logic, test orchestration, batch evaluation, metric aggregation — all manual. That is where frameworks help.
Pattern 2: DeepEval's GEval for Custom Metrics
DeepEval (v3.8+, as of March 2026) implements LLM-as-a-Judge through GEval — a metric class that generates evaluation steps from natural language criteria, then scores outputs using chain-of-thought.
Install it:
pip install -U deepeval
Set your API key (DeepEval uses OpenAI models as the default judge):
export OPENAI_API_KEY="your_api_key"
Build a custom coherence metric:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
coherence_metric = GEval(
name="Coherence",
criteria=(
"Coherence - the collective quality of all sentences "
"in the actual output. Sentences should flow logically, "
"maintain consistent terminology, and build on each other."
),
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7,
)
test_case = LLMTestCase(
input="Explain gradient descent in simple terms.",
actual_output=(
"Gradient descent is an optimization algorithm. "
"It finds the minimum of a function by iteratively "
"moving in the direction of steepest descent. "
"Think of it as a ball rolling downhill — it naturally "
"settles at the lowest point."
),
)
coherence_metric.measure(test_case)
print(f"Score: {coherence_metric.score}")
print(f"Reason: {coherence_metric.reason}")
GEval does three things behind the scenes:
- Converts your
criteriastring into numbered evaluation steps using chain-of-thought prompting. - Runs those steps against the test case.
- Returns a normalized score (0-1) and a natural language reason.
The threshold parameter sets the minimum passing score. Below 0.7 and the test case fails — useful for CI pipelines where you want hard pass/fail gates.
Combining Multiple Metrics
Real evaluation needs multiple dimensions. Score relevance, faithfulness, and coherence together:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.7)
test_case = LLMTestCase(
input="What are the side effects of gradient clipping?",
actual_output=(
"Gradient clipping prevents exploding gradients by capping "
"the gradient norm. Side effects include slower convergence "
"when the clip threshold is too aggressive, and potential "
"loss of gradient direction information."
),
retrieval_context=[
"Gradient clipping caps gradient norms to prevent exploding "
"gradients. Setting the threshold too low can slow convergence. "
"Clipping by norm preserves direction better than clipping by value."
],
)
results = evaluate(
test_cases=[test_case],
metrics=[relevancy, faithfulness, coherence_metric],
)
AnswerRelevancyMetric checks whether the output actually answers the question. It needs input and actual_output in the test case.
FaithfulnessMetric checks whether the output is grounded in the provided context — critical for RAG systems. It requires retrieval_context as a list of strings.
The evaluate() function runs all metrics against all test cases and returns a structured results object. Run this in CI with deepeval test run test_eval.py and you get pass/fail status on every commit.
Pattern 3: Pairwise Comparison — Which Output Is Better?
Single-score evaluation has a known weakness: score drift. A judge model might score "7/10" differently across runs. Pairwise comparison eliminates this by asking a simpler question — "Which output is better?"
from openai import OpenAI
from pydantic import BaseModel, Field
from enum import Enum
client = OpenAI()
class Winner(str, Enum):
A = "A"
B = "B"
TIE = "TIE"
class PairwiseResult(BaseModel):
winner: Winner
reasoning: str
confidence: float = Field(ge=0.0, le=1.0)
def compare_outputs(
question: str,
output_a: str,
output_b: str,
criteria: str = "accuracy, completeness, and clarity",
) -> PairwiseResult:
"""Compare two LLM outputs and pick the better one."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
response_format=PairwiseResult,
messages=[
{
"role": "system",
"content": (
"You are an expert evaluator comparing two answers. "
"Evaluate based on the given criteria. Be specific "
"about WHY one answer is better. If both are equally "
"good, say TIE."
),
},
{
"role": "user",
"content": (
f"Criteria: {criteria}\n\n"
f"Question: {question}\n\n"
f"Answer A: {output_a}\n\n"
f"Answer B: {output_b}\n\n"
"Which answer is better? Return the winner, "
"reasoning, and your confidence level (0-1)."
),
},
],
)
return response.choices[0].message.parsed
Use pairwise comparison to evaluate model upgrades:
result = compare_outputs(
question="How does backpropagation work?",
output_a="Backpropagation computes gradients using the chain rule.",
output_b=(
"Backpropagation computes gradients of the loss function "
"with respect to each weight by applying the chain rule "
"backwards through the network layers. Each layer's gradient "
"depends on the gradient of the layer above it, propagated "
"through the activation function's derivative."
),
criteria="technical depth and educational value",
)
print(f"Winner: {result.winner}")
print(f"Confidence: {result.confidence}")
print(f"Why: {result.reasoning}")
Pairwise comparison is how model leaderboards work. Chatbot Arena uses this exact approach with human judges. Replacing humans with LLM judges gives you the same ranking signal at a fraction of the cost.
Mitigating Position Bias
LLM judges tend to prefer the first answer they see. This is called position bias. Fix it by running each comparison twice with swapped positions:
def compare_with_debiasing(
question: str,
output_a: str,
output_b: str,
criteria: str = "accuracy, completeness, and clarity",
) -> PairwiseResult:
"""Run pairwise comparison twice with swapped order."""
result_ab = compare_outputs(question, output_a, output_b, criteria)
result_ba = compare_outputs(question, output_b, output_a, criteria)
# If both agree on the same winner, the result is reliable
if result_ab.winner == Winner.A and result_ba.winner == Winner.B:
return result_ab # Both say output_a is better
if result_ab.winner == Winner.B and result_ba.winner == Winner.A:
return result_ab # Both say output_b is better
# Disagreement — call it a tie
return PairwiseResult(
winner=Winner.TIE,
reasoning="Position bias detected: results flipped with order.",
confidence=0.5,
)
When the judge picks A in one ordering and B in the other, the comparison is unreliable. Defaulting to TIE prevents position bias from contaminating your results. This adds one extra API call per comparison — a small cost for eliminating a systematic error.
When to Use Each Pattern
| Pattern | Best For | Trade-Off |
|---|---|---|
| Raw LLM-as-a-Judge | Quick prototypes, custom criteria | You build the infrastructure |
| DeepEval GEval | CI pipelines, regression testing | Requires OpenAI API key for the judge |
| Pairwise comparison | Model selection, A/B testing | 2x API cost (debiasing), no absolute score |
The three-layer stack that works in production:
-
DeepEval in CI — Run
AnswerRelevancyMetricandFaithfulnessMetricon every commit. Catch regressions automatically. - Pairwise comparison for model upgrades — When evaluating a new model, run debiased pairwise comparison against your current model on 200-500 representative examples.
- Human review for edge cases — Sample 5-10% of LLM-judged results for human validation. Track judge-human agreement over time. If agreement drops below 75%, recalibrate your rubrics.
LLM-as-a-Judge does not replace human evaluation. It replaces the 90% of human evaluation that is repetitive scoring against known rubrics. The remaining 10% — ambiguous cases, novel failure modes, ethical edge cases — still needs a human.
Key Takeaways
LLM-as-a-Judge works because classifying content is simpler than generating it. A model that struggles to write a perfect explanation can still tell you which of two explanations is better.
Start with Pattern 1 to understand the mechanics. Move to Pattern 2 when you need CI integration. Use Pattern 3 when comparing models or prompts.
The metric that matters most: judge-human agreement rate. Measure it. If your LLM judge agrees with human reviewers less than 75% of the time on your specific task, your rubric needs work — not your judge model.
Follow @klement_gunndu for more machine learning content. We're building in public.
Top comments (0)