DEV Community

Cover image for LLM-as-a-Judge: Evaluate Your Models Without Human Reviewers
klement Gunndu
klement Gunndu

Posted on

LLM-as-a-Judge: Evaluate Your Models Without Human Reviewers

Human evaluation is the gold standard for LLM output quality. It is also the bottleneck that kills every scaling plan.

One human reviewer processes 50-100 examples per hour. A single model comparison across 1,000 test cases takes 10-20 hours of human labor. Run that across 5 metrics and 3 model candidates, and you are looking at weeks of work before you ship anything.

LLM-as-a-Judge solves this. You use a capable model to evaluate the outputs of another model — scoring relevance, faithfulness, coherence, or any custom criteria you define. Research shows well-configured LLM judges achieve roughly 85% agreement with human reviewers — higher than the typical 81% agreement rate between two human raters on the same task. Not perfect. But 1,000x faster and consistent enough to catch regressions before humans need to look.

Here are 3 patterns for implementing LLM-as-a-Judge in Python, from raw API calls to production-grade frameworks.

Pattern 1: Raw LLM-as-a-Judge With the OpenAI SDK

Before reaching for a framework, understand the core mechanism. LLM-as-a-Judge is a structured prompt that asks one model to score another model's output.

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class EvalResult(BaseModel):
    score: int
    reasoning: str

def judge_output(
    question: str,
    answer: str,
    criteria: str = "relevance and accuracy",
) -> EvalResult:
    """Use an LLM to evaluate another LLM's output."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        response_format=EvalResult,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert evaluator. Score the answer "
                    "on a scale of 1-10 based on the given criteria. "
                    "Provide chain-of-thought reasoning before scoring."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Criteria: {criteria}\n\n"
                    f"Question: {question}\n\n"
                    f"Answer: {answer}\n\n"
                    "Evaluate this answer. Return your reasoning "
                    "and a score from 1-10."
                ),
            },
        ],
    )
    return response.choices[0].message.parsed
Enter fullscreen mode Exit fullscreen mode

Use it like this:

result = judge_output(
    question="What causes a Python deadlock?",
    answer="A deadlock occurs when two threads each hold a lock the other needs.",
    criteria="technical accuracy and completeness",
)
print(f"Score: {result.score}/10")
print(f"Reasoning: {result.reasoning}")
Enter fullscreen mode Exit fullscreen mode

This is the foundation. Every framework builds on this exact pattern: structured prompt, scoring rubric, chain-of-thought reasoning.

Three things make this raw approach work:

  1. Structured output — Pydantic enforces the response schema. No regex parsing.
  2. Chain-of-thought — The judge reasons before scoring. This reduces score variance by forcing the model to justify its decision.
  3. Explicit criteria — The rubric tells the judge what to measure. Vague criteria produce vague scores.

The limitation: you build everything yourself. Threshold logic, test orchestration, batch evaluation, metric aggregation — all manual. That is where frameworks help.

Pattern 2: DeepEval's GEval for Custom Metrics

DeepEval (v3.8+, as of March 2026) implements LLM-as-a-Judge through GEval — a metric class that generates evaluation steps from natural language criteria, then scores outputs using chain-of-thought.

Install it:

pip install -U deepeval
Enter fullscreen mode Exit fullscreen mode

Set your API key (DeepEval uses OpenAI models as the default judge):

export OPENAI_API_KEY="your_api_key"
Enter fullscreen mode Exit fullscreen mode

Build a custom coherence metric:

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

coherence_metric = GEval(
    name="Coherence",
    criteria=(
        "Coherence - the collective quality of all sentences "
        "in the actual output. Sentences should flow logically, "
        "maintain consistent terminology, and build on each other."
    ),
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
)

test_case = LLMTestCase(
    input="Explain gradient descent in simple terms.",
    actual_output=(
        "Gradient descent is an optimization algorithm. "
        "It finds the minimum of a function by iteratively "
        "moving in the direction of steepest descent. "
        "Think of it as a ball rolling downhill — it naturally "
        "settles at the lowest point."
    ),
)

coherence_metric.measure(test_case)
print(f"Score: {coherence_metric.score}")
print(f"Reason: {coherence_metric.reason}")
Enter fullscreen mode Exit fullscreen mode

GEval does three things behind the scenes:

  1. Converts your criteria string into numbered evaluation steps using chain-of-thought prompting.
  2. Runs those steps against the test case.
  3. Returns a normalized score (0-1) and a natural language reason.

The threshold parameter sets the minimum passing score. Below 0.7 and the test case fails — useful for CI pipelines where you want hard pass/fail gates.

Combining Multiple Metrics

Real evaluation needs multiple dimensions. Score relevance, faithfulness, and coherence together:

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.7)

test_case = LLMTestCase(
    input="What are the side effects of gradient clipping?",
    actual_output=(
        "Gradient clipping prevents exploding gradients by capping "
        "the gradient norm. Side effects include slower convergence "
        "when the clip threshold is too aggressive, and potential "
        "loss of gradient direction information."
    ),
    retrieval_context=[
        "Gradient clipping caps gradient norms to prevent exploding "
        "gradients. Setting the threshold too low can slow convergence. "
        "Clipping by norm preserves direction better than clipping by value."
    ],
)

results = evaluate(
    test_cases=[test_case],
    metrics=[relevancy, faithfulness, coherence_metric],
)
Enter fullscreen mode Exit fullscreen mode

AnswerRelevancyMetric checks whether the output actually answers the question. It needs input and actual_output in the test case.

FaithfulnessMetric checks whether the output is grounded in the provided context — critical for RAG systems. It requires retrieval_context as a list of strings.

The evaluate() function runs all metrics against all test cases and returns a structured results object. Run this in CI with deepeval test run test_eval.py and you get pass/fail status on every commit.

Pattern 3: Pairwise Comparison — Which Output Is Better?

Single-score evaluation has a known weakness: score drift. A judge model might score "7/10" differently across runs. Pairwise comparison eliminates this by asking a simpler question — "Which output is better?"

from openai import OpenAI
from pydantic import BaseModel, Field
from enum import Enum

client = OpenAI()

class Winner(str, Enum):
    A = "A"
    B = "B"
    TIE = "TIE"

class PairwiseResult(BaseModel):
    winner: Winner
    reasoning: str
    confidence: float = Field(ge=0.0, le=1.0)

def compare_outputs(
    question: str,
    output_a: str,
    output_b: str,
    criteria: str = "accuracy, completeness, and clarity",
) -> PairwiseResult:
    """Compare two LLM outputs and pick the better one."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        response_format=PairwiseResult,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert evaluator comparing two answers. "
                    "Evaluate based on the given criteria. Be specific "
                    "about WHY one answer is better. If both are equally "
                    "good, say TIE."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Criteria: {criteria}\n\n"
                    f"Question: {question}\n\n"
                    f"Answer A: {output_a}\n\n"
                    f"Answer B: {output_b}\n\n"
                    "Which answer is better? Return the winner, "
                    "reasoning, and your confidence level (0-1)."
                ),
            },
        ],
    )
    return response.choices[0].message.parsed
Enter fullscreen mode Exit fullscreen mode

Use pairwise comparison to evaluate model upgrades:

result = compare_outputs(
    question="How does backpropagation work?",
    output_a="Backpropagation computes gradients using the chain rule.",
    output_b=(
        "Backpropagation computes gradients of the loss function "
        "with respect to each weight by applying the chain rule "
        "backwards through the network layers. Each layer's gradient "
        "depends on the gradient of the layer above it, propagated "
        "through the activation function's derivative."
    ),
    criteria="technical depth and educational value",
)
print(f"Winner: {result.winner}")
print(f"Confidence: {result.confidence}")
print(f"Why: {result.reasoning}")
Enter fullscreen mode Exit fullscreen mode

Pairwise comparison is how model leaderboards work. Chatbot Arena uses this exact approach with human judges. Replacing humans with LLM judges gives you the same ranking signal at a fraction of the cost.

Mitigating Position Bias

LLM judges tend to prefer the first answer they see. This is called position bias. Fix it by running each comparison twice with swapped positions:

def compare_with_debiasing(
    question: str,
    output_a: str,
    output_b: str,
    criteria: str = "accuracy, completeness, and clarity",
) -> PairwiseResult:
    """Run pairwise comparison twice with swapped order."""
    result_ab = compare_outputs(question, output_a, output_b, criteria)
    result_ba = compare_outputs(question, output_b, output_a, criteria)

    # If both agree on the same winner, the result is reliable
    if result_ab.winner == Winner.A and result_ba.winner == Winner.B:
        return result_ab  # Both say output_a is better
    if result_ab.winner == Winner.B and result_ba.winner == Winner.A:
        return result_ab  # Both say output_b is better

    # Disagreement — call it a tie
    return PairwiseResult(
        winner=Winner.TIE,
        reasoning="Position bias detected: results flipped with order.",
        confidence=0.5,
    )
Enter fullscreen mode Exit fullscreen mode

When the judge picks A in one ordering and B in the other, the comparison is unreliable. Defaulting to TIE prevents position bias from contaminating your results. This adds one extra API call per comparison — a small cost for eliminating a systematic error.

When to Use Each Pattern

Pattern Best For Trade-Off
Raw LLM-as-a-Judge Quick prototypes, custom criteria You build the infrastructure
DeepEval GEval CI pipelines, regression testing Requires OpenAI API key for the judge
Pairwise comparison Model selection, A/B testing 2x API cost (debiasing), no absolute score

The three-layer stack that works in production:

  1. DeepEval in CI — Run AnswerRelevancyMetric and FaithfulnessMetric on every commit. Catch regressions automatically.
  2. Pairwise comparison for model upgrades — When evaluating a new model, run debiased pairwise comparison against your current model on 200-500 representative examples.
  3. Human review for edge cases — Sample 5-10% of LLM-judged results for human validation. Track judge-human agreement over time. If agreement drops below 75%, recalibrate your rubrics.

LLM-as-a-Judge does not replace human evaluation. It replaces the 90% of human evaluation that is repetitive scoring against known rubrics. The remaining 10% — ambiguous cases, novel failure modes, ethical edge cases — still needs a human.

Key Takeaways

LLM-as-a-Judge works because classifying content is simpler than generating it. A model that struggles to write a perfect explanation can still tell you which of two explanations is better.

Start with Pattern 1 to understand the mechanics. Move to Pattern 2 when you need CI integration. Use Pattern 3 when comparing models or prompts.

The metric that matters most: judge-human agreement rate. Measure it. If your LLM judge agrees with human reviewers less than 75% of the time on your specific task, your rubric needs work — not your judge model.


Follow @klement_gunndu for more machine learning content. We're building in public.

Top comments (0)