Roman Belov

Posted on Apr 28 • Originally published at futurecraft.pro

LLM-as-Judge: Automated Quality Gate for LLM Outputs in Production

#llm #startup #ai #productivity

LLM-as-Judge is a pattern where one language model evaluates another model's outputs against defined criteria. An automatic quality gate: every response gets checked before reaching the user, or after, for monitoring. Standard production monitoring metrics (200 OK, latency 340ms, rate limits within bounds) are useless for assessing quality — the model can hallucinate in 15% of responses while HTTP status codes tell you nothing about it.

Manual review doesn't scale. One person can handle 100 requests a day. At 10,000, nobody can. And quality degradation usually hits at scale: after a prompt update, a model swap, or a silent change on the provider side.

This article covers how LLM-as-Judge works, which metrics to evaluate, and how to plug it into a production pipeline.

How LLM-as-Judge works and its limitations

The judge model receives a prompt with instructions plus the text being evaluated, then returns a score: a number, a category, or structured JSON. The judge doesn't generate content. It classifies and scores. Models handle this more consistently than generation.

User: "Recommend cafes in downtown Moscow"
          |
          v
+--------------------+
|   LLM Generator    | -> "Here are 5 cafes: Coffemania near Patriarshiye..."
|   (GPT-4o-mini)    |
+--------------------+
          |
          v
+--------------------+
|   LLM Judge        | -> { relevance: 0.9, factuality: 0.7,
|   (Claude Sonnet)  |     toxicity: 0.0, completeness: 0.8 }
+--------------------+
          |
          v
   Score < threshold? -> Alert / Block / Log

Research by Zheng et al. (2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena") showed that GPT-4 as a judge agreed with human ratings in 80%+ of cases. Two human annotators agreed with each other about 81% of the time. The gap between an LLM judge and a human is roughly the same as the gap between two humans.

LLM output quality metrics: what to evaluate

Metric choice depends on the task. Main categories below.

Metrics for RAG systems

Metric	What it checks	When you need it
Faithfulness	Response is grounded in context, no fabricated facts	Always for RAG
Answer Relevance	Response matches the question	Always
Context Relevance	Retriever returned relevant documents	Debugging retrieval

Metrics for generative tasks

Metric	What it checks	When you need it
Correctness	Factual accuracy	When a reference answer exists
Completeness	Response covers all aspects of the query	Complex queries
Toxicity	No insults, harmful content	User-facing products
Hallucination	Model doesn't fabricate facts	Always

Metrics for agent pipelines

Metric	What it checks	When you need it
Tool Use Correctness	Right tool with right arguments	Agent pipelines
Task Completion	End result solves the task	Always for agents

In practice, start with two or three metrics. For RAG: faithfulness + answer relevance. For a chatbot: relevance + toxicity. For an agent: task completion. Add more as you find specific problems.

Judge prompt structure for evaluation

Evaluation quality comes down to the prompt. A working template for faithfulness:

FAITHFULNESS_JUDGE_PROMPT = """You are an impartial judge evaluating the faithfulness
of an AI assistant's response.

Faithfulness means: every claim in the response is supported by the provided context.
Claims not found in context = unfaithful.

## Input
**User Question:** {question}
**Retrieved Context:** {context}
**AI Response:** {response}

## Task
1. Extract each factual claim from the AI Response
2. For each claim, check if it is supported by the Retrieved Context
3. A claim is SUPPORTED if the context contains evidence for it
4. A claim is UNSUPPORTED if the context does not mention it or contradicts it

## Output (JSON only)
{{
  "claims": [
    {{"claim": "...", "supported": true/false, "evidence": "..."}}
  ],
  "score": <float 0.0-1.0, ratio of supported claims to total claims>,
  "reasoning": "<one sentence summary>"
}}"""

What makes this work:

Specific criteria. "Rate the response quality" doesn't work. "Check that every fact is backed by context" works. The more specific the instruction, the more stable the scores.

Chain-of-thought. The model first extracts claims, checks each one, then assigns a score. Without intermediate steps, scores are unstable.

Structured output. JSON with a fixed schema, score from 0 to 1, reasoning in one sentence. This makes parsing and aggregation straightforward.

Implementing LLM-as-Judge: three approaches

1. Python + LLM API

Minimal implementation, no frameworks:

import json
from litellm import completion

def evaluate_faithfulness(question: str, context: str, response: str) -> dict:
    judge_response = completion(
        model="anthropic/claude-sonnet-4-20250514",
        messages=[{
            "role": "user",
            "content": FAITHFULNESS_JUDGE_PROMPT.format(
                question=question,
                context=context,
                response=response,
            )
        }],
        response_format={"type": "json_object"},
        temperature=0,
    )

    result = json.loads(judge_response.choices[0].message.content)
    return result

eval_result = evaluate_faithfulness(
    question="Какие кафе в центре Москвы?",
    context="Кофемания: Патриаршие пруды. Сёстры: Покровка 6.",
    response="Рекомендую Кофеманию на Патриарших и Пушкин на Тверском бульваре.",
)
# score: 0.5 (Кофемания confirmed, Пушкин is not)

Pros: full control, minimal dependencies. Cons: you write every metric yourself, no batch processing. If you work with multiple LLM providers, litellm lets you switch between them through a single interface — more on this in the article about multi-provider LLM architecture.

2. DeepEval

Open-source framework with built-in metrics. Works like pytest for LLM outputs.

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    HallucinationMetric,
)

faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o")
relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")
hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o")

test_case = LLMTestCase(
    input="Какие кафе в центре Москвы?",
    actual_output="Рекомендую Кофеманию на Патриарших...",
    retrieval_context=["Кофемания: Патриаршие пруды. Сёстры: Покровка 6."],
)

results = evaluate([test_case], [faithfulness, relevancy, hallucination])

14+ built-in metrics, pytest integration. LLM quality tests run alongside unit tests:

# test_llm_quality.py
from deepeval import assert_test

def test_travel_recommendations():
    test_case = LLMTestCase(
        input="Кафе в Москве",
        actual_output=run_my_pipeline("Кафе в Москве"),
        retrieval_context=get_retrieved_docs("Кафе в Москве"),
    )
    assert_test(test_case, [faithfulness, relevancy])

3. Langfuse Evaluations

If you already use Langfuse for tracing, evaluations plug in on top. The judge model runs against each trace and attaches a score to it. Scores can be attached to an entire trace or to individual observations. If you haven't set up an observability stack yet, start with the practical guide to LLM observability with Langfuse.

langfuse.score(
    trace_id="trace-abc-123",
    name="faithfulness",
    value=0.85,
    comment="1 of 7 claims not supported by context",
)

For production monitoring, Langfuse fits better than DeepEval: scores are tied to real traces, visible in the dashboard, with day-over-day quality degradation charts.

Integrating LLM-as-Judge into CI/CD and production pipelines

Pre-deploy: prompt regression testing

Prompt changed? Run a dataset through the judge model before deploying. Score below threshold — deploy blocked.

# .github/workflows/llm-quality.yml
name: LLM Quality Gate
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install deepeval
      - run: deepeval test run test_llm_quality.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Runtime: gate before the response

For high-stakes tasks, evaluate before sending the response:

async def generate_with_quality_gate(question: str) -> str:
    response = await generate_response(question)

    eval_result = await evaluate_faithfulness(
        question=question,
        context=retrieved_context,
        response=response,
    )

    if eval_result["score"] < 0.7:
        return "Sorry, I'm not confident in the accuracy of this answer. Try rephrasing your question."

    return response

An extra LLM call per request. GPT-4o-mini as a judge costs $0.15 per million input tokens. At 10,000 requests per day with a ~500-token prompt: about $0.75/day.

Post-hoc: sample-based monitoring

The most common scenario. Evaluation runs asynchronously:

import random

traces = langfuse.fetch_traces(limit=100)
sample = random.sample(traces.data, min(100, len(traces.data)))

scores = []
for trace in sample:
    result = evaluate_faithfulness(
        question=trace.input,
        context=trace.metadata.get("context", ""),
        response=trace.output,
    )
    scores.append(result["score"])
    langfuse.score(trace_id=trace.id, name="faithfulness", value=result["score"])

avg_score = sum(scores) / len(scores)
if avg_score < 0.75:
    send_alert(f"Faithfulness degraded: {avg_score:.2f}")

Cheaper than a runtime gate, but catches trends. Average faithfulness dropped from 0.88 to 0.71 over a week — something broke: the prompt, the retriever, or a model update on the provider side.

LLM-as-Judge pitfalls and biases

Position bias

Judge models systematically prefer whichever answer appears first in pairwise comparisons. Zheng et al. (2023) measured a shift of up to 10-15%. Fix: run the evaluation twice with swapped order and average the results. Or use pointwise scoring instead of pairwise.

Verbosity bias

Longer answers get higher scores, even when a shorter answer is more accurate. In the judge prompt, explicitly state "response length does not affect the score" and include an example where a short answer receives the highest mark.

Self-enhancement bias

GPT-4 gives higher scores to GPT-4 outputs. Claude prefers Claude outputs. Fix: use a judge model from a different provider than the generator. Generate with GPT-4o, evaluate with Claude Sonnet. Or the other way around. The broader question of trusting LLM outputs is a topic of its own — more on this in TruthGuard: when AI agents lie.

Cost

Every evaluation is an LLM call. A runtime gate on 10,000 requests/day means 10,000 extra calls. Options: a cheap model as judge (GPT-4o-mini, Claude Haiku), sample-based evaluation, caching scores for similar pairs.

The judge hallucinates too

A judge model can give a high score to a response full of fabricated facts, if the hallucination sounds plausible. Partial fix: chain-of-thought + structured output. There is no complete fix. This is a fundamental limitation of the approach.

Choosing the right judge model for each scenario

Scenario	Judge model	Why
Pre-deploy tests	GPT-4o or Claude Sonnet 4	Accuracy matters more than speed
Runtime gate	GPT-4o-mini or Claude Haiku	Cheap and fast
Post-hoc monitoring	GPT-4o-mini	Bulk processing

Rule of thumb: the judge model should be at least as capable as the generator. GPT-4o-mini judging GPT-4o-mini works. GPT-4o-mini judging Claude Opus is unreliable.

temperature=0 for judge calls is mandatory.

Tools for LLM evaluation

Tool	Focus	LLM-as-Judge	Self-hosted	Price
DeepEval	Testing	14+ metrics	Yes (OSS)	Free
Ragas	RAG evaluation	Faithfulness, relevance	Yes (OSS)	Free
Langfuse	Observability + evals	Evaluator templates	Yes (OSS)	Free (self-hosted)
Phoenix (Arize)	Observability + evals	Hallucination, QA	Yes (OSS)	Free
Braintrust	Evals + logging	Custom scorers	Cloud	Free tier

For a startup: DeepEval for pre-deploy tests + Langfuse for production monitoring. Two open-source tools cover the entire cycle.

Production setup for LLM-as-Judge

+---------------------------------------------------+
|                   CI/CD Pipeline                   |
|                                                    |
|  PR with prompt change                             |
|      |                                             |
|      v                                             |
|  DeepEval: dataset x new prompt -> scores          |
|      |                                             |
|      v                                             |
|  Score < threshold? -> Block merge                 |
+---------------------------------------------------+

+---------------------------------------------------+
|                   Production                       |
|                                                    |
|  User request -> LLM -> Response -> User           |
|                   |                                |
|                   v (async)                         |
|              Langfuse trace                         |
|                   |                                |
|                   v (cron, hourly)                  |
|         Judge evaluation (sample)                  |
|                   |                                |
|                   v                                |
|         Score dashboard + alerts                   |
+---------------------------------------------------+

Where to start: step-by-step plan

Pick one metric. For RAG: faithfulness. For a chatbot: answer relevance.
Collect 20-30 examples by hand: questions, answers, ratings (good/bad). A golden dataset for calibration.
Write a judge prompt, run it against the golden dataset. Agreement with human ratings below 70%? Revise the prompt.
Add DeepEval to CI for tests on prompt changes.
Set up Langfuse evaluations for production monitoring.

From zero to a working quality gate: two to three days. Golden dataset + judge prompt: a couple of hours.

DEV Community