Thousand Miles AI

Posted on Mar 6

How to Evaluate LLM Outputs — Beyond 'Looks Good to Me'

#learning #ai

Your RAG pipeline returns an answer. It sounds confident. But is it actually correct? Turns out 'vibes-based evaluation' doesn't scale. Learn the metrics and frameworks that actually tell you if your LLM is hallucinating, missing context, or nailing it.

The Classic Problem

You've built a RAG pipeline. Your knowledge base is solid. Your retriever works fine. You run a test query, and the LLM spits out an answer that sounds completely confident. Grammar? Perfect. Structure? Coherent. Tone? Professional.

You copy it to your Slack channel: "It works!"

But then someone asks a follow-up question, and the answer contradicts itself. Or they check a fact and it's subtly wrong. Or they ask "where did you get that?" and you realize the LLM just... made it up.

That's not an error in your pipeline. That's an error in how you evaluated it.

If you're relying on "it looks right to me," you're in dangerous territory. The problem scales immediately: when you have 100 queries, 1000 queries, or a production system running 24/7, you can't manually inspect every output.

You need metrics.

Why This Matters

Vibes-based evaluation breaks at scale. Human inspection is slow, inconsistent, and subjective. One person reads an answer and thinks "solid." Another reads the same answer and spots a hallucination. You're shipping an LLM system that nobody actually understands, and nobody can debug when it fails.

But here's the thing: traditional ML evaluation metrics don't work for language models. In classification, you have clear right/wrong answers. In RAG, there's no single "ground truth." The same query might have 10 correct answers depending on how you interpret it. And hallucinations are genuinely hard to spot automatically because the LLM is confident and grammatically flawless.

So we need new frameworks. We need to measure different dimensions separately, and we need tools that don't require a human to read every single output.

The Core Challenge: Why LLM Eval Is Different

Traditional ML evaluation assumes:

There's one right answer (binary classification, exact match, etc.)
Metrics are purely numerical (accuracy, precision, recall)
No middle ground between right and wrong

Language generation throws all of that out:

Multiple correct answers exist (paraphrasing, different phrasings, different correct facts)
Quality is multidimensional (you need to measure faithfulness and relevance, not just "accuracy")
Hallucinations look like correct answers—that's the whole problem

That's why LLM evaluation has evolved into a multi-metric framework where you evaluate different dimensions separately.

The RAGAS Framework: Your Evaluation Toolkit

RAGAS (Retrieval Augmented Generation Assessment) is the most popular open-source framework for evaluating RAG systems. It provides a suite of metrics that work without requiring ground truth labels.

The Core Metrics

Faithfulness — Does the answer contain hallucinations?

The metric works like this: an LLM extracts all claims made in the answer. Then it checks each claim against the retrieved context. If a claim isn't supported by the context, it's hallucinated.

Score: 0–1, where 1 means "everything is supported by the retrieved context."

Why it matters: This catches the sneaky case where your LLM generates grammatically perfect nonsense.

Answer Relevancy — Is the answer actually relevant to the user's question?

Instead of asking a human "is this relevant?", RAGAS generates multiple synthetic questions from the answer using the LLM, then measures how similar those questions are to the original query.

Score: 0–1, where 1 means "the answer is directly answering what was asked."

Why it matters: An answer can be faithful to the context and completely miss what the user wanted.

Context Precision — Are the most useful chunks ranked first?

When you retrieve 10 documents, is the most relevant one at position 1? Or buried at position 7? This metric measures whether the retriever ranked things in the right order.

Score: 0–1, where 1 means "every retrieved chunk is relevant."

Why it matters: If your LLM has to wade through junk to find useful context, it'll get confused or hallucinate.

Context Recall — Did you retrieve everything needed?

This asks: given the correct answer, how much of the supporting context did the retriever actually find?

Score: 0–1, where 1 means "you got all the context needed to answer correctly."

Why it matters: If key context is missing, your LLM can't answer well, no matter how good it is.

Putting It Together

You're not checking one metric. You're checking four dimensions of a single evaluation:

Faithfulness measures hallucinations → low faithfulness = your LLM is making things up
Answer relevancy measures understanding the question → low relevancy = wrong answer, right format
Context precision measures retriever ranking → low precision = retriever is mixing junk with gold
Context recall measures completeness → low recall = retriever missed important context

A healthy RAG system has all four scores high. If faithfulness is low, you have a hallucination problem. If recall is low, your retriever is weak.

Beyond RAGAS: LLM-as-Judge

RAGAS is great for RAG systems specifically. But what if your system is more general? What if you're not using retrieval?

That's where LLM-as-Judge comes in.

The idea is simple: use a powerful LLM (like GPT-4) to evaluate the outputs of another LLM, prompted to assign scores on dimensions like helpfulness, correctness, faithfulness, or safety.

Judge prompt (simplified):
"You are an expert evaluator. The user asked: [QUERY]
The system responded: [ANSWER]
Rate the response from 1-10 on correctness, helpfulness, and truthfulness.
Explain your reasoning."

Pros:

No ground truth needed
Works for any task (not just RAG)
Can evaluate complex, nuanced quality
Aligns with human judgment (85%+ agreement with human raters on GPT-4 judgments)

Cons:

Costs money (you're calling an LLM to evaluate another LLM)
Can inherit biases from the judge (GPT-4 has position bias, verbosity bias, self-enhancement bias)
Prompt wording matters a lot—small changes in phrasing can shift scores by 10-15%

Pro tip: Use Chain-of-Thought prompting with your judge. Ask it to explain its reasoning step-by-step before assigning a score. This improves reliability by 10-15% and gives you a debuggable reasoning trail.

Hallucination Detection: The Hard Problem

Here's the truth: hallucinations are hard to detect automatically.

Your LLM generates a paragraph that sounds completely plausible. It cites no sources (because it made it up). It's grammatically perfect. How do you know it's wrong without checking every fact manually?

Recent approaches:

Self-Consistency Methods (SelfCheckGPT):

Generate the same answer multiple times with different random seeds
If the answer is consistent across generations, it's probably faithful
If it varies wildly each time, it's probably hallucinated
This works because factual claims are stable; hallucinations drift

Token Probability Methods:

Look at the model's internal confidence scores during generation
If the model assigns low probability to its own words, something's off
This doesn't always work—some hallucinations are high-confidence

Supervised Detection:

Train a detector on labeled hallucination data
Feed it hidden state representations from the LLM
Let it predict whether a claim is hallucinated
Works well in-domain; requires new training for new domains

The Honest Answer: There's no silver bullet. You need multiple approaches:

Faithfulness metric to catch unsupported claims
Self-consistency checks for flagrant hallucinations
Human spot-checking on high-stakes domains
Reference-based metrics (comparing output to ground truth) when you have labels

Common Evaluation Mistakes

Mistake 1: Relying on a Single Metric

"Our faithfulness score is 0.92—we're good!"

No. Faithfulness only tells you about hallucinations. Your answer could be:

Faithful but irrelevant (addresses the wrong question)
Faithful and relevant but missing half the context

Evaluate all dimensions. If any dimension is weak, you have a problem.

Mistake 2: Gaming the Metrics

You optimize for high RAGAS scores, so you make your retrieved context smaller (fewer chunks = easier for the LLM to be faithful). Now your scores are great, but your answers miss important details.

Or you use a judge that's biased toward verbose, confident-sounding answers, so your system generates fluff.

The trap: High metrics ≠ good product. You still need human evaluation on real user queries.

Mistake 3: Forgetting About Domain Shift

You evaluate your system on one domain (e.g., Python tutorials) and get great scores. You ship it to production for a different domain (e.g., medical advice). Suddenly users report hallucinations.

This happens because:

Your training data was skewed toward one domain
Your evaluation framework was calibrated on one domain
The LLM's behavior changes in new domains

Always evaluate on representative samples from your actual use case.

Mistake 4: Ignoring the Prompt

LLM judges are incredibly sensitive to how you phrase the evaluation prompt.

"Is this answer correct?" gets different results than "Is this answer helpful and accurate?"
"Rate 1-10" gets different results than "Rate excellent/good/fair/poor"

Test different prompt wordings and see which ones align with your actual needs.

Putting It Into Practice

Here's a lightweight evaluation workflow for your RAG system:

Collect 50-100 real test queries from your users or domain experts
Generate answers using your system
Run RAGAS metrics on all of them
- Calculate mean faithfulness, relevancy, precision, recall
- Flag any queries with low scores
Spot-check the flagged queries manually
- Read the answer and context
- Verify if the metric agrees with your judgment
Iterate — improve your retriever, prompt, or model based on what you find

Tools to use:

RAGAS (open source, free, works with any LLM via API)
DeepEval (Python library, supports RAGAS + custom metrics)
Langfuse (LLM observability platform with built-in LLM-as-judge)
Confident AI (commercial, but focuses on evaluation workflows)

The Real Win: Debugging

Here's the secret nobody tells you: the real value of metrics isn't the score. It's the debugging information.

When your faithfulness score is 0.65, you know where to look: the answer contains unsupported claims. Start examining those claims.

When your context recall is 0.4, you know your retriever is missing stuff. Debug the retriever, not the LLM.

When answer relevancy is low but everything else is high, you know your prompt is asking the wrong question.

Metrics are a map. They point you toward the problem. But you still have to solve it.

Next Steps

Pick one metric that matters to your system (probably faithfulness for RAG)
Set a threshold (e.g., "we want 0.8+ on all metrics")
Evaluate your current system
When scores are low, debug instead of tweaking prompts blindly
Repeat

Start small. Don't build a 500-metric evaluation dashboard on day one. Evaluate the dimensions that matter most to your users, and add more metrics as you grow.

And yes, you still need humans. Metrics catch patterns and point you toward problems. But someone has to verify that the metrics are actually measuring what you care about.

Because in the end, "looks good to me" scales to maybe 100 queries before it breaks. Metrics scale to 100,000. And human judgment backed by metrics? That actually works in production.

Your turn: what does your LLM system get wrong most often? Is it hallucinations, missing context, or something else? Metrics can help you find out.

Author: Shibin

DEV Community