Vaibhav Doddihal

Posted on Jun 18 • Originally published at blocksimplified.com

Evaluating LLM Systems: Metrics, Methods, and Scorecards

#ai #llm #evaluation #llmevaluation

Evaluating LLM Systems: Metrics, Methods, and Scorecards

Originally published on BlockSimplified — 11 min read

This post is part of the AI Fluency series, where I document my learnings around applied AI concepts. The goal is to help you build practical skills you can apply in real projects.

Here is the hard truth about LLM development: most teams ship without proper evaluation. They run a few manual tests, the outputs "look good," and they call it done. Then users start complaining about weird responses, and suddenly nobody knows if the problem is the prompt, the model, or the retrieval pipeline.

I have been there. Early in my LLM projects, I would tweak a prompt, eyeball a few outputs, and deploy. It felt productive. But when something broke in production, I had no baseline to compare against. Did the new prompt actually help? Was the model always this bad at edge cases? No idea.

Evaluation & Testing is not just about catching bugs. It is your compass for improvement. Without systematic evaluation, you are navigating by feel in a space where intuition often fails.

Why Evaluation is Hard (And Why Most Teams Skip It)

LLMs are not like traditional software. When you test a function that adds two numbers, the expected output is clear. With LLMs, the "correct" answer is subjective, context-dependent, and often impossible to define precisely.

Consider this: you ask an LLM to summarize an article. There are dozens of valid summaries. Some focus on the main argument, others on supporting details. Some are formal, others conversational. How do you score that?

This ambiguity leads teams to skip evaluation entirely. It feels like too much work for uncertain benefit. But skipping evaluation means you are:

Flying blind when making prompt changes
Unable to compare models objectively
Missing regressions that hurt users
Building on a foundation you cannot trust

The good news: evaluation does not have to be perfect to be useful. Even rough metrics beat no metrics. Let me show you how to build a practical evaluation system.

The Evaluation Pyramid: Three Levels of Rigor

I think about LLM evaluation as a pyramid with three levels. Each level trades off between accuracy and scalability.

Level 1: Human Evaluation (The Gold Standard)

Human Evaluation is the most accurate but least scalable. Real people assess real outputs against criteria like helpfulness, accuracy, and tone.

When to use it:

Validating that your automated metrics correlate with actual quality
Evaluating subjective criteria like "does this sound natural?"
High-stakes applications where errors are costly
Creating the initial labels for your golden dataset

How to do it well:

Define clear criteria. Vague instructions like "rate quality" lead to inconsistent scores. Instead, specify: "Rate helpfulness from 1-5, where 1 means the response does not address the question at all, and 5 means it fully answers with actionable details."
Use multiple annotators. At minimum, have 3 people rate each response. Calculate inter-rater agreement using Cohen's Kappa. If agreement is low (below 0.6), your guidelines need work.
Include calibration examples. Show annotators examples of responses at each score level before they start. This anchors their judgments.

The practical reality: Human evaluation is expensive. You cannot have humans review every response in production. That is why we need automated methods.

Level 2: LLM-as-a-Judge (Scalable Quality Assessment)

LLM-as-a-Judge uses a capable model to evaluate outputs from your system. It is faster and cheaper than humans while being more nuanced than simple metrics.

The basic pattern:

judge_prompt = """
You are an expert evaluator. Rate the following response on a scale of 1-5
for HELPFULNESS.

Scoring rubric:
1 - Does not address the question
2 - Partially addresses but missing key information
3 - Addresses the question but could be clearer
4 - Good answer with minor room for improvement
5 - Excellent, comprehensive answer

User question: {question}
Response to evaluate: {response}

Provide your score and a brief justification.
"""

Key considerations:

Use a stronger model as judge. If you are evaluating GPT-3.5 outputs, use GPT-4 as the judge. The judge should be at least as capable as the model being evaluated.
Validate against human labels. Run your judge on a set where you have human scores. If correlation is below 0.7, refine your rubric.
Watch for biases. LLMs prefer verbose responses and may favor outputs similar to their training data. Check for these patterns in your evaluations.
Use reference-guided judging when possible. Providing the judge with a reference answer improves consistency.

Level 3: Automated Metrics (Fast but Limited)

Automated Evaluation Metrics are the fastest and cheapest option. They compute scores algorithmically without any LLM calls.

Traditional NLP metrics:

Metric	What it measures	Good for
BLEU	N-gram overlap with reference	Translation
ROUGE	Recall of reference n-grams	Summarization
BERTScore	Semantic similarity via embeddings	General text comparison
Exact Match	String equality	Factoid QA with single correct answer

The problem: These metrics measure surface-level similarity, not actual quality. A response could be helpful and accurate but score poorly because it uses different words than the reference.

When automated metrics work:

Tasks with clearly correct answers (math, coding with test cases)
Detecting obvious failures (empty responses, errors)
Tracking trends over time (even imperfect metrics show direction)

RAG-Specific Evaluation: The Metrics That Matter

If you are building a Retrieval-Augmented Generation system, generic evaluation is not enough. You need metrics that assess both retrieval quality and generation quality.

Retrieval Metrics

Before the LLM even sees the context, did you retrieve the right documents?

Context Precision: Of the documents retrieved, how many were actually relevant?
Context Recall: Of all relevant documents in your corpus, how many did you retrieve?
Recall@K / Precision@K: Versions of above limited to top K results

Why this matters: If retrieval fails, even the best LLM cannot give good answers. Always evaluate retrieval independently.

Generation Metrics (RAG-specific)

These metrics assess the LLM's output given the retrieved context:

Faithfulness: Does the response stick to what is in the context? This catches hallucinations where the model makes up facts not supported by the retrieved documents.

Answer Relevance: Does the response actually answer the user's question? A response could be faithful to the context but still miss what the user asked.

Answer Correctness: Is the response factually correct? This compares against a ground truth answer if available.

RAGAS Framework

The RAGAS (Retrieval Augmented Generation Assessment) framework provides a structured approach to these metrics. It uses LLM-as-a-Judge internally to score each dimension.

# Conceptual example - actual RAGAS API may differ
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=your_test_data,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

Agent Evaluation: Beyond Single-Turn Responses

Evaluating AI Agents is a different challenge. Agents take multiple steps, use tools, and their success depends on achieving a goal, not just producing good text.

Goal Completion Rate

Did the agent accomplish what the user asked? For a travel planning agent, did it actually book the flight? For a research agent, did it find the information?

This is a binary metric (success/failure) but incredibly important. An agent that produces fluent text but fails to complete tasks is useless.

Tool-Use Accuracy

When the agent decides to use a tool, does it:

Choose the right tool for the situation?
Provide correct parameters?
Use tool results appropriately?

Track each of these separately. You might find your agent is good at choosing tools but bad at formatting parameters.

Trajectory Analysis

For multi-step tasks, examine the full trajectory:

How many steps did it take? (Efficiency)
Did it recover from errors? (Robustness)
Did it take unnecessary detours? (Planning quality)

Safety Violation Rate

Especially important for agents with real-world actions. Did the agent ever:

Attempt unauthorized actions?
Leak sensitive information?
Violate explicit constraints?

Even a 0.1% violation rate is too high for production agents with meaningful capabilities.

Building Your Evaluation Scorecard

A scorecard brings all your metrics together in one view. It tells you at a glance whether your system is healthy.

What to Include

Core metrics (track always):

Overall quality score (LLM-as-Judge, 1-5)
Faithfulness (for RAG)
Goal completion rate (for agents)
Safety violation rate

Diagnostic metrics (dig in when core metrics drop):

Context precision/recall (RAG retrieval health)
Tool-use accuracy (agent capability)
Latency and token usage (operational health)

The Golden Dataset

Golden Dataset is your foundation for reliable evaluation. It is a curated set of inputs with verified expected outputs.

How to build one:

Start with real queries. Pull from production logs (anonymized). These represent actual user needs.
Include edge cases. Add queries that have caused failures. These are your regression tests.
Get expert verification. Have domain experts validate or write reference answers.
Keep it manageable. 50-100 high-quality examples beat 1,000 sloppy ones. Quality over quantity.

How to use it:

Run your golden dataset after any change:

New prompt version? Run the golden dataset.
Model upgrade? Run the golden dataset.
Retrieval pipeline change? Run the golden dataset.

Compare scores to your baseline. If quality drops, investigate before deploying.

Automated Regression Testing

Integrate golden dataset evaluation into your CI/CD pipeline:

# Conceptual GitHub Actions workflow
name: LLM Evaluation
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Run golden dataset evaluation
        run: python evaluate.py --dataset golden_set.json

      - name: Check quality thresholds
        run: |
          if [ $(cat results.json | jq '.faithfulness') < 0.85 ]; then
            echo "Faithfulness dropped below threshold"
            exit 1
          fi

Practical Implementation: Where to Start

If you are just getting started with LLM evaluation, here is my recommended sequence:

Week 1: Create Your Golden Dataset

Pull 30-50 representative queries from production or brainstorming
Write reference answers for each
Include 10+ edge cases or known failure scenarios

Week 2: Set Up LLM-as-a-Judge

Create a judge prompt for your primary quality criterion
Run it on your golden dataset
Manually review judge outputs to check reasonableness

Week 3: Validate and Iterate

Have humans rate a subset of the same responses
Compare human scores to judge scores
Refine your judge prompt until correlation is decent (aim for 0.7+)

Week 4: Automate

Integrate evaluation into your deployment process
Set quality thresholds that block bad deploys
Create a dashboard to track metrics over time

Ongoing: Expand and Maintain

Add new examples to golden dataset as you find failures
Add metrics for new dimensions (safety, latency, etc.)
Review and update quarterly

Common Mistakes to Avoid

Optimizing for the metric, not the goal. Your metric is a proxy for quality, not quality itself. If you tune prompts to maximize your judge scores, you might overfit to the judge's preferences rather than actual user needs.

Too few examples in golden dataset. You need coverage of your use cases. Fifty examples is a minimum; one hundred is better. But focus on quality and diversity, not raw quantity.

Not validating your judge. An LLM judge can have systematic biases. Always check correlation with human judgment before trusting it.

Evaluating in isolation. A component might score well individually but fail in the full pipeline. Test end-to-end, not just pieces.

Static evaluation sets. Your application evolves. Your evaluation set should too. Review and update regularly.

Key Takeaways

Key Concepts

Evaluation & Testing
Evaluation Metrics
LLM-as-a-Judge
Human Evaluation
Golden Dataset
Retrieval-Augmented Generation
AI Agents
Hallucinations

Continue Learning

Enjoyed this article? Put your knowledge to the test:

Take the interactive quiz on BlockSimplified to see how much you retained
Explore 16 linked Learning Blocks, curated resources for deeper understanding
Follow for more insights on AI, development, and tech

DEV Community

Evaluating LLM Systems: Metrics, Methods, and Scorecards

Evaluating LLM Systems: Metrics, Methods, and Scorecards

Why Evaluation is Hard (And Why Most Teams Skip It)

The Evaluation Pyramid: Three Levels of Rigor

Level 1: Human Evaluation (The Gold Standard)

Level 2: LLM-as-a-Judge (Scalable Quality Assessment)

Level 3: Automated Metrics (Fast but Limited)

RAG-Specific Evaluation: The Metrics That Matter

Retrieval Metrics

Generation Metrics (RAG-specific)

RAGAS Framework

Agent Evaluation: Beyond Single-Turn Responses

Goal Completion Rate

Tool-Use Accuracy

Trajectory Analysis

Safety Violation Rate

Building Your Evaluation Scorecard

What to Include

The Golden Dataset

Automated Regression Testing

Practical Implementation: Where to Start

Week 1: Create Your Golden Dataset

Week 2: Set Up LLM-as-a-Judge

Week 3: Validate and Iterate

Week 4: Automate

Ongoing: Expand and Maintain

Common Mistakes to Avoid

Key Takeaways

Key Concepts

Continue Learning

Top comments (0)