Evaluating LLM Systems: Metrics, Methods, and Scorecards
Originally published on BlockSimplified — 11 min read
This post is part of the AI Fluency series, where I document my learnings around applied AI concepts. The goal is to help you build practical skills you can apply in real projects.
Here is the hard truth about LLM development: most teams ship without proper evaluation. They run a few manual tests, the outputs "look good," and they call it done. Then users start complaining about weird responses, and suddenly nobody knows if the problem is the prompt, the model, or the retrieval pipeline.
I have been there. Early in my LLM projects, I would tweak a prompt, eyeball a few outputs, and deploy. It felt productive. But when something broke in production, I had no baseline to compare against. Did the new prompt actually help? Was the model always this bad at edge cases? No idea.
Evaluation & Testing is not just about catching bugs. It is your compass for improvement. Without systematic evaluation, you are navigating by feel in a space where intuition often fails.
Why Evaluation is Hard (And Why Most Teams Skip It)
LLMs are not like traditional software. When you test a function that adds two numbers, the expected output is clear. With LLMs, the "correct" answer is subjective, context-dependent, and often impossible to define precisely.
Consider this: you ask an LLM to summarize an article. There are dozens of valid summaries. Some focus on the main argument, others on supporting details. Some are formal, others conversational. How do you score that?
This ambiguity leads teams to skip evaluation entirely. It feels like too much work for uncertain benefit. But skipping evaluation means you are:
- Flying blind when making prompt changes
- Unable to compare models objectively
- Missing regressions that hurt users
- Building on a foundation you cannot trust
The good news: evaluation does not have to be perfect to be useful. Even rough metrics beat no metrics. Let me show you how to build a practical evaluation system.
The Evaluation Pyramid: Three Levels of Rigor
I think about LLM evaluation as a pyramid with three levels. Each level trades off between accuracy and scalability.
Level 1: Human Evaluation (The Gold Standard)
Human Evaluation is the most accurate but least scalable. Real people assess real outputs against criteria like helpfulness, accuracy, and tone.
When to use it:
- Validating that your automated metrics correlate with actual quality
- Evaluating subjective criteria like "does this sound natural?"
- High-stakes applications where errors are costly
- Creating the initial labels for your golden dataset
How to do it well:
Define clear criteria. Vague instructions like "rate quality" lead to inconsistent scores. Instead, specify: "Rate helpfulness from 1-5, where 1 means the response does not address the question at all, and 5 means it fully answers with actionable details."
Use multiple annotators. At minimum, have 3 people rate each response. Calculate inter-rater agreement using Cohen's Kappa. If agreement is low (below 0.6), your guidelines need work.
Include calibration examples. Show annotators examples of responses at each score level before they start. This anchors their judgments.
The practical reality: Human evaluation is expensive. You cannot have humans review every response in production. That is why we need automated methods.
Level 2: LLM-as-a-Judge (Scalable Quality Assessment)
LLM-as-a-Judge uses a capable model to evaluate outputs from your system. It is faster and cheaper than humans while being more nuanced than simple metrics.
The basic pattern:
judge_prompt = """
You are an expert evaluator. Rate the following response on a scale of 1-5
for HELPFULNESS.
Scoring rubric:
1 - Does not address the question
2 - Partially addresses but missing key information
3 - Addresses the question but could be clearer
4 - Good answer with minor room for improvement
5 - Excellent, comprehensive answer
User question: {question}
Response to evaluate: {response}
Provide your score and a brief justification.
"""
Key considerations:
Use a stronger model as judge. If you are evaluating GPT-3.5 outputs, use GPT-4 as the judge. The judge should be at least as capable as the model being evaluated.
Validate against human labels. Run your judge on a set where you have human scores. If correlation is below 0.7, refine your rubric.
Watch for biases. LLMs prefer verbose responses and may favor outputs similar to their training data. Check for these patterns in your evaluations.
Use reference-guided judging when possible. Providing the judge with a reference answer improves consistency.
Level 3: Automated Metrics (Fast but Limited)
Automated Evaluation Metrics are the fastest and cheapest option. They compute scores algorithmically without any LLM calls.
Traditional NLP metrics:
| Metric | What it measures | Good for |
|---|---|---|
| BLEU | N-gram overlap with reference | Translation |
| ROUGE | Recall of reference n-grams | Summarization |
| BERTScore | Semantic similarity via embeddings | General text comparison |
| Exact Match | String equality | Factoid QA with single correct answer |
The problem: These metrics measure surface-level similarity, not actual quality. A response could be helpful and accurate but score poorly because it uses different words than the reference.
When automated metrics work:
- Tasks with clearly correct answers (math, coding with test cases)
- Detecting obvious failures (empty responses, errors)
- Tracking trends over time (even imperfect metrics show direction)
RAG-Specific Evaluation: The Metrics That Matter
If you are building a Retrieval-Augmented Generation system, generic evaluation is not enough. You need metrics that assess both retrieval quality and generation quality.
Retrieval Metrics
Before the LLM even sees the context, did you retrieve the right documents?
- Context Precision: Of the documents retrieved, how many were actually relevant?
- Context Recall: Of all relevant documents in your corpus, how many did you retrieve?
- Recall@K / Precision@K: Versions of above limited to top K results
Why this matters: If retrieval fails, even the best LLM cannot give good answers. Always evaluate retrieval independently.
Generation Metrics (RAG-specific)
These metrics assess the LLM's output given the retrieved context:
Faithfulness: Does the response stick to what is in the context? This catches hallucinations where the model makes up facts not supported by the retrieved documents.
Answer Relevance: Does the response actually answer the user's question? A response could be faithful to the context but still miss what the user asked.
Answer Correctness: Is the response factually correct? This compares against a ground truth answer if available.
RAGAS Framework
The RAGAS (Retrieval Augmented Generation Assessment) framework provides a structured approach to these metrics. It uses LLM-as-a-Judge internally to score each dimension.
# Conceptual example - actual RAGAS API may differ
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=your_test_data,
metrics=[faithfulness, answer_relevancy, context_precision]
)
Agent Evaluation: Beyond Single-Turn Responses
Evaluating AI Agents is a different challenge. Agents take multiple steps, use tools, and their success depends on achieving a goal, not just producing good text.
Goal Completion Rate
Did the agent accomplish what the user asked? For a travel planning agent, did it actually book the flight? For a research agent, did it find the information?
This is a binary metric (success/failure) but incredibly important. An agent that produces fluent text but fails to complete tasks is useless.
Tool-Use Accuracy
When the agent decides to use a tool, does it:
- Choose the right tool for the situation?
- Provide correct parameters?
- Use tool results appropriately?
Track each of these separately. You might find your agent is good at choosing tools but bad at formatting parameters.
Trajectory Analysis
For multi-step tasks, examine the full trajectory:
- How many steps did it take? (Efficiency)
- Did it recover from errors? (Robustness)
- Did it take unnecessary detours? (Planning quality)
Safety Violation Rate
Especially important for agents with real-world actions. Did the agent ever:
- Attempt unauthorized actions?
- Leak sensitive information?
- Violate explicit constraints?
Even a 0.1% violation rate is too high for production agents with meaningful capabilities.
Building Your Evaluation Scorecard
A scorecard brings all your metrics together in one view. It tells you at a glance whether your system is healthy.
What to Include
Core metrics (track always):
- Overall quality score (LLM-as-Judge, 1-5)
- Faithfulness (for RAG)
- Goal completion rate (for agents)
- Safety violation rate
Diagnostic metrics (dig in when core metrics drop):
- Context precision/recall (RAG retrieval health)
- Tool-use accuracy (agent capability)
- Latency and token usage (operational health)
The Golden Dataset
Golden Dataset is your foundation for reliable evaluation. It is a curated set of inputs with verified expected outputs.
How to build one:
Start with real queries. Pull from production logs (anonymized). These represent actual user needs.
Include edge cases. Add queries that have caused failures. These are your regression tests.
Get expert verification. Have domain experts validate or write reference answers.
Keep it manageable. 50-100 high-quality examples beat 1,000 sloppy ones. Quality over quantity.
How to use it:
Run your golden dataset after any change:
- New prompt version? Run the golden dataset.
- Model upgrade? Run the golden dataset.
- Retrieval pipeline change? Run the golden dataset.
Compare scores to your baseline. If quality drops, investigate before deploying.
Automated Regression Testing
Integrate golden dataset evaluation into your CI/CD pipeline:
# Conceptual GitHub Actions workflow
name: LLM Evaluation
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- name: Run golden dataset evaluation
run: python evaluate.py --dataset golden_set.json
- name: Check quality thresholds
run: |
if [ $(cat results.json | jq '.faithfulness') < 0.85 ]; then
echo "Faithfulness dropped below threshold"
exit 1
fi
Practical Implementation: Where to Start
If you are just getting started with LLM evaluation, here is my recommended sequence:
Week 1: Create Your Golden Dataset
- Pull 30-50 representative queries from production or brainstorming
- Write reference answers for each
- Include 10+ edge cases or known failure scenarios
Week 2: Set Up LLM-as-a-Judge
- Create a judge prompt for your primary quality criterion
- Run it on your golden dataset
- Manually review judge outputs to check reasonableness
Week 3: Validate and Iterate
- Have humans rate a subset of the same responses
- Compare human scores to judge scores
- Refine your judge prompt until correlation is decent (aim for 0.7+)
Week 4: Automate
- Integrate evaluation into your deployment process
- Set quality thresholds that block bad deploys
- Create a dashboard to track metrics over time
Ongoing: Expand and Maintain
- Add new examples to golden dataset as you find failures
- Add metrics for new dimensions (safety, latency, etc.)
- Review and update quarterly
Common Mistakes to Avoid
Optimizing for the metric, not the goal. Your metric is a proxy for quality, not quality itself. If you tune prompts to maximize your judge scores, you might overfit to the judge's preferences rather than actual user needs.
Too few examples in golden dataset. You need coverage of your use cases. Fifty examples is a minimum; one hundred is better. But focus on quality and diversity, not raw quantity.
Not validating your judge. An LLM judge can have systematic biases. Always check correlation with human judgment before trusting it.
Evaluating in isolation. A component might score well individually but fail in the full pipeline. Test end-to-end, not just pieces.
Static evaluation sets. Your application evolves. Your evaluation set should too. Review and update regularly.
Key Takeaways
Key Concepts
- Evaluation & Testing
- Evaluation Metrics
- LLM-as-a-Judge
- Human Evaluation
- Golden Dataset
- Retrieval-Augmented Generation
- AI Agents
- Hallucinations
Continue Learning
Enjoyed this article? Put your knowledge to the test:
- Take the interactive quiz on BlockSimplified to see how much you retained
- Explore 16 linked Learning Blocks, curated resources for deeper understanding
- Follow for more insights on AI, development, and tech

Top comments (0)