Shubham Thakur

Posted on Nov 27

My First LLM Evaluation Pipeline

#ai #learning #llm #python

The Context: Why I Started This Journey

After 7+ years as a Software Testing Lead, I've spent countless hours ensuring code quality, writing test cases, and building robust testing frameworks. But as AI systems started becoming ubiquitous in production environments, I found myself asking: "How do we test AI models with the same rigor we test traditional software?"
That question led me down the rabbit hole of AI Quality Engineering and LLM Evaluation. This blog documents my first hands-on experience with DeepEval, an open-source Python framework that makes testing Large Language Models as intuitive as writing unit tests.
Spoiler alert: It's both humbling and exciting! 🚀

What I Built: Two Versions, Two Approaches

I approached this learning exercise by building two versions of LLM evaluation pipelines, each teaching me different aspects of the evaluation process.

Version 1: The Physics Quiz (Reality Check Edition)

Dataset: 50 physics questions in .jsonl format
LLM Outputs: 50 hardcoded responses (simulating pre-generated outputs)
Evaluation Model: Azure OpenAI
Metric Used: Answer Relevancy
Results: 28/50 passed (56% pass rate) ⚠️

Version 2: The Olympics Quiz (Real-Time Edition)

Dataset: 5 Olympics trivia questions
LLM: DeepSeek-R1 8B (running locally via Ollama)
Evaluation Model: Azure OpenAI
Metric Used: Answer Relevancy
Results: 5/5 passed (100% pass rate) ✅

You can view my test runs at the DeepEval dashboard.

The Technical Deep Dive

Setting Up the Foundation

Here's what my Version 2 implementation looks like (the active version in my code):

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.evaluate import evaluate
from deepeval.dataset import EvaluationDataset, Golden
from langchain_ollama import ChatOllama

The beauty of DeepEval is its simplicity. You need just four key components:

Test Cases: Input-output pairs
Metrics: What you're measuring (relevancy, correctness, toxicity, etc.)
Dataset: Organized collection of test cases
Evaluation: The execution engine

Connecting to a Local LLM

One of the exciting parts was running DeepSeek-R1 8B locally using Ollama:

chat = ChatOllama(
    base_url="http://localhost:11434",
    model="deepseek-r1:8b",
    temperature=0.5,
    max_token=200
)

This gave me complete control over the model without worrying about API costs or rate limits during experimentation, perfect for learning!

Building the Dataset

I created a simple but effective dataset structure:

test_data = [
  {
    "input": "Which country topped the medal table at the Tokyo 2020 Olympics?",
    "expected_output": "The United States topped the medal table with 113 total medals."
  },
  # ... 4 more Olympics questions
]

The pattern is straightforward: each test case has an input (the question) and an expected_output (the ground truth). DeepEval calls these "Golden" examples:

goldens = []
for data in test_data:
    golden = Golden(
        input=data['input'],
        expected_output=data['expected_output'],
    )
    goldens.append(golden)

new_dataset = EvaluationDataset(goldens=goldens)

Generating Real-Time Outputs

Here's where Version 2 differs from Version 1. Instead of using hardcoded outputs, I invoked the LLM for each test case:

for golden in new_dataset.goldens:
    test_case = LLMTestCase(
        input=golden.input,
        expected_output=golden.expected_output,
        actual_output=chat.invoke(golden.input).content  # Real-time LLM call!
    )
    new_dataset.add_test_case(test_case)

This approach simulates a real production scenario where you're continuously evaluating live model outputs.

Running the Evaluation

The final step is beautifully simple:

evaluate(
    test_cases=new_dataset.test_cases, 
    metrics=[AnswerRelevancyMetric()]
)

That's it! DeepEval handles the rest i.e. comparing actual vs. expected outputs, calculating relevancy scores, and generating a comprehensive report.

Key Learnings & Insights

1. The 56% Pass Rate Taught Me More Than the 100%

Version 1's 28/50 pass rate was initially discouraging, but it revealed something crucial: LLM evaluation is hard, and that's the point. Those failures highlighted:

Ambiguous question phrasing
Incorrect expected outputs in my ground truth
The importance of clear, specific prompts
How models interpret questions differently than humans

2. Dataset Quality > Dataset Quantity

Version 2's perfect score with just 5 questions wasn't luck, it was intentional design:

Questions were clear and unambiguous
Expected outputs were concise ("Answer in one sentence")
The domain (Olympics facts) had verifiable ground truth
Token limits prevented verbose, meandering responses

Lesson: Start small, get it right, then scale.

3. Local LLMs Are Game-Changers for Learning

Running DeepSeek-R1 8B via Ollama gave me:

Freedom to experiment without API costs
Fast iteration cycles (no network latency)
Privacy for sensitive test data
Understanding of model behavior at different temperatures

4. Evaluation Metrics Are Not One-Size-Fits-All

I started with AnswerRelevancyMetric, which measures whether the output addresses the input question. But DeepEval offers 14+ metrics:

Correctness: Factual accuracy
Hallucination: Detection of made-up information
Toxicity: Safety and appropriateness
Bias: Fairness across demographics
Latency: Response time
Context Relevancy: For RAG applications

Choosing the right metric depends entirely on your use case.

What Version 1 Looked Like (The Evolution)

For context, here's how Version 1 worked with hardcoded outputs:

# Version 1: Read from pre-generated outputs file
with open('llmoutputs.jsonl', 'r') as f:
    for idx, line in enumerate(f):
        llm_output = json.loads(line)
        test_case = LLMTestCase(
            input=new_dataset.goldens[idx].input,
            expected_output=new_dataset.goldens[idx].expected_output,
            actual_output=llm_output['actual_output']  # Hardcoded!
        )
        new_dataset.add_test_case(test_case)

This approach is useful when:

You're testing historical model outputs
You want reproducible benchmarks
You're comparing multiple model versions
You're working with expensive API calls

What's Next in My Learning Journey

This is just the beginning! Here's what I'm exploring next:

Immediate Next Steps:

Expand Metrics: Test Hallucination, Toxicity, and Bias metrics
RAG Evaluation: Build a retrieval-augmented generation system and evaluate context relevancy
Automated Regression Testing: Integrate DeepEval into CI/CD pipelines
Comparative Analysis: Evaluate multiple models (GPT-4, Claude, Llama) on the same dataset

Longer-Term Goals:

Component-level testing for LLM applications
End-to-end testing for multi-agent systems
Building custom evaluation metrics for domain-specific use cases
Exploring LLM-as-a-Judge paradigms

Practical Takeaways for Testing Professionals
If you're coming from a traditional testing background like me, here's what translates well:

Traditional Testing --> LLM Evaluation
Unit tests --> Component-level metrics (relevancy, correctness)
Integration tests --> End-to-end conversation flows
Test data management --> Golden datasets & versioning
Assertions --> Metric thresholds & scoring
Regression testing --> Continuous evaluation in CI/CD
Test coverage --> Metric coverage across dimensions

The mindset is the same: systematic verification, reproducible results, and continuous improvement.

Final Thoughts: It's a Great Time to Be Curious

Seven years in testing taught me that quality isn't accidental, it's engineered. The same principle applies to AI systems, but the tools and techniques are still evolving.

What excites me most about LLM evaluation is that we're building the testing discipline in real-time. There's no decades-old playbook to follow. We're figuring out what "good" looks like, what metrics matter, and how to balance automation with human judgment.

The 56% pass rate in Version 1 didn't discourage me, it energized me. It meant there's so much to learn, so many problems to solve, and so many opportunities to make AI systems more reliable, safe, and trustworthy.

If you're a testing professional curious about AI Quality Engineering, my advice is simple: start small, break things, learn fast. Build a simple evaluation pipeline like I did. You'll be surprised how quickly the concepts click once you get hands-on.

Let's Connect!

I'm documenting my entire learning journey in AI Quality Engineering and would love to connect with others on similar paths:

🔗 Connect with me on LinkedIn

What aspects of LLM evaluation are you most curious about? What challenges are you facing? Let's learn together! 🚀

This blog is part of my ongoing series on AI Quality Engineering. Stay tuned for more hands-on experiments, lessons learned, and practical guides!

DEV Community