DEV Community

Cover image for My First LLM Evaluation Pipeline
Shubham Thakur
Shubham Thakur

Posted on

My First LLM Evaluation Pipeline

The Context: Why I Started This Journey

After 7+ years as a Software Testing Lead, I've spent countless hours ensuring code quality, writing test cases, and building robust testing frameworks. But as AI systems started becoming ubiquitous in production environments, I found myself asking: "How do we test AI models with the same rigor we test traditional software?"
That question led me down the rabbit hole of AI Quality Engineering and LLM Evaluation. This blog documents my first hands-on experience with DeepEval, an open-source Python framework that makes testing Large Language Models as intuitive as writing unit tests.
Spoiler alert: It's both humbling and exciting! ๐Ÿš€


What I Built: Two Versions, Two Approaches

I approached this learning exercise by building two versions of LLM evaluation pipelines, each teaching me different aspects of the evaluation process.

Version 1: The Physics Quiz (Reality Check Edition)

  • Dataset: 50 physics questions in .jsonl format
  • LLM Outputs: 50 hardcoded responses (simulating pre-generated outputs)
  • Evaluation Model: Azure OpenAI
  • Metric Used: Answer Relevancy
  • Results: 28/50 passed (56% pass rate) โš ๏ธ

Version 2: The Olympics Quiz (Real-Time Edition)

  • Dataset: 5 Olympics trivia questions
  • LLM: DeepSeek-R1 8B (running locally via Ollama)
  • Evaluation Model: Azure OpenAI
  • Metric Used: Answer Relevancy
  • Results: 5/5 passed (100% pass rate) โœ…

You can view my test runs at the DeepEval dashboard.

A snippet of Deepeval Dashboard


The Technical Deep Dive

Setting Up the Foundation

Here's what my Version 2 implementation looks like (the active version in my code):

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.evaluate import evaluate
from deepeval.dataset import EvaluationDataset, Golden
from langchain_ollama import ChatOllama
Enter fullscreen mode Exit fullscreen mode

The beauty of DeepEval is its simplicity. You need just four key components:

  1. Test Cases: Input-output pairs
  2. Metrics: What you're measuring (relevancy, correctness, toxicity, etc.)
  3. Dataset: Organized collection of test cases
  4. Evaluation: The execution engine

Connecting to a Local LLM

One of the exciting parts was running DeepSeek-R1 8B locally using Ollama:

chat = ChatOllama(
    base_url="http://localhost:11434",
    model="deepseek-r1:8b",
    temperature=0.5,
    max_token=200
)
Enter fullscreen mode Exit fullscreen mode

This gave me complete control over the model without worrying about API costs or rate limits during experimentation, perfect for learning!

Building the Dataset

I created a simple but effective dataset structure:

test_data = [
  {
    "input": "Which country topped the medal table at the Tokyo 2020 Olympics?",
    "expected_output": "The United States topped the medal table with 113 total medals."
  },
  # ... 4 more Olympics questions
]
Enter fullscreen mode Exit fullscreen mode

The pattern is straightforward: each test case has an input (the question) and an expected_output (the ground truth). DeepEval calls these "Golden" examples:

goldens = []
for data in test_data:
    golden = Golden(
        input=data['input'],
        expected_output=data['expected_output'],
    )
    goldens.append(golden)

new_dataset = EvaluationDataset(goldens=goldens)
Enter fullscreen mode Exit fullscreen mode

Generating Real-Time Outputs

Here's where Version 2 differs from Version 1. Instead of using hardcoded outputs, I invoked the LLM for each test case:

for golden in new_dataset.goldens:
    test_case = LLMTestCase(
        input=golden.input,
        expected_output=golden.expected_output,
        actual_output=chat.invoke(golden.input).content  # Real-time LLM call!
    )
    new_dataset.add_test_case(test_case)
Enter fullscreen mode Exit fullscreen mode

This approach simulates a real production scenario where you're continuously evaluating live model outputs.

Running the Evaluation

The final step is beautifully simple:

evaluate(
    test_cases=new_dataset.test_cases, 
    metrics=[AnswerRelevancyMetric()]
)
Enter fullscreen mode Exit fullscreen mode

That's it! DeepEval handles the rest i.e. comparing actual vs. expected outputs, calculating relevancy scores, and generating a comprehensive report.


Key Learnings & Insights

1. The 56% Pass Rate Taught Me More Than the 100%

Version 1's 28/50 pass rate was initially discouraging, but it revealed something crucial: LLM evaluation is hard, and that's the point. Those failures highlighted:

  • Ambiguous question phrasing
  • Incorrect expected outputs in my ground truth
  • The importance of clear, specific prompts
  • How models interpret questions differently than humans

2. Dataset Quality > Dataset Quantity

Version 2's perfect score with just 5 questions wasn't luck, it was intentional design:

  • Questions were clear and unambiguous
  • Expected outputs were concise ("Answer in one sentence")
  • The domain (Olympics facts) had verifiable ground truth
  • Token limits prevented verbose, meandering responses

Lesson: Start small, get it right, then scale.

3. Local LLMs Are Game-Changers for Learning

Running DeepSeek-R1 8B via Ollama gave me:

  • Freedom to experiment without API costs
  • Fast iteration cycles (no network latency)
  • Privacy for sensitive test data
  • Understanding of model behavior at different temperatures

4. Evaluation Metrics Are Not One-Size-Fits-All

I started with AnswerRelevancyMetric, which measures whether the output addresses the input question. But DeepEval offers 14+ metrics:

  • Correctness: Factual accuracy
  • Hallucination: Detection of made-up information
  • Toxicity: Safety and appropriateness
  • Bias: Fairness across demographics
  • Latency: Response time
  • Context Relevancy: For RAG applications

Choosing the right metric depends entirely on your use case.

What Version 1 Looked Like (The Evolution)

For context, here's how Version 1 worked with hardcoded outputs:

# Version 1: Read from pre-generated outputs file
with open('llmoutputs.jsonl', 'r') as f:
    for idx, line in enumerate(f):
        llm_output = json.loads(line)
        test_case = LLMTestCase(
            input=new_dataset.goldens[idx].input,
            expected_output=new_dataset.goldens[idx].expected_output,
            actual_output=llm_output['actual_output']  # Hardcoded!
        )
        new_dataset.add_test_case(test_case)
Enter fullscreen mode Exit fullscreen mode

This approach is useful when:

  • You're testing historical model outputs
  • You want reproducible benchmarks
  • You're comparing multiple model versions
  • You're working with expensive API calls

What's Next in My Learning Journey

This is just the beginning! Here's what I'm exploring next:

Immediate Next Steps:

  1. Expand Metrics: Test Hallucination, Toxicity, and Bias metrics
  2. RAG Evaluation: Build a retrieval-augmented generation system and evaluate context relevancy
  3. Automated Regression Testing: Integrate DeepEval into CI/CD pipelines
  4. Comparative Analysis: Evaluate multiple models (GPT-4, Claude, Llama) on the same dataset

Longer-Term Goals:

  • Component-level testing for LLM applications
  • End-to-end testing for multi-agent systems
  • Building custom evaluation metrics for domain-specific use cases
  • Exploring LLM-as-a-Judge paradigms

Practical Takeaways for Testing Professionals
If you're coming from a traditional testing background like me, here's what translates well:

  • Traditional Testing --> LLM Evaluation
  • Unit tests --> Component-level metrics (relevancy, correctness)
  • Integration tests --> End-to-end conversation flows
  • Test data management --> Golden datasets & versioning
  • Assertions --> Metric thresholds & scoring
  • Regression testing --> Continuous evaluation in CI/CD
  • Test coverage --> Metric coverage across dimensions

The mindset is the same: systematic verification, reproducible results, and continuous improvement.


Final Thoughts: It's a Great Time to Be Curious

Seven years in testing taught me that quality isn't accidental, it's engineered. The same principle applies to AI systems, but the tools and techniques are still evolving.

What excites me most about LLM evaluation is that we're building the testing discipline in real-time. There's no decades-old playbook to follow. We're figuring out what "good" looks like, what metrics matter, and how to balance automation with human judgment.

The 56% pass rate in Version 1 didn't discourage me, it energized me. It meant there's so much to learn, so many problems to solve, and so many opportunities to make AI systems more reliable, safe, and trustworthy.

If you're a testing professional curious about AI Quality Engineering, my advice is simple: start small, break things, learn fast. Build a simple evaluation pipeline like I did. You'll be surprised how quickly the concepts click once you get hands-on.

Let's Connect!

I'm documenting my entire learning journey in AI Quality Engineering and would love to connect with others on similar paths:

๐Ÿ”— Connect with me on LinkedIn

What aspects of LLM evaluation are you most curious about? What challenges are you facing? Let's learn together! ๐Ÿš€


This blog is part of my ongoing series on AI Quality Engineering. Stay tuned for more hands-on experiments, lessons learned, and practical guides!

Top comments (0)