Kshitij Gupta

Posted on May 26

I Built an Automated LLM Evaluation Pipeline From Scratch — Here's Everything I Learned

#llm #python #devops #machinelearning

How I went from zero LLM eval experience to shipping a production-grade RAG evaluation harness using only free-tier tools — and what every design decision taught me about building AI systems that can be trusted.

The Problem: Everyone Wants Eval Experience, Nobody Teaches It

So I decided to build the infrastructure myself.

The result is llm-eval-harness — a fully automated evaluation pipeline for RAG and agentic LLM systems. It runs test cases against multiple providers, scores responses with an LLM judge, tracks regressions over time, and blocks deployment when quality drops. It's the kind of tooling that would be at home in a real AI startup's internal infrastructure.

And it cost me nothing to build, because I used Ollama locally and Groq's free tier for everything.

GitHub: kshitijqwerty/llm-eval-harness

What the System Actually Does — End to End

Before diving into the code, let me walk through the full lifecycle of an eval run so the pieces make sense together.

Step 1: Load tasks. The runner reads YAML files from evals/tasks/. Each file contains a list of eval cases — a query, an optional source document, the eval mode (direct or rag), and metadata like expected topics. There are currently three task files: sample_rag.yaml (2 cases), rag_support.yaml (4 cases, customer support domain), and medical_faq.yaml (5 cases, medical FAQ domain).

Step 2: Set up RAG context (if needed). For rag mode cases, the runner calls harness/rag.py to ingest the source document into ChromaDB, chunk it, embed it with all-MiniLM-L6-v2, and build a retrieval pipeline. For direct mode, this step is skipped and the query goes straight to the model.

Step 3: Run cases across providers. The runner iterates over every combination of eval case × provider. For each pair, it calls the appropriate adapter (OllamaAdapter or GroqAdapter), gets a response, and records the latency.

Step 4: Score with LLM-as-judge. Each (query, response, source document) triple is sent to the judge model — Groq's llama3-8b-8192 — which returns scores for faithfulness, relevance, and hallucination on a 0–1 scale.

Step 5: Persist to Postgres. Each EvalResult is written to the database as an EvalResultRow, grouped under an EvalRun record. This is what enables regression detection.

Step 6: Regression check. The runner calls harness/regression.py to compare this run's scores against the previous run for the same task. If any metric drops by more than the CI threshold (0.25), the regression report overrides the CI gate and main.py exits with code 1.

Step 7: Generate report. harness/reporter.py renders the Jinja2 HTML template with score bars, per-model tables, and a CI gate banner showing pass/fail status.

Step 8: Expose through API. All of this is accessible through the FastAPI dashboard — you can trigger runs, watch live logs, compare runs, and download reports through a browser.

Here's the full architecture:

┌─────────────────────────────────────────────────┐
│                  YAML Task Files                │
│ sample_rag.yaml / rag_support.yaml / medical_faq│
└────────────────────┬────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────┐
│               harness/runner.py                 │
│  Orchestrates: cases × providers                │
│  Modes: direct / rag                            │
└──────┬──────────────────────┬───────────────────┘
       │                      │
       ▼                      ▼
┌──────────────┐    ┌─────────────────────────────┐
│harness/rag.py│    │      harness/models.py      │
│ChromaDB      │    │  BaseLLMAdapter             │
│Embeddings    │    │  OllamaAdapter / GroqAdapter│
│Retrieval     │    │  OpenAIAdapter              │
└─────┬────────┘    └─────────────┬───────────────┘
      └──────────┬────────────────┘
                 ▼
     ┌────────────────────────┐
     │    harness/metrics.py  │
     │  LLM-as-judge scoring  │
     │  faithfulness          │
     │  relevance             │
     │  hallucination         │
     │  latency               │
     │  EvalResult dataclass  │
     └────────────┬───────────┘
                  │
        ┌─────────┴──────────┐
        ▼                    ▼
┌───────────────┐   ┌─────────────────────┐
│ harness/db.py │   │harness/regression.py│
│ EvalRun       │   │MetricDiff           │
│ EvalResultRow │   │RegressionReport     │
│ Postgres      │   │compute_diff()       │
└──────┬────────┘   └──────────┬──────────┘
       │                       │
       ▼                       ▼
┌────────────────────────────────────────┐
│          harness/reporter.py           │
│     Jinja2 HTML report generation      │
│     Score bars, CI gate banner         │
└──────────────────┬─────────────────────┘
                   │
                   ▼
┌────────────────────────────────────────┐
│           api/main.py (FastAPI)        │
│  /runs  /runs/{id}/diff  /trigger      │
│  /runs/{id}/report  /jobs/{id}/logs    │
└────────────────────────────────────────┘

Deep Dive: Every File and What It Does

`harness/models.py` — The Adapter Layer

This is the most architecturally important file in the project. The adapter pattern solves a real problem: eval logic shouldn't care which model it's talking to. Different providers have different SDKs, different auth mechanisms, different response formats. Without abstraction, you'd have if provider == "groq": ... elif provider == "ollama": ... scattered throughout the runner, and adding a new provider would mean touching every file.

Instead, BaseLLMAdapter defines a single contract:

from abc import ABC, abstractmethod

class BaseLLMAdapter(ABC):
    @abstractmethod
    def complete(self, prompt: str) -> str:
        """Send a prompt, get a string response."""
        ...

    @abstractmethod
    def model_name(self) -> str:
        """Return the model identifier for logging."""
        ...

Each concrete adapter implements this interface. OllamaAdapter hits the local Ollama HTTP API. GroqAdapter uses the groq Python client with an API key from the environment. OpenAIAdapter is there for when you eventually want to benchmark against GPT-4o — same interface, no runner changes needed.

The registry function get_adapter(name: str) maps string names to adapter instances, so the YAML task files and CLI flags can specify providers as strings without importing anything directly.

This pattern directly mirrors what you'd find in production LLM infrastructure at companies like Cohere, Weights & Biases, or any startup running multi-provider benchmarks.

`harness/metrics.py` — The Scoring Engine

This is where the real evaluation work happens. The scoring pipeline uses the LLM-as-judge paradigm, which has become the industry standard for evaluating generative models precisely because it handles semantic equivalence in ways string matching never could.

The judge prompt is carefully structured to elicit consistent numeric scores:

JUDGE_PROMPT = """
You are an expert evaluator for language model outputs.

Given:
- Query: {query}
- Source Document: {context}
- Model Response: {response}

Score the response on each dimension from 0.0 to 1.0:

1. FAITHFULNESS: Does the response accurately reflect the source document?
   (1.0 = fully grounded, 0.0 = contradicts the source)

2. RELEVANCE: Does the response actually answer the query?
   (1.0 = directly and completely answers, 0.0 = completely off-topic)

3. HALLUCINATION: Does the response introduce facts not in the source?
   (1.0 = no hallucination, 0.0 = heavily hallucinated)

Respond with JSON only:
{{"faithfulness": float, "relevance": float, "hallucination": float}}
"""

The EvalResult dataclass captures everything about a single evaluation:

@dataclass
class EvalResult:
    task_id: str
    provider: str
    model: str
    query: str
    response: str
    faithfulness: float
    relevance: float
    hallucination: float
    latency_ms: float
    passed: bool       # True if all scores above threshold
    run_id: str
    timestamp: datetime

The passed flag is computed by comparing each score against the CI threshold (0.25 by default). A case fails if any metric drops below this threshold. Latency is measured as wall-clock time around the adapter.complete() call and stored in milliseconds.

One critical design note: the judge model and the evaluated model are intentionally different. Using the same model to judge itself would introduce self-serving bias. Groq's llama3-8b-8192 serves as judge regardless of which model is being evaluated, creating a consistent external standard.

`harness/rag.py` — The Retrieval Pipeline

RAG evaluation is fundamentally different from direct prompt evaluation. For direct mode, you just send a prompt and score the answer. For RAG mode, you need to first build a retrieval system from a source document, retrieve relevant chunks, augment the prompt, then score the answer — and your scoring has to account for whether the model stayed faithful to what was retrieved, not just what was in the original document.

harness/rag.py handles the full pipeline:

Ingestion: Source documents (.txt files in docs/) are loaded, split into chunks using LangChain's RecursiveCharacterTextSplitter, and embedded with sentence-transformers/all-MiniLM-L6-v2. The choice of all-MiniLM-L6-v2 was deliberate — it's small (22M parameters), fast, runs locally with no API calls, and performs well on semantic similarity tasks. For a portfolio project, there's no reason to burn API budget on embeddings.

Storage: Chunks and their embeddings are stored in ChromaDB, using a file-backed persistent collection. Each collection is keyed by document name so the same document isn't re-ingested across runs.

Retrieval: At query time, the query is embedded with the same model and ChromaDB returns the top-k most similar chunks (default k=3). These chunks become the context window for the model prompt.

Augmented prompt: The final prompt follows the standard RAG template:

Use the following context to answer the question.
If the answer is not in the context, say "I don't know."

Context:
{retrieved_chunks}

Question: {query}

Answer:

The "I don't know" instruction is important — it's what makes hallucination detection meaningful. A model that says "I don't know" when the answer isn't in the context is behaving correctly. A model that confabulates an answer when the context is insufficient is hallucinating, and the judge will catch it.

`harness/runner.py` — The Orchestration Core

The runner is the central coordinator. It's responsible for:

Loading YAML task files and deserializing them into eval case objects
Iterating over the Cartesian product of cases × providers
Dispatching to harness/rag.py or calling the adapter directly depending on mode
Collecting EvalResult objects
Writing results to Postgres via harness/db.py
Calling the regression check
Triggering report generation

The runner supports two modes specified per-case in the YAML:

direct mode: Query goes straight to the model. Used for general instruction-following tasks where there's no external knowledge base.

rag mode: Query goes through the full RAG pipeline — ingest document, retrieve chunks, augment prompt, then evaluate. Used for knowledge-grounded tasks like customer support or medical FAQ.

The design keeps these modes fully separate. There's no hybrid mode, no magic inference about which to use. The task YAML is explicit, which means the eval results are unambiguous.

`harness/regression.py` — Catching Silent Degradations

This is the file most portfolio eval projects skip entirely, and it's the one that most impressed me to build.

The regression problem in ML is real and insidious: you update a model, run your eval suite, all cases pass the absolute threshold, and you ship. Three months later someone notices the product is worse. What happened? Every individual run "passed," but performance gradually drifted downward over dozens of releases.

harness/regression.py addresses this by comparing the current run against the previous run for the same task, not just against an absolute threshold.

@dataclass
class MetricDiff:
    metric: str
    previous: float
    current: float
    delta: float        # current - previous; negative = regression

@dataclass
class RegressionReport:
    task_id: str
    provider: str
    diffs: list[MetricDiff]
    has_regression: bool  # True if any delta < -CI_THRESHOLD
    regression_metrics: list[str]

compute_diff() queries Postgres for the most recent previous run for this task+provider combination, computes the delta for each metric, and returns a RegressionReport. If has_regression is True, the runner overrides the normal CI gate — even if the current run's absolute scores are above threshold, a significant drop from the previous run is a regression and must block deployment.

This is exactly the logic you'd implement at a real company where model quality is a shipping requirement.

`harness/db.py` — Persistent Run History

SQLAlchemy models for two tables:

EvalRun — one record per eval run:

class EvalRun(Base):
    __tablename__ = "eval_runs"
    id: str            # UUID
    task_file: str     # which YAML was used
    started_at: datetime
    completed_at: datetime
    status: str        # "running" | "passed" | "failed"
    ci_passed: bool

EvalResultRow — one record per case × provider:

class EvalResultRow(Base):
    __tablename__ = "eval_results"
    id: str
    run_id: str        # FK to EvalRun
    task_id: str
    provider: str
    model: str
    faithfulness: float
    relevance: float
    hallucination: float
    latency_ms: float
    passed: bool

init_db() creates both tables on startup. get_db() is a FastAPI dependency that yields a SQLAlchemy session and closes it when the request completes — standard SQLAlchemy session management.

Storing results in Postgres rather than flat files is what enables regression detection. Without a queryable run history, you can't compute deltas.

`harness/reporter.py` + `report_template.html` — The HTML Report

The report is generated by rendering a Jinja2 template with the eval results. The template produces a self-contained HTML file with:

A header showing run metadata (timestamp, task file, total cases, pass rate)
A CI gate banner: green "PASSED" or red "FAILED" with the blocking metric and delta if it's a regression
Per-provider summary tables with average scores across all cases
Per-case score bars — horizontal bars scaled 0–1, color-coded (green above 0.7, yellow 0.4–0.7, red below 0.4)
Regression indicators where the current score dropped significantly from the previous run

The HTML report is uploaded as a GitHub Actions artifact on every run, including failures. This means you always have a browsable record of what happened, even if the run blocked CI.

What a Real Report Looks Like

Here's an actual eval report generated by the harness — this is the real output from running sample_rag_eval against two providers: ollama/llama3.2 and groq/llama-3.1-8b-instant.

`api/main.py` + `api/jobs.py` — The FastAPI Layer

The API layer wraps everything in a RESTful interface with a browser-accessible dashboard. The full route list:

GET  /runs                  → list all EvalRun records, newest first
GET  /runs/{id}             → single run detail
GET  /runs/{id}/results     → all EvalResultRow records for a run
GET  /runs/{id}/diff        → regression comparison vs previous run
GET  /runs/{id}/report      → serve the HTML report for a run
POST /trigger               → kick off a new eval run as a background job
GET  /jobs                  → list all in-flight and recent jobs
GET  /jobs/{id}/logs        → stream captured stdout from a job
GET  /dashboard             → the Jinja2 HTML dashboard

The /trigger endpoint is the most interesting. It spawns the eval runner as a subprocess using threading, captures stdout line by line, and stores it in the in-memory job tracker in api/jobs.py. The log lines are capped at 200 to prevent unbounded memory growth.

The dashboard (api/templates/dashboard.html) polls /jobs every 2 seconds via fetch(). While a job is running, it shows a pulsing dot indicator next to the job ID. When the job completes, it auto-refreshes the runs table so the new results appear without a manual reload. The trigger button disables itself while any job is running — a detail that matters when you're demoing, since double-triggering would corrupt the regression baseline.

The in-memory job store is an intentional tradeoff. Using threading and a plain Python dict means the job history resets on server restart. A production system would use a persistent queue (Celery + Redis, or Postgres-backed). But for a portfolio project, the in-memory approach is honest about what it is, easy to understand, and sufficient for the use case. The comment in the code says exactly this.

The Eval Task Files

Three YAML files define the evaluation suite:

evals/tasks/sample_rag.yaml (2 cases, direct mode) — basic sanity checks for direct prompt answering. Used during development to verify the pipeline works before adding RAG complexity.

evals/tasks/rag_support.yaml (4 cases, RAG mode) — customer support domain. Source document is docs/support_policy.txt, a synthetic policy document covering refunds, shipping, account management, and product returns. Cases test whether models can retrieve and accurately answer policy questions without hallucinating policy terms.

evals/tasks/medical_faq.yaml (5 cases, RAG mode) — medical FAQ domain. Source document is docs/medical_faq.txt, a synthetic FAQ covering common medical questions. This domain is intentionally higher-stakes — hallucination in medical contexts has real consequences, and the scoring weights reflect that.

The YAML schema is straightforward:

- id: refund_window         # unique identifier for regression tracking
  mode: rag                 # "rag" or "direct"
  query: "What is the refund window for purchases?"
  document: docs/support_policy.txt
  expected_topics:          # soft hints for the judge, not hard assertions
    - "30 days"
    - "original payment method"
  tags:                     # for filtering and grouping in reports
    - policy
    - refunds

The expected_topics field is passed to the judge prompt as context about what a correct answer should contain. It's not a hard assertion — the judge uses its own reasoning — but it helps calibrate the scoring for domain-specific terminology.

The CI/CD Pipeline in Detail

The GitHub Actions workflow is where the project moves from "toy" to "something I'd actually use."

name: LLM Eval CI

on: [push, pull_request]

services:
  postgres:
    image: postgres:15
    env:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: eval_db
    options: >-
      --health-cmd pg_isready
      --health-interval 10s
      --health-timeout 5s
      --health-retries 5

env:
  DATABASE_URL: postgresql://postgres:postgres@localhost:5432/eval_db
  GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}

steps:
  - uses: actions/checkout@v4
  - uses: actions/setup-python@v5
    with:
      python-version: '3.11'
  - run: pip install -r requirements.txt
  - run: python main.py
  - uses: actions/upload-artifact@v3
    if: always()
    with:
      name: eval-report-${{ github.sha }}
      path: reports/

A few things worth unpacking:

The Postgres sidecar uses --health-cmd pg_isready with retry logic. Without this, the workflow would try to connect to Postgres before it's ready and fail with a connection error. The health check ensures the DB is accepting connections before python main.py runs.

DATABASE_URL is set inline as an environment variable rather than coming from a .env file, because there's no .env file in CI. The GROQ_API_KEY comes from GitHub Secrets, which is the correct pattern for sensitive credentials.

if: always() on the artifact upload means the HTML report is uploaded whether the run passed or failed. This is critical: if a regression blocks CI, you need to be able to see why it failed. Without always(), a failed run produces no artifact and you're debugging blind.

sys.exit(1) in main.py is what makes this a real CI gate. If the regression check returns has_regression: True, main.py exits with a non-zero code, the GitHub Actions step fails, and the PR is blocked. This is exactly how you'd enforce model quality as a deployment requirement.

What I Actually Learned Building This

LLM-as-judge is nondeterministic and you have to design around it

The same model response will score differently on different runs. I saw variance of ±0.05 to ±0.08 on a single case across repeated runs with identical inputs. This is because LLM sampling is stochastic by default.

The CI threshold of 0.25 exists partly because of this variance — a threshold too tight would cause random failures unrelated to actual quality changes. A production system would run each case multiple times and use the average, or use a confidence interval around the score. For a portfolio project, a coarse threshold is the right tradeoff.

YAML task files are better than I expected at scaling

My initial instinct was that YAML would feel limiting — that I'd quickly want to write Python test functions with proper assertions. That never happened. The YAML format turned out to be expressive enough for every case I needed to write, and dramatically faster to author than Python test functions. Five medical FAQ cases took about 15 minutes. The equivalent in pytest would have been 45 minutes and harder to read.

The non-engineer accessibility point is also real. I had a friend who isn't a programmer read through the rag_support.yaml and immediately understood what it was testing. She suggested two new cases that I added. That wouldn't have happened with Python test code.

The adapter pattern is worth the upfront cost

Writing the adapter abstraction felt like over-engineering when I started. By the time I was running cases against both Ollama and Groq simultaneously, it was obviously the right call. The runner code has zero provider-specific logic. Adding OpenAIAdapter took 20 minutes and zero changes to any other file. The abstraction paid for itself immediately.

Regression detection changes how you think about evaluation

Before building harness/regression.py, I was thinking about evaluation as "does this run pass?" After building it, I started thinking about evaluation as "is this run better than the last run?" That's a fundamentally different mental model, and it's the right one for systems that evolve over time.

A run that scores 0.72 on faithfulness is fine in isolation. But if the previous run scored 0.85, that 0.13 drop is a regression worth investigating. The absolute score tells you if you're above the floor. The delta tells you if you're moving in the right direction.

ChromaDB genuinely has zero friction

I expected setting up a vector store to be a whole thing. It was not a whole thing. chromadb.PersistentClient(path=".chroma") and you're done. The persistence just works. The API is clean. For anything at portfolio scale, it's the right tool.

Tradeoffs I Made Consciously

Every project involves tradeoffs. Here are the ones I made deliberately and why:

In-memory job store instead of Redis/Celery: Sufficient for a portfolio demo. Honest about the limitation. Adds zero operational complexity (no Redis container to manage). The comment in api/jobs.py explains this explicitly.

Single judge model instead of an ensemble: Production systems often use multiple judge models and take the average to reduce variance. One judge is simpler, good enough for demonstration purposes, and consistent with how most real eval pipelines start.

Groq as judge model instead of a larger model: Groq's llama3-8b-8192 is fast and free. GPT-4o would be a better judge but costs money. For a portfolio project, the tradeoff is obvious.

Synchronous runner instead of async: Parallel provider calls would be faster. The synchronous version is easier to reason about and debug, which matters when you're building something new. Async is a straightforward upgrade path if the runner ever needs to scale.

Fixed CI threshold instead of statistical significance testing: Proper threshold calibration requires a held-out validation set and statistical testing. A fixed 0.25 threshold is a reasonable approximation for a project at this scale.

What I'd Build Next

If I were productionizing this or extending it as a portfolio project:

Persistent job store: Replace the in-memory dict in api/jobs.py with a Postgres-backed job table. Add job status transitions (queued → running → completed/failed), retry logic, and job history that survives server restarts.

Async eval runner: Use asyncio and httpx for concurrent provider calls. The current runner processes cases sequentially; async would cut total run time by roughly (number of providers - 1) × (average latency per case).

Eval case versioning: Track which git commit's task files produced which results. Right now, changing a YAML task file breaks regression baselines silently.

Custom judge prompts per domain: The current judge prompt is generic. Medical FAQ evaluation should weight hallucination more heavily and use domain-specific rubrics. Let task authors define judge configuration in their YAML files.

Dashboard authentication: The current dashboard has no auth. Fine for local use, problematic if you ever expose it to the internet.

Multi-run averaging: Run each case N times and report mean ± std. This would make CI thresholds much more reliable by smoothing out judge variance.

Streaming responses: Add streaming support to the adapters for faster perceived latency and to enable partial scoring during generation.

Running It Yourself

Prerequisites: Python 3.11+, Docker (for Postgres), Ollama installed locally, a free Groq account.

# Clone the repo
git clone https://github.com/kshitijqwerty/llm-eval-harness
cd llm-eval-harness

# Set up environment
cp .env.example .env
# Edit .env: add your GROQ_API_KEY and DATABASE_URL

# Install dependencies
pip install -r requirements.txt

# Pull a local model via Ollama
ollama pull llama3

# Start Postgres (if using Docker)
docker run -d \
  --name eval-postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=eval_db \
  -p 5432:5432 \
  postgres:15

# Run the full eval suite
python main.py

# Or start the FastAPI dashboard
uvicorn api.main:app --reload
# Open http://localhost:8000/dashboard

The .env.example documents all four required variables: GROQ_API_KEY, DATABASE_URL, OLLAMA_BASE_URL (defaults to http://localhost:11434), and JUDGE_MODEL (defaults to llama3-8b-8192).

Closing Thoughts

LLM evaluation is infrastructure work. It's not glamorous, it doesn't produce a flashy demo, and it requires you to think carefully about what "good" means for a given task — which is harder than it sounds.

But it's also exactly the kind of work that separates teams shipping reliable AI products from teams shipping demos. The eval harness is what lets you answer "is the new model better?" with evidence instead of intuition. It's what lets you catch regressions before users do. It's what makes continuous deployment of LLM-powered features possible without flying blind.

Building this project taught me more about production AI systems than any course I've taken. Not because the code is complex — it isn't — but because the problems you have to think through (what do you measure? how do you measure it? what counts as a regression? how do you prevent false positives in CI?) are the same problems real teams are solving.

If you're in a similar position — trying to build LLM eval experience without access to a company's internal infrastructure — I hope this project gives you a concrete starting point. Fork it, extend it, break it, and rebuild it better.

Feedback, issues, and PRs welcome at github.com/kshitijqwerty/llm-eval-harness.

Top comments (1)

Harjot Singh • May 31

An eval pipeline is the unglamorous thing that separates people shipping AI on vibes from people shipping it with confidence - because without evals, "did my prompt change make it better or worse" is just a feeling. The hard-won lessons here are usually: building the eval SET is harder than the harness (garbage eval data = meaningless scores), LLM-as-judge needs its own calibration or it drifts, and you have to eval the thing you actually care about (task success), not a proxy that's easy to measure.

The payoff that makes it worth the effort: once you have a trustworthy eval, every other decision gets cheaper - you can swap models, cut costs, trim prompts, and KNOW whether quality held instead of hoping. Evals are what turn "scary change" into "measured change." That's why I treat verification/eval as core infra in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - you can't safely route to cheaper models or change anything without a way to confirm output quality didn't slip. Excellent, genuinely useful writeup. What was the hardest part - building a representative eval set, or getting LLM-as-judge to be consistent? Those two seem to be where everyone struggles.