<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kshitij Gupta</title>
    <description>The latest articles on DEV Community by Kshitij Gupta (@kgup).</description>
    <link>https://dev.to/kgup</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3952161%2F25eae158-a618-4e3f-8b32-fb91364299a5.jpeg</url>
      <title>DEV Community: Kshitij Gupta</title>
      <link>https://dev.to/kgup</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kgup"/>
    <language>en</language>
    <item>
      <title>I Built an Automated LLM Evaluation Pipeline From Scratch — Here's Everything I Learned</title>
      <dc:creator>Kshitij Gupta</dc:creator>
      <pubDate>Tue, 26 May 2026 09:56:01 +0000</pubDate>
      <link>https://dev.to/kgup/i-built-an-automated-llm-evaluation-pipeline-from-scratch-heres-everything-i-learned-enh</link>
      <guid>https://dev.to/kgup/i-built-an-automated-llm-evaluation-pipeline-from-scratch-heres-everything-i-learned-enh</guid>
      <description>&lt;p&gt;&lt;em&gt;How I went from zero LLM eval experience to shipping a production-grade RAG evaluation harness using only free-tier tools — and what every design decision taught me about building AI systems that can be trusted.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Everyone Wants Eval Experience, Nobody Teaches It
&lt;/h2&gt;

&lt;p&gt;So I decided to build the infrastructure myself.&lt;/p&gt;

&lt;p&gt;The result is &lt;code&gt;llm-eval-harness&lt;/code&gt; — a fully automated evaluation pipeline for RAG and agentic LLM systems. It runs test cases against multiple providers, scores responses with an LLM judge, tracks regressions over time, and blocks deployment when quality drops. It's the kind of tooling that would be at home in a real AI startup's internal infrastructure.&lt;/p&gt;

&lt;p&gt;And it cost me nothing to build, because I used Ollama locally and Groq's free tier for everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/kshitijqwerty/llm-eval-harness" rel="noopener noreferrer"&gt;kshitijqwerty/llm-eval-harness&lt;/a&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  What the System Actually Does — End to End
&lt;/h2&gt;

&lt;p&gt;Before diving into the code, let me walk through the full lifecycle of an eval run so the pieces make sense together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Load tasks.&lt;/strong&gt; The runner reads YAML files from &lt;code&gt;evals/tasks/&lt;/code&gt;. Each file contains a list of eval cases — a query, an optional source document, the eval mode (&lt;code&gt;direct&lt;/code&gt; or &lt;code&gt;rag&lt;/code&gt;), and metadata like expected topics. There are currently three task files: &lt;code&gt;sample_rag.yaml&lt;/code&gt; (2 cases), &lt;code&gt;rag_support.yaml&lt;/code&gt; (4 cases, customer support domain), and &lt;code&gt;medical_faq.yaml&lt;/code&gt; (5 cases, medical FAQ domain).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Set up RAG context (if needed).&lt;/strong&gt; For &lt;code&gt;rag&lt;/code&gt; mode cases, the runner calls &lt;code&gt;harness/rag.py&lt;/code&gt; to ingest the source document into ChromaDB, chunk it, embed it with &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;, and build a retrieval pipeline. For &lt;code&gt;direct&lt;/code&gt; mode, this step is skipped and the query goes straight to the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Run cases across providers.&lt;/strong&gt; The runner iterates over every combination of eval case × provider. For each pair, it calls the appropriate adapter (&lt;code&gt;OllamaAdapter&lt;/code&gt; or &lt;code&gt;GroqAdapter&lt;/code&gt;), gets a response, and records the latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Score with LLM-as-judge.&lt;/strong&gt; Each (query, response, source document) triple is sent to the judge model — Groq's &lt;code&gt;llama3-8b-8192&lt;/code&gt; — which returns scores for faithfulness, relevance, and hallucination on a 0–1 scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Persist to Postgres.&lt;/strong&gt; Each &lt;code&gt;EvalResult&lt;/code&gt; is written to the database as an &lt;code&gt;EvalResultRow&lt;/code&gt;, grouped under an &lt;code&gt;EvalRun&lt;/code&gt; record. This is what enables regression detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: Regression check.&lt;/strong&gt; The runner calls &lt;code&gt;harness/regression.py&lt;/code&gt; to compare this run's scores against the previous run for the same task. If any metric drops by more than the CI threshold (0.25), the regression report overrides the CI gate and &lt;code&gt;main.py&lt;/code&gt; exits with code 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7: Generate report.&lt;/strong&gt; &lt;code&gt;harness/reporter.py&lt;/code&gt; renders the Jinja2 HTML template with score bars, per-model tables, and a CI gate banner showing pass/fail status.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 8: Expose through API.&lt;/strong&gt; All of this is accessible through the FastAPI dashboard — you can trigger runs, watch live logs, compare runs, and download reports through a browser.&lt;/p&gt;

&lt;p&gt;Here's the full architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│                  YAML Task Files                │
│ sample_rag.yaml / rag_support.yaml / medical_faq│
└────────────────────┬────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────┐
│               harness/runner.py                 │
│  Orchestrates: cases × providers                │
│  Modes: direct / rag                            │
└──────┬──────────────────────┬───────────────────┘
       │                      │
       ▼                      ▼
┌──────────────┐    ┌─────────────────────────────┐
│harness/rag.py│    │      harness/models.py      │
│ChromaDB      │    │  BaseLLMAdapter             │
│Embeddings    │    │  OllamaAdapter / GroqAdapter│
│Retrieval     │    │  OpenAIAdapter              │
└─────┬────────┘    └─────────────┬───────────────┘
      └──────────┬────────────────┘
                 ▼
     ┌────────────────────────┐
     │    harness/metrics.py  │
     │  LLM-as-judge scoring  │
     │  faithfulness          │
     │  relevance             │
     │  hallucination         │
     │  latency               │
     │  EvalResult dataclass  │
     └────────────┬───────────┘
                  │
        ┌─────────┴──────────┐
        ▼                    ▼
┌───────────────┐   ┌─────────────────────┐
│ harness/db.py │   │harness/regression.py│
│ EvalRun       │   │MetricDiff           │
│ EvalResultRow │   │RegressionReport     │
│ Postgres      │   │compute_diff()       │
└──────┬────────┘   └──────────┬──────────┘
       │                       │
       ▼                       ▼
┌────────────────────────────────────────┐
│          harness/reporter.py           │
│     Jinja2 HTML report generation      │
│     Score bars, CI gate banner         │
└──────────────────┬─────────────────────┘
                   │
                   ▼
┌────────────────────────────────────────┐
│           api/main.py (FastAPI)        │
│  /runs  /runs/{id}/diff  /trigger      │
│  /runs/{id}/report  /jobs/{id}/logs    │
└────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Deep Dive: Every File and What It Does
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;harness/models.py&lt;/code&gt; — The Adapter Layer
&lt;/h3&gt;

&lt;p&gt;This is the most architecturally important file in the project. The adapter pattern solves a real problem: eval logic shouldn't care which model it's talking to. Different providers have different SDKs, different auth mechanisms, different response formats. Without abstraction, you'd have &lt;code&gt;if provider == "groq": ... elif provider == "ollama": ...&lt;/code&gt; scattered throughout the runner, and adding a new provider would mean touching every file.&lt;/p&gt;

&lt;p&gt;Instead, &lt;code&gt;BaseLLMAdapter&lt;/code&gt; defines a single contract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;abc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;abstractmethod&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BaseLLMAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Send a prompt, get a string response.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return the model identifier for logging.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each concrete adapter implements this interface. &lt;code&gt;OllamaAdapter&lt;/code&gt; hits the local Ollama HTTP API. &lt;code&gt;GroqAdapter&lt;/code&gt; uses the &lt;code&gt;groq&lt;/code&gt; Python client with an API key from the environment. &lt;code&gt;OpenAIAdapter&lt;/code&gt; is there for when you eventually want to benchmark against GPT-4o — same interface, no runner changes needed.&lt;/p&gt;

&lt;p&gt;The registry function &lt;code&gt;get_adapter(name: str)&lt;/code&gt; maps string names to adapter instances, so the YAML task files and CLI flags can specify providers as strings without importing anything directly.&lt;/p&gt;

&lt;p&gt;This pattern directly mirrors what you'd find in production LLM infrastructure at companies like Cohere, Weights &amp;amp; Biases, or any startup running multi-provider benchmarks.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;code&gt;harness/metrics.py&lt;/code&gt; — The Scoring Engine
&lt;/h3&gt;

&lt;p&gt;This is where the real evaluation work happens. The scoring pipeline uses the LLM-as-judge paradigm, which has become the industry standard for evaluating generative models precisely because it handles semantic equivalence in ways string matching never could.&lt;/p&gt;

&lt;p&gt;The judge prompt is carefully structured to elicit consistent numeric scores:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;JUDGE_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are an expert evaluator for language model outputs.

Given:
- Query: {query}
- Source Document: {context}
- Model Response: {response}

Score the response on each dimension from 0.0 to 1.0:

1. FAITHFULNESS: Does the response accurately reflect the source document?
   (1.0 = fully grounded, 0.0 = contradicts the source)

2. RELEVANCE: Does the response actually answer the query?
   (1.0 = directly and completely answers, 0.0 = completely off-topic)

3. HALLUCINATION: Does the response introduce facts not in the source?
   (1.0 = no hallucination, 0.0 = heavily hallucinated)

Respond with JSON only:
{{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;faithfulness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: float, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;relevance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: float, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: float}}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;EvalResult&lt;/code&gt; dataclass captures everything about a single evaluation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EvalResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;relevance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;hallucination&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;       &lt;span class="c1"&gt;# True if all scores above threshold
&lt;/span&gt;    &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;passed&lt;/code&gt; flag is computed by comparing each score against the CI threshold (0.25 by default). A case fails if &lt;em&gt;any&lt;/em&gt; metric drops below this threshold. Latency is measured as wall-clock time around the &lt;code&gt;adapter.complete()&lt;/code&gt; call and stored in milliseconds.&lt;/p&gt;

&lt;p&gt;One critical design note: the judge model and the evaluated model are intentionally different. Using the same model to judge itself would introduce self-serving bias. Groq's &lt;code&gt;llama3-8b-8192&lt;/code&gt; serves as judge regardless of which model is being evaluated, creating a consistent external standard.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;code&gt;harness/rag.py&lt;/code&gt; — The Retrieval Pipeline
&lt;/h3&gt;

&lt;p&gt;RAG evaluation is fundamentally different from direct prompt evaluation. For direct mode, you just send a prompt and score the answer. For RAG mode, you need to first build a retrieval system from a source document, retrieve relevant chunks, augment the prompt, &lt;em&gt;then&lt;/em&gt; score the answer — and your scoring has to account for whether the model stayed faithful to what was retrieved, not just what was in the original document.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;harness/rag.py&lt;/code&gt; handles the full pipeline:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingestion:&lt;/strong&gt; Source documents (&lt;code&gt;.txt&lt;/code&gt; files in &lt;code&gt;docs/&lt;/code&gt;) are loaded, split into chunks using LangChain's &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt;, and embedded with &lt;code&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/code&gt;. The choice of &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; was deliberate — it's small (22M parameters), fast, runs locally with no API calls, and performs well on semantic similarity tasks. For a portfolio project, there's no reason to burn API budget on embeddings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage:&lt;/strong&gt; Chunks and their embeddings are stored in ChromaDB, using a file-backed persistent collection. Each collection is keyed by document name so the same document isn't re-ingested across runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval:&lt;/strong&gt; At query time, the query is embedded with the same model and ChromaDB returns the top-k most similar chunks (default k=3). These chunks become the context window for the model prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Augmented prompt:&lt;/strong&gt; The final prompt follows the standard RAG template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use the following context to answer the question.
If the answer is not in the context, say "I don't know."

Context:
{retrieved_chunks}

Question: {query}

Answer:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "I don't know" instruction is important — it's what makes hallucination detection meaningful. A model that says "I don't know" when the answer isn't in the context is behaving correctly. A model that confabulates an answer when the context is insufficient is hallucinating, and the judge will catch it.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;code&gt;harness/runner.py&lt;/code&gt; — The Orchestration Core
&lt;/h3&gt;

&lt;p&gt;The runner is the central coordinator. It's responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loading YAML task files and deserializing them into eval case objects&lt;/li&gt;
&lt;li&gt;Iterating over the Cartesian product of cases × providers&lt;/li&gt;
&lt;li&gt;Dispatching to &lt;code&gt;harness/rag.py&lt;/code&gt; or calling the adapter directly depending on mode&lt;/li&gt;
&lt;li&gt;Collecting &lt;code&gt;EvalResult&lt;/code&gt; objects&lt;/li&gt;
&lt;li&gt;Writing results to Postgres via &lt;code&gt;harness/db.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Calling the regression check&lt;/li&gt;
&lt;li&gt;Triggering report generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The runner supports two modes specified per-case in the YAML:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;direct&lt;/code&gt; mode:&lt;/strong&gt; Query goes straight to the model. Used for general instruction-following tasks where there's no external knowledge base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;rag&lt;/code&gt; mode:&lt;/strong&gt; Query goes through the full RAG pipeline — ingest document, retrieve chunks, augment prompt, then evaluate. Used for knowledge-grounded tasks like customer support or medical FAQ.&lt;/p&gt;

&lt;p&gt;The design keeps these modes fully separate. There's no hybrid mode, no magic inference about which to use. The task YAML is explicit, which means the eval results are unambiguous.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;code&gt;harness/regression.py&lt;/code&gt; — Catching Silent Degradations
&lt;/h3&gt;

&lt;p&gt;This is the file most portfolio eval projects skip entirely, and it's the one that most impressed me to build.&lt;/p&gt;

&lt;p&gt;The regression problem in ML is real and insidious: you update a model, run your eval suite, all cases pass the absolute threshold, and you ship. Three months later someone notices the product is worse. What happened? Every individual run "passed," but performance gradually drifted downward over dozens of releases.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;harness/regression.py&lt;/code&gt; addresses this by comparing the current run against the previous run for the same task, not just against an absolute threshold.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MetricDiff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;        &lt;span class="c1"&gt;# current - previous; negative = regression
&lt;/span&gt;
&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RegressionReport&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;diffs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;MetricDiff&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;has_regression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;  &lt;span class="c1"&gt;# True if any delta &amp;lt; -CI_THRESHOLD
&lt;/span&gt;    &lt;span class="n"&gt;regression_metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;compute_diff()&lt;/code&gt; queries Postgres for the most recent &lt;em&gt;previous&lt;/em&gt; run for this task+provider combination, computes the delta for each metric, and returns a &lt;code&gt;RegressionReport&lt;/code&gt;. If &lt;code&gt;has_regression&lt;/code&gt; is True, the runner overrides the normal CI gate — even if the current run's absolute scores are above threshold, a significant drop from the previous run is a regression and must block deployment.&lt;/p&gt;

&lt;p&gt;This is exactly the logic you'd implement at a real company where model quality is a shipping requirement.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;code&gt;harness/db.py&lt;/code&gt; — Persistent Run History
&lt;/h3&gt;

&lt;p&gt;SQLAlchemy models for two tables:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;EvalRun&lt;/code&gt;&lt;/strong&gt; — one record per eval run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EvalRun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;__tablename__&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_runs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;            &lt;span class="c1"&gt;# UUID
&lt;/span&gt;    &lt;span class="n"&gt;task_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;     &lt;span class="c1"&gt;# which YAML was used
&lt;/span&gt;    &lt;span class="n"&gt;started_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
    &lt;span class="n"&gt;completed_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;        &lt;span class="c1"&gt;# "running" | "passed" | "failed"
&lt;/span&gt;    &lt;span class="n"&gt;ci_passed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;EvalResultRow&lt;/code&gt;&lt;/strong&gt; — one record per case × provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EvalResultRow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;__tablename__&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;        &lt;span class="c1"&gt;# FK to EvalRun
&lt;/span&gt;    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;relevance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;hallucination&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;init_db()&lt;/code&gt; creates both tables on startup. &lt;code&gt;get_db()&lt;/code&gt; is a FastAPI dependency that yields a SQLAlchemy session and closes it when the request completes — standard SQLAlchemy session management.&lt;/p&gt;

&lt;p&gt;Storing results in Postgres rather than flat files is what enables regression detection. Without a queryable run history, you can't compute deltas.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;code&gt;harness/reporter.py&lt;/code&gt; + &lt;code&gt;report_template.html&lt;/code&gt; — The HTML Report
&lt;/h3&gt;

&lt;p&gt;The report is generated by rendering a Jinja2 template with the eval results. The template produces a self-contained HTML file with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A header showing run metadata (timestamp, task file, total cases, pass rate)&lt;/li&gt;
&lt;li&gt;A CI gate banner: green "PASSED" or red "FAILED" with the blocking metric and delta if it's a regression&lt;/li&gt;
&lt;li&gt;Per-provider summary tables with average scores across all cases&lt;/li&gt;
&lt;li&gt;Per-case score bars — horizontal bars scaled 0–1, color-coded (green above 0.7, yellow 0.4–0.7, red below 0.4)&lt;/li&gt;
&lt;li&gt;Regression indicators where the current score dropped significantly from the previous run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The HTML report is uploaded as a GitHub Actions artifact on every run, including failures. This means you always have a browsable record of what happened, even if the run blocked CI.&lt;/p&gt;




&lt;h3&gt;
  
  
  What a Real Report Looks Like
&lt;/h3&gt;

&lt;p&gt;Here's an actual eval report generated by the harness — this is the real output from running &lt;code&gt;sample_rag_eval&lt;/code&gt; against two providers: &lt;code&gt;ollama/llama3.2&lt;/code&gt; and &lt;code&gt;groq/llama-3.1-8b-instant&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlcnnt74lq3477gyf5ut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlcnnt74lq3477gyf5ut.png" alt="HTML Eval Report — sample_rag_eval run showing, score bars for faithfulness, relevance and hallucination across both providers" width="800" height="785"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;code&gt;api/main.py&lt;/code&gt; + &lt;code&gt;api/jobs.py&lt;/code&gt; — The FastAPI Layer
&lt;/h3&gt;

&lt;p&gt;The API layer wraps everything in a RESTful interface with a browser-accessible dashboard. The full route list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GET  /runs                  → list all EvalRun records, newest first
GET  /runs/{id}             → single run detail
GET  /runs/{id}/results     → all EvalResultRow records for a run
GET  /runs/{id}/diff        → regression comparison vs previous run
GET  /runs/{id}/report      → serve the HTML report for a run
POST /trigger               → kick off a new eval run as a background job
GET  /jobs                  → list all in-flight and recent jobs
GET  /jobs/{id}/logs        → stream captured stdout from a job
GET  /dashboard             → the Jinja2 HTML dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;/trigger&lt;/code&gt; endpoint is the most interesting. It spawns the eval runner as a subprocess using &lt;code&gt;threading&lt;/code&gt;, captures stdout line by line, and stores it in the in-memory job tracker in &lt;code&gt;api/jobs.py&lt;/code&gt;. The log lines are capped at 200 to prevent unbounded memory growth.&lt;/p&gt;

&lt;p&gt;The dashboard (&lt;code&gt;api/templates/dashboard.html&lt;/code&gt;) polls &lt;code&gt;/jobs&lt;/code&gt; every 2 seconds via &lt;code&gt;fetch()&lt;/code&gt;. While a job is running, it shows a pulsing dot indicator next to the job ID. When the job completes, it auto-refreshes the runs table so the new results appear without a manual reload. The trigger button disables itself while any job is running — a detail that matters when you're demoing, since double-triggering would corrupt the regression baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The in-memory job store is an intentional tradeoff.&lt;/strong&gt; Using &lt;code&gt;threading&lt;/code&gt; and a plain Python dict means the job history resets on server restart. A production system would use a persistent queue (Celery + Redis, or Postgres-backed). But for a portfolio project, the in-memory approach is honest about what it is, easy to understand, and sufficient for the use case. The comment in the code says exactly this.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Eval Task Files
&lt;/h2&gt;

&lt;p&gt;Three YAML files define the evaluation suite:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;evals/tasks/sample_rag.yaml&lt;/code&gt;&lt;/strong&gt; (2 cases, direct mode) — basic sanity checks for direct prompt answering. Used during development to verify the pipeline works before adding RAG complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;evals/tasks/rag_support.yaml&lt;/code&gt;&lt;/strong&gt; (4 cases, RAG mode) — customer support domain. Source document is &lt;code&gt;docs/support_policy.txt&lt;/code&gt;, a synthetic policy document covering refunds, shipping, account management, and product returns. Cases test whether models can retrieve and accurately answer policy questions without hallucinating policy terms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;evals/tasks/medical_faq.yaml&lt;/code&gt;&lt;/strong&gt; (5 cases, RAG mode) — medical FAQ domain. Source document is &lt;code&gt;docs/medical_faq.txt&lt;/code&gt;, a synthetic FAQ covering common medical questions. This domain is intentionally higher-stakes — hallucination in medical contexts has real consequences, and the scoring weights reflect that.&lt;/p&gt;

&lt;p&gt;The YAML schema is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;refund_window&lt;/span&gt;         &lt;span class="c1"&gt;# unique identifier for regression tracking&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rag&lt;/span&gt;                 &lt;span class="c1"&gt;# "rag" or "direct"&lt;/span&gt;
  &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;window&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;purchases?"&lt;/span&gt;
  &lt;span class="na"&gt;document&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docs/support_policy.txt&lt;/span&gt;
  &lt;span class="na"&gt;expected_topics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;          &lt;span class="c1"&gt;# soft hints for the judge, not hard assertions&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;original&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;method"&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                     &lt;span class="c1"&gt;# for filtering and grouping in reports&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;policy&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;refunds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;expected_topics&lt;/code&gt; field is passed to the judge prompt as context about what a correct answer should contain. It's not a hard assertion — the judge uses its own reasoning — but it helps calibrate the scoring for domain-specific terminology.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CI/CD Pipeline in Detail
&lt;/h2&gt;

&lt;p&gt;The GitHub Actions workflow is where the project moves from "toy" to "something I'd actually use."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval CI&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eval_db&lt;/span&gt;
    &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;-&lt;/span&gt;
      &lt;span class="s"&gt;--health-cmd pg_isready&lt;/span&gt;
      &lt;span class="s"&gt;--health-interval 10s&lt;/span&gt;
      &lt;span class="s"&gt;--health-timeout 5s&lt;/span&gt;
      &lt;span class="s"&gt;--health-retries 5&lt;/span&gt;

&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql://postgres:postgres@localhost:5432/eval_db&lt;/span&gt;
  &lt;span class="na"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GROQ_API_KEY }}&lt;/span&gt;

&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-python@v5&lt;/span&gt;
    &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;python-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.11'&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install -r requirements.txt&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python main.py&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/upload-artifact@v3&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always()&lt;/span&gt;
    &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eval-report-${{ github.sha }}&lt;/span&gt;
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;reports/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth unpacking:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Postgres sidecar&lt;/strong&gt; uses &lt;code&gt;--health-cmd pg_isready&lt;/code&gt; with retry logic. Without this, the workflow would try to connect to Postgres before it's ready and fail with a connection error. The health check ensures the DB is accepting connections before &lt;code&gt;python main.py&lt;/code&gt; runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;DATABASE_URL&lt;/code&gt; is set inline&lt;/strong&gt; as an environment variable rather than coming from a &lt;code&gt;.env&lt;/code&gt; file, because there's no &lt;code&gt;.env&lt;/code&gt; file in CI. The &lt;code&gt;GROQ_API_KEY&lt;/code&gt; comes from GitHub Secrets, which is the correct pattern for sensitive credentials.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;if: always()&lt;/code&gt; on the artifact upload&lt;/strong&gt; means the HTML report is uploaded whether the run passed or failed. This is critical: if a regression blocks CI, you need to be able to see &lt;em&gt;why&lt;/em&gt; it failed. Without &lt;code&gt;always()&lt;/code&gt;, a failed run produces no artifact and you're debugging blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;sys.exit(1)&lt;/code&gt; in &lt;code&gt;main.py&lt;/code&gt;&lt;/strong&gt; is what makes this a real CI gate. If the regression check returns &lt;code&gt;has_regression: True&lt;/code&gt;, &lt;code&gt;main.py&lt;/code&gt; exits with a non-zero code, the GitHub Actions step fails, and the PR is blocked. This is exactly how you'd enforce model quality as a deployment requirement.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Learned Building This
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LLM-as-judge is nondeterministic and you have to design around it
&lt;/h3&gt;

&lt;p&gt;The same model response will score differently on different runs. I saw variance of ±0.05 to ±0.08 on a single case across repeated runs with identical inputs. This is because LLM sampling is stochastic by default.&lt;/p&gt;

&lt;p&gt;The CI threshold of 0.25 exists partly because of this variance — a threshold too tight would cause random failures unrelated to actual quality changes. A production system would run each case multiple times and use the average, or use a confidence interval around the score. For a portfolio project, a coarse threshold is the right tradeoff.&lt;/p&gt;

&lt;h3&gt;
  
  
  YAML task files are better than I expected at scaling
&lt;/h3&gt;

&lt;p&gt;My initial instinct was that YAML would feel limiting — that I'd quickly want to write Python test functions with proper assertions. That never happened. The YAML format turned out to be expressive enough for every case I needed to write, and dramatically faster to author than Python test functions. Five medical FAQ cases took about 15 minutes. The equivalent in pytest would have been 45 minutes and harder to read.&lt;/p&gt;

&lt;p&gt;The non-engineer accessibility point is also real. I had a friend who isn't a programmer read through the &lt;code&gt;rag_support.yaml&lt;/code&gt; and immediately understood what it was testing. She suggested two new cases that I added. That wouldn't have happened with Python test code.&lt;/p&gt;

&lt;h3&gt;
  
  
  The adapter pattern is worth the upfront cost
&lt;/h3&gt;

&lt;p&gt;Writing the adapter abstraction felt like over-engineering when I started. By the time I was running cases against both Ollama and Groq simultaneously, it was obviously the right call. The runner code has zero provider-specific logic. Adding &lt;code&gt;OpenAIAdapter&lt;/code&gt; took 20 minutes and zero changes to any other file. The abstraction paid for itself immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Regression detection changes how you think about evaluation
&lt;/h3&gt;

&lt;p&gt;Before building &lt;code&gt;harness/regression.py&lt;/code&gt;, I was thinking about evaluation as "does this run pass?" After building it, I started thinking about evaluation as "is this run &lt;em&gt;better than&lt;/em&gt; the last run?" That's a fundamentally different mental model, and it's the right one for systems that evolve over time.&lt;/p&gt;

&lt;p&gt;A run that scores 0.72 on faithfulness is fine in isolation. But if the previous run scored 0.85, that 0.13 drop is a regression worth investigating. The absolute score tells you if you're above the floor. The delta tells you if you're moving in the right direction.&lt;/p&gt;

&lt;h3&gt;
  
  
  ChromaDB genuinely has zero friction
&lt;/h3&gt;

&lt;p&gt;I expected setting up a vector store to be a whole thing. It was not a whole thing. &lt;code&gt;chromadb.PersistentClient(path=".chroma")&lt;/code&gt; and you're done. The persistence just works. The API is clean. For anything at portfolio scale, it's the right tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tradeoffs I Made Consciously
&lt;/h2&gt;

&lt;p&gt;Every project involves tradeoffs. Here are the ones I made deliberately and why:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-memory job store instead of Redis/Celery:&lt;/strong&gt; Sufficient for a portfolio demo. Honest about the limitation. Adds zero operational complexity (no Redis container to manage). The comment in &lt;code&gt;api/jobs.py&lt;/code&gt; explains this explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single judge model instead of an ensemble:&lt;/strong&gt; Production systems often use multiple judge models and take the average to reduce variance. One judge is simpler, good enough for demonstration purposes, and consistent with how most real eval pipelines start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Groq as judge model instead of a larger model:&lt;/strong&gt; Groq's &lt;code&gt;llama3-8b-8192&lt;/code&gt; is fast and free. GPT-4o would be a better judge but costs money. For a portfolio project, the tradeoff is obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Synchronous runner instead of async:&lt;/strong&gt; Parallel provider calls would be faster. The synchronous version is easier to reason about and debug, which matters when you're building something new. Async is a straightforward upgrade path if the runner ever needs to scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fixed CI threshold instead of statistical significance testing:&lt;/strong&gt; Proper threshold calibration requires a held-out validation set and statistical testing. A fixed 0.25 threshold is a reasonable approximation for a project at this scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Build Next
&lt;/h2&gt;

&lt;p&gt;If I were productionizing this or extending it as a portfolio project:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistent job store:&lt;/strong&gt; Replace the in-memory dict in &lt;code&gt;api/jobs.py&lt;/code&gt; with a Postgres-backed job table. Add job status transitions (queued → running → completed/failed), retry logic, and job history that survives server restarts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Async eval runner:&lt;/strong&gt; Use &lt;code&gt;asyncio&lt;/code&gt; and &lt;code&gt;httpx&lt;/code&gt; for concurrent provider calls. The current runner processes cases sequentially; async would cut total run time by roughly (number of providers - 1) × (average latency per case).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eval case versioning:&lt;/strong&gt; Track which git commit's task files produced which results. Right now, changing a YAML task file breaks regression baselines silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom judge prompts per domain:&lt;/strong&gt; The current judge prompt is generic. Medical FAQ evaluation should weight hallucination more heavily and use domain-specific rubrics. Let task authors define judge configuration in their YAML files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dashboard authentication:&lt;/strong&gt; The current dashboard has no auth. Fine for local use, problematic if you ever expose it to the internet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-run averaging:&lt;/strong&gt; Run each case N times and report mean ± std. This would make CI thresholds much more reliable by smoothing out judge variance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming responses:&lt;/strong&gt; Add streaming support to the adapters for faster perceived latency and to enable partial scoring during generation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running It Yourself
&lt;/h2&gt;

&lt;p&gt;Prerequisites: Python 3.11+, Docker (for Postgres), Ollama installed locally, a free Groq account.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repo&lt;/span&gt;
git clone https://github.com/kshitijqwerty/llm-eval-harness
&lt;span class="nb"&gt;cd &lt;/span&gt;llm-eval-harness

&lt;span class="c"&gt;# Set up environment&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env: add your GROQ_API_KEY and DATABASE_URL&lt;/span&gt;

&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Pull a local model via Ollama&lt;/span&gt;
ollama pull llama3

&lt;span class="c"&gt;# Start Postgres (if using Docker)&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; eval-postgres &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgres &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;eval_db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 5432:5432 &lt;span class="se"&gt;\&lt;/span&gt;
  postgres:15

&lt;span class="c"&gt;# Run the full eval suite&lt;/span&gt;
python main.py

&lt;span class="c"&gt;# Or start the FastAPI dashboard&lt;/span&gt;
uvicorn api.main:app &lt;span class="nt"&gt;--reload&lt;/span&gt;
&lt;span class="c"&gt;# Open http://localhost:8000/dashboard&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;.env.example&lt;/code&gt; documents all four required variables: &lt;code&gt;GROQ_API_KEY&lt;/code&gt;, &lt;code&gt;DATABASE_URL&lt;/code&gt;, &lt;code&gt;OLLAMA_BASE_URL&lt;/code&gt; (defaults to &lt;code&gt;http://localhost:11434&lt;/code&gt;), and &lt;code&gt;JUDGE_MODEL&lt;/code&gt; (defaults to &lt;code&gt;llama3-8b-8192&lt;/code&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;LLM evaluation is infrastructure work. It's not glamorous, it doesn't produce a flashy demo, and it requires you to think carefully about what "good" means for a given task — which is harder than it sounds.&lt;/p&gt;

&lt;p&gt;But it's also exactly the kind of work that separates teams shipping reliable AI products from teams shipping demos. The eval harness is what lets you answer "is the new model better?" with evidence instead of intuition. It's what lets you catch regressions before users do. It's what makes continuous deployment of LLM-powered features possible without flying blind.&lt;/p&gt;

&lt;p&gt;Building this project taught me more about production AI systems than any course I've taken. Not because the code is complex — it isn't — but because the &lt;em&gt;problems&lt;/em&gt; you have to think through (what do you measure? how do you measure it? what counts as a regression? how do you prevent false positives in CI?) are the same problems real teams are solving.&lt;/p&gt;

&lt;p&gt;If you're in a similar position — trying to build LLM eval experience without access to a company's internal infrastructure — I hope this project gives you a concrete starting point. Fork it, extend it, break it, and rebuild it better.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Feedback, issues, and PRs welcome at &lt;a href="https://github.com/kshitijqwerty/llm-eval-harness" rel="noopener noreferrer"&gt;github.com/kshitijqwerty/llm-eval-harness&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>python</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
