<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: archminor</title>
    <description>The latest articles on DEV Community by archminor (@archminor).</description>
    <link>https://dev.to/archminor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3863161%2F0aa2c8eb-6fd5-4973-a957-d160aaf8fce3.jpeg</url>
      <title>DEV Community: archminor</title>
      <link>https://dev.to/archminor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/archminor"/>
    <language>en</language>
    <item>
      <title>I needed to know if the cheaper model was good enough. So I built an LLM-as-a-Judge pipeline</title>
      <dc:creator>archminor</dc:creator>
      <pubDate>Mon, 06 Apr 2026 05:21:53 +0000</pubDate>
      <link>https://dev.to/archminor/i-needed-to-know-if-the-cheaper-model-was-good-enough-so-i-built-an-llm-as-a-judge-pipeline-4ll</link>
      <guid>https://dev.to/archminor/i-needed-to-know-if-the-cheaper-model-was-good-enough-so-i-built-an-llm-as-a-judge-pipeline-4ll</guid>
      <description>&lt;p&gt;Benchmarks are useful, but they don't really tell me whether a prompt change or cheaper model is good enough for my own workflow.&lt;/p&gt;

&lt;p&gt;I kept running into that, so I ended up building a config-driven eval pipeline: run test cases, check format/schema, use a separate LLM as judge, then generate comparison reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;3-stage pipeline:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inference&lt;/strong&gt; — Run your test cases against candidate models (format and schema validation runs automatically)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Judge&lt;/strong&gt; — A separate LLM scores outputs on 9 metrics (accuracy, faithfulness, completeness, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare&lt;/strong&gt; — Aggregate scores into a comparison report (JSON + Markdown)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key design choices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3-layer judge architecture&lt;/strong&gt; — Format, content, and expression are evaluated in separate LLM calls with no shared context. This prevents a formatting issue from biasing content scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pairwise + absolute + hybrid modes&lt;/strong&gt; — Compare two models head-to-head, score them independently, or both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Majority vote aggregation&lt;/strong&gt; — Run the judge multiple times and take the majority to reduce noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blinding&lt;/strong&gt; — Candidate labels are randomized to prevent position bias.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency mode&lt;/strong&gt; — Set &lt;code&gt;inference_repeats &amp;gt;= 2&lt;/code&gt; and the pipeline automatically switches to measuring output stability instead of quality.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-vendor support:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI, Azure OpenAI, Gemini (native REST), and any OpenAI-compatible endpoint (LM Studio, vLLM, etc.)&lt;/li&gt;
&lt;li&gt;Mix and match — e.g., judge with GPT, candidates on local models&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the output looks like
&lt;/h2&gt;

&lt;p&gt;You get a &lt;code&gt;comparison-report.json&lt;/code&gt; with win rates, per-metric mean scores, confidence intervals, and critical issue counts. Plus a Markdown report for quick reading.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0l79qylidsmd5ge4bsja.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0l79qylidsmd5ge4bsja.jpg" alt="Evaluation Results — Sample Run" width="684" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The rubric is a standalone Markdown file with score anchors (1/3/5), bias guards, and critical issue rules. You can customize evaluation criteria by editing the rubric alone — no code changes needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it's NOT
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Not a benchmark suite — you bring your own test cases&lt;/li&gt;
&lt;li&gt;Not a model training tool — it evaluates outputs, not weights&lt;/li&gt;
&lt;li&gt;Not an agent framework — it's a batch evaluation pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;p&gt;Python &amp;gt;= 3.11, Pydantic, Typer CLI. Three commands to run: &lt;code&gt;uv sync&lt;/code&gt;, configure &lt;code&gt;.env&lt;/code&gt;, &lt;code&gt;uv run llm-judge run-all&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/archminor/llm-as-a-judge" rel="noopener noreferrer"&gt;archminor/llm-as-a-judge&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Curious to hear how other people are handling production LLM evals.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
