Michael Rakutko

Posted on Jul 5

Bulletproof design for a local LLM-as-a-judge

#ai #python #llm #softwareengineering

I build analytics for a living, so I have a reflex: don't trust a number you can't defend.

Right now, everyone is trying to build "evals" (LLM-as-a-judge). If you want to analyze complex unstructured data — whether it's a medical compliance check or raw terminal traces from a Claude Code session — you need an LLM to score it.

Large frontier models need these evals to measure their own quality. But if you want to keep your data private and your costs low, you run small, offline, local models. And here is the reality: small models hallucinate even harder. They need an ironclad harness.

The task: score a conversation transcript against a quality rubric — a fixed checklist a domain expert wrote down beforehand, where each item is a concrete yes/no criterion ("did the speaker confirm X", "was step Y covered"). In my case that rubric has 36 items, so a perfect transcript scores 36 out of 36. That number just reflects how long the human's checklist is; the model only fills in each item's status.

When I put a local LLM in charge of grading against that fixed rubric, my reflex went off immediately. Ask a language model to grade, and it bluffs. It drops the criteria it's unsure about, marks things "done" with zero evidence, and gives a different answer every single run.

So I built a pipeline where the model never actually scores anything. It only answers small, checkable questions; the code does the judging. And it records enough telemetry to prove, on every single run, that it behaved. Here is how it works.

TL;DR: The Tech Stack & Approach

The Context: Building an LLM-as-a-judge pipeline to evaluate complex transcripts (medical, coding, etc.) locally.
The Problem: Small, offline models are unpredictable. They game the count, invent IDs, and drift.
The Solution: A deterministic harness around the LLM. 22 small parallel calls, grammar constraints, code-driven counting, and CI-enforced invariants.

Why you can't just ask "score this"

The obvious version is one prompt: paste the transcript, paste the 36-item rubric, ask for a single score out of 36. It looks like it works. But under the hood, a language model left to its own devices will:

Game the count. It quietly drops criteria it's unsure about, so the denominator shifts under you and the percentage looks better than it actually is.
Claim without evidence. It marks something "done" with no quote to back it — a confident guess dressed up as a fact.
Invent structure. It returns a criterion ID or a rule that doesn't even exist in your rubric.
Drift run to run. Same transcript, different score. You built a random number generator and called it a measurement.

In this build, the model never gets to be the scorer. Everything that produces a number lives strictly in code.

The shape of the pipeline

One run is a chain of small LLM calls with deterministic code between them — about 22 calls total. Only two stages can change the final score; the rest is read-only:

Extract — pull header facts and rate data quality.
Judge ×3 — score each of the 6 sections, running three independent votes each (scoring stage).
Merge + repair — combine the votes, then verify every "done" (scoring stage).
Synthesize — generate summaries, coaching, and narratives (cosmetic stage — strictly read-only for the score).

Here's where the ~22 calls actually go:

The six ideas that make it trustworthy

1. Model judges, code counts

The model returns only a status per criterion. All arithmetic happens in code, over a fixed denominator of 36. If the pipeline skips a criterion, it counts as a 0 and stays in the denominator. The count cannot be gamed.

2. Vote three times

Each criterion is judged three times in parallel. A "done" status only survives if the votes agree and at least one carries a real quote. A lone confident vote loses.

3. Ground-or-demote

If the model says "done" but the code cannot find its exact quote in the transcript, the code downgrades it to "partial" automatically. The code enforces the evidence itself, where the model can't argue with it.

Here is the logic in five lines:

# The model said "done" — don't believe it yet
for c in criteria_marked_done_without_a_quote:
    quote = ask_model_for_verbatim_quote(c)
    if transcript.contains_verbatim(quote):
        c.evidence = quote          # Grounded → keep "done"
    else:
        c.status = "partial"        # No proof → demote automatically

4. Constrain the grammar

The model's output is heavily constrained. A criterion ID from the wrong section, or a rule that does not exist, is grammatically impossible for the model to emit. Hallucinated IDs cannot happen — code-level validation is just a backup safety belt.

5. Score-sensitive vs cosmetic layers

Every layer is strictly labeled. Only the two scoring layers may change the number. Summaries and narratives run afterward and are explicitly blocked from touching a status.

6. Reuse the prompt prefix

The system prompt and the transcript go first and are byte-identical across all ~22 calls. This allows the inference server to cache them once. If you reorder the prompt, you pay for that massive context 22 times.

The flight recorder — and the one invariant

Every run writes a trace next to the result: each of the ~22 calls with its output, tokens, latency, and the hash of its prompt and schema. Plus, we capture the full 36-criterion status vector at four distinct checkpoints as it moves through the pipeline:

after votes → after repair → pre-validation → validated

Then, one invariant holds it all together:

assert status_validated == status_after_repair

The architecture claims: after the repair stage, the score is final — everything downstream is purely cosmetic. Instead of trusting that claim in a comment, the pipeline checks it on every single run.

If a downstream "cosmetic" layer silently bugs out and moves a score, the pipeline fails on that very run, while the change is still fresh in your head. Don't trust your own "this layer doesn't touch the score" code comments. Turn them into tests.

The takeaway

The real fix here is the harness. Same model, same prompts; the scaffolding around them is what makes the scores trustworthy.

Let the model answer small checkable questions, do the arithmetic in code, constrain the grammar, isolate the scoring layers, and log enough to prove the system behaved. That's how an LLM stops being a black box and becomes a reliable system component you can measure, regression-test, and defend.

When something breaks, you find exactly which node failed and you patch there, instead of rewriting the entire system prompt.

DEV Community