LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

#ai #machinelearning #python #llm

In Part 1 the model's job was to pick one of 77 labels, so I could check it with ==. But most real LLM output isn't like that — it's a paragraph, a summary, a support reply. There's no label to compare against.

So people reach for the obvious move: use an LLM to grade the LLM. Show it a question and an answer, ask "how good is this, 1–10?", trust the number. It works shockingly well... right up until it doesn't, in ways that don't show up unless you go looking.

I built that judge from scratch and checked it against a dataset that comes with real human votes: the LMSYS Chatbot Arena conversations (via the ungated mirror agie-ai/lmsys-chatbot_arena_conversations, so this runs cold on Kaggle). Each row is a real user prompt, two chatbot answers, and a human verdict for which was better.

The judge is one prompt and a regex

JUDGE_RUBRIC = (
    'You are grading the quality of an answer to a question. '
    'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. '
    'Respond in EXACTLY this format:\nSCORE: <number>\nREASON: <one short sentence>'
)

def judge(question, answer, temperature=0.0):
    prompt = f'{JUDGE_RUBRIC}\n\nQUESTION:\n{question}\n\nANSWER:\n{answer}\n\nYour grade:'
    reply = generate(prompt, max_new_tokens=64, temperature=temperature)
    m = re.search(r'SCORE:\s*([0-9]+(?:\.[0-9]+)?)', reply)
    return (float(m.group(1)) if m else float('nan')), reply

That's it — Qwen2.5-1.5B-Instruct reading one answer and emitting a number. The rest of the notebook is about not trusting it blindly. Note the rubric is deliberately naive ("correctness and helpfulness, 1–10") — it's the lazy version most people actually write, which is the point.

Failure #1 — It barely used the scale

I had the judge score one unchanged answer eight times at a realistic temperature:

scores = [judge(sample_q, sample_a, temperature=0.7)[0] for _ in range(8)]
# [8.0, 7.0, 8.0, 7.0, 8.0, 7.0, 8.0, 8.0]
# range: 7-8 | stdev: 0.48

Two problems. First, the score isn't stable — same answer, different numbers. Second, and worse: a "1–10" judge that only ever emits 7 or 8 isn't really using a 10-point scale. It has almost no resolution to separate "good" from "great." So when you A/B-test two prompts and one scores 7.6 vs 7.9, that gap is noise dressed up as a decimal.

Failure #2 — It didn't agree with humans

For each pair, I scored answer A and answer B independently (the judge never sees both at once — this avoids position bias entirely), took the higher score as the judge's pick, and compared to the human winner:

for p in pairs[:60]:
    s_a, _ = judge(p['question'], p['ans_a'])
    s_b, _ = judge(p['question'], p['ans_b'])
    judge_pick = 'tie' if s_a == s_b else ('model_a' if s_a > s_b else 'model_b')
    ...
# Pairs scored: 60 (judge gave equal scores on 20 of them)
# On the 40 it scored decisively, it agreed with the HUMAN winner: 26/40 = 65%

Read those two numbers together:

20 of 60 were ties — on a third of the pairs, the judge gave both answers the same score even though a human saw a clear winner. (Remember that 7–8 band? When everything scores 7 or 8, lots of things tie.) It was blind to a difference real people could see.
65% agreement on decisive calls — better than a coin flip, but it disagreed with humans on more than 1 in 3 of its confident calls.

Count the ties as misses and the judge lined up with human judgment on just 26/60 = 43% of all pairs.

The receipt that stung

The disagreement cases tell you why it fails. My favorite:

Q : When is it today?
  judge scores -> answer_a: 3, answer_b: 10  => judge picked model_b
  but HUMANS preferred: model_a

The model has no idea what day it is, so a confident date is the wrong answer. The human caught that. The judge gave the confident-wrong answer a 10 and the honest hedge a 3. It wasn't grading correctness — it was grading confidence. (Other receipts showed the same thing: 1-vs-2 and 8-vs-7 "decisions" that were really just noise around a tie.)

The point

This notebook scored nothing new about a model. It audited the judge — the thing handing out the scores — and found two failures hiding behind a clean-looking number: it disagreed with itself run to run, and it agreed with people only 43% of the time.

The fix isn't "don't use judges." It's evaluate your evaluator: grade with repeats and report the spread, not a single number, and calibrate against human labels before you trust the judge on data nobody has labeled.

What's next

Part 3: two obvious ways to fix a bad judge — a bigger model and a better rubric. I run all four combinations against the same human votes and measure how much each lever actually moves the needle. (The cheaper fix does more of the work than you'd expect.)

📓 Full runnable notebook on Kaggle: [https://www.kaggle.com/code/sumannath88/ep02-llm-as-a-judge]

Built with PyTorch + Hugging Face Transformers. Data: LMSYS Chatbot Arena (ungated mirror). Questions or corrections welcome in the comments.