A Better LLM Judge? The Rubric Made My Small Model Worse

#ai #machinelearning #python #llm

In Part 2 I built the laziest possible LLM judge — a tiny model (Qwen2.5-1.5B) and a one-line rubric — and it agreed with human votes only ~43% of the time, crammed every score into a 7–8 band, and tied a third of the comparisons humans had no trouble separating.

Two things were wrong with that judge, and people usually fix only one:

The model was too small.
The rubric told it almost nothing.

I fixed each independently and measured the effect. The result wasn't the tidy "write a better rubric, it's free" story I expected — it was more interesting than that.

The big judge runs on an API (and why)

A genuinely large judge doesn't fit a free Kaggle GPU, and fighting transformers versions / OOM / sharding is exactly the yak-shaving real teams skip by calling a hosted endpoint. So the big judge runs on OpenRouter — one OpenAI-compatible API across many models, so swapping the judge is a one-line BIG_ID change. The small baseline still runs locally (no reason to spend API calls on a 1.5B model).

Two things keep the calls cheap and short: cap the output (max_tokens=160) and turn reasoning off (these models reason by default, which bloats output). Plus a small retry on the occasional 429:

BIG_ID = 'deepseek/deepseek-v4-pro'   # one-line swap; also ran qwen/qwen3-32b

def big_judge(question, answer, rubric, max_tokens=160, retries=4):
    kw = dict(model=BIG_ID, messages=build_messages(question, answer, rubric),
              temperature=0, max_tokens=max_tokens)
    for attempt in range(retries):
        try:
            try:   # disable reasoning (OpenRouter-specific); fall back if rejected
                resp = or_client.chat.completions.create(
                    extra_body={'reasoning': {'enabled': False}}, **kw)
            except Exception as inner:
                if 'reasoning' in str(inner).lower():
                    resp = or_client.chat.completions.create(**kw)
                else:
                    raise
            return parse_score(resp.choices[0].message.content or ''), None
        except Exception as e:
            if ('rate' in str(e).lower() or '429' in str(e)) and attempt < retries - 1:
                time.sleep(2 * (attempt + 1)); continue
            return float('nan'), None

Since the API calls are network-bound, the 2x2 runner fans them out across a thread pool (ThreadPoolExecutor), so each big-judge condition finishes in a fraction of the sequential time. (Lesson learned the hard way on an earlier provider: with max_tokens=512 and no reasoning cap, a reasoning model spent ~4.5K tokens thinking per call and blew straight through that provider's rate limit. Capping output is the biggest lever.)

The two rubrics — the actual variable

The naive rubric is what most people write and stop at:

NAIVE_RUBRIC = (
    'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. '
    'Respond EXACTLY as:\nSCORE: <number>'
)

The good rubric names explicit criteria, anchors the scale (what a 2/5/8/10 mean), and demands reasoning before the score:

GOOD_RUBRIC = (
    'You are an expert evaluator. Judge the answer on CORRECTNESS, COMPLETENESS, and '
    'INSTRUCTION-FOLLOWING. Use the FULL 1-10 scale, anchored:\n'
    '  1-2 = wrong/irrelevant.  3-4 = major errors.  5-6 = partial.\n'
    '  7-8 = correct, minor issues.  9-10 = fully correct and on-task.\n'
    'A confident, fluent answer that is factually WRONG must score 1-2, not high. '
    'First one sentence of reasoning, then:\nREASON: <one sentence>\nSCORE: <number>'
)

The 2x2 (run twice, two different big judges)

Same human-voted Chatbot Arena pairs as Part 2 (N=30), same independent single-answer scoring. The only things that change are model and rubric. To make sure the effect wasn't a quirk of one model, I ran the big judge twice — deepseek/deepseek-v4-pro and qwen/qwen3-32b — via OpenRouter. The small baseline is the same local Qwen2.5-1.5B in both.

Big judge = DeepSeek:

Condition	Agreement (decisive)	Agreement (overall)	Ties	Scale
small + naive	67%	47%	9/30	2–10
small + good rubric	54% ⬇	43%	6/30	1–10
big + naive	65%	37%	10/30	1–10
big + good rubric	79% ⬆	50%	7/30	1–10

Big judge = Qwen 32B (same pattern, milder):

Condition	Agreement (decisive)	Ties
small + naive	67%	9/30
small + good rubric	54% ⬇	6/30
big + naive	70%	7/30
big + good rubric	71% ⬆	4/30

Read the rubric column carefully, on both. The good rubric hurt the small model (67%→54% — same on both runs) but helped the big one (DeepSeek: 65%→79%, a +14pt jump; Qwen: 70%→71% but with far fewer ties). The detailed, multi-criteria instructions that sharpened a capable model just confused the 1.5B.

One more thing the DeepSeek run exposes: big + naive landed at 65% decisive / 37% overall — no better than the small model, and its worst tie count. A bigger, pricier judge with a lazy rubric bought nothing. The leap to 79% only came when the big model and a real rubric were used together.

The point

I expected "a better rubric is the cheap win." The data said something more useful: a good rubric is an instruction, and the model has to be capable enough to follow it.

A bigger model only helped when paired with a real rubric. With the lazy rubric, the big model was no better than the small one (DeepSeek big+naive actually landed at 67%/65% — flat — with its worst tie count).
A better rubric only paid off on the capable model. On the small model, the careful rubric was worse than the lazy one-liner — on both big-judge runs.

So the two fixes aren't independent levers you can add up. Hand a precise rubric to a weak model and you can make your eval worse than doing nothing; pay for a big model and skip the rubric and you've bought nothing. The best judge was the combination — big model and real rubric (DeepSeek hit 79%) — but the instructive results are the two traps on either side of it.

An LLM judge is an instrument: the model is the sensor, the rubric is the calibration. A precise calibration on a cheap sensor can read worse than no calibration at all. Specify both, and always check against human labels — because intuition (mine included) gets this wrong.

That wraps the series

Three episodes, one thread: a metric is only as honest as the conditions you measured it under.

Ep 1 — accuracy hid the classes a model silently abandoned.
Ep 2 — an LLM judge's confident score hid that it disagreed with humans.
Ep 3 — a "better" rubric helped a strong model and hurt a weak one; the headline hid that.

Evaluation isn't a box you tick once and quote forever — it's an instrument you specify, calibrate, and keep checking against ground truth, because the convenient number will always flatter you. Thanks for following along.

📓 Full runnable notebook on Kaggle: [https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric]

Built with Hugging Face Transformers (small judge, local) + OpenRouter (big judges: deepseek-v4-pro and qwen3-32b). Data: LMSYS Chatbot Arena. Questions or corrections welcome in the comments.