Repo + raw results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/model-selection-v1
Every number below is from a live run on an RTX 5090 (32 GB), Ollama, four models. The v1→v2 history is in the commit log — you can verify the failure.
I set out to answer a narrow, practical question: which local LLM should a Japanese company actually run on-prem? Four candidates, a fixed independent judge, three dimensions — quality, latency, VRAM.
The first run came back with every model scoring near-perfect. Faithfulness 1.0000 across the board. Hit rate 0.90–1.00. Judge-agreement κ = 1.0.
It looked like a clean result. It was actually a broken benchmark — and figuring out why taught me more than any leaderboard number would have.
The number that's too good to be true
Here's v1, 20 questions, four models:
| Model | Params | Faithfulness | Hit rate |
|---|---|---|---|
| elyza-jp-8b | 8.0B | 1.0000 | 0.90 |
| gemma4-31b | 31.3B | 1.0000 | 0.95 |
| nemotron-nano-9b-jp | 8.9B | 0.9792 | 1.00 |
| swallow-8b | 8.0B | 1.0000 | 1.00 |
An 8B model and a 31B model scoring identically should set off an alarm. Model capacity that different, collapsing to the same score, almost always means the test isn't resolving the difference — not that the difference is gone.
The discriminability breakdown made it undeniable: 90% of questions were answered correctly by every model, 10% by some, and 0% by none. A benchmark where nine out of ten questions can't tell your candidates apart isn't measuring the candidates. It's measuring whether the questions are easy. They were.
And the κ = 1.0 that looked like a perfect, reassuring judge agreement? When two judges both assign full marks to nearly everything, perfect agreement is the trivial solution, not a strong signal. Zero variance makes the statistic meaningless. A κ of 1.0 here wasn't "the judges agree" — it was "there was nothing to disagree about."
A benchmark that gives everyone full marks is informationally equivalent to no benchmark at all. You can't make a selection decision on it, because it contains no signal about which model to select.
Why this is the interesting part, not the embarrassing part
The temptation is to quietly fix the questions and publish only the clean v2 table. I'm doing the opposite — keeping v1 in the repo, with an ADR documenting the failure — because the failure is the methodology content.
Anyone can run four models through a question set and print a table. The thing that's actually hard, and actually rare, is recognizing that your own measurement is broken when the numbers look great. Most published "local LLM comparisons" never check discriminability at all. They show you a table of high scores and call it a benchmark. If every model on that table scores 90%+, you're looking at the easiness of the questions, not the quality of the models.
So the real deliverable here isn't "model X won." It's a protocol: a model-selection benchmark is only valid if it can resolve the models it compares — and you have to test that explicitly before you trust a single score.
Building discriminability back in
v2 replaced the question set with 45 items deliberately designed to be hard enough to separate the field:
- multi-step reasoning rather than single-fact lookup
- Japanese nuance — keigo (honorifics), specialized terminology, deliberately ambiguous phrasing
- boundary facts — specific dates and figures that are easy to hallucinate
The target wasn't "make it hard for its own sake." It was a specific distribution: the strongest model should still miss some. v1 was the wrong shape (90/10/0). v2 landed at 29% answered by all, 51% by some, 20% by none. That 20% nobody gets is the part that gives the benchmark resolution at the top end.
Hit-rate spread went from 0.10 (v1) to 0.22 (v2). The models actually separate now. And the judge agreement that mattered: κ dropped from 1.0 to 0.920.
That drop is an improvement. v1's κ = 1.0 was a zero-variance artifact. v2's κ = 0.920 is a real agreement number computed over real disagreement — it's the first version where the judge-reliability statistic actually means anything. If you see a benchmark reporting perfect judge agreement, ask whether there was any variance for the judges to agree about.
The finding worth flagging (with the caveat attached)
The thing that made me look twice: nemotron-nano-9b-jp (8.9B) tied gemma4-31b (31.3B) on hit rate — 0.622 each — while using roughly half the VRAM (~11 GB vs ~20 GB) and running about 2.6× faster (190 vs 71 tokens/s, warm).
If that holds, it's the whole point of doing selection by constraint instead of by raw capability. The biggest model is not automatically the right deployment choice. Under a VRAM ceiling, a latency target, or a throughput requirement, a 9B Japanese-sovereign model that matches a 31B on the task is the better call — and you'd never see that from a "which model is strongest" framing.
The honest caveat, up front: this is 45 questions. The nemotron-vs-gemma4 tie is an observation on this set, not a settled result. It needs a larger sample to confirm, and I'm reporting it as a lead to chase, not a conclusion to act on. The point of the protocol is precisely that you don't get to claim a result the sample can't support.
Judge setup, for the skeptics
Because the first question a careful reader asks is "who graded this, and did anything grade itself":
- Primary judge: qwen3:32b — and it is not a contestant. It's a Chinese model; by my own deployment/content separation rule it doesn't belong in the Japanese on-prem default lineup, so it sits out the race and judges instead. That sidesteps self-preference bias: no contestant grades its own homework or a same-family sibling's.
- Cross-validation: gemma4:31b re-judged a 20-question subset to check the primary judge's reliability (the κ = 0.920 above). gemma4 is a contestant, so it's used only to validate the judge protocol — never to score itself.
Two models can't co-reside on 32 GB (qwen3:32b ~29 GB, gemma4:31b ~19 GB), so the whole thing runs two-pass: generate all answers, evict, load judge, score all answers. Resumable, cached per model.
What I'd hand to anyone benchmarking models for selection
- Check discriminability before you trust any score. If most questions are answered correctly by every candidate, your benchmark is measuring question difficulty, not model quality.
- A perfect score is a red flag, not a green one. Especially when models of very different size tie.
- Perfect judge agreement (κ=1.0) on a low-variance set is meaningless. A slightly lower κ over real disagreement is worth more.
- Select by constraint, not by raw capability. "Strongest" and "right for this deployment" are different questions.
- Keep the failed version. The path from a broken benchmark to a working one is the part nobody can fake.
Full protocol, the v1 failure, the v2 fix, and every raw judged output:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/model-selection-v1
The companion tooling — a zero-dependency library that audits whether a retrieval metric can be trusted in the first place — is at eval-sanity.



Top comments (0)