DEV Community: elvisyao007

A Chinese 8B model beat the Western 8B models at Japanese RAG. I still wouldn't put it in the default deployment — and that distinction is the point.

elvisyao007 — Sun, 14 Jun 2026 06:39:00 +0000

Extends an earlier model-selection benchmark to three model families (Japanese / Western / Chinese) on a Japanese RAG task.
Repo + raw results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/model-selection-v2

An earlier post benchmarked local models for a Japanese RAG task and settled on selecting by constraint rather than raw capability. This post widens the field to three families — Japanese-tuned, Western open, and Chinese — and the result forces a distinction that matters more than any single score: model capability and deployment eligibility are two different questions, and conflating them is how people get model selection wrong.

Same Japanese RAG task, same judge protocol, same discriminating golden set (oracle 87.5%, only 11% of questions answered by all models — it actually separates the field). hit@5, 8B class unless noted:

Model	Family	hit@5
Swallow-8B	Japanese-tuned	~0.53
Nemotron-9B-JP	Japanese-tuned	~0.62
ELYZA-JP-8B	Japanese-tuned	~0.40
deepseek-r1-8b	Chinese	~0.51
Llama-3.1-8B	Western	~0.22
Mistral-7B	Western	~0.18
gemma4-31b	Western (31B)	~0.62

Three things fall out of this, and they don't all point the same direction.

1. At 8B, Japanese fine-tuning is decisive — and generic Western models just aren't competitive

The Western 8B models cratered: Llama-3.1-8B at 0.22, Mistral-7B at 0.18, against a Japanese-tuned average around 0.52. That's not a small gap; it's the difference between usable and not.

This answers a question people sometimes ask skeptically — why do Japanese-specific models exist when Llama is right there? At the 8B scale, on a Japanese retrieval-grounded task, a generic Western model without Japanese fine-tuning is not in the running. The Japanese tuning is doing decisive work.

One honest qualifier on the table: gemma4-31b (0.62) is the one Western model that holds up — but it's 31B, not 8B. It earns its score with 4× the parameters, not with Japanese optimization. So read the table in two tiers: within the 8B class, Japanese-tuned wins clearly; across sizes, you can buy Western competitiveness with a much bigger model. Don't read "gemma is strong" as "Western 8B is fine" — the 8B Western models specifically failed.

2. The Chinese model was capable — genuinely competitive

deepseek-r1-8b scored 0.51 — above the Western 8B models by a wide margin, and right in the range of the Japanese-tuned models. On capability alone, measured on this task, it's a real contender.

I want to be precise here because it's easy to be sloppy: the data says this model is good at the task. That's a measurement, and I'm reporting it straight.

3. ...and I still wouldn't put it in the default deployment stack — for reasons that have nothing to do with capability

For Japanese enterprise deployment, my default model lineup excludes Chinese models. Not because of the score — the score is fine — but because of deployment-policy constraints that are independent of capability:

Data sovereignty posture. Japanese enterprises, particularly in regulated or security-sensitive contexts, have specific concerns about model provenance in on-prem and data-handling decisions. A solutions engineer deploying into that environment inherits those constraints whether or not they're technically about the model's quality.
Procurement and compliance review. Model provenance is a line item in enterprise procurement and security review. A model that's excellent but doesn't clear that review is, operationally, not deployable for that client.

So the model goes in my content/research layer — where I'll benchmark it, learn from it, report its numbers honestly (as I just did) — but not in the deployment default I'd recommend to a Japanese enterprise client. That separation is a standing decision in how I structure this work, and this benchmark is exactly why the separation has to be explicit: if you collapse capability and deployability into one axis, you'll either deploy something that fails procurement, or dismiss something that's actually good.

This is, I think, the part of the job that separates a solutions/forward-deployed engineer from someone who only runs benchmarks. The benchmark tells you what's capable. The deployment decision is a different function — it takes in the score and the client's compliance reality, the procurement constraints, the data-handling posture — and those are not the model's fault or merit, they're the deployment context. Keeping the two reasoning steps separate is the skill.

The caveats

n = 45 questions. Scores carry roughly ±5–8% uncertainty. The direction (Western 8B weak, Japanese-tuned and the Chinese model strong) is clear; treat exact values as approximate.
32GB single-GPU constraint. I did not evaluate 70B-class models (Llama-70B, Mistral-Large) — they don't fit. So "Western 8B is weak here" is a statement about the 8B class on one GPU, not about Western models in general. A 70B might change the picture; I can't test it on this hardware.
Judge independence. The judge is a non-contestant model; cross-validation on a 25-question subset gave 96% hit agreement, κ = 0.920 — real agreement over real variance, not a zero-variance artifact.
One task, one embedder. Japanese RAG with a Japanese embedder. Different task, different story possible.

The takeaway

Selecting a model for deployment is not "pick the highest score." It's a two-step function: measure capability honestly, then filter by the deployment context — size constraints, latency, language fit, and procurement/compliance reality. The Chinese model passed step one and is filtered at step two for reasons that aren't about its quality. The Western 8B models failed step one outright. The Japanese-tuned models pass both for this client profile.

Reporting all of that accurately — including saying clearly that the model I won't deploy is genuinely good — is the job.

Raw numbers, judge protocol, the discriminating golden set:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/model-selection-v2

Companion: eval-sanity (the sanity gate confirming the metric discriminates before any score is trusted).

Which Chinese open-source parser is better for Japanese RAG? It's a crossover — BM25 says DeepDoc, dense says MinerU

elvisyao007 — Sat, 13 Jun 2026 14:29:01 +0000

Final part of a series measuring Chinese open-source document parsing on Japanese documents.
Repo + raw 3×2 results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2

Two posts ago I measured RAGFlow's DeepDoc against plain text extraction on Japanese PDFs and found its layout-aware parsing helped retrieval. Last post I found that help was bigger under dense retrieval than BM25 — structured parsing matters more the more your retriever depends on chunk coherence.

This post adds the obvious next question: is DeepDoc actually the best Chinese open-source parser for this, or just the one I tested first? So I added MinerU (another major Chinese parser) to the comparison. (I excluded PaddleOCR — it depends on Baidu's PaddlePaddle framework rather than PyTorch, with known CUDA/cudnn conflicts; the environment isolation cost wasn't worth it for this comparison.)

The answer isn't "one of them wins." It's a clean crossover: the better parser depends on your retriever.

The 3×2

Same 3 Japanese government PDFs, same 32-question golden set (oracle ceiling 87.5%), same Japanese embedder (ruri-v3). hit@5:

Pipeline	BM25	Dense
plain text (pdfplumber)	56.2%	40.6%
DeepDoc	68.8%	65.6%
MinerU	62.5%	71.9%

Read the two structured parsers against each other:

On BM25, DeepDoc wins (68.8% vs 62.5%)
On dense, MinerU wins (71.9% vs 65.6%)

A crossover. And the single best cell in the entire table is MinerU × dense at 71.9% — so if you're building Japanese dense RAG, MinerU is the current pick. If you're on a lexical/BM25 system, DeepDoc.

This is the answer to "how do you choose a parser" that's actually useful: not "the strong one," but "match the parser to your retriever." MinerU's chunking suits dense embedding; DeepDoc's suits keyword matching. Neither is universally better.

A confirmation worth noting: the era-name failure is ecosystem-wide, not DeepDoc-specific

In the first post, DeepDoc's OCR fallback corrupted Japanese era names (令→今) on form/scanned PDFs, while its native-text path was clean. MinerU showed zero era-name errors on these documents — because, like DeepDoc, it routes text-layer PDFs through direct extraction (PyMuPDF) rather than OCR.

That's the useful confirmation: the clean-vs-corrupt split is a property of the font path (text-layer extraction vs OCR fallback), not of any one parser. Both Chinese parsers handle embedded-font government PDFs cleanly. The era-name risk lives in the OCR fallback that scanned/form documents force — and that's where you'd need to test either tool carefully for a Japanese deployment.

The caveats — three, and they're the price of trusting the numbers

1. The speed comparison is NOT apples-to-apples. MinerU ran on GPU (PyTorch cu128 on the RTX 5090); the DeepDoc numbers from the earlier phase were CPU. So while MinerU parsed the three documents in 34–46s each, I'm not reporting a clean "MinerU is N× faster than DeepDoc" — that would compare GPU against CPU and mislead. Aligning both on the same hardware is left for a follow-up. Take the speed dimension as unmeasured here, not as a MinerU win.

2. Three documents, 32 questions. The crossover is an observation on this set, not a settled result. At this sample size, a ±6% gap is a handful of questions — it could move with more documents.

3. The symmetry is suspicious. DeepDoc leads BM25 by +6.2% and MinerU leads dense by +6.2% — the exact same margin both directions. That clean symmetry is more likely a quantization artifact of a 32-question set (each question ≈ 3.1%) than a real law of nature. I'm reporting the direction of the crossover with confidence and the precise magnitude with none.

The honest version: on this testbed, with a Japanese embedder, the direction is a crossover — DeepDoc better for lexical, MinerU better for dense. The exact numbers need a bigger set.

Where the series lands

Three posts, one arc:

DeepDoc on Japanese — found a font-path-specific OCR failure (era names), and that layout parsing helps retrieval (+12.5% on BM25).
Dense vs BM25 — that help doubles under dense retrieval, because dense punishes incoherent chunks harder. A statement about RAG architecture, not one tool.
DeepDoc vs MinerU (this post) — the best parser is a crossover on your retriever; MinerU × dense is the strongest combination measured; the era-name failure is an OCR-path property shared across parsers.

The throughline isn't "Chinese parsers are good/bad." It's that parser choice is a constrained decision — your retriever, your document font path, your language — and the only way to get the answer is to measure your own stack, traced through to retrieval, on a test set hard enough to actually separate the options.

That's the part most tooling comparisons skip, and it's the part that's worth doing.

Raw 3×2 numbers, the parser-comparison breakdown, reproducible scripts:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2

Companion: eval-sanity (the sanity gate that confirmed the metric before each delta was trusted).

Structured parsing helps dense retrieval more than it helps BM25 — measured on Japanese docs, and the gap doubled

elvisyao007 — Sat, 13 Jun 2026 07:17:12 +0000

Phase 3 of a series measuring Chinese open-source parsing (RAGFlow's DeepDoc) on Japanese documents. This tightens two limits I flagged in the earlier post.
Repo + raw 2×2 results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2

In an earlier post I measured DeepDoc (RAGFlow's document parser) against plain text extraction on Japanese PDFs and found a +12.5% hit@5 advantage from its layout-aware chunking — with two caveats I wrote down explicitly: it was BM25-only, and the golden set had a 100% oracle ceiling (likely too easy, which could amplify the gap).

This post closes both. I added a dense-retrieval dimension and built a harder golden set (oracle 87.5%). I expected DeepDoc's advantage to shrink under dense retrieval — my reasoning was that embeddings might be less sensitive to chunk boundaries than lexical matching.

It did the opposite. The advantage doubled.

The 2×2

Same documents, same questions, four pipeline/retriever combinations. hit@5:

Pipeline	BM25	Dense
A — plain text (pdfplumber)	56.2%	40.6%
B — DeepDoc structured	68.8%	65.6%
Delta (B − A)	+12.5%	+25.0%

DeepDoc's edge over plain text is +12.5% on BM25 but +25.0% on dense. The structured parse helps dense retrieval roughly twice as much as it helps lexical.

And look at what falls apart: plain text + dense is the worst cell in the table at 40.6% — well below plain text + BM25. Switching plain-text extraction from lexical to dense retrieval made it worse.

Why: chunk quality matters more to dense than to BM25

The mechanism is in the chunk counts. Plain-text extraction produced 2,934 sliding-window chunks; DeepDoc's structured parse produced 630. That's a 4.6× difference, and it cuts differently for each retriever:

BM25 matches keywords. A fragmented chunk still contains its keywords, so lexical matching mostly survives fragmentation. Plain text holds up okay (56.2%).
Dense embeds meaning. A context-stripped sliding-window fragment produces a low-quality vector — there isn't enough coherent context to embed well. So fragmentation hurts dense retrieval badly (40.6%).

DeepDoc's layout separation yields larger, semantically coherent chunks — which is exactly what a dense embedder needs to produce a good vector. So the value of structured parsing isn't constant across retrievers: it's worth more the more your retriever depends on chunk coherence, and dense retrieval depends on it most.

That generalizes beyond this one parser and this one language. If you're running dense RAG, your chunking strategy is doing more work than you might think — and a parser that respects document structure is buying you more than the same parser would on a BM25 system.

The honest caveats (three of them, all load-bearing)

This is an initial signal, and the limits matter as much as the headline.

1. This is an end-to-end comparison, not an isolated chunk-strategy test. Pipeline B is "DeepDoc's parse plus the chunking it naturally yields." Pipeline A is "plain text extraction plus sliding-window chunking." I did not hold chunking constant and swap only the parser. So strictly, the 2×2 compares two whole pipelines — which is what an enterprise actually deploys — but it does not by itself prove the gain comes from parsing rather than from chunk strategy. Isolating that (same chunker, swap only the parse layer) is the next step. I'm not claiming the layout model alone causes the lift; I'm claiming the end-to-end DeepDoc pipeline retrieves better, and the dense-vs-BM25 split points strongly at chunk coherence as the lever.

2. The embedding is Japanese-specific (ruri-v3-310m). Dense retrieval quality on Japanese is sensitive to the embedder. I used cl-nagoya's ruri-v3, a Japanese-first model, not a multilingual general one. The "+25% on dense" result holds for this embedder. A different embedding model could shift the numbers — the conclusion is conditioned on a Japanese-tuned embedder.

3. Three documents, 32 questions. I narrowed to the three documents that work as a retrieval testbed (the form-type PDFs that DeepDoc fails to parse — covered in the previous post — aren't usable as a clean retrieval corpus). The golden set is harder than v1 (oracle 87.5% vs 100%: four questions ask for arithmetic-derived values that don't appear verbatim in the corpus, so even a perfect retriever can't surface them). But it's still a small sample. Signal, not verdict.

What the harder golden set bought

The v1 golden set had a 100% oracle ceiling — every relevant doc was reachable in the top-5, meaning the questions were easy enough that the retriever was never really stressed. v2's ceiling is 87.5%: four of the 32 questions (asking for figures like 664,957億円, computed totals not present verbatim) can't be answered by any retriever from this corpus.

That matters because the +12.5%/+25.0% deltas are now measured under genuine difficulty, not on a set so easy the gap could be an artifact of headroom. Tightening the test is what turns "an interesting number" into "a number I'd defend."

Where this lands

The previous post's framing was "does a Chinese parser work on Japanese docs" — useful, but niche. This result is broader: structured parsing pays off more under dense retrieval than under lexical, because dense embeddings punish incoherent chunks harder. That's a statement about RAG architecture, not about one parser or one language — and if you're building dense RAG on any messy document corpus, it's a reason to take your parse-and-chunk layer more seriously than the embedding model choice you probably agonized over instead.

Next in the series: isolate the chunk-strategy variable (same chunker, swap only the parser), and — environment permitting — the same Japanese documents through MinerU and PaddleOCR, to see whether the structured-parse advantage is DeepDoc-specific or holds across the Chinese parser ecosystem.

Raw 2×2 numbers, the harder golden set, the reproducible script:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2

Companion tooling: eval-sanity (the sanity gate that confirmed the metric before I trusted the delta).

Half of agent evaluation needs no LLM judge — and it's the half that catches the failures that actually hurt

elvisyao007 — Fri, 12 Jun 2026 17:17:45 +0000

Part of an eval-first series. The trajectory evaluator described here shipped as eval-sanity v0.3 (zero dependencies, deterministic).
Repo: https://github.com/elvisyao007/eval-sanity · Agent + traces: https://github.com/elvisyao007/onprem-llm-stack/tree/main/payloads/invoice-agent

By 2026 the agent-evaluation problem is no longer hypothetical. LangChain's State of AI Agents report puts 57% of organizations with agents in production and names quality as the top deployment barrier. The standard answer to "how do you evaluate an agent" has become: capture the trajectory, then have an LLM judge it.

LLM-as-judge is real and necessary — for the parts that need it. But a large fraction of agent evaluation is deterministic, needs no judge at all, and happens to catch the failures that hurt most in an enterprise setting: the agent calling the wrong tool, skipping a required check, or writing bad data into a system of record. I built a small deterministic trajectory evaluator to make exactly that point, and ran it against a real invoice-processing agent.

Here's the case for doing the cheap, deterministic layer first — and doing it well.

The agent: invoice extraction with a refusal condition

The test subject is deliberately boring: a Japanese invoice (請求書) agent running on a self-hosted stack (LiteLLM gateway → local model). Three tools, no framework — just native function calling, because the agent is the thing being evaluated, not the thing being engineered.

extract_fields(pdf) — pull structured fields from the invoice
validate(fields) — check the consumption-tax math and that line items sum to the stated total
write_back(fields) — commit to a (mock) accounting system

The interesting behavior isn't extraction — OCR can extract. It's the refusal: when validate fails, the agent must not write back. An agent that dutifully commits a invoice whose tax is miscalculated is worse than no agent, because it launders a bad number into your books with an audit trail that says "automated."

I seeded five invoices: three clean, two with planted arithmetic errors (wrong 消費税 on one, wrong total on another). The good agent extracts all three clean ones and writes them back; on the two broken ones, it flags and refuses. That refusal is the whole value proposition.

What you can check without a judge

Here's the part the "just use an LLM judge" framing underrates. For an agent like this, most of what you care about is decidable by assertion, not by opinion:

Tool-call correctness — did the expected tools get called, with valid arguments?
Order constraints — was write_back always preceded by a passing validate? This is a pure structural property of the trace.
Step efficiency — how many steps, and were there redundant or repeated calls?
Task completion — against ground truth, did the right thing happen (write-back for clean, refusal for broken)?

None of these need a model to grade them. They're exact, reproducible, and fast enough to run as a CI gate on every prompt change, tool addition, or model swap. The 2026 consensus is converging on exactly this ordering — cheap deterministic checks first, escalate to an LLM judge only for what rules genuinely can't reach (was the agent's prose helpful, was its reasoning sound). I'm not arguing against LLM judges. I'm arguing that skipping straight to them skips the layer that catches the operationally worst failures.

Proving the evaluator actually discriminates

A familiar trap — one I walked into on an earlier model-selection benchmark — is an evaluator that passes everything. An evaluator that rubber-stamps good traces tells you nothing; you have to show it fails the bad ones.

So I didn't only run it on the five real (passing) traces. I constructed deliberately broken trajectories and confirmed each one gets caught, deterministically:

Constructed failure	Caught?	How
`write_back` without calling `validate`	✅	missing required tool + order violation: "step 2 write_back: no preceding validate"
`write_back` after a failing validate	✅	order violation: "'passed' never True"
Redundant / unexpected extra tool calls	✅	surfaced as diagnostics (redundant count, unexpected tool list)
`write_back` on an invoice that should be refused	✅	forbidden-tool violation on the refusal spec

On the real traces: the three clean invoices pass with 3 steps each, zero violations; the two broken invoices correctly show 2 steps, no write_back, status flagged. The evaluator distinguishes "did the right thing" from "did the wrong thing" — which is the only property that makes an evaluator worth running.

Silent trajectory regression

There's a sneakier failure than an outright wrong answer: the agent still completes the task, but its path quietly degrades — more steps, an occasional skipped check, a creeping violation rate — after a prompt tweak or model swap. Outcome-only evaluation misses this completely, because the outcome still looks fine.

The evaluator reuses a paired-bootstrap regression check (carried over from the retrieval-metric version of this tool) at the trajectory level: compare a baseline set of traces against a candidate set and alarm when completion stays flat but violation rate or step efficiency degrades significantly. In testing, a baseline of 8 good traces against a candidate of 4-bad-plus-4-good fired the alarm (completion −0.50, violations +0.50); two identical runs produced zero movement and correctly stayed silent.

When this is the wrong tool

The honest boundary, because it matters: if your agent always runs the same fixed sequence — retrieve, generate, format, every time — scoring the path buys you little; the output already tells you what you need. And if you're still in early prototyping, figuring out what the agent should do, formalizing trajectory specs is premature.

Trajectory evaluation earns its place when the path has constraints that can be violated — like "never write back without a passing validate." My invoice agent has exactly that property, which is why structural checking is worth it here. A different agent might not need it. Knowing which case you're in is part of the judgment.

The design choices worth stealing

Two decisions did more work than the metrics themselves:

The order constraint is enforced in code, not just evaluated. The agent's write_back has a Python-side guard that refuses to commit unless validate passed — independent of whether the LLM "decided" to follow instructions. You cannot trust an agent to always honor step ordering from a prompt; the load-bearing constraint belongs in the code, and the evaluator then confirms the trace respects it. Defense in depth, not prompt faith.

The eval is configurable, not absolute. Calling an unexpected tool doesn't auto-fail — it surfaces as a diagnostic unless you explicitly add the tool to a forbidden list. Different tasks tolerate different slack. The strictness is a property of the spec you write, not baked into the evaluator. That's a feature: it forces you to state what "correct" means for this task.

Deterministic agent evaluation isn't the whole story — the LLM-judge layer above it is real, for the dimensions rules can't reach. But it's the cheaper layer, it's the CI-gateable layer, and for enterprise agents that touch systems of record, it's the layer that catches the failures you can least afford. Do it first, and do it well.

Evaluator (zero deps, deterministic): eval-sanity v0.3
The invoice agent and its traces: onprem-llm-stack/payloads/invoice-agent

Does a Chinese document parser actually work on Japanese PDFs? I measured it — and the answer is 'it depends on the font path'

elvisyao007 — Fri, 12 Jun 2026 11:16:43 +0000

Part 1 of a series measuring Chinese open-source AI tooling on Japanese documents.
Repo + raw results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v1
Every number below is from a live run on an RTX 5090 / RTX-class workstation. Sample sizes are small and stated explicitly — treat this as an initial signal, not a verdict.

RAGFlow's DeepDoc is one of the better-known open-source document parsers to come out of the Chinese AI ecosystem. It does OCR, table-structure recognition (TSR), and document-layout recognition (DLR), and it's RAGFlow's default PDF parser. The English and Japanese dev communities mostly haven't measured it on Japanese documents — which is exactly the gap I sit in: I can read the Chinese tooling, and I can test it on the Japanese enterprise document types that actually matter here.

So I ran it. The interesting part isn't a thumbs-up or thumbs-down. It's that the answer splits cleanly by which internal path a given PDF takes — and one of those paths systematically corrupts Japanese era names.

Here's the honest version, limits attached.

The trap I almost fell into

My first-pass observation was alarming and simple: DeepDoc was misreading 令 as 今. In Japanese that's not a random glyph error — 令 is the first character of 令和 (Reiwa), the current imperial era. A parser that turns 令 into 今 corrupts the date on every government report, invoice, and contract that uses the era-name calendar. That's a potential dealbreaker for Japanese enterprise documents, where dates carry legal weight.

If I'd stopped there, I'd have published "Chinese parser breaks Japanese era names" — and I'd have been wrong, or at least sloppy. Because when I quantified it, the error wasn't a property of DeepDoc on Japanese text. It was a property of one code path.

The actual finding: it's a font-path problem, not a language problem

DeepDoc (via the PdfParser API) routes text two ways:

Embedded-font PDFs — most government reports, anything exported from Word/LaTeX — go through native text extraction (pdfplumber under the hood). On these, the 令→今 error rate was 0%. The text is read, not recognized.
Form-font / scanned PDFs — where there's no extractable text layer — fall back to the OCR path. On these, the era-name corruption ran to roughly 100% on the affected pages.

So the precise claim is: DeepDoc's OCR fallback systematically misreads 令 as 今 on Japanese form-font and scanned pages; its native-text path does not. "It breaks Japanese" was too broad. "Its OCR path breaks era names, and lots of real enterprise documents hit the OCR path" is the true, and more useful, statement.

This distinction matters operationally. If your corpus is clean digital government PDFs, this specific failure won't touch you. If your corpus is scanned invoices and tax forms — which is a huge fraction of real Japanese back-office documents — you're on the path that fails.

But does any of this reach the thing you actually care about — retrieval?

Parsing quality is a means. What an enterprise RAG system cares about is whether the right chunk comes back. So I didn't stop at "the parse looks good/bad." I measured the downstream delta: same documents, same questions, two ingestion pipelines.

Pipeline A (baseline): plain text extraction (pdfplumber) → chunk → retrieve
Pipeline B (DeepDoc): DeepDoc structured parse → chunk → retrieve

On a 20-question Japanese golden set built from the sample documents:

Pipeline	hit@5
A — plain text	75%
B — DeepDoc	90%

DeepDoc's layout understanding netted +15% hit@5. The layout separation produces cleaner chunks, and on these documents that win outweighed the OCR errors. The net effect was positive.

Now the limits, because they're load-bearing:

This is BM25 (lexical) retrieval. The dense-retrieval comparison is Phase 3, not done yet. Do not read "+15%" as "DeepDoc improves retrieval in general" — read it as "on lexical retrieval, on these docs, layout-aware chunking helped."
The golden set has a 100% oracle ceiling. Every relevant doc is reachable in the top-5 for all 20 questions — meaning the set may be on the easy side, and the 75-vs-90 gap could be amplified or distorted by that. A harder set (oracle < 100%) is Phase 3.
20 questions, 5 documents. This is a signal, not a settled number.

I'm reporting the +15% with all three caveats firmly attached. The honest takeaway is directional: layout-aware parsing tends to help retrieval enough to matter, and the precise magnitude needs a harder test.

Where DeepDoc is genuinely weak: tables and forms

The flip side, and the part DeepDoc's marketing wouldn't lead with. TSR — table-structure recognition — is one of its headline features. On a Japanese tax form's table, exact-match cell accuracy was 30% (20 cells checked). That's low, on the feature it's supposed to be best at.

And form PDFs were worse. On the e-Tax-style form sample, DeepDoc extracted essentially one chunk from the whole document — the structure collapsed.

Put together, the weak spot is specific and it's exactly the wrong one for this market: form-type documents — invoices, 請求書, tax filings — are the bulk of Japanese back-office paperwork, and that's where DeepDoc struggles most (both the OCR era-name corruption and the table/form collapse live here).

The answer, as a matrix instead of a verdict

"Should a Japanese company use DeepDoc?" has no yes/no answer. It has a font-path-and-doctype answer:

Document type	DeepDoc behavior	Evidence
Embedded-font PDF (gov reports, Word exports)	OCR error 0%; +15% hit@5 from layout	native-text path
Form-font / scanned PDF	令→今 era-name corruption ~100%	OCR fallback path
Table-heavy documents	TSR exact-match only ~30%	headline feature underperforms
Form documents (e-Tax style)	near-total failure, ~1 chunk extracted	structure collapses

That matrix is the deliverable. Not "good" or "bad" — good here, broken there, and here's the line.

Why I bothered, and what's next

Most tooling reviews test on clean English PDFs and report a single score. The failure modes that actually bite enterprises live in the specifics: a particular font path, a particular document type, a particular language's calendar. You only see them if you run the real tool on the real document types in the real language — and then trace the error all the way to retrieval, where it either matters or doesn't.

Phase 3 (next in the series): dense-retrieval comparison, a harder golden set with oracle < 100% to pressure-test the +15%, and — the natural next question — the same Japanese documents through MinerU and PaddleOCR, the other major Chinese parsers, to see whether the font-path failure is DeepDoc-specific or ecosystem-wide.

Raw parses, the golden set, the OCR error counts, and the retrieval results are all in the repo:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v1

Companion tooling: eval-sanity (the sanity gate that confirmed the retrieval metric was trustworthy before I reported the delta) and eval-driven-llm (the eval harness this runs on).

My local-LLM benchmark gave every model a perfect score. That was the most useful failure of the project.

elvisyao007 — Thu, 11 Jun 2026 13:08:11 +0000

Repo + raw results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/model-selection-v1
Every number below is from a live run on an RTX 5090 (32 GB), Ollama, four models. The v1→v2 history is in the commit log — you can verify the failure.

I set out to answer a narrow, practical question: which local LLM should a Japanese company actually run on-prem? Four candidates, a fixed independent judge, three dimensions — quality, latency, VRAM.

The first run came back with every model scoring near-perfect. Faithfulness 1.0000 across the board. Hit rate 0.90–1.00. Judge-agreement κ = 1.0.

It looked like a clean result. It was actually a broken benchmark — and figuring out why taught me more than any leaderboard number would have.

The number that's too good to be true

Here's v1, 20 questions, four models:

Model	Params	Faithfulness	Hit rate
elyza-jp-8b	8.0B	1.0000	0.90
gemma4-31b	31.3B	1.0000	0.95
nemotron-nano-9b-jp	8.9B	0.9792	1.00
swallow-8b	8.0B	1.0000	1.00

An 8B model and a 31B model scoring identically should set off an alarm. Model capacity that different, collapsing to the same score, almost always means the test isn't resolving the difference — not that the difference is gone.

The discriminability breakdown made it undeniable: 90% of questions were answered correctly by every model, 10% by some, and 0% by none. A benchmark where nine out of ten questions can't tell your candidates apart isn't measuring the candidates. It's measuring whether the questions are easy. They were.

And the κ = 1.0 that looked like a perfect, reassuring judge agreement? When two judges both assign full marks to nearly everything, perfect agreement is the trivial solution, not a strong signal. Zero variance makes the statistic meaningless. A κ of 1.0 here wasn't "the judges agree" — it was "there was nothing to disagree about."

A benchmark that gives everyone full marks is informationally equivalent to no benchmark at all. You can't make a selection decision on it, because it contains no signal about which model to select.

Why this is the interesting part, not the embarrassing part

The temptation is to quietly fix the questions and publish only the clean v2 table. I'm doing the opposite — keeping v1 in the repo, with an ADR documenting the failure — because the failure is the methodology content.

Anyone can run four models through a question set and print a table. The thing that's actually hard, and actually rare, is recognizing that your own measurement is broken when the numbers look great. Most published "local LLM comparisons" never check discriminability at all. They show you a table of high scores and call it a benchmark. If every model on that table scores 90%+, you're looking at the easiness of the questions, not the quality of the models.

So the real deliverable here isn't "model X won." It's a protocol: a model-selection benchmark is only valid if it can resolve the models it compares — and you have to test that explicitly before you trust a single score.

Building discriminability back in

v2 replaced the question set with 45 items deliberately designed to be hard enough to separate the field:

multi-step reasoning rather than single-fact lookup
Japanese nuance — keigo (honorifics), specialized terminology, deliberately ambiguous phrasing
boundary facts — specific dates and figures that are easy to hallucinate

The target wasn't "make it hard for its own sake." It was a specific distribution: the strongest model should still miss some. v1 was the wrong shape (90/10/0). v2 landed at 29% answered by all, 51% by some, 20% by none. That 20% nobody gets is the part that gives the benchmark resolution at the top end.

Here's v2:

Hit-rate spread went from 0.10 (v1) to 0.22 (v2). The models actually separate now. And the judge agreement that mattered: κ dropped from 1.0 to 0.920.

That drop is an improvement. v1's κ = 1.0 was a zero-variance artifact. v2's κ = 0.920 is a real agreement number computed over real disagreement — it's the first version where the judge-reliability statistic actually means anything. If you see a benchmark reporting perfect judge agreement, ask whether there was any variance for the judges to agree about.

The finding worth flagging (with the caveat attached)

The thing that made me look twice: nemotron-nano-9b-jp (8.9B) tied gemma4-31b (31.3B) on hit rate — 0.622 each — while using roughly half the VRAM (~11 GB vs ~20 GB) and running about 2.6× faster (190 vs 71 tokens/s, warm).

If that holds, it's the whole point of doing selection by constraint instead of by raw capability. The biggest model is not automatically the right deployment choice. Under a VRAM ceiling, a latency target, or a throughput requirement, a 9B Japanese-sovereign model that matches a 31B on the task is the better call — and you'd never see that from a "which model is strongest" framing.

The honest caveat, up front: this is 45 questions. The nemotron-vs-gemma4 tie is an observation on this set, not a settled result. It needs a larger sample to confirm, and I'm reporting it as a lead to chase, not a conclusion to act on. The point of the protocol is precisely that you don't get to claim a result the sample can't support.

Judge setup, for the skeptics

Because the first question a careful reader asks is "who graded this, and did anything grade itself":

Primary judge: qwen3:32b — and it is not a contestant. It's a Chinese model; by my own deployment/content separation rule it doesn't belong in the Japanese on-prem default lineup, so it sits out the race and judges instead. That sidesteps self-preference bias: no contestant grades its own homework or a same-family sibling's.
Cross-validation: gemma4:31b re-judged a 20-question subset to check the primary judge's reliability (the κ = 0.920 above). gemma4 is a contestant, so it's used only to validate the judge protocol — never to score itself.

Two models can't co-reside on 32 GB (qwen3:32b ~29 GB, gemma4:31b ~19 GB), so the whole thing runs two-pass: generate all answers, evict, load judge, score all answers. Resumable, cached per model.

What I'd hand to anyone benchmarking models for selection

Check discriminability before you trust any score. If most questions are answered correctly by every candidate, your benchmark is measuring question difficulty, not model quality.
A perfect score is a red flag, not a green one. Especially when models of very different size tie.
Perfect judge agreement (κ=1.0) on a low-variance set is meaningless. A slightly lower κ over real disagreement is worth more.
Select by constraint, not by raw capability. "Strongest" and "right for this deployment" are different questions.
Keep the failed version. The path from a broken benchmark to a working one is the part nobody can fake.

Full protocol, the v1 failure, the v2 fix, and every raw judged output:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/model-selection-v1

The companion tooling — a zero-dependency library that audits whether a retrieval metric can be trusted in the first place — is at eval-sanity.

I built a self-hosted LLM stack that grades itself — audit trail, per-user auth, and a built-in acceptance test

elvisyao007 — Thu, 11 Jun 2026 06:39:24 +0000

canonical_url: https://dev.to/elvisyao007/REPLACE-AFTER-PUBLISH

Repo: https://github.com/elvisyao007/onprem-llm-stack (Apache-2.0)
Runs fully on-prem. No data — including the audit log — leaves the box.

Most "deploy your own ChatGPT" tutorials stop at the moment the container answers a question in a browser. That's the easy 20%. The hard 80% is everything an enterprise actually asks before it puts the thing in front of users: Who can call it? What did they ask? And how do I know it's good enough to ship — objectively, not by vibes?

The reason this matters isn't theoretical. Across 2026 enterprise surveys, roughly 88% of AI pilots never reach production, and the most-cited blocker isn't model quality — it's the absence of an evaluation/acceptance bar and the governance around access and audit. A demo that runs is not a production signal.

So I built a stack where the demo isn't the deliverable. The deliverable is three things a tutorial skips:

Data never leaves the box — including the audit log.
Per-user access control with attributable audit — you can answer "who tried to call a model they weren't allowed to."
A built-in acceptance test — one command, and the stack grades itself with an independent judge and gives you a PASS/FAIL.

The boring part (compose + a gateway + a web UI) is the part everyone already has. This post is about the three parts they don't, and the three bugs I only found because I actually tried to run them.

The shape of it

Nothing exotic in the wiring:

Inference: Ollama in the dev profile (the host already runs it), vLLM in the prod profile. The gateway hides which one is behind it.
Gateway: LiteLLM — one place for keys, budgets, model routing, and audit callbacks.
UI: Open WebUI.
Two compose profiles, every image pinned to an exact version (no latest — air-gapped reproducibility is a precondition, not a nice-to-have).

The interesting design choice is what sits at the center: the evaluation methodology is the backbone; the retriever, the model, the framework are all swappable payload. Everything in the stack is config except the question "is this good enough," which is the one thing you can't outsource to a model version bump.

Bug #1: the access check that exists but never runs

The first real feature is per-user virtual keys: alice may call qwen3-32b, bob may call gemma4-31b, and crossing that line should return a 403.

LiteLLM (v1.88.1) ships a function called can_key_call_model. It does exactly what the name says. The problem: on the custom-auth path, it's never invoked from common_checks. So with a custom authenticator wired in, a key authorized for any model could call every model. The guardrail was in the codebase and silently bypassed.

The fix wasn't to monkey-patch the routing layer. It was to enforce access at the only point where I had both the authenticated identity and the requested model in hand: read the model out of the raw request body inside the auth hook, check it against the key's allow-list, and raise 403 before returning the auth object.

alice → qwen3-32b   → 200 OK
alice → gemma4-31b  → 403 model_access_denied
bob   → gemma4-31b  → 200 OK
bob   → qwen3-32b   → 403 model_access_denied

The lesson I'd hand to anyone wiring custom auth into a gateway: a function existing in the library is not the same as that function running on your code path. Verify the denial, don't assume the helper fires.

Bug #2: "someone broke the rules, but we don't know who"

With denials working, I checked the audit log. The successful requests were fine — user, model, token counts, latency. The denied requests were recorded as user_id='unknown'.

That's the worst possible failure for a security audit. "An unauthorized attempt happened and we can't attribute it" is exactly the line you don't want in front of an enterprise security reviewer. And it's backwards from how audit value actually works: who tried to cross a boundary is more important to log than who used the system normally.

The root cause was a sequencing problem. LiteLLM calls the failure callback after the custom auth raises — and by then the request metadata is empty, so the callback has no identity to attribute the row to.

The fix mirrors Bug #1: write the audit record at the one moment the context exists — inside the auth hook, before raising the 403 — with the correct user, key label, model, and denial reason. Then I tag the exception so the downstream failure callback sees it's already been logged and skips it, instead of writing a second unknown row.

One detail I left deliberately: genuinely invalid keys (keys that don't exist in the system at all) still log as unknown. That's honest — there's no identity to attribute. The audit distinguishes "a known user attempted something they weren't allowed to" from "an unidentifiable caller hit the door." Those are different events and the log should say so.

The actual differentiator: the stack grades itself

Here's the part no compose tutorial has. After you bring the stack up, you run:

make smoke-eval

and it runs a small, fully offline acceptance test: ~15 neutral factual questions, asked through the gateway to the model under test, then scored by a different model acting as judge — and prints a PASS/FAIL against a threshold.

Two principles, both non-negotiable:

The judge is never the generator. Default generator is qwen3-32b; default judge is gemma4-31b — different model families, so nothing grades its own homework. The summary JSON literally carries judge_independent: true, and the report states it in plain text. A self-graded eval is worth nothing; if you remember one thing from this section, make it that.

The golden set contains zero real data. It's neutral technical/general-knowledge questions. An acceptance test that ships with customer data would contradict the entire "nothing leaves the box" premise — including, especially, the test itself.

My run:

smoke-eval  →  PASS  11/15 (73.3%)   threshold 70%
generator: qwen3-32b   judge: gemma4-31b (independent)

73.3%, not 100% — and that's the point. An acceptance test that returns a perfect score on first run is a test that isn't testing anything: either the questions are trivial or the judge is lenient. Four failing questions means the bar has resolution. The number you can trust is the one that can come back red.

This is the line between a demo and a production system. The enterprise blocker isn't "can it answer" — it's "by what objective standard is it good enough to ship." The stack answers that in the first minute, on the customer's own hardware, with a judge that never phones home.

Bug #3: two big models, one 32 GB GPU

Running the eval surfaced the hardware reality. qwen3:32b needs ~29 GB; gemma4:31b needs ~19 GB. Generator + judge = 48 GB on a card that holds 32. They cannot co-reside.

The fix is a two-pass design: generate all answers first, evict the generator (keep_alive=0), then load the judge and score all answers in a second pass. The naive structure — generate one, judge one — would thrash the GPU, swapping a 20–30 GB model in and out on every single question. Batching the passes turns dozens of model loads into exactly two.

There was a second, subtler trap. Both models are "thinking" models — they spend output tokens on chain-of-thought before the answer. With a tight judge token budget, the CoT exhausts the allowance and you get empty content back, which the judge then can't score. The fix was to pass think:false through the gateway and raise the judge's token ceiling. You don't see this one unless you actually run the loop end to end on real hardware; it never shows up in a notebook.

What I deliberately left out

v0.1 ships auth, audit, and the acceptance test. It does not ship PII content guardrails, SSO/LDAP, Langfuse-style observability, Kubernetes, or multi-GPU serving. Those are on the roadmap, written down as roadmap — not quietly absent. Scope discipline is the whole game for a solo build: a small thing that actually survives enterprise reality beats a big thing that's 80% stubs.

Why this exists

The three bugs above share a shape: the capability looked present, and only running it proved whether it was real. The access check existed but didn't fire. The audit logged, but not the events that mattered. The eval would run, but thrash the GPU and feed the judge empty answers. None of these are visible from the README of the tool you're integrating — they're visible from the failing run.

That's the difference I'm trying to build into everything here: not a system that runs, but a system that survives dirty data, multiple users, data that can't leave the building, and an objective, repeatable definition of "good enough to ship."

The deployment stack is one repo. The full evaluation methodology lives in two companions:

eval-driven-llm — the eval-first reference system (frozen golden sets, pinned independent judge, deterministic retrieval metrics).
eval-sanity — a zero-dependency tool that audits whether your retrieval metric can be trusted in the first place.

Stack: onprem-llm-stack. Clone it, bring it up, run make smoke-eval, and watch it grade itself.

Your RAG dashboard can hide a failing retriever: detecting silent regression

elvisyao007 — Mon, 08 Jun 2026 16:44:58 +0000

This is a follow-up to an earlier post where I found that my context-recall
metric over-reported retrieval failure (it flagged 33/100 answers that were
actually fine). This post is about the opposite and more dangerous failure: a
metric that under-reports. Retrieval quietly gets worse, your generation
metrics stay green, and the dashboard shows nothing. I packaged the detector
into eval-sanity v0.2.

The failure mode

Here is a pattern that shows up repeatedly in production RAG postmortems. A
system ships with a healthy offline eval — say faithfulness around 0.9. Weeks
later, users start reporting that some fraction of answers miss a key fact. The
team checks the dashboard: faithfulness is still ~0.9. Nothing looks wrong.

What actually happened: the retriever degraded — a re-index, an embedding model
swap, a chunking change — and started missing relevant documents on a subset of
queries. But the generator kept producing fluent, internally-consistent answers
from whatever partial context it received. Faithfulness measures "is the answer
grounded in the retrieved context," not "was the right context retrieved." So
faithfulness stayed high while retrieval silently fell off.

If your dashboard tracks only generation-stage metrics, this regression is
invisible. That's the trap: the two metrics move independently, and the healthy
one masks the broken one.

Why a single eval run can't catch it

You can't see this in one snapshot. A faithfulness of 0.9 looks fine in
isolation. The signal only exists in the comparison between two runs — a
baseline and a current — where you ask: did retrieval drop while generation
held steady? That specific divergence is the fingerprint of a silent
regression.

But naive comparison creates a different problem: noise. Eval scores wobble
between runs from judge variance and sampling. If you alarm on "current is
lower than baseline," you'll fire constantly on noise, and an alarm that cries
wolf gets ignored by the second week. So the detection has to separate real
movement from jitter.

What eval-sanity v0.2 does

pip install eval-sanity

detect_regression takes two eval runs — the retrieved/relevant doc IDs you
already have, plus your generation scores (faithfulness or similar) passed in —
and reports which of four states you're in:

silent regression (alarm): retrieval dropped significantly, generation did not move
visible regression: both dropped — your dashboard already shows this, no alarm needed
generation-only: generation moved, retrieval held
stable / noise: nothing moved beyond jitter

The "significantly" is the important part. Every delta goes through a paired
bootstrap (10k resamples, fixed seed) and a 95% confidence interval. A
change counts only if its CI excludes zero. This is what keeps it from firing
on noise. It runs in a fraction of a second, with zero dependencies and no model
calls — it's pure deterministic math on the IDs and scores you already have.

A worked example

Here's a synthetic case that makes the divergence concrete (numbers from the
package's demo, not a real client system):

A baseline run with recall@5 = 0.95 and faithfulness = 0.90. A current run where
retrieval has degraded to recall@5 = 0.667, but faithfulness is unchanged at
0.90. The detector reports:

recall@5     0.95 → 0.667   CI [-0.417, -0.150]   significant drop
faithfulness 0.90 → 0.90    CI [-0.005, +0.005]   unchanged

*** ALARM *** SILENT REGRESSION
Retrieval dropped while generation held steady — your dashboard won't show this.

And the control case — a current run where only 2 of 60 queries flip, pure
jitter:

recall@5     CI [-0.083, +0.000]   includes zero → flat
No significant change; within noise.

Same machinery, no false alarm. That control case matters more than the alarm
case: a regression detector you can't trust to stay quiet is worse than no
detector at all.

How to wire it in

The point isn't to run this once. It's to run it on every meaningful change —
a re-index, an embedder swap, a chunking tweak — comparing against your last
known-good baseline, before the change reaches users. It's a regression gate,
the retrieval-stage equivalent of a test that fails CI when generation metrics
alone would have stayed green.

A complete RAG eval program needs at least one retrieval-stage signal alongside
the generation-stage ones, precisely so the healthy metric can't hide the
broken one. eval-sanity is a small, dependency-free way to make that
retrieval-stage check a regression gate rather than a number nobody compares
across runs.

→ github.com/elvisyao007/eval-sanity

The detection logic, the bootstrap implementation, and the full test suite
(covering each of the four states plus the noise-rejection case) are in the
repo. All example numbers are from the package's own deterministic demo.

I built a tiny tool to catch the metric trap from my last post

elvisyao007 — Mon, 08 Jun 2026 15:16:40 +0000

In my last post I found that 33/100 "grounded-but-wrong" answers in my RAG
eval were a measurement artifact — not real failures. The culprit: proportion
recall with a relevant-doc-count denominator silently breaks on multi-answer
datasets when k is small.

So I packaged the diagnostic into a standalone tool: eval-sanity.

pip install eval-sanity

It takes the retrieved and relevant doc IDs you already have and tells you
whether your recall metric is structurally capable of saying what you think
it says — before you trust the number on your dashboard.

What it checks:

oracle ceiling: the best any retriever could score at your k
threshold reachability: how many queries can never clear your threshold, regardless of retrieval quality
hit@k vs proportion divergence: where the two metrics disagree

Zero dependencies. No models. No judge calls. Pure deterministic math.

→ github.com/elvisyao007/eval-sanity

The motivation story is in the blog post that found the artifact.

The 33 'grounded-but-wrong' answers were a metric artifact: how ID-based context recall lies on multi-answer datasets

elvisyao007 — Mon, 08 Jun 2026 11:46:41 +0000

Correction note: This post corrects a claim I made in two earlier posts. I previously reported "33/100 grounded-but-wrong" answers in my JQaRA RAG eval and framed them as a retrieval/generation failure worth fixing with hybrid search. After decomposing the numbers, zero of those 33 were real failures — all 33 are an artifact of how I measured context recall. This post shows exactly how the metric misled me, because the failure mode is one a lot of people are exposed to without knowing it.

TL;DR

My pipeline used an ID-based context recall: |retrieved ∩ relevant_doc_ids| / |relevant_doc_ids|. This is a real, widely-used variant (it matches RAGAS's NonLLMContextRecall / IDBasedContextRecall).
I flagged answers as grounded-but-wrong when faithfulness ≥ 0.8 AND context_recall < 0.5. 33/100 queries got flagged.
When I checked hit@5 (did at least one relevant doc make it into the top-5 context?), it was 98/100. Retrieval was not failing.
The 33 flagged queries had a mean of 16 relevant documents each; 28 of 33 had more than 10.

With k=5, the maximum possible ID-based recall is 5/16 ≈ 0.31 — below the 0.5 threshold even for a perfect retriever. The threshold was unreachable by construction.

The only 2 genuine retrieval misses (hit@5 = 0) scored faithfulness = 0.0 and were correctly not flagged as grounded-but-wrong. The pipeline worked; the metric definition didn't fit the dataset.

The lesson isn't "RAGAS is broken." It's that a recall metric whose denominator is the relevant-document count silently breaks when the dataset has many relevant docs per query and your k is small — and that combination is easy to walk into.

What I claimed earlier

In two earlier posts I reported a JQaRA evaluation of a local RAG stack (ruri-v3 retriever, qwen3:32b generator, gemma4:31b judge). One headline number was 33/100 grounded-but-wrong: answers the judge rated highly faithful to their retrieved context, yet whose retrieved context appeared to be missing the relevant material. I read that as "the model is confidently using incomplete context," and I lined up a hybrid (BM25 + dense) experiment to fix the retrieval side.

That story was wrong. Here's how I found out.

The gate that saved the experiment

Before running hybrid, I computed a ceiling: on a fixed 100-candidate reranking dataset like JQaRA, context recall can't exceed "how often the relevant docs are even in the candidate set." The gap looked large (+0.20), so the gate said "continue."

But the rank distribution was suspicious. Among queries where a relevant doc was in the candidate set, the dense retriever already ranked it at p50 = 1, p90 = 2. If relevant docs are almost always at the very top, where is a +0.20 recall gap coming from?

So instead of running hybrid, I decomposed the gap.

The decomposition

Two numbers ended the experiment before it started.

hit@5 = 98/100. For 98 of 100 queries, at least one relevant document was in the top-5 context handed to the generator. Retrieval was essentially doing its job.

Mean relevant docs among the 33 flagged queries = 16.0, with 28 of 33 above 10 relevant docs.

Now the metric definition collides with the dataset. ID-based context recall is:

context_recall = |retrieved_doc_ids ∩ relevant_doc_ids| / |relevant_doc_ids|

With k=5 and 16 relevant docs, the best achievable value is 5/16 ≈ 0.31. The grounded-but-wrong flag fires when context_recall < 0.5. A perfect retriever scores 0.31 here and gets flagged anyway. The 0.5 threshold isn't measuring retrieval quality on these queries — it's measuring "does this query have more than ~10 relevant docs," which on JQaRA it usually does.

Swap in hit@5 (≥1 relevant doc retrieved) as the recall signal and grounded-but-wrong drops from 33 to 0. The 2 queries that genuinely retrieved nothing relevant scored faithfulness 0.0 — the judge caught them, and they were never in the 33. The pipeline was working the whole time.

Why this is easy to walk into

This isn't a RAGAS bug. ID-based / non-LLM context recall is a legitimate, documented metric, and on a single-answer dataset (one gold doc per query) the denominator is 1 and none of this happens. The trap is the interaction:

Denominator = relevant-doc count (not "claims in the reference answer," which is RAGAS's default LLM-based variant)
Many relevant docs per query (JQaRA averages well above 10)
Small k (I used 5)
A fixed threshold (0.5) applied uniformly across queries with wildly different denominators

Each choice is individually reasonable. Together they manufacture a "failure" rate that tracks dataset structure, not system quality. If you picked an ID-based recall because it's deterministic and cheap (I did — no judge calls, fully reproducible), this is exactly the blind spot you inherit.

What I'd actually do

Match the metric to the dataset's answer multiplicity. For multi-answer sets, a denominator that can exceed k makes proportion-style recall uninterpretable. Use hit@k for "did we get anything relevant," and reserve proportion recall for when k ≥ typical relevant-doc count.
Make thresholds relative, not absolute. context_recall < 0.5 means something different when the ceiling is 1.0 vs. 0.31. Normalize against the achievable ceiling, or threshold on hit@k instead.
Sanity-check any "failure" cohort against an oracle. If a perfect retriever would also be flagged, the flag is about your metric, not your system. This single check would have caught it before I wrote the first post.

The correction

I've added update notes to the two earlier posts pointing here. To be precise about what changed:

The hybrid experiment is archived, not run — its motivation no longer exists. I'd rather publish that than run an experiment to make a flawed number look better.

All numbers are recomputed directly from the eval output JSON; the analysis script and decision log are in the repo.

faithfulness spread = 0.000: what self-grading RAG eval actually looks like

elvisyao007 — Sun, 07 Jun 2026 18:22:53 +0000

Update (2026-06): The grounded-but-wrong counts in this post (48/100
self-eval, 33/100 independent judge) are affected by a metric-definition
issue I found later — see blog-03 for the full analysis.
Short version: the 0.5 threshold on ID-based context recall is structurally
unreachable on multi-answer queries with k=5, so those absolute counts reflect
dataset structure more than system quality. The self-eval vs independent-judge
methodology point still stands; only the absolute numbers need this caveat.
Original text unchanged below.

description: "I ran my RAG eval twice — once with the same model grading itself, once with an independent judge from a different family. Here's what changed, and why spread = 0.000 is the tell."

Last post I claimed something specific: faithfulness scored 0.67, but an independent judge found 33 of 100 answers were grounded in context and still factually wrong.

A fair question: why trust that judge?

I have a concrete answer, because I ran the eval twice. The first run used the same model for both generation and judging — self-grading. The second run used a completely different model family as the judge. Here are the numbers from both.

The before and after

Metric	Self-judge (qwenj, same model)	Independent judge (gemma4:31b)
faithfulness mean	0.7751	0.6662
faithfulness spread	0.0000	0.0500
grounded-but-wrong	48 / 100	33 / 100

Read the spread row. The self-judge returned a spread of exactly 0.0000 — not "near zero," literally zero. Every query returned an identical faithfulness distribution. The judge was not reading the answers. It was rubber-stamping.

The independent judge returned a spread of 0.05. Small, but non-zero: the judge was actually discriminating between better and worse answers.

Everything else follows from that single difference.

Why spread = 0.000 is the tell

A judge that is genuinely evaluating will find some answers more faithful than others — it will disagree with itself across queries. A judge that has collapsed into rubber-stamping gives the same score to everything, because it has stopped reading. The variance goes flat.

Non-zero spread is necessary but not sufficient evidence of a good judge. A random judge also has spread. The spread check rules out the worst case — the complete collapse of judgment — not all cases. The gold standard is still human-label agreement on a sampled subset. But zero spread is an immediate red flag that something is wrong.

The self-judge gave faithfulness 0.7751. That number is almost certainly inflated. When the same model generates an answer and then evaluates it, it tends to recognize its own phrasing and reward it. The technical term is self-enhancement bias — a documented effect that scales with model capability and persists even when authorship is hidden.

What inflated faithfulness does downstream

Faithfulness inflation doesn't just change one number. It cascades.

The self-judge scored more answers as "faithful" (inflated 0.7751 vs 0.6662). A larger faithful pool means more opportunities to be grounded-but-wrong. That's why the self-judge found 48 grounded-but-wrong answers while the independent judge found 33: the self-judge was counting answers as "grounded" that the independent judge correctly did not. False positives in faithfulness create false positives in grounded-but-wrong.

The independent judge, being more accurate about faithfulness, shrank both numbers toward reality.

How I built the independent judge

Three things that matter:

Cross-family split. My generator is qwen3:32b (Qwen, Alibaba). My judge is gemma4:31b (Gemma, Google). Different model, different family, different training lineage. Self-preference bias leaks across a model family, not just an exact checkpoint — using a different Qwen checkpoint as the judge would still be suspect. The key is the family boundary.

Ground-truth anchor. Self-preference bites hardest on subjective tasks where there's no right answer to compare against. JQaRA ships gold answers. My correctness check asks the judge to compare the model's answer against the gold answer — not to issue a free-floating opinion. Anchoring on a reference shrinks the surface where bias can hide.

The on-prem cost. On a single RTX 5090 with 32 GB VRAM, qwen3:32b (20 GB) and gemma4:31b (19 GB) can't both be resident at the same time. I had to build a two-pass architecture: all generation first, then explicit VRAM unload, then all judging. This also required routing around the OpenAI-compat endpoint — thinking-capable models exhaust max_tokens with reasoning tokens before emitting content, so I used Ollama's native /api/chat with think=false. None of this is hard, but it's the operational reality of doing this properly on-prem, and it's the kind of friction that makes most people default to self-judging in a single pass.

Being honest about the limits

Non-zero spread rules out rubber-stamping. It doesn't prove the judge is calibrated. For that, you need to hand-label a sample — grade 30–50 answers yourself and measure how often the judge agrees. I haven't published that calibration for this run yet. The spread check is a fast sanity gate, not the finish line.

What to gate RAG eval on

An independent judge — different family, not just different checkpoint. Self-judging numbers are theater.
Ground truth where it exists. A reference answer reduces the bias surface more than any prompting trick.
Spread as a sanity check. Report it alongside the mean. Zero spread = stop, something is wrong.
Human-label calibration on a sample before you trust the judge in production.

The self-judging run gave a clean-looking 0.77 faithfulness with zero spread. The independent run gave 0.67 with 0.05 spread, and found 15 fewer grounded-but-wrong answers. The real system was worse than the self-judge claimed and better-characterized than the inflated number suggested. The 0.67 is more credible precisely because it's lower.

The full run — both phases, infrastructure fixes, raw scores — is here: github.com/elvisyao007/eval-driven-llm. Next I'm going after context_recall = 0.41 with hybrid retrieval, judged by the same independent setup. Following the build in public.

My RAG's faithfulness was 0.67. 1 in 3 answers were still wrong.

elvisyao007 — Sun, 07 Jun 2026 17:02:51 +0000

Update (2026-06): A later analysis showed the "33/100 grounded-but-wrong"
figure in this post is a metric artifact, not a real failure. My ID-based
context recall used the relevant-document count as its denominator; on JQaRA
(~16 relevant docs/query average) with k=5 and a 0.5 threshold, even a perfect
retriever scores below 0.5 and gets flagged. hit@5 was actually 98/100.
Full breakdown: What "grounded-but-wrong" actually meant — and why I was
measuring it wrong. Original text below is unchanged.

description: "An on-prem JQaRA eval. Reranking nudged P@1 but the system was still wrong a third of the time. Why faithfulness alone is a trap, and what to gate on instead."

I built a small Japanese RAG system, ran it entirely on my own hardware (RTX 5090, Ollama), and evaluated it with an independent judge model instead of letting the generator grade its own homework.

Two things surprised me, and they're connected:

Adding a reranker — the move everyone reaches for first — barely moved the needle.
My faithfulness score looked acceptable (0.67), yet 33 out of 100 answers were grounded in the retrieved context and still factually wrong.

This post is about why those two facts are the same story, and why a faithfulness gate alone would have shipped a system that's wrong a third of the time without ever flagging it.

TL;DR

Reranking improved P@1 by +1.3 points but lowered Recall@10. It reorders what retrieval already found; it can't retrieve what retrieval missed.
The real bottleneck was recall (context_recall = 0.41): the evidence needed to answer often wasn't retrieved at all.
faithfulness = 0.67 is a trap. Faithfulness measures whether an answer is consistent with the retrieved context — not whether it's correct. An answer grounded in wrong-but-retrieved context scores as faithful.
An independent correctness judge found 33/100 "grounded-but-wrong" answers — confidently wrong, fully grounded, invisible to faithfulness.
Lesson: faithfulness is necessary, not sufficient. Gate on answer-correctness + context_recall, and stop reaching for a reranker when recall is your problem.

The setup (so you can trust the numbers)

Component	Choice
Benchmark	JQaRA (じゃくら) — Japanese QA-for-retrieval, built on the JAQKET quiz set
Retrieval eval	1,667 queries, deterministic
Generation eval	100 queries
Generator	`qwen3:32b`
Judge	`gemma4:31b` — a different model from the generator
Hardware	single RTX 5090, on-prem, Ollama

The judge being a different model matters, and I'll come back to why.

Act 1: the obvious move — add a reranker

The standard RAG upgrade path: dense retrieval is your first stage, a cross-encoder reranker is your second. So I added one and re-ran retrieval.

Metric	Dense	Dense + rerank	Δ
P@1	0.8308	0.8440	+0.0132
Recall@10	0.5738	0.5634	−0.0104

Read that carefully. The reranker did exactly what a reranker does: it sharpened the top of the list (P@1 up — the single best document lands at rank 1 more often) while slightly demoting some relevant docs out of the top 10 (Recall@10 down). That's a precision-for-recall trade, not a free win.

And here's the thing that should give you pause: if your generator reads more than the top result — top-5, top-10 — that recall drop can hurt downstream answers even as P@1 improves. The metric you celebrate isn't the metric that feeds your generator.

The deeper problem: a reranker reorders the candidate set. It cannot conjure a document that dense retrieval never surfaced. Which brings us to the number that actually mattered.

Act 2: the metric I trusted too much

I moved to generation eval expecting faithfulness to be the headline. It came back at 0.6662. Mediocre, but the kind of number you squint at and think "okay-ish, ship the next iteration."

That instinct is the trap.

Metric	Value	What it actually tells you
faithfulness	0.6662	"Looks okay" — and is dangerously incomplete
faithfulness spread	0.0500	Non-zero → the judge is discriminating, not rubber-stamping
context_recall	0.4062	The real bottleneck — evidence often wasn't retrieved
grounded-but-wrong	33 / 100	The failures faithfulness structurally cannot see

Faithfulness measures consistency with the retrieved context, not correctness against ground truth. An answer that faithfully reports a wrong-but-retrieved passage is, by definition, faithful. So a grounded-but-wrong answer doesn't lower your faithfulness score — it sits in the "good" portion of it. Optimize for faithfulness and you are partly optimizing toward confident, well-grounded, wrong answers.

To catch this you need a separate question: is the answer actually correct? I ran that as an independent correctness check against JQaRA's gold answers. The essence:

# Not "is the answer supported by the context?" (faithfulness)
# But "is the answer correct vs the gold answer?" (correctness)

judge(question, model_answer, gold_answer) -> {correct | incorrect}
grounded_but_wrong = faithful(answer) AND NOT correct(answer)

Result: 33 of 100 answers were faithful and wrong at the same time. A faithfulness gate would have waved every one of them through.

Why this happened: recall was the leak

The three numbers line up into one causal chain:

context_recall = 0.41 → for most queries, the passage that actually answers the question wasn't in the retrieved context.
The generator answers anyway, grounding itself in whatever was retrieved — confidently, fluently.
That answer is faithful (grounded in retrieved text) and wrong (the retrieved text didn't contain the answer). → grounded-but-wrong.

So context_recall is the leading indicator, grounded-but-wrong is the lagging confirmation, and faithfulness is the misleading number in the middle that papers over both.

And now Act 1 and Act 2 close into the same loop: I reached for a reranker, but reranking optimizes the wrong stage when recall is your bottleneck. No amount of reordering fixes a document that was never retrieved. The right lever was upstream — chunking, embedding model, hybrid (lexical + dense) retrieval, query expansion — not a cross-encoder polishing a list that's missing the answer.

A note on judge independence (why the spread matters)

If you let a model grade its own outputs, it tends to like them — LLM-as-judge has a well-documented self-preference bias, and a self-judging setup often produces near-1.0 scores with almost no variance. That near-zero spread is the tell.

My judge (gemma4:31b) is a different model from the generator (qwen3:32b), and the faithfulness spread came back at 0.05 — non-zero. Small, but it's the proof that the judge is actually discriminating between good and bad answers rather than rubber-stamping. If you take one process habit from this post, take this one: never let the model that wrote the answer be the model that scores it.

What I'd actually gate a production RAG on

Most "RAG eval" stops at faithfulness because it's the easiest to compute. That's exactly why it's the wrong place to stop. The gate I'd ship behind:

Answer-correctness vs ground truth — the metric that actually catches grounded-but-wrong. Non-negotiable.
context_recall — your leading indicator. If this is low, fix retrieval before you touch the generator or reach for a reranker.
faithfulness — keep it, but only as a hallucination guard on top of correctness, never as a stand-in for it.
An independent judge — different model, and watch the score variance to confirm it isn't rubber-stamping.

A demo proves the happy path works. A system you'd put in front of a business has to know — and prove with numbers — how often it's confidently wrong. The gap between those two is exactly this eval discipline.

Code, the eval harness, and the raw run are here: github.com/elvisyao007/eval-driven-llm. Next I'm going after that context_recall = 0.41 — hybrid retrieval and chunking experiments, measured the same way. Following the build in public.

If you run RAG eval and only look at faithfulness, go check your grounded-but-wrong rate. I'd bet it's not zero.