Structured parsing helps dense retrieval more than it helps BM25 — measured on Japanese docs, and the gap doubled

#rag #llm #japan #machinelearning

Phase 3 of a series measuring Chinese open-source parsing (RAGFlow's DeepDoc) on Japanese documents. This tightens two limits I flagged in the earlier post.
Repo + raw 2×2 results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2

In an earlier post I measured DeepDoc (RAGFlow's document parser) against plain text extraction on Japanese PDFs and found a +12.5% hit@5 advantage from its layout-aware chunking — with two caveats I wrote down explicitly: it was BM25-only, and the golden set had a 100% oracle ceiling (likely too easy, which could amplify the gap).

This post closes both. I added a dense-retrieval dimension and built a harder golden set (oracle 87.5%). I expected DeepDoc's advantage to shrink under dense retrieval — my reasoning was that embeddings might be less sensitive to chunk boundaries than lexical matching.

It did the opposite. The advantage doubled.

The 2×2

Same documents, same questions, four pipeline/retriever combinations. hit@5:

Pipeline	BM25	Dense
A — plain text (pdfplumber)	56.2%	40.6%
B — DeepDoc structured	68.8%	65.6%
Delta (B − A)	+12.5%	+25.0%

DeepDoc's edge over plain text is +12.5% on BM25 but +25.0% on dense. The structured parse helps dense retrieval roughly twice as much as it helps lexical.

And look at what falls apart: plain text + dense is the worst cell in the table at 40.6% — well below plain text + BM25. Switching plain-text extraction from lexical to dense retrieval made it worse.

Why: chunk quality matters more to dense than to BM25

The mechanism is in the chunk counts. Plain-text extraction produced 2,934 sliding-window chunks; DeepDoc's structured parse produced 630. That's a 4.6× difference, and it cuts differently for each retriever:

BM25 matches keywords. A fragmented chunk still contains its keywords, so lexical matching mostly survives fragmentation. Plain text holds up okay (56.2%).
Dense embeds meaning. A context-stripped sliding-window fragment produces a low-quality vector — there isn't enough coherent context to embed well. So fragmentation hurts dense retrieval badly (40.6%).

DeepDoc's layout separation yields larger, semantically coherent chunks — which is exactly what a dense embedder needs to produce a good vector. So the value of structured parsing isn't constant across retrievers: it's worth more the more your retriever depends on chunk coherence, and dense retrieval depends on it most.

That generalizes beyond this one parser and this one language. If you're running dense RAG, your chunking strategy is doing more work than you might think — and a parser that respects document structure is buying you more than the same parser would on a BM25 system.

The honest caveats (three of them, all load-bearing)

This is an initial signal, and the limits matter as much as the headline.

1. This is an end-to-end comparison, not an isolated chunk-strategy test. Pipeline B is "DeepDoc's parse plus the chunking it naturally yields." Pipeline A is "plain text extraction plus sliding-window chunking." I did not hold chunking constant and swap only the parser. So strictly, the 2×2 compares two whole pipelines — which is what an enterprise actually deploys — but it does not by itself prove the gain comes from parsing rather than from chunk strategy. Isolating that (same chunker, swap only the parse layer) is the next step. I'm not claiming the layout model alone causes the lift; I'm claiming the end-to-end DeepDoc pipeline retrieves better, and the dense-vs-BM25 split points strongly at chunk coherence as the lever.

2. The embedding is Japanese-specific (ruri-v3-310m). Dense retrieval quality on Japanese is sensitive to the embedder. I used cl-nagoya's ruri-v3, a Japanese-first model, not a multilingual general one. The "+25% on dense" result holds for this embedder. A different embedding model could shift the numbers — the conclusion is conditioned on a Japanese-tuned embedder.

3. Three documents, 32 questions. I narrowed to the three documents that work as a retrieval testbed (the form-type PDFs that DeepDoc fails to parse — covered in the previous post — aren't usable as a clean retrieval corpus). The golden set is harder than v1 (oracle 87.5% vs 100%: four questions ask for arithmetic-derived values that don't appear verbatim in the corpus, so even a perfect retriever can't surface them). But it's still a small sample. Signal, not verdict.

What the harder golden set bought

The v1 golden set had a 100% oracle ceiling — every relevant doc was reachable in the top-5, meaning the questions were easy enough that the retriever was never really stressed. v2's ceiling is 87.5%: four of the 32 questions (asking for figures like 664,957億円, computed totals not present verbatim) can't be answered by any retriever from this corpus.

That matters because the +12.5%/+25.0% deltas are now measured under genuine difficulty, not on a set so easy the gap could be an artifact of headroom. Tightening the test is what turns "an interesting number" into "a number I'd defend."

Where this lands

The previous post's framing was "does a Chinese parser work on Japanese docs" — useful, but niche. This result is broader: structured parsing pays off more under dense retrieval than under lexical, because dense embeddings punish incoherent chunks harder. That's a statement about RAG architecture, not about one parser or one language — and if you're building dense RAG on any messy document corpus, it's a reason to take your parse-and-chunk layer more seriously than the embedding model choice you probably agonized over instead.

Next in the series: isolate the chunk-strategy variable (same chunker, swap only the parser), and — environment permitting — the same Japanese documents through MinerU and PaddleOCR, to see whether the structured-parse advantage is DeepDoc-specific or holds across the Chinese parser ecosystem.

Raw 2×2 numbers, the harder golden set, the reproducible script:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2

Companion tooling: eval-sanity (the sanity gate that confirmed the metric before I trusted the delta).