DEV Community

elvisyao007
elvisyao007

Posted on

Which Chinese open-source parser is better for Japanese RAG? It's a crossover — BM25 says DeepDoc, dense says MinerU

Final part of a series measuring Chinese open-source document parsing on Japanese documents.
Repo + raw 3×2 results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2

Two posts ago I measured RAGFlow's DeepDoc against plain text extraction on Japanese PDFs and found its layout-aware parsing helped retrieval. Last post I found that help was bigger under dense retrieval than BM25 — structured parsing matters more the more your retriever depends on chunk coherence.

This post adds the obvious next question: is DeepDoc actually the best Chinese open-source parser for this, or just the one I tested first? So I added MinerU (another major Chinese parser) to the comparison. (I excluded PaddleOCR — it depends on Baidu's PaddlePaddle framework rather than PyTorch, with known CUDA/cudnn conflicts; the environment isolation cost wasn't worth it for this comparison.)

The answer isn't "one of them wins." It's a clean crossover: the better parser depends on your retriever.


The 3×2

Same 3 Japanese government PDFs, same 32-question golden set (oracle ceiling 87.5%), same Japanese embedder (ruri-v3). hit@5:

Pipeline BM25 Dense
plain text (pdfplumber) 56.2% 40.6%
DeepDoc 68.8% 65.6%
MinerU 62.5% 71.9%

Read the two structured parsers against each other:

  • On BM25, DeepDoc wins (68.8% vs 62.5%)
  • On dense, MinerU wins (71.9% vs 65.6%)

A crossover. And the single best cell in the entire table is MinerU × dense at 71.9% — so if you're building Japanese dense RAG, MinerU is the current pick. If you're on a lexical/BM25 system, DeepDoc.

This is the answer to "how do you choose a parser" that's actually useful: not "the strong one," but "match the parser to your retriever." MinerU's chunking suits dense embedding; DeepDoc's suits keyword matching. Neither is universally better.


A confirmation worth noting: the era-name failure is ecosystem-wide, not DeepDoc-specific

In the first post, DeepDoc's OCR fallback corrupted Japanese era names (令→今) on form/scanned PDFs, while its native-text path was clean. MinerU showed zero era-name errors on these documents — because, like DeepDoc, it routes text-layer PDFs through direct extraction (PyMuPDF) rather than OCR.

That's the useful confirmation: the clean-vs-corrupt split is a property of the font path (text-layer extraction vs OCR fallback), not of any one parser. Both Chinese parsers handle embedded-font government PDFs cleanly. The era-name risk lives in the OCR fallback that scanned/form documents force — and that's where you'd need to test either tool carefully for a Japanese deployment.


The caveats — three, and they're the price of trusting the numbers

1. The speed comparison is NOT apples-to-apples. MinerU ran on GPU (PyTorch cu128 on the RTX 5090); the DeepDoc numbers from the earlier phase were CPU. So while MinerU parsed the three documents in 34–46s each, I'm not reporting a clean "MinerU is N× faster than DeepDoc" — that would compare GPU against CPU and mislead. Aligning both on the same hardware is left for a follow-up. Take the speed dimension as unmeasured here, not as a MinerU win.

2. Three documents, 32 questions. The crossover is an observation on this set, not a settled result. At this sample size, a ±6% gap is a handful of questions — it could move with more documents.

3. The symmetry is suspicious. DeepDoc leads BM25 by +6.2% and MinerU leads dense by +6.2% — the exact same margin both directions. That clean symmetry is more likely a quantization artifact of a 32-question set (each question ≈ 3.1%) than a real law of nature. I'm reporting the direction of the crossover with confidence and the precise magnitude with none.

The honest version: on this testbed, with a Japanese embedder, the direction is a crossover — DeepDoc better for lexical, MinerU better for dense. The exact numbers need a bigger set.


Where the series lands

Three posts, one arc:

  1. DeepDoc on Japanese — found a font-path-specific OCR failure (era names), and that layout parsing helps retrieval (+12.5% on BM25).
  2. Dense vs BM25 — that help doubles under dense retrieval, because dense punishes incoherent chunks harder. A statement about RAG architecture, not one tool.
  3. DeepDoc vs MinerU (this post) — the best parser is a crossover on your retriever; MinerU × dense is the strongest combination measured; the era-name failure is an OCR-path property shared across parsers.

The throughline isn't "Chinese parsers are good/bad." It's that parser choice is a constrained decision — your retriever, your document font path, your language — and the only way to get the answer is to measure your own stack, traced through to retrieval, on a test set hard enough to actually separate the options.

That's the part most tooling comparisons skip, and it's the part that's worth doing.

Raw 3×2 numbers, the parser-comparison breakdown, reproducible scripts:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2

Companion: eval-sanity (the sanity gate that confirmed the metric before each delta was trusted).

Top comments (0)