Final part of a series measuring Chinese open-source document parsing on Japanese documents.
Repo + raw 3×2 results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2
Two posts ago I measured RAGFlow's DeepDoc against plain text extraction on Japanese PDFs and found its layout-aware parsing helped retrieval. Last post I found that help was bigger under dense retrieval than BM25 — structured parsing matters more the more your retriever depends on chunk coherence.
This post adds the obvious next question: is DeepDoc actually the best Chinese open-source parser for this, or just the one I tested first? So I added MinerU (another major Chinese parser) to the comparison. (I excluded PaddleOCR — it depends on Baidu's PaddlePaddle framework rather than PyTorch, with known CUDA/cudnn conflicts; the environment isolation cost wasn't worth it for this comparison.)
The answer isn't "one of them wins." It's a clean crossover: the better parser depends on your retriever.
The 3×2
Same 3 Japanese government PDFs, same 32-question golden set (oracle ceiling 87.5%), same Japanese embedder (ruri-v3). hit@5:
| Pipeline | BM25 | Dense |
|---|---|---|
| plain text (pdfplumber) | 56.2% | 40.6% |
| DeepDoc | 68.8% | 65.6% |
| MinerU | 62.5% | 71.9% |
Read the two structured parsers against each other:
- On BM25, DeepDoc wins (68.8% vs 62.5%)
- On dense, MinerU wins (71.9% vs 65.6%)
A crossover. And the single best cell in the entire table is MinerU × dense at 71.9% — so if you're building Japanese dense RAG, MinerU is the current pick. If you're on a lexical/BM25 system, DeepDoc.
This is the answer to "how do you choose a parser" that's actually useful: not "the strong one," but "match the parser to your retriever." MinerU's chunking suits dense embedding; DeepDoc's suits keyword matching. Neither is universally better.
A confirmation worth noting: the era-name failure is ecosystem-wide, not DeepDoc-specific
In the first post, DeepDoc's OCR fallback corrupted Japanese era names (令→今) on form/scanned PDFs, while its native-text path was clean. MinerU showed zero era-name errors on these documents — because, like DeepDoc, it routes text-layer PDFs through direct extraction (PyMuPDF) rather than OCR.
That's the useful confirmation: the clean-vs-corrupt split is a property of the font path (text-layer extraction vs OCR fallback), not of any one parser. Both Chinese parsers handle embedded-font government PDFs cleanly. The era-name risk lives in the OCR fallback that scanned/form documents force — and that's where you'd need to test either tool carefully for a Japanese deployment.
The caveats — three, and they're the price of trusting the numbers
1. The speed comparison is NOT apples-to-apples. MinerU ran on GPU (PyTorch cu128 on the RTX 5090); the DeepDoc numbers from the earlier phase were CPU. So while MinerU parsed the three documents in 34–46s each, I'm not reporting a clean "MinerU is N× faster than DeepDoc" — that would compare GPU against CPU and mislead. Aligning both on the same hardware is left for a follow-up. Take the speed dimension as unmeasured here, not as a MinerU win.
2. Three documents, 32 questions. The crossover is an observation on this set, not a settled result. At this sample size, a ±6% gap is a handful of questions — it could move with more documents.
3. The symmetry is suspicious. DeepDoc leads BM25 by +6.2% and MinerU leads dense by +6.2% — the exact same margin both directions. That clean symmetry is more likely a quantization artifact of a 32-question set (each question ≈ 3.1%) than a real law of nature. I'm reporting the direction of the crossover with confidence and the precise magnitude with none.
The honest version: on this testbed, with a Japanese embedder, the direction is a crossover — DeepDoc better for lexical, MinerU better for dense. The exact numbers need a bigger set.
Where the series lands
Three posts, one arc:
- DeepDoc on Japanese — found a font-path-specific OCR failure (era names), and that layout parsing helps retrieval (+12.5% on BM25).
- Dense vs BM25 — that help doubles under dense retrieval, because dense punishes incoherent chunks harder. A statement about RAG architecture, not one tool.
- DeepDoc vs MinerU (this post) — the best parser is a crossover on your retriever; MinerU × dense is the strongest combination measured; the era-name failure is an OCR-path property shared across parsers.
The throughline isn't "Chinese parsers are good/bad." It's that parser choice is a constrained decision — your retriever, your document font path, your language — and the only way to get the answer is to measure your own stack, traced through to retrieval, on a test set hard enough to actually separate the options.
That's the part most tooling comparisons skip, and it's the part that's worth doing.
Raw 3×2 numbers, the parser-comparison breakdown, reproducible scripts:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v2
Companion: eval-sanity (the sanity gate that confirmed the metric before each delta was trusted).

Top comments (0)