Does a Chinese document parser actually work on Japanese PDFs? I measured it — and the answer is 'it depends on the font path'

#rag #llm #japan #machinelearning

Part 1 of a series measuring Chinese open-source AI tooling on Japanese documents.
Repo + raw results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v1
Every number below is from a live run on an RTX 5090 / RTX-class workstation. Sample sizes are small and stated explicitly — treat this as an initial signal, not a verdict.

RAGFlow's DeepDoc is one of the better-known open-source document parsers to come out of the Chinese AI ecosystem. It does OCR, table-structure recognition (TSR), and document-layout recognition (DLR), and it's RAGFlow's default PDF parser. The English and Japanese dev communities mostly haven't measured it on Japanese documents — which is exactly the gap I sit in: I can read the Chinese tooling, and I can test it on the Japanese enterprise document types that actually matter here.

So I ran it. The interesting part isn't a thumbs-up or thumbs-down. It's that the answer splits cleanly by which internal path a given PDF takes — and one of those paths systematically corrupts Japanese era names.

Here's the honest version, limits attached.

The trap I almost fell into

My first-pass observation was alarming and simple: DeepDoc was misreading 令 as 今. In Japanese that's not a random glyph error — 令 is the first character of 令和 (Reiwa), the current imperial era. A parser that turns 令 into 今 corrupts the date on every government report, invoice, and contract that uses the era-name calendar. That's a potential dealbreaker for Japanese enterprise documents, where dates carry legal weight.

If I'd stopped there, I'd have published "Chinese parser breaks Japanese era names" — and I'd have been wrong, or at least sloppy. Because when I quantified it, the error wasn't a property of DeepDoc on Japanese text. It was a property of one code path.

The actual finding: it's a font-path problem, not a language problem

DeepDoc (via the PdfParser API) routes text two ways:

Embedded-font PDFs — most government reports, anything exported from Word/LaTeX — go through native text extraction (pdfplumber under the hood). On these, the 令→今 error rate was 0%. The text is read, not recognized.
Form-font / scanned PDFs — where there's no extractable text layer — fall back to the OCR path. On these, the era-name corruption ran to roughly 100% on the affected pages.

So the precise claim is: DeepDoc's OCR fallback systematically misreads 令 as 今 on Japanese form-font and scanned pages; its native-text path does not. "It breaks Japanese" was too broad. "Its OCR path breaks era names, and lots of real enterprise documents hit the OCR path" is the true, and more useful, statement.

This distinction matters operationally. If your corpus is clean digital government PDFs, this specific failure won't touch you. If your corpus is scanned invoices and tax forms — which is a huge fraction of real Japanese back-office documents — you're on the path that fails.

But does any of this reach the thing you actually care about — retrieval?

Parsing quality is a means. What an enterprise RAG system cares about is whether the right chunk comes back. So I didn't stop at "the parse looks good/bad." I measured the downstream delta: same documents, same questions, two ingestion pipelines.

Pipeline A (baseline): plain text extraction (pdfplumber) → chunk → retrieve
Pipeline B (DeepDoc): DeepDoc structured parse → chunk → retrieve

On a 20-question Japanese golden set built from the sample documents:

Pipeline	hit@5
A — plain text	75%
B — DeepDoc	90%

DeepDoc's layout understanding netted +15% hit@5. The layout separation produces cleaner chunks, and on these documents that win outweighed the OCR errors. The net effect was positive.

Now the limits, because they're load-bearing:

This is BM25 (lexical) retrieval. The dense-retrieval comparison is Phase 3, not done yet. Do not read "+15%" as "DeepDoc improves retrieval in general" — read it as "on lexical retrieval, on these docs, layout-aware chunking helped."
The golden set has a 100% oracle ceiling. Every relevant doc is reachable in the top-5 for all 20 questions — meaning the set may be on the easy side, and the 75-vs-90 gap could be amplified or distorted by that. A harder set (oracle < 100%) is Phase 3.
20 questions, 5 documents. This is a signal, not a settled number.

I'm reporting the +15% with all three caveats firmly attached. The honest takeaway is directional: layout-aware parsing tends to help retrieval enough to matter, and the precise magnitude needs a harder test.

Where DeepDoc is genuinely weak: tables and forms

The flip side, and the part DeepDoc's marketing wouldn't lead with. TSR — table-structure recognition — is one of its headline features. On a Japanese tax form's table, exact-match cell accuracy was 30% (20 cells checked). That's low, on the feature it's supposed to be best at.

And form PDFs were worse. On the e-Tax-style form sample, DeepDoc extracted essentially one chunk from the whole document — the structure collapsed.

Put together, the weak spot is specific and it's exactly the wrong one for this market: form-type documents — invoices, 請求書, tax filings — are the bulk of Japanese back-office paperwork, and that's where DeepDoc struggles most (both the OCR era-name corruption and the table/form collapse live here).

The answer, as a matrix instead of a verdict

"Should a Japanese company use DeepDoc?" has no yes/no answer. It has a font-path-and-doctype answer:

Document type	DeepDoc behavior	Evidence
Embedded-font PDF (gov reports, Word exports)	OCR error 0%; +15% hit@5 from layout	native-text path
Form-font / scanned PDF	令→今 era-name corruption ~100%	OCR fallback path
Table-heavy documents	TSR exact-match only ~30%	headline feature underperforms
Form documents (e-Tax style)	near-total failure, ~1 chunk extracted	structure collapses

That matrix is the deliverable. Not "good" or "bad" — good here, broken there, and here's the line.

Why I bothered, and what's next

Most tooling reviews test on clean English PDFs and report a single score. The failure modes that actually bite enterprises live in the specifics: a particular font path, a particular document type, a particular language's calendar. You only see them if you run the real tool on the real document types in the real language — and then trace the error all the way to retrieval, where it either matters or doesn't.

Phase 3 (next in the series): dense-retrieval comparison, a harder golden set with oracle < 100% to pressure-test the +15%, and — the natural next question — the same Japanese documents through MinerU and PaddleOCR, the other major Chinese parsers, to see whether the font-path failure is DeepDoc-specific or ecosystem-wide.

Raw parses, the golden set, the OCR error counts, and the retrieval results are all in the repo:
https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/deepdoc-eval-v1

Companion tooling: eval-sanity (the sanity gate that confirmed the retrieval metric was trustworthy before I reported the delta) and eval-driven-llm (the eval harness this runs on).