DEV Community

Cover image for The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF.
Julia
Julia

Posted on

The fastest non-VLM parser that preserves document structure: tables, headings, lists is OpenDataLoader PDF.

๐Ÿš€ The developers found room to improve on latency, so we profiled. We initially expected the sorting algorithm (XY-Cut++) to be the bottleneck, but it turned out to be less than **1% **of the total time. The real cost was hiding in content filtering (55%) and preprocessing (25%).

Benchmarks
๐Ÿ–‡๏ธ3 fixes applied
๐Ÿ’ฅPage-level parallel processing
๐Ÿ’ฅHidden text detection โ†’ opt-in
๐Ÿ’ฅText-only fast path
๐Ÿ’ขOutput is byte-for-byte identical before and after optimization. Only the speed changed results stay the same.

๐Ÿ–‡๏ธOpenDataLoader PDF highlights
๐Ÿš€#1 in latency ๐Ÿฅ‡(585 pages in 1.10s)
๐Ÿ—ƒ๏ธ#1 in memory efficiency ๐Ÿฅ‡(7.4MB)
๐Ÿ’ขJava ยท Python ยท Node.js SDK
๐Ÿ’ขMultiple output formats (text, markdown, HTML, JSON, PDF)

Check out the benchmark below for latency and memory usage results. See the PR for full details on what changed and how we got here. We'd love your feedback if you try it out!


GitHub: http://github.com/opendataloader-project/opendataloader-pdf?utm_source=x&utm_medium=social&utm_campaign=perf_update
Benchmark: http://github.com/opendataloader-project/opendataloader-bench?utm_source=x&utm_medium=social&utm_campaign=perf_update
PR: https://github.com/opendataloader-project/opendataloader-pdf/pull/362?utm_source=x&utm_medium=social&utm_campaign=perf_update

Top comments (0)