OpenDataLoader team published the full benchmark results on http://opendataloader.org
Transparent methodology, 200 real-world PDFs, all scores reproducible.
OpenDataLoader PDF offers two modes!
⚙️ Rule-based mode
No AI model. Runs locally, no GPU required. 0.015s/page — the fastest in benchmarks.
🧠 Hybrid mode
Rule-based engine + AI model combined. Significant quality improvements in tables, reading order, and image recognition.
Hybrid mode results
📊 Overall: 0.907 (#1)
📖 Reading Order: 0.934 (#1)
📋 Table Extraction: 0.928 (#1)
⚡ Speed (rule-based mode): 0.015s/page (#1)
🏷️ Heading Detection: 0.821 (#2)
Key highlights
📋 Table extraction #1 (0.928) — 0.041 gap over 2nd place.
Table structure drives answer quality in RAG pipelines. This gap matters.
📖 Reading order #1 (0.934).
Multi-column layouts are extracted in the order humans actually read.
⚡ Speed and quality at the same time.
Rule-based mode for speed, hybrid mode for accuracy.
Choose based on your use case.
Compared against 12 parsers, including docling, marker, unstructured, mineru, and pymupdf4llm.
All results are per-document mean — no cherry-picking, no synthetic data.
The benchmark repo is open.
Run it yourself, add your own parser.
🔗 Benchmark → https://opendataloader.org/?utm_source=x&
utm_medium=social&utm_campaign=benchmark_release
📂 Methodology → https://github.com/opendataloader
-project/opendataloader-bench?utm_source=x&utm_medium=social&utm_campaign=benchmark_release

Top comments (0)