DEV Community

Cover image for OpenDataLoader PDF: one tool and so many options!
Julia
Julia

Posted on

OpenDataLoader PDF: one tool and so many options!

TL;DR: OpenDataLoader PDF is the first open-source tool to auto-tag untagged PDFs into screen-reader-ready Tagged PDFs and the most performant open-source PDF parser for RAG pipelines. But it offers many options because not all PDFs are the same. The heuristic engine processes 60+ pages per second on CPU with 0.91 reading order accuracy; hybrid AI mode boosts accuracy to 0.934 for complex documents. Outputs include JSON with bounding boxes for RAG pipelines or Markdown for human readability. Auto-tagging is free (Apache 2.0); full PDF/UA-1 & PDF/UA-2 export is an enterprise add-on. You choose what fits your documents, compliance needs, and infrastructure.

Core technical options & their meanings
OpenDataLoader PDF gives many choices not to complicate things, but because different use cases and different document types need different approaches. Here's what each option does and why it matters.

Output Format: JSON Markdown HTML Annotated PDF Text

When you run OpenDataLoader, you choose between these output formats.

JSON gives structured, machine-readable data. Every element heading, paragraph, table, list, caption is tagged with a semantic type and a bounding box. Users get exact coordinates for every piece of content. This is the foundation for RAG pipelines, because users can map extracted text back to its exact location on the page.

Markdown offers human-readable text. It's cleaner, simpler, and works well when you just need to read or preview the extracted content.

Advice: Choose JSON when you need precision and structure. Choose Markdown when you need readability.

HTML output transforms your PDF content into a styled, web-ready document. The structure is preserved headings, paragraphs, lists, and tables are rendered with appropriate HTML tags and inline styling.
Annotated PDF output generates a visual overlay on the original document. Every detected element: heading, paragraph, table, list, image is highlighted with a colored bounding box and labeled with its semantic type.

Annotated PDF gives confidence to the users visually, instantly, and without reading a single line of raw JSON.

Text output format strips away everything except the raw text content. No bounding boxes. No semantic types. No formatting. Just the extracted text in the correct reading order.
Comparison of output formats

Layout Analysis: The XY-Cut++ Algorithm

Reading order is one of the hardest problems in PDF extraction. A page may look perfect to a human, but a machine, for example, can easily confuse multi-column page layout with a table or mix up footnotes with body text.

OpenDataLoader solves this with the XY-Cut++ algorithm. It analyzes the page geometry, finds the gaps between columns and blocks, recursively splits the page until every element is in the right order. The result is a logical reading order that mimics how a human would read the page.

This matters because incorrect reading order breaks information retrieval. If the RAG pipeline gets the order wrong, the answers it generates will be wrong too.

In OpenDataLoader this algorithm is enabled by default, and there is an option to disable it.

Processing engine: Heuristic vs. Hybrid

OpenDataLoader's default engine is heuristic, a fast, deterministic, rule-based system that runs entirely on CPU. It processes 60+ pages per second, requires no GPU, and is 100% local. No data ever leaves your machine.

The heuristic engine is ideal for most text-based PDFs. It's private, fast, and predictable.

For complex documents: scanned pages, borderless tables, mathematical formulas, charts OpenDataLoader offers a hybrid AI mode. This routes difficult pages to a local AI backend that handles what the heuristic engine cannot.

The result: table accuracy jumps from 0.49 to 0.93, and reading order accuracy improves from 0.91 to 0.934.

Users choose the engine based on their documents and their performance needs. Also the choices are designed to balance speed (CPU-only, 60+ pages/sec), privacy (100% local), and accuracy (bounding boxes, correct reading order). You select the output and rely on the engine's built-in intelligence for layout and structure, making it a powerful tool for high-throughput, local RAG pipelines.

Two algorithms for table detection: border and cluster

In the process of table extraction in heuristic mode, OpenDataLoader uses two different methods. By default, only the 'border' algorithm is used, which focuses only on table borders. Users can also enable a second algorithm, 'cluster', which divides content into clusters to identify tables (including tables without borders).

Noise filtering in OpenDataLoader

PDFs are full of small text, invisible text, hidden layers, and text outside the page. If users pass all of this to their LLM, they pollute the context with irrelevant information.

OpenDataLoader automatically filters out small text, invisible text, hidden layers, and text outside the page. Only the main body content is extracted and passed to the user’s pipeline. Cleaner input means better outputs.

Filters are also customizable. By default, they're all enabled, removing all content: small text, invisible text, hidden layers, and text outside the page. However, the user can disable these filters in any combination.

Tagged PDF Support: using native structure

When a PDF is "Tagged" it already contains native structural information: headings, paragraphs, lists, reading order. This is often the case with accessible PDFs that comply with PDF/UA or WCAG standards.

ODL can use the existing document structure instead of re-analyzing the layout. This is faster and more accurate, as it relies on the document's existing tags. We recommend using this option only if the PDF is properly tagged.

OpenDataLoader is one tool. Multiple workflows. You decide.

hancom #opendataloader #pdf

Website: https://opendataloader.org/

GitHub: https://github.com/opendataloader-project/opendataloader-pdf

Top comments (0)