DEV Community

Cover image for OpenDataLoader: Safe, Open, High-Performance — PDF for AI
Julia
Julia

Posted on

OpenDataLoader: Safe, Open, High-Performance — PDF for AI

Why is PDF so important?

Why should we focus on PDF?
The Importance of PDF in the AX Era
So, why should you consider OpenDataLoader PDF?

Difficulties with text extraction which OpenDataLoader PDF solves
Installation: Easy and quickly using OpenDataLoader PDF

Why should we focus on PDF?

So, among the numerous electronic document formats, why should we focus on PDF?

The reason is that PDF is not only the global standard for digital documents, but also the starting point of the data ecosystem that AI learns from.

According to statistics from smallpdf.com, by 2025, there will be a massive 2.5 trillion PDF documents stored worldwide, and 290 billion new PDF documents will be created every year.

Furthermore, 98% of global companies have adopted PDF as their standard for distributed documents.

78% of digital contracts worldwide are in PDF format, 90% of official government documents are distributed in PDF format, and 88% of healthcare records are stored in PDF format.

This means PDF is the most extensive and reliable data source for AI training. This is why we at Hancom focus on PDF data.

The Importance of PDF in the AX Era

Why is PDF data so important in the AX era?

It’s because high-quality PDF datasets are a crucial data source for maximizing AI performance. Recently, Hacking Face released a massive PDF-based AI training dataset to the world, valuing 3 trillion tokens, 3.65 TB, supporting 475 million characters, and 1,733 languages. The research results based on the ‘FinePDF’ dataset clearly demonstrate how important PDF is as a data resource for AI training.

Two research findings are noteworthy.

First, PDF data contains “long, high-quality information” essential for AI training, making it much more valuable than simple web data.

The chart on the left analyzes the sentence length and quantity contained in various datasets. This is evidenced by the fact that the average training sentence length and quantity of the FinePDF dataset, based on PDF documents, are longer and more numerous than those of other datasets.

Secondly, the most notable aspect is the performance indicator for LLM training.

As shown in the chart on the right, research shows that when this high-quality PDF dataset was used as a 25% contribution for LLM training, mixed with other datasets, the LLM quality improved and performance was the best.

This is a significant research finding that demonstrates that PDF documents, when well-refined, are a key factor in determining the performance of AI models, not just in quantity.

So, why should you consider OpenDataLoader PDF?
Today, both open-source and commercial markets offer a wide range of tools designed to extract text from PDF documents. Developers and ordinary users have many options to choose from.

This article explores what makes OpenDataLoader PDF stand out among other text extraction solutions. Unlike many alternatives, OpenDataLoader PDF combines data accuracy, security, and AI safety. So, OpenDataLoader PDF safely and accurately converts PDFs to JSON, Markdown, or HTML. Easily feed them into AI stacks like LLM, vector search, and RAG.

In the following sections, we’ll explain the key points of OpenDataLoader PDF, demonstrate its performance advantages, and walk you through the initial setup and basic steps to start using it effectively in your projects.

  • PDF data extraction engine developed and based on Hancom’s long-standing expertise in document processing. It is available under an open source, Mozilla Public License 2.0.

  • Has its homepage on Github and can be installed from PyPI.

  • Reconstructs document layout : headings, paragraphs, lists, and tables to restore the original document’s layout. So the content is easier to chunk, index, and query.

  • Engine transforms unstructured PDFs into clean, structured data, making it the perfect foundation for AI-driven workflows.

  • Provides AI-Safety: proactively identifies and neutralizes potential malicious content before it can affect the integrity and security.

  • SDK is designed to solve real-world problems and foster developer innovation across a wide range of industries. It can be used in the following industries: financial services, legal and compliance, research and academia, enterprise document automation etc.

Difficulties with text extraction which OpenDataLoader PDF solves:
Text extraction from PDF files can be challenging due to the way PDFs are structured. Unlike plain text files, PDFs are primarily designed for visual presentation, not for storing text in a logical or linear order. The following issues commonly occur during text extraction:

Lack of Text Order:
PDFs often store text as individual positioned elements rather than continuous lines or paragraphs. As a result, extracted text may appear out of sequence or fragmented.
Missing or Incorrect Spacing:
Since spaces are not always explicitly stored in a PDF, extraction tools must infer them based on character positioning, which can lead to missing or excessive spaces.
Encoding Problems:
Some PDFs use custom or non-standard fonts where character codes do not directly map to Unicode values. This can produce garbled or unreadable text during extraction.
Scanned Documents (Image-based PDFs):
PDFs created from scans contain only images, not actual text. Optical Character Recognition (OCR) is required to extract readable text, and accuracy depends on image quality and OCR performance.
Complex Layouts:
Multi-column formats, tables, or embedded graphics can confuse extraction algorithms, resulting in incorrect text flow or mixed content.
Embedded or Encrypted Content:
Some PDFs restrict copying or contain encrypted text streams, preventing extraction tools from accessing the data directly.
Hidden or Layered Text:
PDFs may include invisible layers, annotations, or overlapping text objects that interfere with accurate extraction.

OpenDataLoader PDF addresses these issues by providing access to layout, positioning, and font information, allowing developers to reconstruct the text structure more accurately.

The architecture of OpenDataLoader PDF is a streamlined workflow designed for clarity and efficiency. The diagram below illustrates how we process documents from a raw PDF to a structured JSON output, with a focus on our core capabilities.

Installation: Easy and quickly using OpenDataLoader PDF
Quickly experience OpenDataLoader PDF with a simple command, without the complex and difficult installation process.

Install
pip install -U opendataloader-pdf

pip install -U opendataloader-pdf
Enter fullscreen mode Exit fullscreen mode

Convert PDFs from Python

import opendataloader_pdf
       opendataloader_pdf.convert(    
            input_path=["path/to/document.pdf", "path/to/folder"],    
            output_dir="path/to/output",    
            format=["json", "html", "pdf", "markdown"],
)
Enter fullscreen mode Exit fullscreen mode
input_path can mix file and directory paths.
output_dir defaults to the input folder when omitted.
format accepts any subset of json, text, html, pdf, markdown, markdown-with-html, markdown-with-images.
Enter fullscreen mode Exit fullscreen mode

CLI usage
Use the same installation to drive conversions from the terminal:

opendataloader-pdf path/to/document.pdf \
  -o path/to/output \ 
  -f json html pdf markdown
Enter fullscreen mode Exit fullscreen mode

Add flags like

--content-safety-off hidden-text
Enter fullscreen mode Exit fullscreen mode

or

--keep-line-breaks
Enter fullscreen mode Exit fullscreen mode

as needed. See the CLI reference for every option.

Start with Node.js
The TypeScript package mirrors the Python API and exposes both a programmatic helper and a CLI

npx @opendataloader/pdf
Enter fullscreen mode Exit fullscreen mode

Verify Java once before installing:

java -version
Enter fullscreen mode Exit fullscreen mode

Convert from TypeScript

import { convert } from "@opendataloader/pdf";
async function main() { 
  await convert(["path/to/document.pdf", "path/to/folder"], {
    outputDir: "path/to/output",  
    format: ["json", "html", "pdf", "markdown"],
  });
}

main().catch((error) => { 
  console.error("Error processing PDF:", error);
});
Enter fullscreen mode Exit fullscreen mode

CLI usage

npx @opendataloader/pdf path/to/document.pdf path/to/folder \  
-o path/to/output \  
-f json html pdf markdown
Enter fullscreen mode Exit fullscreen mode

or install globally for repeated use:

npm install -g @opendataloader/pdf
opendataloader-pdf path/to/document.pdf -o path/to/output
Enter fullscreen mode Exit fullscreen mode

Java
For various example templates, including Gradle and Maven, please refer to our examples on GitHub

Convert a PDF to JSON

This example demonstrates how to convert a PDF document into structured JSON format, providing a clean, hierarchical representation of text blocks, tables, and other key elements.

PDF to JSON Sample

PDF to HTML

If you’re looking for an efficient and user-friendly solution for document data extraction, OpenDataLoader PDF stands out as one of the most accessible and powerful open-source tools available.

Contact Us

Your interest and feedback are invaluable to us. Please explore our code, go over open issues and become a part of our growing community.

Website: opendataloader.org

GitHub: https://github.com/opendataloader-project/opendataloader-pdf

E-mail: open.dataloader@hancom.com

Top comments (0)