DEV Community

Njeri Kimaru
Njeri Kimaru

Posted on

Docling CLI to parse PDFs and export it to multiple formats

What is Docling ???

Docling is an open source document processing library that converts various document formats into structured outputs.
Docling plays an important part in the RAG pipeline.

I'll be taking you through the process of parsing PDFs into structured formats.

Step 1: Set up

  • Create the project structure in your terminal;
mkdir docling_cli
cd docling_cli
Enter fullscreen mode Exit fullscreen mode
  • Create your virtual environment and activate it. Fedora


Windows

Step 2: Installing docling

pip install docling
docling --version
Enter fullscreen mode Exit fullscreen mode

Fedora

Windows

Check the docling's version

Step 3: Creating input and outputs folders

  • create a folder called data where you will stored your desired pdfs.
  • create a new folder and name it outputs then inside the folders create new folders called; markdown outputs, html outputs and json outputs.

Step 4: Changing the pdfs into html format

docling --to html *.pdf --output ~Documents/docling_cli/outputs/html_outputs
Enter fullscreen mode Exit fullscreen mode

Step 5: Changing the pdfs into other formats

1. Markdown

2. Json

3. Plain text

4. yaml

5. html_split_page

6. DOCtags

7. vtt

Step 6: Analyzing the result findings.

I used three types of pdfss;
one with tables, the other with text and images and the other had tables and paragraphs. Here are my key findings;

1. Pdf with tables

  • In HTML, the rows and columns came out better than they were in the original pdf.
  • Markdown outputs were good too as it wrote the tables in markdown format without losing anything.
  • JSON was broke everything down into nested objects
  • Plain text was good too but not as compared to markdown.

2. Pdf with text and images

  • HTML lost the color of the images.

3. Pdf with tables and paragraphs

  • Paragraphs in all formats came out nicely as texts.

Top comments (0)