What is Docling ???
Docling is an open source document processing library that converts various document formats into structured outputs.
Docling plays an important part in the RAG pipeline.
I'll be taking you through the process of parsing PDFs into structured formats.
Step 1: Set up
- Create the project structure in your terminal;
mkdir docling_cli
cd docling_cli
- Create your virtual environment and activate it. Fedora
Step 2: Installing docling
pip install docling
docling --version
Fedora
Windows
Check the docling's version
Step 3: Creating input and outputs folders
- create a folder called data where you will stored your desired pdfs.
- create a new folder and name it outputs then inside the folders create new folders called; markdown outputs, html outputs and json outputs.
Step 4: Changing the pdfs into html format
docling --to html *.pdf --output ~Documents/docling_cli/outputs/html_outputs
Step 5: Changing the pdfs into other formats
1. Markdown
2. Json
3. Plain text
4. yaml
5. html_split_page
6. DOCtags
7. vtt
Step 6: Analyzing the result findings.
I used three types of pdfss;
one with tables, the other with text and images and the other had tables and paragraphs. Here are my key findings;
1. Pdf with tables
- In HTML, the rows and columns came out better than they were in the original pdf.
- Markdown outputs were good too as it wrote the tables in markdown format without losing anything.
- JSON was broke everything down into nested objects
- Plain text was good too but not as compared to markdown.
2. Pdf with text and images
- HTML lost the color of the images.
3. Pdf with tables and paragraphs
- Paragraphs in all formats came out nicely as texts.














Top comments (0)