Multiple document conversion using Docling and a GUI

#docling #llm #rag #genai

Introduction

In a previous post I described how I began to put my hands on Docling and make my very first steps (My first hands-on experience with Docling). In that first step I used ‘Thinker’ framework to add a GUI so I could choose a file and convert it to divers formats using Docling.

Natively, Docling provided batch conversions through the following command line examples;

# Convert a single file to Markdown (default)
docling myfile.pdf

# Convert a single file to Markdown and JSON, without OCR
docling myfile.pdf --to json --to md --no-ocr

# Convert PDF files in input directory to Markdown (default)
docling ./input/dir --from pdf

# Convert PDF and Word files in input directory to Markdown and JSON
docling ./input/dir --from pdf --from docx --to md --to json --output ./scratch

# Convert all supported files in input directory to Markdown, but abort on first error
docling ./input/dir --output ./scratch --abort-on-error

I wanted to do almost the same thing with a GUI. So this is my 2nd attemps to do my own stuff with Docling for multiple type document conversions.

For sure, I am still using Thinker 🤭

Implementation

I started over my previous code and change the file selection dialog box to multiple files.

# open-file dialog
root = tk.Tk()

filenames = tk.filedialog.askopenfilenames(
    title='Select files (pdf, pptx, docx, md, img)..',
    filetypes=filetypes,
)    
if filenames:
    print("Selected files:")
    for filename in filenames:
        print(filename)
else:
    print("No files selected.") 
    quit()


root.destroy()

So the code becomes the following.

import json
import logging
import time
from pathlib import Path

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.models.ocr_mac_model import OcrMacOptions
from docling.models.tesseract_ocr_cli_model import TesseractCliOcrOptions
from docling.models.tesseract_ocr_model import TesseractOcrOptions

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
    DocumentConverter,
    PdfFormatOption,
    WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline


## GUI for file selection with thinker 
import tkinter as tk
from tkinter import filedialog
## filetypes for thinker dialog box
filetypes = (
    ('PDF files', '*.PDF'),
    ('Word file', '*.DOCX'),
    ('Powerpoint file', '*.PPTX'),
    ('HTML file', '*.HTML'),
    ('IMAGE file', '*.PNG'),
    ('IMAGE file', '*.JPG'),
    ('IMAGE file', '*.JPEG'),
    ('IMAGE file', '*.GIF'),
    ('IMAGE file', '*.BMP'),
    ('IMAGE file', '*.TIFF'),
    ('MD file', '*.MD'),
)

# open-file dialog
root = tk.Tk()

filenames = tk.filedialog.askopenfilenames(
    title='Select files (pdf, pptx, docx, md, img)..',
    filetypes=filetypes,
)    
if filenames:
    print("Selected files:")
    for filename in filenames:
        print(filename)
else:
    print("No files selected.") 
    quit()


root.destroy()

_log = logging.getLogger(__name__)

def main():
    logging.basicConfig(level=logging.INFO)



# Docling Parse with EasyOCR
# ----------------------
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

doc_converter = (
    DocumentConverter(  # all of the below is optional, has internal defaults.
        allowed_formats=[
            InputFormat.PDF,
            InputFormat.IMAGE,
            InputFormat.DOCX,
            InputFormat.HTML,
            InputFormat.PPTX,
            InputFormat.ASCIIDOC,
            InputFormat.MD,
        ],  # whitelist formats, non-matching files are ignored.
        format_options={
            #InputFormat.PDF: PdfFormatOption(
            #    pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend
            #),
            #InputFormat.DOCX: WordFormatOption(
            #    pipeline_cls=SimplePipeline  # , backend=MsWordDocumentBackend
            #),
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),              
        },
    )
)    


for filename in filenames:
    input_doc_path = filename

    start_time = time.time()
    conv_result = doc_converter.convert(input_doc_path)
    end_time = time.time() - start_time

    _log.info(f"Document converted in {end_time:.2f} seconds.")

    ## Export results
    output_dir = Path("scratch")
    output_dir.mkdir(parents=True, exist_ok=True)
    doc_filename = conv_result.input.file.stem

    # Export Deep Search document JSON format:
    with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
        fp.write(json.dumps(conv_result.document.export_to_dict()))

    # Export Text format:
    with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
        fp.write(conv_result.document.export_to_text())

    # Export Markdown format:
    with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
        fp.write(conv_result.document.export_to_markdown())

    # Export Document Tags format:
    with (output_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
        fp.write(conv_result.document.export_to_document_tokens())


if __name__ == "__main__":
    main()

The GUI;

The terminal output;

2024-12-19 13:22:50.682 Python[62628:4807197] +[IMKClient subclass]: chose IMKClient_Modern
2024-12-19 13:22:51.214 Python[62628:4807197] The class 'NSOpenPanel' overrides the method identifier.  This method is implemented by class 'NSWindow'
Selected files:
/Users/alainairom/Docling_test/file-selection.png
/Users/alainairom/Docling_test/file-selection2.png
/Users/alainairom/Docling_test/file-selection3.png
/Users/alainairom/Docling_test/mobicheckin_server_event_guest_category_66968af1fc394000725041be_badge_template_66a37cdae369f921e572e4fa_1732039981_NZYRUCY.pdf
/Users/alainairom/Docling_test/scratch.png
/Users/alainairom/Docling_test/Screenshot at Dec 02 08-11-28.png

And the actual converted docs…

Conclusion

This document describes multiple conversion capacities of Docling using a GUI tool.

I’m going to put some tests on documents’ types etc… but for now it works!

I’m going to do some more in depth experiences… so stay tuned and thanks for reading 🤗

Useful links

Docling repository: https://github.com/DS4SD/docling
Docling documentation: https://ds4sd.github.io/docling/

DEV Community

Multiple document conversion using Docling and a GUI

Introduction

Implementation

Conclusion

Useful links

Top comments (0)

Read next

Ollama - Custom Model - llama3.2

From Beats to Bytes: AI's Musical Revolution

RAG Chatbot with Amazon Bedrock & LangChain

ChatGPT is my first mentor