DOCLING: A how to guide to parsing multilingual documents.

#docling #parcing #ramalama #fedora

Docling simplifies document processing, parsing diverse formats and providing seamless integrations with the gen AI.

This guide summarizes the workflow for extracting high-fidelity text from documents containing intertwined languages (Dutch, French, English) while bypassing hardware restrictions and linguistic bias.

1. Environment Setup & Hardware Constraints
When working in restricted Linux environments (like Fedora with user quotas), avoid full VLM downloads (e.g., IBM Granite) if you have less than 5GB of free home directory space.
For a more stable experience with diverse OCR engines, Windows 10/11 provides easier management of system-level language packs.

2. Solving the Python 3.13 Compatibility Gap
If you are using the latest Python 3.13, legacy Tesseract wrappers may hang or crash due to outdated C-extensions.
To resolve this Use RapidOCR ONNX engine under tesseract hardware.

RapidOCR utilizes the ONNX Runtime, which is platform-independent and meshes perfectly with Python 3.13. It provides the speed of modern AI with the stability required for new environments.

3. Step-by-Step Implementation Workflow
Install Tesseract with Multi-Lang Support.
During the Windows installation, manually check the boxes for Dutch, French, and English in the Additional Language Script components.
Use Docling as your primary framework but configure it to use the RapidOCR ONNX engine to ensure it doesn't crash on Python 3.13.

Bypass Language Tagging
Instead of injecting a single priority code in your script (which causes bias), rely on the system-level backend of tesseract to recognize the languages natively.

4. Export to Markdown
Direct your output to .md files. This preserves the structural layout tables and headers better than plain .txt.

5. Using Easy OCR
One straight for ward way of marking down documents is to utilize jpg images on easy OCR. The code is pretty simple and has over 80+ languages. The only downside- It's slow. It takes a couple of minutes to markdown text from images. This is because its built on pytorch. Pythorth makes easy ocr highly accurate but slow especially when you can't use a dedicated GPU.

Key Findings
Identify Tagging Differences
Different engines use different codes for the same language (e.g., "nld" for RapidOCR and nl for EasyOCR). If you use the code method ensure the language code is correct to avoid errors.

Injecting too many language codes into a small model often leads to guessing. System-level integration is the only way to maintain true linguistic integrity when working with one document that has multiple languages.

Using a simple setup like Rapidocr or easy ocr proves much more superior in terms of accuracy, space, RAM and time saving.

Let me know if you have experimented with docling before. Which Ocr did you prefer?? comment below