DEV Community

Alain Airom
Alain Airom

Posted on

Behind the scenes of Docling PDF Parsing

Docling PDF Parsing with “heron”

Introduction

While recently discussing PDF extraction capabilities with a customer, I revisited the Docling documentation and noticed a significant update I hadn’t noticed before: ‘📑 New layout model (Heron) by default, for faster PDF parsing.’ Although this update has been out for a few months, it prompted me to dive deeper into what exactly ‘Heron’ is and why the team chose it as the new cornerstone for Docling’s document understanding.

Why “Heron” is the New Default

The “plunge” reveals that Heron isn’t just a minor patch — it’s a major architectural leap forward for Docling. Developed by IBM Research, Heron is based on the RT-DETRv2 (Real-Time DEtection TRansformer) architecture.

The “Why” Behind the Choice:

  • Massive Accuracy Gains: Heron provides a 23.5% gain in mAP (mean Average Precision) compared to the old Docling baseline. This means it is significantly better at “seeing” the difference between a title, a paragraph, and a picture.
  • Speed (Real-Time Performance): Heron is optimized for speed. On a standard NVIDIA A100 GPU, the “Heron-101” variant can process a page in just 28ms, making it one of the fastest layout models in the open-source community.
  • Superior Table & Multi-Column Handling: The primary reason Docling chose Heron is its ability to maintain the “logical reading order” in complex layouts. It prevents the common “word salad” effect found in traditional PDF parsers by mapping text to visual bounding boxes first.

  • Comparison at a Glance;


| Metric               | Old Docling Model | New Heron Model (Default)        |
| -------------------- | ----------------- | -------------------------------- |
| **Architecture**     | RT-DETRv1         | **RT-DETRv2**                    |
| **mAP (Accuracy)**   | ~54%              | **~77%**                         |
| **Primary Strength** | Basic detection   | **Complex layout & table logic** |
| **Inference Time**   | Moderate          | **Ultra-Fast (Real-time)**       |
Enter fullscreen mode Exit fullscreen mode

Usage and Implementation

Integrating Docling into a Python project instantly provides your application with enterprise-grade extraction capabilities. By simply running pip install docling, you gain access to the Heron layout model out-of-the-box, allowing you to transform massive volumes of complex PDFs into structured, machine-readable data with minimal scaffolding. This 'plug-and-play' approach ensures that even the most intricate multi-column reports or dense technical papers are parsed with high-fidelity accuracy from day one.

from pathlib import Path

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
    TesseractCliOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption


def main():
    data_folder = Path(__file__).parent / "../../tests/data"
    input_doc_path = data_folder / "pdf/2206.01062.pdf"

    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options = TableStructureOptions(
        do_cell_matching=True
    )

    # Any of the OCR options can be used: EasyOcrOptions, TesseractOcrOptions,
    # TesseractCliOcrOptions, OcrMacOptions (macOS only), RapidOcrOptions
    # ocr_options = EasyOcrOptions(force_full_page_ocr=True)
    # ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
    # ocr_options = OcrMacOptions(force_full_page_ocr=True)
    # ocr_options = RapidOcrOptions(force_full_page_ocr=True)
    ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
    pipeline_options.ocr_options = ocr_options

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )

    doc = converter.convert(input_doc_path).document
    md = doc.export_to_markdown()
    print(md)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

The implementation of the Heron layout model is a cornerstone of the Docling ecosystem, as detailed on its Hugging Face model page;

Document Layout Analysis “heron”
🚀 heron is the default layout analysis model of the Docling project, designed for robust and high-quality document layout understanding.
📄 For an in-depth description of the model architecture, training datasets, and evaluation methodology, please refer to our technical report: “Advanced Layout Analysis Models for Docling”, Nikolaos Livathinos et al., 🔗 https://arxiv.org/abs/2509.11720
Inference code example
Prerequisites:

pip install transformers Pillow torch requests
Enter fullscreen mode Exit fullscreen mode

Prediction:

import requests
from transformers import RTDetrV2ForObjectDetection, RTDetrImageProcessor
import torch
from PIL import Image


classes_map = {
    0: "Caption",
    1: "Footnote",
    2: "Formula",
    3: "List-item",
    4: "Page-footer",
    5: "Page-header",
    6: "Picture",
    7: "Section-header",
    8: "Table",
    9: "Text",
    10: "Title",
    11: "Document Index",
    12: "Code",
    13: "Checkbox-Selected",
    14: "Checkbox-Unselected",
    15: "Form",
    16: "Key-Value Region",
}
image_url = "https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo/resolve/main/example_images/annual_rep_14.png"
model_name = "ds4sd/docling-layout-heron"
threshold = 0.6


# Download the image
image = Image.open(requests.get(image_url, stream=True).raw)
image = image.convert("RGB")

# Initialize the model
image_processor = RTDetrImageProcessor.from_pretrained(model_name)
model = RTDetrV2ForObjectDetection.from_pretrained(model_name)

# Run the prediction pipeline
inputs = image_processor(images=[image], return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
results = image_processor.post_process_object_detection(
    outputs,
    target_sizes=torch.tensor([image.size[::-1]]),
    threshold=threshold,
)

# Get the results
for result in results:
    for score, label_id, box in zip(
        result["scores"], result["labels"], result["boxes"]
    ):
        score = round(score.item(), 2)
        label = classes_map[label_id.item()]
        box = [round(i, 2) for i in box.tolist()]
        print(f"{label}:{score} {box}")
Enter fullscreen mode Exit fullscreen mode

Conclusion: Why Docling Set the New Gold Standard?

By integrating the RT-DETRv2 architecture, Docling has successfully bridged the gap between high-speed performance and deep structural intelligence — achieving a remarkable 23.5% gain in accuracy while maintaining real-time processing speeds. Unlike many other open-source tools that struggle with “word salad” in complex multi-column layouts, Heron’s vision-first approach ensures that the logical reading order and complex table hierarchies remain intact. This combination of IBM’s rigorous research, a “plug-and-play” Python implementation, and its ability to turn messy PDFs into LLM-ready Markdown makes Docling more than just a parser; it is the definitive foundation for the next generation of Retrieval-Augmented Generation (RAG) and document intelligence pipelines.

Links

Top comments (0)