📣 Just announced: IBM Granite-Docling: End-to-end document understanding with one tiny model

#docling #rag #opensource #huggingface

Another exciting feature with Docling!

Introduction

The ability to seamlessly convert complex document images into structured, editable text formats is a key challenge in document processing. Addressing this need, Granite Docling is introduced as a powerful solution. It’s a multimodal Image-Text-to-Text model specifically engineered for efficient document conversion. Its design focuses on preserving the core structural and content features inherent in the Docling standard, all while maintaining seamless integration with DoclingDocuments to ensure full compatibility across the ecosystem. This makes it an ideal tool for accurately digitizing and structuring document layouts.

Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with DoclingDocuments to ensure full compatibility.

Granite-Docling is purpose-built for accurate and efficient document conversion, unlike most VLM-based approaches to optical character recognition (OCR) that aim to adapt large, general-purpose models to the task. Even at an ultra-compact 258M parameters, Granite-Docling’s capabilities rival those of systems several times its size, making it extremely cost-effective. The model goes well beyond mere text extraction: it handles both inline and floating math and code, excels at recognizing table structure and preserves the layout and structure of the original document. Whereas conventional OCR models convert documents directly to Markdown and lose connection to the source content, Granite-Docling’s unique method of faithfully translating complex structural elements makes its output ideal for downstream RAG applications.

Model Summary (excerpt from Hugging Face model’s page)

Granite Docling 258M builds upon the IDEFICS3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16–512 and substitutes the language model with a Granite 165M LLM. Try out our Granite-Docling-258 demo today.

Developed by: IBM Research
Model type: Multi-modal model (image+text-to-text)
Language(s): English (NLP)
License: Apache 2.0
Release Date: September 17, 2025

Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing features while introducing a number of powerful new features, including:

- 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas
- 🧩 Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference
- 🧘 Improved Stability: Tends to avoid infinite loops more effectively
- 🧮 Enhanced Inline Equations: Better inline math recognition
- 🧾 Document Element QA: Answer questions about a document’s structure such as the presence and order of document elements
- 🌍 Japanese, Arabic and Chinese support (experimental)

Test & Implementation

The model’s Hugging Face page provides the following sample code;

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "docling-core",
#     "mlx-vlm", 
#     "pillow",
#     "transformers",
# ]
# ///

import webbrowser
from pathlib import Path

from docling_core.types.doc import ImageRefMode
from docling_core.types.doc.document import DocTagsDocument, DoclingDocument
from mlx_vlm import load, stream_generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
from transformers.image_utils import load_image

# Configuration
MODEL_PATH = "ibm-granite/granite-docling-258M-mlx"
PROMPT = "Convert this page to docling."
SHOW_IN_BROWSER = True

# Sample images (pick one...)
# SAMPLE_IMAGE = "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png"
# SAMPLE_IMAGE = "https://ibm.biz/docling-page-with-list"
SAMPLE_IMAGE = "https://ibm.biz/docling-page-with-table"

# Load model and processor
print("Loading model...")
model, processor = load(MODEL_PATH)
config = load_config(MODEL_PATH)

# Prepare input image and prompt
print("Preparing input...")
pil_image = load_image(SAMPLE_IMAGE)
formatted_prompt = apply_chat_template(processor, config, PROMPT, num_images=1)

# Generate DocTags output
print("Generating DocTags...\n")
output = ""
for token in stream_generate(
    model, processor, formatted_prompt, [pil_image], max_tokens=4096, verbose=False
):
    output += token.text
    print(token.text, end="")
    if "</doctag>" in token.text:
        break

print("\n\nProcessing output...")

# Create DoclingDocument from generated DocTags
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([output], [pil_image])
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Sample Document")

# Export to different formats
print("\nMarkdown output:\n")
print(doc.export_to_markdown())

# Save as HTML with embedded images
output_path = Path("./output.html") 
doc.save_as_html(output_path, image_mode=ImageRefMode.EMBEDDED)
print(f"\nHTML saved to: {output_path}")

# Open in browser
if SHOW_IN_BROWSER:
    webbrowser.open(f"file:///{str(output_path.resolve())}")

I modified the sample provied code very slightly as the following to meet my usual test structure ⬇️

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "docling-core",
#     "mlx-vlm",
#     "pillow",
#     "transformers",
# ]
# ///

import webbrowser
from pathlib import Path
import os
import glob
from docling_core.types.doc import ImageRefMode
from docling_core.types.doc.document import DocTagsDocument, DoclingDocument
from mlx_vlm import load, stream_generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
from transformers.image_utils import load_image

# -----------------------------------------------------------------------------
# Configuration
# -----------------------------------------------------------------------------
MODEL_PATH = "ibm-granite/granite-docling-258M-mlx"
PROMPT = "Convert this page to docling."
SHOW_IN_BROWSER = True
INPUT_DIR = Path("./input")
OUTPUT_DIR = Path("./output")

# -----------------------------------------------------------------------------
# Main Application Logic
# -----------------------------------------------------------------------------
def main():
    # Ensure input and output directories exist
    if not INPUT_DIR.exists():
        print(f"Error: Input directory '{INPUT_DIR}' not found. Please create it and add an image.")
        return

    OUTPUT_DIR.mkdir(exist_ok=True)

    # Get the first image from the input directory
    image_files = glob.glob(str(INPUT_DIR / "*"))
    if not image_files:
        print(f"Error: No image files found in '{INPUT_DIR}'. Please add an image to process.")
        return

    sample_image_path = image_files[0]
    print(f"Processing image: {sample_image_path}")

    # Load model and processor
    print("Loading model...")
    model, processor = load(MODEL_PATH)
    config = load_config(MODEL_PATH)

    # Prepare input image and prompt
    print("Preparing input...")
    pil_image = load_image(sample_image_path)
    formatted_prompt = apply_chat_template(processor, config, PROMPT, num_images=1)

    # Generate DocTags output
    print("Generating DocTags...\n")
    output = ""
    for token in stream_generate(
        model, processor, formatted_prompt, [pil_image], max_tokens=4096, verbose=False
    ):
        output += token.text
        print(token.text, end="")
        if "</doctag>" in token.text:
            break

    print("\n\nProcessing output...")

    # Create DoclingDocument from generated DocTags
    doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([output], [pil_image])
    doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Sample Document")

    # Define output file paths
    base_name = Path(sample_image_path).stem
    markdown_path = OUTPUT_DIR / f"{base_name}.md"
    html_path = OUTPUT_DIR / f"{base_name}.html"

    # Export to different formats
    print("\nMarkdown output:\n")
    markdown_output = doc.export_to_markdown()
    print(markdown_output)

    with open(markdown_path, "w", encoding="utf-8") as f:
        f.write(markdown_output)
    print(f"\nMarkdown saved to: {markdown_path}")

    # Save as HTML with embedded images
    doc.save_as_html(html_path, image_mode=ImageRefMode.EMBEDDED)
    print(f"HTML saved to: {html_path}")

    # Open in browser
    if SHOW_IN_BROWSER:
        webbrowser.open(f"file:///{str(html_path.resolve())}")

if __name__ == "__main__":
    main()

Beforehand you should make a virtual environment (or you should better 😉) and install all the required and necessary packages.

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip

pip install docling-core
pip install mlx-vlm
pip install pillow
pip install transformers

The console output to expect is provided below.

python app.py
Processing image: input/page_with_table (1).png
Loading model...
chat_template.jinja: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 588/588 [00:00<00:00, 9.83MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35.0/35.0 [00:00<00:00, 604kB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 169/169 [00:00<00:00, 3.25MB/s]
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 486/486 [00:00<00:00, 1.22MB/s]
config.json: 6.87kB [00:00, 19.3MB/s]                                                                                                                                 | 0.00/169 [00:00<?, ?B/s]
model.safetensors.index.json: 38.1kB [00:00, 47.6MB/s]                                                                                                                | 0.00/486 [00:00<?, ?B/s]
merges.txt: 917kB [00:00, 43.2MB/s]████████████████████▌                                                                                                         | 3/13 [00:00<00:02,  3.53it/s]
special_tokens_map.json: 1.08kB [00:00, 18.5MB/s]
processor_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 68.0/68.0 [00:00<00:00, 1.58MB/s]
tokenizer_config.json: 18.0kB [00:00, 43.7MB/s]                                                                                                                      | 0.00/68.0 [00:00<?, ?B/s]
vocab.json: 1.61MB [00:00, 43.9MB/s]                                                                                                                                 | 0.00/631M [00:00<?, ?B/s]
tokenizer.json: 7.15MB [00:00, 138MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 631M/631M [00:12<00:00, 49.1MB/s]
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:13<00:00,  1.07s/it]
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 259647.39it/s]
Preparing input...
Generating DocTags...

<doctag><page_header><loc_126><loc_26><loc_420><loc_33>Optimized Table Tokenization for Table Structure Recognition</page_header>
<page_header><loc_452><loc_27><loc_458><loc_33>9</page_header>
<text><loc_56><loc_46><loc_459><loc_72>order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.</text>
<section_header_level_1><loc_57><loc_84><loc_270><loc_92>5.1 Hyper Parameter Optimization</section_header_level_1>
<text><loc_56><loc_97><loc_459><loc_150>We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.</text>
<otsl><loc_64><loc_213><loc_452><loc_314><fcel>#<fcel>#<fcel>Language<fcel>TEDs<lcel><lcel><fcel>mAP<fcel>Inference<nl><fcel>enc-layers<fcel>dec-layers<fcel>simple<fcel>complex<fcel>all<fcel>(0.75)<fcel>time (secs)<ucel><nl><fcel>6<fcel>6<fcel>OTSL<fcel>0.965<fcel>0.934<fcel>0.955<fcel>0.88<fcel>2.73<nl><ucel><ucel><fcel>HTML<fcel>0.969<fcel>0.927<fcel>0.955<fcel>0.857<ucel><nl><fcel>4<fcel>4<fcel>OTSL<fcel>0.938<fcel>0.904<fcel>0.927<fcel>0.853<fcel>1.97<nl><ucel><ucel><fcel>HTML<fcel>0.952<fcel>0.909<fcel>0.938<fcel>0.843<ucel><nl><fcel>2<fcel>4<fcel>OTSL<fcel>0.923<fcel>0.897<fcel>0.915<fcel>0.859<fcel>1.91<nl><ucel><ucel><fcel>HTML<fcel>0.945<fcel>0.901<fcel>0.931<fcel>0.834<ucel><nl><fcel>4<fcel>2<fcel>OTSL<fcel>0.952<fcel>0.92<fcel>0.942<fcel>0.857<fcel>1.22<nl><ucel><ucel><fcel>HTML<fcel>0.944<fcel>0.903<fcel>0.931<fcel>0.824<ucel><nl><caption><loc_57><loc_165><loc_458><loc_206>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
<section_header_level_1><loc_57><loc_343><loc_208><loc_351>5.2 Quantitative Results</section_header_level_1>
<text><loc_56><loc_355><loc_458><loc_427>We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.</text>
<text><loc_56><loc_429><loc_458><loc_464>Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.</text>
</doctag>

Processing output...

Markdown output:

order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.

## 5.1 Hyper Parameter Optimization

We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.

Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.

| #          | #          | Language   | TEDs    | TEDs   | TEDs   | mAP         | Inference   |
|------------|------------|------------|---------|--------|--------|-------------|-------------|
| enc-layers | dec-layers | simple     | complex | all    | (0.75) | time (secs) | Inference   |
| 6          | 6          | OTSL       | 0.965   | 0.934  | 0.955  | 0.88        | 2.73        |
| 6          | 6          | HTML       | 0.969   | 0.927  | 0.955  | 0.857       | 2.73        |
| 4          | 4          | OTSL       | 0.938   | 0.904  | 0.927  | 0.853       | 1.97        |
| 4          | 4          | HTML       | 0.952   | 0.909  | 0.938  | 0.843       | 1.97        |
| 2          | 4          | OTSL       | 0.923   | 0.897  | 0.915  | 0.859       | 1.91        |
| 2          | 4          | HTML       | 0.945   | 0.901  | 0.931  | 0.834       | 1.91        |
| 4          | 2          | OTSL       | 0.952   | 0.92   | 0.942  | 0.857       | 1.22        |
| 4          | 2          | HTML       | 0.944   | 0.903  | 0.931  | 0.824       | 1.22        |

## 5.2 Quantitative Results

We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.

Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.

Markdown saved to: output/page_with_table (1).md
HTML saved to: output/page_with_table (1).html
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

There is also a markdown file generated;

order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.

## 5.1 Hyper Parameter Optimization

We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.

Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.

| #          | #          | Language   | TEDs    | TEDs   | TEDs   | mAP         | Inference   |
|------------|------------|------------|---------|--------|--------|-------------|-------------|
| enc-layers | dec-layers | simple     | complex | all    | (0.75) | time (secs) | Inference   |
| 6          | 6          | OTSL       | 0.965   | 0.934  | 0.955  | 0.88        | 2.73        |
| 6          | 6          | HTML       | 0.969   | 0.927  | 0.955  | 0.857       | 2.73        |
| 4          | 4          | OTSL       | 0.938   | 0.904  | 0.927  | 0.853       | 1.97        |
| 4          | 4          | HTML       | 0.952   | 0.909  | 0.938  | 0.843       | 1.97        |
| 2          | 4          | OTSL       | 0.923   | 0.897  | 0.915  | 0.859       | 1.91        |
| 2          | 4          | HTML       | 0.945   | 0.901  | 0.931  | 0.834       | 1.91        |
| 4          | 2          | OTSL       | 0.952   | 0.92   | 0.942  | 0.857       | 1.22        |
| 4          | 2          | HTML       | 0.944   | 0.903  | 0.931  | 0.824       | 1.22        |

## 5.2 Quantitative Results

We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.

Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.

The resulting HTML file is remarkably clean, well-structured, and easy to read.

➡️ You can also try it live on Hugging Face: https://huggingface.co/spaces/ibm-granite/granite-docling-258m-demo

or…

➡️ Use it at your convenient and for your usage as a command line tool:

# Convert to HTML and Markdown:
docling --to html --to md --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887" # accepts files, urls or directories

# Convert to HTML including layout visualization:
docling --to html_split_page --show-layout --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887"

Conclusion

In conclusion, the Granite-Docling-258M model establishes a new benchmark for document conversion by offering a purpose-built, efficient, and highly accurate alternative to adapted general-purpose VLM solutions. Its primary advantage lies in its ultra-compact size (258M parameters), which allows it to rival the performance of much larger systems, delivering extreme cost-effectiveness without compromising quality. Crucially, Granite-Docling goes far beyond simple OCR; it excels at deep structural understanding, accurately recognizing and preserving layout, table structure, and complex elements like inline/floating math and code. By faithfully translating these structural elements into a coherent output, it overcomes the limitations of conventional OCR models that discard source structure, making the resulting documents ideal for seamless integration into downstream Retrieval-Augmented Generation (RAG) **applications. Granite-Docling-258M is thus the superior choice for organizations demanding both high fidelity and efficient processing** for their document intelligence needs.

Top comments (5)

Guy • Sep 22

This is impressive! Granite-Docling’s “all in one tiny model” approach to document understanding feels like precisely the kind of efficiency jump many AI-driven tools need. Having built orchestration layers with Claude for my app ScrumBuddy, I’ve seen firsthand how preserving structure (tables, code, layout) early in the pipeline saves far more downstream pain than squeezing more raw accuracy out of raw OCR. The layout fidelity, math/code inline and floating, these features are what turn documents into useable data rather than noisy dumps.

My opinion: tiny models that understand structure are underrated because folks often measure only “how much text did it pull out” but what matters more is how well the downstream agent or app can reason over that structure. If Granite-Docling can reliably preserve semantic blocks, tags, and context, then agents doing RAG, QA, or content summarization will perform far more consistently. The fewer “fix layout, rename tags, untangle tables” jobs you have after document ingestion, the more trust users will have.

A thought: one thing I’ve baked into my agent workflows is a validation pass post-conversion, not simply verifying text but checking structural hints. Is every table recognized? Are inline formulas intact? Are code blocks tagged correctly? If Granite-Docling provides versioned metadata about these rich structural features, downstream agents can be smarter about what to trust vs what to double-check. All in all, very exciting development. Tools like this raise the baseline of what “understanding documents” means. Kudos to IBM for pushing this forward.