Another exciting feature with Docling!
Introduction
The ability to seamlessly convert complex document images into structured, editable text formats is a key challenge in document processing. Addressing this need, Granite Docling is introduced as a powerful solution. It’s a multimodal Image-Text-to-Text model specifically engineered for efficient document conversion. Its design focuses on preserving the core structural and content features inherent in the Docling standard, all while maintaining seamless integration with DoclingDocuments to ensure full compatibility across the ecosystem. This makes it an ideal tool for accurately digitizing and structuring document layouts.
Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with DoclingDocuments to ensure full compatibility.
Granite-Docling is purpose-built for accurate and efficient document conversion, unlike most VLM-based approaches to optical character recognition (OCR) that aim to adapt large, general-purpose models to the task. Even at an ultra-compact 258M parameters, Granite-Docling’s capabilities rival those of systems several times its size, making it extremely cost-effective. The model goes well beyond mere text extraction: it handles both inline and floating math and code, excels at recognizing table structure and preserves the layout and structure of the original document. Whereas conventional OCR models convert documents directly to Markdown and lose connection to the source content, Granite-Docling’s unique method of faithfully translating complex structural elements makes its output ideal for downstream RAG applications.
Model Summary (excerpt from Hugging Face model’s page)
Granite Docling 258M builds upon the IDEFICS3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16–512 and substitutes the language model with a Granite 165M LLM. Try out our Granite-Docling-258 demo today.
- Developed by: IBM Research
- Model type: Multi-modal model (image+text-to-text)
- Language(s): English (NLP)
- License: Apache 2.0
- Release Date: September 17, 2025
Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing features while introducing a number of powerful new features, including:
- - 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas
- - 🧩 Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference
- - 🧘 Improved Stability: Tends to avoid infinite loops more effectively
- - 🧮 Enhanced Inline Equations: Better inline math recognition
- - 🧾 Document Element QA: Answer questions about a document’s structure such as the presence and order of document elements
- - 🌍 Japanese, Arabic and Chinese support (experimental)
Test & Implementation
The model’s Hugging Face page provides the following sample code;
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "docling-core",
# "mlx-vlm",
# "pillow",
# "transformers",
# ]
# ///
import webbrowser
from pathlib import Path
from docling_core.types.doc import ImageRefMode
from docling_core.types.doc.document import DocTagsDocument, DoclingDocument
from mlx_vlm import load, stream_generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
from transformers.image_utils import load_image
# Configuration
MODEL_PATH = "ibm-granite/granite-docling-258M-mlx"
PROMPT = "Convert this page to docling."
SHOW_IN_BROWSER = True
# Sample images (pick one...)
# SAMPLE_IMAGE = "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png"
# SAMPLE_IMAGE = "https://ibm.biz/docling-page-with-list"
SAMPLE_IMAGE = "https://ibm.biz/docling-page-with-table"
# Load model and processor
print("Loading model...")
model, processor = load(MODEL_PATH)
config = load_config(MODEL_PATH)
# Prepare input image and prompt
print("Preparing input...")
pil_image = load_image(SAMPLE_IMAGE)
formatted_prompt = apply_chat_template(processor, config, PROMPT, num_images=1)
# Generate DocTags output
print("Generating DocTags...\n")
output = ""
for token in stream_generate(
model, processor, formatted_prompt, [pil_image], max_tokens=4096, verbose=False
):
output += token.text
print(token.text, end="")
if "</doctag>" in token.text:
break
print("\n\nProcessing output...")
# Create DoclingDocument from generated DocTags
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([output], [pil_image])
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Sample Document")
# Export to different formats
print("\nMarkdown output:\n")
print(doc.export_to_markdown())
# Save as HTML with embedded images
output_path = Path("./output.html")
doc.save_as_html(output_path, image_mode=ImageRefMode.EMBEDDED)
print(f"\nHTML saved to: {output_path}")
# Open in browser
if SHOW_IN_BROWSER:
webbrowser.open(f"file:///{str(output_path.resolve())}")
I modified the sample provied code very slightly as the following to meet my usual test structure ⬇️
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "docling-core",
# "mlx-vlm",
# "pillow",
# "transformers",
# ]
# ///
import webbrowser
from pathlib import Path
import os
import glob
from docling_core.types.doc import ImageRefMode
from docling_core.types.doc.document import DocTagsDocument, DoclingDocument
from mlx_vlm import load, stream_generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
from transformers.image_utils import load_image
# -----------------------------------------------------------------------------
# Configuration
# -----------------------------------------------------------------------------
MODEL_PATH = "ibm-granite/granite-docling-258M-mlx"
PROMPT = "Convert this page to docling."
SHOW_IN_BROWSER = True
INPUT_DIR = Path("./input")
OUTPUT_DIR = Path("./output")
# -----------------------------------------------------------------------------
# Main Application Logic
# -----------------------------------------------------------------------------
def main():
# Ensure input and output directories exist
if not INPUT_DIR.exists():
print(f"Error: Input directory '{INPUT_DIR}' not found. Please create it and add an image.")
return
OUTPUT_DIR.mkdir(exist_ok=True)
# Get the first image from the input directory
image_files = glob.glob(str(INPUT_DIR / "*"))
if not image_files:
print(f"Error: No image files found in '{INPUT_DIR}'. Please add an image to process.")
return
sample_image_path = image_files[0]
print(f"Processing image: {sample_image_path}")
# Load model and processor
print("Loading model...")
model, processor = load(MODEL_PATH)
config = load_config(MODEL_PATH)
# Prepare input image and prompt
print("Preparing input...")
pil_image = load_image(sample_image_path)
formatted_prompt = apply_chat_template(processor, config, PROMPT, num_images=1)
# Generate DocTags output
print("Generating DocTags...\n")
output = ""
for token in stream_generate(
model, processor, formatted_prompt, [pil_image], max_tokens=4096, verbose=False
):
output += token.text
print(token.text, end="")
if "</doctag>" in token.text:
break
print("\n\nProcessing output...")
# Create DoclingDocument from generated DocTags
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([output], [pil_image])
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Sample Document")
# Define output file paths
base_name = Path(sample_image_path).stem
markdown_path = OUTPUT_DIR / f"{base_name}.md"
html_path = OUTPUT_DIR / f"{base_name}.html"
# Export to different formats
print("\nMarkdown output:\n")
markdown_output = doc.export_to_markdown()
print(markdown_output)
with open(markdown_path, "w", encoding="utf-8") as f:
f.write(markdown_output)
print(f"\nMarkdown saved to: {markdown_path}")
# Save as HTML with embedded images
doc.save_as_html(html_path, image_mode=ImageRefMode.EMBEDDED)
print(f"HTML saved to: {html_path}")
# Open in browser
if SHOW_IN_BROWSER:
webbrowser.open(f"file:///{str(html_path.resolve())}")
if __name__ == "__main__":
main()
Beforehand you should make a virtual environment (or you should better 😉) and install all the required and necessary packages.
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install docling-core
pip install mlx-vlm
pip install pillow
pip install transformers
The console output to expect is provided below.
python app.py
Processing image: input/page_with_table (1).png
Loading model...
chat_template.jinja: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 588/588 [00:00<00:00, 9.83MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35.0/35.0 [00:00<00:00, 604kB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 169/169 [00:00<00:00, 3.25MB/s]
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 486/486 [00:00<00:00, 1.22MB/s]
config.json: 6.87kB [00:00, 19.3MB/s] | 0.00/169 [00:00<?, ?B/s]
model.safetensors.index.json: 38.1kB [00:00, 47.6MB/s] | 0.00/486 [00:00<?, ?B/s]
merges.txt: 917kB [00:00, 43.2MB/s]████████████████████▌ | 3/13 [00:00<00:02, 3.53it/s]
special_tokens_map.json: 1.08kB [00:00, 18.5MB/s]
processor_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 68.0/68.0 [00:00<00:00, 1.58MB/s]
tokenizer_config.json: 18.0kB [00:00, 43.7MB/s] | 0.00/68.0 [00:00<?, ?B/s]
vocab.json: 1.61MB [00:00, 43.9MB/s] | 0.00/631M [00:00<?, ?B/s]
tokenizer.json: 7.15MB [00:00, 138MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 631M/631M [00:12<00:00, 49.1MB/s]
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:13<00:00, 1.07s/it]
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 259647.39it/s]
Preparing input...
Generating DocTags...
<doctag><page_header><loc_126><loc_26><loc_420><loc_33>Optimized Table Tokenization for Table Structure Recognition</page_header>
<page_header><loc_452><loc_27><loc_458><loc_33>9</page_header>
<text><loc_56><loc_46><loc_459><loc_72>order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.</text>
<section_header_level_1><loc_57><loc_84><loc_270><loc_92>5.1 Hyper Parameter Optimization</section_header_level_1>
<text><loc_56><loc_97><loc_459><loc_150>We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.</text>
<otsl><loc_64><loc_213><loc_452><loc_314><fcel>#<fcel>#<fcel>Language<fcel>TEDs<lcel><lcel><fcel>mAP<fcel>Inference<nl><fcel>enc-layers<fcel>dec-layers<fcel>simple<fcel>complex<fcel>all<fcel>(0.75)<fcel>time (secs)<ucel><nl><fcel>6<fcel>6<fcel>OTSL<fcel>0.965<fcel>0.934<fcel>0.955<fcel>0.88<fcel>2.73<nl><ucel><ucel><fcel>HTML<fcel>0.969<fcel>0.927<fcel>0.955<fcel>0.857<ucel><nl><fcel>4<fcel>4<fcel>OTSL<fcel>0.938<fcel>0.904<fcel>0.927<fcel>0.853<fcel>1.97<nl><ucel><ucel><fcel>HTML<fcel>0.952<fcel>0.909<fcel>0.938<fcel>0.843<ucel><nl><fcel>2<fcel>4<fcel>OTSL<fcel>0.923<fcel>0.897<fcel>0.915<fcel>0.859<fcel>1.91<nl><ucel><ucel><fcel>HTML<fcel>0.945<fcel>0.901<fcel>0.931<fcel>0.834<ucel><nl><fcel>4<fcel>2<fcel>OTSL<fcel>0.952<fcel>0.92<fcel>0.942<fcel>0.857<fcel>1.22<nl><ucel><ucel><fcel>HTML<fcel>0.944<fcel>0.903<fcel>0.931<fcel>0.824<ucel><nl><caption><loc_57><loc_165><loc_458><loc_206>Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.</caption></otsl>
<section_header_level_1><loc_57><loc_343><loc_208><loc_351>5.2 Quantitative Results</section_header_level_1>
<text><loc_56><loc_355><loc_458><loc_427>We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.</text>
<text><loc_56><loc_429><loc_458><loc_464>Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.</text>
</doctag>
Processing output...
Markdown output:
order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.
## 5.1 Hyper Parameter Optimization
We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|------------|------------|------------|---------|--------|--------|-------------|-------------|
| enc-layers | dec-layers | simple | complex | all | (0.75) | time (secs) | Inference |
| 6 | 6 | OTSL | 0.965 | 0.934 | 0.955 | 0.88 | 2.73 |
| 6 | 6 | HTML | 0.969 | 0.927 | 0.955 | 0.857 | 2.73 |
| 4 | 4 | OTSL | 0.938 | 0.904 | 0.927 | 0.853 | 1.97 |
| 4 | 4 | HTML | 0.952 | 0.909 | 0.938 | 0.843 | 1.97 |
| 2 | 4 | OTSL | 0.923 | 0.897 | 0.915 | 0.859 | 1.91 |
| 2 | 4 | HTML | 0.945 | 0.901 | 0.931 | 0.834 | 1.91 |
| 4 | 2 | OTSL | 0.952 | 0.92 | 0.942 | 0.857 | 1.22 |
| 4 | 2 | HTML | 0.944 | 0.903 | 0.931 | 0.824 | 1.22 |
## 5.2 Quantitative Results
We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.
Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.
Markdown saved to: output/page_with_table (1).md
HTML saved to: output/page_with_table (1).html
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
There is also a markdown file generated;
order to compute the TED score. Inference timing results for all experiments were obtained from the same machine on a single core with AMD EPYC 7763 CPU @2.45 GHz.
## 5.1 Hyper Parameter Optimization
We have chosen the PubTabNet data set to perform HPO, since it includes a highly diverse set of tables. Also we report TED scores separately for simple and complex tables (tables with cell spans). Results are presented in Table. 1. It is evident that with OTSL, our model achieves the same TED score and slightly better mAP scores in comparison to HTML. However OTSL yields a 2x speed up in the inference runtime over HTML.
Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.
| # | # | Language | TEDs | TEDs | TEDs | mAP | Inference |
|------------|------------|------------|---------|--------|--------|-------------|-------------|
| enc-layers | dec-layers | simple | complex | all | (0.75) | time (secs) | Inference |
| 6 | 6 | OTSL | 0.965 | 0.934 | 0.955 | 0.88 | 2.73 |
| 6 | 6 | HTML | 0.969 | 0.927 | 0.955 | 0.857 | 2.73 |
| 4 | 4 | OTSL | 0.938 | 0.904 | 0.927 | 0.853 | 1.97 |
| 4 | 4 | HTML | 0.952 | 0.909 | 0.938 | 0.843 | 1.97 |
| 2 | 4 | OTSL | 0.923 | 0.897 | 0.915 | 0.859 | 1.91 |
| 2 | 4 | HTML | 0.945 | 0.901 | 0.931 | 0.834 | 1.91 |
| 4 | 2 | OTSL | 0.952 | 0.92 | 0.942 | 0.857 | 1.22 |
| 4 | 2 | HTML | 0.944 | 0.903 | 0.931 | 0.824 | 1.22 |
## 5.2 Quantitative Results
We picked the model parameter configuration that produced the best prediction quality (enc=6, dec=6, heads=8) with PubTabNet alone, then independently trained and evaluated it on three publicly available data sets: PubTabNet (395k samples), FinTabNet (113k samples) and PubTables-1M (about 1M samples). Performance results are presented in Table. 2. It is clearly evident that the model trained on OTSL outperforms HTML across the board, keeping high TEDs and mAP scores even on difficult financial tables (FinTabNet) that contain sparse and large tables.
Additionally, the results show that OTSL has an advantage over HTML when applied on a bigger data set like PubTables-1M and achieves significantly improved scores. Finally, OTSL achieves faster inference due to fewer decoding steps which is a result of the reduced sequence representation.
The resulting HTML file is remarkably clean, well-structured, and easy to read.
➡️ You can also try it live on Hugging Face: https://huggingface.co/spaces/ibm-granite/granite-docling-258m-demo
or…
➡️ Use it at your convenient and for your usage as a command line tool:
# Convert to HTML and Markdown:
docling --to html --to md --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887" # accepts files, urls or directories
# Convert to HTML including layout visualization:
docling --to html_split_page --show-layout --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887"
Conclusion
In conclusion, the Granite-Docling-258M model establishes a new benchmark for document conversion by offering a purpose-built, efficient, and highly accurate alternative to adapted general-purpose VLM solutions. Its primary advantage lies in its ultra-compact size (258M parameters), which allows it to rival the performance of much larger systems, delivering extreme cost-effectiveness without compromising quality. Crucially, Granite-Docling goes far beyond simple OCR; it excels at deep structural understanding, accurately recognizing and preserving layout, table structure, and complex elements like inline/floating math and code. By faithfully translating these structural elements into a coherent output, it overcomes the limitations of conventional OCR models that discard source structure, making the resulting documents ideal for seamless integration into downstream Retrieval-Augmented Generation (RAG) **applications. Granite-Docling-258M is thus the superior choice for organizations demanding both high fidelity and efficient processing** for their document intelligence needs.
Links
- Description page: https://www.ibm.com/new/announcements/granite-docling-end-to-end-document-conversion
- ibm-granite/granite-docling-258M: https://huggingface.co/ibm-granite/granite-docling-258M
- Docling project: https://github.com/docling-project
- Docling documentation: https://docling-project.github.io/docling/
Top comments (0)