Docling Chart Extraction is out! Powered by Granite Vision for Superior Accuracy!
Introduction
For too long, complex charts in PDFs have been the βblack boxesβ of document processing β visible to humans but invisible to machines. When your RAG system hits a financial report or a scientific paper, it usually sees a jumbled mess of text or skips the visual data entirely. That ends today. With the latest update to Docling, powered by the ultra-efficient Granite Vision model, we can finally bridge the gap between pixels and spreadsheets. Whether itβs a quarterly revenue bar chart or a complex distribution line graph, Docling doesnβt just see the image; it understands the data behind it.
Capacities Demonstrated through Sample Provided
Docling github reposotiroy provides a sample application that you can test out of the box with the following features;
# %% [markdown]
# Extract chart data from a PDF and export the result as split-page HTML with layout.
#
# What this example does
# - Converts a PDF with chart extraction enrichment enabled.
# - Iterates detected pictures and prints extracted chart data as CSV to stdout.
# - Saves the converted document as split-page HTML with layout to `scratch/`.
#
# Prerequisites
# - Install Docling with the `granite_vision` extra (for chart extraction model).
# - Install `pandas`.
#
# How to run
# - From the repo root: `python docs/examples/chart_extraction.py`.
# - Outputs are written to `scratch/`.
#
# Input document
# - Defaults to `docs/examples/data/chart_document.pdf`. Change `input_doc_path`
# as needed.
#
# Notes
# - Enabling `do_chart_extraction` automatically enables picture classification.
# - Supported chart types: bar chart, pie chart, line chart.
# %%
import logging
import time
from pathlib import Path
import pandas as pd
from docling_core.transforms.serializer.html import (
HTMLDocSerializer,
HTMLOutputStyle,
HTMLParams,
)
from docling_core.transforms.visualizer.layout_visualizer import LayoutVisualizer
from docling_core.types.doc import ImageRefMode, PictureItem
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
_log = logging.getLogger(__name__)
def main():
logging.basicConfig(level=logging.INFO)
input_doc_path = Path(__file__).parent / "data/chart_document.pdf"
output_dir = Path("scratch")
output_dir.mkdir(parents=True, exist_ok=True)
# Configure the PDF pipeline with chart extraction enabled.
# This automatically enables picture classification as well.
pipeline_options = PdfPipelineOptions()
pipeline_options.do_chart_extraction = True
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
start_time = time.time()
conv_res = doc_converter.convert(input_doc_path)
doc_filename = conv_res.input.file.stem
# Iterate over document items and print extracted chart data.
for item, _level in conv_res.document.iterate_items():
if not isinstance(item, PictureItem):
continue
if item.meta is None:
continue
# Check if the picture was classified as a chart.
if item.meta.classification is not None:
chart_type = item.meta.classification.get_main_prediction().class_name
else:
continue
# Check if chart data was extracted.
if item.meta.tabular_chart is None:
continue
table_data = item.meta.tabular_chart.chart_data
print(f"## Chart type: {chart_type}")
print(f" Size: {table_data.num_rows} rows x {table_data.num_cols} cols")
# Build a DataFrame from the extracted table cells for display.
grid: list[list[str]] = [
[""] * table_data.num_cols for _ in range(table_data.num_rows)
]
for cell in table_data.table_cells:
grid[cell.start_row_offset_idx][cell.start_col_offset_idx] = cell.text
chart_df = pd.DataFrame(grid)
print(chart_df.to_csv(index=False, header=False))
# Export the full document as split-page HTML with layout.
html_filename = output_dir / f"{doc_filename}.html"
ser = HTMLDocSerializer(
doc=conv_res.document,
params=HTMLParams(
image_mode=ImageRefMode.EMBEDDED,
output_style=HTMLOutputStyle.SPLIT_PAGE,
),
)
visualizer = LayoutVisualizer()
visualizer.params.show_label = False
ser_res = ser.serialize(
visualizer=visualizer,
)
with open(html_filename, "w") as fw:
fw.write(ser_res.text)
_log.info(f"Saved split-page HTML to {html_filename}")
elapsed = time.time() - start_time
_log.info(f"Document converted and exported in {elapsed:.2f} seconds.")
if __name__ == "__main__":
main()
π Key Features You Need to Know:
- Granite Vision Integration: Leverages IBMβs lightweight, state-of-the-art vision-language model to accurately classify and parse document figures.
- Automatic Data Reconstruction: Converts Bar, Pie, and Line charts directly into structured DataFrames (CSV/JSON), ready for your analysis or LLM context.
- Visual Layout Preservation: Export your documents as high-fidelity, split-page HTML that keeps the original structure intact while making the underlying data interactive.
My personal tocuh on the sample!
As usual, to deliver exactly what I need, I have rebuilt the script provided to include a professional Gradio interface, recursive file handling using pathlib, and a robust timestamped output system and as a bonus, an Excel file with the charts.
- Prepare your environment first;
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
- Install the requirements;
docling>=2.0.0
docling-core
pandas
gradio
docling[granite_vision]
docling[ocr]
- And then the sample application
# app.py
import logging
import time
import zipfile
from pathlib import Path
from datetime import datetime
import pandas as pd
import gradio as gr
from docling_core.transforms.serializer.html import (
HTMLDocSerializer,
HTMLOutputStyle,
HTMLParams,
)
from docling_core.transforms.visualizer.layout_visualizer import LayoutVisualizer
from docling_core.types.doc import ImageRefMode, PictureItem
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
logging.basicConfig(level=logging.INFO)
_log = logging.getLogger(__name__)
def process_folder():
input_dir = Path("./input")
output_base = Path("./output")
input_dir.mkdir(exist_ok=True)
output_base.mkdir(exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
run_output_dir = output_base / f"run_{timestamp}"
run_output_dir.mkdir(parents=True, exist_ok=True)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_chart_extraction = True
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
doc_converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
log_messages = []
all_charts_data = [] # For Excel export
files = list(input_dir.rglob("*.pdf"))
if not files:
return "No PDF files found.", None, None
for file_path in files:
try:
conv_res = doc_converter.convert(file_path)
# Chart Extraction Logic
chart_count = 0
for item, _level in conv_res.document.iterate_items():
if isinstance(item, PictureItem) and item.meta and item.meta.tabular_chart:
chart_count += 1
table_data = item.meta.tabular_chart.chart_data
grid = [[""] * table_data.num_cols for _ in range(table_data.num_rows)]
for cell in table_data.table_cells:
grid[cell.start_row_offset_idx][cell.start_col_offset_idx] = cell.text
df = pd.DataFrame(grid)
sheet_name = f"{file_path.stem[:20]}_C{chart_count}"
all_charts_data.append((sheet_name, df))
# HTML Export
html_filename = run_output_dir / f"{file_path.stem}.html"
ser = HTMLDocSerializer(doc=conv_res.document, params=HTMLParams(image_mode=ImageRefMode.EMBEDDED, output_style=HTMLOutputStyle.SPLIT_PAGE))
ser_res = ser.serialize(visualizer=LayoutVisualizer())
with open(html_filename, "w", encoding="utf-8") as fw:
fw.write(ser_res.text)
log_messages.append(f"β
{file_path.name}: {chart_count} charts found.")
except Exception as e:
log_messages.append(f"β {file_path.name}: {str(e)}")
# Excel
excel_path = run_output_dir / "master_chart_export.xlsx"
if all_charts_data:
with pd.ExcelWriter(excel_path) as writer:
for sheet_name, df in all_charts_data:
df.to_excel(writer, sheet_name=sheet_name, index=False, header=False)
# ZIP for download
zip_path = output_base / f"results_{timestamp}.zip"
with zipfile.ZipFile(zip_path, 'w') as zipf:
for f in run_output_dir.rglob('*'):
zipf.write(f, f.relative_to(run_output_dir))
return "\n".join(log_messages), str(excel_path), str(zip_path)
# --- Gradio UI ---
with gr.Blocks(title="Docling Enterprise") as demo:
gr.Markdown("# π Docling Chart Intelligence Hub")
with gr.Row():
run_btn = gr.Button("π Process ./input Folder", variant="primary")
with gr.Row():
status = gr.Textbox(label="Processing Log", lines=8)
with gr.Row():
excel_out = gr.File(label="Download Master Excel")
zip_out = gr.File(label="Download All Results (HTML + Data)")
run_btn.click(fn=process_folder, outputs=[status, excel_out, zip_out])
demo.launch()
- Hereafter a screen capture of the input document (the link is provided in the βLinksβ section);
- The applicationβs UI;
- Output from the procesing and execution on the console;
python app.py
* Running on local URL: http://127.0.0.1:7860
INFO:httpx:HTTP Request: GET http://127.0.0.1:7860/gradio_api/startup-events "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: HEAD http://127.0.0.1:7860/ "HTTP/1.1 200 OK"
* To create a public link, set `share=True` in `launch()`.
INFO:httpx:HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
INFO:__main__:Processing: input/chart_document.pdf
INFO:docling.datamodel.document:detected formats: [<InputFormat.PDF: 'pdf'>]
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Initializing pipeline for StandardPdfPipeline with options hash 73684e8f84d58523e34f7afaeac3a9d6
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered picture descriptions: ['vlm', 'api']
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.utils.accelerator_utils:Removing MPS from available devices because it is not in supported_devices=[<AcceleratorDevice.CPU: 'cpu'>, <AcceleratorDevice.CUDA: 'cuda'>]
INFO:docling.utils.accelerator_utils:Accelerator device: 'cpu'
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:09<00:00, 4.53s/it]
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
INFO:docling.models.stages.ocr.auto_ocr_model:Auto OCR model selected ocrmac.
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered layout engines: ['docling_layout_default', 'docling_experimental_table_crops_layout']
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered table structure engines: ['docling_tableformer']
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.pipeline.base_pipeline:Processing document chart_document.pdf
INFO:docling.document_converter:Finished converting document chart_document.pdf in 671.11 sec.
- And a screen capture of the βhtmlβ resulted file;
> A basic Excel output which could be enhanced!
π―
Conclusion: The Future of Document Intelligence
The release of Doclingβs chart extraction marks a paradigm shift in how we handle unstructured data. By moving beyond simple text scraping and leveraging the Granite Vision model, Docling transforms complex visual data β once the βdark matterβ of PDFs β into structured, actionable insights with surgical precision. This capability doesnβt just βseeβ a chart; it reconstructs the underlying logic of bar, pie, and line graphs into high-fidelity data tables.
This sample application serves as your high-speed starter kit for this new era. By providing a standardized βinput/outputβ workflow, recursive processing, and a clean Gradio UI, it bridges the gap between a powerful library and a production-ready tool. You can use this template as a foundational blueprint: whether you are building a financial analysis engine, a scientific research aggregator, or a specialized RAG pipeline, this setup provides the scaffolding you need to scale from a local script to a robust, monitored, and automated document processing ecosystem.
>>> Thanks for reading <<<
Links
- Docling Project: https://github.com/docling-project/docling
- Original Code Sample for Chart Extraction: https://github.com/docling-project/docling/blob/main/docs/examples/chart_extraction.py
- Document used for test: https://github.com/docling-project/docling/blob/main/docs/examples/data/chart_document.pdf
- Granite Vision Chart2CSV Preview on Hugging Face: https://huggingface.co/ibm-granite/granite-vision-3.3-2b-chart2csv-preview





Top comments (0)