Awa Destiny Aghangu

Posted on Mar 26

Getting Started with Docling: PDF to Structured Data

#python #ai #tutorial #opensource

Docling is an open-source document conversion tool from IBM Research. It takes PDFs and converts them into clean, structured output like Markdown, HTML, JSON, or plain text. It handles layout analysis, table extraction, image embedding, OCR, and even a vision-based pipeline for complex documents.

This guide walks through installation, the core conversion options, and the advanced flags worth knowing.

Installation

Use a virtual environment:

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install docling

Verify:

docling --version
# Should output: Docling version: 2.xx.x

Basic Conversion

Docling accepts both local file paths and remote URLs:

docling https://example.com/document.pdf
docling ./my-report.pdf

Default output is Markdown, written to your current directory. For a typical document, expect around two minutes and minimal resource usage.

Output Formats

Markdown (default)

docling file.pdf
# or explicitly
docling file.pdf --to md

Text, headings, tables, and images are all preserved. Images are embedded as base64 data URIs. This is a really useful format for a lot of data pipelines.

HTML

docling file.pdf --to html

The same extracted content wrapped in HTML with basic browser styling. Useful for human-readable web viewing. The underlying extraction is identical to the Markdown output but only the presentation layer changes.

JSON

docling file.pdf --to json

Every element; heading, paragraph, table, image becomes a structured node with semantic metadata. Use this when you need programmatic access to document structure, not just raw text.

Plain Text

docling file.pdf --to text

All structure stripped. Images become  placeholders. Useful only when you need raw text and nothing else.

Advanced Options

VLM Pipeline

docling --pipeline vlm --vlm-model granite_docling file.pdf --output vlm/

The standard pipeline reads the text layer of the PDF. The VLM (Vision Language Model) pipeline processes the document visually, the way a human would read it. This matters in a few specific situations:

Image-based pages: Cover pages or sections built entirely from images have no text layer for the standard pipeline to read. The VLM pipeline recovers them.
Hidden text artifacts: Old revisions sometimes leave hidden text beneath visible content. The standard pipeline surfaces both strings. The VLM pipeline reads what's visually rendered, so the artifact doesn't appear.
Complex layouts: Overall structure and layout understanding are noticeably better.

The trade-offs are real though. The VLM pipeline takes significantly more time and is resource (CPU/GPU/RAM) intensive compared to the standard pipeline. It also has its own failure modes some unicode symbols like ✔ that the standard pipeline captures correctly may be replaced with approximate text like (in-place), and some passages may repeat in the output.

Use the VLM pipeline when accuracy matters more than speed. For bulk processing, stick with the standard pipeline unless you have the resources for builk VLM pipelines.

Disabling OCR

docling file.pdf --no-ocr

For PDFs with a proper text layer (digitally created documents), disabling OCR has no effect on output quality and shaves off a little processing time. For scanned documents, disabling OCR means text in images won't be extracted at all.

Referenced Image Export

docling file.pdf --image-export-mode referenced --output out/

By default, images are embedded as base64 in the output file, which keeps everything self-contained but produces large files. With referenced, images are written as separate files and the Markdown links to them by path. Use this when images need to be processed independently or when a smaller output file is preferred.

Disabling Table Structure Recovery

docling file.pdf --no-tables

Table content is still extracted, but instead of a proper Markdown table with rows and columns, everything collapses into a single cell. Useful if you are processing in bulk and handling table structure downstream.

DEV Community