Ashok Nagaraj

Posted on Nov 7

From PDFs to Markdown

#rag #parser #etl

Introduction

Retrieval-Augmented Generation (RAG) pipelines rely heavily on accurate and structured document parsing. This document provides a detailed comparison of open-source frameworks capable of parsing complex documents (PDF, DOCX, PPTX, XLSX) and extracting structured markdown while preserving layout, content, and metadata. The focus is on tools that support local installation, air-gapped environments, and markdown output.

Premise and Requirements

Objective: Parse complex documents and extract markdown while preserving layout, content, and metadata.
Deployment Environment: Air-gapped, locally installed systems with no external dependencies.
Supported Input Formats: PDF, DOCX, PPTX, XLSX.
Output Format: Markdown with layout and metadata preservation.
Tool Requirements:
- Open-source license
- Local installation (no cloud dependencies)
- CPU-only compatibility
- Fast parsing speed
- OCR capabilities for scanned documents
- CLI support for automation
- Ease of use and documentation
- Local models for layout and structure analysis
- GitHub popularity (stars)
- Hybrid chunking support for RAG pipelines

Evaluation Criteria

The frameworks are evaluated based on the following criteria:

License: Open-source licensing for unrestricted use.
Input Formats: Supported document types (PDF, DOCX, PPTX, XLSX, HTML, Images).
Output Formats: Markdown, HTML, JSON, or raw text.
OCR Support: Ability to extract text from scanned documents using OCR engines.
CLI Availability: Command-line interface for automation and scripting.
Local Models: Support for locally installed models without cloud dependencies.
Markdown Output: Capability to generate markdown preserving layout and structure.
Hybrid Chunking: Support for layout-aware and semantic chunking.
Speed: Relative performance in parsing and conversion.
GitHub Stars: Community adoption and popularity.

Comparison Table

Tool	License	Input Formats	Output Formats	OCR Support	CLI	Local Models	Markdown Output	Hybrid Chunking	Speed	GitHub Stars
Docling	MIT	PDF, DOCX, PPTX, XLSX, HTML, Images	Markdown, HTML, JSON	Yes (Tesseract, EasyOCR, RapidOCR)	Yes	Yes	Yes	Yes	Fast	42.7k
Marker	Apache	PDF, DOCX, PPTX, XLSX, HTML, EPUB, Images	Markdown, HTML, JSON	Yes (Surya OCR)	Yes	Yes	Yes	Yes	Very Fast	~2k
MinerU	Apache	PDF	Markdown, JSON	Yes (PaddleOCR)	Yes	Yes	Yes	Yes	Medium	~1k
PyMuPDF	AGPL-3.0	PDF, EPUB, XPS	Raw text, JSON	No	Yes	No	No	No	Fast	7.4k
PyMuPDF4LLM	AGPL-3.0	PDF	Markdown	No	Yes	No	Yes	No	Medium	1.1k
PyPDF2	BSD	PDF	Text	No	Yes	No	No	No	Slow	6.3k
Markitdown	MIT	PDF	Markdown	No	Yes	No	Yes	No	Unknown	<500
Dolphin	Unknown	PDF	Markdown	No	No	No	Yes	No	Unknown	<500

Framework Descriptions

Docling

Docling is a comprehensive document parsing framework developed by IBM Research and hosted by the LF AI & Data Foundation. It supports multiple input formats and uses advanced models like DocLayNet for layout analysis and TableFormer for table structure extraction. It includes OCR support via Tesseract, EasyOCR, and RapidOCR. Docling is ideal for enterprise-grade RAG pipelines in air-gapped environments.

Marker

Marker is a fast and flexible parser that uses Surya OCR for multilingual text extraction. It supports a wide range of input formats and outputs structured markdown. Marker is optimized for speed and supports GPU, CPU, and Apple MPS acceleration. It is suitable for lightweight deployments and multilingual document processing.

MinerU

MinerU specializes in parsing Chinese, scientific, and financial documents. It uses PaddleOCR and hybrid rule-based models for accurate layout and table extraction. MinerU is effective in handling rotated tables and preserving document structure in markdown.

PyMuPDF

PyMuPDF is a low-level PDF parsing library that provides fast text extraction but lacks OCR and layout understanding. It is suitable for simple text extraction tasks.

PyMuPDF4LLM

An extension of PyMuPDF, PyMuPDF4LLM adds markdown output capabilities but does not include advanced layout features or OCR support.

PyPDF2

PyPDF2 is a basic PDF reader and writer library. It supports text extraction but lacks layout analysis, OCR, and markdown output.

Markitdown

Markitdown is a lightweight tool for converting PDFs to markdown. It does not support OCR or advanced layout parsing.

Dolphin

Dolphin is a minimalistic tool for markdown extraction from PDFs. It lacks CLI, OCR, and layout analysis features.

References

Docling: https://github.com/docling/docling
Marker: https://github.com/marker/marker
MinerU: https://github.com/mineru/mineru
PyMuPDF: https://github.com/pymupdf/PyMuPDF
PyMuPDF4LLM: https://github.com/pyMuPDF4LLM/pyMuPDF4LLM
PyPDF2: https://github.com/py-pdf/PyPDF2
Markitdown: https://github.com/markitdown/markitdown
Dolphin: https://github.com/dolphin/dolphin

Docling

Strengths:

Supports multiple input formats including images and HTML.
Advanced layout analysis using DocLayNet.
Table extraction using TableFormer.
Multilingual OCR support via Tesseract, EasyOCR, RapidOCR.
Markdown output with layout and metadata preservation.
CLI and Python API available.
Highly modular and extensible.

Weaknesses:

Requires setup of multiple dependencies.
May be overkill for simple documents.
AGPL license may be restrictive for some commercial use cases.

Marker

Strengths:

Very fast parsing and markdown generation.
Surya OCR supports 90+ languages.
Supports GPU, CPU, and Apple MPS acceleration.
CLI and Python API available.
Markdown output with reading order and layout preservation.

Weaknesses:

Less documentation compared to Docling.
Limited table structure analysis compared to TableFormer.
Relatively newer tool with smaller community.

MinerU

Strengths:

Strong performance on Chinese, financial, and scientific documents.
Hybrid rule-based and model-based parsing.
Good rotated table detection and header/footer removal.
Markdown output supported.
CLI available for automation.

Weaknesses:

Focused on specific domains; may not generalize well.
Limited input format support (PDF only).
Smaller community and fewer GitHub stars.

PyMuPDF / PyMuPDF4LLM

Strengths:

Fast and lightweight PDF parsing.
Markdown output supported in PyMuPDF4LLM.
Good for raw text extraction and simple documents.

Weaknesses:

No OCR or layout understanding out-of-the-box.
Limited to PDF format.
No hybrid chunking or metadata preservation.

PyPDF2

Strengths:

Simple and lightweight.
Good for basic text extraction from PDFs.
BSD license allows flexible use.

Weaknesses:

No OCR, layout analysis, or markdown output.
Slow performance on large documents.
Limited to PDF format.

Markitdown

Strengths:

Markdown output supported.
Open-source and lightweight.

Weaknesses:

Limited documentation and community support.
No OCR or layout analysis.
Limited input format support.

Dolphin

Strengths:

Markdown output supported.
Simple interface.

Weaknesses:

Unknown license and community size.
No OCR or layout analysis.
Limited input format support.

DEV Community

From PDFs to Markdown

Introduction

Premise and Requirements

Evaluation Criteria

Comparison Table

Framework Descriptions

Docling

Marker

MinerU

PyMuPDF

PyMuPDF4LLM

PyPDF2

Markitdown

Dolphin

References

Docling

Marker

MinerU

PyMuPDF / PyMuPDF4LLM

PyPDF2

Markitdown

Dolphin

tldr recommendation

(as of Oct 2025)

Top comments (0)