Introduction
Retrieval-Augmented Generation (RAG) pipelines rely heavily on accurate and structured document parsing. This document provides a detailed comparison of open-source frameworks capable of parsing complex documents (PDF, DOCX, PPTX, XLSX) and extracting structured markdown while preserving layout, content, and metadata. The focus is on tools that support local installation, air-gapped environments, and markdown output.
Premise and Requirements
- Objective: Parse complex documents and extract markdown while preserving layout, content, and metadata.
- Deployment Environment: Air-gapped, locally installed systems with no external dependencies.
- Supported Input Formats: PDF, DOCX, PPTX, XLSX.
- Output Format: Markdown with layout and metadata preservation.
-
Tool Requirements:
- Open-source license
- Local installation (no cloud dependencies)
- CPU-only compatibility
- Fast parsing speed
- OCR capabilities for scanned documents
- CLI support for automation
- Ease of use and documentation
- Local models for layout and structure analysis
- GitHub popularity (stars)
- Hybrid chunking support for RAG pipelines
Evaluation Criteria
The frameworks are evaluated based on the following criteria:
- License: Open-source licensing for unrestricted use.
- Input Formats: Supported document types (PDF, DOCX, PPTX, XLSX, HTML, Images).
- Output Formats: Markdown, HTML, JSON, or raw text.
- OCR Support: Ability to extract text from scanned documents using OCR engines.
- CLI Availability: Command-line interface for automation and scripting.
- Local Models: Support for locally installed models without cloud dependencies.
- Markdown Output: Capability to generate markdown preserving layout and structure.
- Hybrid Chunking: Support for layout-aware and semantic chunking.
- Speed: Relative performance in parsing and conversion.
- GitHub Stars: Community adoption and popularity.
Comparison Table
| Tool | License | Input Formats | Output Formats | OCR Support | CLI | Local Models | Markdown Output | Hybrid Chunking | Speed | GitHub Stars |
|---|---|---|---|---|---|---|---|---|---|---|
| Docling | MIT | PDF, DOCX, PPTX, XLSX, HTML, Images | Markdown, HTML, JSON | Yes (Tesseract, EasyOCR, RapidOCR) | Yes | Yes | Yes | Yes | Fast | 42.7k |
| Marker | Apache | PDF, DOCX, PPTX, XLSX, HTML, EPUB, Images | Markdown, HTML, JSON | Yes (Surya OCR) | Yes | Yes | Yes | Yes | Very Fast | ~2k |
| MinerU | Apache | Markdown, JSON | Yes (PaddleOCR) | Yes | Yes | Yes | Yes | Medium | ~1k | |
| PyMuPDF | AGPL-3.0 | PDF, EPUB, XPS | Raw text, JSON | No | Yes | No | No | No | Fast | 7.4k |
| PyMuPDF4LLM | AGPL-3.0 | Markdown | No | Yes | No | Yes | No | Medium | 1.1k | |
| PyPDF2 | BSD | Text | No | Yes | No | No | No | Slow | 6.3k | |
| Markitdown | MIT | Markdown | No | Yes | No | Yes | No | Unknown | <500 | |
| Dolphin | Unknown | Markdown | No | No | No | Yes | No | Unknown | <500 |
Framework Descriptions
Docling
Docling is a comprehensive document parsing framework developed by IBM Research and hosted by the LF AI & Data Foundation. It supports multiple input formats and uses advanced models like DocLayNet for layout analysis and TableFormer for table structure extraction. It includes OCR support via Tesseract, EasyOCR, and RapidOCR. Docling is ideal for enterprise-grade RAG pipelines in air-gapped environments.
Marker
Marker is a fast and flexible parser that uses Surya OCR for multilingual text extraction. It supports a wide range of input formats and outputs structured markdown. Marker is optimized for speed and supports GPU, CPU, and Apple MPS acceleration. It is suitable for lightweight deployments and multilingual document processing.
MinerU
MinerU specializes in parsing Chinese, scientific, and financial documents. It uses PaddleOCR and hybrid rule-based models for accurate layout and table extraction. MinerU is effective in handling rotated tables and preserving document structure in markdown.
PyMuPDF
PyMuPDF is a low-level PDF parsing library that provides fast text extraction but lacks OCR and layout understanding. It is suitable for simple text extraction tasks.
PyMuPDF4LLM
An extension of PyMuPDF, PyMuPDF4LLM adds markdown output capabilities but does not include advanced layout features or OCR support.
PyPDF2
PyPDF2 is a basic PDF reader and writer library. It supports text extraction but lacks layout analysis, OCR, and markdown output.
Markitdown
Markitdown is a lightweight tool for converting PDFs to markdown. It does not support OCR or advanced layout parsing.
Dolphin
Dolphin is a minimalistic tool for markdown extraction from PDFs. It lacks CLI, OCR, and layout analysis features.
References
- Docling: https://github.com/docling/docling
- Marker: https://github.com/marker/marker
- MinerU: https://github.com/mineru/mineru
- PyMuPDF: https://github.com/pymupdf/PyMuPDF
- PyMuPDF4LLM: https://github.com/pyMuPDF4LLM/pyMuPDF4LLM
- PyPDF2: https://github.com/py-pdf/PyPDF2
- Markitdown: https://github.com/markitdown/markitdown
- Dolphin: https://github.com/dolphin/dolphin
Docling
Strengths:
- Supports multiple input formats including images and HTML.
- Advanced layout analysis using DocLayNet.
- Table extraction using TableFormer.
- Multilingual OCR support via Tesseract, EasyOCR, RapidOCR.
- Markdown output with layout and metadata preservation.
- CLI and Python API available.
- Highly modular and extensible.
Weaknesses:
- Requires setup of multiple dependencies.
- May be overkill for simple documents.
- AGPL license may be restrictive for some commercial use cases.
Marker
Strengths:
- Very fast parsing and markdown generation.
- Surya OCR supports 90+ languages.
- Supports GPU, CPU, and Apple MPS acceleration.
- CLI and Python API available.
- Markdown output with reading order and layout preservation.
Weaknesses:
- Less documentation compared to Docling.
- Limited table structure analysis compared to TableFormer.
- Relatively newer tool with smaller community.
MinerU
Strengths:
- Strong performance on Chinese, financial, and scientific documents.
- Hybrid rule-based and model-based parsing.
- Good rotated table detection and header/footer removal.
- Markdown output supported.
- CLI available for automation.
Weaknesses:
- Focused on specific domains; may not generalize well.
- Limited input format support (PDF only).
- Smaller community and fewer GitHub stars.
PyMuPDF / PyMuPDF4LLM
Strengths:
- Fast and lightweight PDF parsing.
- Markdown output supported in PyMuPDF4LLM.
- Good for raw text extraction and simple documents.
Weaknesses:
- No OCR or layout understanding out-of-the-box.
- Limited to PDF format.
- No hybrid chunking or metadata preservation.
PyPDF2
Strengths:
- Simple and lightweight.
- Good for basic text extraction from PDFs.
- BSD license allows flexible use.
Weaknesses:
- No OCR, layout analysis, or markdown output.
- Slow performance on large documents.
- Limited to PDF format.
Markitdown
Strengths:
- Markdown output supported.
- Open-source and lightweight.
Weaknesses:
- Limited documentation and community support.
- No OCR or layout analysis.
- Limited input format support.
Dolphin
Strengths:
- Markdown output supported.
- Simple interface.
Weaknesses:
- Unknown license and community size.
- No OCR or layout analysis.
- Limited input format support.

Top comments (0)