DEV Community

Cover image for From PDFs to Markdown
Ashok Nagaraj
Ashok Nagaraj

Posted on

From PDFs to Markdown

Introduction

Retrieval-Augmented Generation (RAG) pipelines rely heavily on accurate and structured document parsing. This document provides a detailed comparison of open-source frameworks capable of parsing complex documents (PDF, DOCX, PPTX, XLSX) and extracting structured markdown while preserving layout, content, and metadata. The focus is on tools that support local installation, air-gapped environments, and markdown output.

Premise and Requirements

  • Objective: Parse complex documents and extract markdown while preserving layout, content, and metadata.
  • Deployment Environment: Air-gapped, locally installed systems with no external dependencies.
  • Supported Input Formats: PDF, DOCX, PPTX, XLSX.
  • Output Format: Markdown with layout and metadata preservation.
  • Tool Requirements:
    • Open-source license
    • Local installation (no cloud dependencies)
    • CPU-only compatibility
    • Fast parsing speed
    • OCR capabilities for scanned documents
    • CLI support for automation
    • Ease of use and documentation
    • Local models for layout and structure analysis
    • GitHub popularity (stars)
    • Hybrid chunking support for RAG pipelines

Evaluation Criteria

The frameworks are evaluated based on the following criteria:

  • License: Open-source licensing for unrestricted use.
  • Input Formats: Supported document types (PDF, DOCX, PPTX, XLSX, HTML, Images).
  • Output Formats: Markdown, HTML, JSON, or raw text.
  • OCR Support: Ability to extract text from scanned documents using OCR engines.
  • CLI Availability: Command-line interface for automation and scripting.
  • Local Models: Support for locally installed models without cloud dependencies.
  • Markdown Output: Capability to generate markdown preserving layout and structure.
  • Hybrid Chunking: Support for layout-aware and semantic chunking.
  • Speed: Relative performance in parsing and conversion.
  • GitHub Stars: Community adoption and popularity.

Comparison Table

Tool License Input Formats Output Formats OCR Support CLI Local Models Markdown Output Hybrid Chunking Speed GitHub Stars
Docling MIT PDF, DOCX, PPTX, XLSX, HTML, Images Markdown, HTML, JSON Yes (Tesseract, EasyOCR, RapidOCR) Yes Yes Yes Yes Fast 42.7k
Marker Apache PDF, DOCX, PPTX, XLSX, HTML, EPUB, Images Markdown, HTML, JSON Yes (Surya OCR) Yes Yes Yes Yes Very Fast ~2k
MinerU Apache PDF Markdown, JSON Yes (PaddleOCR) Yes Yes Yes Yes Medium ~1k
PyMuPDF AGPL-3.0 PDF, EPUB, XPS Raw text, JSON No Yes No No No Fast 7.4k
PyMuPDF4LLM AGPL-3.0 PDF Markdown No Yes No Yes No Medium 1.1k
PyPDF2 BSD PDF Text No Yes No No No Slow 6.3k
Markitdown MIT PDF Markdown No Yes No Yes No Unknown <500
Dolphin Unknown PDF Markdown No No No Yes No Unknown <500

Framework Descriptions

Docling

Docling is a comprehensive document parsing framework developed by IBM Research and hosted by the LF AI & Data Foundation. It supports multiple input formats and uses advanced models like DocLayNet for layout analysis and TableFormer for table structure extraction. It includes OCR support via Tesseract, EasyOCR, and RapidOCR. Docling is ideal for enterprise-grade RAG pipelines in air-gapped environments.

Marker

Marker is a fast and flexible parser that uses Surya OCR for multilingual text extraction. It supports a wide range of input formats and outputs structured markdown. Marker is optimized for speed and supports GPU, CPU, and Apple MPS acceleration. It is suitable for lightweight deployments and multilingual document processing.

MinerU

MinerU specializes in parsing Chinese, scientific, and financial documents. It uses PaddleOCR and hybrid rule-based models for accurate layout and table extraction. MinerU is effective in handling rotated tables and preserving document structure in markdown.

PyMuPDF

PyMuPDF is a low-level PDF parsing library that provides fast text extraction but lacks OCR and layout understanding. It is suitable for simple text extraction tasks.

PyMuPDF4LLM

An extension of PyMuPDF, PyMuPDF4LLM adds markdown output capabilities but does not include advanced layout features or OCR support.

PyPDF2

PyPDF2 is a basic PDF reader and writer library. It supports text extraction but lacks layout analysis, OCR, and markdown output.

Markitdown

Markitdown is a lightweight tool for converting PDFs to markdown. It does not support OCR or advanced layout parsing.

Dolphin

Dolphin is a minimalistic tool for markdown extraction from PDFs. It lacks CLI, OCR, and layout analysis features.


References


Docling

Strengths:

  • Supports multiple input formats including images and HTML.
  • Advanced layout analysis using DocLayNet.
  • Table extraction using TableFormer.
  • Multilingual OCR support via Tesseract, EasyOCR, RapidOCR.
  • Markdown output with layout and metadata preservation.
  • CLI and Python API available.
  • Highly modular and extensible.

Weaknesses:

  • Requires setup of multiple dependencies.
  • May be overkill for simple documents.
  • AGPL license may be restrictive for some commercial use cases.

Marker

Strengths:

  • Very fast parsing and markdown generation.
  • Surya OCR supports 90+ languages.
  • Supports GPU, CPU, and Apple MPS acceleration.
  • CLI and Python API available.
  • Markdown output with reading order and layout preservation.

Weaknesses:

  • Less documentation compared to Docling.
  • Limited table structure analysis compared to TableFormer.
  • Relatively newer tool with smaller community.

MinerU

Strengths:

  • Strong performance on Chinese, financial, and scientific documents.
  • Hybrid rule-based and model-based parsing.
  • Good rotated table detection and header/footer removal.
  • Markdown output supported.
  • CLI available for automation.

Weaknesses:

  • Focused on specific domains; may not generalize well.
  • Limited input format support (PDF only).
  • Smaller community and fewer GitHub stars.

PyMuPDF / PyMuPDF4LLM

Strengths:

  • Fast and lightweight PDF parsing.
  • Markdown output supported in PyMuPDF4LLM.
  • Good for raw text extraction and simple documents.

Weaknesses:

  • No OCR or layout understanding out-of-the-box.
  • Limited to PDF format.
  • No hybrid chunking or metadata preservation.

PyPDF2

Strengths:

  • Simple and lightweight.
  • Good for basic text extraction from PDFs.
  • BSD license allows flexible use.

Weaknesses:

  • No OCR, layout analysis, or markdown output.
  • Slow performance on large documents.
  • Limited to PDF format.

Markitdown

Strengths:

  • Markdown output supported.
  • Open-source and lightweight.

Weaknesses:

  • Limited documentation and community support.
  • No OCR or layout analysis.
  • Limited input format support.

Dolphin

Strengths:

  • Markdown output supported.
  • Simple interface.

Weaknesses:

  • Unknown license and community size.
  • No OCR or layout analysis.
  • Limited input format support.

tldr recommendation

(as of Oct 2025)

flowchart

Top comments (0)