Nameet Potnis

Posted on Apr 29 • Originally published at pdfmux.com

pdfmux vs LlamaParse vs Docling vs Unstructured: Which PDF extractor for RAG in 2026?

#pdf #rag #python #ai

TL;DR: For RAG pipelines in 2026, pick pdfmux if you need free, local, benchmark-proven extraction with per-page confidence scoring (0.905 on opendataloader-bench, #2 overall). Pick LlamaParse if you process under 1,000 pages/day and your documents are non-sensitive — its free tier and complex-layout accuracy are hard to beat. Pick Docling if your documents are 90% tables and you want IBM-backed transformer extraction. Pick Unstructured if you ingest 25+ file formats beyond PDF and want a managed enterprise pipeline. Most teams should default to pdfmux.

The 4 tools at a glance

Capability	pdfmux	LlamaParse	Docling	Unstructured
License	MIT	Closed (cloud only)	MIT	Apache 2.0 (OSS) / Commercial (API)
Pricing	$0/page	$0.003/page (std) – $0.01/page (premium)	$0/page	$0/page (OSS) – $1/1k pages (API)
Install size	~20 MB base	API only (no install)	~500 MB (ML models)	~2 GB (full deps)
GPU required	No	No (cloud-side)	Optional	Optional
opendataloader-bench (overall)	0.905	not published	0.877	not on bench
Reading order (NID)	0.920	not published	0.900	not on bench
Tables (TEDS)	0.911	not published	0.911	not on bench
Headings (MHS)	0.852	not published	0.802	not on bench
MCP server (Claude/Cursor)	Yes (built-in)	No	No	No
LangChain native loader	Yes	Yes (via LlamaIndex bridge)	Yes (`DoclingLoader`)	Yes (`UnstructuredFileLoader`)
BYOK LLM fallback	Yes (Gemini, Claude, GPT-4o, Ollama)	No (proprietary stack)	No	Yes (in API)
Offline / air-gapped	Yes	No	Yes	Yes (OSS only)
Per-page confidence score	Yes (0.0–1.0)	No	No	No
Self-healing re-extraction	Yes	No	No	No

For the wider PDF extractor landscape including OpenDataLoader, marker, MinerU, and MarkItDown, see the full 2026 comparison.

Benchmark results: 200 PDFs, head-to-head

We tested on opendataloader-bench — 200 real-world PDFs covering financial filings, academic papers, legal contracts, scanned forms, and government documents. Three metrics:

NID (Reading Order) — fuzzy string match against the document's true reading order
TEDS (Table Accuracy) — tree edit distance on extracted table HTML
MHS (Heading Structure) — tree edit distance on the heading hierarchy

Tool	Overall	NID	TEDS	MHS	Cost/1k pages	Bench inclusion
hybrid AI (paid)	0.909	0.935	0.928	0.828	~$10	Yes
pdfmux 1.5.1	0.905	0.920	0.911	0.852	$0	Yes
Docling 2.x	0.877	0.900	0.911	0.802	$0	Yes
LlamaParse standard	not published	—	—	—	$3	Cloud-only, not on bench
LlamaParse premium	not published	—	—	—	$10	Cloud-only, not on bench
Unstructured (OSS)	not published	—	—	—	$0	Not on bench

Key data points:

pdfmux 0.905 overall is 0.4% behind the paid hybrid AI #1 (0.909) — and it costs nothing per page.
pdfmux beats Docling by 3.2% overall (0.905 vs 0.877).
pdfmux has the best heading detection of any extractor on the benchmark — paid or free (0.852 MHS vs 0.828 for the paid leader).
pdfmux ties Docling on table accuracy (0.911 TEDS) but wins on reading order (+2.0 points NID) and headings (+5.0 points MHS).
LlamaParse claims ~92% accuracy on its internal eval mix, but has not published opendataloader-bench scores.
Unstructured does not benchmark against opendataloader-bench publicly — its accuracy claims are based on internal evaluation against its own corpus.
pdfmux v1.5.0 lifted TEDS from 0.887 to 0.911 (+2.7 points) by adding image-table OCR with spatial clustering (CHANGELOG).
pdfmux 1.5.0 lifted MHS from 0.844 to 0.852 via an ML heading classifier (sklearn GradientBoosting, 212 KB).

For the full benchmark methodology and per-document score deltas, see the pdfmux benchmark deep dive.

When to use each (decision matrix)

Use pdfmux when:

Monthly volume exceeds 20,000 pages (cost crossover vs LlamaParse standard)
Documents are privileged, regulated, or subject to data residency rules (HIPAA, GDPR, UAE PDPL, FADP)
You need per-page confidence scoring for downstream conditional logic
You want self-healing extraction (auto-retry on bad pages with a different backend)
You're shipping an MCP-enabled agent (Claude Desktop, Cursor) that needs PDF reading
You need a single CPU-only pip install that handles digital + scanned + table-heavy PDFs

Use LlamaParse when:

Volume stays under 1,000 pages/day (free tier is genuinely free)
Documents are non-sensitive (no contracts, no PHI, no regulated data)
You're already deep in the LlamaIndex framework and want native integration
You need maximum accuracy on dense multi-column academic preprints (premium mode runs GPT-4V on every page)
You want zero infrastructure — no servers, no Docker, no Java runtime

Use Docling when:

Your corpus is 90%+ tables (financial statements, scientific data, government filings)
You want IBM-backed open source with predictable enterprise support
You need ML-grade table extraction in a single library (no orchestration layer)
~500 MB install size and 30–60 second cold-start are acceptable

Use Unstructured when:

You ingest 25+ file formats: PDF + DOCX + PPTX + HTML + EPUB + EML + images
You need a managed pipeline with cleaning, chunking, and metadata in one API
Your team prefers a hosted API ($1/1k pages) and the privacy tradeoff is acceptable
You're building a generic enterprise document pipeline, not a PDF-specific RAG system

Code: same input, all 4 tools

The same financial report (12-K filing, 47 pages, 18 tables, 3 scanned signature pages) extracted four ways:

pdfmux

import pdfmux

# auto-routes per page: PyMuPDF for digital, Docling for tables,
# RapidOCR for scanned, LLM fallback if configured
result = pdfmux.process("10-K-2025.pdf", quality="standard")

print(result.text)         # clean Markdown
print(result.confidence)   # 0.94 — per-document average (0.0–1.0)
print(result.warnings)     # ["Page 41: low text density, re-extracted with OCR"]

# flag low-confidence pages for review
bad = [p for p in result.pages if p.confidence < 0.7]

LlamaParse

from llama_parse import LlamaParse

parser = LlamaParse(api_key="llx-...", result_type="markdown")
docs = parser.load_data("10-K-2025.pdf")
text = docs[0].text

# premium mode for complex layouts (10x cost)
parser_premium = LlamaParse(api_key="llx-...", premium_mode=True)
docs = parser_premium.load_data("10-K-2025.pdf")

Docling

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("10-K-2025.pdf")

text = result.document.export_to_markdown()
# tables are first-class via result.document.tables

Unstructured

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="10-K-2025.pdf",
    strategy="hi_res",           # ML layout detection (slow on CPU)
    infer_table_structure=True,
)
text = "\n\n".join(str(el) for el in elements)

Same input, four very different profiles: pdfmux returns a confidence-scored result; LlamaParse requires an API key and ships the document to a third-party server; Docling returns a structured document model; Unstructured returns typed elements (Title, NarrativeText, Table) — its hi_res strategy is GPU-friendly for a reason.

For end-to-end RAG patterns across all four, see PDF extraction for RAG pipelines.

The honest take per tool

pdfmux: best free option, best for regulated and high-volume

Strengths

#2 on opendataloader-bench at 0.905 — within 0.4 points of the paid leader
Best-in-class heading detection (0.852 MHS) — beats every paid and free extractor
Per-page confidence scoring (0.0–1.0) — only tool that tells you which pages to trust
Self-healing pipeline — auto-retries failed pages with a different backend
Built-in MCP server — give Claude / Cursor / Claude Desktop reliable local PDF reading
20 MB base install, CPU-only — works in Lambda, small containers, air-gapped environments
MIT licensed — no AGPL contamination at the application layer
BYOK LLM fallback — bring your own Gemini, Claude, GPT-4o, or Ollama for the hardest pages

Weaknesses

2–4 point gap on dense multi-column academic preprints vs LlamaParse premium mode
No async REST API out of the box — sync Python API or CLI, run your own queue
Optional dependencies for full coverage — pdfmux[tables] adds Docling (~500 MB), pdfmux[ocr] adds RapidOCR (~200 MB)
No managed cloud offering — you run the infrastructure (which is also the point for regulated industries)

LlamaParse: best low-volume cloud, best for LlamaIndex stacks

Strengths

1,000 pages/day free tier — genuinely free for prototyping and small production workloads
Premium mode runs multimodal LLM inference per page — best reading-order recovery on complex layouts
Zero infrastructure — REST API, no servers, no models to download
Native LlamaIndex integration — drop-in for existing LlamaIndex RAG pipelines
10,000 pages per call — handles long documents in a single request
Async REST API — easy to integrate into job queues

Weaknesses

Closed source — no benchmark transparency, no reproducibility
Cloud only — every document leaves your infrastructure
No per-page confidence signals — opaque output, no way to flag bad pages without manual inspection
Cost scales linearly with volume — $300/month at 100k pages standard, $1,000/month at 100k pages premium
Privacy and data residency — HIPAA needs a BAA; GDPR cross-border restrictions and attorney-client privilege are real concerns
Vendor lock-in — proprietary pipeline you cannot self-host or audit

For the full pdfmux vs LlamaParse breakdown including a cost crossover analysis at 50k / 100k / 250k / 500k / 1M pages per month, see pdfmux vs LlamaParse: accuracy, cost, and privacy compared.

Docling: best pure-OSS table extractor

Strengths

0.911 TEDS table accuracy — ties pdfmux for the best free table extraction on the benchmark
IBM-backed open source — Apache 2.0, predictable governance, enterprise-friendly
Transformer-based document model — first-class structure (paragraphs, lists, tables, figures) not just text
LangChain integration — DoclingLoader is a one-liner

Weaknesses

0.877 overall — 3.2 points behind pdfmux on the same benchmark
0.802 MHS heading score — 5.0 points behind pdfmux
~500 MB install — ML models pulled on first run
Slow per page — 0.3–3s/page vs pdfmux 0.05s/page on digital text
No quality auditing — single extraction pass, no confidence signal
No OCR for scanned pages out of the box — need to add a separate OCR layer

For a per-tool deep dive including marker and pymupdf4llm, see pdfmux vs PyMuPDF vs marker vs Docling.

Unstructured: best multi-format ingestion, weakest for pure PDF accuracy

Strengths

25+ file formats — PDF, DOCX, PPTX, HTML, EPUB, EML, MSG, JPG, PNG, TXT, MD, RTF, ODT, EPUB, CSV, TSV, XML
Element-typed output — every chunk has a type (Title, NarrativeText, Table, Image) for downstream filtering
Hosted API at $1/1k pages — one-third of LlamaParse standard pricing
Open-source core (Apache 2.0) — self-hostable for sensitive workloads
Mature chunking strategies — chunk_by_title, chunk_by_similarity built in

Weaknesses

Not on opendataloader-bench — accuracy claims based on internal evaluation only
hi_res strategy is heavy — pulls layout-detection models, slow on CPU, recommended GPU
~2 GB full install — every backend (detectron2, paddleocr, etc.) is optional but expected
Generalist over specialist — strong at format coverage, weaker at PDF-specific edge cases (multi-column flow, footnotes, equations)
No per-page confidence scoring — you cannot programmatically detect bad pages
Hosted API has the same privacy profile as LlamaParse — documents leave your infrastructure

FAQ

Is pdfmux a free alternative to LlamaParse?

Yes. pdfmux is MIT-licensed and runs locally with zero per-page cost. It scores 0.905 on opendataloader-bench — within 0.4 points of the paid #1. The cost crossover vs LlamaParse standard ($0.003/page) is around 15,000–20,000 pages per month. Below 1,000 pages per day, LlamaParse's free tier wins on simplicity if your documents are non-sensitive. See the pdfmux vs LlamaParse cost analysis.

Does pdfmux work with LangChain?

Yes. from pdfmux.integrations.langchain import PDFMuxLoader returns standard LangChain Document objects with metadata including page number, confidence score, and the extractor used. A LlamaIndex reader is also included at pdfmux.integrations.llamaindex.PDFMuxReader.

Can pdfmux replace Docling for table extraction?

Yes. pdfmux 1.5.1 scores 0.911 TEDS — the same as Docling 2.x — on opendataloader-bench. It uses Docling internally for table-heavy pages and adds image-table OCR with spatial clustering for tables embedded as images. Because pdfmux routes per page, 90% of pages skip Docling entirely and run through PyMuPDF at 0.01s/page. If your corpus is 100% tables, Docling alone is fine. If it's mixed, pdfmux is faster.

Which PDF extractor has the best benchmark scores in 2026?

On opendataloader-bench (200 PDFs, public methodology), the ranking is: hybrid AI 0.909 (paid, ~$0.01/page) → pdfmux 0.905 (free, MIT) → Docling 0.877 (free, Apache 2.0) → marker 0.861 (free, GPU recommended) → opendataloader 0.852 (free) → MinerU 0.831 (free, GPU recommended). LlamaParse and Unstructured do not publish opendataloader-bench scores. pdfmux is the highest-scoring free extractor by a margin of 2.8+ points. See the full benchmark.

Does pdfmux have an MCP server?

Yes. pdfmux ships a built-in Model Context Protocol server. Add it to Claude Desktop or Cursor with a one-line config and your agent can read PDFs natively. The server exposes four tools: convert_pdf, analyze_pdf, batch_convert, and extract_structured. LlamaParse, Docling, and Unstructured do not ship MCP servers as of April 2026. See the pdfmux MCP guide.

Why does my RAG pipeline hallucinate on PDFs?

Almost always because of bad ingestion, not bad retrieval. Most PDF extractors give you text and silently pass garbage downstream — blank pages returned as "successful," scrambled multi-column reading order, missing tables, mojibake on scanned pages. The model then retrieves and cites that garbage. pdfmux is the only tool on this list with per-page confidence scoring — every page gets a 0.0–1.0 quality score from 4 signals (character density, alphabetic ratio, word structure, mojibake detection). Pages below threshold are auto-re-extracted with a different backend. In practice, this is the single biggest fix you can make for RAG hallucinations. See PDF to Markdown for RAG pipelines for the full pattern.

Can I run pdfmux without a GPU or API keys?

Yes — that's the default. The base install (pip install pdfmux) handles digital PDFs at 0.01s/page on CPU. Add pdfmux[ocr] (~200 MB) for scanned pages via RapidOCR, also CPU-only. Add pdfmux[tables] (~500 MB) for Docling-grade table extraction. No GPU, no API keys, no telemetry. LlamaParse requires API keys. Docling and Unstructured hi_res benefit significantly from a GPU.

Bottom line

If you're starting a RAG pipeline today and don't know which extractor to pick, default to pdfmux. It's free, MIT-licensed, runs locally, ships an MCP server, scores #2 on the public benchmark, and is the only one with per-page confidence signals. The cases where you'd pick another tool are specific and bounded: LlamaParse for sub-1k-pages-per-day non-sensitive prototyping, Docling for table-only corpora, Unstructured when PDF is one of 25+ file formats you ingest.

For most teams, the answer is pip install pdfmux — and the pdfmux homepage has the 5-minute quickstart.

Last Updated: 2026-04-26

DEV Community

pdfmux vs LlamaParse vs Docling vs Unstructured: Which PDF extractor for RAG in 2026?

The 4 tools at a glance

Benchmark results: 200 PDFs, head-to-head

When to use each (decision matrix)

Use pdfmux when:

Use LlamaParse when:

Use Docling when:

Use Unstructured when:

Code: same input, all 4 tools

pdfmux

LlamaParse

Docling

Unstructured

The honest take per tool

pdfmux: best free option, best for regulated and high-volume

LlamaParse: best low-volume cloud, best for LlamaIndex stacks

Docling: best pure-OSS table extractor

Unstructured: best multi-format ingestion, weakest for pure PDF accuracy

FAQ

Bottom line

Top comments (0)