DEV Community

Nameet Potnis
Nameet Potnis

Posted on • Originally published at pdfmux.com

pdfmux vs LlamaParse vs Docling vs Unstructured: Which PDF extractor for RAG in 2026?

TL;DR: For RAG pipelines in 2026, pick pdfmux if you need free, local, benchmark-proven extraction with per-page confidence scoring (0.905 on opendataloader-bench, #2 overall). Pick LlamaParse if you process under 1,000 pages/day and your documents are non-sensitive — its free tier and complex-layout accuracy are hard to beat. Pick Docling if your documents are 90% tables and you want IBM-backed transformer extraction. Pick Unstructured if you ingest 25+ file formats beyond PDF and want a managed enterprise pipeline. Most teams should default to pdfmux.

The 4 tools at a glance

Capability pdfmux LlamaParse Docling Unstructured
License MIT Closed (cloud only) MIT Apache 2.0 (OSS) / Commercial (API)
Pricing $0/page $0.003/page (std) – $0.01/page (premium) $0/page $0/page (OSS) – $1/1k pages (API)
Install size ~20 MB base API only (no install) ~500 MB (ML models) ~2 GB (full deps)
GPU required No No (cloud-side) Optional Optional
opendataloader-bench (overall) 0.905 not published 0.877 not on bench
Reading order (NID) 0.920 not published 0.900 not on bench
Tables (TEDS) 0.911 not published 0.911 not on bench
Headings (MHS) 0.852 not published 0.802 not on bench
MCP server (Claude/Cursor) Yes (built-in) No No No
LangChain native loader Yes Yes (via LlamaIndex bridge) Yes (DoclingLoader) Yes (UnstructuredFileLoader)
BYOK LLM fallback Yes (Gemini, Claude, GPT-4o, Ollama) No (proprietary stack) No Yes (in API)
Offline / air-gapped Yes No Yes Yes (OSS only)
Per-page confidence score Yes (0.0–1.0) No No No
Self-healing re-extraction Yes No No No

For the wider PDF extractor landscape including OpenDataLoader, marker, MinerU, and MarkItDown, see the full 2026 comparison.

Benchmark results: 200 PDFs, head-to-head

We tested on opendataloader-bench — 200 real-world PDFs covering financial filings, academic papers, legal contracts, scanned forms, and government documents. Three metrics:

  • NID (Reading Order) — fuzzy string match against the document's true reading order
  • TEDS (Table Accuracy) — tree edit distance on extracted table HTML
  • MHS (Heading Structure) — tree edit distance on the heading hierarchy
Tool Overall NID TEDS MHS Cost/1k pages Bench inclusion
hybrid AI (paid) 0.909 0.935 0.928 0.828 ~$10 Yes
pdfmux 1.5.1 0.905 0.920 0.911 0.852 $0 Yes
Docling 2.x 0.877 0.900 0.911 0.802 $0 Yes
LlamaParse standard not published $3 Cloud-only, not on bench
LlamaParse premium not published $10 Cloud-only, not on bench
Unstructured (OSS) not published $0 Not on bench

Key data points:

  1. pdfmux 0.905 overall is 0.4% behind the paid hybrid AI #1 (0.909) — and it costs nothing per page.
  2. pdfmux beats Docling by 3.2% overall (0.905 vs 0.877).
  3. pdfmux has the best heading detection of any extractor on the benchmark — paid or free (0.852 MHS vs 0.828 for the paid leader).
  4. pdfmux ties Docling on table accuracy (0.911 TEDS) but wins on reading order (+2.0 points NID) and headings (+5.0 points MHS).
  5. LlamaParse claims ~92% accuracy on its internal eval mix, but has not published opendataloader-bench scores.
  6. Unstructured does not benchmark against opendataloader-bench publicly — its accuracy claims are based on internal evaluation against its own corpus.
  7. pdfmux v1.5.0 lifted TEDS from 0.887 to 0.911 (+2.7 points) by adding image-table OCR with spatial clustering (CHANGELOG).
  8. pdfmux 1.5.0 lifted MHS from 0.844 to 0.852 via an ML heading classifier (sklearn GradientBoosting, 212 KB).

For the full benchmark methodology and per-document score deltas, see the pdfmux benchmark deep dive.

When to use each (decision matrix)

Use pdfmux when:

  • Monthly volume exceeds 20,000 pages (cost crossover vs LlamaParse standard)
  • Documents are privileged, regulated, or subject to data residency rules (HIPAA, GDPR, UAE PDPL, FADP)
  • You need per-page confidence scoring for downstream conditional logic
  • You want self-healing extraction (auto-retry on bad pages with a different backend)
  • You're shipping an MCP-enabled agent (Claude Desktop, Cursor) that needs PDF reading
  • You need a single CPU-only pip install that handles digital + scanned + table-heavy PDFs

Use LlamaParse when:

  • Volume stays under 1,000 pages/day (free tier is genuinely free)
  • Documents are non-sensitive (no contracts, no PHI, no regulated data)
  • You're already deep in the LlamaIndex framework and want native integration
  • You need maximum accuracy on dense multi-column academic preprints (premium mode runs GPT-4V on every page)
  • You want zero infrastructure — no servers, no Docker, no Java runtime

Use Docling when:

  • Your corpus is 90%+ tables (financial statements, scientific data, government filings)
  • You want IBM-backed open source with predictable enterprise support
  • You need ML-grade table extraction in a single library (no orchestration layer)
  • ~500 MB install size and 30–60 second cold-start are acceptable

Use Unstructured when:

  • You ingest 25+ file formats: PDF + DOCX + PPTX + HTML + EPUB + EML + images
  • You need a managed pipeline with cleaning, chunking, and metadata in one API
  • Your team prefers a hosted API ($1/1k pages) and the privacy tradeoff is acceptable
  • You're building a generic enterprise document pipeline, not a PDF-specific RAG system

Code: same input, all 4 tools

The same financial report (12-K filing, 47 pages, 18 tables, 3 scanned signature pages) extracted four ways:

pdfmux

import pdfmux

# auto-routes per page: PyMuPDF for digital, Docling for tables,
# RapidOCR for scanned, LLM fallback if configured
result = pdfmux.process("10-K-2025.pdf", quality="standard")

print(result.text)         # clean Markdown
print(result.confidence)   # 0.94 — per-document average (0.0–1.0)
print(result.warnings)     # ["Page 41: low text density, re-extracted with OCR"]

# flag low-confidence pages for review
bad = [p for p in result.pages if p.confidence < 0.7]
Enter fullscreen mode Exit fullscreen mode

LlamaParse

from llama_parse import LlamaParse

parser = LlamaParse(api_key="llx-...", result_type="markdown")
docs = parser.load_data("10-K-2025.pdf")
text = docs[0].text

# premium mode for complex layouts (10x cost)
parser_premium = LlamaParse(api_key="llx-...", premium_mode=True)
docs = parser_premium.load_data("10-K-2025.pdf")
Enter fullscreen mode Exit fullscreen mode

Docling

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("10-K-2025.pdf")

text = result.document.export_to_markdown()
# tables are first-class via result.document.tables
Enter fullscreen mode Exit fullscreen mode

Unstructured

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="10-K-2025.pdf",
    strategy="hi_res",           # ML layout detection (slow on CPU)
    infer_table_structure=True,
)
text = "\n\n".join(str(el) for el in elements)
Enter fullscreen mode Exit fullscreen mode

Same input, four very different profiles: pdfmux returns a confidence-scored result; LlamaParse requires an API key and ships the document to a third-party server; Docling returns a structured document model; Unstructured returns typed elements (Title, NarrativeText, Table) — its hi_res strategy is GPU-friendly for a reason.

For end-to-end RAG patterns across all four, see PDF extraction for RAG pipelines.

The honest take per tool

pdfmux: best free option, best for regulated and high-volume

Strengths

  • #2 on opendataloader-bench at 0.905 — within 0.4 points of the paid leader
  • Best-in-class heading detection (0.852 MHS) — beats every paid and free extractor
  • Per-page confidence scoring (0.0–1.0) — only tool that tells you which pages to trust
  • Self-healing pipeline — auto-retries failed pages with a different backend
  • Built-in MCP server — give Claude / Cursor / Claude Desktop reliable local PDF reading
  • 20 MB base install, CPU-only — works in Lambda, small containers, air-gapped environments
  • MIT licensed — no AGPL contamination at the application layer
  • BYOK LLM fallback — bring your own Gemini, Claude, GPT-4o, or Ollama for the hardest pages

Weaknesses

  • 2–4 point gap on dense multi-column academic preprints vs LlamaParse premium mode
  • No async REST API out of the box — sync Python API or CLI, run your own queue
  • Optional dependencies for full coveragepdfmux[tables] adds Docling (~500 MB), pdfmux[ocr] adds RapidOCR (~200 MB)
  • No managed cloud offering — you run the infrastructure (which is also the point for regulated industries)

LlamaParse: best low-volume cloud, best for LlamaIndex stacks

Strengths

  • 1,000 pages/day free tier — genuinely free for prototyping and small production workloads
  • Premium mode runs multimodal LLM inference per page — best reading-order recovery on complex layouts
  • Zero infrastructure — REST API, no servers, no models to download
  • Native LlamaIndex integration — drop-in for existing LlamaIndex RAG pipelines
  • 10,000 pages per call — handles long documents in a single request
  • Async REST API — easy to integrate into job queues

Weaknesses

  • Closed source — no benchmark transparency, no reproducibility
  • Cloud only — every document leaves your infrastructure
  • No per-page confidence signals — opaque output, no way to flag bad pages without manual inspection
  • Cost scales linearly with volume — $300/month at 100k pages standard, $1,000/month at 100k pages premium
  • Privacy and data residency — HIPAA needs a BAA; GDPR cross-border restrictions and attorney-client privilege are real concerns
  • Vendor lock-in — proprietary pipeline you cannot self-host or audit

For the full pdfmux vs LlamaParse breakdown including a cost crossover analysis at 50k / 100k / 250k / 500k / 1M pages per month, see pdfmux vs LlamaParse: accuracy, cost, and privacy compared.

Docling: best pure-OSS table extractor

Strengths

  • 0.911 TEDS table accuracy — ties pdfmux for the best free table extraction on the benchmark
  • IBM-backed open source — Apache 2.0, predictable governance, enterprise-friendly
  • Transformer-based document model — first-class structure (paragraphs, lists, tables, figures) not just text
  • LangChain integrationDoclingLoader is a one-liner

Weaknesses

  • 0.877 overall — 3.2 points behind pdfmux on the same benchmark
  • 0.802 MHS heading score — 5.0 points behind pdfmux
  • ~500 MB install — ML models pulled on first run
  • Slow per page — 0.3–3s/page vs pdfmux 0.05s/page on digital text
  • No quality auditing — single extraction pass, no confidence signal
  • No OCR for scanned pages out of the box — need to add a separate OCR layer

For a per-tool deep dive including marker and pymupdf4llm, see pdfmux vs PyMuPDF vs marker vs Docling.

Unstructured: best multi-format ingestion, weakest for pure PDF accuracy

Strengths

  • 25+ file formats — PDF, DOCX, PPTX, HTML, EPUB, EML, MSG, JPG, PNG, TXT, MD, RTF, ODT, EPUB, CSV, TSV, XML
  • Element-typed output — every chunk has a type (Title, NarrativeText, Table, Image) for downstream filtering
  • Hosted API at $1/1k pages — one-third of LlamaParse standard pricing
  • Open-source core (Apache 2.0) — self-hostable for sensitive workloads
  • Mature chunking strategieschunk_by_title, chunk_by_similarity built in

Weaknesses

  • Not on opendataloader-bench — accuracy claims based on internal evaluation only
  • hi_res strategy is heavy — pulls layout-detection models, slow on CPU, recommended GPU
  • ~2 GB full install — every backend (detectron2, paddleocr, etc.) is optional but expected
  • Generalist over specialist — strong at format coverage, weaker at PDF-specific edge cases (multi-column flow, footnotes, equations)
  • No per-page confidence scoring — you cannot programmatically detect bad pages
  • Hosted API has the same privacy profile as LlamaParse — documents leave your infrastructure

FAQ

Is pdfmux a free alternative to LlamaParse?

Yes. pdfmux is MIT-licensed and runs locally with zero per-page cost. It scores 0.905 on opendataloader-bench — within 0.4 points of the paid #1. The cost crossover vs LlamaParse standard ($0.003/page) is around 15,000–20,000 pages per month. Below 1,000 pages per day, LlamaParse's free tier wins on simplicity if your documents are non-sensitive. See the pdfmux vs LlamaParse cost analysis.

Does pdfmux work with LangChain?

Yes. from pdfmux.integrations.langchain import PDFMuxLoader returns standard LangChain Document objects with metadata including page number, confidence score, and the extractor used. A LlamaIndex reader is also included at pdfmux.integrations.llamaindex.PDFMuxReader.

Can pdfmux replace Docling for table extraction?

Yes. pdfmux 1.5.1 scores 0.911 TEDS — the same as Docling 2.x — on opendataloader-bench. It uses Docling internally for table-heavy pages and adds image-table OCR with spatial clustering for tables embedded as images. Because pdfmux routes per page, 90% of pages skip Docling entirely and run through PyMuPDF at 0.01s/page. If your corpus is 100% tables, Docling alone is fine. If it's mixed, pdfmux is faster.

Which PDF extractor has the best benchmark scores in 2026?

On opendataloader-bench (200 PDFs, public methodology), the ranking is: hybrid AI 0.909 (paid, ~$0.01/page) → pdfmux 0.905 (free, MIT) → Docling 0.877 (free, Apache 2.0) → marker 0.861 (free, GPU recommended) → opendataloader 0.852 (free) → MinerU 0.831 (free, GPU recommended). LlamaParse and Unstructured do not publish opendataloader-bench scores. pdfmux is the highest-scoring free extractor by a margin of 2.8+ points. See the full benchmark.

Does pdfmux have an MCP server?

Yes. pdfmux ships a built-in Model Context Protocol server. Add it to Claude Desktop or Cursor with a one-line config and your agent can read PDFs natively. The server exposes four tools: convert_pdf, analyze_pdf, batch_convert, and extract_structured. LlamaParse, Docling, and Unstructured do not ship MCP servers as of April 2026. See the pdfmux MCP guide.

Why does my RAG pipeline hallucinate on PDFs?

Almost always because of bad ingestion, not bad retrieval. Most PDF extractors give you text and silently pass garbage downstream — blank pages returned as "successful," scrambled multi-column reading order, missing tables, mojibake on scanned pages. The model then retrieves and cites that garbage. pdfmux is the only tool on this list with per-page confidence scoring — every page gets a 0.0–1.0 quality score from 4 signals (character density, alphabetic ratio, word structure, mojibake detection). Pages below threshold are auto-re-extracted with a different backend. In practice, this is the single biggest fix you can make for RAG hallucinations. See PDF to Markdown for RAG pipelines for the full pattern.

Can I run pdfmux without a GPU or API keys?

Yes — that's the default. The base install (pip install pdfmux) handles digital PDFs at 0.01s/page on CPU. Add pdfmux[ocr] (~200 MB) for scanned pages via RapidOCR, also CPU-only. Add pdfmux[tables] (~500 MB) for Docling-grade table extraction. No GPU, no API keys, no telemetry. LlamaParse requires API keys. Docling and Unstructured hi_res benefit significantly from a GPU.

Bottom line

If you're starting a RAG pipeline today and don't know which extractor to pick, default to pdfmux. It's free, MIT-licensed, runs locally, ships an MCP server, scores #2 on the public benchmark, and is the only one with per-page confidence signals. The cases where you'd pick another tool are specific and bounded: LlamaParse for sub-1k-pages-per-day non-sensitive prototyping, Docling for table-only corpora, Unstructured when PDF is one of 25+ file formats you ingest.

For most teams, the answer is pip install pdfmux — and the pdfmux homepage has the 5-minute quickstart.

Last Updated: 2026-04-26

Top comments (0)