<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nameet Potnis</title>
    <description>The latest articles on DEV Community by Nameet Potnis (@nameetp).</description>
    <link>https://dev.to/nameetp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3903907%2F7f735e5c-f893-45a3-875c-95dd410ff210.jpeg</url>
      <title>DEV Community: Nameet Potnis</title>
      <link>https://dev.to/nameetp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nameetp"/>
    <language>en</language>
    <item>
      <title>pdfmux vs LlamaParse vs Docling vs Unstructured: Which PDF extractor for RAG in 2026?</title>
      <dc:creator>Nameet Potnis</dc:creator>
      <pubDate>Wed, 29 Apr 2026 09:52:45 +0000</pubDate>
      <link>https://dev.to/nameetp/pdfmux-vs-llamaparse-vs-docling-vs-unstructured-which-pdf-extractor-for-rag-in-2026-3agj</link>
      <guid>https://dev.to/nameetp/pdfmux-vs-llamaparse-vs-docling-vs-unstructured-which-pdf-extractor-for-rag-in-2026-3agj</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: For RAG pipelines in 2026, pick &lt;strong&gt;pdfmux&lt;/strong&gt; if you need free, local, benchmark-proven extraction with per-page confidence scoring (0.905 on opendataloader-bench, #2 overall). Pick &lt;strong&gt;LlamaParse&lt;/strong&gt; if you process under 1,000 pages/day and your documents are non-sensitive — its free tier and complex-layout accuracy are hard to beat. Pick &lt;strong&gt;Docling&lt;/strong&gt; if your documents are 90% tables and you want IBM-backed transformer extraction. Pick &lt;strong&gt;Unstructured&lt;/strong&gt; if you ingest 25+ file formats beyond PDF and want a managed enterprise pipeline. Most teams should default to pdfmux.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4 tools at a glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;pdfmux&lt;/th&gt;
&lt;th&gt;LlamaParse&lt;/th&gt;
&lt;th&gt;Docling&lt;/th&gt;
&lt;th&gt;Unstructured&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;Closed (cloud only)&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;Apache 2.0 (OSS) / Commercial (API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pricing&lt;/td&gt;
&lt;td&gt;$0/page&lt;/td&gt;
&lt;td&gt;$0.003/page (std) – $0.01/page (premium)&lt;/td&gt;
&lt;td&gt;$0/page&lt;/td&gt;
&lt;td&gt;$0/page (OSS) – $1/1k pages (API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Install size&lt;/td&gt;
&lt;td&gt;~20 MB base&lt;/td&gt;
&lt;td&gt;API only (no install)&lt;/td&gt;
&lt;td&gt;~500 MB (ML models)&lt;/td&gt;
&lt;td&gt;~2 GB (full deps)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU required&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No (cloud-side)&lt;/td&gt;
&lt;td&gt;Optional&lt;/td&gt;
&lt;td&gt;Optional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;opendataloader-bench (overall)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.905&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;not published&lt;/td&gt;
&lt;td&gt;0.877&lt;/td&gt;
&lt;td&gt;not on bench&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reading order (NID)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.920&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;not published&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;not on bench&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tables (TEDS)&lt;/td&gt;
&lt;td&gt;0.911&lt;/td&gt;
&lt;td&gt;not published&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.911&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;not on bench&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Headings (MHS)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.852&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;not published&lt;/td&gt;
&lt;td&gt;0.802&lt;/td&gt;
&lt;td&gt;not on bench&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP server (Claude/Cursor)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes (built-in)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangChain native loader&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (via LlamaIndex bridge)&lt;/td&gt;
&lt;td&gt;Yes (&lt;code&gt;DoclingLoader&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Yes (&lt;code&gt;UnstructuredFileLoader&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BYOK LLM fallback&lt;/td&gt;
&lt;td&gt;Yes (Gemini, Claude, GPT-4o, Ollama)&lt;/td&gt;
&lt;td&gt;No (proprietary stack)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (in API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Offline / air-gapped&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (OSS only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-page confidence score&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes (0.0–1.0)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-healing re-extraction&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For the wider PDF extractor landscape including OpenDataLoader, marker, MinerU, and MarkItDown, see the &lt;a href="https://pdfmux.com/blog/pdf-extractor-comparison-2026/" rel="noopener noreferrer"&gt;full 2026 comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark results: 200 PDFs, head-to-head
&lt;/h2&gt;

&lt;p&gt;We tested on &lt;a href="https://github.com/opendataloader-project/opendataloader-bench" rel="noopener noreferrer"&gt;opendataloader-bench&lt;/a&gt; — 200 real-world PDFs covering financial filings, academic papers, legal contracts, scanned forms, and government documents. Three metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NID (Reading Order)&lt;/strong&gt; — fuzzy string match against the document's true reading order&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TEDS (Table Accuracy)&lt;/strong&gt; — tree edit distance on extracted table HTML&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MHS (Heading Structure)&lt;/strong&gt; — tree edit distance on the heading hierarchy&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;th&gt;NID&lt;/th&gt;
&lt;th&gt;TEDS&lt;/th&gt;
&lt;th&gt;MHS&lt;/th&gt;
&lt;th&gt;Cost/1k pages&lt;/th&gt;
&lt;th&gt;Bench inclusion&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;hybrid AI (paid)&lt;/td&gt;
&lt;td&gt;0.909&lt;/td&gt;
&lt;td&gt;0.935&lt;/td&gt;
&lt;td&gt;0.928&lt;/td&gt;
&lt;td&gt;0.828&lt;/td&gt;
&lt;td&gt;~$10&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;pdfmux 1.5.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.905&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.920&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.911&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.852&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docling 2.x&lt;/td&gt;
&lt;td&gt;0.877&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;0.911&lt;/td&gt;
&lt;td&gt;0.802&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LlamaParse standard&lt;/td&gt;
&lt;td&gt;not published&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;td&gt;Cloud-only, not on bench&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LlamaParse premium&lt;/td&gt;
&lt;td&gt;not published&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$10&lt;/td&gt;
&lt;td&gt;Cloud-only, not on bench&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unstructured (OSS)&lt;/td&gt;
&lt;td&gt;not published&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;Not on bench&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key data points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;pdfmux 0.905 overall is &lt;strong&gt;0.4% behind&lt;/strong&gt; the paid hybrid AI #1 (0.909) — and it costs nothing per page.&lt;/li&gt;
&lt;li&gt;pdfmux beats Docling by &lt;strong&gt;3.2%&lt;/strong&gt; overall (0.905 vs 0.877).&lt;/li&gt;
&lt;li&gt;pdfmux has the &lt;strong&gt;best heading detection of any extractor on the benchmark — paid or free&lt;/strong&gt; (0.852 MHS vs 0.828 for the paid leader).&lt;/li&gt;
&lt;li&gt;pdfmux ties Docling on table accuracy (0.911 TEDS) but wins on reading order (+2.0 points NID) and headings (+5.0 points MHS).&lt;/li&gt;
&lt;li&gt;LlamaParse claims ~92% accuracy on its internal eval mix, but has not published opendataloader-bench scores.&lt;/li&gt;
&lt;li&gt;Unstructured does not benchmark against opendataloader-bench publicly — its accuracy claims are based on internal evaluation against its own corpus.&lt;/li&gt;
&lt;li&gt;pdfmux v1.5.0 lifted TEDS from 0.887 to &lt;strong&gt;0.911 (+2.7 points)&lt;/strong&gt; by adding image-table OCR with spatial clustering (&lt;a href="https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;pdfmux 1.5.0 lifted MHS from 0.844 to &lt;strong&gt;0.852&lt;/strong&gt; via an ML heading classifier (sklearn GradientBoosting, 212 KB).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the full benchmark methodology and per-document score deltas, see the &lt;a href="https://pdfmux.com/blog/benchmarking-pdf-extractors/" rel="noopener noreferrer"&gt;pdfmux benchmark deep dive&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use each (decision matrix)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Use pdfmux when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Monthly volume exceeds 20,000 pages (cost crossover vs LlamaParse standard)&lt;/li&gt;
&lt;li&gt;Documents are privileged, regulated, or subject to data residency rules (HIPAA, GDPR, UAE PDPL, FADP)&lt;/li&gt;
&lt;li&gt;You need per-page confidence scoring for downstream conditional logic&lt;/li&gt;
&lt;li&gt;You want self-healing extraction (auto-retry on bad pages with a different backend)&lt;/li&gt;
&lt;li&gt;You're shipping an MCP-enabled agent (Claude Desktop, Cursor) that needs PDF reading&lt;/li&gt;
&lt;li&gt;You need a single CPU-only &lt;code&gt;pip install&lt;/code&gt; that handles digital + scanned + table-heavy PDFs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use LlamaParse when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Volume stays under 1,000 pages/day (free tier is genuinely free)&lt;/li&gt;
&lt;li&gt;Documents are non-sensitive (no contracts, no PHI, no regulated data)&lt;/li&gt;
&lt;li&gt;You're already deep in the LlamaIndex framework and want native integration&lt;/li&gt;
&lt;li&gt;You need maximum accuracy on dense multi-column academic preprints (premium mode runs GPT-4V on every page)&lt;/li&gt;
&lt;li&gt;You want zero infrastructure — no servers, no Docker, no Java runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use Docling when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Your corpus is 90%+ tables (financial statements, scientific data, government filings)&lt;/li&gt;
&lt;li&gt;You want IBM-backed open source with predictable enterprise support&lt;/li&gt;
&lt;li&gt;You need ML-grade table extraction in a single library (no orchestration layer)&lt;/li&gt;
&lt;li&gt;~500 MB install size and 30–60 second cold-start are acceptable&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use Unstructured when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You ingest 25+ file formats: PDF + DOCX + PPTX + HTML + EPUB + EML + images&lt;/li&gt;
&lt;li&gt;You need a managed pipeline with cleaning, chunking, and metadata in one API&lt;/li&gt;
&lt;li&gt;Your team prefers a hosted API ($1/1k pages) and the privacy tradeoff is acceptable&lt;/li&gt;
&lt;li&gt;You're building a generic enterprise document pipeline, not a PDF-specific RAG system&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Code: same input, all 4 tools
&lt;/h2&gt;

&lt;p&gt;The same financial report (12-K filing, 47 pages, 18 tables, 3 scanned signature pages) extracted four ways:&lt;/p&gt;

&lt;h3&gt;
  
  
  pdfmux
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pdfmux&lt;/span&gt;

&lt;span class="c1"&gt;# auto-routes per page: PyMuPDF for digital, Docling for tables,
# RapidOCR for scanned, LLM fallback if configured
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pdfmux&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10-K-2025.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# clean Markdown
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 0.94 — per-document average (0.0–1.0)
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warnings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# ["Page 41: low text density, re-extracted with OCR"]
&lt;/span&gt;
&lt;span class="c1"&gt;# flag low-confidence pages for review
&lt;/span&gt;&lt;span class="n"&gt;bad&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  LlamaParse
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LlamaParse&lt;/span&gt;

&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LlamaParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llx-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10-K-2025.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# premium mode for complex layouts (10x cost)
&lt;/span&gt;&lt;span class="n"&gt;parser_premium&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LlamaParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llx-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;premium_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser_premium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10-K-2025.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Docling
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;docling.document_converter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DocumentConverter&lt;/span&gt;

&lt;span class="n"&gt;converter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocumentConverter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10-K-2025.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export_to_markdown&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# tables are first-class via result.document.tables
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Unstructured
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;unstructured.partition.pdf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;partition_pdf&lt;/span&gt;

&lt;span class="n"&gt;elements&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;partition_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10-K-2025.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hi_res&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# ML layout detection (slow on CPU)
&lt;/span&gt;    &lt;span class="n"&gt;infer_table_structure&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;el&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;el&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same input, four very different profiles: pdfmux returns a confidence-scored result; LlamaParse requires an API key and ships the document to a third-party server; Docling returns a structured document model; Unstructured returns typed elements (&lt;code&gt;Title&lt;/code&gt;, &lt;code&gt;NarrativeText&lt;/code&gt;, &lt;code&gt;Table&lt;/code&gt;) — its &lt;code&gt;hi_res&lt;/code&gt; strategy is GPU-friendly for a reason.&lt;/p&gt;

&lt;p&gt;For end-to-end RAG patterns across all four, see &lt;a href="https://pdfmux.com/blog/pdf-extraction-for-rag-pipeline/" rel="noopener noreferrer"&gt;PDF extraction for RAG pipelines&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest take per tool
&lt;/h2&gt;

&lt;h3&gt;
  
  
  pdfmux: best free option, best for regulated and high-volume
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;#2 on opendataloader-bench at 0.905&lt;/strong&gt; — within 0.4 points of the paid leader&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best-in-class heading detection (0.852 MHS)&lt;/strong&gt; — beats every paid and free extractor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-page confidence scoring (0.0–1.0)&lt;/strong&gt; — only tool that tells you which pages to trust&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-healing pipeline&lt;/strong&gt; — auto-retries failed pages with a different backend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in MCP server&lt;/strong&gt; — give Claude / Cursor / Claude Desktop reliable local PDF reading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;20 MB base install, CPU-only&lt;/strong&gt; — works in Lambda, small containers, air-gapped environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIT licensed&lt;/strong&gt; — no AGPL contamination at the application layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BYOK LLM fallback&lt;/strong&gt; — bring your own Gemini, Claude, GPT-4o, or Ollama for the hardest pages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2–4 point gap on dense multi-column academic preprints&lt;/strong&gt; vs LlamaParse premium mode&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No async REST API out of the box&lt;/strong&gt; — sync Python API or CLI, run your own queue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optional dependencies for full coverage&lt;/strong&gt; — &lt;code&gt;pdfmux[tables]&lt;/code&gt; adds Docling (~500 MB), &lt;code&gt;pdfmux[ocr]&lt;/code&gt; adds RapidOCR (~200 MB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No managed cloud offering&lt;/strong&gt; — you run the infrastructure (which is also the point for regulated industries)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  LlamaParse: best low-volume cloud, best for LlamaIndex stacks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1,000 pages/day free tier&lt;/strong&gt; — genuinely free for prototyping and small production workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Premium mode runs multimodal LLM inference per page&lt;/strong&gt; — best reading-order recovery on complex layouts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero infrastructure&lt;/strong&gt; — REST API, no servers, no models to download&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native LlamaIndex integration&lt;/strong&gt; — drop-in for existing LlamaIndex RAG pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10,000 pages per call&lt;/strong&gt; — handles long documents in a single request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async REST API&lt;/strong&gt; — easy to integrate into job queues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Closed source&lt;/strong&gt; — no benchmark transparency, no reproducibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud only&lt;/strong&gt; — every document leaves your infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No per-page confidence signals&lt;/strong&gt; — opaque output, no way to flag bad pages without manual inspection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost scales linearly with volume&lt;/strong&gt; — $300/month at 100k pages standard, $1,000/month at 100k pages premium&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy and data residency&lt;/strong&gt; — HIPAA needs a BAA; GDPR cross-border restrictions and attorney-client privilege are real concerns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor lock-in&lt;/strong&gt; — proprietary pipeline you cannot self-host or audit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the full pdfmux vs LlamaParse breakdown including a cost crossover analysis at 50k / 100k / 250k / 500k / 1M pages per month, see &lt;a href="https://pdfmux.com/blog/pdfmux-vs-llamaparse/" rel="noopener noreferrer"&gt;pdfmux vs LlamaParse: accuracy, cost, and privacy compared&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Docling: best pure-OSS table extractor
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;0.911 TEDS table accuracy&lt;/strong&gt; — ties pdfmux for the best free table extraction on the benchmark&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IBM-backed open source&lt;/strong&gt; — Apache 2.0, predictable governance, enterprise-friendly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transformer-based document model&lt;/strong&gt; — first-class structure (paragraphs, lists, tables, figures) not just text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain integration&lt;/strong&gt; — &lt;code&gt;DoclingLoader&lt;/code&gt; is a one-liner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;0.877 overall — 3.2 points behind pdfmux&lt;/strong&gt; on the same benchmark&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.802 MHS heading score&lt;/strong&gt; — 5.0 points behind pdfmux&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~500 MB install&lt;/strong&gt; — ML models pulled on first run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow per page&lt;/strong&gt; — 0.3–3s/page vs pdfmux 0.05s/page on digital text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No quality auditing&lt;/strong&gt; — single extraction pass, no confidence signal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No OCR for scanned pages out of the box&lt;/strong&gt; — need to add a separate OCR layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a per-tool deep dive including marker and pymupdf4llm, see &lt;a href="https://pdfmux.com/blog/pdfmux-vs-pymupdf-vs-marker-vs-docling/" rel="noopener noreferrer"&gt;pdfmux vs PyMuPDF vs marker vs Docling&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unstructured: best multi-format ingestion, weakest for pure PDF accuracy
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;25+ file formats&lt;/strong&gt; — PDF, DOCX, PPTX, HTML, EPUB, EML, MSG, JPG, PNG, TXT, MD, RTF, ODT, EPUB, CSV, TSV, XML&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Element-typed output&lt;/strong&gt; — every chunk has a type (&lt;code&gt;Title&lt;/code&gt;, &lt;code&gt;NarrativeText&lt;/code&gt;, &lt;code&gt;Table&lt;/code&gt;, &lt;code&gt;Image&lt;/code&gt;) for downstream filtering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosted API at $1/1k pages&lt;/strong&gt; — one-third of LlamaParse standard pricing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source core (Apache 2.0)&lt;/strong&gt; — self-hostable for sensitive workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mature chunking strategies&lt;/strong&gt; — &lt;code&gt;chunk_by_title&lt;/code&gt;, &lt;code&gt;chunk_by_similarity&lt;/code&gt; built in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not on opendataloader-bench&lt;/strong&gt; — accuracy claims based on internal evaluation only&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;hi_res&lt;/code&gt; strategy is heavy&lt;/strong&gt; — pulls layout-detection models, slow on CPU, recommended GPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~2 GB full install&lt;/strong&gt; — every backend (detectron2, paddleocr, etc.) is optional but expected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generalist over specialist&lt;/strong&gt; — strong at format coverage, weaker at PDF-specific edge cases (multi-column flow, footnotes, equations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No per-page confidence scoring&lt;/strong&gt; — you cannot programmatically detect bad pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosted API has the same privacy profile as LlamaParse&lt;/strong&gt; — documents leave your infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is pdfmux a free alternative to LlamaParse?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. pdfmux is MIT-licensed and runs locally with zero per-page cost. It scores 0.905 on opendataloader-bench — within 0.4 points of the paid #1. The cost crossover vs LlamaParse standard ($0.003/page) is around 15,000–20,000 pages per month. Below 1,000 pages per day, LlamaParse's free tier wins on simplicity if your documents are non-sensitive. See the &lt;a href="https://pdfmux.com/blog/pdfmux-vs-llamaparse/" rel="noopener noreferrer"&gt;pdfmux vs LlamaParse cost analysis&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does pdfmux work with LangChain?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. &lt;code&gt;from pdfmux.integrations.langchain import PDFMuxLoader&lt;/code&gt; returns standard LangChain &lt;code&gt;Document&lt;/code&gt; objects with metadata including page number, confidence score, and the extractor used. A LlamaIndex reader is also included at &lt;code&gt;pdfmux.integrations.llamaindex.PDFMuxReader&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can pdfmux replace Docling for table extraction?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. pdfmux 1.5.1 scores &lt;strong&gt;0.911 TEDS — the same as Docling 2.x — on opendataloader-bench&lt;/strong&gt;. It uses Docling internally for table-heavy pages and adds image-table OCR with spatial clustering for tables embedded as images. Because pdfmux routes per page, 90% of pages skip Docling entirely and run through PyMuPDF at 0.01s/page. If your corpus is 100% tables, Docling alone is fine. If it's mixed, pdfmux is faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which PDF extractor has the best benchmark scores in 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On opendataloader-bench (200 PDFs, public methodology), the ranking is: hybrid AI 0.909 (paid, ~$0.01/page) → pdfmux 0.905 (free, MIT) → Docling 0.877 (free, Apache 2.0) → marker 0.861 (free, GPU recommended) → opendataloader 0.852 (free) → MinerU 0.831 (free, GPU recommended). LlamaParse and Unstructured do not publish opendataloader-bench scores. &lt;strong&gt;pdfmux is the highest-scoring free extractor by a margin of 2.8+ points.&lt;/strong&gt; See the &lt;a href="https://pdfmux.com/blog/benchmarking-pdf-extractors/" rel="noopener noreferrer"&gt;full benchmark&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does pdfmux have an MCP server?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. pdfmux ships a built-in Model Context Protocol server. Add it to Claude Desktop or Cursor with a one-line config and your agent can read PDFs natively. The server exposes four tools: &lt;code&gt;convert_pdf&lt;/code&gt;, &lt;code&gt;analyze_pdf&lt;/code&gt;, &lt;code&gt;batch_convert&lt;/code&gt;, and &lt;code&gt;extract_structured&lt;/code&gt;. LlamaParse, Docling, and Unstructured do not ship MCP servers as of April 2026. See &lt;a href="https://pdfmux.com/blog/mcp-server-pdf-ai-agent/" rel="noopener noreferrer"&gt;the pdfmux MCP guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does my RAG pipeline hallucinate on PDFs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Almost always because of bad ingestion, not bad retrieval. Most PDF extractors give you text and silently pass garbage downstream — blank pages returned as "successful," scrambled multi-column reading order, missing tables, mojibake on scanned pages. The model then retrieves and cites that garbage. pdfmux is the only tool on this list with &lt;strong&gt;per-page confidence scoring&lt;/strong&gt; — every page gets a 0.0–1.0 quality score from 4 signals (character density, alphabetic ratio, word structure, mojibake detection). Pages below threshold are auto-re-extracted with a different backend. In practice, this is the single biggest fix you can make for RAG hallucinations. See &lt;a href="https://pdfmux.com/blog/pdf-to-markdown-for-rag/" rel="noopener noreferrer"&gt;PDF to Markdown for RAG pipelines&lt;/a&gt; for the full pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I run pdfmux without a GPU or API keys?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes — that's the default. The base install (&lt;code&gt;pip install pdfmux&lt;/code&gt;) handles digital PDFs at 0.01s/page on CPU. Add &lt;code&gt;pdfmux[ocr]&lt;/code&gt; (~200 MB) for scanned pages via RapidOCR, also CPU-only. Add &lt;code&gt;pdfmux[tables]&lt;/code&gt; (~500 MB) for Docling-grade table extraction. No GPU, no API keys, no telemetry. LlamaParse requires API keys. Docling and Unstructured &lt;code&gt;hi_res&lt;/code&gt; benefit significantly from a GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;If you're starting a RAG pipeline today and don't know which extractor to pick, &lt;strong&gt;default to pdfmux&lt;/strong&gt;. It's free, MIT-licensed, runs locally, ships an MCP server, scores #2 on the public benchmark, and is the only one with per-page confidence signals. The cases where you'd pick another tool are specific and bounded: LlamaParse for sub-1k-pages-per-day non-sensitive prototyping, Docling for table-only corpora, Unstructured when PDF is one of 25+ file formats you ingest.&lt;/p&gt;

&lt;p&gt;For most teams, the answer is &lt;code&gt;pip install pdfmux&lt;/code&gt; — and the &lt;a href="https://pdfmux.com/" rel="noopener noreferrer"&gt;pdfmux homepage&lt;/a&gt; has the 5-minute quickstart.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Last Updated: 2026-04-26&lt;/em&gt;&lt;/p&gt;

</description>
      <category>pdf</category>
      <category>rag</category>
      <category>python</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
