DEV Community

Cover image for I Tested 7 Python PDF Extractors So You Don’t Have To (2025 Edition)
Aman
Aman

Posted on • Originally published at onlyoneaman.Medium

I Tested 7 Python PDF Extractors So You Don’t Have To (2025 Edition)

Why This Even Matters

PDF extraction sounds boring until you need it. Then it becomes the bottleneck in everything you’re trying to build.

Maybe you’re building a document search system and need clean text for indexing. Maybe you’re creating embeddings for a RAG pipeline, and garbage text means garbage vectors. Maybe you’re processing invoices, analysing research papers, or just trying to get data out of that quarterly report someone sent you.

For small PDFs? Sure, you can often just pass the whole thing to Claude or GPT-4. But when you’re dealing with hundreds of documents, building search systems, or need structured data for processing, that’s when extraction quality actually matters.

So I decided to test the most popular Python libraries the way most developers would actually use them: minimal setup, basic extraction, real-world document.

What I Actually Tested

The Document: A typical business PDF — one page with headers, body text, a six-column table, and an image. The kind of thing that shows up in email attachments daily. You can find it here.

The Environment:

  • MacBook M2 Pro, 16GB RAM, macOS 15
  • Fresh Python 3.11 virtual environment
  • Clean pip installs, no optimisations

You can find the code for this test here.

Testing Approach: I used the simplest possible implementation for each library — the approach you’d try first when you’re just getting started. Most of these packages have advanced configuration options, specialised table extraction methods, and layout analysis features that could dramatically change results. But I wanted to see what you get with minimal effort.

What I Measured:

  • Speed: How fast can it process a single page?
  • Text Quality: Is the output readable and properly formatted?
  • Table Handling: Do tables survive extraction in usable form?

The Libraries (How to test quickly & Honest First Impressions)

pypdfium2 — The Speed Champion

# pip install pypdfium2
import pypdfium2 as pdfium
text = "\n".join(
    p.get_textpage().get_text_range() 
    for p in pdfium.PdfDocument("doc.pdf")
)
Enter fullscreen mode Exit fullscreen mode

What I got: Clean, readable text in 0.004 seconds. No formatting, no table structure — just fast, basic extraction.

Good for: High-volume processing, simple content indexing, when speed matters more than structure.

Consider if you need any formatting preservation or structured data extraction.

pypdf — The Reliable Default

# pip install pypdf
from pypdf import PdfReader
reader = PdfReader("doc.pdf")
text = "\n".join(p.extract_text() for p in reader.pages)
Enter fullscreen mode Exit fullscreen mode

What I got: Solid text extraction with occasional spacing quirks. Works everywhere, no C dependencies.

Good for: Lambda functions, containerised apps, environments where you can’t compile extensions.

Consider if: Text fidelity is critical for your downstream processing.

pdfplumber — The Data Extraction Tool

# pip install pdfplumber
import pdfplumber
with pdfplumber.open("doc.pdf") as pdf:
    first_page = pdf.pages[0]
    text = first_page.extract_text()
    table = first_page.extract_table()
Enter fullscreen mode Exit fullscreen mode

What I got: Basic text had some concatenation issues, but the table extraction worked well. This library has extensive options for fine-tuning that I didn’t explore.

Good for: When you specifically need tabular data, coordinate-based extraction, or detailed layout control.

Consider if: You just need clean text without heavy configuration.

pymupdf4llm — The Markdown Generator

# pip install pymupdf4llm
import pymupdf4llm
markdown = pymupdf4llm.to_markdown("doc.pdf")
Enter fullscreen mode Exit fullscreen mode

What I got: Clean markdown output in 0.14 seconds with proper headings and table formatting. Surprisingly good results.

Good for: Content systems, documentation processing, when you need structured text that preserves hierarchy.

Consider if: You’re dealing with complex multi-column layouts that might get scrambled.

unstructured — The Semantic Chunker

# pip install "unstructured[all-docs]"
from unstructured.partition.auto import partition
blocks = partition(filename="doc.pdf")
for block in blocks:
    print(f"{block.category}: {block.text}")
Enter fullscreen mode Exit fullscreen mode

What I got: Semantically labelled chunks (Title, NarrativeText, etc.) in 1.11 seconds. Perfect for downstream processing.

Good for: RAG systems, document analysis, when you need meaningful content boundaries for embeddings.

Consider if: You just need raw text content without semantic analysis.

marker-pdf — The Layout Perfectionist

# pip install marker-pdf
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
text, _, _ = text_from_rendered(
    PdfConverter(create_model_dict())("doc.pdf")
)
Enter fullscreen mode Exit fullscreen mode

What I got: Stunning layout-perfect markdown with inline images. Takes 12 seconds and downloads a 1GB model on first run.

Good for: When layout fidelity is critical, vision model inputs, and high-quality document conversion.

Consider if: You’re processing documents in real-time or have resource constraints.

textract — The Universal Handler

# pip install textract  # Requires Tesseract
import textract
text = textract.process("doc.pdf").decode()
Enter fullscreen mode Exit fullscreen mode

What I got: Fast extraction (0.05s) with automatic OCR fallback capability. Handles many file formats beyond PDF.

Good for: Mixed document types, when some files might be scanned, and building robust document processing pipelines.

Consider if: You only handle digital PDFs and want to avoid additional dependencies.

Real-World Performance Results

Here’s what actually happened with my test document:

marker-pdf (11.3s): Perfect structure preservation, ideal for high-quality conversions, long time though

pymupdf4llm (0.12s): Excellent markdown output, great balance of speed and quality

unstructured (1.29s): Clean semantic chunks, perfect for RAG workflows

textract (0.21s): Fast with OCR capabilities, minor formatting variations

pypdfium2 (0.003s): Blazing speed, clean basic text, no structure

pypdf (0.024s): Reliable extraction, occasional spacing artifacts

pdfplumber (0.10s): Good for tables, text extraction needs configuration

Important caveat: These results reflect basic usage with minimal configuration. Each library has advanced features that could significantly change performance for specific use cases. You can find the link to all results in the references.

Takeaways

Context matters more than raw performance. The “best” extractor depends entirely on what you’re building and how you’ll use the extracted text.
Simple often wins. For many use cases, basic text extraction is perfectly adequate. Don’t over-engineer unless you actually need the advanced features.
Test with your data. PDF structures vary wildly. What works great on my test document might fail on your quarterly reports.
Have a fallback plan. For production systems, consider hybrid approaches: fast extraction first, and more sophisticated methods for edge cases.
Advanced features exist. This comparison only scratched the surface. Most libraries have configuration options that could completely change the results.

Next Steps

The scope of this experiment was limited, but it could be useful for most of the basic use cases around extraction. Next, I will try to solve more problems like the ones listed below. If you have anything else we should add, please let me know.

  • More Document types (DOC, DOCX, …) are also very common document formats we come across, we either need to convert them to some common format or check the package for compatibility with it
  • Handling Password-Protected PDFs
  • Dealing with OCR Text: PDF files may contain scanned images of text, which cannot be extracted using standard methods. To handle OCR (Optical Character Recognition) text, specialised libraries like pytesseract (a wrapper for Google’s Tesseract OCR engine) can be used to extract text from the images.
  • Even in PDFs / Documents, there are more edge cases like handling images inside documents, OCR, vector drawings, and more.
  • Forms, especially with checkboxes
  • Rotated PDFs
  • Right-to-left script (Arabic/Hebrew).
  • DOCX containing elements like an embedded Excel chart or a floating text box.

Bottom Line

Pick the tool that fits your actual requirements, not the one with the highest benchmark scores.

For most document processing needs, pymupdf4llm hits the sweet spot of speed and quality. For RAG systems, unstructured gives you better semantic chunks. For pure speed, pypdfium2 is hard to beat.

But honestly? The extraction is usually the easy part. The real work happens in how you process, chunk, and use that text afterwards.

Found different results with your documents or discovered better approaches? I’d love to hear about it — this space moves fast, and real-world feedback keeps comparisons honest.

I hope you have found this article useful 😄. Thank you for reading. I am

Follow me on Medium for more articles. Also, let’s connect Twitter, LinkedIn. ☕️

References:

https://bella.amankumar.ai/examples/pdf/11-google-doc-document/google-doc-document.pdf (Test Python File)

https://gist.github.com/onlyoneaman/a479b3875524d39cee234f013866015e

Final Results ZIP: http://bella.amankumar.ai/examples/pdf/1.zip

GitHub - py-pdf/benchmarks

pymupdf.readthedocs.io

marker-pdf

aiwand - pypi.org

amankumar.ai

Top comments (0)