Rost

Posted on Dec 30, 2025 • Originally published at glukhov.org

Extract Text from PDFs with PDFMiner in Python

#python #api #dev #linux

PDFMiner.six
is a powerful Python library for extracting text, metadata, and layout information from PDF documents.

Unlike simple PDF readers, it provides deep analysis of PDF structure and handles complex layouts effectively.

What is PDFMiner and Why Use It?

PDFMiner is a pure-Python library designed to extract and analyze text from PDF documents. The .six version is the actively maintained fork that supports Python 3.x, while the original PDFMiner project is no longer updated.

Key Features:

Pure Python implementation (no external dependencies)
Detailed layout analysis and text positioning
Font and character encoding detection
Support for encrypted PDFs
Command-line tools included
Extensible architecture for custom processing

PDFMiner is particularly useful when you need precise control over text extraction, need to preserve layout information, or work with complex multi-column documents. While it may be slower than some alternatives, its accuracy and detailed analysis capabilities make it the preferred choice for document processing pipelines. For the reverse workflow, you might also be interested in generating PDFs programmatically in Python.

Installation and Setup

Install PDFMiner.six using pip:

pip install pdfminer.six

For virtual environments (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install pdfminer.six

If you're new to Python package management, check out our Python Cheatsheet for more details on pip and virtual environments.

Verify the installation:

pdf2txt.py --version

The library includes several command-line tools:

pdf2txt.py - Extract text from PDFs
dumppdf.py - Dump PDF internal structure
latin2ascii.py - Convert Latin characters to ASCII

These tools complement other PDF manipulation utilities like Poppler that provide additional functionality such as page extraction and format conversion.

Basic Text Extraction

Simple Text Extraction

The most straightforward way to extract text from a PDF:

from pdfminer.high_level import extract_text

# Extract all text from a PDF
text = extract_text('document.pdf')
print(text)

This high-level API handles most common use cases and returns the entire document as a single string.

Extract Text from Specific Pages

To extract text from specific pages:

from pdfminer.high_level import extract_text

# Extract text from pages 2-5 (0-indexed)
text = extract_text('document.pdf', page_numbers=[1, 2, 3, 4])
print(text)

This is particularly useful for large documents where you only need certain sections, significantly improving performance.

Extract Text with Page Iteration

For processing pages individually:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

for page_layout in extract_pages('document.pdf'):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

This approach gives you more control over how each page is processed, useful when working with documents where page structure varies.

Advanced Layout Analysis

Understanding LAParams

LAParams (Layout Analysis Parameters) control how PDFMiner interprets document layout. Understanding the difference between PDFMiner and simpler libraries is crucial here - PDFMiner actually analyzes the spatial relationships between text elements.

from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams

# Create custom LAParams
laparams = LAParams(
    line_overlap=0.5,      # Min overlap for text lines
    char_margin=2.0,       # Character margin
    line_margin=0.5,       # Line margin
    word_margin=0.1,       # Word spacing
    boxes_flow=0.5,        # Box flow threshold
    detect_vertical=True,  # Detect vertical text
    all_texts=False        # Extract only text in boxes
)

text = extract_text('document.pdf', laparams=laparams)

Parameter Explanation:

line_overlap: How much lines must overlap vertically to be considered the same line (0.0-1.0)
char_margin: Maximum spacing between characters in the same word (as multiple of character width)
line_margin: Maximum spacing between lines in the same paragraph
word_margin: Spacing threshold to separate words
boxes_flow: Threshold for text box flow direction
detect_vertical: Enable detection of vertical text (common in Asian languages)

Extracting Layout Information

Get detailed position and font information:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox, LTTextLine, LTChar

for page_layout in extract_pages('document.pdf'):
    for element in page_layout:
        if isinstance(element, LTTextBox):
            # Get bounding box coordinates
            x0, y0, x1, y1 = element.bbox
            print(f"Text at ({x0}, {y0}): {element.get_text()}")

            # Iterate through lines
            for text_line in element:
                if isinstance(text_line, LTTextLine):
                    # Get character-level details
                    for char in text_line:
                        if isinstance(char, LTChar):
                            print(f"Char: {char.get_text()}, "
                                  f"Font: {char.fontname}, "
                                  f"Size: {char.height}")

This level of detail is invaluable for document analysis, form extraction, or when you need to understand document structure programmatically.

Handling Different PDF Types

Encrypted PDFs

PDFMiner can handle password-protected PDFs:

from pdfminer.high_level import extract_text

# Extract from password-protected PDF
text = extract_text('encrypted.pdf', password='your_password')

Note that PDFMiner can only extract text from PDFs - it cannot bypass security restrictions that prevent text extraction at the PDF level.

Multi-Column Documents

For documents with multiple columns, tune LAParams:

from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams

# Optimize for multi-column layouts
laparams = LAParams(
    detect_vertical=False,
    line_margin=0.3,
    word_margin=0.1,
    boxes_flow=0.3  # Lower value for better column detection
)

text = extract_text('multi_column.pdf', laparams=laparams)

The boxes_flow parameter is particularly important for multi-column documents - lower values help PDFMiner distinguish between separate columns.

Non-English and Unicode Text

PDFMiner handles Unicode well, but ensure proper encoding:

from pdfminer.high_level import extract_text

# Extract text with Unicode support
text = extract_text('multilingual.pdf', codec='utf-8')

# Save to file with UTF-8 encoding
with open('output.txt', 'w', encoding='utf-8') as f:
    f.write(text)

Working with Scanned PDFs

PDFMiner cannot extract text from scanned PDFs (images) directly. These require OCR (Optical Character Recognition). However, you can integrate PDFMiner with OCR tools.

Here's how to detect if a PDF is scanned and needs OCR:

from pdfminer.high_level import extract_text
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTImage

def is_scanned_pdf(pdf_path):
    """Check if PDF appears to be scanned (mostly images)"""
    text_count = 0
    image_count = 0

    for page_layout in extract_pages(pdf_path):
        for element in page_layout:
            if isinstance(element, (LTFigure, LTImage)):
                image_count += 1
            elif hasattr(element, 'get_text'):
                if element.get_text().strip():
                    text_count += 1

    # If mostly images and little text, likely scanned
    return image_count > text_count * 2

if is_scanned_pdf('document.pdf'):
    print("This PDF appears to be scanned - use OCR")
else:
    text = extract_text('document.pdf')
    print(text)

For scanned PDFs, consider integrating with Tesseract OCR or using tools to extract images from PDFs first, then applying OCR to those images.

Command-Line Usage

PDFMiner includes powerful command-line tools:

Extract Text with Command-Line Tools

# Extract text to stdout
pdf2txt.py document.pdf

# Save to file
pdf2txt.py -o output.txt document.pdf

# Extract specific pages
pdf2txt.py -p 1,2,3 document.pdf

# Extract as HTML
pdf2txt.py -t html -o output.html document.pdf

Advanced Options

# Custom layout parameters
pdf2txt.py -L 0.3 -W 0.1 document.pdf

# Extract with detailed layout (XML)
pdf2txt.py -t xml -o layout.xml document.pdf

# Set password for encrypted PDF
pdf2txt.py -P mypassword encrypted.pdf

These command-line tools are excellent for quick testing, shell scripts, and integration into automated workflows.

Performance Optimization

Processing Large PDFs

For large documents, consider these optimization strategies:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LAParams

# Process only needed pages
def extract_page_range(pdf_path, start_page, end_page):
    text_content = []
    for i, page_layout in enumerate(extract_pages(pdf_path)):
        if i < start_page:
            continue
        if i >= end_page:
            break
        text_content.append(page_layout)
    return text_content

# Disable layout analysis for speed
from pdfminer.high_level import extract_text
text = extract_text('large.pdf', laparams=None)  # Much faster

Batch Processing

For processing multiple PDFs efficiently:

from multiprocessing import Pool
from pdfminer.high_level import extract_text
import os

def process_pdf(pdf_path):
    """Process single PDF file"""
    try:
        text = extract_text(pdf_path)
        output_path = pdf_path.replace('.pdf', '.txt')
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(text)
        return f"Processed: {pdf_path}"
    except Exception as e:
        return f"Error processing {pdf_path}: {str(e)}"

# Process PDFs in parallel
def batch_process_pdfs(pdf_directory, num_workers=4):
    pdf_files = [os.path.join(pdf_directory, f) 
                 for f in os.listdir(pdf_directory) 
                 if f.endswith('.pdf')]

    with Pool(num_workers) as pool:
        results = pool.map(process_pdf, pdf_files)

    for result in results:
        print(result)

# Usage
batch_process_pdfs('/path/to/pdfs', num_workers=4)

Common Issues and Solutions

Issue: Incorrect Text Order

Problem: Extracted text appears jumbled or out of order.

Solution: Adjust LAParams, especially boxes_flow:

from pdfminer.layout import LAParams
laparams = LAParams(boxes_flow=0.3)  # Try different values
text = extract_text('document.pdf', laparams=laparams)

Issue: Missing Spaces Between Words

Problem: Words run together without spaces.

Solution: Increase word_margin:

laparams = LAParams(word_margin=0.2)  # Increase from default 0.1
text = extract_text('document.pdf', laparams=laparams)

Issue: Encoding Errors

Problem: Strange characters or encoding errors.

Solution: Specify codec explicitly:

text = extract_text('document.pdf', codec='utf-8')

Issue: Memory Errors with Large PDFs

Problem: Out of memory errors with large files.

Solution: Process page by page:

def extract_text_chunked(pdf_path, chunk_size=10):
    """Extract text in chunks to reduce memory usage"""
    all_text = []
    page_count = 0

    for page_layout in extract_pages(pdf_path):
        page_text = []
        for element in page_layout:
            if hasattr(element, 'get_text'):
                page_text.append(element.get_text())

        all_text.append(''.join(page_text))
        page_count += 1

        # Process in chunks
        if page_count % chunk_size == 0:
            yield ''.join(all_text)
            all_text = []

    # Yield remaining text
    if all_text:
        yield ''.join(all_text)

Comparing PDFMiner with Alternatives

Understanding when to use PDFMiner versus other libraries is important:

PDFMiner vs PyPDF2

PyPDF2 is simpler and faster but less accurate:

Use PyPDF2 for: Simple PDFs, quick extraction, merging/splitting PDFs
Use PDFMiner for: Complex layouts, accurate text positioning, detailed analysis

PDFMiner vs pdfplumber

pdfplumber builds on PDFMiner with a higher-level API:

Use pdfplumber for: Table extraction, simpler API, quick prototyping
Use PDFMiner for: Maximum control, custom processing, production systems

PDFMiner vs PyMuPDF (fitz)

PyMuPDF is significantly faster but has C dependencies:

Use PyMuPDF for: Performance-critical applications, large-scale processing
Use PDFMiner for: Pure Python requirement, detailed layout analysis

Practical Example: Extract and Analyze Document

Here's a complete example that extracts text and provides document statistics:

from pdfminer.high_level import extract_pages, extract_text
from pdfminer.layout import LTTextBox, LTChar
from collections import Counter
import re

def analyze_pdf(pdf_path):
    """Extract text and provide document analysis"""

    # Extract full text
    full_text = extract_text(pdf_path)

    # Statistics
    stats = {
        'total_chars': len(full_text),
        'total_words': len(full_text.split()),
        'total_lines': full_text.count('\n'),
        'fonts': Counter(),
        'font_sizes': Counter(),
        'pages': 0
    }

    # Detailed analysis
    for page_layout in extract_pages(pdf_path):
        stats['pages'] += 1

        for element in page_layout:
            if isinstance(element, LTTextBox):
                for line in element:
                    for char in line:
                        if isinstance(char, LTChar):
                            stats['fonts'][char.fontname] += 1
                            stats['font_sizes'][round(char.height, 1)] += 1

    return {
        'text': full_text,
        'stats': stats,
        'most_common_font': stats['fonts'].most_common(1)[0] if stats['fonts'] else None,
        'most_common_size': stats['font_sizes'].most_common(1)[0] if stats['font_sizes'] else None
    }

# Usage
result = analyze_pdf('document.pdf')
print(f"Pages: {result['stats']['pages']}")
print(f"Words: {result['stats']['total_words']}")
print(f"Main font: {result['most_common_font']}")
print(f"Main size: {result['most_common_size']}")

Integration with Document Processing Pipelines

PDFMiner works well in larger document processing workflows. For example, when building RAG (Retrieval-Augmented Generation) systems or document management solutions, you might combine it with other Python tools for a complete pipeline.

Once you've extracted text from PDFs, you often need to convert it to other formats. You can convert HTML content to Markdown using Python libraries or even leverage LLM-powered conversion with Ollama for intelligent document transformation. These techniques are particularly useful when PDF extraction produces HTML-like structured text that needs cleaning and reformatting.

For comprehensive document conversion pipelines, you might also need to handle Word document to Markdown conversion, creating a unified workflow that processes multiple document formats into a common output format.

Best Practices

Always use LAParams for complex documents - The default settings work for simple documents, but tuning LAParams significantly improves results for complex layouts.
Test with sample pages first - Before processing large batches, test your extraction settings on representative samples.
Handle exceptions gracefully - PDF files can be corrupted or malformed. Always wrap extraction code in try-except blocks.
Cache extracted text - For repeated processing, cache extracted text to avoid re-processing.
Validate extracted text - Implement checks to verify extraction quality (e.g., minimum text length, expected keywords).
Consider alternatives for specific use cases - While PDFMiner is powerful, sometimes specialized tools (like tabula-py for tables) are more appropriate.
Keep PDFMiner updated - The .six fork is actively maintained. Keep it updated for bug fixes and improvements.
Document your code properly - When sharing PDF extraction scripts, use proper Markdown code blocks with syntax highlighting for better readability.

Conclusion

PDFMiner.six is an essential tool for Python developers working with PDF documents. Its pure-Python implementation, detailed layout analysis, and extensible architecture make it ideal for production document processing systems. While it may have a steeper learning curve than simpler libraries, the precision and control it offers are unmatched for complex PDF extraction tasks.

Whether you're building a document management system, analyzing scientific papers, or extracting data for machine learning pipelines, PDFMiner provides the foundation for reliable PDF text extraction in Python.

DEV Community