DEV Community

Rost
Rost

Posted on • Originally published at glukhov.org

Extract Text from PDFs with PDFMiner in Python

PDFMiner.six
is a powerful Python library for extracting text, metadata, and layout information from PDF documents.

Unlike simple PDF readers, it provides deep analysis of PDF structure and handles complex layouts effectively.

What is PDFMiner and Why Use It?

PDFMiner is a pure-Python library designed to extract and analyze text from PDF documents. The .six version is the actively maintained fork that supports Python 3.x, while the original PDFMiner project is no longer updated.

Key Features:

  • Pure Python implementation (no external dependencies)
  • Detailed layout analysis and text positioning
  • Font and character encoding detection
  • Support for encrypted PDFs
  • Command-line tools included
  • Extensible architecture for custom processing

PDFMiner is particularly useful when you need precise control over text extraction, need to preserve layout information, or work with complex multi-column documents. While it may be slower than some alternatives, its accuracy and detailed analysis capabilities make it the preferred choice for document processing pipelines. For the reverse workflow, you might also be interested in generating PDFs programmatically in Python.

Installation and Setup

Install PDFMiner.six using pip:

pip install pdfminer.six
Enter fullscreen mode Exit fullscreen mode

For virtual environments (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install pdfminer.six
Enter fullscreen mode Exit fullscreen mode

If you're new to Python package management, check out our Python Cheatsheet for more details on pip and virtual environments.

Verify the installation:

pdf2txt.py --version
Enter fullscreen mode Exit fullscreen mode

The library includes several command-line tools:

  • pdf2txt.py - Extract text from PDFs
  • dumppdf.py - Dump PDF internal structure
  • latin2ascii.py - Convert Latin characters to ASCII

These tools complement other PDF manipulation utilities like Poppler that provide additional functionality such as page extraction and format conversion.

Basic Text Extraction

Simple Text Extraction

The most straightforward way to extract text from a PDF:

from pdfminer.high_level import extract_text

# Extract all text from a PDF
text = extract_text('document.pdf')
print(text)
Enter fullscreen mode Exit fullscreen mode

This high-level API handles most common use cases and returns the entire document as a single string.

Extract Text from Specific Pages

To extract text from specific pages:

from pdfminer.high_level import extract_text

# Extract text from pages 2-5 (0-indexed)
text = extract_text('document.pdf', page_numbers=[1, 2, 3, 4])
print(text)
Enter fullscreen mode Exit fullscreen mode

This is particularly useful for large documents where you only need certain sections, significantly improving performance.

Extract Text with Page Iteration

For processing pages individually:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

for page_layout in extract_pages('document.pdf'):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())
Enter fullscreen mode Exit fullscreen mode

This approach gives you more control over how each page is processed, useful when working with documents where page structure varies.

Advanced Layout Analysis

Understanding LAParams

LAParams (Layout Analysis Parameters) control how PDFMiner interprets document layout. Understanding the difference between PDFMiner and simpler libraries is crucial here - PDFMiner actually analyzes the spatial relationships between text elements.

from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams

# Create custom LAParams
laparams = LAParams(
    line_overlap=0.5,      # Min overlap for text lines
    char_margin=2.0,       # Character margin
    line_margin=0.5,       # Line margin
    word_margin=0.1,       # Word spacing
    boxes_flow=0.5,        # Box flow threshold
    detect_vertical=True,  # Detect vertical text
    all_texts=False        # Extract only text in boxes
)

text = extract_text('document.pdf', laparams=laparams)
Enter fullscreen mode Exit fullscreen mode

Parameter Explanation:

  • line_overlap: How much lines must overlap vertically to be considered the same line (0.0-1.0)
  • char_margin: Maximum spacing between characters in the same word (as multiple of character width)
  • line_margin: Maximum spacing between lines in the same paragraph
  • word_margin: Spacing threshold to separate words
  • boxes_flow: Threshold for text box flow direction
  • detect_vertical: Enable detection of vertical text (common in Asian languages)

Extracting Layout Information

Get detailed position and font information:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox, LTTextLine, LTChar

for page_layout in extract_pages('document.pdf'):
    for element in page_layout:
        if isinstance(element, LTTextBox):
            # Get bounding box coordinates
            x0, y0, x1, y1 = element.bbox
            print(f"Text at ({x0}, {y0}): {element.get_text()}")

            # Iterate through lines
            for text_line in element:
                if isinstance(text_line, LTTextLine):
                    # Get character-level details
                    for char in text_line:
                        if isinstance(char, LTChar):
                            print(f"Char: {char.get_text()}, "
                                  f"Font: {char.fontname}, "
                                  f"Size: {char.height}")
Enter fullscreen mode Exit fullscreen mode

This level of detail is invaluable for document analysis, form extraction, or when you need to understand document structure programmatically.

Handling Different PDF Types

Encrypted PDFs

PDFMiner can handle password-protected PDFs:

from pdfminer.high_level import extract_text

# Extract from password-protected PDF
text = extract_text('encrypted.pdf', password='your_password')
Enter fullscreen mode Exit fullscreen mode

Note that PDFMiner can only extract text from PDFs - it cannot bypass security restrictions that prevent text extraction at the PDF level.

Multi-Column Documents

For documents with multiple columns, tune LAParams:

from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams

# Optimize for multi-column layouts
laparams = LAParams(
    detect_vertical=False,
    line_margin=0.3,
    word_margin=0.1,
    boxes_flow=0.3  # Lower value for better column detection
)

text = extract_text('multi_column.pdf', laparams=laparams)
Enter fullscreen mode Exit fullscreen mode

The boxes_flow parameter is particularly important for multi-column documents - lower values help PDFMiner distinguish between separate columns.

Non-English and Unicode Text

PDFMiner handles Unicode well, but ensure proper encoding:

from pdfminer.high_level import extract_text

# Extract text with Unicode support
text = extract_text('multilingual.pdf', codec='utf-8')

# Save to file with UTF-8 encoding
with open('output.txt', 'w', encoding='utf-8') as f:
    f.write(text)
Enter fullscreen mode Exit fullscreen mode

Working with Scanned PDFs

PDFMiner cannot extract text from scanned PDFs (images) directly. These require OCR (Optical Character Recognition). However, you can integrate PDFMiner with OCR tools.

Here's how to detect if a PDF is scanned and needs OCR:

from pdfminer.high_level import extract_text
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTImage

def is_scanned_pdf(pdf_path):
    """Check if PDF appears to be scanned (mostly images)"""
    text_count = 0
    image_count = 0

    for page_layout in extract_pages(pdf_path):
        for element in page_layout:
            if isinstance(element, (LTFigure, LTImage)):
                image_count += 1
            elif hasattr(element, 'get_text'):
                if element.get_text().strip():
                    text_count += 1

    # If mostly images and little text, likely scanned
    return image_count > text_count * 2

if is_scanned_pdf('document.pdf'):
    print("This PDF appears to be scanned - use OCR")
else:
    text = extract_text('document.pdf')
    print(text)
Enter fullscreen mode Exit fullscreen mode

For scanned PDFs, consider integrating with Tesseract OCR or using tools to extract images from PDFs first, then applying OCR to those images.

Command-Line Usage

PDFMiner includes powerful command-line tools:

Extract Text with Command-Line Tools

# Extract text to stdout
pdf2txt.py document.pdf

# Save to file
pdf2txt.py -o output.txt document.pdf

# Extract specific pages
pdf2txt.py -p 1,2,3 document.pdf

# Extract as HTML
pdf2txt.py -t html -o output.html document.pdf
Enter fullscreen mode Exit fullscreen mode

Advanced Options

# Custom layout parameters
pdf2txt.py -L 0.3 -W 0.1 document.pdf

# Extract with detailed layout (XML)
pdf2txt.py -t xml -o layout.xml document.pdf

# Set password for encrypted PDF
pdf2txt.py -P mypassword encrypted.pdf
Enter fullscreen mode Exit fullscreen mode

These command-line tools are excellent for quick testing, shell scripts, and integration into automated workflows.

Performance Optimization

Processing Large PDFs

For large documents, consider these optimization strategies:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LAParams

# Process only needed pages
def extract_page_range(pdf_path, start_page, end_page):
    text_content = []
    for i, page_layout in enumerate(extract_pages(pdf_path)):
        if i < start_page:
            continue
        if i >= end_page:
            break
        text_content.append(page_layout)
    return text_content

# Disable layout analysis for speed
from pdfminer.high_level import extract_text
text = extract_text('large.pdf', laparams=None)  # Much faster
Enter fullscreen mode Exit fullscreen mode

Batch Processing

For processing multiple PDFs efficiently:

from multiprocessing import Pool
from pdfminer.high_level import extract_text
import os

def process_pdf(pdf_path):
    """Process single PDF file"""
    try:
        text = extract_text(pdf_path)
        output_path = pdf_path.replace('.pdf', '.txt')
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(text)
        return f"Processed: {pdf_path}"
    except Exception as e:
        return f"Error processing {pdf_path}: {str(e)}"

# Process PDFs in parallel
def batch_process_pdfs(pdf_directory, num_workers=4):
    pdf_files = [os.path.join(pdf_directory, f) 
                 for f in os.listdir(pdf_directory) 
                 if f.endswith('.pdf')]

    with Pool(num_workers) as pool:
        results = pool.map(process_pdf, pdf_files)

    for result in results:
        print(result)

# Usage
batch_process_pdfs('/path/to/pdfs', num_workers=4)
Enter fullscreen mode Exit fullscreen mode

Common Issues and Solutions

Issue: Incorrect Text Order

Problem: Extracted text appears jumbled or out of order.

Solution: Adjust LAParams, especially boxes_flow:

from pdfminer.layout import LAParams
laparams = LAParams(boxes_flow=0.3)  # Try different values
text = extract_text('document.pdf', laparams=laparams)
Enter fullscreen mode Exit fullscreen mode

Issue: Missing Spaces Between Words

Problem: Words run together without spaces.

Solution: Increase word_margin:

laparams = LAParams(word_margin=0.2)  # Increase from default 0.1
text = extract_text('document.pdf', laparams=laparams)
Enter fullscreen mode Exit fullscreen mode

Issue: Encoding Errors

Problem: Strange characters or encoding errors.

Solution: Specify codec explicitly:

text = extract_text('document.pdf', codec='utf-8')
Enter fullscreen mode Exit fullscreen mode

Issue: Memory Errors with Large PDFs

Problem: Out of memory errors with large files.

Solution: Process page by page:

def extract_text_chunked(pdf_path, chunk_size=10):
    """Extract text in chunks to reduce memory usage"""
    all_text = []
    page_count = 0

    for page_layout in extract_pages(pdf_path):
        page_text = []
        for element in page_layout:
            if hasattr(element, 'get_text'):
                page_text.append(element.get_text())

        all_text.append(''.join(page_text))
        page_count += 1

        # Process in chunks
        if page_count % chunk_size == 0:
            yield ''.join(all_text)
            all_text = []

    # Yield remaining text
    if all_text:
        yield ''.join(all_text)
Enter fullscreen mode Exit fullscreen mode

Comparing PDFMiner with Alternatives

Understanding when to use PDFMiner versus other libraries is important:

PDFMiner vs PyPDF2

PyPDF2 is simpler and faster but less accurate:

  • Use PyPDF2 for: Simple PDFs, quick extraction, merging/splitting PDFs
  • Use PDFMiner for: Complex layouts, accurate text positioning, detailed analysis

PDFMiner vs pdfplumber

pdfplumber builds on PDFMiner with a higher-level API:

  • Use pdfplumber for: Table extraction, simpler API, quick prototyping
  • Use PDFMiner for: Maximum control, custom processing, production systems

PDFMiner vs PyMuPDF (fitz)

PyMuPDF is significantly faster but has C dependencies:

  • Use PyMuPDF for: Performance-critical applications, large-scale processing
  • Use PDFMiner for: Pure Python requirement, detailed layout analysis

Practical Example: Extract and Analyze Document

Here's a complete example that extracts text and provides document statistics:

from pdfminer.high_level import extract_pages, extract_text
from pdfminer.layout import LTTextBox, LTChar
from collections import Counter
import re

def analyze_pdf(pdf_path):
    """Extract text and provide document analysis"""

    # Extract full text
    full_text = extract_text(pdf_path)

    # Statistics
    stats = {
        'total_chars': len(full_text),
        'total_words': len(full_text.split()),
        'total_lines': full_text.count('\n'),
        'fonts': Counter(),
        'font_sizes': Counter(),
        'pages': 0
    }

    # Detailed analysis
    for page_layout in extract_pages(pdf_path):
        stats['pages'] += 1

        for element in page_layout:
            if isinstance(element, LTTextBox):
                for line in element:
                    for char in line:
                        if isinstance(char, LTChar):
                            stats['fonts'][char.fontname] += 1
                            stats['font_sizes'][round(char.height, 1)] += 1

    return {
        'text': full_text,
        'stats': stats,
        'most_common_font': stats['fonts'].most_common(1)[0] if stats['fonts'] else None,
        'most_common_size': stats['font_sizes'].most_common(1)[0] if stats['font_sizes'] else None
    }

# Usage
result = analyze_pdf('document.pdf')
print(f"Pages: {result['stats']['pages']}")
print(f"Words: {result['stats']['total_words']}")
print(f"Main font: {result['most_common_font']}")
print(f"Main size: {result['most_common_size']}")
Enter fullscreen mode Exit fullscreen mode

Integration with Document Processing Pipelines

PDFMiner works well in larger document processing workflows. For example, when building RAG (Retrieval-Augmented Generation) systems or document management solutions, you might combine it with other Python tools for a complete pipeline.

Once you've extracted text from PDFs, you often need to convert it to other formats. You can convert HTML content to Markdown using Python libraries or even leverage LLM-powered conversion with Ollama for intelligent document transformation. These techniques are particularly useful when PDF extraction produces HTML-like structured text that needs cleaning and reformatting.

For comprehensive document conversion pipelines, you might also need to handle Word document to Markdown conversion, creating a unified workflow that processes multiple document formats into a common output format.

Best Practices

  1. Always use LAParams for complex documents - The default settings work for simple documents, but tuning LAParams significantly improves results for complex layouts.

  2. Test with sample pages first - Before processing large batches, test your extraction settings on representative samples.

  3. Handle exceptions gracefully - PDF files can be corrupted or malformed. Always wrap extraction code in try-except blocks.

  4. Cache extracted text - For repeated processing, cache extracted text to avoid re-processing.

  5. Validate extracted text - Implement checks to verify extraction quality (e.g., minimum text length, expected keywords).

  6. Consider alternatives for specific use cases - While PDFMiner is powerful, sometimes specialized tools (like tabula-py for tables) are more appropriate.

  7. Keep PDFMiner updated - The .six fork is actively maintained. Keep it updated for bug fixes and improvements.

  8. Document your code properly - When sharing PDF extraction scripts, use proper Markdown code blocks with syntax highlighting for better readability.

Conclusion

PDFMiner.six is an essential tool for Python developers working with PDF documents. Its pure-Python implementation, detailed layout analysis, and extensible architecture make it ideal for production document processing systems. While it may have a steeper learning curve than simpler libraries, the precision and control it offers are unmatched for complex PDF extraction tasks.

Whether you're building a document management system, analyzing scientific papers, or extracting data for machine learning pipelines, PDFMiner provides the foundation for reliable PDF text extraction in Python.

Related Resources

Related Articles on This Site

External References

Top comments (0)