DEV Community

jelizaveta
jelizaveta

Posted on

Read PDF Documents with Python (Extract Text, Images, and Tables)

In the digital age, PDF documents have become the mainstream medium for information exchange and storage due to their cross-platform compatibility and fixed formatting. However, when we need to extract specific data—such as text, images, or tables—from large numbers of PDF documents, manual operations are inefficient and error-prone. At this point, Python, with its powerful data processing capabilities, becomes the ideal tool to tackle this challenge.

This article explores how to use Spire.PDF for Python , a professional library, to efficiently and accurately extract various types of content from PDF documents.


Introduction and Installation of Spire.PDF for Python

Spire.PDF for Python is a powerful and easy-to-use PDF processing library designed specifically for Python developers. It not only handles complex PDF structures and supports extracting content from encrypted PDFs but also accurately parses text, images, and tables, greatly simplifying the workflow of PDF automation.

Main Advantages:

  • Comprehensive Feature Support: Covers PDF creation, editing, conversion, content extraction, and more.
  • High-Precision Content Extraction: Accurately identifies and extracts text, images, vector graphics, table data, etc.
  • Ease of Use: Provides an intuitive API interface to lower development complexity.
  • Excellent Performance: Performs well when handling large or complex PDF documents.

Installation:

You can easily install Spire.PDF for Python using pip:

pip install Spire.PDF
Enter fullscreen mode Exit fullscreen mode

After installation, you can import and use the Spire.PDF library in your Python project.


Extracting Text from PDF

Extracting text from PDFs is one of the most common requirements. Spire.PDF for Python provides flexible methods to extract text from the entire document, specific pages, or even specific regions.

Example: Extract text from a page

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument instance
doc = PdfDocument()
# Load PDF file
doc.LoadFromFile("sample.pdf")  # Replace with your PDF file path

# Get the first page
page = doc.Pages[0]

# Create a PdfTextExtractor instance
textExtractor = PdfTextExtractor(page)
option = PdfTextExtractOptions()
# Extract text
text = textExtractor.ExtractText(option)

print("Extracted Text:\n", text)

# Close the document
doc.Close()

Enter fullscreen mode Exit fullscreen mode

Example: Extract text from a specific area

# Assume 'page' is already defined
pdfTextExtractor = PdfTextExtractor(page)
pdfTextExtractOptions = PdfTextExtractOptions()
# Define extract area (X, Y, width, height)
pdfTextExtractOptions.ExtractArea = RectangleF(80.0, 180.0, 500.0, 200.0)
text = pdfTextExtractor.ExtractText(pdfTextExtractOptions)

print("Extracted text from the specified area:\n", text)
Enter fullscreen mode Exit fullscreen mode

Extracting Images from PDF

PDF documents often contain important image data. Spire.PDF for Python can help identify and extract these images, saving them in common formats.

from spire.pdf.common import *
from spire.pdf import *
import os

# Create a PdfDocument instance
doc = PdfDocument()
# Load PDF file
doc.LoadFromFile("sample.pdf")  # Replace with your PDF file path

# Create a directory to save extracted images
output_dir = "extracted_images"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Traverse all pages
for i in range(doc.Pages.Count):
    page = doc.Pages.get_Item(i)

    # Use PdfImageHelper to extract images
    imageHelper = PdfImageHelper()
    imageInfo = imageHelper.GetImagesInfo(page)

    for j, info in enumerate(imageInfo):
        image = info.Image
        file_name = os.path.join(output_dir, f"page_{i+1}_image_{j+1}.png")
        image.Save(file_name, ImageFormat.get_Png())  # Or ImageFormat.get_Jpeg()
        print(f"Image saved at: {file_name}")

# Close the document
doc.Close()

Enter fullscreen mode Exit fullscreen mode

Extracting Tables from PDF

Extracting tables is more challenging because PDFs don’t inherently contain “table” structures. However, Spire.PDF for Python can intelligently recognize table layouts and extract structured data.

from spire.pdf import PdfDocument, PdfTableExtractor

# Load PDF file
doc = PdfDocument()
doc.LoadFromFile("sample.pdf")

# Create a PdfTableExtractor instance
table_extractor = PdfTableExtractor(doc)
# Extract tables from the first page
tables = table_extractor.ExtractTable(0)
for table in tables:
    row_count = table.GetRowCount()
    column_count = table.GetColumnCount()
    for i in range(row_count):
        table_row = []
        for j in range(column_count):
            cell_text = table.GetText(i, j)
            table_row.append(cell_text)
        print(table_row)

Enter fullscreen mode Exit fullscreen mode

Conclusion and Outlook

This article has shown the powerful features and ease of use of Spire.PDF for Python in extracting content from PDF documents. Whether for simple text extraction or complex image and table parsing, Spire.PDF offers efficient and reliable solutions.

It greatly simplifies the challenges faced by Python developers in automation, data analysis, and document digitization.

Spire.PDF for Python is not just a content extraction tool—it also supports creating, editing, converting, merging, splitting, encrypting, and decrypting PDFs, making it a complete solution for PDF workflows.

We encourage you to try these examples yourself and explore more advanced features to take your PDF processing tasks to the next level.

Top comments (0)