Read PDFs in Python: Extract Text and Images

In daily work and study, we often need to batch-extract text or images from PDF files. For example, organizing clauses from a scanned contract, or collecting all the images from a product manual.

Dealing with PDFs used to be a headache, but with the right libraries, everything becomes simple. Today, we’ll introduce how to use Spire.PDF for Python —a powerful library that can extract text and images from PDFs with just a few lines of code.

Before you start, make sure you have installed the Spire.PDF library:

pip install Spire.PDF

1. Load the PDF Document

Before doing anything else, we need to load the PDF file into our code. Spire.PDF is very flexible and supports loading from a file path as well as loading from a data stream (Stream) .

Method 1: Load from a file

This is the most direct approach for fixed files on your local disk.

from spire.pdf import PdfDocument

# Create a PdfDocument instance
pdf = PdfDocument()
# Load a local PDF document
pdf.LoadFromFile("sample.pdf")

Method 2: Load from a data stream

If your PDF data is received from a network interface or generated in memory as byte data, this method is very useful.

from spire.pdf import PdfDocument, Stream

# Read the file as a byte array (demo: read from file; it can also come from a network)
withopen("sample.pdf", "rb") as f:
    byte_data = f.read()

# Create a stream object
pdfStream = Stream(byte_data)
# Load the PDF from the stream
pdf = PdfDocument(pdfStream)

2. Extract Text

Text extraction is one of the most common tasks when processing documents. The following code demonstrates how to iterate through all pages in a PDF and concatenate the text from each page.

It mainly uses two helper classes: PdfTextExtractor and PdfTextExtractOptions. Setting IsExtractAllText = True helps ensure that most visible text on the page is extracted.

# Assume the pdf object has already been loaded using the method above
all_text = ""

# Loop through each page
for pageIndex in range(pdf.Pages.Count):
    # Get the current page by index
    page = pdf.Pages.get_Item(pageIndex)

    # Create a text extractor
    text_extractor = PdfTextExtractor(page)

    # Configure extraction options
    options = PdfTextExtractOptions()
    options.IsExtractAllText = True
    options.IsSimpleExtraction = True

    # Extract and accumulate
    all_text += text_extractor.ExtractText(options)

# Print the result
print(all_text)

3. Extract Images

In many cases, key information in a PDF is actually hidden in illustrations or charts. Spire.PDF also provides a very convenient image extraction solution.

Using the PdfImageHelper helper class, we can directly get image information from a page, and then save each image as an image file (such as .png).

# Get the first page (index is 0)
page = pdf.Pages.get_Item(0)

# Create an image helper object
image_helper = PdfImageHelper()
# Get all image information on the page
images_info = image_helper.GetImagesInfo(page)

# Loop through and save each image
for i in range(len(images_info)):
    # Save as PNG format
    images_info[i].Image.Save(f"output/Images/image_{i}.png")

print(f"Successfully extracted {len(images_info)} images")

Note : If it’s a scanned PDF (image-based), what you extract is essentially the entire scanned image. If it’s an electronically generated PDF, it can accurately extract embedded standalone icons or photos.

4. Advanced Tips

Although the code above covers the basics, there are a few things worth paying attention to in real applications:

Page handling : The example extracts all text for demonstration purposes. If you want to process page by page, just control pageIndex in the loop.
Chinese support : The library supports Chinese well. When extracting Chinese PDFs, just ensure your encoding environment is UTF-8.
Free edition limitations : If you are using the free version of Spire.PDF, note that it usually has a limit on the number of pages it can process (for example, only the first 10 pages). If you need to handle many pages, you may need to evaluate the commercial version.

Summary

With Spire.PDF for Python , you’ll find that processing PDF files is surprisingly easy. Whether it’s reading a file, analyzing text page by page, or saving precious illustrations, you can get everything done with just a short handful of lines of code. This greatly improves document processing efficiency, letting you focus on the next steps—data analysis or business logic.

Try it now and let code free your hands!

DEV Community