Paulo GP

Posted on Feb 25 • Updated on Feb 28

Working with PDF files using PyMuPDF

#python #programming

Introduction

PyMuPDF is a versatile Python library that empowers developers to work with PDF documents effortlessly. From extracting text and images to performing complex manipulations, PyMuPDF offers a rich set of features for handling PDF files programmatically. In this chapter, we explore the capabilities of PyMuPDF and demonstrate its usage through practical examples.

Topics

Installation and setup of PyMuPDF
Text extraction from PDF documents
Image extraction from PDF documents
PDF manipulation and modification

Installation and Setup of PyMuPDF

To begin harnessing the capabilities of PyMuPDF, you first need to install the library. You can install PyMuPDF via pip:

pip install PyMuPDF

Once installed, you can import the library into your Python scripts:

import fitz

Text Extraction from PDF Documents

PyMuPDF allows you to extract text from PDF documents with ease. Here's a simple example:

PDF file:

Test document PDF
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.

import fitz


def extract_text_from_pdf(filename: str) -> str:
    doc = fitz.open(filename=filename)
    text = ""
    for page in doc:
        text += page.get_text()
    return text


extracted_text = extract_text_from_pdf(filename ="example.pdf")
print(extracted_text)

Output:

Test document PDF
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.

Image Extraction from PDF Documents

In addition to text, PyMuPDF enables you to extract images from PDF documents:

import fitz


def extract_images_from_pdf(filename: str) -> list:
    doc = fitz.open(filename= filename)
    images = []
    for page in doc:
        for img in page.get_images():
            xref = img[0]
            base_image = doc.extract_image(xref=xref)
            image_bytes = base_image["image"]
            images.append(image_bytes)
    return images


extracted_images = extract_images_from_pdf(filename ="example.pdf")
print("Number of images extracted:", len(extracted_images))

Output:

Number of images extracted: 1

PDF Manipulation and Modification

PyMuPDF facilitates various manipulations and modifications of PDF documents, such as adding annotations, merging documents, and more:

import fitz


def add_annotation_to_pdf(in_filename: str, annotation: str, out_filename: str) -> None:
    doc = fitz.open(filename=in_filename)
    page = doc[0]  # Add annotation to the first page
    annot = page.add_text_annot(point=(100, 100), text=annotation)
    annot.set_colors(colors=(1, 0, 0))  # Set annotation color to red
    doc.save(filename=out_filename)


in_filename = "example.pdf"
annotation = "This is an annotation added using PyMuPDF."
out_filename = "example2.pdf"
add_annotation_to_pdf(in_filename=in_filename, annotation=annotation, out_filename=out_filename)

This code adds the note "This is an annotation added using PyMuPDF." to the output PDF.

Conclusion

PyMuPDF emerges as a powerful ally for Python developers tasked with working with PDF documents. Whether it's extracting text and images, performing manipulations, or modifying PDF files, PyMuPDF offers a comprehensive toolkit for tackling diverse PDF-related tasks programmatically.

DEV Community

Working with PDF files using PyMuPDF

Introduction

Topics

Installation and Setup of PyMuPDF

Text Extraction from PDF Documents

Image Extraction from PDF Documents

PDF Manipulation and Modification

Conclusion

Top comments (0)

Read next

Day 27 of 30-Day .NET Challenge: Query v/s Method Syntax

Implementing the Idempotency-Key specification on Apache APISIX

What is a Monad?

JSON: What is it and How to Use It?