DEV Community

Cover image for Working with PDF files using PyMuPDF
Paulo GP
Paulo GP

Posted on • Updated on

Working with PDF files using PyMuPDF

Introduction

PyMuPDF is a versatile Python library that empowers developers to work with PDF documents effortlessly. From extracting text and images to performing complex manipulations, PyMuPDF offers a rich set of features for handling PDF files programmatically. In this chapter, we explore the capabilities of PyMuPDF and demonstrate its usage through practical examples.

Topics

  • Installation and setup of PyMuPDF
  • Text extraction from PDF documents
  • Image extraction from PDF documents
  • PDF manipulation and modification

Installation and Setup of PyMuPDF

To begin harnessing the capabilities of PyMuPDF, you first need to install the library. You can install PyMuPDF via pip:

pip install PyMuPDF
Enter fullscreen mode Exit fullscreen mode

Once installed, you can import the library into your Python scripts:

import fitz
Enter fullscreen mode Exit fullscreen mode

Text Extraction from PDF Documents

PyMuPDF allows you to extract text from PDF documents with ease. Here's a simple example:

PDF file:

Test document PDF
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.
Enter fullscreen mode Exit fullscreen mode
import fitz


def extract_text_from_pdf(filename: str) -> str:
    doc = fitz.open(filename=filename)
    text = ""
    for page in doc:
        text += page.get_text()
    return text


extracted_text = extract_text_from_pdf(filename ="example.pdf")
print(extracted_text)
Enter fullscreen mode Exit fullscreen mode

Output:

Test document PDF
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.
Enter fullscreen mode Exit fullscreen mode

Image Extraction from PDF Documents

In addition to text, PyMuPDF enables you to extract images from PDF documents:

import fitz


def extract_images_from_pdf(filename: str) -> list:
    doc = fitz.open(filename= filename)
    images = []
    for page in doc:
        for img in page.get_images():
            xref = img[0]
            base_image = doc.extract_image(xref=xref)
            image_bytes = base_image["image"]
            images.append(image_bytes)
    return images


extracted_images = extract_images_from_pdf(filename ="example.pdf")
print("Number of images extracted:", len(extracted_images))
Enter fullscreen mode Exit fullscreen mode

Output:

Number of images extracted: 1
Enter fullscreen mode Exit fullscreen mode

PDF Manipulation and Modification

PyMuPDF facilitates various manipulations and modifications of PDF documents, such as adding annotations, merging documents, and more:

import fitz


def add_annotation_to_pdf(in_filename: str, annotation: str, out_filename: str) -> None:
    doc = fitz.open(filename=in_filename)
    page = doc[0]  # Add annotation to the first page
    annot = page.add_text_annot(point=(100, 100), text=annotation)
    annot.set_colors(colors=(1, 0, 0))  # Set annotation color to red
    doc.save(filename=out_filename)


in_filename = "example.pdf"
annotation = "This is an annotation added using PyMuPDF."
out_filename = "example2.pdf"
add_annotation_to_pdf(in_filename=in_filename, annotation=annotation, out_filename=out_filename)
Enter fullscreen mode Exit fullscreen mode

This code adds the note "This is an annotation added using PyMuPDF." to the output PDF.

Conclusion

PyMuPDF emerges as a powerful ally for Python developers tasked with working with PDF documents. Whether it's extracting text and images, performing manipulations, or modifying PDF files, PyMuPDF offers a comprehensive toolkit for tackling diverse PDF-related tasks programmatically.

Top comments (0)