DEV Community

cloudytech147
cloudytech147

Posted on

A Step-by-Step guide to extract images from PDF in Python

In this Python tutorial, we will walk you through the Python code that can extract images from PDF files and save them in the same directory as that of the code file.

But before we go into the blog on extract images from PDF from Python, we need to install certain Python libraries.

Required Libraries

Here, we will be using three Python libraries, namely pillow , io, and PyMuPDF. Among these three libraries, io is a part of the Python standard library, whereas pillow and PyMuPDF are open-source third-party libraries.

To install pillow and PyMuPDF libraries for your Python environment, you need to run the following pip install commands on the command prompt or terminal app on your system:

pillow: Pillow is a popular Python image handling library.

pip install Pillow
Enter fullscreen mode Exit fullscreen mode

PyMuPDF: PyMuPDF library is used to access PDF, XPS, OpenXPS, epub, comic, and fiction book format files.

pip install PyMuPDF
Enter fullscreen mode Exit fullscreen mode

io: io library is used to deal with various I/O streams.

Implementation

Let’s start with importing the required module.

import fitz #the PyMuPDF module
from PIL import Image
import io
Enter fullscreen mode Exit fullscreen mode

Now, open the pdf file my_file.pdf with fitz.open() method, loop through every page, and extract images from every page and save them locally.

filename = "my_file.pdf"
# open file
with fitz.open(filename) as my_pdf_file:

    #loop through every page
    for page_number in range(1, len(my_pdf_file)+1):

        # acess individual page
        page = my_pdf_file[page_number-1]

        # accesses all images of the page
        images = page.getImageList()

        # check if images are there
        if images:
            print(f"There are {len(images)} image/s on page number {page_number}[+]")
        else:
            print(f"There are No image/s on page number {page_number}[!]")

        # loop through all images present in the page 
        for image_number, image in enumerate(page.getImageList(), start=1):

            #access image xerf
            xref_value = image[0]

            #extract image information
            base_image = my_pdf_file.extractImage(xref_value)

            # access the image itself
            image_bytes = base_image["image"]

            #get image extension
            ext = base_image["ext"]

            #load image
            image = Image.open(io.BytesIO(image_bytes))

            #save image locally
            image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb"))
Enter fullscreen mode Exit fullscreen mode

Here’s a brief overview of the functions and methods mentioned in the above code:

fitz.open(filename) as my_pdf_file statement open the PDF file.

page.getImageList() return a list of all images present on the single page.

my_pdf_file.extractImage(xref_value) method returns all the information about the image, including its byte code and image extension.

io.BytesIO(image_bytes) changes the image bytes-like object to proper bytes object.

Image.open(io.BytesIO(image_bytes)) method opens the image byte object.

image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb")) method saves the image locally.

Now put all the code together and execute.

Python Program to Extract Images from the PDF File

import fitz # PyMuPDF
import io
from PIL import Image

#filename
filename = "my_file.pdf"

# open file
with fitz.open(filename) as my_pdf_file:

    #loop through every page
    for page_number in range(1, len(my_pdf_file)+1):

        # acess individual page
        page = my_pdf_file[page_number-1]

        # accesses all images of the page
        images = page.getImageList()

        # check if images are there
        if images:
            print(f"There are {len(images)} image/s on page number {page_number}[+]")
        else:
            print(f"There are No image/s on page number {page_number}[!]")

        # loop through all images present in the page
        for image_number, image in enumerate(page.getImageList(), start=1):

            #access image xerf
            xref_value = image[0]

            #extract image information
            base_image = my_pdf_file.extractImage(xref_value)

            # access the image itself
            image_bytes = base_image["image"]

            #get image extension
            ext = base_image["ext"]

            #load image
            image = Image.open(io.BytesIO(image_bytes))

            #save image locally
            image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb"))
Enter fullscreen mode Exit fullscreen mode

Output:

When you execute the above program, you will see an output similar to the one as follows (output depends on the images in the PDF file that you have chosen):

There are 2 image/s on page number 1[+]
There are 2 image/s on page number 2[+]
There are 2 image/s on page number 3[+]
There are 2 image/s on page number 4[+]
There are 2 image/s on page number 5[+]
There are 2 image/s on page number 6[+]
There are 2 image/s on page number 7[+]
There are 2 image/s on page number 8[+]
There are 2 image/s on page number 9[+]
There are 2 image/s on page number 10[+]
There are 2 image/s on page number 11[+]
There are 2 image/s on page number 12[+]
There are 2 image/s on page number 13[+]
There are 2 image/s on page number 14[+]
There are 2 image/s on page number 15[+]
There are 2 image/s on page number 16[+]
There are 2 image/s on page number 17[+]
There are 2 image/s on page number 18[+]
There are 2 image/s on page number 19[+]
There are 2 image/s on page number 20[+]
There are 2 image/s on page number 21[+]
There are 2 image/s on page number 22[+]
There are 2 image/s on page number 23[+]
There are 2 image/s on page number 24[+]
There are 2 image/s on page number 25[+]
There are 2 image/s on page number 26[+]
There are 2 image/s on page number 27[+]
There are 2 image/s on page number 28[+]
There are 2 image/s on page number 29[+]
There are 2 image/s on page number 30[+]
Enter fullscreen mode Exit fullscreen mode

If you check the directory where your Python script is present, you will see that all the images have been saved there.

Extract Images from PDF in Python

Conclusion

In this Python instructional exercise, we figured out how to get to every one of the pictures in a PDF document utilizing the PyMuPDF library and save them locally utilizing the Python Pillow library. You can essentially reorder the previously mentioned Python program and supplant the my_file.pdf file name with your own PDF document name and concentrate every one of the pictures present in it.

Top comments (1)

Collapse
 
akaybk_14 profile image
Youngbin Kwon

Curious why you used PyMuPDF instead of PyPDF2 in particular?