Simple Steps Extracting Images from PDFs with Python

#python #programming #webdev #tutorial

In today's digital landscape, where information is abundant and diverse, extracting valuable visuals from PDF documents has become an essential skill for many tasks, from data analysis to content creation. Python, with its versatility and rich ecosystem of libraries, offers efficient solutions for this task. In this guide, we'll explore two methods for extracting images from PDFs using Python, catering to both simplicity and flexibility.

1.Using PyMuPDF

PyMuPDF stands out as a robust library proficient in managing a wide array of PDF tasks, among them, image extraction.

pip install fitz
Below, you'll find the code snippet:

import fitz  

def extract_images_pymupdf(pdf_path, output_folder):
    """Extracts images from a PDF using PyMuPDF.

    Args:
        pdf_path (str): Path to the PDF file.
        output_folder (str): Path to the folder where extracted images will be saved.
    """

    try:
        pdf_doc = fitz.open(pdf_path)

        if not os.path.exists(output_folder):
            os.makedirs(output_folder)  # Create output folder if it doesn't exist

        for page_num in range(len(pdf_doc)):
            page = pdf_doc[page_num]
            images = page.get_page_images(xref=True)  # Get images with XREFs (cross-references)

            for xref, image in images:
                # Extract image data and determine filename based on XREF or page/image number
                if xref:
                    filename = f"{os.path.splitext(pdf_path)[0]}_{page_num + 1}_{xref}"
                else:
                    filename = f"{os.path.splitext(pdf_path)[0]}_{page_num + 1}_{image['ext']}"

                with open(os.path.join(output_folder, filename), "wb") as f:
                    f.write(image["data"])

        print(f"Images extracted from '{pdf_path}' to '{output_folder}'.")

    except Exception as e:
        print(f"Error extracting images: {e}")


# Example usage
pdf_path = "your_pdf_file.pdf"
output_folder = "extracted_images"
extract_images_pymupdf(pdf_path, output_folder)

Importing PyMuPDF with import fitz enables efficient handling of PDF tasks, including image extraction. The extract_images_pymupdf function, taking pdf_path and output_folder, opens the PDF and creates the output folder if absent. It then iterates through pages, extracting images with XREFs and determining filenames. Image data is written to files in binary mode, followed by a success message. Exception handling ensures smooth execution. For usage, update pdf_path and output_folder, and call extract_images_pymupdf.

2.Using External Tools (Optional)

If you prefer not to install Python libraries, you can use external tools like pdfimages from the poppler package. Here's a simple approach:

Ensure poppler is installed on your system.
Open your terminal or command prompt.
Navigate to the directory containing your PDF file.
Execute the following command, replacing your_pdf_file.pdf with your actual file name:

pdfimages your_pdf_file.pdf extracted_images
PyMuPDF is generally preferred for its efficiency, control over image extraction, and cross-platform compatibility.

External tools can be a simpler option if you only need basic image extraction and don't want to install libraries.

I hope this guide helps you effectively extract images from your PDFs in Python!

DEV Community

Simple Steps Extracting Images from PDFs with Python

1.Using PyMuPDF

2.Using External Tools (Optional)

Top comments (0)

Read next

Managing versions of OpenTofu, Terraform and TerraGrunt with tenv and DOP

How to create a pricing slider with Tailwind CSS and Alpinejs

Path To A Clean(er) React Architecture - API Layer & Fetch Functions

ES6 Spread Operator: Unleashing the Power of Modern JavaScript