Python Tutorial: Extracting Images and Text from PPT

When we need to grab materials like images and text from a PowerPoint presentation, doing it manually—copying and pasting one by one—is not only time-consuming but also easy to miss things or make mistakes. Today, I'll share a simple way to batch extract images and text from PPT using Python.

Preparation

First, you need to install Spire.Presentation for Python. You can install it via the pip command:

pip install Spire.Presentation

Once the installation is complete, you can start writing the code.

Extracting Images from PPT

Often, the images in a PPT are the materials we need. The following code demonstrates how to batch extract all images from a PPT and save them locally:

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation instance
ppt = Presentation()

# Load the PowerPoint document
ppt.LoadFromFile("sample.pptx")

# Iterate through all images in the document
for i, image in enumerate(ppt.Images):
    # Extract and save the image
    ImageName = "ExtractImage/Images_" + str(i) + ".png"
    image.Image.Save(ImageName)

ppt.Dispose()

How it works:

Presentation(): Creates a PPT document object
LoadFromFile(): Loads the PPT file to be processed
ppt.Images: Gets the collection of all images in the document
image.Image.Save(): Saves the image in PNG format

After running, all images will be saved sequentially to the ExtractImage folder, named Images_0.png, Images_1.png, and so on.

Extracting Text from PPT

Besides images, extracting text content is also a common requirement. The following code iterates through each slide and extracts text from all shapes:

from spire.presentation import *
from spire.presentation.common import *

# Create a Presentation object
pres = Presentation()

# Load the PowerPoint presentation
pres.LoadFromFile("Sample.pptx")

text = []
# Iterate through each slide
for slide in pres.Slides:
    # Iterate through each shape
    for shape in slide.Shapes:
        # Check if the shape is of IAutoShape type (can contain text)
        if isinstance(shape, IAutoShape):
            # Extract text from the shape
            for paragraph in shape.TextFrame.Paragraphs:
                text.append(paragraph.Text)

# Write the extracted text to a file
with open("output/SlideText.txt", "w", encoding='utf-8') as f:
    for s in text:
        f.write(s + "\n")

pres.Dispose()

How it works:

pres.Slides: Gets the collection of all slides
slide.Shapes: Gets all shapes in each slide
IAutoShape: Represents the auto-shape type that can contain text
shape.TextFrame.Paragraphs: Gets the collection of paragraphs in the shape
Finally, all text is written to the SlideText.txt file, with one paragraph per line

Important Notes

Resource Release : After using the Presentation object, be sure to call the Dispose() method to release resources and avoid memory leaks.
File Paths : Ensure the PPT file path is correct. The directories for saving images and text need to be created in advance or created automatically using code.
Text Encoding : Use utf-8 encoding when writing to text files to properly handle non-English characters such as Chinese.
Image Format : The Save() method saves images in PNG format by default. Refer to the official documentation if you need other formats.
Shape Types : The text extraction only handles the IAutoShape type. If text is located in other shape types like tables or charts, additional processing is required.

Summary

With Spire.Presentation for Python, you can batch extract images and text from PPT with just a dozen lines of code. This library is powerful and easy to use, making it ideal for office automation scenarios. I hope this article helps you improve your work efficiency!

If you have more requirements for PPT automation processing, such as creating PPTs, modifying content, adding charts, etc., Spire.Presentation offers many more rich features waiting for you to explore.

DEV Community