When we need to grab materials like images and text from a PowerPoint presentation, doing it manually—copying and pasting one by one—is not only time-consuming but also easy to miss things or make mistakes. Today, I'll share a simple way to batch extract images and text from PPT using Python.
Preparation
First, you need to install Spire.Presentation for Python. You can install it via the pip command:
pip install Spire.Presentation
Once the installation is complete, you can start writing the code.
Extracting Images from PPT
Often, the images in a PPT are the materials we need. The following code demonstrates how to batch extract all images from a PPT and save them locally:
from spire.presentation.common import *
from spire.presentation import *
# Create a Presentation instance
ppt = Presentation()
# Load the PowerPoint document
ppt.LoadFromFile("sample.pptx")
# Iterate through all images in the document
for i, image in enumerate(ppt.Images):
# Extract and save the image
ImageName = "ExtractImage/Images_" + str(i) + ".png"
image.Image.Save(ImageName)
ppt.Dispose()
How it works:
- Presentation(): Creates a PPT document object
- LoadFromFile(): Loads the PPT file to be processed
- ppt.Images: Gets the collection of all images in the document
- image.Image.Save(): Saves the image in PNG format
After running, all images will be saved sequentially to the ExtractImage folder, named Images_0.png, Images_1.png, and so on.
Extracting Text from PPT
Besides images, extracting text content is also a common requirement. The following code iterates through each slide and extracts text from all shapes:
from spire.presentation import *
from spire.presentation.common import *
# Create a Presentation object
pres = Presentation()
# Load the PowerPoint presentation
pres.LoadFromFile("Sample.pptx")
text = []
# Iterate through each slide
for slide in pres.Slides:
# Iterate through each shape
for shape in slide.Shapes:
# Check if the shape is of IAutoShape type (can contain text)
if isinstance(shape, IAutoShape):
# Extract text from the shape
for paragraph in shape.TextFrame.Paragraphs:
text.append(paragraph.Text)
# Write the extracted text to a file
with open("output/SlideText.txt", "w", encoding='utf-8') as f:
for s in text:
f.write(s + "\n")
pres.Dispose()
How it works:
- pres.Slides: Gets the collection of all slides
- slide.Shapes: Gets all shapes in each slide
- IAutoShape: Represents the auto-shape type that can contain text
- shape.TextFrame.Paragraphs: Gets the collection of paragraphs in the shape
- Finally, all text is written to the SlideText.txt file, with one paragraph per line
Important Notes
-
Resource Release : After using the Presentation object, be sure to call the
Dispose()method to release resources and avoid memory leaks. - File Paths : Ensure the PPT file path is correct. The directories for saving images and text need to be created in advance or created automatically using code.
-
Text Encoding : Use
utf-8encoding when writing to text files to properly handle non-English characters such as Chinese. -
Image Format : The
Save()method saves images in PNG format by default. Refer to the official documentation if you need other formats. -
Shape Types : The text extraction only handles the
IAutoShapetype. If text is located in other shape types like tables or charts, additional processing is required.
Summary
With Spire.Presentation for Python, you can batch extract images and text from PPT with just a dozen lines of code. This library is powerful and easy to use, making it ideal for office automation scenarios. I hope this article helps you improve your work efficiency!
If you have more requirements for PPT automation processing, such as creating PPTs, modifying content, adding charts, etc., Spire.Presentation offers many more rich features waiting for you to explore.
Top comments (0)