A Simple Guide to Extracting Text and Image Coordinates from PDF with Python

In data processing tasks, extracting text and image coordinates from PDF documents is a common requirement. This article will demonstrate how to use the Spire.PDF for Python library to achieve this functionality, providing simple code examples to help you get started quickly.

Introduction to Spire.PDF

Spire.PDF for Python is a powerful PDF processing library that allows developers to manipulate PDF files programmatically. It supports extracting text, images, metadata, and more. This library is particularly convenient when we need to obtain the coordinates of specific text or images.

Installation command:

pip install spire-pdf

Coordinate System Setup

In Spire.PDF, understanding the coordinate system is crucial:

The origin (0, 0) is located at the top left corner of the page.
The X-axis extends to the right, while the Y-axis extends downwards.

Understanding this helps us better position elements within the PDF.

Getting Text Coordinates

Here are the steps to extract specific text coordinates using Spire.PDF:

Create a PdfDocument object.
Load the PDF document.
Access the specific page.
Create a PdfTextFinder object and set the search options.
Search for the text and retrieve its coordinates.

Below is an example code to obtain text coordinates:

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load the PDF document
doc.LoadFromFile("Input.pdf")

# Access the specific page
page = doc.Pages.get_Item(0)

# Create a PdfTextFinder object
textFinder = PdfTextFinder(page)

# Specify search options
findOptions = PdfTextFindOptions()
findOptions.Parameter = TextFindParameter.WholeWord
textFinder.Options = findOptions

# Search for the string "Privacy Policy" in the page
findResults = textFinder.Find("Privacy Policy")

# Get the first instance of the search results
result = findResults[0]

# Get the X/Y coordinates of the found text
x = int(result.Positions[0].X)
y = int(result.Positions[0].Y)
print("The coordinates of the first instance of the found text are:", (x, y))

# Release resources
doc.Dispose()

Code Explanation

The PdfDocument object is used to open the existing PDF file.
The PdfTextFinder makes it easy to locate specified text, and the search options allow for case insensitivity and ensure whole word matches.
Finally, the text coordinates are obtained through result.Positions, where (0, 0) indicates the top-left corner of the page.

Getting Image Coordinates

The process of obtaining image coordinates is similar to text extraction but utilizes the PdfImageHelper to handle image information. Here’s an example code:

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load the PDF document
doc.LoadFromFile("Input.pdf")

# Access the specific page
page = doc.Pages.get_Item(0)

# Create a PdfImageHelper object
imageHelper = PdfImageHelper()

# Get image information from the page
imageInformation = imageHelper.GetImagesInfo(page)

# Get the X/Y coordinates of the specified image
x = int(imageInformation[0].Bounds.X)
y = int(imageInformation[0].Bounds.Y)
print("The coordinates of the specified image are:", (x, y))

# Release resources
doc.Dispose()

Code Explanation

The PdfImageHelper class is used to obtain all image information on the specific page.
The imageInformation object provides the boundary coordinates (X, Y) of the image for further processing.

Conclusion

This article introduced how to extract text and image coordinates from PDFs using Spire.PDF for Python, accompanied by relevant example code. Mastering these techniques can greatly enhance your efficiency in information extraction, data analysis, and document processing. I hope this guide helps you quickly get started with PDF coordinate extraction!