DEV Community

jelizaveta
jelizaveta

Posted on

Unlocking PDF Content: How to Quickly Extract Text from PDFs Using Python

In modern office environments, PDF files are widely used as a universal document format. Important information is often stored in PDF files, whether it's contracts, reports, or eBooks. Consequently, the demand for extracting text data from PDFs has increased. This article will introduce how to use Spire.PDF for Python to achieve this, specifically focusing on extracting text from a specific page and a designated area.

1. Environment Setup

First, ensure that you have installed Python and the relevant libraries for Spire.PDF. You can install Spire.PDF with the following command:

pip install Spire.PDF
Enter fullscreen mode Exit fullscreen mode

2. Extracting Text from a Specific Page

2.1 Code Example

The following code demonstrates how to extract text from a specific page of a PDF document (for example, page 2):

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load the PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Create a PdfTextExtractOptions object and enable full text extraction
extractOptions = PdfTextExtractOptions()
# Extract all text, including spaces
extractOptions.IsExtractAllText = True

# Get the specific page (e.g., page 2)
page = doc.Pages.get_Item(1)

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Extract text from the page
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a file using UTF-8 encoding
withopen('output/TextOfPage.txt', 'w', encoding='utf-8') as file:
    file.write(text)
Enter fullscreen mode Exit fullscreen mode

2.2 Code Explanation

  1. Create a PdfDocument object : This step initializes the process of loading a PDF file.
  2. Load the PDF document : Load the PDF file using the specified path.
  3. Configure extraction options : Setting IsExtractAllText to True ensures that all text, including spaces, is extracted.
  4. Get the specific page : doc.Pages.get_Item(1) fetches the second page of the PDF (indexing starts at 0).
  5. Create the text extractor and extract text : Use the PdfTextExtractor object to extract text.
  6. Save the extracted text to a file : Finally, save the extracted content to a specified path.

3. Extracting Text from a Specific Area

Sometimes, extracting text from a specific area within a PDF is more effective. This can be accomplished by defining a rectangular area.

3.1 Code Example

The following code shows how to extract text from a specified area of a PDF:

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load the PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Get the specific page (e.g., page 2)
page = doc.Pages.get_Item(1)

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()

# Define the rectangular area for extraction
# RectangleF(left, top, width, height)
extractOptions.ExtractArea = RectangleF(0.0, 100.0, 890.0, 80.0)

# Extract text from the specified area, retaining spaces
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a file using UTF-8 encoding
withopen('output/TextOfRectangle.txt', 'w', encoding='utf-8') as file:
    file.write(text)
Enter fullscreen mode Exit fullscreen mode

3.2 Code Explanation

  1. Load the PDF file : Similar to before, first load the PDF document.
  2. Get the specific page : Again, use doc.Pages.get_Item(1) to get the second page.
  3. Define the extraction area : Use the RectangleF class to define a rectangular area, where the top-left corner is at (0, 100), with a width of 890 and a height of 80.
  4. Execute text extraction : Use the ExtractText method to extract text from the specified area.
  5. Save the text : Finally, save the extracted text as a UTF-8 encoded file.

Conclusion

With the above methods, we can easily extract the necessary text information from PDF documents. The API provided by Spire.PDF for Python is simple and efficient, capable of meeting various text extraction needs. Whether extracting from an entire page or a specific area, this tool significantly improves efficiency, especially for those handling a large number of PDF files.

I hope this blog helps you better understand how to extract PDF text using Python, making your work easier and more efficient!

Top comments (0)