Extract Text from PDF Files Using the Aspose.PDF Library in Python

#python #development #dotnet #pdf

Portable Document Format (PDF) files have become an integral part of our digital landscape, offering a reliable and standardized means of sharing and storing information across multiple platforms. Despite their convenience, PDF users face many challenges and problems when interacting with these documents on a daily basis. These tasks range from the mundane to the complex, shaping the user experience for navigating and managing PDF files.

One of the main tasks that PDF users face is extracting and editing text. PDF files are often used due to their static nature, but extracting text for reuse or modification is a common requirement.

There are several libraries in Python that allow you to solve this problem (for example, PDFMiner). In this post I want to talk about working with the Aspose.PDF for Python via .NET library. As the name suggests, the library is based on .NET code, but it is completely self-contained; all the necessary components for running runtime are already included in the library. You can install this library using the following command:

pip install aspose-pdf

Note! Get a temporary license and try to work with text without any limitation.

Extract Text Content from PDF Document

Okay, let's kick things off with the basics. This Python snippet shows how to grab and print text from a PDF using the Aspose.PDF library. Here's a quick rundown of what's happening in the code:

Import the necessary module from Aspose.PDF library.
Load the PDF file (let say "input.pdf") using the Document class and store it in the pdfDocument variable.
Create a TextAbsorber object to extract text from the PDF document.
Use the textAbsorber to visit the page of the PDF using textAbsorber.visit(pdfDocument.pages[1]). In our case, we will visit 1st page and will parse and extract the text content from the specified page.
At the last, we can print the extracted text content or do something else.

import aspose.pdf as pdf
pdfDocument = pdf.Document("input.pdf")
textAbsorber = pdf.text.TextAbsorber()
textAbsorber.visit(pdfDocument.pages[1])
print(textAbsorber.text)

As you can see, TextAbsorber allows us to extract all the text from a page, but what if we need a more detailed analysis? Let’s try other tools.

Extract Text Fragments from PDF Document

The TextFragmentAbsorber allows you to extract small pieces of text named Text Fragments. You can loop through all the fragments and get their properties like Text, Position (XIndent, YIndent) etc.

The steps to perform an extraction are basically the same:Import the necessary module.
Load the PDF document.
Initialize a TextFragmentAbsorber.
Process a specific page: The textFragmentAbsorber.visit(pdfDocument.pages[1]) method processes the text on the first page of the loaded PDF document.
Iterate through text fragments: The for loop iterates through each text fragment detected by the TextFragmentAbsorber on the processed page.
Print text fragments: print(textFragment.text) – prints the text content of each extracted text fragment or perform another action.

import aspose.pdf as pdf
pdfDocument = pdf.Document("input.pdf")
textFragmentAbsorber =  pdf.text.TextFragmentAbsorber()
textFragmentAbsorber.visit(pdfDocument.pages[1])
for textFragment in textFragmentAbsorber.text_fragments:
    print(textFragment.text)

So, this piece of code does something pretty cool: it helps you pluck text from a particular page in a PDF using the Aspose.PDF library. It's a nifty way to snag text for later analysis or tinkering around in Python.

Extract Paragraphs of the text from PDF Document

Yet another tool helps us handle text as paragraphs. ParagraphAbsorber works similar to the previous tool, but it has its own collection for paragraphs.

Since the purpose of this post is not a detailed examination of the ParagraphAbsorber, I will limit myself to only a short example and a brief description:

Import the necessary module and load the PDF document.
Initialize a ParagraphAbsorber and process a specific page. In this step, we got a collection of the text sections.
Iterate through text sections. The outer for loop iterates through each section of text detected by the ParagraphAbsorber on the processed page. Within the inner for loop, each text fragment within a section is concatenated to form a complete paragraph of text.
Do something (like print paragraphs).

import aspose.pdf as pdf
pdfDocument = pdf.Document("input.pdf")
paragraphAbsorber = pdf.text.ParagraphAbsorber()
paragraphAbsorber.visit(pdfDocument.pages[1])
for section in paragraphAbsorber.page_markups:
    paragraphText=""
    for textFragment in section.text_fragments:
        paragraphText=paragraphText+textFragment.text
        paragraphText=paragraphText+"\r\n"
    print(paragraphText)

Basically, this bit of code helps you pull out paragraphs from a chosen page in a PDF. It's a handy way to grab text for all sorts of things like analysis or other processing you might want to do.