Portable Document Format (PDF) files have become an integral part of our digital landscape, offering a reliable and standardized means of sharing and storing information across multiple platforms. Despite their convenience, PDF users face many challenges and problems when interacting with these documents on a daily basis. These tasks range from the mundane to the complex, shaping the user experience for navigating and managing PDF files.
One of the main tasks that PDF users face is extracting and editing text. PDF files are often used due to their static nature, but extracting text for reuse or modification is a common requirement.
There are several libraries in Python that allow you to solve this problem (for example, PDFMiner). In this post I want to talk about working with the Aspose.PDF for Python via .NET library. As the name suggests, the library is based on .NET code, but it is completely self-contained; all the necessary components for running runtime are already included in the library. You can install this library using the following command:
pip install aspose-pdf
Note! Get a temporary license and try to work with text without any limitation.
Extract Text Content from PDF Document
Okay, let's kick things off with the basics. This Python snippet shows how to grab and print text from a PDF using the Aspose.PDF library. Here's a quick rundown of what's happening in the code:
- Import the necessary module from Aspose.PDF library.
- Load the PDF file (let say "input.pdf") using the Document class and store it in the
pdfDocument
variable. - Create a
TextAbsorber
object to extract text from the PDF document. - Use the
textAbsorber
to visit the page of the PDF usingtextAbsorber.visit(pdfDocument.pages[1])
. In our case, we will visit 1st page and will parse and extract the text content from the specified page. - At the last, we can print the extracted text content or do something else.
import aspose.pdf as pdf
pdfDocument = pdf.Document("input.pdf")
textAbsorber = pdf.text.TextAbsorber()
textAbsorber.visit(pdfDocument.pages[1])
print(textAbsorber.text)
As you can see, TextAbsorber
allows us to extract all the text from a page, but what if we need a more detailed analysis? Let’s try other tools.
Extract Text Fragments from PDF Document
The TextFragmentAbsorber
allows you to extract small pieces of text named Text Fragments. You can loop through all the fragments and get their properties like Text, Position (XIndent, YIndent) etc.
- The steps to perform an extraction are basically the same:Import the necessary module.
- Load the PDF document.
- Initialize a
TextFragmentAbsorber
. - Process a specific page: The
textFragmentAbsorber.visit(pdfDocument.pages[1])
method processes the text on the first page of the loaded PDF document. - Iterate through text fragments: The
for
loop iterates through each text fragment detected by theTextFragmentAbsorber
on the processed page. - Print text fragments:
print(textFragment.text)
– prints the text content of each extracted text fragment or perform another action.
import aspose.pdf as pdf
pdfDocument = pdf.Document("input.pdf")
textFragmentAbsorber = pdf.text.TextFragmentAbsorber()
textFragmentAbsorber.visit(pdfDocument.pages[1])
for textFragment in textFragmentAbsorber.text_fragments:
print(textFragment.text)
So, this piece of code does something pretty cool: it helps you pluck text from a particular page in a PDF using the Aspose.PDF library. It's a nifty way to snag text for later analysis or tinkering around in Python.
Extract Paragraphs of the text from PDF Document
Yet another tool helps us handle text as paragraphs. ParagraphAbsorber
works similar to the previous tool, but it has its own collection for paragraphs.
Since the purpose of this post is not a detailed examination of the ParagraphAbsorber
, I will limit myself to only a short example and a brief description:
- Import the necessary module and load the PDF document.
- Initialize a
ParagraphAbsorber
and process a specific page. In this step, we got a collection of the text sections. - Iterate through text sections. The outer
for
loop iterates through each section of text detected by theParagraphAbsorber
on the processed page. Within the innerfor
loop, each text fragment within a section is concatenated to form a complete paragraph of text. - Do something (like print paragraphs).
import aspose.pdf as pdf
pdfDocument = pdf.Document("input.pdf")
paragraphAbsorber = pdf.text.ParagraphAbsorber()
paragraphAbsorber.visit(pdfDocument.pages[1])
for section in paragraphAbsorber.page_markups:
paragraphText=""
for textFragment in section.text_fragments:
paragraphText=paragraphText+textFragment.text
paragraphText=paragraphText+"\r\n"
print(paragraphText)
Basically, this bit of code helps you pull out paragraphs from a chosen page in a PDF. It's a handy way to grab text for all sorts of things like analysis or other processing you might want to do.
Top comments (0)