DEV Community

Andriy Andruhovski for Aspose.PDF

Posted on • Edited on

5 2

Extract the text data from PDF file using Aspose.PDF for .NET

While dealing with Portable Document Format files, at times, you might need to extract text from a PDF file.
Aspose.PDF several classes to extract the data:

The easiest way to extract the data from PDF is using TextFragmentAbsorber with the default options:

private static void ExtractData1()
{
var pdfDocument = new Document(@"C:\aspose\demodata.pdf");
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
String extractedText = textAbsorber.Text;
textAbsorber.Visit(pdfDocument);
File.WriteAllText(@"C:\aspose\demodata.txt", extractedText);
}
view raw Program.cs hosted with ❤ by GitHub

TextAbsorber performs text extraction and provides access to the result via Text object. In this case, we'll get all text data in one single object.
Call the Accept method on a particular page of the Document object. The Index is the particular page number from where text needs to be extracted.

private static void ExtractData2()
{
var pdfDocument = new Document(@"C:\aspose\demodata.pdf");
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages[1].Accept(textAbsorber);
String extractedText = textAbsorber.Text;
textAbsorber.Visit(pdfDocument);
File.WriteAllText(@"C:\aspose\demodata.txt", extractedText);
}
view raw Program.cs hosted with ❤ by GitHub

Sometimes we need to extract the text from the particular area (i.e. the left upper corner of the page). TextAbsorber also can do it. We'll need to setup TextSearchOptions property. In the following example, we'll set up a LimitToPageBounds property and a Rectangle property. The last takes Rectangle object as a value and using this property, we can specify the region of the page from which we need to extract the text. In our example, the LimitToPageBounds property indicates that text is searched within the page bound and the Rectangle property indicates to the upper half of page.

private static void ExtractData3()
{
var pdfDocument = new Document(@"C:\aspose\demodata.pdf");
TextAbsorber textAbsorber = new TextAbsorber
{
TextSearchOptions =
{
LimitToPageBounds = true,
Rectangle = new Rectangle(
0,
pdfDocument.PageInfo.Height / 2,
pdfDocument.PageInfo.Width,
pdfDocument.PageInfo.Height)
}
};
pdfDocument.Pages.Accept(textAbsorber);
String extractedText = textAbsorber.Text;
textAbsorber.Visit(pdfDocument);
File.WriteAllText(@"C:\aspose\demodata.txt", extractedText);
}
view raw Program.cs hosted with ❤ by GitHub

The TextFragmentAbsorber object is basically used in text search scenario. When the search is completed the occurrences are represented as text fragments collection. The TextFragment object provides access to the search occurrence text, text properties, and allows to edit text and change the text state (font, font size, color etc).

private static void ExtractData4()
{
var pdfDocument = new Document(@"C:\aspose\demodata.pdf");
// Find font that will be used to change document text font
Aspose.Pdf.Text.Font font = FontRepository.FindFont("Arial");
// Create TextFragmentAbsorber object to find all "hello world" text occurrences
TextFragmentAbsorber absorber = new TextFragmentAbsorber("hello world");
// Accept the absorber for first page
pdfDocument.Pages[1].Accept(absorber);
// Change text and font of the first text occurrence
absorber.TextFragments[1].Text = "hi world";
absorber.TextFragments[1].TextState.Font = font;
// Save document
pdfDocument.Save(@"C:\aspose\demodata_new.pdf");
}
view raw Program.cs hosted with ❤ by GitHub

The ParagraphAbsorber class performs the search for sections and paragraphs of text and provides access for rectangles and polygons that describe it in text coordinate space.

private static void ExtractData5()
{
var doc = new Document(@"C:\aspose\demodata.pdf");
Page page = doc.Pages[1];
ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(page);
PageMarkup markup = absorber.PageMarkups[0];
foreach (MarkupSection section in markup.Sections)
{
DrawRectangleOnPage(section.Rectangle, page);
foreach (MarkupParagraph paragraph in section.Paragraphs)
{
DrawPolygonOnPage(paragraph.Points, page);
}
}
doc.Save("C:\\aspose\\sections_paragraphs.pdf");
}
private static void DrawRectangleOnPage(Rectangle rectangle, Page page)
{
page.Contents.Add(new Operator.GSave());
page.Contents.Add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
page.Contents.Add(new Operator.SetRGBColorStroke(0, 1, 0));
page.Contents.Add(new Operator.SetLineWidth(2));
page.Contents.Add(
new Operator.Re(rectangle.LLX,
rectangle.LLY,
rectangle.Width,
rectangle.Height));
page.Contents.Add(new Operator.ClosePathStroke());
page.Contents.Add(new Operator.GRestore());
}
private static void DrawPolygonOnPage(IReadOnlyList<Point> polygon, Page page)
{
page.Contents.Add(new Operator.GSave());
page.Contents.Add(new Operator.ConcatenateMatrix(1, 0, 0, 1, 0, 0));
page.Contents.Add(new Operator.SetRGBColorStroke(0, 0, 1));
page.Contents.Add(new Operator.SetLineWidth(1));
page.Contents.Add(new Operator.MoveTo(polygon[0].X, polygon[0].Y));
for (var i = 1; i < polygon.Count; i++)
{
page.Contents.Add(new Operator.LineTo(polygon[i].X, polygon[i].Y));
}
page.Contents.Add(new Operator.LineTo(polygon[0].X, polygon[0].Y));
page.Contents.Add(new Operator.ClosePathStroke());
page.Contents.Add(new Operator.GRestore());
}
view raw Program.cs hosted with ❤ by GitHub

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

While many AI coding tools operate as simple command-response systems, Qodo Gen 1.0 represents the next generation: autonomous, multi-step problem-solving agents that work alongside you.

Read full post

Top comments (3)

Collapse
 
smithharber profile image
smithharber • Edited

Are you searching for a conversion solution to Import PDF file to Text effortlessly? Use CubexSoft PDF to Text Tool Converter for this purpose which give perfect and exact solution of how to convert a PDF emails to Text. There is no conversion issue you can simply export PDF files into Text file format. The software is a desktop based application which supports all version of Windows i.e. 11, 10, 8, 7, 8.1, vista etc. If you want to grab more knowledge about the software working, download PDF to TXT Tool demo version. The software demo version allow convert of first 5 PDF emails to Text for free of cost.

Collapse
 
mamtacd profile image
mamtacd

Hi Team,
I need to extract content from PDF, by giving a paragraph heading or some phrase.
How to achieve this. ParagraphAbsober, does get all text. However I need only from a particular paragraph or particular portion of a paragraph, not the complete page.
How to achieve this.
Regards,
Mamtha.A.C.D.

Collapse
 
andruhovski profile image
Andriy Andruhovski

Thanks for your interest!
Currently, you can use TextFragementAbsorber with regular expression as an input parameter.

    // Create TextFragmentAbsorber object that searches all words starting 'h' 
    // and ending 'o' using regular expression.
    TextFragmentAbsorber absorber = new TextFragmentAbsorber(@"h\w*?o", 
         new TextSearchOptions(true));

Unfortunately, ParagraphAbsorber doesn't support search by the regular expression, so you need to analyze paragraphs extracted with this tool manually.

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Rather than just generating snippets, our agents understand your entire project context, can make decisions, use tools, and carry out tasks autonomously.

Read full post

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay