How to Extract text from PDF in Java

#java #pdf #extracttext #text

PDF is one of the most widely used digital documents and it is difficult to edit the text on the PDF. Extracting text from a PDF document is often required for analyzing the text, or getting the particular information about the PDF. In this article, we will demonstrate how to extract text from PDF documents programmatically in Java from the following four parts with the help of Spire.PDF for Java.

Extract Text from PDF using Java
Extract Text from Specific Page in PDF
Extract Text from Specific Area in PDF
Extract Highlighted Text from PDF Document

Install Spire.PDF for Java

First, you're required to add the Spire.Pdf.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.



<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>8.7.0</version>
    </dependency>
</dependencies>

Extract Text from PDF using Java

Spire.PDF offers PdfPageBase.extractText() method to extract the text from all the PDF pages. Here are the steps of how to extract text from all pages of PDF document.

Create a PdfDocument instance.
Load a sample PDF file using PdfDocument.loadFromFile() method.
Create a StringBuilder object.
Loop through all the pages of PDF and use PdfPageBase.extractText() method to extract text, then append the data to the StringBuilder instance using StringBuilder.append() method.
Write the extracted data to a txt document using FileWriter.write() method.



import com.spire.pdf.*;
import java.io.*;

public class extractTextfromPDF {
    public static void main(String[] args) throws Exception {

        //Create a Pdf file
        PdfDocument pdf = new PdfDocument();

        //Load the file from disk
        pdf.loadFromFile("PDFSample.pdf");

        //Create a StringBuilder instance
        StringBuilder sb = new StringBuilder();

        PdfPageBase page;
        //Traverse all the pages in the document.
        for (int i = 0; i < pdf.getPages().getCount(); i++) {
            page = pdf.getPages().get(i);
            //Extract the text from the pdf pages
            sb.append(page.extractText(true));
        }
        FileWriter writer;

        try {
            //Create a new txt file to save the extracted text
            writer = new FileWriter("ExtractText.txt");
            writer.write(sb.toString());
            writer.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }
        pdf.close();
    }
}

Extract Text from Specific Page in PDF

Here are the steps to extract the text from a specific page of PDF.

Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
Create a StringBuilder object.
Get the first page of PDF using PdfDocument.getPages().get(0) method.
Use page.extractText() method to extract text from the first page, then append the data to the StringBuilder instance using StringBuilder.append() method.
Write the extracted data to a txt document using FileWriter.write() method.



import com.spire.pdf.*;
import java.io.*;


public class extractTextFromParticularPage {
    public static void main(String[] args) throws Exception {

        //Create a Pdf file
        PdfDocument pdf = new PdfDocument();

        //Load the file from disk
        pdf.loadFromFile("PDFSample.pdf");

        //Create a StringBuilder instance
        StringBuilder sb = new StringBuilder();

        //Get the first page
        PdfPageBase page = pdf.getPages().get(0);

        //Extract the text and keep white space
        sb.append(page.extractText(true));

        FileWriter writer;

        try {
            //Create a new txt file to save the extracted text
            writer = new FileWriter("extractTextFromParticularPage.txt");
            writer.write(sb.toString());
            writer.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }
        pdf.close();
    }
}

Extract Text from Specific Area in PDF

Here are the steps to extract the text from a specific rectangle area on a PDF page.

Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
Create a StringBuilder object.
Get the first page of PDF using PdfDocument.getPages().get(0) method.
Use page.extractText(new Rectangle2D.Float(x, y, width, height)) method to extract the text from the specific rectangle area, then append the data to the StringBuilder instance using StringBuilder.append() method.
Write the extracted data to a txt document using FileWriter.write() method.



import com.spire.pdf.*;
import java.io.*;
import java.awt.geom.Rectangle2D;

public class extractTextFromSpecificArea {
    public static void main(String[] args) throws Exception {

        //Create a Pdf file
        PdfDocument pdf = new PdfDocument();

        //Load the file from disk
        pdf.loadFromFile("PDFSample.pdf");

        //Create a StringBuilder instance
        StringBuilder sb = new StringBuilder();

        //Get the first page
        PdfPageBase page = pdf.getPages().get(0);

        //Extract text from a specific rectangular area within the page
        sb.append(page.extractText(new Rectangle2D.Float(60, 120, 500, 220)));

        FileWriter writer;

        try {
            //Create a new txt file to save the extracted text
            writer = new FileWriter("extractTextFromParticularArea.txt");
            writer.write(sb.toString());
            writer.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }
        pdf.close();
    }
}

Extract Highlighted Text from PDF Document

Spire.PDF also supports to extract highlighted text from PDF document.

Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
Create a StringBuilder object.
Get the first page of PDF using PdfDocument.getPages().get(0) method.
Get the annotation collection of the first page of the document by using page.getAnnotationsWidget().
Loop through the pop-up annotations, after extract data from each annotation using annotations.get(int).getText() method, then append the data to the StringBuilder instance using StringBuilder.append() method.
Write the extracted data to a txt document using Writer.write() method.



import com.spire.pdf.*;
import com.spire.pdf.annotations.PdfTextMarkupAnnotationWidget;

import java.io.*;

public class extractHighlightedText {
    public static void main(String[] args) throws Exception {

        //Create a Pdf file
        PdfDocument pdf = new PdfDocument();

        //Load the file from disk
        pdf.loadFromFile("PDFSample0.pdf");

        //Create a StringBuilder instance
        StringBuilder sb = new StringBuilder();

        //Get the first page
        PdfPageBase page = pdf.getPages().get(0);

        for (int i = 0; i < page.getAnnotationsWidget().getCount(); i++) {
            if (page.getAnnotationsWidget().get(i) instanceof PdfTextMarkupAnnotationWidget) {
                PdfTextMarkupAnnotationWidget textMarkupAnnotation = (PdfTextMarkupAnnotationWidget) page.getAnnotationsWidget().get(i);
                sb.append(page.extractText(textMarkupAnnotation.getBounds()));
            }
        }
        FileWriter writer;

        try {
            //Create a new txt file to save the extracted text
            writer = new FileWriter("extractHilightedText.txt");
            writer.write(sb.toString());
            writer.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }
        pdf.close();
    }
}

Conclusion

In this article, we have demonstrated how to extract text from PDF using Java. With Spire.PDF for Java, we could extract text from PDF file for different scenarios, such as extracting all the text from a PDF; only extract the text from a specific page, or a specific page area. And we can also only get the highlighted text from the PDF. You can check the PDF forum for more features to operate the PDF files.

Top comments (1)

Sergei • Dec 18 '24 • Edited

good to know: Spire.PDF is not open source and you need a license to use it. An open source alternative would be Apache PDFBox