How to Extract Text from PDF Files Effortlessly

#programming #java #pdf #txt

Extracting text from a PDF file means converting the readable content within the PDF into plain text format. By utilizing dedicated text extraction tools or libraries, PDF files can be parsed, and the text content can be extracted accordingly. This enables convenient searching, analysis, and further processing of the extracted text. Text extraction typically entails removing non-text elements like formatting, layout, and images from the PDF, resulting in the retention of pure textual information. This process proves highly valuable across numerous scenarios, including tasks that require text analysis, data mining, or automated processing.

Step 1, Download Free Spire.PDF for Java and unzip it.

This a the community version of Spire.PDF for Java and there are some limitations on pages during the process. If you want to get rid of the limitations, please download the commercial product from this link.
It supports create or edit PDF files independently in Java applications. What’s more, you can also convert PDF to Word, merge PDF files, print PDF files using this library.

Step 2, Create a new Java project.

Step 3, Import “Spire.Pdf.jar” to your project

Take IntelliJ IDEA 2018 (jdk 1.8.0) for example.

Click “File”- “Project Structure”- “Modules”- “Dependencies” in turn.
Choose the “JARs or Directories” under the green plus.
Find the “Spire.Pdf.jar” in the lib folder of the decompressed package and import it to the project.

Step 4, Write code

Or, you can refer to the sample come below:

import com.spire.pdf.*;
import java.io.*;

public class ExtractTextFromPage {
    public static void main(String[] args) throws Exception{
        //Create a Pdf file
        PdfDocument doc = new PdfDocument();

        //Load the file from disk
        doc.loadFromFile("sample.pdf");

        //Create a new txt file to save the extracted text
        String result = "extractTextFromPage.txt";
        File file=new File(result);
        if(!file.exists()){
            file.delete();
        }
        file.createNewFile();
        FileWriter fw=new FileWriter(file,true);
        BufferedWriter bw=new BufferedWriter(fw);

        //Get the first page
        PdfPageBase page = doc.getPages().get(0);

        //Extract text from page keeping white space
        String text = page.extractText(true);

        bw.write(text);

        bw.flush();
        bw.close();
        fw.close();
    }
}

In this code snippet, we begin by creating a PdfDocument object and loading a PDF file. Subsequently, a new text file named "extractTextFromPage.txt" is generated to hold the extracted text content. The getPages().get() method is then employed to retrieve the first page of the PDF document. By utilizing the extractText(true) method on the PdfPageBase object, we can extract the text from the page while retaining whitespace. Lastly, the extracted text is written to the text file "extractTextFromPage.txt" using the write() method of the BufferedWriter object.