Extract data from PDF into Excel

#pdf #excel #java #free

When you sit in the office, many PDF forms with names and numbers are handed to you. The next task is to gather all the data and save it to an Excel spreadsheet. You may decide to copy and paste the data to Excel but that is a daunting task and it may take you hours to copy the data. Here I would like to recommend Spire.PDF for java to you, which you can extract data from PDF forms into Excel worksheet easily in few lines of codes.

Spire.PDF for Java is a PDF API that enables Java applications to read, write, save and print PDF documents without using Adobe Acrobat. Using this Java PDF component, developers and programmers can implement rich capabilities to create PDF files from scratch or process existing PDF file. Let us show you how to extract data from PDF files and then store them to Excel worksheets from the following aspects:

Convert PDF to Excel directly

Export table data from PDF to Excel

Install Spire.PDF for Java

First of all, you're required to add the Spire.Pdf.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>9.7.0</version>
    </dependency>
</dependencies>

Convert PDF to Excel

The following are the steps to convert a PDF document to Excel:

Initialize an instance of PdfDocument class.
Load the PDF document using PdfDocument.loadFromFile(String) method.
Save the document to Excel using PdfDocument.saveToFile(String, FileFormat) method.

import com.spire.pdf.FileFormat;
import com.spire.pdf.PdfDocument;

public class PdftoExcel {
    public static void main(String[] args) throws Exception {

            //Initialize an instance of PdfDocument class
            PdfDocument pdf = new PdfDocument();
            //Load the PDF document
            pdf.loadFromFile("Sample.pdf");

            //Save the PDF document to XLSX
            pdf.saveToFile("PdfToExcel.xlsx", FileFormat.XLSX);
        }
    }

Export table data from PDF to Excel

When you convert the whole PDF file to Excel, you may find that the boarders are disappeared and get the other data you don’t want. If you want to remain all the styles on the Excel, you only extract the date in tables from a PDF page and export them as individual Excel worksheets.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import com.spire.xls.ExcelVersion;
import com.spire.xls.Workbook;
import com.spire.xls.Worksheet;

public class ExtractTableDataAndSaveInExcel {
    public static void main(String[] args) throws Exception {

        //Load a sample PDF document
        PdfDocument pdf = new PdfDocument("Sample1.pdf");

        //Create a PdfTableExtractor instance
        PdfTableExtractor extractor = new PdfTableExtractor(pdf);

        //Extract tables from the first page
        PdfTable[] pdfTables = extractor.extractTable(0);

        //Create a Workbook object,
        Workbook wb = new Workbook();

        //Remove default worksheets
        wb.getWorksheets().clear();

        //If any tables are found
        if (pdfTables != null && pdfTables.length > 0) {

            //Loop through the tables
            for (int tableNum = 0; tableNum < pdfTables.length; tableNum++) {

                //Add a worksheet to workbook
                String sheetName = String.format("Table - %d", tableNum + 1);
                Worksheet sheet = wb.getWorksheets().add(sheetName);

                //Loop through the rows in the current table
                for (int rowNum = 0; rowNum < pdfTables[tableNum].getRowCount(); rowNum++) {

                    //Loop through the columns in the current table
                    for (int colNum = 0; colNum < pdfTables[tableNum].getColumnCount(); colNum++) {

                        //Extract data from the current table cell
                        String text = pdfTables[tableNum].getText(rowNum, colNum);

                        //Insert data into a specific cell
                        sheet.get(rowNum + 1, colNum + 1).setText(text);

                    }
                }

                //Auto fit column width
                for (int sheetColNum = 0; sheetColNum < sheet.getColumns().length; sheetColNum++) {
                    sheet.autoFitColumn(sheetColNum + 1);
                }
            }
        }

        //Save the workbook to an Excel file
        wb.saveToFile("ExportTableToExcel1.xlsx", ExcelVersion.Version2016);
    }
}

Conclusion

In this article, we have demonstrated how to Export the date in PDF table and then store it to Excel using Java. With Spire.PDF for Java, we could also extract all the texts and images from PDF file for different scenarios. You can check the PDF forum for more features to operate the PDF files.