DEV Community

Shahzad Ashraf
Shahzad Ashraf

Posted on

How to Extract Metadata from PDF Documents in Java

Extracting metadata from PDF files is a frequent necessity in contemporary application development, particularly for processes that involve document management, digital archiving, and automated content handling. Metadata may encompass vital information such as the document's author, creation and modification dates, the software used to generate the file, and even user-defined fields. By utilizing the GroupDocs.Metadata Cloud Java SDK, developers can effortlessly incorporate this capability into their Java applications without the need to navigate the intricacies of file format specifications.

The Java REST API provided by GroupDocs Cloud offers a streamlined and effective method for accessing and managing PDF metadata. Rather than developing low-level parsing algorithms, developers can execute a few simple API calls to obtain properties from PDFs stored either locally or in the cloud. The SDK allows integration with popular storage services, providing the flexibility to retrieve PDFs from multiple locations and process them in real time. This capability means that your application can cater to metadata extraction requirements across various settings, ranging from internal corporate systems to extensive web applications.

Employing the Cloud Java SDK guarantees that your metadata extraction process is dependable, secure, and scalable. The API manages the intensive processing on the server side, allowing your application’s resources to be utilized more efficiently and minimizing the maintenance of intricate parsing libraries. Whether your objectives involve auditing documents, improving search functionalities, or managing substantial document collections more efficiently, extracting PDF metadata in Java with the REST API is a simple and forward-compatible option. By harnessing this cloud-enabled solution, developers can provide powerful document intelligence features without sacrificing performance or development speed. For further details on this functionality, please consult our step-by-step article.

Below is a practical code example for incorporating this feature into your Java applications:

package com.groupdocs;
import com.groupdocs.cloud.metadata.client.*;
import com.groupdocs.cloud.metadata.api.*;
import com.groupdocs.cloud.metadata.model.*;
import com.groupdocs.cloud.metadata.model.requests.*;

public class ExtractPDFMetadata {

    public static void main(String[] args) {

        // Step 1: Configure your API credentials
        String MyAppKey = "your-app-key";
        String MyAppSid = "your-app-sid";
        Configuration configuration = new Configuration(MyAppKey, MyAppSid);

        // Step 2: Initialize the Metadata API
        MetadataApi metadataApi = new MetadataApi(configuration);

        try {
            // Step 3: Add source file from the cloud storage
            FileInfo fileInfo = new FileInfo();
            fileInfo.setFilePath("SampleFiles/source.pdf"); 

            // Step 4: Apply extraction options
            ExtractOptions options = new ExtractOptions();
            options.setFileInfo(fileInfo);

            // Step 5: Perform metadata extraction
            ExtractRequest request = new ExtractRequest(options);
            ExtractResult result = metadataApi.extract(request);

            // Step 6: Print simplified metadata tree
            System.out.println("PDF Metadata Properties:");
            if (result.getMetadataTree() != null &&
                result.getMetadataTree().getInnerPackages() != null) {

                result.getMetadataTree().getInnerPackages().forEach(pkg -> {
                    pkg.getPackageProperties().forEach(prop -> {
                        System.out.println("- " + prop.getName() + ": " 
                                                                + prop.getValue());

                        if (prop.getTags() != null && !prop.getTags().isEmpty()) {
                            prop.getTags().forEach(tag -> System.out.println(
                                "  . Tag: " + tag.getName() +
                                " (" + tag.getCategory() + ")"
                            ));
                        }
                    });
                });

            } else {

                System.out.println("No metadata found in the PDF file.");
            }

        } catch (Exception e) {
            System.err.println("Error extracting metadata: " + e.getMessage());
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)