Pilalo

Posted on Apr 28

Combining PDF Documents with Java: A Hands-On Guide

#java #productivity #automation #programming

Merging PDF documents is a fundamental operation in many document management workflows. Whether you're aggregating monthly reports into a single file, combining contract sections from different sources, or building an automated archiving system, the ability to programmatically join PDFs is a valuable tool in a Java developer's toolkit. In this article, I'll walk through several approaches to merging PDF files using Spire.PDF for Java, covering full document merges, selective page combining, and stream-based operations.

Common Use Cases for PDF Merging

Before diving into the code, let's consider the scenarios where PDF merging comes into play:

Financial Report Compilation: Banks and accounting firms combining quarterly statements into annual reports.
Legal Document Assembly: Law firms merging contract sections, exhibits, and appendices from various sources.
Invoice Consolidation: E-commerce platforms grouping multiple invoices for batch processing or customer delivery.
Digital Archiving: Organizations merging scanned documents into unified archival records.
Print Job Preparation: Combining multiple documents for efficient batch printing.

Each of these scenarios presents slightly different requirements—sometimes you need the entire document, other times only specific pages, and occasionally the source files aren't even stored on disk.

Setting Up the Library

To include Spire.PDF for Java in your project, add the following dependency to your Maven pom.xml:

<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>12.4.4</version>
    </dependency>
</dependencies>

If you're using Gradle or managing JARs manually, you can obtain the artifact from the vendor's distribution source.

Basic Merge: Combining Complete PDF Files

The simplest scenario is merging entire PDF documents end-to-end. The PdfDocument.mergeFiles() method handles this in a single operation:

import com.spire.pdf.FileFormat;
import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfDocumentBase;

public class MergePdfs {
    public static void main(String[] args) {
        // Define the paths of the PDF files to be merged
        String[] files = new String[] {
            "Sample1.pdf", 
            "Sample2.pdf", 
            "Sample3.pdf"
        };

        // Merge the files into a single PdfDocumentBase object
        PdfDocumentBase pdf = PdfDocument.mergeFiles(files);

        // Save the merged result
        pdf.save("MergedDocument.pdf", FileFormat.PDF);
    }
}

How it works: The mergeFiles() method takes an array of file paths. It processes the files in the array order—first element becomes the first pages, second element follows, and so on. The method returns a PdfDocumentBase object representing the combined result, which you can then save or further manipulate.

This approach works well when you have a relatively small number of files and want all pages from each source document included in the output.

Selective Page Merging: Combining Specific Pages Only

In many real-world scenarios, you don't need every page from every document. You might only want the cover page from one file, the body content from another, and the appendix from a third. The library supports this through insertPage() and insertPageRange() methods:

import com.spire.pdf.PdfDocument;

public class MergeSelectedPages {
    public static void main(String[] args) {
        // Define source files
        String[] files = new String[] {
            "Sample1.pdf", 
            "Sample2.pdf", 
            "Sample3.pdf"
        };

        // Load each source document
        PdfDocument[] pdfs = new PdfDocument[files.length];
        for (int i = 0; i < files.length; i++) {
            pdfs[i] = new PdfDocument(files[i]);
        }

        // Create a new empty PDF as the merge target
        PdfDocument pdf = new PdfDocument();

        // Insert specific pages from each source
        pdf.insertPage(pdfs[0], 0);            // Page 1 from first file
        pdf.insertPageRange(pdfs[1], 1, 3);    // Pages 2-4 from second file
        pdf.insertPage(pdfs[2], 0);            // Page 1 from third file

        // Save the selectively merged document
        pdf.saveToFile("SelectivelyMerged.pdf");

        // Close source documents
        for (PdfDocument source : pdfs) {
            source.close();
        }
        pdf.close();
    }
}

Page indexing note: The insertPage() and insertPageRange() methods use zero-based indexing. So insertPage(pdfs[0], 0) inserts the first page, and insertPageRange(pdfs[1], 1, 3) inserts pages 2 through 4 (indices 1, 2, and 3) from the second document.

This approach gives you precise control over the composition of the final document, which is particularly useful for generating report summaries, extracting specific chapters, or removing unnecessary pages like blank separators.

Stream-Based Merging: Working Without Local Files

Not all PDFs exist as physical files on a disk. In web applications, microservices, or message-driven systems, documents might arrive as input streams from network downloads, database BLOBs, or in-memory generation. The mergeFiles() method also accepts InputStream arrays:

import com.spire.pdf.*;
import java.io.*;

public class MergePdfsByStream {
    public static void main(String[] args) throws IOException {
        // Create FileInputStream objects for each PDF
        FileInputStream stream1 = new FileInputStream(new File("Template_1.pdf"));
        FileInputStream stream2 = new FileInputStream(new File("Template_2.pdf"));
        FileInputStream stream3 = new FileInputStream(new File("Template_3.pdf"));

        // Combine streams into an array
        InputStream[] streams = new FileInputStream[]{
            stream1, stream2, stream3
        };

        // Merge the streams
        PdfDocumentBase pdf = PdfDocument.mergeFiles(streams);

        // Save the result
        pdf.save("MergedByStream.pdf", FileFormat.PDF);

        // Release resources
        pdf.close();
        pdf.dispose();
        stream1.close();
        stream2.close();
        stream3.close();
    }
}

This approach is useful when source files come from non-file sources or when you're working in environments where direct file system access is limited.

Batch Processing: Handling Large Numbers of Files

When dealing with a large collection of PDFs—say, hundreds of files—loading all of them into memory simultaneously can be impractical. A chunked processing strategy can help manage memory usage:

import com.spire.pdf.*;
import java.io.File;
import java.util.ArrayList;
import java.util.List;

public class BatchPdfMerger {
    private static final int CHUNK_SIZE = 20;

    public static void main(String[] args) {
        File inputDir = new File("input-pdfs");
        File[] pdfFiles = inputDir.listFiles(
            (dir, name) -> name.toLowerCase().endsWith(".pdf")
        );

        if (pdfFiles == null || pdfFiles.length == 0) {
            System.out.println("No PDF files found.");
            return;
        }

        List<String> tempFiles = new ArrayList<>();

        // Process in chunks
        for (int i = 0; i < pdfFiles.length; i += CHUNK_SIZE) {
            int end = Math.min(i + CHUNK_SIZE, pdfFiles.length);
            String[] chunk = new String[end - i];

            for (int j = i; j < end; j++) {
                chunk[j - i] = pdfFiles[j].getAbsolutePath();
            }

            // Merge the current chunk
            PdfDocumentBase merged = PdfDocument.mergeFiles(chunk);
            String tempPath = "temp_chunk_" + (i / CHUNK_SIZE) + ".pdf";
            merged.save(tempPath, FileFormat.PDF);
            merged.close();
            tempFiles.add(tempPath);
        }

        // Merge all temp chunks into final output
        if (tempFiles.size() == 1) {
            File tempFile = new File(tempFiles.get(0));
            tempFile.renameTo(new File("FinalMerged.pdf"));
        } else {
            PdfDocumentBase finalPdf = PdfDocument.mergeFiles(
                tempFiles.toArray(new String[0])
            );
            finalPdf.save("FinalMerged.pdf", FileFormat.PDF);
            finalPdf.close();

            // Clean up temp files
            for (String temp : tempFiles) {
                new File(temp).delete();
            }
        }
    }
}

This strategy first merges documents in manageable chunks, writes temporary intermediate files, then merges those intermediate files into the final output. While it adds some disk I/O overhead, it reduces peak memory consumption significantly.

Important Implementation Considerations

Memory Management: PDF merging can be memory-intensive because each source document must be fully loaded and parsed. For large documents or high-volume batch jobs, processing in chunks and promptly closing source documents after use helps prevent OutOfMemoryError exceptions.

Page Indexing: Remember that page indices are zero-based. The first page is index 0. When using insertPageRange(), both start and end indices are inclusive.

Resource Cleanup: Always close PdfDocument objects and input streams when you're done with them. The close() and dispose() methods release native resources. In production code, consider using try-with-resources or try-finally blocks to ensure cleanup even when exceptions occur.

File Order: The mergeFiles() method maintains the order of the input array. The first file in the array becomes the beginning of the merged document, and the last file becomes the end.

Licensing: The library requires a valid license for production use. Without one, the output will contain an evaluation watermark. For development and testing, this is expected behavior.

Alternative Approaches

While this article focuses on Spire.PDF for Java, other libraries in the Java ecosystem offer PDF merging capabilities:

Apache PDFBox: Open-source library providing low-level PDF manipulation. Merging requires more manual code but offers full control over the process.
iText: Widely used library with both open-source (AGPL) and commercial licensing. Provides PdfMerger class for combining documents.
PDFsam: Open-source tool with both GUI and command-line interfaces. The underlying library can be integrated programmatically.

Each library has different trade-offs in terms of ease of use, performance characteristics, and licensing requirements.

Wrapping Up

Merging PDF files in Java is a well-supported operation with the right library. Whether you need simple full-document concatenation, precise page-level selection, or stream-based processing for non-file sources, the API provides consistent methods to accomplish these tasks.

The examples in this guide cover the most common scenarios. As your requirements grow, you might also explore related capabilities like splitting PDFs, adding watermarks during merge, or handling encrypted source documents—features available through the same library.

DEV Community