How to Programmatically Split Word Documents in Java using Spire.Doc

#java #tooling #tutorial

Dealing with large or complex Word documents often presents a significant challenge for developers. Whether it's extracting specific chapters, processing individual sections, or distributing portions of a report, the ability to programmatically split these documents is crucial. This article introduces Spire.Doc for Java as a robust and efficient solution to this common pain point, providing a practical, step-by-step guide to help you master document splitting in your Java applications.

Introducing Spire.Doc for Java and Installation

Spire.Doc for Java is a professional Java library designed for creating, writing, editing, converting, and printing Word documents without requiring Microsoft Office to be installed on the system. It supports a wide range of Word document formats, including DOC, DOCX, RTF, and XML. Its comprehensive API allows developers to perform complex document manipulations, making it an excellent choice for tasks like splitting documents, merging files, or extracting content with high fidelity. Its efficiency and powerful feature set make it a go-to tool for document automation in Java environments.

To integrate Spire.Doc for Java into your project, you'll need to add its dependency to your pom.xml (for Maven).

Maven Dependency:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.doc</artifactId>
        <version>13.11.2</version>
    </dependency>
</dependencies>

After adding the dependency, synchronize your project to download the necessary libraries.

Splitting Word Documents by Page Breaks

Splitting a Word document by page breaks is ideal when your document is structured such that each logical unit (e.g., a chapter, a distinct report section) begins on a new page. This method is straightforward and effective for documents where visual page separation aligns with content separation.

Process for Splitting by Page Breaks:

Create a Document instance and load the sample Word file using Document.loadFromFile().
Create a new Word document and add a section to it.
Loop through all body child objects in each section of the original document and identify whether each object is a paragraph or a table.
If it’s a table, add it directly to the new document’s section.
If it’s a paragraph, add the paragraph to the new section, then check all its child objects for any page breaks.
When a page break is found, get its index and remove it from the paragraph.
Save the new Word document and repeat the process for the remaining content.

Here's a Java code example demonstrating how to split a Word document into multiple files based on page breaks using Spire.doc:

import com.spire.doc.*;
import com.spire.doc.documents.*;

public class splitDocByPageBreak {
    public static void main(String[] args) throws Exception {
        // Create a Document instance
        Document original = new Document();

        // Load a sample Word document
        original.loadFromFile("E:\\Files\\SplitByPageBreak.docx");

        // Create a new Word document and add a section to it
        Document newWord = new Document();
        Section section = newWord.addSection();
        int index = 0;

        //Traverse through all sections of original document
        for (int s = 0; s < original.getSections().getCount(); s++) {
            Section sec = original.getSections().get(s);

            //Traverse through all body child objects of each section.
            for (int c = 0; c < sec.getBody().getChildObjects().getCount(); c++) {
                DocumentObject obj = sec.getBody().getChildObjects().get(c);
                if (obj instanceof Paragraph) {
                    Paragraph para = (Paragraph) obj;
                    sec.cloneSectionPropertiesTo(section);

                    //Add paragraph object in original section into section of new document
                    section.getBody().getChildObjects().add(para.deepClone());
                    for (int i = 0; i < para.getChildObjects().getCount(); i++) {
                        DocumentObject parobj = para.getChildObjects().get(i);
                        if (parobj instanceof Break) {
                            Break break1 = (Break) parobj;
                            if (break1.getBreakType().equals(BreakType.Page_Break)) {

                                //Get the index of page break in paragraph
                                int indexId = para.getChildObjects().indexOf(parobj);

                                //Remove the page break from its paragraph
                                Paragraph newPara = (Paragraph) section.getBody().getLastParagraph();
                                newPara.getChildObjects().removeAt(indexId);

                                //Save the new Word document
                                newWord.saveToFile("output/result"+index+".docx", FileFormat.Docx);
                                index++;

                                //Create a new document and add a section
                                newWord = new Document();
                                section = newWord.addSection();

                                //Add paragraph object in original section into section of new document
                                section.getBody().getChildObjects().add(para.deepClone());
                                if (section.getParagraphs().get(0).getChildObjects().getCount() == 0) {

                                    //Remove the first blank paragraph
                                    section.getBody().getChildObjects().removeAt(0);
                                } else {

                                    //Remove the child objects before the page break
                                    while (indexId >= 0) {
                                        section.getParagraphs().get(0).getChildObjects().removeAt(indexId);
                                        indexId--;
                                    }
                                }
                            }
                        }
                    }
                }
                if (obj instanceof Table) {
                    //Add table object in original section into section of new document
                    section.getBody().getChildObjects().add(obj.deepClone());
                }
            }
        }

        //Save to file
        newWord.saveToFile("output/result"+index+".docx", FileFormat.Docx);
    }
}

Splitting Word Documents by Section Breaks

Splitting by section breaks offers more granular control, especially for documents with complex layouts, varying headers/footers, or different page orientations within the same file. A section break signifies a logical division in a Word document, allowing for distinct formatting properties for each section. This method is invaluable when you need to split Word documents in Java where the logical divisions are not merely page-based.

Process for Splitting by Section Breaks:

Create a Document instance and load the sample Word file using Document.loadFromFile().
Create a new Word document.
Loop through all sections in the original document.
For each section, clone it using Section.deepClone().
Add the cloned section to the new document with Document.getSections().add().
Save the final document using Document.saveToFile().

Here's a code example demonstrating how to split a Word document based on its section breaks:

import com.spire.doc.*;
public class splitDocBySectionBreak {
    public static void main(String[] args) throws Exception {
        //Create Document instance
        Document document = new Document();

        //Load a sample Word document
        document.loadFromFile("E:\\Files\\SplitBySectionBreak.docx");

        //Define a new Word document object
        Document newWord;

        //Traverse through all sections of the original Word document
        for (int i = 0; i < document.getSections().getCount(); i++){
            newWord = new Document();

            //Clone each section of the original document and add it to the new document as new section
            newWord.getSections().add(document.getSections().get(i).deepClone());

            //Save the result document 
            newWord.saveToFile("Result/result"+i+".docx");
        }
    }
}

The output Word document contains a distinct section from the original document. This is particularly useful for documents like legal contracts, academic papers, or reports where different sections might have unique page numbering, headers, or footers.

The Conclusion

In conclusion, Spire.Doc for Java provides powerful and flexible methods for tackling complex document manipulation tasks, including the critical ability to split a Word document. As demonstrated, developers can easily split Word documents in Java either by page breaks for simple sequential splitting or by section breaks for more structured, content-aware divisions. Integrating these techniques into your Java applications will significantly enhance your document management capabilities, leading to more efficient workflows and improved productivity.