Extracting Text and Images from Word Documents in Java

Word documents are ubiquitous in professional and personal contexts, often containing critical data. Programmatically accessing their content, however, can be a common challenge for developers. This tutorial addresses that need by guiding you through the process of extracting both text and images from Word documents using Java. We'll leverage the powerful Spire.Doc for Java library to streamline these operations, providing practical examples for your projects.

Introduction to Spire.Doc for Java and Setup

Spire.Doc for Java is a comprehensive API designed for creating, writing, editing, converting, and printing Word documents within Java applications. It supports a wide range of Word features, making it an excellent choice for complex document manipulation tasks. To begin, you'll need to add the library to your project.

For Maven users, include the following dependency in your pom.xml file:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.doc</artifactId>
        <version>13.12.2</version>
    </dependency>
</dependencies>

For those not using a build tool, you can download the JAR file directly from the official E-iceblue website and add it to your project's build path.

Extracting Text from Word Documents

Extracting text is often the first step in processing Word documents. Spire.Doc for Java makes this straightforward, allowing you to retrieve the entire textual content or specific parts. The library handles various Word document formats, including .doc and .docx.

Here’s a code example demonstrating how to extract all text from a Word document:

import com.spire.doc.Document;
import java.io.FileWriter;
import java.io.IOException;

public class ExtractText {

    public static void main(String[] args) throws IOException {

        //Create a Document object and load a Word document
        Document document = new Document();
        document.loadFromFile("sample1.docx");

        //Get text from document as string
        String text=document.getText();

        //Write string to a .txt file
        writeStringToTxt(text," ExtractedText.txt");
    }
    public static void writeStringToTxt(String content, String txtFileName) throws IOException{
        FileWriter fWriter= new FileWriter(txtFileName,true);
        try {
            fWriter.write(content);
        }catch(IOException ex){
            ex.printStackTrace();
        }finally{
            try{
                fWriter.flush();
                fWriter.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }
}

Explanation:

Document document = new Document();: Initializes a new Document object.
document.loadFromFile("path/to/your/document.docx");: Loads your Word document. Ensure you replace "path/to/your/document.docx" with the actual path to your file.
document.getText(): This method retrieves the plain text content of the current document.
writeStringToTxt(text," ExtractedText.txt");: We write the text string to a .txt file.

Extracting Images from Word Documents

Images embedded within Word documents often hold crucial visual information. Spire.Doc for Java allows you to systematically extract these images and save them to a specified location. It can handle various image formats present in Word documents.

Here's how to extract images using Spire.Doc for Java:

import com.spire.doc.*;
import com.spire.doc.documents.*;
import com.spire.doc.fields.*;
import com.spire.doc.interfaces.*;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.*;
import java.util.*;

public class ExtractImage {
    public static void main(String[] args) throws IOException {

        //Create a Document object and load a Word document
        Document document = new Document();
        document.loadFromFile("sample2.docx");

        //Create a queue and add the root document element to it
        Queue<ICompositeObject> nodes = new LinkedList<>();
        nodes.add(document);

        //Create a ArrayList object to store extracted images
        List<BufferedImage> images = new ArrayList<>();

        //Traverse the document tree
        while (nodes.size() > 0) {
            ICompositeObject node = nodes.poll();
            for (int i = 0; i < node.getChildObjects().getCount(); i++)
            {
                IDocumentObject child = node.getChildObjects().get(i);
                if (child instanceof ICompositeObject)
                {
                    nodes.add((ICompositeObject) child);
                }
                else if (child.getDocumentObjectType() == DocumentObjectType.Picture)
                {
                    DocPicture picture = (DocPicture) child;
                    images.add(picture.getImage());
                }
            }
        }

        //Save images to the specific folder
        for (int i = 0; i < images.size(); i++) {
            File file = new File(String.format("output/extractImage-%d.png", i));
            ImageIO.write(images.get(i), "PNG", file);
        }
    }
}

Explanation:

DocumentObjectType.Paragraph: Images are typically embedded within paragraphs or other document elements. We specifically look for paragraphs here.
paragraph.getChildObjects(): Within each paragraph, we examine its child objects.
DocumentObjectType.Picture: This enum value identifies an image object.
DocPicture picture = (DocPicture) child;: Casts the generic DocumentObject to a DocPicture to access image-specific properties.
picture.getImage(): Retrieves the image data as a BufferedImage.
ImageIO.write(image, "PNG", new File(fileName));: Uses Java's ImageIO class to save the BufferedImage to a file. You can change "PNG" to other formats like "JPG" if needed.
Error Handling: A try-catch block is included to handle potential exceptions during image saving.

This approach provides a robust way to extract images, allowing you to process them further as required by your application.

Conclusion

This tutorial has demonstrated how to effectively extract both text and images from Word documents using Java, leveraging the capabilities of the Spire.Doc for Java library. By following these practical steps, developers can programmatically access and process the content within Word files, opening up possibilities for data extraction, content analysis, and automated document workflows. We encourage you to explore the extensive documentation of Spire.Doc for Java to uncover its full potential for your document processing needs.