lu liu

Posted on Mar 10

Converting Word Documents to PDF Using Python

In document management and distribution workflows, converting Word documents to PDF is a fundamental and essential operation. The PDF format offers cross-platform consistency, resistance to casual editing, and broad compatibility, making it the preferred choice for document archiving, report distribution, and formal file exchange. This article explores how to efficiently convert Word documents to PDF format using Python while controlling various conversion parameters.

Why Convert Word to PDF

While Word documents are convenient for editing, they have several limitations when it comes to distribution and presentation:

Format Consistency: PDFs maintain identical layout across different devices and operating systems
Tamper Resistance: PDFs are harder to accidentally or intentionally modify, suitable for formal documents
Universal Compatibility: Nearly all devices can view PDFs without requiring Microsoft Office
File Optimization: PDFs can embed fonts and compress images, reducing file size
Security Features: Support for password protection, digital signatures, and other security measures

Automating this conversion process with Python enables batch processing, scheduled conversions, and integration into larger document management workflows.

Environment Setup

Before starting, you need to install a Python library that supports Word document operations. Spire.Doc for Python provides comprehensive APIs for handling DOCX format documents, including PDF conversion capabilities.

pip install Spire.Doc

Once installed, import the relevant modules in your Python script:

from spire.doc import *
from spire.doc.common import *

Basic Conversion Process

The core steps for converting a Word document to PDF are straightforward: load the document, call the save method, and close the document. Here's a minimal working example:

from spire.doc import *
from spire.doc.common import *

# Define input and output paths
inputFile = "document.docx"
outputFile = "output.pdf"

# Create a Word document object
document = Document()

# Load the Word file
document.LoadFromFile(inputFile)

# Save as PDF format
document.SaveToFile(outputFile, FileFormat.PDF)

# Close the document to release resources
document.Close()

This code demonstrates the most basic conversion flow. The Document object handles loading and managing the Word document, while the second parameter FileFormat.PDF in the SaveToFile() method specifies PDF as the output format. This approach works well for quick conversions using default parameters.

Using Conversion Parameter Objects

For scenarios requiring more control, you can use the ToPdfParameterList object to configure conversion options:

from spire.doc import *
from spire.doc.common import *

inputFile = "report.docx"
outputFile = "report_with_bookmarks.pdf"

document = Document()
document.LoadFromFile(inputFile)

# Create PDF conversion parameters object
params = ToPdfParameterList()

# Set whether to create Word bookmarks
params.CreateWordBookmarks = True

# Save as PDF with custom parameters
document.SaveToFile(outputFile, params)
document.Close()

The ToPdfParameterList object encapsulates all available PDF conversion options. By configuring this object, you can precisely control the conversion behavior and output characteristics.

Creating PDF Bookmarks

Bookmarks are essential navigation elements in PDF documents, helping readers quickly locate specific sections. When generating PDF from Word, you can automatically create bookmarks based on heading styles:

from spire.doc import *
from spire.doc.common import *

inputFile = "manual.docx"
outputFile = "manual_with_bookmarks.pdf"

document = Document()
document.LoadFromFile(inputFile)

params = ToPdfParameterList()

# Enable bookmark creation
params.CreateWordBookmarks = True

# Configure bookmark creation mode
# False: create based on Word bookmarks
# True: create based on heading styles
params.CreateWordBookmarksUsingHeadings = False

document.SaveToFile(outputFile, params)
document.Close()

There are two modes for bookmark creation:

Based on Word Bookmarks (CreateWordBookmarksUsingHeadings = False): Generates PDF bookmarks from bookmarks already defined in the Word document
Based on Heading Styles (CreateWordBookmarksUsingHeadings = True): Automatically recognizes heading styles (Heading 1, Heading 2, etc.) in Word to generate a bookmark hierarchy

The appropriate mode depends on how your Word document is organized. For structured technical documents, using heading styles typically produces a more complete bookmark structure.

Embedding Fonts for Consistency

Font embedding is crucial for ensuring PDFs display consistently across different systems. If a PDF viewer lacks the fonts used in the document, it may substitute them, causing layout changes:

from spire.doc import *
from spire.doc.common import *

inputFile = "formatted_document.docx"
outputFile = "embedded_fonts.pdf"

document = Document()
document.LoadFromFile(inputFile)

params = ToPdfParameterList()

# Embed all fonts (embeds complete fonts by default)
params.IsEmbeddedAllFonts = True

document.SaveToFile(outputFile, params)
document.Close()

The IsEmbeddedAllFonts parameter controls font embedding behavior:

Set to True: Embeds complete glyph sets for all fonts used in the document, ensuring correct display on any device
Set to False: Embeds only font subsets or none at all, resulting in smaller files but potentially relying on system fonts

For documents containing special fonts, decorative typography, or requiring print-quality output, enabling complete font embedding is recommended.

Combining Multiple Conversion Options

In practical applications, you typically need to configure multiple options simultaneously for optimal results:

from spire.doc import *
from spire.doc.common import *

inputFile = "corporate_report.docx"
outputFile = "final_report.pdf"

document = Document()
document.LoadFromFile(inputFile)

params = ToPdfParameterList()

# Create bookmarks for navigation
params.CreateWordBookmarks = True
params.CreateWordBookmarksUsingHeadings = True  # Based on heading styles

# Embed all fonts for consistency
params.IsEmbeddedAllFonts = True

# Save as high-quality PDF
document.SaveToFile(outputFile, params)
document.Close()

This configuration is ideal for formal business documents, technical manuals, or academic papers, providing both visual consistency and excellent navigational experience.

Batch Converting Multiple Documents

When processing large numbers of Word documents, a batch conversion script significantly improves efficiency:

import os
from spire.doc import *
from spire.doc.common import *

def batch_convert_word_to_pdf(input_folder, output_folder, embed_fonts=True):
    """Batch convert all Word documents in a folder to PDF"""

    # Ensure output directory exists
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Supported Word formats
    word_extensions = ['.docx', '.doc', '.dot', '.dotx']

    # Iterate through all Word files
    for filename in os.listdir(input_folder):
        if any(filename.lower().endswith(ext) for ext in word_extensions):
            input_path = os.path.join(input_folder, filename)
            base_name = os.path.splitext(filename)[0]
            output_path = os.path.join(output_folder, base_name + '.pdf')

            # Convert current document
            document = Document()
            document.LoadFromFile(input_path)

            params = ToPdfParameterList()
            params.IsEmbeddedAllFonts = embed_fonts

            document.SaveToFile(output_path, params)
            document.Close()

            print("Converted: {0} -> {1}".format(filename, base_name + '.pdf'))

# Usage example
batch_convert_word_to_pdf("input_docs", "output_pdfs", embed_fonts=True)

This batch conversion function provides:

Automatic output directory creation
Support for multiple Word formats (DOCX, DOC, DOT, etc.)
Configurable font embedding option
Progress reporting

Converting Different Word Document Versions

Spire.Doc supports converting various Word document formats:

from spire.doc import *

document = Document()

# Convert DOCX (Word 2007+)
document.LoadFromFile("document.docx")
document.SaveToFile("output.pdf", FileFormat.PDF)
document.Close()

# Convert DOC (Word 97-2003)
document = Document()
document.LoadFromFile("legacy_document.doc")
document.SaveToFile("output.pdf", FileFormat.PDF)
document.Close()

# Convert DOTX template
document = Document()
document.LoadFromFile("template.dotx")
document.SaveToFile("output.pdf", FileFormat.PDF)
document.Close()

Regardless of the input format, the output PDF maintains consistent quality and features.

Practical Example: Document Archiving System

Combining these techniques, you can build a simple document archiving and conversion system:

import os
from datetime import datetime
from spire.doc import *
from spire.doc.common import *

class DocumentArchiver:
    def __init__(self, archive_root="archive"):
        self.archive_root = archive_root
        if not os.path.exists(archive_root):
            os.makedirs(archive_root)

    def archive_document(self, word_file, category="general"):
        """Archive a Word document as PDF"""

        # Create category directory
        category_dir = os.path.join(self.archive_root, category)
        if not os.path.exists(category_dir):
            os.makedirs(category_dir)

        # Generate timestamped filename
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        base_name = os.path.splitext(os.path.basename(word_file))[0]
        pdf_filename = "{0}_{1}.pdf".format(base_name, timestamp)
        pdf_path = os.path.join(category_dir, pdf_filename)

        # Perform conversion
        document = Document()
        document.LoadFromFile(word_file)

        params = ToPdfParameterList()
        params.CreateWordBookmarks = True
        params.CreateWordBookmarksUsingHeadings = True
        params.IsEmbeddedAllFonts = True

        document.SaveToFile(pdf_path, params)
        document.Close()

        return pdf_path

    def batch_archive(self, file_list, category):
        """Batch archive documents"""
        archived_files = []
        for file_path in file_list:
            try:
                pdf_path = self.archive_document(file_path, category)
                archived_files.append(pdf_path)
                print("Archived: {0}".format(pdf_path))
            except Exception as e:
                print("Failed to archive {0}: {1}".format(file_path, str(e)))
        return archived_files

# Usage example
archiver = DocumentArchiver("document_archive")
archived_pdf = archiver.archive_document("quarterly_report.docx", category="reports")
print("Archived to: {0}".format(archived_pdf))

This archiving system provides:

Category-based organization
Automatic timestamp generation to prevent filename conflicts
Batch archiving support
Error handling and logging

Common Issues and Solutions

Issue 1: Chinese Characters Display Incorrectly After Conversion

Ensure font embedding is enabled:

params.IsEmbeddedAllFonts = True

Issue 2: PDF File Size Too Large

If complete font embedding isn't necessary, disable it:

params.IsEmbeddedAllFonts = False

Alternatively, preprocess and compress images before conversion.

Issue 3: Bookmark Hierarchy Incorrect

Verify that heading styles are correctly applied in the Word document and ensure you're using the appropriate bookmark creation mode:

params.CreateWordBookmarksUsingHeadings = True  # Based on heading styles

Summary

Converting Word documents to PDF is a core skill in document automation workflows. Through this article, we've learned:

How to load and convert Word documents using the Document object
Configuring conversion parameters via ToPdfParameterList
Creating PDF bookmarks to enhance document navigability
Embedding fonts to ensure cross-platform display consistency
Building batch conversion and document archiving systems

These techniques apply directly to enterprise document management, automated report generation, digital archive systems, and other practical scenarios. After mastering the basic conversion methods, you can explore advanced features such as PDF encryption, digital signatures, and form creation to build more comprehensive document processing workflows.

DEV Community