In daily development, you may often encounter scenarios where you need to save HTML content from a webpage as a Word document for further editing, or export it directly as a PDF for sharing and archiving. Manual copy-pasting can easily lead to formatting loss and compatibility issues, making programmatic conversion from HTML to Word and PDF a practical solution.
In Python, this is commonly achieved using third-party libraries. Such libraries allow developers to generate high-quality files through simple APIs without worrying too much about low-level layout details.
Why Convert HTML to Word and PDF
- Content Editing: Word files are more suitable for further formatting, proofreading, and revision.
- Archiving and Sharing: PDF files preserve layout consistency across different devices, making them ideal for storage and distribution.
- Automation: Python scripts can batch process a large number of HTML files, significantly improving efficiency.
Using Third-Party Libraries to Convert HTML to Word and PDF
In Python, converting HTML to Word or PDF typically requires third-party libraries to handle document formatting and layout. Spire.Doc for Python is one such option. It supports reading and converting HTML, Word, and PDF files with a simple and developer-friendly API, making it ideal for quickly adding document processing features.
Installing Spire.Doc for Python
- Install via pip (recommended):
pip install spire.doc
- Install a specific version (if needed):
pip install spire.doc==13.8.0
- Verify installation:
python -c "from spire.doc import Document; print('Spire.Doc import OK')"
Converting HTML to Word with Python
The core idea of converting HTML to Word is to load the HTML file (or string) and then save it as a Word document. Both .doc
and .docx
formats are supported, so developers can choose based on project requirements.
1. Convert HTML File to Word
from spire.doc import *
# Create a Document object
document = Document()
# Load content from HTML file
document.LoadFromFile("input.html", FileFormat.Html, XHTMLValidationType.None_)
# Save as Word 2003 format (.doc)
document.SaveToFile("output.doc", FileFormat.Doc)
# Save as Word 2013 format (.docx)
document.SaveToFile("output.docx", FileFormat.Docx2013)
# Close the document
document.Close()
Code Explanation:
-
Document()
: Initializes a new document object. -
LoadFromFile("input.html", FileFormat.Html, XHTMLValidationType.None_)
: Loads and parses a local HTML file. -
SaveToFile("output.doc", FileFormat.Doc)
: Saves as Word 2003 format. -
SaveToFile("output.docx", FileFormat.Docx2013)
: Saves as Word 2013 format, suitable for modern Office versions. -
Close()
: Releases resources and properly closes the document.
This approach works well for directly converting existing HTML files.
2. Convert HTML String to Word
from spire.doc import *
# Create a Document object
document = Document()
# Add a section and paragraph
section = document.AddSection()
paragraph = section.AddParagraph()
# Define an HTML string
html_string = """
<h1>Python HTML to Word Example</h1>
<p>This is a <strong>bold</strong> text and a <a href='https://example.com'>hyperlink</a>.</p>
"""
# Insert HTML into the paragraph
paragraph.AppendHTML(html_string)
# Save as Word 2003 format (.doc)
document.SaveToFile("string_output.doc", FileFormat.Doc)
# Save as Word 2013 format (.docx)
document.SaveToFile("string_output.docx", FileFormat.Docx2013)
document.Close()
Code Explanation:
-
AddSection()
: Adds a new section to the document. -
AddParagraph()
: Creates a paragraph in the section, which serves as the insertion point for HTML. -
AppendHTML(html_string)
: Inserts an HTML string into the paragraph, automatically rendering headers, bold text, and links. -
SaveToFile(..., FileFormat.Doc)
: Saves as.doc
. -
SaveToFile(..., FileFormat.Docx2013)
: Saves as.docx
.
This method is more flexible for dynamically generated HTML content.
Converting HTML to PDF with Python
In some scenarios, converting HTML directly to PDF is more appropriate, such as generating reports, contracts, or webpage snapshots. The process is similar to saving Word files, just with a different output format.
1. Convert HTML File to PDF
from spire.doc import *
# Create a Document object
document = Document()
# Load content from HTML file
document.LoadFromFile("input.html", FileFormat.Html, XHTMLValidationType.None_)
# Save as PDF
document.SaveToFile("output.pdf", FileFormat.PDF)
# Close the document
document.Close()
Code Explanation:
- Load HTML with
LoadFromFile
. - Use
SaveToFile("output.pdf", FileFormat.PDF)
to export to PDF. - The PDF preserves the original HTML layout and hyperlinks.
This is ideal for archiving existing HTML pages as PDFs.
2. Convert HTML String to PDF
from spire.doc import *
document = Document()
section = document.AddSection()
paragraph = section.AddParagraph()
html_string = """
<h2>HTML to PDF Example</h2>
<p>This HTML content contains <em>italic</em>, <strong>bold</strong>, and a <a href='https://example.com'>link</a>.</p>
"""
# Insert HTML
paragraph.AppendHTML(html_string)
# Save as PDF
document.SaveToFile("string_output.pdf", FileFormat.PDF)
document.Close()
Code Explanation:
-
AppendHTML(html_string)
: Inserts HTML content and renders it visually. -
SaveToFile("string_output.pdf", FileFormat.PDF)
: Exports the document to PDF. - Supports preserving styles, fonts, and hyperlinks.
This approach works well for dynamically generating PDFs at runtime, such as for API-based reports.
Conclusion
With Python, converting HTML to Word and PDF is straightforward. Whether reading from a file or using an HTML string, Spire.Doc for Python provides a clean and efficient API, supporting multiple output formats (.doc
, .docx
, .pdf
) with minimal effort.
Top comments (0)