lu liu

Posted on Mar 12

Converting HTML to Excel Using Python

#python #webscraping #tutorial #html

In data extraction and web scraping workflows, converting HTML tables to Excel spreadsheets is a valuable skill. Web pages often contain structured data in HTML table format that needs to be analyzed, manipulated, or archived in Excel. This article explores how to efficiently convert HTML content to Excel format using Python while preserving table structure and formatting.

Why Convert HTML to Excel

HTML tables are commonly used to display data on websites, but they lack the analytical capabilities of spreadsheet software:

Data Analysis: Excel provides powerful tools for sorting, filtering, and calculating
Data Manipulation: Spreadsheets allow easy editing and restructuring of data
Offline Access: Excel files can be worked with without an internet connection
Integration: Excel data can be easily imported into databases and business intelligence tools
Reporting: Professional reports and charts can be generated from extracted data

Automating HTML to Excel conversion with Python enables batch processing of web data, automated report generation, and seamless integration into data pipelines.

Environment Setup

Before starting, you need to install a Python library that supports both HTML parsing and Excel operations. Spire.XLS for Python provides comprehensive APIs for handling HTML to XLSX conversion.

pip install Spire.XLS

Once installed, import the relevant modules in your Python script:

from spire.xls import *
from spire.xls.common import *

Basic HTML to Excel Conversion

The most straightforward conversion scenario involves loading an HTML file and saving it as an Excel workbook. This approach works well when you have complete HTML files saved locally.

The following code demonstrates how to load an HTML file containing table data and convert it directly to Excel format with minimal configuration:

from spire.xls import *
from spire.xls.common import *

# Define input and output paths
inputFile = "web_table.html"
outputFile = "extracted_data.xlsx"

# Create a workbook object
workbook = Workbook()

# Load HTML content from file
workbook.LoadFromHtml(inputFile)

# Save as Excel 2013 format
workbook.SaveToFile(outputFile, ExcelVersion.Version2013)

This basic conversion flow automatically parses HTML tables and creates corresponding worksheets. The LoadFromHtml() method handles the complexity of parsing HTML structure and mapping table elements to Excel cells. Each HTML table becomes a separate worksheet in the resulting workbook.

Converting HTML String Directly

In many real-world scenarios, you'll work with HTML content obtained from web requests rather than saved files. The library supports loading HTML directly from strings, which is perfect for web scraping applications.

When you fetch HTML content from a website using libraries like requests, you can pass the HTML string directly to the workbook object without saving it to a temporary file:

from spire.xls import *
import io

# Assume you have HTML string from web scraping
html_content = """
<html>
<table>
    <tr><th>Name</th><th>Value</th></tr>
    <tr><td>Item A</td><td>100</td></tr>
    <tr><td>Item B</td><td>200</td></tr>
</table>
</html>
"""

# Create workbook
workbook = Workbook()

# Load from HTML string using stream
stream = io.BytesIO(html_content.encode('utf-8'))
workbook.LoadFromHtml(stream)

# Save to Excel
workbook.SaveToFile("web_data.xlsx", ExcelVersion.Version2013)
workbook.Dispose()

This approach eliminates the need for temporary files and integrates seamlessly with web scraping workflows. You can combine this with HTTP request libraries to build automated data extraction pipelines that fetch HTML from URLs and immediately convert to Excel format.

Handling Multiple Tables

Web pages often contain multiple HTML tables that should be preserved as separate worksheets. The conversion process automatically detects and handles multiple tables within a single HTML document.

When an HTML file contains several table elements, each one is mapped to a distinct worksheet in the Excel workbook, maintaining the original order and structure:

from spire.xls import *

inputFile = "multi_table_report.html"
outputFile = "report_worksheets.xlsx"

workbook = Workbook()
workbook.LoadFromHtml(inputFile)

# Access individual worksheets created from HTML tables
for i in range(workbook.Worksheets.Count):
    sheet = workbook.Worksheets[i]
    print("Worksheet {0}: {1}".format(i, sheet.Name))

# Save the complete workbook
workbook.SaveToFile(outputFile, ExcelVersion.Version2013)
workbook.Dispose()

This automatic table-to-worksheet mapping is particularly useful for:

Financial reports with multiple data sections
Comparison tables split across different categories
Statistical data organized by time periods or regions
Product catalogs with separate category tables

Preserving Formatting and Styles

HTML tables often include styling information such as colors, borders, and font attributes. Maintaining these visual elements during conversion helps preserve the original presentation and makes data easier to interpret.

The conversion process attempts to map CSS styles to Excel formatting properties, ensuring that important visual cues are retained in the spreadsheet:

from spire.xls import *

inputFile = "styled_table.html"
outputFile = "formatted_spreadsheet.xlsx"

workbook = Workbook()
workbook.LoadFromHtml(inputFile)

# Access the first worksheet to verify formatting
sheet = workbook.Worksheets[0]

# You can further enhance formatting after conversion
# For example, auto-fit columns for better readability
sheet.AllocatedRange.AutoFitColumns()
sheet.AllocatedRange.AutoFitRows()

# Save with preserved styles
workbook.SaveToFile(outputFile, ExcelVersion.Version2013)
workbook.Dispose()

The conversion preserves common HTML styling including:

Background colors for cells and rows
Font properties (bold, italic, size)
Text alignment (left, center, right)
Border styles and colors
Column widths based on content

Working with Complex HTML Structures

Real-world HTML often contains nested tables, merged cells, and complex layouts that require special handling during conversion. Understanding how these structures are mapped to Excel helps ensure accurate results.

HTML elements like colspan and rowspan attributes are automatically converted to merged cells in Excel, maintaining the visual layout of the original table:

from spire.xls import *

inputFile = "complex_layout.html"
outputFile = "converted_layout.xlsx"

workbook = Workbook()
workbook.LoadFromHtml(inputFile)

# Iterate through worksheets to inspect converted content
for sheet in workbook.Worksheets:
    # Check for merged regions created from colspan/rowspan
    if sheet.MergedCells.Count > 0:
        print("Sheet '{0}' has {1} merged regions".format(
            sheet.Name, 
            sheet.MergedCells.Count
        ))

workbook.SaveToFile(outputFile, ExcelVersion.Version2013)
workbook.Dispose()

Common complex structures handled during conversion include:

Nested tables become separate worksheets
Merged cells from colspan and rowspan attributes
Header rows repeated across pages
Images embedded within table cells

Batch Converting Multiple HTML Files

In production environments, you often need to process many HTML files simultaneously. A batch conversion function streamlines this workflow and ensures consistent output quality.

The following utility function demonstrates how to process an entire directory of HTML files, creating a corresponding set of Excel workbooks with proper error handling:

import os
from spire.xls import *

def batch_convert_html_to_excel(input_folder, output_folder):
    """Convert all HTML files in a folder to Excel"""

    # Ensure output directory exists
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Process all HTML files
    html_extensions = ['.html', '.htm']

    for filename in os.listdir(input_folder):
        if any(filename.lower().endswith(ext) for ext in html_extensions):
            input_path = os.path.join(input_folder, filename)
            base_name = os.path.splitext(filename)[0]
            output_path = os.path.join(output_folder, base_name + '.xlsx')

            try:
                # Convert current file
                workbook = Workbook()
                workbook.LoadFromHtml(input_path)

                # Auto-fit columns for better readability
                for sheet in workbook.Worksheets:
                    sheet.AllocatedRange.AutoFitColumns()

                workbook.SaveToFile(output_path, ExcelVersion.Version2013)
                workbook.Dispose()

                print("Converted: {0}".format(filename))

            except Exception as e:
                print("Failed to convert {0}: {1}".format(filename, str(e)))

# Usage example
batch_convert_html_to_excel("html_reports", "excel_output")

This batch conversion utility provides:

Automatic output directory creation
Support for both .html and .htm extensions
Error handling for malformed HTML files
Progress reporting for monitoring large batches
Automatic column width optimization

Practical Example: Web Data Extraction Pipeline

Combining HTML to Excel conversion with web scraping creates a powerful data extraction pipeline. This practical example shows how to fetch tables from a website and save them directly to Excel format.

You can integrate HTTP requests with the conversion process to build end-to-end automation that goes from live web data to analysis-ready spreadsheets:

import requests
from spire.xls import *
import io

def extract_web_table_to_excel(url, output_file):
    """Fetch HTML table from URL and save as Excel"""

    try:
        # Fetch HTML content from URL
        response = requests.get(url)
        response.raise_for_status()
        html_content = response.text

        # Create workbook and load HTML
        workbook = Workbook()
        stream = io.BytesIO(html_content.encode('utf-8'))
        workbook.LoadFromHtml(stream)

        # Apply formatting enhancements
        for sheet in workbook.Worksheets:
            sheet.AllocatedRange.AutoFitColumns()
            sheet.AllocatedRange.AutoFitRows()

        # Save to Excel
        workbook.SaveToFile(output_file, ExcelVersion.Version2013)
        workbook.Dispose()

        print("Successfully extracted data to: {0}".format(output_file))
        return True

    except Exception as e:
        print("Extraction failed: {0}".format(str(e)))
        return False

# Usage example
extract_web_table_to_excel(
    "https://example.com/data-table.html", 
    "extracted_data.xlsx"
)

This pipeline approach enables:

Automated daily data extraction from reporting websites
Real-time conversion of financial dashboards to analyzable formats
Archival of time-sensitive web content in structured format
Integration with downstream data processing workflows

Common Issues and Solutions

Issue 1: Missing Data in Converted File

Ensure the HTML contains proper table tags (<table>, <tr>, <td>). Data outside table structures may not be captured:

# Verify HTML structure before conversion
if "<table>" in html_content:
    workbook.LoadFromHtml(stream)

Issue 2: Poor Column Widths

Apply auto-fit after conversion to optimize column widths:

for sheet in workbook.Worksheets:
    sheet.AllocatedRange.AutoFitColumns()

Issue 3: Encoding Problems with Special Characters

Specify UTF-8 encoding when working with international characters:

stream = io.BytesIO(html_content.encode('utf-8'))

Issue 4: Large Files Causing Memory Issues

Process large HTML files in chunks or use streaming approaches for very large datasets. Consider filtering to extract only necessary tables.

Summary

Converting HTML to Excel is an essential skill for web data extraction and automated reporting. Through this article, we've learned:

How to load and convert HTML files using the Workbook object
Processing HTML strings directly from web scraping operations
Handling multiple tables as separate worksheets automatically
Preserving formatting and styles during conversion
Managing complex HTML structures with merged cells
Building batch conversion systems for production use
Creating end-to-end web data extraction pipelines

These techniques apply directly to competitive intelligence gathering, financial data collection, market research automation, and regulatory compliance reporting. After mastering basic HTML to Excel conversion, you can explore advanced topics like scheduled data refresh, multi-source data consolidation, and integration with business intelligence platforms.

DEV Community