DEV Community

jelizaveta
jelizaveta

Posted on

A Comprehensive Guide to Extracting Word Tables to TXT with Python

In everyday office tasks and automation scenarios, Word documents (DOC/DOCX) remain one of the most common data carriers. Many business data, statistical reports, contract clauses, or configuration information often exist in the form of tables within Word documents. When we need to further process this table data (such as importing into a database, converting to Excel, generating reports, or performing data analysis), manual copy-pasting is clearly inefficient and prone to errors.

With the help of Python and specialized document processing libraries, we can achieve automated extraction of table content from Word and save it as structured text files or other formats. This article will detail how to use Spire.Doc for Python to extract tables from Word documents individually and export the table content as text files.

Why Choose Spire.Doc for Python?

Among various Python document processing options, Spire.Doc for Python stands out as a professional Word document processing library for developers, with the following advantages:

  • No Dependency on Microsoft Word : Pure Python implementation, suitable for server and automation environments
  • Supports Complete Word Structure : Access to paragraphs, tables, headers/footers, styles, etc.
  • Clear API Design : Object-oriented, with logic that is highly consistent with Word document structure
  • Stable and Reliable : Suitable for batch processing and enterprise-level application scenarios

When extracting tables, which involves traversing the document hierarchy, Spire.Doc provides a very intuitive object model, making the code both clear and maintainable.

Overview of the Implementation Approach

Extracting tables from Word essentially involves layer-by-layer traversal of the Word document structure. The overall process is as follows:

  1. Load the Word document
  2. Traverse all Sections in the document
  3. Retrieve all tables in each Section
  4. Traverse the rows and cells within the tables
  5. Read the paragraph text within the cells
  6. Concatenate the table data in a row-column structure
  7. Save each table as an independent text file

This method not only fully preserves the table structure but also facilitates future extensions to CSV, Excel, or database import logic.

Preparation

Before starting, ensure your environment is ready:

  • Python 3.x
  • Installed spire.doc (Spire.Doc for Python)

Installation example:

pip install spire-doc
Enter fullscreen mode Exit fullscreen mode

Once installed, you can directly reference the relevant modules in your Python project.

Example Code: Extracting Tables from Word and Saving as Text Files

Here is the complete example code for extracting all tables from a Word document and saving each table as a .txt file.

from spire.doc import *
from spire.doc.common import *

# Create Document instance
doc = Document()

# Load Word document
doc.LoadFromFile("Input.docx")

# Traverse all Sections in the document
for s in range(doc.Sections.Count):
    # Get the current Section
    section = doc.Sections.get_Item(s)
    # Retrieve all tables in the current Section
    tables = section.Tables
    # Traverse tables in the current Section
    for i in range(tables.Count):
        # Get table object
        table = tables.get_Item(i)
        # String to store current table data
        tableData = ''
        # Traverse all rows in the table
        for j in range(table.Rows.Count):
            # Traverse all cells in the current row
            for k in range(table.Rows.get_Item(j).Cells.Count):
                # Get cell object
                cell = table.Rows.get_Item(j).Cells.get_Item(k)
                # String to store text content in the cell
                cellText = ''
                # Traverse all paragraphs in the cell
                for para in range(cell.Paragraphs.Count):
                    paragraphText = cell.Paragraphs.get_Item(para).Text
                    cellText += (paragraphText + ' ')
                # Append cell text to table data string
                tableData += cellText
                # If not the last cell, add a tab as column separator
                if k < table.Rows.get_Item(j).Cells.Count - 1:
                    tableData += '\t'
            # After the current row, add a newline character
            tableData += '\n'

        # Save table data as a text file
        with open(f'output/Tables/WordTable_{s+1}_{i+1}.txt', 'w', encoding='utf-8') as f:
            f.write(tableData)

# Close document and release resources
doc.Close()
Enter fullscreen mode Exit fullscreen mode

Run

Code Explanation

Let’s break down the core logic of the code to help you better understand how it works.

1. Load the Word Document

doc = Document()
doc.LoadFromFile("Input.docx")
Enter fullscreen mode Exit fullscreen mode

Here, we create a Document instance and load the specified Word file. Document is the core object in Spire.Doc that represents the entire Word document.

2. Traverse Sections in the Document

for s inrange(doc.Sections.Count):
    section = doc.Sections.get_Item(s)
Enter fullscreen mode Exit fullscreen mode

In Word, a document may contain multiple Sections (for example, for pagination and different header/footer settings). To ensure we don’t miss any tables, we need to traverse all Sections.

3. Retrieve and Traverse Tables

tables = section.Tables
for i inrange(tables.Count):
    table = tables.get_Item(i)
Enter fullscreen mode Exit fullscreen mode

Each Section may contain multiple tables. By using section.Tables, we can directly access all table objects in that section.

4. Traverse Rows and Cells

for j inrange(table.Rows.Count):
for k inrange(table.Rows.get_Item(j).Cells.Count):
        cell = table.Rows.get_Item(j).Cells.get_Item(k)
Enter fullscreen mode Exit fullscreen mode

Tables consist of rows and cells. Here we use a double loop to ensure we read the data in a "row → column" order, thus maintaining the original table structure.

5. Read Paragraph Text from Cells

for para inrange(cell.Paragraphs.Count):
    paragraphText = cell.Paragraphs.get_Item(para).Text
    cellText += (paragraphText + ' ')
Enter fullscreen mode Exit fullscreen mode

A cell may contain multiple paragraphs (e.g., manual line breaks). Therefore, we need to traverse cell.Paragraphs and concatenate all paragraph texts to ensure complete content.

6. Concatenate Table Data

tableData += cellText
tableData += '\t'
tableData += '\n'
Enter fullscreen mode Exit fullscreen mode
  • Use tab characters (\t) to separate columns
  • Use newline characters (\n) to separate rows

This format is quite suitable for later conversion into Excel, CSV, or direct database imports.

7. Save as Text Files

withopen(f'output/Tables/WordTable_{s+1}_{i+1}.txt', 'w', encoding='utf-8') as f:
    f.write(tableData)
Enter fullscreen mode Exit fullscreen mode

Each table is saved as a separate text file, with the filename containing the Section and Table indexes for easy identification.

Application Scenario Expansion

Based on this example code, you can easily extend it to more practical applications, such as:

  • Converting the extracted table data to CSV or Excel
  • Automatically parsing Word reports and importing into a database system
  • Batch processing tables in contracts or business documents
  • Integrating with data analysis or BI tools

The rich API provided by Spire.Doc for Python makes these extensions very natural.

Conclusion

This article has described how to automatically extract table content from Word documents using Spire.Doc for Python , saving it as text files. By layer-by-layer traversing the Word document structure (Section, Table, Row, Cell, Paragraph), we can completely and accurately acquire table data, laying a solid foundation for subsequent data processing and automation workflows.

If you are looking for a stable, efficient solution that does not rely on an Office environment for Word table extraction, Spire.Doc for Python is undoubtedly a choice worth considering.

Top comments (0)