DEV Community

jelizaveta
jelizaveta

Posted on

Extract Text and Tables from Word Documents Accurately Using Python

Extracting structured content from Word documents—especially text and tables—is a common requirement in data processing workflows. While Python offers several libraries such as <span>python-docx</span>, handling complex document layouts or extracting both body text and tables reliably can be challenging. In these scenarios, Free Spire.Doc for Python provides a more stable and feature-rich solution.

In this article, you'll learn how to extract text from Word documents and save it to TXT files, as well as how to automatically export table data for further processing.

Prerequisites

Free Spire.Doc for Python is a powerful Word processing library that supports multiple document formats, including <span>.doc</span> and <span>.docx</span>. You can install it easily using pip:

pip install spire.doc.free
Enter fullscreen mode Exit fullscreen mode

By default, the library runs in free mode. The free edition can process documents containing up to 500 paragraphs or 25 tables , which is sufficient for most small documents and testing scenarios.

Extract All Text and Save It to a TXT File

The built-in <span>GetText()</span> method can retrieve all text from a Word document. In practical applications, you'll often want to save the extracted content rather than simply displaying it. The following example reads all text from a Word document and writes it to a <span>.txt</span> file:

from spire.doc import *
from spire.doc.common import *

# Create a Document object and load a Word file
doc = Document()
doc.LoadFromFile("input.docx")

# Extract all plain text from the document
full_text = doc.GetText()

# Write the text to a TXT file
with open("output.txt", "w", encoding="utf-8") as file:
    file.write(full_text)

doc.Close()
print("Text extraction completed. Saved to output.txt")
Enter fullscreen mode Exit fullscreen mode

Key Points

  • <span>GetText()</span> extracts all textual content in reading order, including paragraphs, headings, headers, and footers, while ignoring non-text elements such as images and shapes.
  • The file is saved using UTF-8 encoding to ensure proper handling of non-English characters.
  • Always call <span>doc.Close()</span> after processing to release system resources.

Extract and Export All Tables Accurately

Tables in Word documents often contain important business data such as reports, inventories, and records. Spire.Doc provides a clear object hierarchy:

Document → Section → Table → Row → Cell → Paragraph

The following code traverses every table in each section and exports each table as an individual <span>.txt</span> file. Tab characters are used as column separators, making the output easy to import into Excel or other data-processing tools.

from spire.doc import *
from spire.doc.common import *
import os

# Create the output directory
output_dir = "output/Tables"
os.makedirs(output_dir, exist_ok=True)

# Load the Word document
doc = Document()
doc.LoadFromFile("Sample.docx")

# Traverse all sections
for section_idx in range(doc.Sections.Count):
    section = doc.Sections.get_Item(section_idx)
    tables = section.Tables

    for table_idx in range(tables.Count):
        table = tables.get_Item(table_idx)
        table_data = ""

        # Traverse all rows and cells
        for row_idx in range(table.Rows.Count):
            row = table.Rows.get_Item(row_idx)

            for col_idx in range(row.Cells.Count):
                cell = row.Cells.get_Item(col_idx)

                # Collect text from all paragraphs within the cell
                cell_text = ""
                for para_idx in range(cell.Paragraphs.Count):
                    cell_text += cell.Paragraphs.get_Item(para_idx).Text + " "

                table_data += cell_text.strip()

                # Separate columns with tabs
                if col_idx < row.Cells.Count - 1:
                    table_data += "\t"

            table_data += "\n"  # End of row

        # Save the current table
        output_path = f"{output_dir}/WordTable_{section_idx+1}_{table_idx+1}.txt"

        with open(output_path, "w", encoding="utf-8") as f:
            f.write(table_data)

        print(f"Saved: {output_path}")

doc.Close()
Enter fullscreen mode Exit fullscreen mode

Code Explanation

  • The nested loops ensure that every table in every section is processed.
  • Cell content is extracted by iterating through the cell's <span>Paragraphs</span> collection, preventing the loss of text separated by line breaks or formatting.
  • Output files are named using the pattern <span>SectionIndex_TableIndex</span>, making it easy to identify the source of each exported table.

Note: This example processes only top-level tables. If your document contains nested tables inside cells, you can extend the logic recursively to handle deeper table structures.

Best Practices and Considerations

1. Performance and Memory Usage

For large documents containing hundreds of pages, process only the content you need. If you're interested solely in tables, skip text extraction, and vice versa. Also, make sure <span>doc.Close()</span> is always executed to avoid resource leaks.

2. Handling Merged Cells

When tables contain merged rows or columns, the code above still extracts the text from each cell correctly. However, the exported plain-text output will not preserve merge relationships.

If maintaining the original table structure is important, you can use properties such as <span>Cell.ColumnSpan</span> and <span>Cell.RowSpan</span> to build a matrix that represents merged cells.

3. Free Edition Limitations

The free edition of Spire.Doc can process documents containing up to 500 paragraphs or 25 tables , which is generally sufficient for everyday document-processing tasks and evaluation purposes.

Conclusion

With Free Spire.Doc for Python , you can extract text and tables from Word documents using only a few dozen lines of code. The two techniques demonstrated in this article—saving document text to TXT files and exporting tables individually—can be integrated directly into your data-processing workflows.

Combined with Python's file-handling capabilities and downstream tools such as pandas, these methods make it easy to build automated document-parsing solutions.

If you need to work with more complex layouts or preserve advanced table structures, Spire.Doc also offers additional APIs such as <span>ExportToHtml()</span> and <span>SaveToFile()</span>, providing even greater flexibility for document-processing projects.

Top comments (0)