DEV Community

jelizaveta
jelizaveta

Posted on

A Quick Guide to Extracting PDF Tables with Python

In data analysis work, we often encounter scenarios where we need to extract tabular data from PDF files. However, directly copying tables from PDF often results in formatting chaos and data misalignment. This article will guide you step by step on how to use the Spire.PDF for Python library to quickly and accurately identify and extract tables from PDF, and save the data in common formats such as CSV and Excel.

1. Preparation: Installing Required Libraries

First, you need to install the Spire.PDF library. Open the terminal or command line and execute the following command:

pip install Spire.PDF
Enter fullscreen mode Exit fullscreen mode

If you plan to export the extracted data to Excel format, it is recommended to also install pandas and openpyxl:

pip install pandas openpyxl
Enter fullscreen mode Exit fullscreen mode

2. Core Code: Extracting Tables from PDF

The following code demonstrates how to extract tables from the first page of a PDF and print the cell contents row by row:

from spire.pdf import PdfDocument, PdfTableExtractor

# 1. Load PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")

# 2. Create table extractor
table_extractor = PdfTableExtractor(pdf)

# 3. Extract all tables from the first page
tables = table_extractor.ExtractTable(0)

# 4. Iterate through each table
for table in tables:
    row_count = table.GetRowCount()
    column_count = table.GetColumnCount()

    # Extract cell contents row by row
    for i in range(row_count):
        row_data = []
        for j in range(column_count):
            cell_text = table.GetText(i, j)
            row_data.append(cell_text)
        print(row_data)
Enter fullscreen mode Exit fullscreen mode

Code Explanation

Method Purpose
LoadFromFile() Loads a PDF file from the specified path
PdfTableExtractor() Creates a table extractor instance
ExtractTable(page number) Extracts all tables from the specified page, page number starts from 0
GetRowCount() / GetColumnCount() Gets the number of rows and columns of the table
GetText(row, column) Gets the text content of the specified cell

3. Advanced Processing: Batch Extraction from Multi-page PDF

If the PDF contains multiple pages, you can use a loop to batch extract all tables:

from spire.pdf import PdfDocument, PdfTableExtractor

pdf = PdfDocument()
pdf.LoadFromFile("multi_page_report.pdf")

# Iterate through all pages
for page_index in range(pdf.Pages.Count):
    extractor = PdfTableExtractor(pdf)
    tables = extractor.ExtractTable(page_index)

    print(f"\n=== Page {page_index + 1} found {len(tables)} table(s) ===")

    for t, table in enumerate(tables):
        print(f"--- Table {t+1} ---")
        rows = table.GetRowCount()
        cols = table.GetColumnCount()

        for i in range(rows):
            row = [table.GetText(i, j) for j in range(cols)]
            print(row)
Enter fullscreen mode Exit fullscreen mode

4. Exporting Data: Saving as CSV or Excel Files

The extracted table data can be easily converted to other formats. The following example saves the data as a CSV file:

import csv
from spire.pdf import PdfDocument, PdfTableExtractor

pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")

extractor = PdfTableExtractor(pdf)
tables = extractor.ExtractTable(0)

if tables:
    table = tables[0]
    rows = table.GetRowCount()
    cols = table.GetColumnCount()

    # Collect all data
    data = []
    for i in range(rows):
        row_data = [table.GetText(i, j) for j in range(cols)]
        data.append(row_data)

    # Write to CSV file
    with open("output.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerows(data)

    print(f"Successfully exported {rows} rows × {cols} columns of data to output.csv")
Enter fullscreen mode Exit fullscreen mode

To export to an Excel file, you can use pandas:

import pandas as pd

# Assume data is the 2D list extracted above
df = pd.DataFrame(data[1:], columns=data[0])  # First row as column headers
df.to_excel("output.xlsx", index=False)
print("Data saved as output.xlsx")
Enter fullscreen mode Exit fullscreen mode

5. Common Issues and Tips

  1. Incomplete table recognition? Check whether the table in the PDF has clear borders. Scanned documents or image-based PDFs require OCR technology; Spire.PDF is mainly suitable for text-based PDFs.
  2. Handling merged cells: Spire.PDF automatically handles merged cells. GetText() returns the content of the cell in the upper-left corner of the merged area, and returns an empty string for other positions.
  3. Performance optimization: When processing large PDFs, it is recommended to extract and save page by page to avoid loading all tables into memory at once.

Through the above steps, you have mastered the complete process of extracting PDF tables using Python. This solution can be easily integrated into automated data processing pipelines, greatly improving work efficiency.

Top comments (0)