In data analysis work, we often encounter scenarios where we need to extract tabular data from PDF files. However, directly copying tables from PDF often results in formatting chaos and data misalignment. This article will guide you step by step on how to use the Spire.PDF for Python library to quickly and accurately identify and extract tables from PDF, and save the data in common formats such as CSV and Excel.
1. Preparation: Installing Required Libraries
First, you need to install the Spire.PDF library. Open the terminal or command line and execute the following command:
pip install Spire.PDF
If you plan to export the extracted data to Excel format, it is recommended to also install pandas and openpyxl:
pip install pandas openpyxl
2. Core Code: Extracting Tables from PDF
The following code demonstrates how to extract tables from the first page of a PDF and print the cell contents row by row:
from spire.pdf import PdfDocument, PdfTableExtractor
# 1. Load PDF file
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")
# 2. Create table extractor
table_extractor = PdfTableExtractor(pdf)
# 3. Extract all tables from the first page
tables = table_extractor.ExtractTable(0)
# 4. Iterate through each table
for table in tables:
row_count = table.GetRowCount()
column_count = table.GetColumnCount()
# Extract cell contents row by row
for i in range(row_count):
row_data = []
for j in range(column_count):
cell_text = table.GetText(i, j)
row_data.append(cell_text)
print(row_data)
Code Explanation
| Method | Purpose |
|---|---|
LoadFromFile() |
Loads a PDF file from the specified path |
PdfTableExtractor() |
Creates a table extractor instance |
ExtractTable(page number) |
Extracts all tables from the specified page, page number starts from 0 |
GetRowCount() / GetColumnCount()
|
Gets the number of rows and columns of the table |
GetText(row, column) |
Gets the text content of the specified cell |
3. Advanced Processing: Batch Extraction from Multi-page PDF
If the PDF contains multiple pages, you can use a loop to batch extract all tables:
from spire.pdf import PdfDocument, PdfTableExtractor
pdf = PdfDocument()
pdf.LoadFromFile("multi_page_report.pdf")
# Iterate through all pages
for page_index in range(pdf.Pages.Count):
extractor = PdfTableExtractor(pdf)
tables = extractor.ExtractTable(page_index)
print(f"\n=== Page {page_index + 1} found {len(tables)} table(s) ===")
for t, table in enumerate(tables):
print(f"--- Table {t+1} ---")
rows = table.GetRowCount()
cols = table.GetColumnCount()
for i in range(rows):
row = [table.GetText(i, j) for j in range(cols)]
print(row)
4. Exporting Data: Saving as CSV or Excel Files
The extracted table data can be easily converted to other formats. The following example saves the data as a CSV file:
import csv
from spire.pdf import PdfDocument, PdfTableExtractor
pdf = PdfDocument()
pdf.LoadFromFile("sample.pdf")
extractor = PdfTableExtractor(pdf)
tables = extractor.ExtractTable(0)
if tables:
table = tables[0]
rows = table.GetRowCount()
cols = table.GetColumnCount()
# Collect all data
data = []
for i in range(rows):
row_data = [table.GetText(i, j) for j in range(cols)]
data.append(row_data)
# Write to CSV file
with open("output.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerows(data)
print(f"Successfully exported {rows} rows × {cols} columns of data to output.csv")
To export to an Excel file, you can use pandas:
import pandas as pd
# Assume data is the 2D list extracted above
df = pd.DataFrame(data[1:], columns=data[0]) # First row as column headers
df.to_excel("output.xlsx", index=False)
print("Data saved as output.xlsx")
5. Common Issues and Tips
- Incomplete table recognition? Check whether the table in the PDF has clear borders. Scanned documents or image-based PDFs require OCR technology; Spire.PDF is mainly suitable for text-based PDFs.
-
Handling merged cells: Spire.PDF automatically handles merged cells.
GetText()returns the content of the cell in the upper-left corner of the merged area, and returns an empty string for other positions. - Performance optimization: When processing large PDFs, it is recommended to extract and save page by page to avoid loading all tables into memory at once.
Through the above steps, you have mastered the complete process of extracting PDF tables using Python. This solution can be easily integrated into automated data processing pipelines, greatly improving work efficiency.
Top comments (0)