Chloe

Posted on Jun 22 • Edited on Jun 29

Extract Tables from PDF in Python: A Practical Guide

#python #automation #datascience #pdf

If you've ever worked with reports, invoices, or financial statements, you've probably run into PDF files that contain tables you need to analyze.

The problem is that PDF documents are designed for viewing, not data processing. Copying table data manually into Excel or another system quickly becomes tedious, especially when you're dealing with multiple files.

In many automation workflows, developers need to extract tables from PDF files and convert them into structured data for analysis, reporting, or database import.

In this article, you'll learn how to extract tables from a PDF in Python using Spire.PDF for Python. We'll cover the installation process, basic usage, and a practical example to extract tables from a sample PDF.

Spire.PDF for Python

Extracting tables from PDFs can be surprisingly challenging. Depending on the document structure, you may need to deal with page layouts, text positioning, or even OCR for scanned files.

To simplify the process, this tutorial uses Spire.PDF for Python. The library includes a built-in table extraction feature that can identify tables on PDF pages and retrieve their cell data with just a few lines of code.

Beyond table extraction, Spire.PDF also supports common PDF automation tasks such as text extraction, document conversion, form processing, and PDF splitting or merging.

Note: Spire.PDF works with text-based PDFs. Scanned PDFs require OCR before table extraction can be performed.

How to Extract Tables from PDF in Python: Step-by-Step

Let's begin a step-by-step tutorial for extracting table data from a PDF file.

Step 1: Create a Python Project

Before we start extracting data from a PDF, let's create a simple Python project.

If you're using Visual Studio 2022:

Open Visual Studio 2022 → Click Create a new project → Select Python Application → Enter a project name and choose a location → Click Create.

Once the project is ready, we can install the required library.

Step 2: Install Spire.PDF for Python

Open a terminal or command prompt and run:

pip install spire.pdf

Now that the library is installed, let's load the PDF document and prepare it for table extraction.

Step 3: Extract Tables from the PDF

We'll use a sample sales report that contains the following table:

The following code loads the PDF document, creates a table extractor, and reads all tables from every page in the file.

from spire.pdf import PdfDocument, PdfTableExtractor

# Load PDF document
pdf = PdfDocument()
pdf.LoadFromFile("sales-report.pdf")

# Create a PdfTableExtractor object
table_extractor = PdfTableExtractor(pdf)

# Extract tables from each page
for i in range(pdf.Pages.Count):
    tables = table_extractor.ExtractTable(i)

    if tables is None:
        continue

    for table_index, table in enumerate(tables):
        print(f"Table {table_index + 1} on page {i + 1}:")
        for row in range(table.GetRowCount()):
            row_data = []
            for col in range(table.GetColumnCount()):
                text = table.GetText(row, col).replace("\n", " ")
                row_data.append(text.strip())
            print("\t".join(row_data))

pdf.Close()

This code starts by importing the required classes from Spire.PDF. It then loads the PDF file into a PdfDocument object and creates a PdfTableExtractor instance.

The extractor scans each page of the document and detects any tables that are present. For every table found, the code loops through all rows and columns, retrieves the cell values using the GetText() method, and prints the extracted data to the console.

After running the script, you'll see output similar to the following, with each detected table printed row by row:

Step 4: Filter Rows — Annual Target Completion ≥ 50%

In many real-world scenarios, extracting the table is only the first step. Once the data is available, you'll often need to filter, validate, or transform it before exporting it to another system.

As an example, let's keep only the products whose Annual Target Completion rate is at least 50%.

The filter logic:

Skip the header row (row index 0)
For each data row, read the last column (col index 4)
Strip the % character and convert to float
Keep the row only if the value is >= 50

# Step 4: Filter rows where Annual Target Completion >= 50%
COMPLETION_COL = 4      # Column index for "Annual Target Completion"
THRESHOLD = 50.0
HEADER_ROW = 0

for page_index in range(pdf.Pages.Count):
    tables = table_extractor.ExtractTable(page_index)

    if tables is None:
        continue

    for table_index, table in enumerate(tables):
        print(f"\n--- Page {page_index + 1}, Table {table_index + 1} ---")

        # Print header
        header = [table.GetText(HEADER_ROW, col) for col in range(table.GetColumnCount())]
        print(header)

        # Filter and print data rows
        for row in range(1, table.GetRowCount()):
            row_data = [table.GetText(row, col) for col in range(table.GetColumnCount())]
            completion_str = row_data[COMPLETION_COL].replace("%", "").strip()
            try:
                completion = float(completion_str)
                if completion >= THRESHOLD:
                    print(row_data)
            except ValueError:
                continue

pdf.Close()

Console output:

Tablet Mini (48.9%) is excluded. The three remaining products have all crossed the 50% threshold. The try/except ValueError block ensures the loop doesn't break if a cell contains unexpected text or an empty string.

This approach can easily be adapted to other filtering scenarios. For example, you might want to find products with a completion rate above 80%, identify underperforming products below 40%, or apply multiple conditions based on different columns in the table.

You can also export the extracted rows to CSV, Excel, or a database once the filtering step is complete.

What Else Can You Do with Spire.PDF?

Table extraction is only one part of a typical PDF automation workflow. You can also extract plain text from any page, merge or split PDF files, add watermarks and annotations, and convert PDFs to Word documents or images — all without leaving Python. See the Spire.PDF for Python documentation for the full API reference.

Conclusion

In this tutorial, we used Spire.PDF for Python to extract tables from a PDF document and filter the results based on a simple business rule.

While the example focused on a sales report, the same approach works for invoices, financial statements, and many other structured PDF documents. Once the data is extracted, you can further analyze it, export it to Excel, store it in a database, or integrate it into automated workflows.

With just a few lines of code, Python can turn static PDF tables into usable data.

DEV Community