Burhanettin Nacar

Posted on Jan 13 • Originally published at bnacar.dev

How to Extract Tables from PDFs Using Python (Without Losing Your Mind)

#data #python #tutorial

If you've ever tried to extract data from a PDF, you know the pain. What looks like a simple table on screen is actually a chaotic mess of positioned text elements in the file.

A friend of mine was struggling with extracting data from PDFs for his project, so I built a PDF extraction API to help him. Along the way, I learned way more about PDF internals than I expected. Here's a practical breakdown.

The Problem: PDFs Don't Have "Tables"

Open any PDF with tabular data. It looks organized, right? Rows, columns, headers.

Now look at what's actually in the file:

draw "Product" at position (50, 100)
draw "Price" at position (200, 100)
draw "Widget" at position (50, 120)
draw "$99" at position (200, 120)

There's no table structure. No rows. No columns. Just text floating at coordinates.

Your job is to reconstruct the logical structure from spatial positions.

Approach 1: PyMuPDF (Basic Text Extraction)

For simple text extraction, PyMuPDF (also called fitz) is fast and reliable:

import fitz

def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

Pros: Fast, handles most PDFs

Cons: Tables come out as jumbled text

Output from a table:

Product Price Quantity
Widget $99 10
Gadget $149 5

Not useful if you need structured data.

Approach 2: pdfplumber (Table Detection)

pdfplumber is specifically designed for table extraction:

import pdfplumber

def extract_tables(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            tables.extend(page_tables)
    return tables

Pros: Detects table boundaries automatically

Cons: Struggles with complex layouts, merged cells

Output:

[
    [['Product', 'Price', 'Quantity'],
     ['Widget', '$99', '10'],
     ['Gadget', '$149', '5']]
]

Much better! But still needs post-processing.

Approach 3: Combining Both

The best results come from combining approaches:

import fitz
import pdfplumber

def smart_extract(pdf_path):
    # First, check if PDF has selectable text
    doc = fitz.open(pdf_path)
    first_page_text = doc[0].get_text().strip()

    if len(first_page_text) < 50:
        # Likely a scanned PDF - needs OCR
        return {"error": "Scanned PDF detected, OCR required"}

    # Extract tables with pdfplumber
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            for table in page.extract_tables():
                if table and len(table) > 1:
                    headers = table[0]
                    rows = table[1:]
                    tables.append({
                        "page": i + 1,
                        "headers": headers,
                        "rows": rows
                    })

    # Extract remaining text with PyMuPDF
    full_text = ""
    for page in doc:
        full_text += page.get_text()

    return {
        "tables": tables,
        "text": full_text,
        "page_count": len(doc)
    }

The Hard Parts Nobody Tells You About

1. Table boundaries are ambiguous

Is this one table or two?

Name     | Email
---------|------------------
John     | john@example.com

Department | Budget
-----------|--------
Sales      | $50,000

Humans see two tables. Algorithms often merge them.

2. Headers aren't always on top

Some invoices put totals at the bottom. Some have headers on the left side. Some have no headers at all.

3. Multi-page tables

When a table spans pages, you need to:

Detect it's a continuation (no headers on page 2)
Merge rows correctly
Handle page breaks mid-row

4. Currency and number parsing

"$1,234.56" vs "1.234,56 EUR" vs "JPY 1234"

Different locales, different formats. Don't assume.

A Better Way: Use an API

After building this for my friend, I packaged it into an API so others don't have to:

curl -X POST "https://pdfpull-895295000838.europe-west1.run.app/api/v1/extract/tables" \
  -H "X-API-Key: sk_demo_123456789" \
  -F "file=@invoice.pdf"

Response:

{
  "tables": [
    {
      "page_number": 1,
      "headers": ["Product", "Price", "Qty"],
      "rows": [
        ["Widget", "$99", "10"],
        ["Gadget", "$149", "5"]
      ]
    }
  ],
  "table_count": 1
}

It also has smart parsers for invoices and resumes that extract specific fields:

curl -X POST "https://pdfpull-895295000838.europe-west1.run.app/api/v1/parse/invoice" \
  -H "X-API-Key: sk_demo_123456789" \
  -F "file=@invoice.pdf"

{
  "vendor_name": "ACME Corporation",
  "invoice_number": "INV-2024-0042",
  "invoice_date": "January 15, 2024",
  "total_amount": 1250.00,
  "currency": "USD",
  "line_items": [
    {"description": "Widget", "quantity": 10, "amount": 990.00},
    {"description": "Gadget", "quantity": 5, "amount": 260.00}
  ],
  "confidence": 0.91
}

Conclusion

PDF extraction is harder than it looks. If you're building something that only occasionally needs PDF parsing, use a library. If you're doing it at scale, consider an API that handles the edge cases for you.

If you need consistent results across lots of PDFs, an API can save you time. For one-off jobs, a library is usually enough.

Building this in public. Follow along on Twitter: @uppnrise

DEV Community