risha-max

Posted on Feb 25

Extract Tables from PDFs Without Tabula -- A Simpler Approach

#python #database #pdf #tutorial

Approach

If you have ever extracted tables from PDFs in production, you know the pain:

It works on one statement
Breaks on the next vendor
Fails when spacing, borders, or merged cells change

For many teams, the workflow becomes: "try Tabula/Camelot, patch for edge cases, repeat forever."

In this post, I will show a simpler schema-driven approach to extract table data from PDFs, including bank statements and multi-page tables.

The table extraction problem

PDFs are designed for visual rendering, not structured data exchange.

That means table boundaries are often implied by layout, not explicit data structures.

Common issues:

Inconsistent row spacing
Missing or broken cell borders
Wrapped text in description columns
Header rows repeated across pages
Totals and footers mixed into table body

Traditional "line/coordinate based" extraction can become fragile quickly.

Why Tabula and Camelot break on complex layouts

Tabula and Camelot are useful tools, especially for clean, machine-generated PDFs with predictable geometry.

But they often struggle when:

Tables are borderless
Columns drift slightly page-to-page
Text wraps across lines
The PDF is scanned or low quality
Multiple table styles appear in one file

You then end up writing post-processing logic:

manual column repair
row stitching
heuristic cleanup for bad splits

At scale, maintenance cost grows fast.

Schema-driven table extraction

A schema-driven approach flips the problem:

Instead of trying to reconstruct a perfect grid from geometry, you declare the output structure you want, and parse the document into that structure.

For example, for a bank statement:

account metadata
statement period
transactions array with typed fields

This is much more robust for real-world variations across issuers and templates.

Tutorial: define a table schema and parse a bank statement

1) Install dependencies

pip install oxpdf

2) Define a schema for statement transactions

BANK_STATEMENT_SCHEMA = {
    "type": "object",
    "properties": {
        "bank_name": {"type": "string"},
        "account_last4": {"type": "string"},
        "statement_start_date": {"type": "string"},
        "statement_end_date": {"type": "string"},
        "transactions": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "date": {"type": "string"},
                    "description": {"type": "string"},
                    "debit": {"type": "number"},
                    "credit": {"type": "number"},
                    "balance": {"type": "number"}
                },
                "required": ["date", "description"]
            }
        }
    },
    "required": ["transactions"]
}

3) Parse a sample statement

import os
from oxpdf import Oxpdf

client = Oxpdf(api_key=os.environ["OXPDF_API_KEY"])

with open("sample-bank-statement.pdf", "rb") as f:
    result = client.pdf.parse(
        file=f,
        schema=BANK_STATEMENT_SCHEMA,
        use_ocr=True  # keep true if scans/photos are common
    )

4) Access table rows (transactions)

data = result["data"]
rows = data.get("transactions", [])

print("Transactions:", len(rows))
for row in rows[:5]:
    print(row.get("date"), row.get("description"), row.get("debit"), row.get("credit"), row.get("balance"))

This gives you row-wise JSON ready for analytics, reconciliation, and storage.

Handling multi-page tables

Multi-page statements are where many extraction pipelines fail.

With schema-driven extraction, the expected output is already a single transactions[] array, so rows can be normalized across pages.

Recommended safeguards:

Ensure date + description are required fields per row
Filter known footer/header artifacts post-parse
Add validation checks (e.g., row count, running balance sanity)
Keep raw parse payload for audit/debugging

A simple validation helper:

def valid_transaction(row: dict) -> bool:
    if not row.get("date") or not row.get("description"):
        return False
    if all(row.get(k) is None for k in ("debit", "credit", "balance")):
        return False
    return True

clean_rows = [r for r in rows if valid_transaction(r)]

Comparison with traditional tools

Traditional coordinate/grid extraction

Pros:

Fast on clean, consistent layouts
Good for fixed internal templates

Cons:

Fragile on real-world variation
Heavy post-processing burden
Harder to maintain across multiple vendors

Schema-driven extraction