DEV Community

risha-max
risha-max

Posted on

Extract Tables from PDFs Without Tabula -- A Simpler Approach

Approach

If you have ever extracted tables from PDFs in production, you know the pain:

  • It works on one statement
  • Breaks on the next vendor
  • Fails when spacing, borders, or merged cells change

For many teams, the workflow becomes: "try Tabula/Camelot, patch for edge cases, repeat forever."

In this post, I will show a simpler schema-driven approach to extract table data from PDFs, including bank statements and multi-page tables.


The table extraction problem

PDFs are designed for visual rendering, not structured data exchange.

That means table boundaries are often implied by layout, not explicit data structures.

Common issues:

  • Inconsistent row spacing
  • Missing or broken cell borders
  • Wrapped text in description columns
  • Header rows repeated across pages
  • Totals and footers mixed into table body

Traditional "line/coordinate based" extraction can become fragile quickly.


Why Tabula and Camelot break on complex layouts

Tabula and Camelot are useful tools, especially for clean, machine-generated PDFs with predictable geometry.

But they often struggle when:

  • Tables are borderless
  • Columns drift slightly page-to-page
  • Text wraps across lines
  • The PDF is scanned or low quality
  • Multiple table styles appear in one file

You then end up writing post-processing logic:

  • manual column repair
  • row stitching
  • heuristic cleanup for bad splits

At scale, maintenance cost grows fast.


Schema-driven table extraction

A schema-driven approach flips the problem:

Instead of trying to reconstruct a perfect grid from geometry, you declare the output structure you want, and parse the document into that structure.

For example, for a bank statement:

  • account metadata
  • statement period
  • transactions array with typed fields

This is much more robust for real-world variations across issuers and templates.


Tutorial: define a table schema and parse a bank statement

1) Install dependencies

pip install oxpdf
Enter fullscreen mode Exit fullscreen mode

2) Define a schema for statement transactions

BANK_STATEMENT_SCHEMA = {
    "type": "object",
    "properties": {
        "bank_name": {"type": "string"},
        "account_last4": {"type": "string"},
        "statement_start_date": {"type": "string"},
        "statement_end_date": {"type": "string"},
        "transactions": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "date": {"type": "string"},
                    "description": {"type": "string"},
                    "debit": {"type": "number"},
                    "credit": {"type": "number"},
                    "balance": {"type": "number"}
                },
                "required": ["date", "description"]
            }
        }
    },
    "required": ["transactions"]
}
Enter fullscreen mode Exit fullscreen mode

3) Parse a sample statement

import os
from oxpdf import Oxpdf

client = Oxpdf(api_key=os.environ["OXPDF_API_KEY"])

with open("sample-bank-statement.pdf", "rb") as f:
    result = client.pdf.parse(
        file=f,
        schema=BANK_STATEMENT_SCHEMA,
        use_ocr=True  # keep true if scans/photos are common
    )
Enter fullscreen mode Exit fullscreen mode

4) Access table rows (transactions)

data = result["data"]
rows = data.get("transactions", [])

print("Transactions:", len(rows))
for row in rows[:5]:
    print(row.get("date"), row.get("description"), row.get("debit"), row.get("credit"), row.get("balance"))
Enter fullscreen mode Exit fullscreen mode

This gives you row-wise JSON ready for analytics, reconciliation, and storage.


Handling multi-page tables

Multi-page statements are where many extraction pipelines fail.

With schema-driven extraction, the expected output is already a single transactions[] array, so rows can be normalized across pages.

Recommended safeguards:

  • Ensure date + description are required fields per row
  • Filter known footer/header artifacts post-parse
  • Add validation checks (e.g., row count, running balance sanity)
  • Keep raw parse payload for audit/debugging

A simple validation helper:

def valid_transaction(row: dict) -> bool:
    if not row.get("date") or not row.get("description"):
        return False
    if all(row.get(k) is None for k in ("debit", "credit", "balance")):
        return False
    return True

clean_rows = [r for r in rows if valid_transaction(r)]
Enter fullscreen mode Exit fullscreen mode

Comparison with traditional tools

Traditional coordinate/grid extraction

Pros:

  • Fast on clean, consistent layouts
  • Good for fixed internal templates

Cons:

  • Fragile on real-world variation
  • Heavy post-processing burden
  • Harder to maintain across multiple vendors

Schema-driven extraction

Pros:

  • Output shaped for your app from the start
  • More resilient to layout changes
  • Easier to maintain as document variety grows

Cons:

  • Requires clear schema design upfront
  • May still need light normalization for edge cases

Final thoughts

If your table extraction keeps breaking on new PDF templates, the issue is often not your regex -- it is the extraction strategy.

For production pipelines, schema-driven parsing is usually the better long-term bet:

  • more stable
  • easier to reason about
  • lower maintenance overhead

If you want to test this with your own statements/invoices, start with a narrow schema and expand field-by-field as needed.


Top comments (0)