Approach
If you have ever extracted tables from PDFs in production, you know the pain:
- It works on one statement
- Breaks on the next vendor
- Fails when spacing, borders, or merged cells change
For many teams, the workflow becomes: "try Tabula/Camelot, patch for edge cases, repeat forever."
In this post, I will show a simpler schema-driven approach to extract table data from PDFs, including bank statements and multi-page tables.
The table extraction problem
PDFs are designed for visual rendering, not structured data exchange.
That means table boundaries are often implied by layout, not explicit data structures.
Common issues:
- Inconsistent row spacing
- Missing or broken cell borders
- Wrapped text in description columns
- Header rows repeated across pages
- Totals and footers mixed into table body
Traditional "line/coordinate based" extraction can become fragile quickly.
Why Tabula and Camelot break on complex layouts
Tabula and Camelot are useful tools, especially for clean, machine-generated PDFs with predictable geometry.
But they often struggle when:
- Tables are borderless
- Columns drift slightly page-to-page
- Text wraps across lines
- The PDF is scanned or low quality
- Multiple table styles appear in one file
You then end up writing post-processing logic:
- manual column repair
- row stitching
- heuristic cleanup for bad splits
At scale, maintenance cost grows fast.
Schema-driven table extraction
A schema-driven approach flips the problem:
Instead of trying to reconstruct a perfect grid from geometry, you declare the output structure you want, and parse the document into that structure.
For example, for a bank statement:
- account metadata
- statement period
- transactions array with typed fields
This is much more robust for real-world variations across issuers and templates.
Tutorial: define a table schema and parse a bank statement
1) Install dependencies
pip install oxpdf
2) Define a schema for statement transactions
BANK_STATEMENT_SCHEMA = {
"type": "object",
"properties": {
"bank_name": {"type": "string"},
"account_last4": {"type": "string"},
"statement_start_date": {"type": "string"},
"statement_end_date": {"type": "string"},
"transactions": {
"type": "array",
"items": {
"type": "object",
"properties": {
"date": {"type": "string"},
"description": {"type": "string"},
"debit": {"type": "number"},
"credit": {"type": "number"},
"balance": {"type": "number"}
},
"required": ["date", "description"]
}
}
},
"required": ["transactions"]
}
3) Parse a sample statement
import os
from oxpdf import Oxpdf
client = Oxpdf(api_key=os.environ["OXPDF_API_KEY"])
with open("sample-bank-statement.pdf", "rb") as f:
result = client.pdf.parse(
file=f,
schema=BANK_STATEMENT_SCHEMA,
use_ocr=True # keep true if scans/photos are common
)
4) Access table rows (transactions)
data = result["data"]
rows = data.get("transactions", [])
print("Transactions:", len(rows))
for row in rows[:5]:
print(row.get("date"), row.get("description"), row.get("debit"), row.get("credit"), row.get("balance"))
This gives you row-wise JSON ready for analytics, reconciliation, and storage.
Handling multi-page tables
Multi-page statements are where many extraction pipelines fail.
With schema-driven extraction, the expected output is already a single transactions[] array, so rows can be normalized across pages.
Recommended safeguards:
- Ensure date + description are required fields per row
- Filter known footer/header artifacts post-parse
- Add validation checks (e.g., row count, running balance sanity)
- Keep raw parse payload for audit/debugging
A simple validation helper:
def valid_transaction(row: dict) -> bool:
if not row.get("date") or not row.get("description"):
return False
if all(row.get(k) is None for k in ("debit", "credit", "balance")):
return False
return True
clean_rows = [r for r in rows if valid_transaction(r)]
Comparison with traditional tools
Traditional coordinate/grid extraction
Pros:
- Fast on clean, consistent layouts
- Good for fixed internal templates
Cons:
- Fragile on real-world variation
- Heavy post-processing burden
- Harder to maintain across multiple vendors
Schema-driven extraction
Pros:
- Output shaped for your app from the start
- More resilient to layout changes
- Easier to maintain as document variety grows
Cons:
- Requires clear schema design upfront
- May still need light normalization for edge cases
Final thoughts
If your table extraction keeps breaking on new PDF templates, the issue is often not your regex -- it is the extraction strategy.
For production pipelines, schema-driven parsing is usually the better long-term bet:
- more stable
- easier to reason about
- lower maintenance overhead
If you want to test this with your own statements/invoices, start with a narrow schema and expand field-by-field as needed.
Top comments (0)