Parsing invoices sounds easy until you do it at scale.
One supplier sends a clean text PDF. Another sends a scanned image. A third changes layout every month. Suddenly your extraction logic breaks and you are patching regex rules at midnight.
In this post, I will show a schema-first approach with 0xPdf that turns invoice PDFs into structured JSON in Python -- without brittle regex parsing.
The problem: why invoice PDF parsing is hard
Invoices are not standardized documents. Even when they contain the same fields, they vary by:
- Layout and visual hierarchy
- Label naming ("Invoice #", "Invoice No.", "Ref")
- OCR quality (for scanned PDFs)
- Currency/date formatting
- Multi-line items and totals sections
Most failures happen when extraction code assumes one fixed layout.
Traditional approaches (and where they break)
pdfplumber (or similar text extractors)
Good for extracting text blocks, but you still need to map unstructured text into business fields yourself.
That usually means:
- Regex for each field
- Manual heuristics per vendor
- Ongoing maintenance when templates change
AWS Textract / raw OCR APIs
Useful for OCR and document signals, but output is still mostly low-level structure. You still need significant post-processing to get final business JSON.
In other words: you pay for extraction and still write parsing logic.
The schema-first approach with 0xPdf
With 0xPdf, you define the shape of the data you want up front (schema), then parse the document directly into that structure.
Instead of:
PDF -> text -> regex -> custom parser -> JSON
you do:
PDF + schema -> structured JSON
This is much easier to maintain across different invoice formats.
Step-by-step Python tutorial
1) Install the SDK
pip install oxpdf
2) Define an invoice schema
Create a schema for the fields your app actually needs:
INVOICE_SCHEMA = {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"invoice_date": {"type": "string"},
"due_date": {"type": "string"},
"vendor_name": {"type": "string"},
"vendor_address": {"type": "string"},
"currency": {"type": "string"},
"subtotal": {"type": "number"},
"tax": {"type": "number"},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"amount": {"type": "number"}
}
}
}
},
"required": ["invoice_number", "invoice_date", "vendor_name", "total"]
}
3) Parse a sample invoice
import os
from oxpdf import Oxpdf
client = Oxpdf(api_key=os.environ["OXPDF_API_KEY"])
with open("sample-invoice.pdf", "rb") as f:
result = client.pdf.parse(
file=f,
schema=INVOICE_SCHEMA,
use_ocr=True # set True if scanned/image-based invoices are common
)
4) Handle the JSON response
data = result["data"]
print("Invoice #:", data.get("invoice_number"))
print("Vendor:", data.get("vendor_name"))
print("Total:", data.get("total"))
for item in data.get("line_items", []):
print("-", item.get("description"), item.get("amount"))
This gives you clean, application-ready JSON you can store directly in your DB or pass to downstream systems.
Results comparison: Textract-style raw output vs 0xPdf structured output
Typical raw OCR output (simplified)
{
"blocks": [
{"type": "LINE", "text": "Invoice No: INV-2039"},
{"type": "LINE", "text": "Total Due USD 1,284.44"},
{"type": "LINE", "text": "Widget A 2 120.00 240.00"}
]
}
You still need custom parsing code for every field.
0xPdf schema-first output
{
"invoice_number": "INV-2039",
"invoice_date": "2026-02-10",
"due_date": "2026-03-12",
"vendor_name": "Acme Supply Co.",
"currency": "USD",
"subtotal": 1150.0,
"tax": 134.44,
"total": 1284.44,
"line_items": [
{
"description": "Widget A",
"quantity": 2,
"unit_price": 120.0,
"amount": 240.0
}
]
}
This is immediately usable by your billing, reconciliation, or ERP workflows.
When to use OCR vs text extraction
Use text extraction when:
- PDFs are digitally generated (machine text selectable)
- Quality is consistent
- You want faster and cheaper processing
Use OCR when:
- PDFs are scanned images
- Mixed quality and skewed scans are common
- You need robust parsing across messy inputs
Practical rule: default to text extraction, auto-fallback to OCR for low-quality or image-only files.
Conclusion
Regex-based invoice parsing does not scale well across vendors and layout changes.
A schema-first workflow gives you cleaner, more stable output and dramatically reduces custom parsing maintenance.
If you want to try it, start with the 0xPdf free tier and parse a few real invoices from your workflow:
Tags: python, pdf, api, tutorial
Top comments (0)