DEV Community

risha-max
risha-max

Posted on

# How to Parse Invoices into JSON with Python (No Regex)

Parsing invoices sounds easy until you do it at scale.

One supplier sends a clean text PDF. Another sends a scanned image. A third changes layout every month. Suddenly your extraction logic breaks and you are patching regex rules at midnight.

In this post, I will show a schema-first approach with 0xPdf that turns invoice PDFs into structured JSON in Python -- without brittle regex parsing.


The problem: why invoice PDF parsing is hard

Invoices are not standardized documents. Even when they contain the same fields, they vary by:

  • Layout and visual hierarchy
  • Label naming ("Invoice #", "Invoice No.", "Ref")
  • OCR quality (for scanned PDFs)
  • Currency/date formatting
  • Multi-line items and totals sections

Most failures happen when extraction code assumes one fixed layout.


Traditional approaches (and where they break)

pdfplumber (or similar text extractors)

Good for extracting text blocks, but you still need to map unstructured text into business fields yourself.

That usually means:

  • Regex for each field
  • Manual heuristics per vendor
  • Ongoing maintenance when templates change

AWS Textract / raw OCR APIs

Useful for OCR and document signals, but output is still mostly low-level structure. You still need significant post-processing to get final business JSON.

In other words: you pay for extraction and still write parsing logic.


The schema-first approach with 0xPdf

With 0xPdf, you define the shape of the data you want up front (schema), then parse the document directly into that structure.

Instead of:

PDF -> text -> regex -> custom parser -> JSON

you do:

PDF + schema -> structured JSON

This is much easier to maintain across different invoice formats.


Step-by-step Python tutorial

1) Install the SDK

pip install oxpdf
Enter fullscreen mode Exit fullscreen mode

2) Define an invoice schema

Create a schema for the fields your app actually needs:

INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "invoice_date": {"type": "string"},
        "due_date": {"type": "string"},
        "vendor_name": {"type": "string"},
        "vendor_address": {"type": "string"},
        "currency": {"type": "string"},
        "subtotal": {"type": "number"},
        "tax": {"type": "number"},
        "total": {"type": "number"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "number"},
                    "unit_price": {"type": "number"},
                    "amount": {"type": "number"}
                }
            }
        }
    },
    "required": ["invoice_number", "invoice_date", "vendor_name", "total"]
}
Enter fullscreen mode Exit fullscreen mode

3) Parse a sample invoice

import os
from oxpdf import Oxpdf

client = Oxpdf(api_key=os.environ["OXPDF_API_KEY"])

with open("sample-invoice.pdf", "rb") as f:
    result = client.pdf.parse(
        file=f,
        schema=INVOICE_SCHEMA,
        use_ocr=True  # set True if scanned/image-based invoices are common
    )
Enter fullscreen mode Exit fullscreen mode

4) Handle the JSON response

data = result["data"]

print("Invoice #:", data.get("invoice_number"))
print("Vendor:", data.get("vendor_name"))
print("Total:", data.get("total"))

for item in data.get("line_items", []):
    print("-", item.get("description"), item.get("amount"))
Enter fullscreen mode Exit fullscreen mode

This gives you clean, application-ready JSON you can store directly in your DB or pass to downstream systems.


Results comparison: Textract-style raw output vs 0xPdf structured output

Typical raw OCR output (simplified)

{
  "blocks": [
    {"type": "LINE", "text": "Invoice No: INV-2039"},
    {"type": "LINE", "text": "Total Due USD 1,284.44"},
    {"type": "LINE", "text": "Widget A   2   120.00   240.00"}
  ]
}
Enter fullscreen mode Exit fullscreen mode

You still need custom parsing code for every field.

0xPdf schema-first output

{
  "invoice_number": "INV-2039",
  "invoice_date": "2026-02-10",
  "due_date": "2026-03-12",
  "vendor_name": "Acme Supply Co.",
  "currency": "USD",
  "subtotal": 1150.0,
  "tax": 134.44,
  "total": 1284.44,
  "line_items": [
    {
      "description": "Widget A",
      "quantity": 2,
      "unit_price": 120.0,
      "amount": 240.0
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

This is immediately usable by your billing, reconciliation, or ERP workflows.


When to use OCR vs text extraction

Use text extraction when:

  • PDFs are digitally generated (machine text selectable)
  • Quality is consistent
  • You want faster and cheaper processing

Use OCR when:

  • PDFs are scanned images
  • Mixed quality and skewed scans are common
  • You need robust parsing across messy inputs

Practical rule: default to text extraction, auto-fallback to OCR for low-quality or image-only files.


Conclusion

Regex-based invoice parsing does not scale well across vendors and layout changes.

A schema-first workflow gives you cleaner, more stable output and dramatically reduces custom parsing maintenance.

If you want to try it, start with the 0xPdf free tier and parse a few real invoices from your workflow:

https://0xpdf.io/pricing


Tags: python, pdf, api, tutorial

Top comments (0)