risha-max

Posted on Feb 24

# How to Parse Invoices into JSON with Python (No Regex)

#python #api #pdf #tutorial

Parsing invoices sounds easy until you do it at scale.

One supplier sends a clean text PDF. Another sends a scanned image. A third changes layout every month. Suddenly your extraction logic breaks and you are patching regex rules at midnight.

In this post, I will show a schema-first approach with 0xPdf that turns invoice PDFs into structured JSON in Python -- without brittle regex parsing.

The problem: why invoice PDF parsing is hard

Invoices are not standardized documents. Even when they contain the same fields, they vary by:

Layout and visual hierarchy
Label naming ("Invoice #", "Invoice No.", "Ref")
OCR quality (for scanned PDFs)
Currency/date formatting
Multi-line items and totals sections

Most failures happen when extraction code assumes one fixed layout.

Traditional approaches (and where they break)

`pdfplumber` (or similar text extractors)

Good for extracting text blocks, but you still need to map unstructured text into business fields yourself.

That usually means:

Regex for each field
Manual heuristics per vendor
Ongoing maintenance when templates change

AWS Textract / raw OCR APIs

Useful for OCR and document signals, but output is still mostly low-level structure. You still need significant post-processing to get final business JSON.

In other words: you pay for extraction and still write parsing logic.

The schema-first approach with 0xPdf

With 0xPdf, you define the shape of the data you want up front (schema), then parse the document directly into that structure.

Instead of:

PDF -> text -> regex -> custom parser -> JSON

you do:

PDF + schema -> structured JSON

This is much easier to maintain across different invoice formats.

Step-by-step Python tutorial

1) Install the SDK

pip install oxpdf

2) Define an invoice schema

Create a schema for the fields your app actually needs:

INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "invoice_date": {"type": "string"},
        "due_date": {"type": "string"},
        "vendor_name": {"type": "string"},
        "vendor_address": {"type": "string"},
        "currency": {"type": "string"},
        "subtotal": {"type": "number"},
        "tax": {"type": "number"},
        "total": {"type": "number"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "number"},
                    "unit_price": {"type": "number"},
                    "amount": {"type": "number"}
                }
            }
        }
    },
    "required": ["invoice_number", "invoice_date", "vendor_name", "total"]
}

3) Parse a sample invoice

import os
from oxpdf import Oxpdf

client = Oxpdf(api_key=os.environ["OXPDF_API_KEY"])

with open("sample-invoice.pdf", "rb") as f:
    result = client.pdf.parse(
        file=f,
        schema=INVOICE_SCHEMA,
        use_ocr=True  # set True if scanned/image-based invoices are common
    )

4) Handle the JSON response

data = result["data"]

print("Invoice #:", data.get("invoice_number"))
print("Vendor:", data.get("vendor_name"))
print("Total:", data.get("total"))

for item in data.get("line_items", []):
    print("-", item.get("description"), item.get("amount"))

This gives you clean, application-ready JSON you can store directly in your DB or pass to downstream systems.

Results comparison: Textract-style raw output vs 0xPdf structured output

Typical raw OCR output (simplified)

{
  "blocks": [
    {"type": "LINE", "text": "Invoice No: INV-2039"},
    {"type": "LINE", "text": "Total Due USD 1,284.44"},
    {"type": "LINE", "text": "Widget A   2   120.00   240.00"}
  ]
}

You still need custom parsing code for every field.

0xPdf schema-first output

{
  "invoice_number": "INV-2039",
  "invoice_date": "2026-02-10",
  "due_date": "2026-03-12",
  "vendor_name": "Acme Supply Co.",
  "currency": "USD",
  "subtotal": 1150.0,
  "tax": 134.44,
  "total": 1284.44,
  "line_items": [
    {
      "description": "Widget A",
      "quantity": 2,
      "unit_price": 120.0,
      "amount": 240.0
    }
  ]
}

This is immediately usable by your billing, reconciliation, or ERP workflows.

When to use OCR vs text extraction

Use text extraction when:

PDFs are digitally generated (machine text selectable)
Quality is consistent
You want faster and cheaper processing

Use OCR when:

PDFs are scanned images
Mixed quality and skewed scans are common
You need robust parsing across messy inputs

Practical rule: default to text extraction, auto-fallback to OCR for low-quality or image-only files.

Conclusion

Regex-based invoice parsing does not scale well across vendors and layout changes.

A schema-first workflow gives you cleaner, more stable output and dramatically reduces custom parsing maintenance.

If you want to try it, start with the 0xPdf free tier and parse a few real invoices from your workflow:

https://0xpdf.io/pricing

Tags: python, pdf, api, tutorial

DEV Community

# How to Parse Invoices into JSON with Python (No Regex)

The problem: why invoice PDF parsing is hard

Traditional approaches (and where they break)

`pdfplumber` (or similar text extractors)

AWS Textract / raw OCR APIs

The schema-first approach with 0xPdf

Step-by-step Python tutorial

1) Install the SDK

2) Define an invoice schema

3) Parse a sample invoice

4) Handle the JSON response

Results comparison: Textract-style raw output vs 0xPdf structured output

Typical raw OCR output (simplified)

0xPdf schema-first output

When to use OCR vs text extraction

Conclusion

Top comments (0)

The problem: why invoice PDF parsing is hard

Traditional approaches (and where they break)

pdfplumber (or similar text extractors)

AWS Textract / raw OCR APIs

The schema-first approach with 0xPdf

Step-by-step Python tutorial

1) Install the SDK

2) Define an invoice schema

3) Parse a sample invoice

4) Handle the JSON response

Results comparison: Textract-style raw output vs 0xPdf structured output

Typical raw OCR output (simplified)

0xPdf schema-first output

When to use OCR vs text extraction

Conclusion

`pdfplumber` (or similar text extractors)