DEV Community

DevToolsmith
DevToolsmith

Posted on

Stop Writing Regex for Invoices: Turn Any PDF Into Structured JSON With One API Call

If you have ever been handed a folder of invoices and asked to "just get the totals into a spreadsheet," you already know the trap. It sounds like a one-afternoon script. Three weeks later you are maintaining a regex zoo, a per-vendor template system, and a Slack channel full of people asking why last Tuesday's batch came back empty.

This article walks through why document parsing is harder than it looks, and a simpler pattern for getting clean, structured data out of PDFs without building and babysitting your own extraction stack.

Why "just parse the PDF" goes wrong

PDFs are a presentation format, not a data format. A number that looks like a total to a human is, under the hood, a glyph positioned at some coordinate with no semantic label. So most teams reach for the same stack:

  1. OCR to pull raw text.
  2. Regex and string heuristics to find fields.
  3. A template per document layout to map positions to fields.

This works in a controlled demo. In production it degrades for predictable reasons:

  • Layouts drift. A vendor moves the invoice number, and your positional template silently returns the wrong cell.
  • Variety explodes. Ten suppliers become two hundred. You cannot hand-author a template for each.
  • Edge cases are the norm. Multi-page invoices, line items that wrap, scanned receipts, mixed currencies. Each one is a new patch.

The maintenance never ends because the input never stabilizes.

The pattern: treat extraction as a single API call

Instead of owning OCR, heuristics, and templates, you can treat extraction the way you treat geocoding or email validation: a service that takes a messy input and returns a typed result. You send a document, you get back JSON with named fields. Your job shrinks to handling the JSON.

Here is what that looks like in practice with ParseFlow:

curl -X POST https://api.parseflow.dev/v1/extract \
  -H "Authorization: Bearer $PARSEFLOW_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "type=invoice"
Enter fullscreen mode Exit fullscreen mode

And a representative response:

{
  "type": "invoice",
  "vendor": "Northwind Supplies",
  "invoice_number": "INV-20418",
  "issue_date": "2026-05-30",
  "currency": "USD",
  "total": 1284.50,
  "line_items": [
    { "description": "Thermal paper rolls", "qty": 12, "unit_price": 4.20, "amount": 50.40 },
    { "description": "Label printer ink", "qty": 3, "unit_price": 78.00, "amount": 234.00 }
  ]
}
Enter fullscreen mode Exit fullscreen mode

From there it is ordinary code. In Python:

import os, requests

def extract_invoice(path: str) -> dict:
    with open(path, "rb") as f:
        resp = requests.post(
            "https://api.parseflow.dev/v1/extract",
            headers={"Authorization": f"Bearer {os.environ['PARSEFLOW_API_KEY']}"},
            files={"file": f},
            data={"type": "invoice"},
        )
    resp.raise_for_status()
    return resp.json()

data = extract_invoice("invoice.pdf")
print(data["vendor"], data["total"])
for item in data["line_items"]:
    print(item["description"], item["amount"])
Enter fullscreen mode Exit fullscreen mode

No OCR step you maintain. No template to author for the next vendor. You get a typed object and move on to the part that is actually your business logic — writing to a database, kicking off an approval, reconciling against a PO.

Where this fits in a workflow

The single-call shape makes it easy to drop into existing automation:

  • Inbox to database: a new invoice email arrives, your function extracts it and inserts a row.
  • Upload form to review queue: a user uploads a receipt, you store structured fields and only flag the ones below a confidence threshold.
  • Batch backfill: loop over an archive of historical PDFs and normalize them into one schema.

Because it is just an HTTP call returning JSON, it slots into Python scripts, serverless functions, and no-code automation tools alike.

What to keep in your own hands

A service handles extraction, but you still own the parts that depend on your domain: validation rules (does this total match the line items?), idempotency (don't double-import the same invoice), and human review for low-confidence cases. The goal is not to remove judgment from the loop. It is to delete the brittle, undifferentiated OCR-and-template layer that was never your real work.

Closing

If your document pipeline is one vendor layout change away from a 2am page, it is worth trying the single-call approach before you build yet another template engine. ParseFlow takes a PDF — invoice, receipt, contract, or ID — and hands back structured JSON you can use immediately. It is free to try, so the fastest way to evaluate it is to send it the exact document that keeps breaking your current setup and see what comes back: https://parseflow.dev



Full disclosure: I build ParseFlow, a document extraction API that turns PDFs into structured JSON. It is free to try at https://parseflow.dev.

Top comments (0)