Francesco Ira

Posted on May 6

How I built an invoice extraction API that works on any PDF layout

#webdev #python #automation #ai

How I built an invoice extraction API that works on any PDF layout

I kept running into the same problem on client projects: invoice processing.

Every solution required templates: one per supplier, one per layout. Every time a vendor updated their invoice design, something broke. I was maintaining 40+ templates across different projects and it was a nightmare.

So I built Parzo, an API that takes any invoice PDF and returns structured JSON. No templates, no configuration, no training data.

How it works

The flow is simple:

POST the PDF to the endpoint
The API extracts text (or falls back to OCR for scanned documents)
An AI model reads the invoice and extracts every field
A validation layer checks the arithmetic, VAT numbers, and date coherence
You get back clean JSON with a confidence score

import requests
import time

def extract_invoice(pdf_path: str, api_key: str) -> dict:
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://api.parzo.dev/v1/extract/invoice",
            headers={"X-API-Key": api_key},
            files={"file": f}
        )

    job_id = response.json()["job_id"]

    while True:
        result = requests.get(
            f"https://api.parzo.dev/v1/jobs/{job_id}",
            headers={"X-API-Key": api_key}
        ).json()

        if result["status"] == "completed":
            return result["result"]

        time.sleep(2)

# Usage
invoice_data = extract_invoice("invoice.pdf", "inv_your_key")
print(invoice_data["financials"]["total"])  # 1220.0

The output schema

The JSON structure is consistent regardless of the original invoice layout:

{
  "vendor": {
    "name": "Acme Srl",
    "vat_number": "IT12345678903",
    "address": "Via Roma 1, Milano"
  },
  "buyer": {
    "name": "Client SpA",
    "vat_number": "IT98765432109"
  },
  "invoice": {
    "number": "FT-2026-001",
    "date": "2026-04-01",
    "due_date": "2026-05-01",
    "currency": "EUR"
  },
  "financials": {
    "subtotal": 1000.00,
    "tax_rate": 22,
    "tax_amount": 220.00,
    "total": 1220.00
  },
  "line_items": [
    {
      "description": "Consulting services",
      "quantity": 10,
      "unit_price": 100.00,
      "total": 1000.00
    }
  ],
  "validation": {
    "confidence": 0.95,
    "flags": []
  }
}

The validation layer

This is the part I'm most proud of. After extraction, a second pass checks:

Arithmetic — does subtotal + tax = total?
VAT number — for Italian invoices, validates the P.IVA checksum (DPR 633/1972 algorithm)
VAT rates — are the rates valid for the detected country?
Date coherence — is the issue date before the due date?

Any anomaly gets added to the flags array with a description. The confidence score drops accordingly. This catches errors before they hit your accounting system.

The tech stack

Runtime: Bun
Framework: Hono
Queue: BullMQ with 4 separate queues by plan tier
Database: PostgreSQL via Drizzle ORM
Storage: Cloudflare R2 (EU region, 24h deletion)
AI routing: smaller model for text PDFs, larger model for scanned documents

The async queue design means the API responds immediately with a job ID and processes in the background. For most text-based PDFs the result is ready in under a second.

GDPR by design

Since this handles financial documents, I made some deliberate choices:

EU-only hosting (Hetzner Frankfurt)
PDFs deleted automatically after 24 hours
No document content in logs
No data transfer outside EU

What I learned building this

Template-free extraction is harder than it sounds. The first version had terrible consistency, the same invoice processed twice would return slightly different field names. The solution was a strict JSON schema enforced at the prompt level, not at the parsing level.

Confidence scores are more useful than binary success/failure. A result with confidence 0.6 and two flags is more useful than a failed extraction — you know what to check manually.

Queue isolation by plan matters. A free tier user processing a 50MB scanned PDF shouldn't block a paying customer's 10KB text invoice. Separate queues with separate concurrency limits solved this.

Try it

Free tier is 100 documents/month, no credit card required.

Happy to answer questions about the architecture or the extraction approach in the comments.

DEV Community

How I built an invoice extraction API that works on any PDF layout

How I built an invoice extraction API that works on any PDF layout

How it works

The output schema

The validation layer

The tech stack

GDPR by design

What I learned building this

Try it

Top comments (0)