DEV Community

Cover image for How I built an invoice extraction API that works on any PDF layout
Francesco Ira
Francesco Ira

Posted on

How I built an invoice extraction API that works on any PDF layout

How I built an invoice extraction API that works on any PDF layout

I kept running into the same problem on client projects: invoice processing.

Every solution required templates: one per supplier, one per layout. Every time a vendor updated their invoice design, something broke. I was maintaining 40+ templates across different projects and it was a nightmare.

So I built Parzo, an API that takes any invoice PDF and returns structured JSON. No templates, no configuration, no training data.

How it works

The flow is simple:

  1. POST the PDF to the endpoint
  2. The API extracts text (or falls back to OCR for scanned documents)
  3. An AI model reads the invoice and extracts every field
  4. A validation layer checks the arithmetic, VAT numbers, and date coherence
  5. You get back clean JSON with a confidence score
import requests
import time

def extract_invoice(pdf_path: str, api_key: str) -> dict:
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://api.parzo.dev/v1/extract/invoice",
            headers={"X-API-Key": api_key},
            files={"file": f}
        )

    job_id = response.json()["job_id"]

    while True:
        result = requests.get(
            f"https://api.parzo.dev/v1/jobs/{job_id}",
            headers={"X-API-Key": api_key}
        ).json()

        if result["status"] == "completed":
            return result["result"]

        time.sleep(2)

# Usage
invoice_data = extract_invoice("invoice.pdf", "inv_your_key")
print(invoice_data["financials"]["total"])  # 1220.0
Enter fullscreen mode Exit fullscreen mode

The output schema

The JSON structure is consistent regardless of the original invoice layout:

{
  "vendor": {
    "name": "Acme Srl",
    "vat_number": "IT12345678903",
    "address": "Via Roma 1, Milano"
  },
  "buyer": {
    "name": "Client SpA",
    "vat_number": "IT98765432109"
  },
  "invoice": {
    "number": "FT-2026-001",
    "date": "2026-04-01",
    "due_date": "2026-05-01",
    "currency": "EUR"
  },
  "financials": {
    "subtotal": 1000.00,
    "tax_rate": 22,
    "tax_amount": 220.00,
    "total": 1220.00
  },
  "line_items": [
    {
      "description": "Consulting services",
      "quantity": 10,
      "unit_price": 100.00,
      "total": 1000.00
    }
  ],
  "validation": {
    "confidence": 0.95,
    "flags": []
  }
}
Enter fullscreen mode Exit fullscreen mode

The validation layer

This is the part I'm most proud of. After extraction, a second pass checks:

  • Arithmetic — does subtotal + tax = total?
  • VAT number — for Italian invoices, validates the P.IVA checksum (DPR 633/1972 algorithm)
  • VAT rates — are the rates valid for the detected country?
  • Date coherence — is the issue date before the due date?

Any anomaly gets added to the flags array with a description. The confidence score drops accordingly. This catches errors before they hit your accounting system.

The tech stack

  • Runtime: Bun
  • Framework: Hono
  • Queue: BullMQ with 4 separate queues by plan tier
  • Database: PostgreSQL via Drizzle ORM
  • Storage: Cloudflare R2 (EU region, 24h deletion)
  • AI routing: smaller model for text PDFs, larger model for scanned documents

The async queue design means the API responds immediately with a job ID and processes in the background. For most text-based PDFs the result is ready in under a second.

GDPR by design

Since this handles financial documents, I made some deliberate choices:

  • EU-only hosting (Hetzner Frankfurt)
  • PDFs deleted automatically after 24 hours
  • No document content in logs
  • No data transfer outside EU

What I learned building this

Template-free extraction is harder than it sounds. The first version had terrible consistency, the same invoice processed twice would return slightly different field names. The solution was a strict JSON schema enforced at the prompt level, not at the parsing level.

Confidence scores are more useful than binary success/failure. A result with confidence 0.6 and two flags is more useful than a failed extraction — you know what to check manually.

Queue isolation by plan matters. A free tier user processing a 50MB scanned PDF shouldn't block a paying customer's 10KB text invoice. Separate queues with separate concurrency limits solved this.

Try it

Free tier is 100 documents/month, no credit card required.

Happy to answer questions about the architecture or the extraction approach in the comments.

Top comments (0)