How I built an invoice extraction API that works on any PDF layout
I kept running into the same problem on client projects: invoice processing.
Every solution required templates: one per supplier, one per layout. Every time a vendor updated their invoice design, something broke. I was maintaining 40+ templates across different projects and it was a nightmare.
So I built Parzo, an API that takes any invoice PDF and returns structured JSON. No templates, no configuration, no training data.
How it works
The flow is simple:
- POST the PDF to the endpoint
- The API extracts text (or falls back to OCR for scanned documents)
- An AI model reads the invoice and extracts every field
- A validation layer checks the arithmetic, VAT numbers, and date coherence
- You get back clean JSON with a confidence score
import requests
import time
def extract_invoice(pdf_path: str, api_key: str) -> dict:
with open(pdf_path, "rb") as f:
response = requests.post(
"https://api.parzo.dev/v1/extract/invoice",
headers={"X-API-Key": api_key},
files={"file": f}
)
job_id = response.json()["job_id"]
while True:
result = requests.get(
f"https://api.parzo.dev/v1/jobs/{job_id}",
headers={"X-API-Key": api_key}
).json()
if result["status"] == "completed":
return result["result"]
time.sleep(2)
# Usage
invoice_data = extract_invoice("invoice.pdf", "inv_your_key")
print(invoice_data["financials"]["total"]) # 1220.0
The output schema
The JSON structure is consistent regardless of the original invoice layout:
{
"vendor": {
"name": "Acme Srl",
"vat_number": "IT12345678903",
"address": "Via Roma 1, Milano"
},
"buyer": {
"name": "Client SpA",
"vat_number": "IT98765432109"
},
"invoice": {
"number": "FT-2026-001",
"date": "2026-04-01",
"due_date": "2026-05-01",
"currency": "EUR"
},
"financials": {
"subtotal": 1000.00,
"tax_rate": 22,
"tax_amount": 220.00,
"total": 1220.00
},
"line_items": [
{
"description": "Consulting services",
"quantity": 10,
"unit_price": 100.00,
"total": 1000.00
}
],
"validation": {
"confidence": 0.95,
"flags": []
}
}
The validation layer
This is the part I'm most proud of. After extraction, a second pass checks:
- Arithmetic — does subtotal + tax = total?
- VAT number — for Italian invoices, validates the P.IVA checksum (DPR 633/1972 algorithm)
- VAT rates — are the rates valid for the detected country?
- Date coherence — is the issue date before the due date?
Any anomaly gets added to the flags array with a description. The confidence score drops accordingly. This catches errors before they hit your accounting system.
The tech stack
- Runtime: Bun
- Framework: Hono
- Queue: BullMQ with 4 separate queues by plan tier
- Database: PostgreSQL via Drizzle ORM
- Storage: Cloudflare R2 (EU region, 24h deletion)
- AI routing: smaller model for text PDFs, larger model for scanned documents
The async queue design means the API responds immediately with a job ID and processes in the background. For most text-based PDFs the result is ready in under a second.
GDPR by design
Since this handles financial documents, I made some deliberate choices:
- EU-only hosting (Hetzner Frankfurt)
- PDFs deleted automatically after 24 hours
- No document content in logs
- No data transfer outside EU
What I learned building this
Template-free extraction is harder than it sounds. The first version had terrible consistency, the same invoice processed twice would return slightly different field names. The solution was a strict JSON schema enforced at the prompt level, not at the parsing level.
Confidence scores are more useful than binary success/failure. A result with confidence 0.6 and two flags is more useful than a failed extraction — you know what to check manually.
Queue isolation by plan matters. A free tier user processing a 50MB scanned PDF shouldn't block a paying customer's 10KB text invoice. Separate queues with separate concurrency limits solved this.
Try it
Free tier is 100 documents/month, no credit card required.
Happy to answer questions about the architecture or the extraction approach in the comments.
Top comments (0)