Invoice processing is one of those problems that looks simple until you actually try to build it. Reading data from a PDF invoice should be straightforward — but the moment you encounter 50 different vendor layouts, foreign languages, scanned images, and multi-page documents, your initial approach falls apart. Here's an honest comparison of the three main approaches.
Approach 1: Regex and String Parsing
For a single, controlled invoice format, regex works fine:
function extractInvoiceData(text) {
const invoiceNumber = text.match(/Invoice\s*#?\s*([A-Z0-9-]+)/i)?.[1];
const total = text.match(/Total\s*[:\$]?\s*([\d,]+\.\d{2})/i)?.[1];
const date = text.match(/(\d{1,2}[\/.-]\d{1,2}[\/.-]\d{2,4})/)?.[1];
return { invoiceNumber, total, date };
}
When it works: Internal documents with consistent formatting, fixed-template invoices from a single vendor, structured data exports.
When it breaks: Any layout change. "Invoice No:" vs "Invoice Number:" vs "Ref:" vs just printing the number without a label. International date formats. Currency symbols in different positions. Thousands separators (1.234,56 vs 1,234.56).
Reality: regex-based extraction needs constant maintenance. Every new vendor format requires code changes.
Approach 2: Template Matching
Template matching defines anchor points in a document layout (coordinates or text markers) and extracts data from fixed positions relative to those anchors.
# Example with a hypothetical template engine
template = {
'invoice_number': { 'after': 'Invoice Number:', 'line': 0, 'field': 0 },
'total': { 'anchor': 'TOTAL DUE', 'offset_y': 0, 'offset_x': 150 },
}
result = extract_with_template(pdf_path, template)
When it works: High-volume, single-vendor processing (e.g., processing all invoices from one supplier). Government forms with fixed layouts.
When it breaks: Requires one template per vendor. A 200-vendor AP operation needs 200 templates. PDFs with dynamic layout (the total row moves based on number of line items). Scanned documents with slight rotation or skew.
Template matching is maintenance-intensive at scale.
Approach 3: AI-Powered Extraction
Modern document AI models (fine-tuned on millions of documents) understand document structure semantically:
// Using an AI document API
const response = await fetch('https://parseflow.dev/api/extract', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PARSEFLOW_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
document_url: 'https://your-bucket.s3.amazonaws.com/invoice.pdf',
fields: ['invoice_number', 'date', 'vendor_name', 'line_items', 'subtotal', 'tax', 'total'],
}),
});
const data = await response.json();
// Returns structured JSON regardless of layout
Output:
{
"invoice_number": "INV-2026-0142",
"date": "2026-04-01",
"vendor_name": "Acme Corporation",
"line_items": [
{ "description": "SaaS Platform License", "quantity": 1, "unit_price": 5000.00, "total": 5000.00 },
{ "description": "Implementation Services", "quantity": 10, "unit_price": 250.00, "total": 2500.00 }
],
"subtotal": 7500.00,
"tax": 750.00,
"total": 8250.00
}
When it works: Variable layouts, multiple vendors, multi-language documents, scanned PDFs, Word/Excel files.
Limitations: Higher cost per document than local processing. Privacy considerations for sensitive documents (check your provider's data handling).
Honest Comparison
| Factor | Regex | Templates | AI API |
|---|---|---|---|
| Setup time | Low | Medium | Very low |
| Accuracy (known format) | High | High | High |
| Accuracy (variable formats) | Low | Medium | High |
| Maintenance burden | High | High | Low |
| Cost per document | Zero | Zero | Small fee |
| Scalability | Code changes | Template changes | None |
The Practical Choice
For most accounts payable automation projects:
- Under 5 vendors, stable formats: Regex is fine
- 5-50 vendors, some variation: Templates + regex hybrid
- 50+ vendors or unknown formats: AI extraction is the only practical option
ParseFlow uses the AI approach, handling PDF, Word, and Excel with a single API endpoint. The free tier covers 100 documents/month — enough to validate the approach before committing.
Top comments (0)