DEV Community

DevToolsmith
DevToolsmith

Posted on

Structured Data Extraction from PDFs: Regex vs Template Matching vs AI

Invoice processing is one of those problems that looks simple until you actually try to build it. Reading data from a PDF invoice should be straightforward — but the moment you encounter 50 different vendor layouts, foreign languages, scanned images, and multi-page documents, your initial approach falls apart. Here's an honest comparison of the three main approaches.

Approach 1: Regex and String Parsing

For a single, controlled invoice format, regex works fine:

function extractInvoiceData(text) {
  const invoiceNumber = text.match(/Invoice\s*#?\s*([A-Z0-9-]+)/i)?.[1];
  const total = text.match(/Total\s*[:\$]?\s*([\d,]+\.\d{2})/i)?.[1];
  const date  = text.match(/(\d{1,2}[\/.-]\d{1,2}[\/.-]\d{2,4})/)?.[1];

  return { invoiceNumber, total, date };
}
Enter fullscreen mode Exit fullscreen mode

When it works: Internal documents with consistent formatting, fixed-template invoices from a single vendor, structured data exports.

When it breaks: Any layout change. "Invoice No:" vs "Invoice Number:" vs "Ref:" vs just printing the number without a label. International date formats. Currency symbols in different positions. Thousands separators (1.234,56 vs 1,234.56).

Reality: regex-based extraction needs constant maintenance. Every new vendor format requires code changes.

Approach 2: Template Matching

Template matching defines anchor points in a document layout (coordinates or text markers) and extracts data from fixed positions relative to those anchors.

# Example with a hypothetical template engine
template = {
  'invoice_number': { 'after': 'Invoice Number:', 'line': 0, 'field': 0 },
  'total': { 'anchor': 'TOTAL DUE', 'offset_y': 0, 'offset_x': 150 },
}
result = extract_with_template(pdf_path, template)
Enter fullscreen mode Exit fullscreen mode

When it works: High-volume, single-vendor processing (e.g., processing all invoices from one supplier). Government forms with fixed layouts.

When it breaks: Requires one template per vendor. A 200-vendor AP operation needs 200 templates. PDFs with dynamic layout (the total row moves based on number of line items). Scanned documents with slight rotation or skew.

Template matching is maintenance-intensive at scale.

Approach 3: AI-Powered Extraction

Modern document AI models (fine-tuned on millions of documents) understand document structure semantically:

// Using an AI document API
const response = await fetch('https://parseflow.dev/api/extract', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.PARSEFLOW_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    document_url: 'https://your-bucket.s3.amazonaws.com/invoice.pdf',
    fields: ['invoice_number', 'date', 'vendor_name', 'line_items', 'subtotal', 'tax', 'total'],
  }),
});

const data = await response.json();
// Returns structured JSON regardless of layout
Enter fullscreen mode Exit fullscreen mode

Output:

{
  "invoice_number": "INV-2026-0142",
  "date": "2026-04-01",
  "vendor_name": "Acme Corporation",
  "line_items": [
    { "description": "SaaS Platform License", "quantity": 1, "unit_price": 5000.00, "total": 5000.00 },
    { "description": "Implementation Services", "quantity": 10, "unit_price": 250.00, "total": 2500.00 }
  ],
  "subtotal": 7500.00,
  "tax": 750.00,
  "total": 8250.00
}
Enter fullscreen mode Exit fullscreen mode

When it works: Variable layouts, multiple vendors, multi-language documents, scanned PDFs, Word/Excel files.

Limitations: Higher cost per document than local processing. Privacy considerations for sensitive documents (check your provider's data handling).

Honest Comparison

Factor Regex Templates AI API
Setup time Low Medium Very low
Accuracy (known format) High High High
Accuracy (variable formats) Low Medium High
Maintenance burden High High Low
Cost per document Zero Zero Small fee
Scalability Code changes Template changes None

The Practical Choice

For most accounts payable automation projects:

  • Under 5 vendors, stable formats: Regex is fine
  • 5-50 vendors, some variation: Templates + regex hybrid
  • 50+ vendors or unknown formats: AI extraction is the only practical option

ParseFlow uses the AI approach, handling PDF, Word, and Excel with a single API endpoint. The free tier covers 100 documents/month — enough to validate the approach before committing.

Top comments (0)