zhongqiyue

Posted on Jun 9

Why regex wasn't enough for data extraction (and what I used instead)

#python #ai #api #tutorial

I spent three weeks trying to extract invoice data from a pile of PDFs sent by different vendors. Each one had a different layout, different labels, and occasionally different languages. My first instinct—regex—burned me badly. Let me tell you what I tried, what failed, and what finally worked.

The mess I walked into

A client asked me to parse ~2,000 invoices and dump the fields (invoice number, date, total, vendor name) into a database. Sounded simple. I opened a random PDF and saw:

Invoice # 2024-0031
Date: 15.03.2024
Total: €1,250.00
Vendor: Acme GmbH

Okay, easy. I wrote a few patterns:

import re

pattern_invoice = r'Invoice # (\d+-\d+)'
pattern_date = r'Date: (\d{2}\.\d{2}\.\d{4})'
pattern_total = r'Total: €([\d,]+\.[\d]{2})'
pattern_vendor = r'Vendor: (.+)'

Worked on the first ten files. Then came the second vendor: Invoice- 2024-031 (different dash), Date (extra space), Total incl. VAT: 1.250,00 € (different currency format). My heart sank.

What I tried that didn't work

More regex: I ended up with a 200-line monster of alternations and optional groups. It broke on the next vendor.
Template matching: Tried to locate known fields by coordinates. Worked for one template, failed on others because some PDFs were scanned images (no text layer).
OCR + heuristics: Used Tesseract, then tried to find “invoice” or “total” near numbers. Too many false positives.
ML classifier: Trained a small NER model with spaCy. Required hundreds of annotated examples. Client didn't have the budget.

I was stuck. Every new vendor required a custom parser. I needed something that understood context, not just patterns.

The approach that finally worked

I decided to use an LLM (GPT-4) with function calling. Instead of writing brittle regex, I defined a schema for the structured output and let the model extract the fields. The key insight: treat extraction as a translation problem—turn unstructured text into a structured JSON object.

Here’s the core idea:

Extract raw text from PDF (using pdfplumber or OCR if needed).
Send that text to an LLM with a clear instruction and a function schema.
Parse the returned JSON.

No more pattern maintenance. No more vendor-specific logic.

Code example (Python)

import json
from openai import OpenAI

client = OpenAI(api_key="sk-...")

def extract_invoice(text: str) -> dict:
    function_schema = {
        "name": "extract_invoice",
        "description": "Extract invoice fields from plain text",
        "parameters": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string", "description": "Invoice number, e.g. 2024-0031"},
                "date": {"type": "string", "description": "Invoice date in YYYY-MM-DD format"},
                "total": {"type": "number", "description": "Total amount as a number"},
                "vendor": {"type": "string"}
            },
            "required": ["invoice_number", "date", "total", "vendor"]
        }
    }

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Extract invoice data from the provided text. Output JSON matching the function schema."},
            {"role": "user", "content": text}
        ],
        tools=[{"type": "function", "function": function_schema}],
        tool_choice={"type": "function", "function": {"name": "extract_invoice"}}
    )

    tool_call = response.choices[0].message.tool_calls[0]
    return json.loads(tool_call.function.arguments)

I fed the model raw text like:

Invoice- 2024-031
Date  : 15 Mar 2024
Total incl. VAT: 1.250,00 €
Vendor: Acme GmbH

It returned:

{
  "invoice_number": "2024-031",
  "date": "2024-03-15",
  "total": 1250.0,
  "vendor": "Acme GmbH"
}

Worked first try.

Handling variations

To improve accuracy, I added a few few-shot examples and asked the model to output raw text if it couldn’t find a field. I also validated the JSON against the schema and retried with a stronger instruction if parsing failed.

messages = [
    {"role": "system", "content": "You are a data extraction assistant. Extract fields exactly as specified."},
    {"role": "user", "content": "Invoice # 2024-0031\nDate: 15.03.2024\nTotal: €1,250.00\nVendor: Acme GmbH"},
    {"role": "assistant", "content": '{"invoice_number": "2024-0031", "date": "2024-03-15", "total": 1250.0, "vendor": "Acme GmbH"}'},
    {"role": "user", "content": text}
]

Lessons learned / trade-offs

Cost: GPT-4 mini is cheap—about $0.15 per thousand pages—but for high volume it adds up. Consider smaller models or a dedicated extraction API. (I later found a service like https://ai.interwestinfo.com/ that offers a similar pipeline, but building my own taught me more.)
Latency: Each call takes 1-3 seconds. For 2,000 invoices, that’s an hour of runtime. Parallel processing helped.
Hallucinations: The model sometimes invents a total or normalizes a date wrong. I added post-processing: check that the total is a number, date is parseable, etc. If confidence is low, flag for human review.
PII & security: Sending sensitive invoice text to an external API may violate data policies. For some clients, I ran a local model (Llama 3.2) via Ollama—slower, but private.
When NOT to use this: If your data is highly structured and consistent, regex is faster. If you need real-time extraction on a low-power device, LLMs are overkill. Use the right tool for the job.

What I’d do differently next time

Start with a small sample of varied documents. Don’t assume uniformity.
Build a validation layer first: extract, validate, retry, flag.
Consider using a purpose-built extraction API for the 80% case, then fall back to LLM for hard cases.

Final thoughts

Regex taught me discipline, but LLM function calling taught me humility. Sometimes the smartest solution is to stop fighting the chaos and let a model that understands language do the heavy lifting. That doesn't mean trusting it blindly—validation is still king—but the time I saved rewriting patterns is enormous.

Now I'm curious: what's your go-to approach when traditional parsing falls short? Do you use LLMs, or have you found another trick?

DEV Community