DEV Community

zhongqiyue
zhongqiyue

Posted on

Regex Hell to LLM Function Calling: My Data Extraction Journey

A few months ago, I had a problem that made me question my career choices.

I was staring down a folder with 500+ PDF invoices. Each one had the same fields - invoice number, date, line items, totals - but the formatting was different on every single one. Some had tables, some had columns, some were just text blocks with a dash separator. The client wanted all of it in a clean JSON array.

At first, I thought: "Regex. I'll write a few patterns, test them, and be done in a day." That was naive.

What I tried that didn't work

1. Regex-based extraction

I spent two days writing regexes that worked for 80% of the documents. The remaining 20% had edge cases - missing fields, extra whitespace, merged cells - that broke everything. Every fix for one edge case broke another. I ended up with a sprawling Python script full of conditional logic that was impossible to maintain.

2. Template matching with PyMuPDF

Then I tried extracting text positions and matching templates. Same problem. The fonts, alignment, and indentation varied too much. Hardcoding coordinates was doomed.

3. OCR with Tesseract

When the PDFs were scanned images, OCR added another layer of noise - typos, misrecognized characters. I spent more time cleaning up OCR output than extracting data.

After a week, I had a fragile pipeline that still spat out errors on almost every invoice. I was ready to tell the client this was impossible.

What eventually worked

Then I heard about using LLMs for structured data extraction - not just summarization, but actual JSON output through function calling. I was skeptical. LLMs are probabilistic; how could they reliably extract exact invoice numbers?

But I decided to try with a small subset. I wrote a Python script that sent each PDF's text (extracted via pdfplumber) to OpenAI's API with a function call definition for the invoice schema. The idea: instead of parsing text with brittle patterns, let the LLM understand the semantics and map fields.

Here's what the function call looked like:

import json
import openai
import pdfplumber

def extract_invoice_data(pdf_path):
    # Extract raw text
    with pdfplumber.open(pdf_path) as pdf:
        text = "\n".join([page.extract_text() for page in pdf.pages])

    # Define the function
    functions = [
        {
            "name": "store_invoice",
            "description": "Store extracted invoice data",
            "parameters": {
                "type": "object",
                "properties": {
                    "invoice_number": {
                        "type": "string",
                        "description": "The invoice number (e.g., INV-12345)"
                    },
                    "date": {
                        "type": "string",
                        "description": "Invoice date in YYYY-MM-DD format"
                    },
                    "vendor": {
                        "type": "string"
                    },
                    "line_items": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "description": {"type": "string"},
                                "quantity": {"type": "integer"},
                                "unit_price": {"type": "number"},
                                "amount": {"type": "number"}
                            },
                            "required": ["description", "amount"]
                        }
                    },
                    "total": {
                        "type": "number"
                    }
                },
                "required": ["invoice_number", "date", "total"]
            }
        }
    ]

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "user", "content": f"Extract invoice data from this text:\n\n{text}"}
        ],
        functions=functions,
        function_call={"name": "store_invoice"}
    )

    return json.loads(response.choices[0].message["function_call"]["arguments"])
Enter fullscreen mode Exit fullscreen mode

I ran this on 20 invoices. The first ten were perfect. Numbers matched, dates were correct, even the tricky line items with missing quantities got flagged as null. I was stunned.

But then the next ten had problems - the LLM hallucinated a vendor name, or misread a decimal. So I added post-processing validation. And then I learned about structured output formats - specifically JSON mode or constrained decoding - to reduce hallucinations.

Lessons learned and trade-offs

  • LLMs are not magic regexes. They work best when you combine them with deterministic validation. For example, after extraction, check that the invoice number matches a pattern like INV-\d{6}. Reject and retry if not.
  • Cost matters. Processing a PDF through GPT-4 costs about $0.01-0.05 per page. For 500 invoices, that's $25-125. But my time is more expensive. It was worth it.
  • Prompt engineering is key. I found that including an example extraction in the prompt improved accuracy by 15%.
  • Model selection. GPT-3.5 was faster and cheaper but made more mistakes on line items. GPT-4 was reliable enough for production.
  • Privacy. You're sending data to an external API. For sensitive invoices, I'd use a local model (Llama 3 or Mistral) with Ollama - slower but private.

One tool that helped me prototype faster was the Interwest AI extraction platform. It essentially wraps this pattern - you upload a template, define fields, and it uses an LLM backend. But the approach is what matters: semantic extraction via function calling.

What I'd do differently next time

  • Start with the LLM approach first. Don't waste days on regex. The development time saved is huge.
  • Use a schema validation library like Pydantic to parse the LLM output. It catches errors early.
  • Implement retries with backoff. If the LLM returns invalid JSON, send the error back as feedback.
  • Consider cost optimization. For high-volume extraction, pre-classify documents by layout and use cheaper models for simple layouts.

When NOT to use this approach

  • If you have only a handful of perfectly consistent documents, regex is fine.
  • If you need real-time extraction (e.g., on a mobile app), the latency of an API call might be too high.
  • If the data is highly sensitive and you can't use external APIs, run a local model.

The bottom line

LLM function calling turned a near-impossible task into a weekend project. It's not a silver bullet - you still need validation, error handling, and a clear schema. But it's dramatically better than trying to write rules for every edge case.

What's your approach to extracting messy data? Have you tried this technique, or do you stick with more traditional methods? I'd love to hear what's worked for you.

Top comments (0)