zhongqiyue

Posted on Jun 2

How I Stopped Writing Fragile Regex and Let AI Parse My Messy Data

#ai #python #webdev #tutorial

A few months ago, I needed to extract structured data from a hundred PDFs. Each PDF had the same kind of information (dates, names, amounts), but the layout was slightly different across every single document. My first instinct? Regex. My second? BeautifulSoup. Both failed spectacularly.

The problem that made me rage-quit

I was working on a client project to ingest invoices from various suppliers. They all looked similar to a human, but computer parsers? A nightmare.

Some tables had borders, others didn't.
Some placed "Total" at the bottom right, others bottom left.
A few PDFs were just scanned images with no text layer.

I tried:

PyPDF2 / pdfplumber: Worked on clean PDFs, but failed on scans and inconsistent spacing.
Regex: I spent days building patterns that broke the moment a new supplier came in.
PaddleOCR: Got text out, but then I had to parse that text anyway.

I was about to tell the client it was impossible at this scale.

The approach I should have started with

After reading about how people use large language models (LLMs) for extraction tasks, I decided to try. The idea is simple: instead of writing rules to find where the data lives, describe what the data is in plain English, and let an AI extract it.

This is not a new idea, but it felt like cheating. And honestly, it kind of is – in a good way.

Here's the core technique:

Convert each document to plain text (or markdown if structure matters).
Build a prompt that asks for specific fields in a JSON format.
Send the text + prompt to an LLM (like GPT-4 or a local model).
Parse the JSON response.

No more regex maintenance. No more layout assumptions.

Code you can copy-paste (almost)

Below is a Python function I wrote. It takes raw text (from any source – PDF, HTML, image via OCR) and returns a structured dictionary.

import json
import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def extract_invoice_data(raw_text: str) -> dict:
    """Use LLM to extract invoice fields from messy text."""
    prompt = f"""You are an expert data extractor.
From the following invoice text, extract these fields as JSON:
- invoice_number
- date (YYYY-MM-DD)
- total_amount (numeric, no currency symbols)
- vendor_name
- line_items (list of items with description and amount)

Return ONLY valid JSON, no other text.

Invoice text:
{raw_text}
"""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )

    try:
        return json.loads(response.choices[0].message.content)
    except json.JSONDecodeError:
        # fallback: sometimes it wraps in ```
{% endraw %}
json ...
{% raw %}

    content = response.choices[0].message.content
    if "```

json" in content:
content = content.split("


")[0].strip()
        return json.loads(content)

Usage example:


python
text = "Invoice #INV-2024-001 Date: 15-Jan-2024 Vendor: Office Supplies Co. Items: Paper x5 $50, Pens $10 Total: $60"
data = extract_invoice_data(text)
print(data)
# {'invoice_number': 'INV-2024-001', 'date': '2024-01-15', 'total_amount': 60.0, 'vendor_name': 'Office Supplies Co.', 'line_items': [{'description': 'Paper x5', 'amount': 50.0}, {'description': 'Pens', 'amount': 10.0}]}

I also experimented with a self-hosted option using Ollama + Llama 3 for privacy-sensitive data. It worked, but slower and sometimes hallucinated fields. For many production use cases, I find GPT-4’s reliability worth the per-call cost, especially when batch processing with caching.

Lessons learned (the hard way)

What worked:

Prompt engineering matters a lot. Be explicit about output format. Use examples (few-shot) for tricky fields.
Temperature = 0 is essential for deterministic behavior.
For long documents (10k+ tokens), chunk and extract per section, then merge.

What didn't work:

Asking for everything at once when the document is large – the model loses context.
Using base models without fine-tuning – they don’t “know” invoice format unless you prime them hard.
Relying on regex for post-processing AI output – just trust the JSON parse and add a retry loop.

Trade-offs:

Cost: OpenAI calls add up. For a few hundred docs per month, it's cheap. For thousands, consider a local model or cheaper API (Claude Haiku, Gemini Flash).
Latency: Each call takes 1-3 seconds. Batch asynchronously if you need speed.
Hallucinations: Occasionally the AI invents an invoice number. Mitigate by validating with simple checks (e.g., date format, total sums with items).

What I'd do differently next time

If I had to start over, I would:

First try a simpler AI approach earlier, instead of spending weeks on regex.
Build a validation layer that cross-checks extracted numbers (e.g., line items sum to total) and flags mismatches for manual review.
Use a tool like InterwestInfo's AI extractor (if it fits the use case) to avoid managing API keys myself. But honestly, rolling my own gave me full control and zero vendor lock-in.

Oh, and I'd document the prompt templates as code, so they're version-controlled and testable.

Final thoughts

AI extraction isn't magic – it's a trade-off between upfront engineering and ongoing reliability. For messy, variable data, it beat the alternative by a mile. But I still clean my input text (remove headers/footers, normalize whitespace) and cache results so I don't hit the API for unchanged documents.

Now I'm curious: What's your go-to trick for extracting data from a pile of unstructured crap? Do you use AI, or have you found a smarter heuristic?

This article is based on my personal experience. No tool is a silver bullet – always evaluate for your specific data and budget.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.