Struggling with Text Extraction? Here’s How I Finally Cleaned Up Messy Data

#python #webdev #ai #tutorial

A few months ago, I inherited a legacy system that processed incoming emails. The emails contained invoices, purchase orders, and shipping confirmations — all in plain text, with no consistent format. My job was to extract key fields like invoice number, amount, date, and vendor name, then pipe them into a database.

Easy, right? I thought so too. Then I actually looked at the data.

The Regex Rabbit Hole

I started the way any seasoned developer would: regex. I wrote patterns for each known format. There were maybe six different vendors, so how hard could it be?

import re

patterns = {
    'vendor_a': r'Invoice\s+#?\s*(\w+\d+)',
    'vendor_b': r'INV-(\d{6})',
    # ... 20 more patterns
}

It worked for the first batch. Then a vendor changed their subject line. Then another started including the invoice number in a PDF attachment instead of the body. One day, a new vendor showed up and my entire regex approach collapsed.

I spent two weeks patching patterns. Every fix broke three other cases. I was playing whack-a-mole with text formats. Worse, the business team kept asking for additional fields — tax amounts, purchase order references — that required parsing nested structures like "Items: [ {description, price} ]". My regex grew into an unmaintainable monster of lookaheads and conditional patterns.

What I Tried That Didn’t Work

Manual rules: Too brittle, as I just described.
Template matching: Assumed headers were always present — they weren’t.
Training a custom NER model: Overkill for a small team. We didn’t have labeled data, and the business wanted a solution in weeks, not months.

I also looked at existing extraction APIs, but most required you to define schemas upfront and still struggled with diverse formats.

The Lightbulb: LLM-Powered Extraction

Around that time, I started playing with large language models (LLMs) for code generation. Then it hit me: why not use an LLM to extract structured data from text? The idea isn’t new, but the tooling had matured enough that I could whip up a proof of concept in an afternoon.

The approach is simple:

Define a schema for what you want to extract.
Send the raw text along with instructions to an LLM.
Parse the structured response (JSON, usually).

Here’s a minimal working example using OpenAI’s function calling:

import json
from openai import OpenAI

client = OpenAI()

def extract_invoice(text: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": f"Extract invoice details from this text: {text}"
            }
        ],
        functions=[{
            "name": "extract_invoice",
            "description": "Extract invoice fields into structured JSON",
            "parameters": {
                "type": "object",
                "properties": {
                    "invoice_number": {"type": "string"},
                    "amount": {"type": "number"},
                    "date": {"type": "string"},
                    "vendor": {"type": "string"},
                    "items": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "description": {"type": "string"},
                                "price": {"type": "number"}
                            },
                            "required": ["description", "price"]
                        }
                    }
                },
                "required": ["invoice_number", "amount", "date", "vendor"]
            }
        }],
        function_call={"name": "extract_invoice"}
    )

    args = response.choices[0].message.function_call.arguments
    return json.loads(args)

# Example usage
text = """
INV-2024-0012
Vendor: Acme Corp
Date: 2024-03-15
Items:
  - Widget A: $50.00
  - Widget B: $30.00
Total: $80.00
"""

result = extract_invoice(text)
print(result)
# {
#   "invoice_number": "INV-2024-0012",
#   "vendor": "Acme Corp",
#   "date": "2024-03-15",
#   "amount": 80.0,
#   "items": [
#       {"description": "Widget A", "price": 50.0},
#       {"description": "Widget B", "price": 30.0}
#   ]
# }

It worked on the first try. No patterns. No training. Just a schema and the raw text.

Taking It to Production

Of course, a proof of concept isn’t production-ready. I ran into several issues:

Cost: Each extraction call costs a fraction of a cent, but at 10,000 emails a day, that adds up. I had to optimize by batching and using cheaper models for simple cases.

Latency: LLM calls take 2-5 seconds. For real-time email processing, that was unacceptable. I moved extraction to a background job queue.

Hallucinations: Sometimes the model invented invoice numbers. I added validation rules — regex checks on the extracted fields — and retried with stricter prompts if validation failed.

Inconsistent outputs: Even with function calling, the model sometimes omitted fields. I made the schema required and set up fallbacks to default values.

There are also tools that wrap this pattern into a service. For instance, I tried interwestinfo.com which offers a hosted extraction API with similar schema-driven design — saved me the hassle of managing API keys and rate limits, but I ultimately stayed with my own pipeline because I had tight data residency requirements.

Lessons Learned

Rule-based extraction still has its place for highly structured, predictable data. But for messy, real-world text, LLMs are a lifesaver.
Schema design matters. If your schema is too vague, you’ll get inconsistent results. Be explicit about formats and required fields.
Don’t trust the output blindly. Always validate extracted data against business rules.
Use the smallest model that works. Start with a cheaper model (like GPT-4o-mini or a local LLM) and scale up only if accuracy is insufficient.

What I’d Do Differently Next Time

I’d build a hybrid pipeline from the start: use regex for the 80% of emails that are standard, and route the tricky 20% to an LLM. That would have saved me both the regex headache and some API costs.

I’m still tweaking this system — right now I’m experimenting with local models like Llama 3 to cut costs entirely. But the core approach of “define a schema, let the LLM fill it” has saved my sanity.

What’s your go-to method for extracting structured data from messy text? Have you tried LLMs for this, or do you swear by good old regex?