DEV Community

zhongqiyue
zhongqiyue

Posted on

Struggling with Text Extraction? Here’s How I Finally Cleaned Up Messy Data

A few months ago, I inherited a legacy system that processed incoming emails. The emails contained invoices, purchase orders, and shipping confirmations — all in plain text, with no consistent format. My job was to extract key fields like invoice number, amount, date, and vendor name, then pipe them into a database.

Easy, right? I thought so too. Then I actually looked at the data.

The Regex Rabbit Hole

I started the way any seasoned developer would: regex. I wrote patterns for each known format. There were maybe six different vendors, so how hard could it be?

import re

patterns = {
    'vendor_a': r'Invoice\s+#?\s*(\w+\d+)',
    'vendor_b': r'INV-(\d{6})',
    # ... 20 more patterns
}
Enter fullscreen mode Exit fullscreen mode

It worked for the first batch. Then a vendor changed their subject line. Then another started including the invoice number in a PDF attachment instead of the body. One day, a new vendor showed up and my entire regex approach collapsed.

I spent two weeks patching patterns. Every fix broke three other cases. I was playing whack-a-mole with text formats. Worse, the business team kept asking for additional fields — tax amounts, purchase order references — that required parsing nested structures like "Items: [ {description, price} ]". My regex grew into an unmaintainable monster of lookaheads and conditional patterns.

What I Tried That Didn’t Work

  • Manual rules: Too brittle, as I just described.
  • Template matching: Assumed headers were always present — they weren’t.
  • Training a custom NER model: Overkill for a small team. We didn’t have labeled data, and the business wanted a solution in weeks, not months.

I also looked at existing extraction APIs, but most required you to define schemas upfront and still struggled with diverse formats.

The Lightbulb: LLM-Powered Extraction

Around that time, I started playing with large language models (LLMs) for code generation. Then it hit me: why not use an LLM to extract structured data from text? The idea isn’t new, but the tooling had matured enough that I could whip up a proof of concept in an afternoon.

The approach is simple:

  1. Define a schema for what you want to extract.
  2. Send the raw text along with instructions to an LLM.
  3. Parse the structured response (JSON, usually).

Here’s a minimal working example using OpenAI’s function calling:

import json
from openai import OpenAI

client = OpenAI()

def extract_invoice(text: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": f"Extract invoice details from this text: {text}"
            }
        ],
        functions=[{
            "name": "extract_invoice",
            "description": "Extract invoice fields into structured JSON",
            "parameters": {
                "type": "object",
                "properties": {
                    "invoice_number": {"type": "string"},
                    "amount": {"type": "number"},
                    "date": {"type": "string"},
                    "vendor": {"type": "string"},
                    "items": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "description": {"type": "string"},
                                "price": {"type": "number"}
                            },
                            "required": ["description", "price"]
                        }
                    }
                },
                "required": ["invoice_number", "amount", "date", "vendor"]
            }
        }],
        function_call={"name": "extract_invoice"}
    )

    args = response.choices[0].message.function_call.arguments
    return json.loads(args)

# Example usage
text = """
INV-2024-0012
Vendor: Acme Corp
Date: 2024-03-15
Items:
  - Widget A: $50.00
  - Widget B: $30.00
Total: $80.00
"""

result = extract_invoice(text)
print(result)
# {
#   "invoice_number": "INV-2024-0012",
#   "vendor": "Acme Corp",
#   "date": "2024-03-15",
#   "amount": 80.0,
#   "items": [
#       {"description": "Widget A", "price": 50.0},
#       {"description": "Widget B", "price": 30.0}
#   ]
# }
Enter fullscreen mode Exit fullscreen mode

It worked on the first try. No patterns. No training. Just a schema and the raw text.

Taking It to Production

Of course, a proof of concept isn’t production-ready. I ran into several issues:

Cost: Each extraction call costs a fraction of a cent, but at 10,000 emails a day, that adds up. I had to optimize by batching and using cheaper models for simple cases.

Latency: LLM calls take 2-5 seconds. For real-time email processing, that was unacceptable. I moved extraction to a background job queue.

Hallucinations: Sometimes the model invented invoice numbers. I added validation rules — regex checks on the extracted fields — and retried with stricter prompts if validation failed.

Inconsistent outputs: Even with function calling, the model sometimes omitted fields. I made the schema required and set up fallbacks to default values.

There are also tools that wrap this pattern into a service. For instance, I tried interwestinfo.com which offers a hosted extraction API with similar schema-driven design — saved me the hassle of managing API keys and rate limits, but I ultimately stayed with my own pipeline because I had tight data residency requirements.

Lessons Learned

  • Rule-based extraction still has its place for highly structured, predictable data. But for messy, real-world text, LLMs are a lifesaver.
  • Schema design matters. If your schema is too vague, you’ll get inconsistent results. Be explicit about formats and required fields.
  • Don’t trust the output blindly. Always validate extracted data against business rules.
  • Use the smallest model that works. Start with a cheaper model (like GPT-4o-mini or a local LLM) and scale up only if accuracy is insufficient.

What I’d Do Differently Next Time

I’d build a hybrid pipeline from the start: use regex for the 80% of emails that are standard, and route the tricky 20% to an LLM. That would have saved me both the regex headache and some API costs.


I’m still tweaking this system — right now I’m experimenting with local models like Llama 3 to cut costs entirely. But the core approach of “define a schema, let the LLM fill it” has saved my sanity.

What’s your go-to method for extracting structured data from messy text? Have you tried LLMs for this, or do you swear by good old regex?

Top comments (0)