A few months ago, I inherited a legacy system that processed incoming emails. The emails contained invoices, purchase orders, and shipping confirmations — all in plain text, with no consistent format. My job was to extract key fields like invoice number, amount, date, and vendor name, then pipe them into a database.
Easy, right? I thought so too. Then I actually looked at the data.
The Regex Rabbit Hole
I started the way any seasoned developer would: regex. I wrote patterns for each known format. There were maybe six different vendors, so how hard could it be?
import re
patterns = {
'vendor_a': r'Invoice\s+#?\s*(\w+\d+)',
'vendor_b': r'INV-(\d{6})',
# ... 20 more patterns
}
It worked for the first batch. Then a vendor changed their subject line. Then another started including the invoice number in a PDF attachment instead of the body. One day, a new vendor showed up and my entire regex approach collapsed.
I spent two weeks patching patterns. Every fix broke three other cases. I was playing whack-a-mole with text formats. Worse, the business team kept asking for additional fields — tax amounts, purchase order references — that required parsing nested structures like "Items: [ {description, price} ]". My regex grew into an unmaintainable monster of lookaheads and conditional patterns.
What I Tried That Didn’t Work
- Manual rules: Too brittle, as I just described.
- Template matching: Assumed headers were always present — they weren’t.
- Training a custom NER model: Overkill for a small team. We didn’t have labeled data, and the business wanted a solution in weeks, not months.
I also looked at existing extraction APIs, but most required you to define schemas upfront and still struggled with diverse formats.
The Lightbulb: LLM-Powered Extraction
Around that time, I started playing with large language models (LLMs) for code generation. Then it hit me: why not use an LLM to extract structured data from text? The idea isn’t new, but the tooling had matured enough that I could whip up a proof of concept in an afternoon.
The approach is simple:
- Define a schema for what you want to extract.
- Send the raw text along with instructions to an LLM.
- Parse the structured response (JSON, usually).
Here’s a minimal working example using OpenAI’s function calling:
import json
from openai import OpenAI
client = OpenAI()
def extract_invoice(text: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": f"Extract invoice details from this text: {text}"
}
],
functions=[{
"name": "extract_invoice",
"description": "Extract invoice fields into structured JSON",
"parameters": {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"amount": {"type": "number"},
"date": {"type": "string"},
"vendor": {"type": "string"},
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"price": {"type": "number"}
},
"required": ["description", "price"]
}
}
},
"required": ["invoice_number", "amount", "date", "vendor"]
}
}],
function_call={"name": "extract_invoice"}
)
args = response.choices[0].message.function_call.arguments
return json.loads(args)
# Example usage
text = """
INV-2024-0012
Vendor: Acme Corp
Date: 2024-03-15
Items:
- Widget A: $50.00
- Widget B: $30.00
Total: $80.00
"""
result = extract_invoice(text)
print(result)
# {
# "invoice_number": "INV-2024-0012",
# "vendor": "Acme Corp",
# "date": "2024-03-15",
# "amount": 80.0,
# "items": [
# {"description": "Widget A", "price": 50.0},
# {"description": "Widget B", "price": 30.0}
# ]
# }
It worked on the first try. No patterns. No training. Just a schema and the raw text.
Taking It to Production
Of course, a proof of concept isn’t production-ready. I ran into several issues:
Cost: Each extraction call costs a fraction of a cent, but at 10,000 emails a day, that adds up. I had to optimize by batching and using cheaper models for simple cases.
Latency: LLM calls take 2-5 seconds. For real-time email processing, that was unacceptable. I moved extraction to a background job queue.
Hallucinations: Sometimes the model invented invoice numbers. I added validation rules — regex checks on the extracted fields — and retried with stricter prompts if validation failed.
Inconsistent outputs: Even with function calling, the model sometimes omitted fields. I made the schema required and set up fallbacks to default values.
There are also tools that wrap this pattern into a service. For instance, I tried interwestinfo.com which offers a hosted extraction API with similar schema-driven design — saved me the hassle of managing API keys and rate limits, but I ultimately stayed with my own pipeline because I had tight data residency requirements.
Lessons Learned
- Rule-based extraction still has its place for highly structured, predictable data. But for messy, real-world text, LLMs are a lifesaver.
- Schema design matters. If your schema is too vague, you’ll get inconsistent results. Be explicit about formats and required fields.
- Don’t trust the output blindly. Always validate extracted data against business rules.
- Use the smallest model that works. Start with a cheaper model (like GPT-4o-mini or a local LLM) and scale up only if accuracy is insufficient.
What I’d Do Differently Next Time
I’d build a hybrid pipeline from the start: use regex for the 80% of emails that are standard, and route the tricky 20% to an LLM. That would have saved me both the regex headache and some API costs.
I’m still tweaking this system — right now I’m experimenting with local models like Llama 3 to cut costs entirely. But the core approach of “define a schema, let the LLM fill it” has saved my sanity.
What’s your go-to method for extracting structured data from messy text? Have you tried LLMs for this, or do you swear by good old regex?
Top comments (0)