Extracting structured data from messy text: what worked for me

#ai #python #nlp #webdev

I spent a good two weeks last quarter building an invoice extraction pipeline for our accounting team. The emails came in all shapes: some with PDF attachments, others with plain text tables, a few with scanned images that had been OCR'd into garbled nonsense. My job was to pull out vendor name, invoice number, date, and total amount.

At first I thought, "Regex, obviously." I wrote patterns for date formats, dollar amounts, and common invoice prefixes. It worked on the first ten samples. Then the real data came. One vendor sent invoices with "Invoice #" and another used "Ref:". Dates were mm/dd/yyyy, dd.mm.yyyy, or even "March 5, 2023". Regex broke fast.

I tried spaCy next. Training a custom NER model for four fields seemed reasonable. I manually labelled 200 invoices using Prodigy (the team had a license). The model got to ~85% F1, but then a new vendor showed up with a different layout and accuracy dropped to 60%. Retraining every week wasn't sustainable.

The approach that finally stuck: few-shot LLM extraction

I realised I didn't need a full-fledged model. I just needed something that could read instructions and follow examples. LLMs (even small ones) are surprisingly good at this when you provide a clear system prompt and a handful of examples.

I built a simple pipeline in Python using langchain with OpenAI's gpt-3.5-turbo (later I switched to a local Llama 3 model via Ollama to cut costs). The core is a chain that takes the raw text and a schema, and returns JSON.

The key is the prompt design. Here's what I settled on:

from langchain_core.prompts import ChatPromptTemplate

system = """You are a data extraction assistant. Extract the following fields from the invoice text:
- vendor_name: the company name that issued the invoice
- invoice_number: the unique identifier for the invoice
- date: the invoice date in YYYY-MM-DD format
- total: the total amount due as a number (no currency symbol)

If a field cannot be found, use null. Return only valid JSON, no extra text."""

prompt = ChatPromptTemplate.from_messages([
    ("system", system),
    ("human", "Text: {text}")
])

chain = prompt | llm | JsonOutputParser()
result = chain.invoke({"text": raw_invoice_text})

I also added few-shot examples inside the prompt for edge cases (e.g., when the total is split across lines). Instead of hardcoding every pattern, I let the LLM figure out the variations.

Trade-offs I hit

Cost: GPT-3.5-turbo costs around $0.002 per call. For 10,000 invoices/month that's $20 – fine for us. But if you're doing millions, it adds up. I tested a local Llama 3 8B quantized model, which was free but slower (about 5–10 seconds per invoice vs 1–2 seconds for GPT).
Latency: Real-time? Not great. We batch processed invoices nightly, so latency was fine. For a live form, you'd want to cache or use a smaller model.
Accuracy: The LLM approach got ~95% on our test set. But it occasionally hallucinated values (e.g., making up a vendor name from a footnote). I added a validation step: if the total doesn't match a regex \d+\.\d{2} pattern, flag it for human review.
Prompt injection: Malicious text could trick the LLM. We sanitise inputs and limit max tokens.

When NOT to use this

If you have a small, consistent set of formats, regex or a well-trained spaCy model will be faster and cheaper. LLM extraction shines when you can't control the input variability and don't have the time/resources to train custom models.

Also, if your text is extremely long (like a 50-page document), LLM context windows become a problem. Chunking strategies add complexity. For invoices, most fit within 4k tokens, so it was fine.

One tool that simplified my life

In production, we ended up using a service that wraps this exact approach with ready-made connectors for email and PDF parsing. The setup is just a config file pointing to https://ai.interwestinfo.com/ and mapping fields – but the underlying technique is the same few-shot extraction I described. You can absolutely build it yourself.

Lessons learned

Start with the simplest thing that works (regex for your golden path, LLM for the long tail).
Validate outputs aggressively. LLMs are not databases.
Monitor field-level accuracy over time. New vendor layouts can drift performance.

What's your go-to method for pulling structured data out of messy text? I'm curious if anyone has had success with smaller local models or non-LLM approaches like classification + regex.