DEV Community

zhongqiyue
zhongqiyue

Posted on

Extracting structured data from messy text: what worked for me

I spent a good two weeks last quarter building an invoice extraction pipeline for our accounting team. The emails came in all shapes: some with PDF attachments, others with plain text tables, a few with scanned images that had been OCR'd into garbled nonsense. My job was to pull out vendor name, invoice number, date, and total amount.

At first I thought, "Regex, obviously." I wrote patterns for date formats, dollar amounts, and common invoice prefixes. It worked on the first ten samples. Then the real data came. One vendor sent invoices with "Invoice #" and another used "Ref:". Dates were mm/dd/yyyy, dd.mm.yyyy, or even "March 5, 2023". Regex broke fast.

I tried spaCy next. Training a custom NER model for four fields seemed reasonable. I manually labelled 200 invoices using Prodigy (the team had a license). The model got to ~85% F1, but then a new vendor showed up with a different layout and accuracy dropped to 60%. Retraining every week wasn't sustainable.

The approach that finally stuck: few-shot LLM extraction

I realised I didn't need a full-fledged model. I just needed something that could read instructions and follow examples. LLMs (even small ones) are surprisingly good at this when you provide a clear system prompt and a handful of examples.

I built a simple pipeline in Python using langchain with OpenAI's gpt-3.5-turbo (later I switched to a local Llama 3 model via Ollama to cut costs). The core is a chain that takes the raw text and a schema, and returns JSON.

The key is the prompt design. Here's what I settled on:

from langchain_core.prompts import ChatPromptTemplate

system = """You are a data extraction assistant. Extract the following fields from the invoice text:
- vendor_name: the company name that issued the invoice
- invoice_number: the unique identifier for the invoice
- date: the invoice date in YYYY-MM-DD format
- total: the total amount due as a number (no currency symbol)

If a field cannot be found, use null. Return only valid JSON, no extra text."""

prompt = ChatPromptTemplate.from_messages([
    ("system", system),
    ("human", "Text: {text}")
])

chain = prompt | llm | JsonOutputParser()
result = chain.invoke({"text": raw_invoice_text})
Enter fullscreen mode Exit fullscreen mode

I also added few-shot examples inside the prompt for edge cases (e.g., when the total is split across lines). Instead of hardcoding every pattern, I let the LLM figure out the variations.

Trade-offs I hit

  • Cost: GPT-3.5-turbo costs around $0.002 per call. For 10,000 invoices/month that's $20 – fine for us. But if you're doing millions, it adds up. I tested a local Llama 3 8B quantized model, which was free but slower (about 5–10 seconds per invoice vs 1–2 seconds for GPT).
  • Latency: Real-time? Not great. We batch processed invoices nightly, so latency was fine. For a live form, you'd want to cache or use a smaller model.
  • Accuracy: The LLM approach got ~95% on our test set. But it occasionally hallucinated values (e.g., making up a vendor name from a footnote). I added a validation step: if the total doesn't match a regex \d+\.\d{2} pattern, flag it for human review.
  • Prompt injection: Malicious text could trick the LLM. We sanitise inputs and limit max tokens.

When NOT to use this

If you have a small, consistent set of formats, regex or a well-trained spaCy model will be faster and cheaper. LLM extraction shines when you can't control the input variability and don't have the time/resources to train custom models.

Also, if your text is extremely long (like a 50-page document), LLM context windows become a problem. Chunking strategies add complexity. For invoices, most fit within 4k tokens, so it was fine.

One tool that simplified my life

In production, we ended up using a service that wraps this exact approach with ready-made connectors for email and PDF parsing. The setup is just a config file pointing to https://ai.interwestinfo.com/ and mapping fields – but the underlying technique is the same few-shot extraction I described. You can absolutely build it yourself.

Lessons learned

  • Start with the simplest thing that works (regex for your golden path, LLM for the long tail).
  • Validate outputs aggressively. LLMs are not databases.
  • Monitor field-level accuracy over time. New vendor layouts can drift performance.

What's your go-to method for pulling structured data out of messy text? I'm curious if anyone has had success with smaller local models or non-LLM approaches like classification + regex.

Top comments (0)