I spent a good two weeks last quarter building an invoice extraction pipeline for our accounting team. The emails came in all shapes: some with PDF attachments, others with plain text tables, a few with scanned images that had been OCR'd into garbled nonsense. My job was to pull out vendor name, invoice number, date, and total amount.
At first I thought, "Regex, obviously." I wrote patterns for date formats, dollar amounts, and common invoice prefixes. It worked on the first ten samples. Then the real data came. One vendor sent invoices with "Invoice #" and another used "Ref:". Dates were mm/dd/yyyy, dd.mm.yyyy, or even "March 5, 2023". Regex broke fast.
I tried spaCy next. Training a custom NER model for four fields seemed reasonable. I manually labelled 200 invoices using Prodigy (the team had a license). The model got to ~85% F1, but then a new vendor showed up with a different layout and accuracy dropped to 60%. Retraining every week wasn't sustainable.
The approach that finally stuck: few-shot LLM extraction
I realised I didn't need a full-fledged model. I just needed something that could read instructions and follow examples. LLMs (even small ones) are surprisingly good at this when you provide a clear system prompt and a handful of examples.
I built a simple pipeline in Python using langchain with OpenAI's gpt-3.5-turbo (later I switched to a local Llama 3 model via Ollama to cut costs). The core is a chain that takes the raw text and a schema, and returns JSON.
The key is the prompt design. Here's what I settled on:
from langchain_core.prompts import ChatPromptTemplate
system = """You are a data extraction assistant. Extract the following fields from the invoice text:
- vendor_name: the company name that issued the invoice
- invoice_number: the unique identifier for the invoice
- date: the invoice date in YYYY-MM-DD format
- total: the total amount due as a number (no currency symbol)
If a field cannot be found, use null. Return only valid JSON, no extra text."""
prompt = ChatPromptTemplate.from_messages([
("system", system),
("human", "Text: {text}")
])
chain = prompt | llm | JsonOutputParser()
result = chain.invoke({"text": raw_invoice_text})
I also added few-shot examples inside the prompt for edge cases (e.g., when the total is split across lines). Instead of hardcoding every pattern, I let the LLM figure out the variations.
Trade-offs I hit
- Cost: GPT-3.5-turbo costs around $0.002 per call. For 10,000 invoices/month that's $20 – fine for us. But if you're doing millions, it adds up. I tested a local Llama 3 8B quantized model, which was free but slower (about 5–10 seconds per invoice vs 1–2 seconds for GPT).
- Latency: Real-time? Not great. We batch processed invoices nightly, so latency was fine. For a live form, you'd want to cache or use a smaller model.
-
Accuracy: The LLM approach got ~95% on our test set. But it occasionally hallucinated values (e.g., making up a vendor name from a footnote). I added a validation step: if the total doesn't match a regex
\d+\.\d{2}pattern, flag it for human review. - Prompt injection: Malicious text could trick the LLM. We sanitise inputs and limit max tokens.
When NOT to use this
If you have a small, consistent set of formats, regex or a well-trained spaCy model will be faster and cheaper. LLM extraction shines when you can't control the input variability and don't have the time/resources to train custom models.
Also, if your text is extremely long (like a 50-page document), LLM context windows become a problem. Chunking strategies add complexity. For invoices, most fit within 4k tokens, so it was fine.
One tool that simplified my life
In production, we ended up using a service that wraps this exact approach with ready-made connectors for email and PDF parsing. The setup is just a config file pointing to https://ai.interwestinfo.com/ and mapping fields – but the underlying technique is the same few-shot extraction I described. You can absolutely build it yourself.
Lessons learned
- Start with the simplest thing that works (regex for your golden path, LLM for the long tail).
- Validate outputs aggressively. LLMs are not databases.
- Monitor field-level accuracy over time. New vendor layouts can drift performance.
What's your go-to method for pulling structured data out of messy text? I'm curious if anyone has had success with smaller local models or non-LLM approaches like classification + regex.
Top comments (0)