I spent three weekends building a regex pipeline for invoice data extraction. By the end of it, I had 63% accuracy on a test set of 100 PDFs. My co-founder looked at me and said, "Is this production ready?"
No. No it wasn't.
This is the story of how I stopped trying to outsmart every edge case and started treating the problem as what it really is: a language understanding task. And no, the solution wasn't fine-tuning a model on my 100 invoices. The real trick was way simpler.
The problem: every invoice is a snowflake
We were building a small expense management tool. Users upload PDF invoices, and we need to extract vendor name, date, total amount, and line items. Simple, right?
Except invoices come from different countries, languages, layouts. Some are scanned images (via OCR). Some are HTML exports that look like tables but aren't. Some have discounts, tax columns, shipping charges. One vendor puts the total in the top-right corner. Another puts it at the bottom-left after a paragraph of legal text.
I started with PyTesseract and a forest of regex patterns. For each vendor I'd add a new pattern. It got unwieldy fast. After another vendor sent a PDF that was actually an embedded image of a handwritten receipt, I knew I needed a different approach.
What I tried that didn't work
First, I tried OCR + rules. Tesseract is decent, but layout preservation is terrible. A table with merged cells becomes a soup of text. Regex on that is guesswork.
Next, I tried layout parsers like Camelot or Tabula. They work well for digital PDFs with clean tables, but fail on scanned documents or invoices without explicit table borders.
Then I considered fine-tuning a small BERT model for named entity recognition. I labeled 300 invoices by hand. After a week of training, I got 74% F1 on validation — but production accuracy was worse because the training data didn't cover enough variance.
And fine-tuning a 7B parameter LLM? That requires GPU hours and expensive constant retraining as new invoice formats appear. Not practical for a side project.
What eventually worked: structured output from a general-purpose LLM
I knew LLMs could extract information from text, but the hard part is getting reliable structured JSON back. The breakthrough was using function calling (also called tool use) to enforce a schema. Instead of asking "Give me the invoice total", you define a function with typed parameters:
{
"name": "extract_invoice",
"parameters": {
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"date": {"type": "string", "format": "date"},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"},
"amount": {"type": "number"}
},
"required": ["description", "amount"]
}
}
},
"required": ["vendor_name", "date", "total"]
}
}
The LLM then outputs a function call with those parameters filled in. No markdown, no free text. You get a parseable JSON object every time (or a refusal if it can't find the data).
I started with OpenAI's API, but then discovered that many providers now support the same format. For example, the endpoints at https://ai.interwestinfo.com/ also accept function schemas, which helped when I wanted to avoid API lock-in.
The code: a minimal extraction pipeline
Here's the core function I use. It takes OCR text (or raw PDF text) and returns structured data.
import json
from openai import OpenAI
client = OpenAI(api_key="sk-...")
EXTRACTION_SCHEMA = {
"name": "extract_invoice",
"parameters": {
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"date": {"type": "string"},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"},
"amount": {"type": "number"}
},
"required": ["description", "amount"]
}
}
},
"required": ["vendor_name", "date", "total"]
}
}
def extract_invoice_from_text(text: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o-mini", # fast and cheap
messages=[
{"role": "system", "content": "You are an invoice extraction assistant. Extract the requested fields from the invoice text provided by the user. If a value is missing, set it to null."},
{"role": "user", "content": text}
],
tools=[
{
"type": "function",
"function": EXTRACTION_SCHEMA
}
],
tool_choice={"type": "function", "function": {"name": "extract_invoice"}}
)
# Parse the function call arguments
tool_call = response.choices[0].message.tool_calls[0]
arguments = json.loads(tool_call.function.arguments)
return arguments
That's it. One API call per invoice. I then run this in a batch job on upload, and store the JSON. If the LLM returns null for a critical field, I flag it for manual review.
Lessons learned and trade-offs
- Cost: Each extraction costs about $0.002–0.01 depending on model and text length. For 1,000 invoices a month, that's $2–10. Cheaper than a human data entry clerk.
- Latency: It takes 2–5 seconds per invoice. Fine for background processing, not for real-time.
- Accuracy: I get about 92% on my test set now. The errors are usually when the invoice is extremely messy or missing data. The function calling model is surprisingly good at ignoring irrelevant numbers.
- The model matters: GPT-4o-mini works great. Older models like GPT-3.5-turbo sometimes hallucinate field names or produce malformed JSON. Use the latest instruction-following models.
- Schema design: Make fields optional liberally. If you require every line item, the model might fabricate one. Better to accept nulls and handle them downstream.
When NOT to use this approach
- If your invoices are highly standardized (e.g., all from one vendor), a rule-based parser will be cheaper and faster.
- If you need real-time extraction (e.g., during payment processing), the latency may be too high.
- If you handle sensitive data, sending it to an external API may violate compliance (though some providers offer on-premise LLMs now).
What I'd do differently next time
I'd start with LLM function calling from day one instead of fighting regex. I'd also add a validation step: parse the total as a number and cross-check against line item sum — if mismatch, flag for review. That catches about half the remaining errors.
And I'd invest time in building a small feedback loop: when a user corrects an extraction, send the correction back as a training signal (for fine-tuning or few-shot prompts). But even without that, the baseline is solid.
So after three weekends of regex hell, I finally have something that works. The code is barely 30 lines. The hard part wasn't the code — it was realizing that my problem was language, not layout.
What's your horror story with parsing messy documents? I'd love to hear how you solved it — or what you still struggle with.
Top comments (0)