Parsing invoices with an LLM sounds simple until you actually do it. Invoices from different vendors have different layouts, use different field names, and encode amounts in different formats. A senior accountant handles this through pattern recognition built over years. I had a weekend.
Here's what I learned building the extraction layer for Finley, an AI accounts payable agent that processes invoices through an 8-step pipeline and uses Hindsight to remember what it learns.
The Extraction Problem Is Harder Than It Looks
The naive approach — ask the LLM "extract the invoice number, vendor name, total amount, and payment terms" — works on clean PDFs from large vendors. It breaks on:
- Scanned invoices with skewed text
- Invoices that embed total amounts inside paragraph descriptions
- Payment terms written as "30 days from receipt" vs. "Net 30" vs. "NET-30"
- GST/VAT breakdowns where the "total" field is ambiguous
The real challenge isn't the extraction itself. It's getting structured, typed output that downstream code can depend on.
Structured Output Is the Only Viable Approach
The extraction service in Finley sends the invoice content to Claude and explicitly requests a JSON schema in the response. The schema pins the field names, types, and acceptable values so the downstream analyzer doesn't have to guess.
The extracted payload looks like:
javascript
{
vendorName: "Prakash Office Supplies Pvt. Ltd.",
invoiceId: "INV-2025-0009",
invoiceDate: "2025-01-15",
totalAmount: 47500,
paymentTerms: "Net-30",
lineItems: [
{ description: "A4 Copy Paper (500 sheets)", quantity: 10, unitPrice: 450, total: 4500 }
// ...
],
currency: "INR"
}
totalAmount is always a number. paymentTerms is always a string in a normalized format. The LLM does the translation — "thirty days" → "Net-30" — so the analyzer never has to.
The Prompt Carries the Business Logic
The extraction prompt does more than "extract fields." It encodes accounting-domain knowledge: what constitutes a valid invoice total, how to handle line item math discrepancies, which fields are optional vs. required. That domain context is what makes Claude useful here over a simpler regex approach.
One thing that helped: including explicit examples of edge cases in the extraction prompt. Not just "extract payment terms" but "payment terms may appear as 'Net 30', 'NET-30', '30 days from invoice date', or 'payment due in 30 days' — normalize to 'Net-30' format." The LLM handles variation better when you name the variations.
Memory Changes What Extraction Can Do
Here's where it gets interesting. Finley uses Hindsight agent memory — and memory affects the extraction step in a way I didn't anticipate when I started.
After a few invoices from the same vendor, Finley has stored observations like "this vendor's invoices use non-standard payment terms that should be corrected to Net-45 per contract." On the next invoice from that vendor, that memory feeds into the analysis step, not the extraction step — but the effect is the same as if extraction had been smarter. The extracted paymentTerms field comes back as "Net-30" (what the invoice says), and the analyzer then flags it using memory: "contract terms are Net-45, vendor consistently invoices Net-30, user has corrected this 3 times."
This separation matters architectually. Extraction reports facts. Analysis applies context. Memory is where the context lives.
The Manual Data Path Is Underrated
Finley has a second input path alongside file upload: a manual JSON field that lets you pass pre-structured invoice data directly to the pipeline. We used this for demos, but it has a real production use case: if you already have invoice data from an ERP or structured email, you can skip LLM extraction entirely and still get memory retrieval, analysis, and decision.
const manualData = req.body.manual ? JSON.parse(req.body.manual) : null;
const extracted = manualData || await extractInvoiceData(fileBuffer, req.file?.mimetype);
That two-line fallback means the extraction layer is optional without changing the pipeline contract. Worth designing in from the start.
What Fails in Practice
Three things go wrong most often in invoice extraction:
Multi-page totals. If the summary total is on page 1 and the line items are on pages 2-3, a naive extraction might grab a subtotal instead of the final total. The prompt needs to explicitly instruct the LLM to find the final payable amount, not the first number labeled "total."
Currency ambiguity. "1,00,000" is a valid Indian number format (one lakh). To US-trained models, it looks like "100,000" with a weird comma. Explicitly calling out the currency in the prompt — and including examples — reduces this error.
Missing fields vs. inapplicable fields. An invoice might not have payment terms because it's a cash sale. The LLM should return null for missing optional fields, not omit them or invent plausible values. Prompt explicitly for null vs. omission behavior.
The Takeaway
LLM invoice extraction is genuinely useful, but it requires treating the prompt as a schema specification, not a question. Define the output format precisely, encode domain edge cases explicitly, and separate extraction (facts) from analysis (context). Memory — through Hindsight in Finley's case — sits in the analysis layer, not the extraction layer, and that separation keeps both cleaner.
You can see Finley running at finley-rho.vercel.app.
Top comments (0)