Vaibhav Rathi

Posted on Feb 5 • Edited on Feb 9

How We Built a 99% Accurate Invoice Processing System Using OCR and LLMs

#machinelearning #llm #python #healthcare

We had a working RAG solution at 91% accuracy. Here's why we rebuilt it with fine-tuning and what we learned along the way.

Our client was spending eight minutes per invoice on manual data entry. At 10,000 invoices a month, that's a full team doing nothing but copying numbers from PDFs into a database.

We were building an invoice processing system for a US healthcare client. The goal was straightforward - extract line items, medical codes, and billing information from unstructured invoice images. Straightforward until you open the first PDF.

Healthcare invoices are chaotic. Every vendor has a different format. Some use tables, some use free-form layouts, some mix typed text with handwritten notes. And the terminology is brutal - drug names that look like keyboard smashes, procedure codes that differ by a single digit, billing amounts scattered across the page with no consistent positioning.

They wanted automation. They also wanted accuracy - in healthcare, a wrong code means denied claims, compliance issues, or payment disputes.

We actually had a working solution already. Azure OpenAI with RAG, vendor templates in a knowledge base, 91% field-level accuracy. Not bad.

But three problems pushed us to reconsider:

Healthcare data flowing through external APIs
Per-request costs adding up at scale
That stubborn 9% error rate that we couldn't push lower no matter how we tuned the prompts

Six months later, we had a fine-tuned system running at 99% field-level accuracy, fully on-premises, processing invoices in under two minutes each. But getting there taught us more about production ML than any course or paper ever did.

Why We Moved Away from RAG

Our RAG-based system worked. 91% accuracy is respectable. But we hit a ceiling.

The problem with RAG for document extraction is that you're asking the model to generalize from templates. "Here's what Vendor A's invoice looks like, here's where they put the total, now extract it." Works great when the invoice matches the template. Falls apart on edge cases - slightly different layouts, unexpected fields, formatting variations within the same vendor.

We tried everything. Better templates. More examples in the knowledge base. Prompt engineering. We squeezed out maybe 2% more accuracy over three months. Still stuck at 93%.

Meanwhile, three other issues were building pressure:

Data privacy. Healthcare invoices contain patient-adjacent information. Every API call to Azure meant data leaving our controlled environment. The client's compliance team was getting nervous, and rightfully so.

Cost at scale. At our volume, API costs were adding up. Not catastrophic, but enough that "what if we self-hosted?" became a recurring question in planning meetings.

The accuracy ceiling. 93% sounds good until you do the math. At 10,000 invoices per month, that's 700 invoices with errors. 700 invoices needing manual review and correction. The automation wasn't as automatic as the numbers suggested.

We made the call to rebuild with fine-tuning. Self-hosted LLaMA 3.2, trained on our actual invoice data. More work upfront, but it addressed all three problems.

The Architecture We Landed On

The final pipeline has four components, each chosen after some painful lessons.

PaddleOCR for Text Extraction

We benchmarked against Tesseract and EasyOCR on a corpus of 500 invoices. PaddleOCR won by 12%, and the gap was even wider on multi-column layouts and tables - exactly the formats healthcare invoices love.

Tesseract kept merging columns or missing table boundaries entirely. EasyOCR was slower without being more accurate. PaddleOCR had a steeper learning curve and less community support for edge cases, but accuracy won that trade-off.

BioBERT for Medical Entity Recognition

This was non-negotiable. We tried general-purpose NER models first, and they were hopeless on medical terminology. "Metformin 500mg" parsed as a single blob. CPT codes got missed entirely. Drug names with unusual spellings - and there are many in healthcare - caused constant errors.

BioBERT, pre-trained on PubMed abstracts, understood medical terminology out of the box. It recognized drug names, procedure codes, and clinical terms that general models consistently missed. This wasn't a marginal improvement. This was the difference between the system working and not working.

Fine-tuned LLaMA 3.2 for Structured Extraction

This is where we diverged from our RAG approach. Instead of giving the model templates and asking it to generalize, we trained it on actual invoice-to-output pairs. The model learned the extraction task directly, not through in-context examples.

Full fine-tuning would have required A100 GPUs we didn't have budget for. QLoRA let us fine-tune on T4s - roughly 7x cheaper.

We tested LoRA ranks from 4 to 32. Rank 8 was the sweet spot. Rank 4 underfitted on complex line items like bundled services. Rank 16 and above showed diminishing returns on accuracy while increasing inference time. Rank 32 actually started overfitting - validation loss ticked back up.

Three-Tier Validation Layer

Tier one: Format validation - regex checks for invoice numbers, dates, monetary values
Tier two: Confidence scoring - below 0.95, the invoice routed to human review
Tier three: Business rules - does the total equal the sum of line items? Are the procedure codes valid for this provider type?

The combination of fine-tuning plus validation got us from 91% to 99%. The model was more accurate, and the validation layer caught most of what it missed.

The Problems We Didn't Anticipate

The architecture above sounds clean. The path to it was not.

Week 2: OCR Quality Was Worse Than We Expected

Two weeks in, 30% of our invoices were producing garbage output. Not because PaddleOCR was bad - because the input images were bad. Scanned at odd angles. Low resolution. Fax artifacts. Handwritten annotations overlapping printed text.

We hadn't budgeted time for preprocessing. We should have.

We added rotation correction, adaptive thresholding for low-contrast scans, and region detection to separate printed text from handwritten notes. This took a week we hadn't planned for. It also improved downstream accuracy by 15%.

Lesson learned early: OCR is half the problem. Never assume clean input.

Week 4: The Model Had Memorized Vendor Formats

Our test accuracy looked great - 96%. We were ready to deploy.

Then someone suggested we test on invoices from vendors that weren't in the training set. Held-out vendors, not just held-out invoices.

The result: 81% accuracy. A 15-point gap.

The model had learned that Vendor A always puts the total in the bottom-right corner. Vendor B uses a specific date format. Vendor C has a particular way of listing line items. When it saw a new vendor's format, it fell apart.

We fixed this with vendor-agnostic preprocessing, diverse training batches, and - most importantly - evaluation on truly held-out vendors. The gap dropped to 4 points. Not perfect, but good enough for production where we'd add new vendors to training over time.

This one stung. Standard train/val splits had hidden a serious generalization problem. We'd have caught it in week one if we'd thought to test on held-out vendors from the start.

Week 6: Rare Line Item Types Were Getting Crushed

Most invoices contain standard stuff - common medications, routine procedures, straightforward billing codes. These made up 88% of our training data.

The remaining 12% was split between compound medications, DME rentals (durable medical equipment), and bundled service codes. These are trickier to extract - compound medications have unusual naming conventions, DME rentals span multiple line items, bundled codes require understanding which services are grouped together.

The model's F1 on these minority classes was 0.68. Overall accuracy looked fine because the majority classes dominated the average. But 0.68 F1 on compound medications meant one in three was wrong. Not acceptable for production.

We tried weighted cross-entropy loss, with weights inversely proportional to class frequency. Compound medications got a weight of 8.2 versus 1.0 for standard line items. We combined this with stratified splits to ensure rare classes appeared in every validation fold - otherwise, we couldn't reliably measure whether our fixes were working.

Minority class F1 went from 0.68 to 0.81. Not perfect, but combined with human review on low-confidence predictions, it was good enough. The remaining errors got caught by our validation layer and routed to the review queue.

The Production Incident

Three months into production, our daily accuracy report showed a drop from 99% to 91% on one vendor's invoices.

This wasn't a gradual decline. The previous day's batch had processed normally. Something had changed overnight.

Four hours of investigation later, we found the culprit: a major vendor had updated their invoice template. Field positions had shifted. The date format changed from MM/DD/YYYY to YYYY-MM-DD. The model had been trained on the old format and was now extracting the wrong fields.

This is the trade-off with fine-tuning versus RAG. With RAG, we could have updated a template in 30 minutes. With fine-tuning, we needed new training data and a retraining cycle.

Here's how we handled it: we immediately routed that vendor's invoices to the human review queue. Over the next few days, the review team processed invoices manually while we collected corrected examples. Once we had enough samples in the new format, we retrained and redeployed. Total time to full recovery: about a week.

For smaller vendors, the calculus is different. If a low-volume vendor changes their format, we keep them in the human review queue until we've accumulated enough examples to justify a retraining cycle. Sometimes that's a few weeks. The volume doesn't warrant faster turnaround.

The incident taught us that fine-tuning comes with operational overhead that RAG doesn't have. We accepted that trade-off for the accuracy and privacy gains, but it's a real trade-off. We now monitor vendor-level accuracy daily and have a documented process for handling template changes: immediate queue routing, sample collection, scheduled retraining.

Production ML isn't about building a model. It's about building a system that stays accurate when the world changes around it. And the world always changes.

What Actually Mattered

Six months in, I can point to five decisions that made the difference between a demo and a system that actually works.

Moving from RAG to fine-tuning. This was the big one. RAG hit a ceiling at 93%. Fine-tuning broke through to 99%. The accuracy gain justified the added operational complexity.

Using BioBERT for medical NER. General models couldn't handle the terminology. Domain-specific pre-training wasn't a nice-to-have; it was essential.

Cross-vendor validation from day one. Standard train/val splits hid our generalization problem. Testing on held-out vendors would have caught it immediately.

Weighted loss for class imbalance. Simple technique, big impact. Rare line item types went from unusable to acceptable.

Monitoring vendor-level accuracy. Aggregate metrics hide localized problems. When one vendor's template changed, we caught it within 24 hours instead of letting errors accumulate.

The Results

After six months in production:

✅ 99% field-level accuracy across all invoice types (up from 91% with RAG)
✅ 73% reduction in processing time - from 8+ minutes to under 2 minutes per invoice
✅ 77% of invoices fully automated - no human touch required
✅ Minority class F1 improved from 0.68 to 0.81
✅ Zero data leaving our environment - full healthcare compliance

The business impact: estimated $340K annual savings in processing costs, and same-day invoice processing where there used to be a 3-5 day backlog.

What I'd Do Differently

If I were starting this project again, three things would change.

Budget more time for preprocessing. We treated OCR as a solved problem. It isn't. Image quality issues ate a week we hadn't planned for, and preprocessing improvements drove more accuracy gains than model tuning.

Test on held-out groups immediately. Not just held-out samples - held-out groups. Vendors, time periods, whatever natural segmentation exists in your data. This catches generalization failures that random splits miss.

Plan for template changes from day one. With fine-tuning, vendor template changes require retraining. We built our monitoring and retraining pipeline after the first incident. It should have been part of the initial design.

Final Thoughts

The model was maybe 30% of this project. The rest was preprocessing, validation logic, monitoring, and maintenance.

If you're building document processing systems, the unsexy work - rotation correction, class-level metrics, vendor-specific monitoring, retraining pipelines - is what separates demos from production. The ML community focuses on model architecture. Production focuses on everything else.

Fine-tuning gave us better accuracy than RAG ever could. It also gave us operational overhead that RAG didn't have. Every template change means collecting new data and retraining. For us, the trade-off was worth it - 99% accuracy versus 91%, plus data privacy and lower per-inference costs. But it's not a free lunch.

The system's been running for six months now. Accuracy has held steady. The monitoring catches template drift quickly. The human review loop handles edge cases while we collect data for the next retraining cycle.

It works. And that's the whole point.

I'm a Senior Data Scientist at Fractal Analytics, where I build LLM-powered automation systems for enterprise clients. Previously led ML teams at R Systems working on healthcare and content applications. Always happy to talk shop - connect with me on LinkedIn.

DEV Community