DEV Community: Vaibhav Rathi

How We Achieved 30% Conversion Lift by Moving from GPT-4 to LoRA Adapters

Vaibhav Rathi — Mon, 09 Feb 2026 15:00:00 +0000

Scaling content optimization to 75+ clients with LLaMA 3 fine-tuning

In April 2024, I joined a project to build a content optimization system for a US-based analytics startup. The challenge was straightforward but hard: improve blog conversion rates across 75+ clients, each with their own unique brand voice.

We started with GPT-4 Turbo and few-shot prompting. It worked, sort of. We could maintain voice consistency about 62% of the time. But we were spending $0.13 to $0.26 per request just on the few-shot examples, before even processing content.

Four months later, we'd moved to fine-tuned LLaMA 3 adapters. Voice consistency jumped to 88%. More importantly, conversion rates improved 30%, from 2.0% to 2.6% CTR in A/B tests.

Here's why few-shot prompting hit a ceiling, how LoRA adapters solved it, and what we learned building this at scale.

The Problem We Were Solving

The startup ran a content analytics platform serving 75+ clients across different industries. Each client published blog posts on their own websites with CTAs embedded throughout. The goal was simple: optimize existing blog content to increase CTA click-through rates.

The catch was that every client had a distinct brand voice. A B2B SaaS company writing about enterprise software needed formal, technical language. An ecommerce fashion brand needed casual, emotional copy. A healthcare provider needed professional yet empathetic tone. Generic rewrites wouldn't work. The content had to sound like it came from that specific brand.

Manual optimization wasn't scaling. Content teams could only rewrite so many posts per week, and quality varied depending on who did the work. We needed automation that could maintain each client's voice while improving conversion.

Starting with GPT-4: The POC Phase

We started with 10 pilot clients in April 2024. Two large enterprise clients, three mid-market companies, five smaller brands. The fastest path to validate the concept was GPT-4 Turbo with few-shot prompting.

The approach was straightforward. For each client, we built a custom prompt that included their style guide, 5 to 10 examples of their high-performing blog posts (originals paired with optimized versions), and clear instructions about what to preserve and what to improve.

The first couple weeks looked great. We were seeing 85% approval rates from human reviewers. Posts sounded like the brand. CTAs felt natural. We thought we'd solved it.

By week four, approval rates had dropped to 70%. By week eight, we were stuck at 62%.

Same prompts, same examples, inconsistent quality. One post would nail the formal B2B tone. The next would come out too casual. A third would be somewhere in between. The model couldn't maintain consistent voice across rewrites, even with the same client and same prompt.

Our multi-dimensional evaluation tracked four things: voice consistency, content quality, CTA integration, and overall polish. Each scored on a 1 to 5 scale by human reviewers.

GPT-4's results after eight weeks looked like this. Voice consistency scored 3.1 out of 5. That was the primary problem. Content quality was actually good at 4.1, but CTA integration was weak at 2.9. Overall approval rate sat at 62%. Posts needed an average of 1.7 revision rounds before publishing.

Client feedback was consistent. "This doesn't sound like our brand." "Where's our unique voice?" "This could be any B2B company."

We tried adding more examples. We tried different prompt structures. We tried breaking down style guidelines into more specific rules. Nothing pushed us past 70% voice consistency.

The Three Problems That Forced Our Hand

The voice inconsistency was frustrating, but there were two other problems making it worse.

The Token Economics Problem

To maintain even 62% voice consistency, we needed 5 to 10 few-shot examples per client. Each example included an original post (about 1,300 tokens) and its rewritten version (another 1,300 tokens). That's roughly 2,600 tokens per example.

Five examples minimum meant 13,000 tokens just for examples. Ten examples for slightly better quality meant 26,000 tokens.

Our full prompt structure with GPT-4 Turbo looked like this. System prompt: 500 tokens. Few-shot examples: 13,000 to 26,000 tokens. Input post: 1,300 tokens. Total input per request: 14,800 to 27,800 tokens.

GPT-4 Turbo can handle it. It has a 128K context window. But the cost adds up fast.

Input cost was 13,000 tokens times $0.01 per thousand, which is $0.13 per request for 5 examples. With 10 examples, that jumped to $0.26. Output cost added another $0.039. Total cost per rewrite ranged from $0.17 to $0.30.

At 75 clients times 50 posts per month, that's 3,750 rewrites monthly. Few-shot overhead alone cost $487 to $975 per month. Annually, that's $5,844 to $11,700 just for examples that never changed. We were paying repeatedly for static data.

The quality-cost trap was real. More examples meant better quality but $0.26 per request. Fewer examples meant lower cost but stuck at 62% approval. We couldn't win with prompting alone.

And even with 10 examples, quality topped out around 70% voice consistency. We'd hit the ceiling.

The Long-Form Content Issue

About 25 to 30% of posts were long-form content, over 2,500 words. We experimented with chunking these into smaller pieces, processing each separately, then stitching them back together. It didn't work well. Narrative flow got lost across chunks. Tone would shift between sections. We dropped the chunking approach after testing it on maybe 3 to 5 articles.

This wasn't a major driver for switching, but it was another pain point.

Why LoRA Adapters Made Sense

The hypothesis was simple. What if we encoded brand voice in model weights instead of sending it as few-shot examples every time?

The LoRA approach meant training one adapter per client on their historical content. Brand voice would be encoded in the adapter weights. No few-shot examples needed in prompts anymore.

This should work for several reasons.

First, it eliminates recurring token cost. Few-shot examples were costing us $487 to $975 monthly. With adapters, that drops to zero. Voice is in the weights, not in every prompt.

Second, it should improve voice consistency. The model learns patterns rather than just mimicking examples. That might break through the 70% ceiling.

Third, prompts become simpler. Just "rewrite this post" plus the content. No prompt engineering needed per client.

Fourth, it scales better. Adding a new client means training a new adapter once. Not tuning prompts repeatedly on an ongoing basis.

The trade-offs were clear. We'd need to train adapters instead of using instant API access. We'd need GPU infrastructure instead of serverless. There would be upfront effort per client instead of plug-and-play.

But we bet on it anyway. Quality ceiling was the real blocker, not speed. With 75+ clients, we needed consistent quality. Few-shot token costs didn't scale. And training cost is one-time while token cost is recurring.

Building the System

We used LLaMA 3-8B as the base model. It had just been released in April 2024, which lined up perfectly with our project timeline. We chose the 8B variant over the 70B because the cost-benefit didn't justify it. For content rewriting, 8B was sufficient. The 70B would have meant 8 to 10 times higher compute cost without meaningful quality improvement.

We chose self-hosted over GPT-4 for the economics at scale. With 75+ clients and high volume, the GPU infrastructure costs would be lower than ongoing API fees.

Our LoRA configuration settled on rank 16 after experimenting. Rank 8 underfitted on complex content. Rank 32 showed diminishing returns. Alpha was set to 32, which is standard at twice the rank. Dropout started at 0.05, but we increased it to 0.1 for clients with fewer than 20 training examples.

We targeted attention layers specifically, using q_proj and v_proj modules. This meant about 0.1% of the base model parameters were trainable. Training time ran 2 to 4 hours on a single GPU per adapter. We used 3 to 5 epochs depending on dataset size.

Training data per client ranged from 10 to 100 examples. The ideal was 30 to 50 examples. We used a mix of high-performing and low-performing posts. The format was original post paired with optimized version, where the optimized versions came from their high-conversion examples.

Our onboarding pipeline for new clients took 3 to 5 days. First, collect existing blog posts. Second, identify high-conversion examples. Third, prepare the training dataset with original to optimized pairs. Fourth, train the LoRA adapter, which took 2 to 4 hours. Fifth, validate on a held-out set. Sixth, deploy to production.

It wasn't instant like GPT-4, but once deployed, it worked consistently.

The Problems We Actually Hit

Overfitting on Small Datasets

Five of our 10 pilot clients had only 10 to 15 training posts available. Not enough data to learn generalizable patterns.

What happened was the LoRA adapters memorized exact phrases from training data. Generated content felt like copy-paste jobs. We'd lost creativity. The output was too rigid.

For example, if a training post said "Our platform helps teams collaborate better," the generation would output almost that exact phrase verbatim.

We fixed this with four changes. First, we increased the minimum training data requirement to 20 posts. Second, we added higher dropout at 0.1 instead of 0.05 for small datasets. Third, we lowered the rank from 16 to 8 to reduce model capacity. Fourth, we used data augmentation by paraphrasing training examples.

Overfitting reduced. Outputs felt more natural while still maintaining voice.

CTA Placement Was the Hardest Challenge

By week 4 or 5, our results looked like this. Content quality: 8 out of 10. Voice matching: 8.5 out of 10. CTA integration: 5 out of 10. That gap was the problem.

CTAs were appearing in random locations. Sometimes they'd be missing entirely. Sometimes the model used generic "Learn More" instead of client-specific copy. Sometimes multiple CTAs would cluster too close together.

The reason CTA placement was hard is that it's strategic, not just stylistic. It depends on content flow. You need to build problem, then solution, then CTA. Training data had varying approaches to this. The model couldn't infer optimal placement rules from examples alone.

Our solution was to use a structured output format with explicit CTA sections. We added CTA placement rules to the training data labels. Human reviewers would approve or reject CTA suggestions. Some clients needed multiple iterations to get it right.

CTA quality improved from 5 out of 10 to 8 out of 10. It still required human oversight, but that was fine. Much better than random placement.

Quality Gap Between GPT-4 and Initial LoRA

Week 6 brought a reality check. We had human evaluators score both approaches.

GPT-4 quality: 8 out of 10. Initial LoRA quality: 6.5 out of 10. Same evaluators, same criteria.

The gap analysis showed that GPT-4 was better at creative, compelling writing. LoRA preserved voice but the writing felt flat. It was missing the punch that drives conversions.

Our hypothesis was that GPT-4 with its 175 billion parameters has more language modeling capacity than LLaMA 3-8B. Small training datasets of 10 to 50 examples might not be enough to match that sophistication.

We tried three things to close the gap. First, we increased rank from 8 to 16. Quality improved from 6.5 to 7.2. Second, we collected more training data, going from 20 to 50 examples per client. Quality improved to 7.8. Third, we curated training data more carefully, focusing on the highest-quality rewrites. Quality reached 8.2.

For premium clients who needed the absolute best quality, we used a hybrid approach. GPT-4 would generate the initial draft, then the LoRA adapter would refine it to match brand voice. Best of both worlds, though more complex.

For standard clients, LoRA alone at 8 out of 10 quality was acceptable. The trade-off was slight quality reduction for consistency and cost savings.

The Results

Our multi-dimensional evaluation after deploying LoRA adapters to all 10 pilot clients showed clear improvement.

Voice consistency jumped from 3.1 to 4.4 out of 5. That's a 42% improvement. Content quality stayed steady at 4.2, barely changed from GPT-4. CTA integration improved from 2.9 to 4.0, a 38% gain. Overall approval rate went from 62% to 88%, a 42% improvement. Revision rounds dropped from 1.7 to 1.1 average, which is 35% fewer iterations.

The business impact mattered most. We ran A/B tests over one month across the pilot clients. Baseline conversion rate on CTAs was 2.0%. After LoRA optimization, conversion rate hit 2.6%. That's a 30% relative improvement. Direct revenue impact from better content.

We scaled from 10 pilot clients to 75+ in production by October 2024. Client growth of 7.5 times was enabled by the LoRA automation. The operational model shifted from high-touch with GPT-4 to low-touch with adapters.

Token economics shifted dramatically. With GPT-4, few-shot overhead was 13,000 tokens per request. Total prompt size was around 15,000 tokens. Cost per rewrite was $0.51. With LoRA, there's no few-shot overhead. Zero tokens for examples. Prompt size dropped to around 2,000 tokens with just instructions and content. GPU infrastructure cost about $500 monthly for all 75 clients combined.

The context benefit was real too. We could handle posts over 5,000 words easily. No chunking needed. Full context preserved throughout.

What We Learned

Few-shot prompting hit a quality ceiling around 70% voice consistency for us. Fine-tuning broke through to 88%. When quality is the blocker and you need consistency at scale, invest in fine-tuning.

Token overhead is real and matters. Those 13,000 tokens for few-shot examples created a major constraint. It limited context, increased cost, and ultimately reduced quality because we couldn't include enough examples without breaking the budget. LoRA encodes voice in weights and eliminates that overhead completely.

Not all clients need the same approach. Small datasets under 20 examples needed higher dropout and lower rank to avoid overfitting. B2B technical content needed rank 16 for nuance. B2C casual content worked fine with rank 8. One size doesn't fit all.

CTA placement can't be left implicit. The model won't infer strategic placement from examples alone. We needed an explicit output format with CTA suggestions. Human oversight remained valuable here.

For premium clients where quality matters most, a hybrid approach works. GPT-4 for creativity plus LoRA for consistency. Best of both worlds, though operationally more complex.

The real metric is conversion, not proxy metrics. Voice consistency and approval rates are useful proxies, but the 30% conversion lift validated the entire approach. That's what the business cared about.

When to Use Each Approach

Use GPT-4 API when you have fewer than 10 clients, need fast experimentation, have a flexible quality bar, budget for API costs, and don't have ML infrastructure.

Use LoRA adapters when you have 20+ clients where scale matters, voice consistency is critical, you've hit a quality ceiling with prompting, high volume makes token costs add up, and you have ML infrastructure or are willing to build it.

Our recommendation is to start with GPT-4 for proof of concept. It's the fastest way to validate whether the approach works at all. Switch to LoRA when you're scaling to 20+ clients and quality consistency becomes the bottleneck. Use hybrid for premium clients who need the absolute best quality.

Looking Back

By October 2024, four months after we started building the LoRA system, we were serving 75+ clients with individualized brand voices. Each adapter took 3 to 5 days to train, but once deployed, it worked consistently.

Voice consistency improved from 62% to 88%. Revision cycles dropped from 1.7 to 1.1 rounds average. More importantly, conversion rates improved 30%. That was the metric that mattered.

The lesson is that when you're scaling personalized LLM systems, encoding domain knowledge in model weights beats few-shot prompting. Not because prompting can't work. It can. But because it's wasteful at scale. You're paying repeatedly for static data, in this case brand voice, that never changes.

With LoRA, you pay once to encode it, then it's free forever. Better quality, lower cost, simpler operations.

If you're building content systems at scale, don't underestimate the few-shot token overhead. It's not just a cost issue. It's an architectural constraint that limits what you can build.

I'm Vaibhav Rathi, a Senior Data Scientist at Fractal Analytics with 8+ years building production ML and LLM systems. This work was done during my time at R Systems in 2024. I'm currently building LLM-powered automation for enterprise clients. Connect with me on LinkedIn at linkedin.com/in/vaibhav-rathi-ai.

How We Built a 99% Accurate Invoice Processing System Using OCR and LLMs

Vaibhav Rathi — Thu, 05 Feb 2026 16:35:30 +0000

We had a working RAG solution at 91% accuracy. Here's why we rebuilt it with fine-tuning and what we learned along the way.

Our client was spending eight minutes per invoice on manual data entry. At 10,000 invoices a month, that's a full team doing nothing but copying numbers from PDFs into a database.

We were building an invoice processing system for a US healthcare client. The goal was straightforward - extract line items, medical codes, and billing information from unstructured invoice images. Straightforward until you open the first PDF.

Healthcare invoices are chaotic. Every vendor has a different format. Some use tables, some use free-form layouts, some mix typed text with handwritten notes. And the terminology is brutal - drug names that look like keyboard smashes, procedure codes that differ by a single digit, billing amounts scattered across the page with no consistent positioning.

They wanted automation. They also wanted accuracy - in healthcare, a wrong code means denied claims, compliance issues, or payment disputes.

We actually had a working solution already. Azure OpenAI with RAG, vendor templates in a knowledge base, 91% field-level accuracy. Not bad.

But three problems pushed us to reconsider:

Healthcare data flowing through external APIs
Per-request costs adding up at scale
That stubborn 9% error rate that we couldn't push lower no matter how we tuned the prompts

Six months later, we had a fine-tuned system running at 99% field-level accuracy, fully on-premises, processing invoices in under two minutes each. But getting there taught us more about production ML than any course or paper ever did.

Why We Moved Away from RAG

Our RAG-based system worked. 91% accuracy is respectable. But we hit a ceiling.

The problem with RAG for document extraction is that you're asking the model to generalize from templates. "Here's what Vendor A's invoice looks like, here's where they put the total, now extract it." Works great when the invoice matches the template. Falls apart on edge cases - slightly different layouts, unexpected fields, formatting variations within the same vendor.

We tried everything. Better templates. More examples in the knowledge base. Prompt engineering. We squeezed out maybe 2% more accuracy over three months. Still stuck at 93%.

Meanwhile, three other issues were building pressure:

Data privacy. Healthcare invoices contain patient-adjacent information. Every API call to Azure meant data leaving our controlled environment. The client's compliance team was getting nervous, and rightfully so.

Cost at scale. At our volume, API costs were adding up. Not catastrophic, but enough that "what if we self-hosted?" became a recurring question in planning meetings.

The accuracy ceiling. 93% sounds good until you do the math. At 10,000 invoices per month, that's 700 invoices with errors. 700 invoices needing manual review and correction. The automation wasn't as automatic as the numbers suggested.

We made the call to rebuild with fine-tuning. Self-hosted LLaMA 3.2, trained on our actual invoice data. More work upfront, but it addressed all three problems.

The Architecture We Landed On

The final pipeline has four components, each chosen after some painful lessons.

PaddleOCR for Text Extraction

We benchmarked against Tesseract and EasyOCR on a corpus of 500 invoices. PaddleOCR won by 12%, and the gap was even wider on multi-column layouts and tables - exactly the formats healthcare invoices love.

Tesseract kept merging columns or missing table boundaries entirely. EasyOCR was slower without being more accurate. PaddleOCR had a steeper learning curve and less community support for edge cases, but accuracy won that trade-off.

BioBERT for Medical Entity Recognition

This was non-negotiable. We tried general-purpose NER models first, and they were hopeless on medical terminology. "Metformin 500mg" parsed as a single blob. CPT codes got missed entirely. Drug names with unusual spellings - and there are many in healthcare - caused constant errors.

BioBERT, pre-trained on PubMed abstracts, understood medical terminology out of the box. It recognized drug names, procedure codes, and clinical terms that general models consistently missed. This wasn't a marginal improvement. This was the difference between the system working and not working.

Fine-tuned LLaMA 3.2 for Structured Extraction

This is where we diverged from our RAG approach. Instead of giving the model templates and asking it to generalize, we trained it on actual invoice-to-output pairs. The model learned the extraction task directly, not through in-context examples.

Full fine-tuning would have required A100 GPUs we didn't have budget for. QLoRA let us fine-tune on T4s - roughly 7x cheaper.

We tested LoRA ranks from 4 to 32. Rank 8 was the sweet spot. Rank 4 underfitted on complex line items like bundled services. Rank 16 and above showed diminishing returns on accuracy while increasing inference time. Rank 32 actually started overfitting - validation loss ticked back up.

Three-Tier Validation Layer

Tier one: Format validation - regex checks for invoice numbers, dates, monetary values
Tier two: Confidence scoring - below 0.95, the invoice routed to human review
Tier three: Business rules - does the total equal the sum of line items? Are the procedure codes valid for this provider type?

The combination of fine-tuning plus validation got us from 91% to 99%. The model was more accurate, and the validation layer caught most of what it missed.

The Problems We Didn't Anticipate

The architecture above sounds clean. The path to it was not.

Week 2: OCR Quality Was Worse Than We Expected

Two weeks in, 30% of our invoices were producing garbage output. Not because PaddleOCR was bad - because the input images were bad. Scanned at odd angles. Low resolution. Fax artifacts. Handwritten annotations overlapping printed text.

We hadn't budgeted time for preprocessing. We should have.

We added rotation correction, adaptive thresholding for low-contrast scans, and region detection to separate printed text from handwritten notes. This took a week we hadn't planned for. It also improved downstream accuracy by 15%.

Lesson learned early: OCR is half the problem. Never assume clean input.

Week 4: The Model Had Memorized Vendor Formats

Our test accuracy looked great - 96%. We were ready to deploy.

Then someone suggested we test on invoices from vendors that weren't in the training set. Held-out vendors, not just held-out invoices.

The result: 81% accuracy. A 15-point gap.

The model had learned that Vendor A always puts the total in the bottom-right corner. Vendor B uses a specific date format. Vendor C has a particular way of listing line items. When it saw a new vendor's format, it fell apart.

We fixed this with vendor-agnostic preprocessing, diverse training batches, and - most importantly - evaluation on truly held-out vendors. The gap dropped to 4 points. Not perfect, but good enough for production where we'd add new vendors to training over time.

This one stung. Standard train/val splits had hidden a serious generalization problem. We'd have caught it in week one if we'd thought to test on held-out vendors from the start.

Week 6: Rare Line Item Types Were Getting Crushed

Most invoices contain standard stuff - common medications, routine procedures, straightforward billing codes. These made up 88% of our training data.

The remaining 12% was split between compound medications, DME rentals (durable medical equipment), and bundled service codes. These are trickier to extract - compound medications have unusual naming conventions, DME rentals span multiple line items, bundled codes require understanding which services are grouped together.

The model's F1 on these minority classes was 0.68. Overall accuracy looked fine because the majority classes dominated the average. But 0.68 F1 on compound medications meant one in three was wrong. Not acceptable for production.

We tried weighted cross-entropy loss, with weights inversely proportional to class frequency. Compound medications got a weight of 8.2 versus 1.0 for standard line items. We combined this with stratified splits to ensure rare classes appeared in every validation fold - otherwise, we couldn't reliably measure whether our fixes were working.

Minority class F1 went from 0.68 to 0.81. Not perfect, but combined with human review on low-confidence predictions, it was good enough. The remaining errors got caught by our validation layer and routed to the review queue.

The Production Incident

Three months into production, our daily accuracy report showed a drop from 99% to 91% on one vendor's invoices.

This wasn't a gradual decline. The previous day's batch had processed normally. Something had changed overnight.

Four hours of investigation later, we found the culprit: a major vendor had updated their invoice template. Field positions had shifted. The date format changed from MM/DD/YYYY to YYYY-MM-DD. The model had been trained on the old format and was now extracting the wrong fields.

This is the trade-off with fine-tuning versus RAG. With RAG, we could have updated a template in 30 minutes. With fine-tuning, we needed new training data and a retraining cycle.

Here's how we handled it: we immediately routed that vendor's invoices to the human review queue. Over the next few days, the review team processed invoices manually while we collected corrected examples. Once we had enough samples in the new format, we retrained and redeployed. Total time to full recovery: about a week.

For smaller vendors, the calculus is different. If a low-volume vendor changes their format, we keep them in the human review queue until we've accumulated enough examples to justify a retraining cycle. Sometimes that's a few weeks. The volume doesn't warrant faster turnaround.

The incident taught us that fine-tuning comes with operational overhead that RAG doesn't have. We accepted that trade-off for the accuracy and privacy gains, but it's a real trade-off. We now monitor vendor-level accuracy daily and have a documented process for handling template changes: immediate queue routing, sample collection, scheduled retraining.

Production ML isn't about building a model. It's about building a system that stays accurate when the world changes around it. And the world always changes.

What Actually Mattered

Six months in, I can point to five decisions that made the difference between a demo and a system that actually works.

Moving from RAG to fine-tuning. This was the big one. RAG hit a ceiling at 93%. Fine-tuning broke through to 99%. The accuracy gain justified the added operational complexity.

Using BioBERT for medical NER. General models couldn't handle the terminology. Domain-specific pre-training wasn't a nice-to-have; it was essential.

Cross-vendor validation from day one. Standard train/val splits hid our generalization problem. Testing on held-out vendors would have caught it immediately.

Weighted loss for class imbalance. Simple technique, big impact. Rare line item types went from unusable to acceptable.

Monitoring vendor-level accuracy. Aggregate metrics hide localized problems. When one vendor's template changed, we caught it within 24 hours instead of letting errors accumulate.

The Results

After six months in production:

✅ 99% field-level accuracy across all invoice types (up from 91% with RAG)
✅ 73% reduction in processing time - from 8+ minutes to under 2 minutes per invoice
✅ 77% of invoices fully automated - no human touch required
✅ Minority class F1 improved from 0.68 to 0.81
✅ Zero data leaving our environment - full healthcare compliance

The business impact: estimated $340K annual savings in processing costs, and same-day invoice processing where there used to be a 3-5 day backlog.

What I'd Do Differently

If I were starting this project again, three things would change.

Budget more time for preprocessing. We treated OCR as a solved problem. It isn't. Image quality issues ate a week we hadn't planned for, and preprocessing improvements drove more accuracy gains than model tuning.

Test on held-out groups immediately. Not just held-out samples - held-out groups. Vendors, time periods, whatever natural segmentation exists in your data. This catches generalization failures that random splits miss.

Plan for template changes from day one. With fine-tuning, vendor template changes require retraining. We built our monitoring and retraining pipeline after the first incident. It should have been part of the initial design.

Final Thoughts

The model was maybe 30% of this project. The rest was preprocessing, validation logic, monitoring, and maintenance.

If you're building document processing systems, the unsexy work - rotation correction, class-level metrics, vendor-specific monitoring, retraining pipelines - is what separates demos from production. The ML community focuses on model architecture. Production focuses on everything else.

Fine-tuning gave us better accuracy than RAG ever could. It also gave us operational overhead that RAG didn't have. Every template change means collecting new data and retraining. For us, the trade-off was worth it - 99% accuracy versus 91%, plus data privacy and lower per-inference costs. But it's not a free lunch.

The system's been running for six months now. Accuracy has held steady. The monitoring catches template drift quickly. The human review loop handles edge cases while we collect data for the next retraining cycle.

It works. And that's the whole point.

I'm a Senior Data Scientist at Fractal Analytics, where I build LLM-powered automation systems for enterprise clients. Previously led ML teams at R Systems working on healthcare and content applications. Always happy to talk shop - connect with me on LinkedIn.