DEV Community

Rayne Robinson
Rayne Robinson

Posted on

My AI Read a Receipt Wrong. It Didn't Misread It — It Made One Up.

I pointed my phone at a grocery receipt. The AI returned a store name, item list, and total.

None of it was real.

The store was wrong. The items were fabricated. The total didn't match anything on the paper. The model didn't misread the receipt — it hallucinated an entirely fictional one.

Same image, different model, five seconds later: every item correct, store name right, total accurate to the penny.

This is a story about vision models, receipts, and why I stopped paying for OCR APIs.

The Problem

I built an expense tracker that lets you scan receipts with your phone camera. Take a photo, the AI reads it, items get logged automatically. No typing.

The OCR industry wants you to pay for this. Google Cloud Vision: $1.50 per 1,000 pages. AWS Textract: $1.50 per 1,000 pages. Azure Document Intelligence: $1 per 1,000 pages. For a personal expense tracker, that's a small number — but it's not zero, it's not local, and your receipt data leaves your machine.

I wanted zero cost, zero cloud, and zero fabrication. I got two out of three on the first try.

The Hallucination

The first vision model I tested was minicpm-v (8B parameters, open source, runs on consumer hardware). I fed it a grocery receipt. Clear photo, good lighting, standard layout.

It returned:

  • A store name that wasn't on the receipt
  • Items I didn't buy
  • Prices that didn't exist on the paper

This wasn't a misread. The model didn't confuse a "7" for a "1" or merge two line items. It generated a plausible-looking receipt from scratch. If I hadn't been holding the original, I might not have caught it.

This is the failure mode nobody warns you about with vision models. Everyone talks about OCR accuracy — character error rates, line detection, skew correction. Nobody talks about the model skipping the image entirely and writing fiction.

The Fix

I swapped to qwen3-vl (8B parameters, same size, same hardware requirements). Same receipt, same prompt:

"List every item and price on this receipt. One per line, format: ITEM - $X.XX. End with subtotal, tax, and total. Also state the store name, date, and payment method."

Every item correct. Store name right. Date right. Total matched. The difference wasn't the prompt, the image quality, or the hardware. It was the model's actual ability to read what's in front of it versus inventing what should be there.

Both models fit in ~6GB of VRAM. Both are open source. Both run locally. The gap between them is a canyon.

The Pipeline

Here's what runs when you point your phone at a receipt:

Stage 1 — Image prep. The photo gets auto-rotated (EXIF data), resized to max 1280px portrait, compressed to JPEG 85%. Phone cameras over-deliver on resolution. The model doesn't need 4000x3000 pixels to read a receipt.

Stage 2 — Vision model reads the receipt. qwen3-vl:8b extracts store name, date, items with prices, subtotal, tax, total, and payment method. One API call. Structured text output.

Stage 3 — Item parsing. Python regex extracts individual items from the raw text. Receipt metadata (subtotal lines, tax lines, coupon lines) gets filtered. Each item gets a name and a price.

Stage 4 — Confidence scoring. The system checks: Did we get items? A total? A store name? A date? Do the items sum to roughly the total? If everything reconciles, auto-save. If something's missing or the math doesn't add up, flag it for human review.

Stage 5 — Category inference. A text model (qwen3:8b) categorizes each item — groceries, household, personal care — based on the item name and the store context. A bottle of wine from Safeway is "Groceries." A bottle of wine from Total Wine is "Alcohol."

No cloud API. No subscription. No receipt data leaving the machine.

The Fallback Chain

Local AI isn't always available. The laptop sleeps. The GPU is running something else. So the system has a fallback chain:

  1. Local GPU (primary) — qwen3-vl:8b on the laptop's RTX 5080. Fastest option. Free. ~5 seconds per receipt.
  2. Claude Haiku API (fallback) — Cloud vision when local isn't reachable. ~$0.01 per receipt. 10-30 seconds.

The key: the fallback is a safety net, not the primary path. 95% of receipts hit the local GPU. The API exists for the 5% when I'm scanning from my phone and the laptop is off.

This is the same dual-model pattern from Part 1 of this series — local handles the volume, cloud handles the exceptions.

The Date Guard

One more thing the hallucination problem taught me: don't trust any single field from OCR output.

The system runs a 90-day sanity check on dates. If the vision model returns a date more than 90 days from today, it's probably wrong — either hallucinated or misread. The system falls back to today's date and flags it.

This sounds paranoid. It caught three bad dates in the first week.

The Numbers

Cloud OCR API Local Vision
Cost per receipt $0.001-0.002 $0.00
Cost per 1,000 receipts $1.00-2.00 $0.00
Data leaves your machine Yes No
Requires internet Yes No
Speed 2-5 seconds ~5 seconds
Works offline No Yes
VRAM required None (cloud) ~6GB

The cost difference is small in absolute terms. The privacy difference is not. Every receipt you scan through a cloud API is a data point about where you shop, what you buy, and how much you spend. For a personal finance tool, that's the wrong tradeoff.

What I Learned

Vision model selection matters more than prompt engineering. I used the same prompt for both models. One fabricated a receipt. One read it perfectly. No amount of prompt optimization fixes a model that invents data.

Confidence scoring is mandatory, not optional. Auto-saving OCR output without verification is asking for garbage in your database. The reconciliation check (do items sum to total?) catches errors the model doesn't flag.

The two-stage pattern works for vision too. Vision model reads. Text model categorizes. Same separation of concerns from the dual-model architecture, applied to a different domain.

Hallucination in vision models is qualitatively different from text hallucination. A text model that hallucinates gives you a wrong answer to a real question. A vision model that hallucinates gives you a confident answer to an image it didn't read. The second is harder to detect because the output looks right.


This is Part 4 of my Local AI Architecture series. Part 1 covered dual-model orchestration. Part 2 covered cognitive memory. Part 3 covered context architecture. Next up: what happens when you give a local AI access to push notifications.

I build zero-cost AI tools on consumer hardware. The factory runs on Docker, Ollama, and one GPU. The tools it produces run on nothing.

Top comments (0)