Alessandro Binda

Posted on May 16

AI-Powered Document OCR for Business: Moving Beyond Simple Text Extraction

#ai #saas #business

OCR (Optical Character Recognition) has been a solved problem for simple printed text since the 1990s. Tesseract can handle clean, high-contrast typed documents reliably. The interesting engineering challenges now are in the hard cases: handwritten documents, degraded historical records, complex multi-column layouts, tables within scanned PDFs, and mixed-format documents that combine printed and handwritten text.

This post covers the technical architecture we use in production for business document processing, the tradeoffs we made, and what we learned from real-world deployment.

The Business Context

Before diving into the technical stack, the business context matters because it shapes the technical requirements.

In legal and notarial workflows in Italy, documents include:

Rogiti (notarial deeds): Often handwritten or typed on old typewriters, with dense legal language, abbreviations, and format conventions specific to the period and the notary
Catastali (land registry documents): Scanned forms with structured data fields, stamps, and handwritten annotations
Contracts and leases: Modern typed documents, often scanned rather than digitally created
Invoices and financial documents: Varying formats across different periods and issuers

The accuracy requirement is high — for legal and financial documents, extraction errors can have real consequences. And the volume can be significant — a property transaction generates dozens of documents.

The Three-Tier Processing Chain

We settled on a three-tier approach based on document type classification:

Tier 1: Modern Typed Documents → Tesseract

For clean, modern typed documents (invoices, contracts generated in the last 20 years, forms), Tesseract 5.x with appropriate preprocessing handles the job adequately. Preprocessing matters enormously:

import cv2
import pytesseract
import numpy as np

def preprocess_for_tesseract(image_path):
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Deskew
    coords = np.column_stack(np.where(gray > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    (h, w) = gray.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    gray = cv2.warpAffine(gray, M, (w, h), flags=cv2.INTER_CUBIC)

    # Denoise and threshold
    gray = cv2.GaussianBlur(gray, (3, 3), 0)
    _, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    return thresh

Deskewing, denoising, and thresholding together take Tesseract accuracy from 80-85% to 92-96% on typical business documents.

Tier 2: Handwritten and Degraded Documents → Mistral Pixtral

For handwritten text and degraded historical documents, Tesseract fails badly. Character recognition on cursive handwriting requires a vision model that understands contextual letter patterns, not pixel-level character matching.

We use Mistral's Pixtral model (vision-capable) for this tier. The prompt is structured carefully:

const prompt = `
Extract all text from this document image.
- Preserve the original formatting and structure
- Mark uncertain readings with [?]
- For numbers, pay special attention to: 1/7, 0/6, 5/3 (common confusions)
- Output as structured JSON with: full_text, tables (if any), key_values (detected form fields)
- Language: Italian legal document (may contain Latin phrases)
`;

const response = await mistral.chat.complete({
  model: "pixtral-12b-2409",
  messages: [{
    role: "user",
    content: [
      { type: "image_url", imageUrl: { url: base64Image } },
      { type: "text", text: prompt }
    ]
  }]
});

Pixtral handles Italian handwriting with accuracy in the 85-90% range on our test set — significantly better than any traditional OCR approach, and good enough for the majority of documents.

Tier 3: Fallback → Gemini Vision

For documents that fail quality checks after Pixtral processing (confidence below threshold, or manual review flagged), we fall back to Gemini Vision as an emergency option. This is rarely used but ensures no document is completely unprocessable.

The Fallback Decision Logic

async function processDocument(imagePath, documentType) {
  // Route by document type
  if (documentType === 'modern_typed') {
    const result = await runTesseract(imagePath);
    if (result.confidence > 0.90) return result;
    // Fall through to vision model if quality is low
  }

  // Vision model path (handwritten or low-confidence typed)
  const pixtralResult = await runPixtral(imagePath);
  if (pixtralResult.confidence > 0.75) return pixtralResult;

  // Emergency fallback
  console.warn('Low confidence — escalating to Gemini');
  return await runGemini(imagePath);
}

Structured Extraction vs. Raw Text

Raw text extraction is not the end goal for business documents — structured data extraction is. After OCR, we run a second LLM pass specifically for field extraction:

For an invoice: extract invoice_number, date, vendor_name, vendor_vat, line_items[], total, currency.
For a land deed: extract property_cadastral_ref, parties[], notary_name, date, transaction_type, consideration_amount.

This two-stage approach (OCR → raw text, then LLM → structured data) is more reliable than trying to do structured extraction directly from the image in one pass. The OCR stage normalizes the text; the extraction stage applies business logic.

Production Deployment Considerations

Queue management: Vision model calls are expensive in latency and cost. We use a priority queue with configurable concurrency limits — in our implementation, maximum 3 concurrent OCR jobs to avoid overwhelming the API and incurring rate limits.

Cost management: Pixtral on Mistral's free tier covers a limited number of pages per day. Monitor usage carefully and implement caching for documents that have already been processed.

Accuracy monitoring: Build a feedback loop. When a human corrects an extraction result, capture that correction. Over time, you accumulate a dataset for fine-tuning or prompt improvement.

Format output: Export processed documents as PDF (for human review) and structured JSON (for downstream automation). The S.C.A.L.A. platform handles both formats in the OCR module, with a download interface for processed documents.

What We Would Do Differently

The main thing we underinvested in early on was the document classification step — correctly routing documents to the right processing tier before attempting OCR. A modern typed invoice processed through the vision model pipeline is slower and more expensive than necessary. A handwritten deed pushed through Tesseract produces garbage.

Invest in the classifier early. Even a simple classifier trained on a few hundred labeled examples will save significant processing cost and improve overall accuracy.

The S.C.A.L.A. platform includes AI-powered document OCR for business documents as part of the core toolset. Details at get-scala.com.

DEV Community