monkeymore studio

Posted on Apr 9

Building a Browser-Based AI OCR Tool with Multiple Engines

#webdev #ai #javascript #tutorial

Introduction

In this article, we'll explore how to implement a powerful browser-based OCR (Optical Character Recognition) tool that supports multiple OCR engines. The tool can extract text from images entirely in the browser, supporting both English and Chinese text recognition, with two engine options: Tesseract.js (lightweight) and PP-OCRv5 (Chinese-optimized deep learning).

Why Browser-Based OCR?

1. Privacy Protection

When users process OCR in the browser, their images never leave their device. This is essential for:

Business documents containing sensitive information
Personal photos with private text
Medical records or legal documents

2. Zero Server Costs

Running OCR in the browser eliminates the need for:

GPU servers for deep learning inference
Bandwidth for uploading/downloading images
API costs for third-party OCR services

3. Offline Capability

Once the models are loaded, users can process images without an internet connection (for Tesseract.js).

Technical Architecture

Core Implementation

1. Data Structures

interface OCRResult {
  text: string;
  confidence: number;
  box?: number[][];  // Bounding box coordinates
}

interface ImageFile {
  id: string;
  file: File;
  previewUrl: string;
  results?: OCRResult[];
  processing?: boolean;
  error?: string;
}

type OCREngine = "tesseract" | "ppocrv5";

// PP-OCRv5 model URLs from GitHub
const GITHUB_BASE_URL = "https://raw.githubusercontent.com/linmingren/openmodels/main/models/ppocrv5";
const MODEL_URLS = {
  det: `${GITHUB_BASE_URL}/ch_PP-OCRv5_mobile_det.onnx`,
  rec: `${GITHUB_BASE_URL}/ch_PP-OCRv5_rec_mobile_infer.onnx`,
};

2. Engine Selection & Loading

The tool supports two OCR engines:

const [engine, setEngine] = useState<OCREngine>("tesseract");
const [modelsLoaded, setModelsLoaded] = useState(false);
const [charDict, setCharDict] = useState<string[]>([]);

const sessionsRef = useRef<{
  det: ort.InferenceSession | null;
  rec: ort.InferenceSession | null;
}>({ det: null, rec: null });

3. Loading Tesseract.js Engine

Tesseract.js is lightweight and works out of the box:

// Tesseract.js loads automatically when selected
useEffect(() => {
  if (engine === "tesseract") {
    setTesseractLoaded(true);
    setLoadingModels(false);
  }
}, [engine]);

// Using Tesseract.js for OCR
const performOCRTesseract = async (imageFile: ImageFile): Promise<OCRResult[]> => {
  const result = await Tesseract.recognize(
    imageFile.file,
    'chi_sim+eng',  // Chinese Simplified + English
    {
      logger: (m) => {
        console.log(`[Tesseract] ${m.status}: ${Math.round(m.progress * 100)}%`);
      }
    }
  );

  const results: OCRResult[] = [];

  if (result.data.text) {
    results.push({
      text: result.data.text.trim(),
      confidence: (result.data.confidence || 90) / 100,
    });
  }

  return results;
};

4. Loading PP-OCRv5 Models

PP-OCRv5 is a more powerful Chinese-optimized OCR system:

// Load character dictionary
useEffect(() => {
  async function loadDict() {
    let dict: string[] = [];

    // Try local first, then remote
    try {
      const response = await fetch(LOCAL_DICT_PATH);
      if (response.ok) {
        const text = await response.text();
        dict = text.split('\n')
          .map(line => line.replace(/\r$/, ''))
          .filter(line => line.length > 0);
        dict = ['blank', ...dict];  // Add blank for CTC
      }
    } catch (err) {
      console.log('Trying remote dictionary...');
    }

    // Remote fallback
    if (dict.length === 0) {
      const response = await fetch(DICT_URL);
      const text = await response.text();
      dict = text.split('\n')
        .map(line => line.replace(/\r$/, ''))
        .filter(line => line.length > 0);
      dict = ['blank', ...dict];
    }

    setCharDict(dict);
  }
  loadDict();
}, [engine]);

// Load ONNX models
async function loadModels() {
  // Ensure ONNX Runtime is loaded
  if (!window.ort) {
    const script = document.createElement("script");
    script.src = "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.20.1/dist/ort.min.js";
    document.head.appendChild(script);
    await new Promise((resolve) => { script.onload = resolve; });
  }

  // Configure for optimal performance
  ort.env.wasm.numThreads = 4;
  ort.env.wasm.simd = true;

  // Load detection model
  const detResponse = await fetch(MODEL_URLS.det);
  const detBuffer = await detResponse.arrayBuffer();
  const detSession = await ort.InferenceSession.create(new Uint8Array(detBuffer), {
    executionProviders: ['wasm'],
    graphOptimizationLevel: 'all'
  });
  sessionsRef.current.det = detSession;

  // Load recognition model
  const recResponse = await fetch(MODEL_URLS.rec);
  const recBuffer = await recResponse.arrayBuffer();
  const recSession = await ort.InferenceSession.create(new Uint8Array(recBuffer), {
    executionProviders: ['wasm'],
    graphOptimizationLevel: 'all'
  });
  sessionsRef.current.rec = recSession;

  setModelsLoaded(true);
}

5. Image Preprocessing for PP-OCRv5

// Normalize image for model input
const normalizeImage = (imageData: ImageData, mean: number[], std: number[]): Float32Array => {
  const { data, width, height } = imageData;
  const floatData = new Float32Array(3 * width * height);

  // ImageNet normalization
  for (let i = 0; i < height * width; i++) {
    const r = data[i * 4] / 255.0;
    const g = data[i * 4 + 1] / 255.0;
    const b = data[i * 4 + 2] / 255.0;

    floatData[i] = (r - mean[0]) / std[0];
    floatData[i + height * width] = (g - mean[1]) / std[1];
    floatData[i + 2 * height * width] = (b - mean[2]) / std[2];
  }

  return floatData;
};

6. Text Detection (DBNet)

PP-OCRv5 uses DBNet for text detection:

// DBNet post-processing
const postprocessDetection = (output: ort.Tensor, threshold = 0.3): number[][][] => {
  const data = output.data as Float32Array;
  const dims = output.dims;
  const height = dims[2] as number;
  const width = dims[3] as number;

  // Probability map
  const probMap = new Float32Array(height * width);
  for (let i = 0; i < height * width; i++) {
    probMap[i] = data[i];
  }

  // Binary map
  const binaryMap = new Uint8Array(height * width);
  for (let i = 0; i < height * width; i++) {
    binaryMap[i] = probMap[i] > threshold ? 1 : 0;
  }

  // Connected component analysis
  const boxes: number[][][] = [];
  const visited = new Uint8Array(height * width);

  for (let y = 0; y < height; y++) {
    for (let x = 0; x < width; x++) {
      const idx = y * width + x;
      if (binaryMap[idx] === 1 && !visited[idx]) {
        const points: number[][] = [];
        const queue: number[][] = [[x, y]];
        visited[idx] = 1;

        // BFS for connected region
        while (queue.length > 0) {
          const [cx, cy] = queue.shift()!;
          points.push([cx, cy]);

          // Check 8 directions
          for (const [dx, dy] of [[-1,0],[1,0],[0,-1],[0,1],[-1,-1],[-1,1],[1,-1],[1,1]]) {
            const nx = cx + dx;
            const ny = cy + dy;
            const nIdx = ny * width + nx;

            if (nx >= 0 && nx < width && ny >= 0 && ny < height && 
                binaryMap[nIdx] === 1 && !visited[nIdx]) {
              visited[nIdx] = 1;
              queue.push([nx, ny]);
            }
          }
        }

        // Filter small regions and create bounding box
        if (points.length > 10) {
          const minX = Math.min(...points.map(p => p[0]));
          const maxX = Math.max(...points.map(p => p[0]));
          const minY = Math.min(...points.map(p => p[1]));
          const maxY = Math.max(...points.map(p => p[1]));

          boxes.push([
            [minX, minY], [maxX, minY], [maxX, maxY], [minX, maxY]
          ]);
        }
      }
    }
  }

  return boxes;
};

7. Text Recognition with CTC Decoding

// CTC (Connectionist Temporal Classification) decoding
const ctcDecodeIndices = (indices: number[], charDict: string[]): string => {
  const blankIdx = 0;
  let result = '';
  let lastIdx = -1;

  for (let i = 0; i < indices.length; i++) {
    const idx = indices[i];

    // Skip blank (CTC blank token)
    if (idx === blankIdx) continue;

    // Skip consecutive duplicates
    if (idx === lastIdx) continue;

    // Map index to character
    if (idx >= 1 && idx < charDict.length) {
      result += charDict[idx];
    }

    lastIdx = idx;
  }

  return result;
};

// Complete PP-OCRv5 recognition
const performOCRPPOCR = async (imageFile: ImageFile): Promise<OCRResult[]> => {
  // 1. Text Detection
  const img = await createImageBitmap(imageFile.file);
  const canvas = document.createElement('canvas');
  canvas.width = img.width;
  canvas.height = img.height;
  canvas.getContext('2d')!.drawImage(img, 0, 0);

  // Resize for detection model (multiple of 32)
  const detInputSize = 960;
  const scale = Math.min(detInputSize / img.width, detInputSize / img.height);
  let newWidth = Math.ceil(Math.round(img.width * scale) / 32) * 32;
  let newHeight = Math.ceil(Math.round(img.height * scale) / 32) * 32;

  // Run detection
  const detImageData = detCtx.getImageData(0, 0, newWidth, newHeight);
  const detInput = normalizeImage(detImageData, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]);
  const detTensor = new ort.Tensor('float32', detInput, [1, 3, newHeight, newWidth]);
  const detResults = await sessionsRef.current.det.run({ [detInputName]: detTensor });
  const boxes = postprocessDetection(detResults[detOutputName]);

  // 2. Text Recognition for each box
  const results: OCRResult[] = [];

  for (const box of boxes.slice(0, 20)) {  // Limit to 20 boxes
    // Crop text region
    const minX = Math.min(...box.map(p => p[0]));
    const maxX = Math.max(...box.map(p => p[0]));
    const minY = Math.min(...box.map(p => p[1]));
    const maxY = Math.max(...box.map(p => p[1]));

    // Resize to 48px height (multiple of 8)
    let recHeight = 48;
    let recWidth = Math.min(320, Math.round((maxX - minX) * recHeight / (maxY - minY)));
    recWidth = Math.ceil(recWidth / 8) * 8;
    recHeight = Math.ceil(recHeight / 8) * 8;

    // Run recognition
    const recImageData = recCtx.getImageData(0, 0, recWidth, recHeight);
    const recInput = normalizeImage(recImageData, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]);
    const recTensor = new ort.Tensor('float32', recInput, [1, 3, recHeight, recWidth]);
    const recResults = await sessionsRef.current.rec.run({ [recInputName]: recTensor });

    // Get indices and decode
    const indices = getMaxIndices(recOutput);
    const text = ctcDecodeIndices(indices, charDict);

    if (text.trim()) {
      results.push({
        text,
        confidence: 0.95,
        box: box.map(p => [p[0] / scale, p[1] / scale])
      });
    }
  }

  // Sort by vertical position
  return results.sort((a, b) => {
    const ay = Math.min(...(a.box || [[0,0]]).map(p => p[1]));
    const by = Math.min(...(b.box || [[0,0]]).map(p => p[1]));
    return ay - by;
  });
};

8. Engine Selection UI

<div className="space-y-3">
  <label className="flex items-start gap-3 p-3 border rounded-lg cursor-pointer hover:bg-gray-50">
    <input
      type="radio"
      name="engine"
      value="tesseract"
      checked={engine === "tesseract"}
      onChange={(e) => setEngine(e.target.value as OCREngine)}
    />
    <div className="flex-1">
      <span className="font-medium block">
        {t.tesseractDefault || "Tesseract.js (Default)"}
      </span>
      <span className="text-sm text-gray-500">
        {t.tesseractDesc || "Fast and lightweight, good for general text recognition"}
      </span>
    </div>
  </label>
  <label className="flex items-start gap-3 p-3 border rounded-lg cursor-pointer hover:bg-gray-50">
    <input
      type="radio"
      name="engine"
      value="ppocrv5"
      checked={engine === "ppocrv5"}
      onChange={(e) => setEngine(e.target.value as OCREngine)}
    />
    <div className="flex-1">
      <span className="font-medium block">
        {t.ppocrv5Option || "PP-OCRv5 (Chinese optimized)"}
      </span>
      <span className="text-sm text-gray-500">
        {t.ppocrv5Desc || "Better for Chinese text recognition, requires downloading large models"}
      </span>
    </div>
  </label>
</div>

Processing Flow

OCR Engine Comparison

Feature	Tesseract.js	PP-OCRv5
Model Size	~20MB (auto-download)	~20MB (det + rec)
Chinese Support	Good	Excellent
Speed	Fast	Medium
Offline	Yes (after cache)	Requires model cache
Languages	100+	Chinese + English
Accuracy	Good for clear text	Excellent for Chinese

Key Technologies Used

Technology	Purpose
Tesseract.js	JavaScript OCR engine
ONNX Runtime Web	Run ONNX models in browser
PP-OCRv5	Baidu's Chinese OCR models
DBNet	Text detection neural network
CRNN	Text recognition (CNN + RNN + CTC)
CTC Decoding	Convert model output to text

Character Dictionary

PP-OCRv5 uses a character dictionary for recognition:

// Dictionary format (one character per line):
// 一
// 二
// 三
// ...
// A
// B
// C
// ...

// With blank token for CTC:
// blank
// 一
// 二
// 三
// ...

The dictionary contains 6000+ Chinese characters plus letters, numbers, and symbols.

Performance Characteristics

Tesseract.js:
- Auto-downloads traineddata (~20MB)
- Processing: 2-10 seconds per image
- Works offline after first run
PP-OCRv5:
- Detection model: ~10MB
- Recognition model: ~10MB
- Dictionary: ~50KB
- Processing: 3-15 seconds per image
- Better for Chinese text

Use Cases

Document digitization - Convert paper documents to text
Screenshot extraction - Extract text from screen captures
Invoice processing - Extract information from receipts
Book scanning - Digitize printed books
Image translation - Prepare text for translation apps

Conclusion

Browser-based OCR with multiple engine support provides flexibility for different use cases. The implementation uses:

Tesseract.js for quick, general-purpose OCR with language support
PP-OCRv5 for specialized Chinese text recognition using deep learning
ONNX Runtime Web for running ONNX models in the browser
CTC Decoding for converting neural network outputs to readable text

Users can choose between lightweight Tesseract.js or powerful PP-OCRv5 based on their needs, all without sending images to any server.

Try it yourself at Free Image Tools

Experience the power of browser-based OCR. No upload required - your images stay on your device!

DEV Community