DEV Community

monkeymore studio
monkeymore studio

Posted on

Building a Browser-Based AI OCR Tool with Multiple Engines

Introduction

In this article, we'll explore how to implement a powerful browser-based OCR (Optical Character Recognition) tool that supports multiple OCR engines. The tool can extract text from images entirely in the browser, supporting both English and Chinese text recognition, with two engine options: Tesseract.js (lightweight) and PP-OCRv5 (Chinese-optimized deep learning).

Why Browser-Based OCR?

1. Privacy Protection

When users process OCR in the browser, their images never leave their device. This is essential for:

  • Business documents containing sensitive information
  • Personal photos with private text
  • Medical records or legal documents

2. Zero Server Costs

Running OCR in the browser eliminates the need for:

  • GPU servers for deep learning inference
  • Bandwidth for uploading/downloading images
  • API costs for third-party OCR services

3. Offline Capability

Once the models are loaded, users can process images without an internet connection (for Tesseract.js).

Technical Architecture

Core Implementation

1. Data Structures

interface OCRResult {
  text: string;
  confidence: number;
  box?: number[][];  // Bounding box coordinates
}

interface ImageFile {
  id: string;
  file: File;
  previewUrl: string;
  results?: OCRResult[];
  processing?: boolean;
  error?: string;
}

type OCREngine = "tesseract" | "ppocrv5";

// PP-OCRv5 model URLs from GitHub
const GITHUB_BASE_URL = "https://raw.githubusercontent.com/linmingren/openmodels/main/models/ppocrv5";
const MODEL_URLS = {
  det: `${GITHUB_BASE_URL}/ch_PP-OCRv5_mobile_det.onnx`,
  rec: `${GITHUB_BASE_URL}/ch_PP-OCRv5_rec_mobile_infer.onnx`,
};
Enter fullscreen mode Exit fullscreen mode

2. Engine Selection & Loading

The tool supports two OCR engines:

const [engine, setEngine] = useState<OCREngine>("tesseract");
const [modelsLoaded, setModelsLoaded] = useState(false);
const [charDict, setCharDict] = useState<string[]>([]);

const sessionsRef = useRef<{
  det: ort.InferenceSession | null;
  rec: ort.InferenceSession | null;
}>({ det: null, rec: null });
Enter fullscreen mode Exit fullscreen mode

3. Loading Tesseract.js Engine

Tesseract.js is lightweight and works out of the box:

// Tesseract.js loads automatically when selected
useEffect(() => {
  if (engine === "tesseract") {
    setTesseractLoaded(true);
    setLoadingModels(false);
  }
}, [engine]);

// Using Tesseract.js for OCR
const performOCRTesseract = async (imageFile: ImageFile): Promise<OCRResult[]> => {
  const result = await Tesseract.recognize(
    imageFile.file,
    'chi_sim+eng',  // Chinese Simplified + English
    {
      logger: (m) => {
        console.log(`[Tesseract] ${m.status}: ${Math.round(m.progress * 100)}%`);
      }
    }
  );

  const results: OCRResult[] = [];

  if (result.data.text) {
    results.push({
      text: result.data.text.trim(),
      confidence: (result.data.confidence || 90) / 100,
    });
  }

  return results;
};
Enter fullscreen mode Exit fullscreen mode

4. Loading PP-OCRv5 Models

PP-OCRv5 is a more powerful Chinese-optimized OCR system:

// Load character dictionary
useEffect(() => {
  async function loadDict() {
    let dict: string[] = [];

    // Try local first, then remote
    try {
      const response = await fetch(LOCAL_DICT_PATH);
      if (response.ok) {
        const text = await response.text();
        dict = text.split('\n')
          .map(line => line.replace(/\r$/, ''))
          .filter(line => line.length > 0);
        dict = ['blank', ...dict];  // Add blank for CTC
      }
    } catch (err) {
      console.log('Trying remote dictionary...');
    }

    // Remote fallback
    if (dict.length === 0) {
      const response = await fetch(DICT_URL);
      const text = await response.text();
      dict = text.split('\n')
        .map(line => line.replace(/\r$/, ''))
        .filter(line => line.length > 0);
      dict = ['blank', ...dict];
    }

    setCharDict(dict);
  }
  loadDict();
}, [engine]);

// Load ONNX models
async function loadModels() {
  // Ensure ONNX Runtime is loaded
  if (!window.ort) {
    const script = document.createElement("script");
    script.src = "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.20.1/dist/ort.min.js";
    document.head.appendChild(script);
    await new Promise((resolve) => { script.onload = resolve; });
  }

  // Configure for optimal performance
  ort.env.wasm.numThreads = 4;
  ort.env.wasm.simd = true;

  // Load detection model
  const detResponse = await fetch(MODEL_URLS.det);
  const detBuffer = await detResponse.arrayBuffer();
  const detSession = await ort.InferenceSession.create(new Uint8Array(detBuffer), {
    executionProviders: ['wasm'],
    graphOptimizationLevel: 'all'
  });
  sessionsRef.current.det = detSession;

  // Load recognition model
  const recResponse = await fetch(MODEL_URLS.rec);
  const recBuffer = await recResponse.arrayBuffer();
  const recSession = await ort.InferenceSession.create(new Uint8Array(recBuffer), {
    executionProviders: ['wasm'],
    graphOptimizationLevel: 'all'
  });
  sessionsRef.current.rec = recSession;

  setModelsLoaded(true);
}
Enter fullscreen mode Exit fullscreen mode

5. Image Preprocessing for PP-OCRv5

// Normalize image for model input
const normalizeImage = (imageData: ImageData, mean: number[], std: number[]): Float32Array => {
  const { data, width, height } = imageData;
  const floatData = new Float32Array(3 * width * height);

  // ImageNet normalization
  for (let i = 0; i < height * width; i++) {
    const r = data[i * 4] / 255.0;
    const g = data[i * 4 + 1] / 255.0;
    const b = data[i * 4 + 2] / 255.0;

    floatData[i] = (r - mean[0]) / std[0];
    floatData[i + height * width] = (g - mean[1]) / std[1];
    floatData[i + 2 * height * width] = (b - mean[2]) / std[2];
  }

  return floatData;
};
Enter fullscreen mode Exit fullscreen mode

6. Text Detection (DBNet)

PP-OCRv5 uses DBNet for text detection:

// DBNet post-processing
const postprocessDetection = (output: ort.Tensor, threshold = 0.3): number[][][] => {
  const data = output.data as Float32Array;
  const dims = output.dims;
  const height = dims[2] as number;
  const width = dims[3] as number;

  // Probability map
  const probMap = new Float32Array(height * width);
  for (let i = 0; i < height * width; i++) {
    probMap[i] = data[i];
  }

  // Binary map
  const binaryMap = new Uint8Array(height * width);
  for (let i = 0; i < height * width; i++) {
    binaryMap[i] = probMap[i] > threshold ? 1 : 0;
  }

  // Connected component analysis
  const boxes: number[][][] = [];
  const visited = new Uint8Array(height * width);

  for (let y = 0; y < height; y++) {
    for (let x = 0; x < width; x++) {
      const idx = y * width + x;
      if (binaryMap[idx] === 1 && !visited[idx]) {
        const points: number[][] = [];
        const queue: number[][] = [[x, y]];
        visited[idx] = 1;

        // BFS for connected region
        while (queue.length > 0) {
          const [cx, cy] = queue.shift()!;
          points.push([cx, cy]);

          // Check 8 directions
          for (const [dx, dy] of [[-1,0],[1,0],[0,-1],[0,1],[-1,-1],[-1,1],[1,-1],[1,1]]) {
            const nx = cx + dx;
            const ny = cy + dy;
            const nIdx = ny * width + nx;

            if (nx >= 0 && nx < width && ny >= 0 && ny < height && 
                binaryMap[nIdx] === 1 && !visited[nIdx]) {
              visited[nIdx] = 1;
              queue.push([nx, ny]);
            }
          }
        }

        // Filter small regions and create bounding box
        if (points.length > 10) {
          const minX = Math.min(...points.map(p => p[0]));
          const maxX = Math.max(...points.map(p => p[0]));
          const minY = Math.min(...points.map(p => p[1]));
          const maxY = Math.max(...points.map(p => p[1]));

          boxes.push([
            [minX, minY], [maxX, minY], [maxX, maxY], [minX, maxY]
          ]);
        }
      }
    }
  }

  return boxes;
};
Enter fullscreen mode Exit fullscreen mode

7. Text Recognition with CTC Decoding

// CTC (Connectionist Temporal Classification) decoding
const ctcDecodeIndices = (indices: number[], charDict: string[]): string => {
  const blankIdx = 0;
  let result = '';
  let lastIdx = -1;

  for (let i = 0; i < indices.length; i++) {
    const idx = indices[i];

    // Skip blank (CTC blank token)
    if (idx === blankIdx) continue;

    // Skip consecutive duplicates
    if (idx === lastIdx) continue;

    // Map index to character
    if (idx >= 1 && idx < charDict.length) {
      result += charDict[idx];
    }

    lastIdx = idx;
  }

  return result;
};

// Complete PP-OCRv5 recognition
const performOCRPPOCR = async (imageFile: ImageFile): Promise<OCRResult[]> => {
  // 1. Text Detection
  const img = await createImageBitmap(imageFile.file);
  const canvas = document.createElement('canvas');
  canvas.width = img.width;
  canvas.height = img.height;
  canvas.getContext('2d')!.drawImage(img, 0, 0);

  // Resize for detection model (multiple of 32)
  const detInputSize = 960;
  const scale = Math.min(detInputSize / img.width, detInputSize / img.height);
  let newWidth = Math.ceil(Math.round(img.width * scale) / 32) * 32;
  let newHeight = Math.ceil(Math.round(img.height * scale) / 32) * 32;

  // Run detection
  const detImageData = detCtx.getImageData(0, 0, newWidth, newHeight);
  const detInput = normalizeImage(detImageData, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]);
  const detTensor = new ort.Tensor('float32', detInput, [1, 3, newHeight, newWidth]);
  const detResults = await sessionsRef.current.det.run({ [detInputName]: detTensor });
  const boxes = postprocessDetection(detResults[detOutputName]);

  // 2. Text Recognition for each box
  const results: OCRResult[] = [];

  for (const box of boxes.slice(0, 20)) {  // Limit to 20 boxes
    // Crop text region
    const minX = Math.min(...box.map(p => p[0]));
    const maxX = Math.max(...box.map(p => p[0]));
    const minY = Math.min(...box.map(p => p[1]));
    const maxY = Math.max(...box.map(p => p[1]));

    // Resize to 48px height (multiple of 8)
    let recHeight = 48;
    let recWidth = Math.min(320, Math.round((maxX - minX) * recHeight / (maxY - minY)));
    recWidth = Math.ceil(recWidth / 8) * 8;
    recHeight = Math.ceil(recHeight / 8) * 8;

    // Run recognition
    const recImageData = recCtx.getImageData(0, 0, recWidth, recHeight);
    const recInput = normalizeImage(recImageData, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]);
    const recTensor = new ort.Tensor('float32', recInput, [1, 3, recHeight, recWidth]);
    const recResults = await sessionsRef.current.rec.run({ [recInputName]: recTensor });

    // Get indices and decode
    const indices = getMaxIndices(recOutput);
    const text = ctcDecodeIndices(indices, charDict);

    if (text.trim()) {
      results.push({
        text,
        confidence: 0.95,
        box: box.map(p => [p[0] / scale, p[1] / scale])
      });
    }
  }

  // Sort by vertical position
  return results.sort((a, b) => {
    const ay = Math.min(...(a.box || [[0,0]]).map(p => p[1]));
    const by = Math.min(...(b.box || [[0,0]]).map(p => p[1]));
    return ay - by;
  });
};
Enter fullscreen mode Exit fullscreen mode

8. Engine Selection UI

<div className="space-y-3">
  <label className="flex items-start gap-3 p-3 border rounded-lg cursor-pointer hover:bg-gray-50">
    <input
      type="radio"
      name="engine"
      value="tesseract"
      checked={engine === "tesseract"}
      onChange={(e) => setEngine(e.target.value as OCREngine)}
    />
    <div className="flex-1">
      <span className="font-medium block">
        {t.tesseractDefault || "Tesseract.js (Default)"}
      </span>
      <span className="text-sm text-gray-500">
        {t.tesseractDesc || "Fast and lightweight, good for general text recognition"}
      </span>
    </div>
  </label>
  <label className="flex items-start gap-3 p-3 border rounded-lg cursor-pointer hover:bg-gray-50">
    <input
      type="radio"
      name="engine"
      value="ppocrv5"
      checked={engine === "ppocrv5"}
      onChange={(e) => setEngine(e.target.value as OCREngine)}
    />
    <div className="flex-1">
      <span className="font-medium block">
        {t.ppocrv5Option || "PP-OCRv5 (Chinese optimized)"}
      </span>
      <span className="text-sm text-gray-500">
        {t.ppocrv5Desc || "Better for Chinese text recognition, requires downloading large models"}
      </span>
    </div>
  </label>
</div>
Enter fullscreen mode Exit fullscreen mode

Processing Flow

OCR Engine Comparison

Feature Tesseract.js PP-OCRv5
Model Size ~20MB (auto-download) ~20MB (det + rec)
Chinese Support Good Excellent
Speed Fast Medium
Offline Yes (after cache) Requires model cache
Languages 100+ Chinese + English
Accuracy Good for clear text Excellent for Chinese

Key Technologies Used

Technology Purpose
Tesseract.js JavaScript OCR engine
ONNX Runtime Web Run ONNX models in browser
PP-OCRv5 Baidu's Chinese OCR models
DBNet Text detection neural network
CRNN Text recognition (CNN + RNN + CTC)
CTC Decoding Convert model output to text

Character Dictionary

PP-OCRv5 uses a character dictionary for recognition:

// Dictionary format (one character per line):
// 一
// 二
// 三
// ...
// A
// B
// C
// ...

// With blank token for CTC:
// blank
// 一
// 二
// 三
// ...
Enter fullscreen mode Exit fullscreen mode

The dictionary contains 6000+ Chinese characters plus letters, numbers, and symbols.

Performance Characteristics

  1. Tesseract.js:

    • Auto-downloads traineddata (~20MB)
    • Processing: 2-10 seconds per image
    • Works offline after first run
  2. PP-OCRv5:

    • Detection model: ~10MB
    • Recognition model: ~10MB
    • Dictionary: ~50KB
    • Processing: 3-15 seconds per image
    • Better for Chinese text

Use Cases

  1. Document digitization - Convert paper documents to text
  2. Screenshot extraction - Extract text from screen captures
  3. Invoice processing - Extract information from receipts
  4. Book scanning - Digitize printed books
  5. Image translation - Prepare text for translation apps

Conclusion

Browser-based OCR with multiple engine support provides flexibility for different use cases. The implementation uses:

  • Tesseract.js for quick, general-purpose OCR with language support
  • PP-OCRv5 for specialized Chinese text recognition using deep learning
  • ONNX Runtime Web for running ONNX models in the browser
  • CTC Decoding for converting neural network outputs to readable text

Users can choose between lightweight Tesseract.js or powerful PP-OCRv5 based on their needs, all without sending images to any server.


Try it yourself at Free Image Tools

Experience the power of browser-based OCR. No upload required - your images stay on your device!

Top comments (0)