Introduction
In this article, we'll explore how to implement a powerful browser-based OCR (Optical Character Recognition) tool that supports multiple OCR engines. The tool can extract text from images entirely in the browser, supporting both English and Chinese text recognition, with two engine options: Tesseract.js (lightweight) and PP-OCRv5 (Chinese-optimized deep learning).
Why Browser-Based OCR?
1. Privacy Protection
When users process OCR in the browser, their images never leave their device. This is essential for:
- Business documents containing sensitive information
- Personal photos with private text
- Medical records or legal documents
2. Zero Server Costs
Running OCR in the browser eliminates the need for:
- GPU servers for deep learning inference
- Bandwidth for uploading/downloading images
- API costs for third-party OCR services
3. Offline Capability
Once the models are loaded, users can process images without an internet connection (for Tesseract.js).
Technical Architecture
Core Implementation
1. Data Structures
interface OCRResult {
text: string;
confidence: number;
box?: number[][]; // Bounding box coordinates
}
interface ImageFile {
id: string;
file: File;
previewUrl: string;
results?: OCRResult[];
processing?: boolean;
error?: string;
}
type OCREngine = "tesseract" | "ppocrv5";
// PP-OCRv5 model URLs from GitHub
const GITHUB_BASE_URL = "https://raw.githubusercontent.com/linmingren/openmodels/main/models/ppocrv5";
const MODEL_URLS = {
det: `${GITHUB_BASE_URL}/ch_PP-OCRv5_mobile_det.onnx`,
rec: `${GITHUB_BASE_URL}/ch_PP-OCRv5_rec_mobile_infer.onnx`,
};
2. Engine Selection & Loading
The tool supports two OCR engines:
const [engine, setEngine] = useState<OCREngine>("tesseract");
const [modelsLoaded, setModelsLoaded] = useState(false);
const [charDict, setCharDict] = useState<string[]>([]);
const sessionsRef = useRef<{
det: ort.InferenceSession | null;
rec: ort.InferenceSession | null;
}>({ det: null, rec: null });
3. Loading Tesseract.js Engine
Tesseract.js is lightweight and works out of the box:
// Tesseract.js loads automatically when selected
useEffect(() => {
if (engine === "tesseract") {
setTesseractLoaded(true);
setLoadingModels(false);
}
}, [engine]);
// Using Tesseract.js for OCR
const performOCRTesseract = async (imageFile: ImageFile): Promise<OCRResult[]> => {
const result = await Tesseract.recognize(
imageFile.file,
'chi_sim+eng', // Chinese Simplified + English
{
logger: (m) => {
console.log(`[Tesseract] ${m.status}: ${Math.round(m.progress * 100)}%`);
}
}
);
const results: OCRResult[] = [];
if (result.data.text) {
results.push({
text: result.data.text.trim(),
confidence: (result.data.confidence || 90) / 100,
});
}
return results;
};
4. Loading PP-OCRv5 Models
PP-OCRv5 is a more powerful Chinese-optimized OCR system:
// Load character dictionary
useEffect(() => {
async function loadDict() {
let dict: string[] = [];
// Try local first, then remote
try {
const response = await fetch(LOCAL_DICT_PATH);
if (response.ok) {
const text = await response.text();
dict = text.split('\n')
.map(line => line.replace(/\r$/, ''))
.filter(line => line.length > 0);
dict = ['blank', ...dict]; // Add blank for CTC
}
} catch (err) {
console.log('Trying remote dictionary...');
}
// Remote fallback
if (dict.length === 0) {
const response = await fetch(DICT_URL);
const text = await response.text();
dict = text.split('\n')
.map(line => line.replace(/\r$/, ''))
.filter(line => line.length > 0);
dict = ['blank', ...dict];
}
setCharDict(dict);
}
loadDict();
}, [engine]);
// Load ONNX models
async function loadModels() {
// Ensure ONNX Runtime is loaded
if (!window.ort) {
const script = document.createElement("script");
script.src = "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.20.1/dist/ort.min.js";
document.head.appendChild(script);
await new Promise((resolve) => { script.onload = resolve; });
}
// Configure for optimal performance
ort.env.wasm.numThreads = 4;
ort.env.wasm.simd = true;
// Load detection model
const detResponse = await fetch(MODEL_URLS.det);
const detBuffer = await detResponse.arrayBuffer();
const detSession = await ort.InferenceSession.create(new Uint8Array(detBuffer), {
executionProviders: ['wasm'],
graphOptimizationLevel: 'all'
});
sessionsRef.current.det = detSession;
// Load recognition model
const recResponse = await fetch(MODEL_URLS.rec);
const recBuffer = await recResponse.arrayBuffer();
const recSession = await ort.InferenceSession.create(new Uint8Array(recBuffer), {
executionProviders: ['wasm'],
graphOptimizationLevel: 'all'
});
sessionsRef.current.rec = recSession;
setModelsLoaded(true);
}
5. Image Preprocessing for PP-OCRv5
// Normalize image for model input
const normalizeImage = (imageData: ImageData, mean: number[], std: number[]): Float32Array => {
const { data, width, height } = imageData;
const floatData = new Float32Array(3 * width * height);
// ImageNet normalization
for (let i = 0; i < height * width; i++) {
const r = data[i * 4] / 255.0;
const g = data[i * 4 + 1] / 255.0;
const b = data[i * 4 + 2] / 255.0;
floatData[i] = (r - mean[0]) / std[0];
floatData[i + height * width] = (g - mean[1]) / std[1];
floatData[i + 2 * height * width] = (b - mean[2]) / std[2];
}
return floatData;
};
6. Text Detection (DBNet)
PP-OCRv5 uses DBNet for text detection:
// DBNet post-processing
const postprocessDetection = (output: ort.Tensor, threshold = 0.3): number[][][] => {
const data = output.data as Float32Array;
const dims = output.dims;
const height = dims[2] as number;
const width = dims[3] as number;
// Probability map
const probMap = new Float32Array(height * width);
for (let i = 0; i < height * width; i++) {
probMap[i] = data[i];
}
// Binary map
const binaryMap = new Uint8Array(height * width);
for (let i = 0; i < height * width; i++) {
binaryMap[i] = probMap[i] > threshold ? 1 : 0;
}
// Connected component analysis
const boxes: number[][][] = [];
const visited = new Uint8Array(height * width);
for (let y = 0; y < height; y++) {
for (let x = 0; x < width; x++) {
const idx = y * width + x;
if (binaryMap[idx] === 1 && !visited[idx]) {
const points: number[][] = [];
const queue: number[][] = [[x, y]];
visited[idx] = 1;
// BFS for connected region
while (queue.length > 0) {
const [cx, cy] = queue.shift()!;
points.push([cx, cy]);
// Check 8 directions
for (const [dx, dy] of [[-1,0],[1,0],[0,-1],[0,1],[-1,-1],[-1,1],[1,-1],[1,1]]) {
const nx = cx + dx;
const ny = cy + dy;
const nIdx = ny * width + nx;
if (nx >= 0 && nx < width && ny >= 0 && ny < height &&
binaryMap[nIdx] === 1 && !visited[nIdx]) {
visited[nIdx] = 1;
queue.push([nx, ny]);
}
}
}
// Filter small regions and create bounding box
if (points.length > 10) {
const minX = Math.min(...points.map(p => p[0]));
const maxX = Math.max(...points.map(p => p[0]));
const minY = Math.min(...points.map(p => p[1]));
const maxY = Math.max(...points.map(p => p[1]));
boxes.push([
[minX, minY], [maxX, minY], [maxX, maxY], [minX, maxY]
]);
}
}
}
}
return boxes;
};
7. Text Recognition with CTC Decoding
// CTC (Connectionist Temporal Classification) decoding
const ctcDecodeIndices = (indices: number[], charDict: string[]): string => {
const blankIdx = 0;
let result = '';
let lastIdx = -1;
for (let i = 0; i < indices.length; i++) {
const idx = indices[i];
// Skip blank (CTC blank token)
if (idx === blankIdx) continue;
// Skip consecutive duplicates
if (idx === lastIdx) continue;
// Map index to character
if (idx >= 1 && idx < charDict.length) {
result += charDict[idx];
}
lastIdx = idx;
}
return result;
};
// Complete PP-OCRv5 recognition
const performOCRPPOCR = async (imageFile: ImageFile): Promise<OCRResult[]> => {
// 1. Text Detection
const img = await createImageBitmap(imageFile.file);
const canvas = document.createElement('canvas');
canvas.width = img.width;
canvas.height = img.height;
canvas.getContext('2d')!.drawImage(img, 0, 0);
// Resize for detection model (multiple of 32)
const detInputSize = 960;
const scale = Math.min(detInputSize / img.width, detInputSize / img.height);
let newWidth = Math.ceil(Math.round(img.width * scale) / 32) * 32;
let newHeight = Math.ceil(Math.round(img.height * scale) / 32) * 32;
// Run detection
const detImageData = detCtx.getImageData(0, 0, newWidth, newHeight);
const detInput = normalizeImage(detImageData, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]);
const detTensor = new ort.Tensor('float32', detInput, [1, 3, newHeight, newWidth]);
const detResults = await sessionsRef.current.det.run({ [detInputName]: detTensor });
const boxes = postprocessDetection(detResults[detOutputName]);
// 2. Text Recognition for each box
const results: OCRResult[] = [];
for (const box of boxes.slice(0, 20)) { // Limit to 20 boxes
// Crop text region
const minX = Math.min(...box.map(p => p[0]));
const maxX = Math.max(...box.map(p => p[0]));
const minY = Math.min(...box.map(p => p[1]));
const maxY = Math.max(...box.map(p => p[1]));
// Resize to 48px height (multiple of 8)
let recHeight = 48;
let recWidth = Math.min(320, Math.round((maxX - minX) * recHeight / (maxY - minY)));
recWidth = Math.ceil(recWidth / 8) * 8;
recHeight = Math.ceil(recHeight / 8) * 8;
// Run recognition
const recImageData = recCtx.getImageData(0, 0, recWidth, recHeight);
const recInput = normalizeImage(recImageData, [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]);
const recTensor = new ort.Tensor('float32', recInput, [1, 3, recHeight, recWidth]);
const recResults = await sessionsRef.current.rec.run({ [recInputName]: recTensor });
// Get indices and decode
const indices = getMaxIndices(recOutput);
const text = ctcDecodeIndices(indices, charDict);
if (text.trim()) {
results.push({
text,
confidence: 0.95,
box: box.map(p => [p[0] / scale, p[1] / scale])
});
}
}
// Sort by vertical position
return results.sort((a, b) => {
const ay = Math.min(...(a.box || [[0,0]]).map(p => p[1]));
const by = Math.min(...(b.box || [[0,0]]).map(p => p[1]));
return ay - by;
});
};
8. Engine Selection UI
<div className="space-y-3">
<label className="flex items-start gap-3 p-3 border rounded-lg cursor-pointer hover:bg-gray-50">
<input
type="radio"
name="engine"
value="tesseract"
checked={engine === "tesseract"}
onChange={(e) => setEngine(e.target.value as OCREngine)}
/>
<div className="flex-1">
<span className="font-medium block">
{t.tesseractDefault || "Tesseract.js (Default)"}
</span>
<span className="text-sm text-gray-500">
{t.tesseractDesc || "Fast and lightweight, good for general text recognition"}
</span>
</div>
</label>
<label className="flex items-start gap-3 p-3 border rounded-lg cursor-pointer hover:bg-gray-50">
<input
type="radio"
name="engine"
value="ppocrv5"
checked={engine === "ppocrv5"}
onChange={(e) => setEngine(e.target.value as OCREngine)}
/>
<div className="flex-1">
<span className="font-medium block">
{t.ppocrv5Option || "PP-OCRv5 (Chinese optimized)"}
</span>
<span className="text-sm text-gray-500">
{t.ppocrv5Desc || "Better for Chinese text recognition, requires downloading large models"}
</span>
</div>
</label>
</div>
Processing Flow
OCR Engine Comparison
| Feature | Tesseract.js | PP-OCRv5 |
|---|---|---|
| Model Size | ~20MB (auto-download) | ~20MB (det + rec) |
| Chinese Support | Good | Excellent |
| Speed | Fast | Medium |
| Offline | Yes (after cache) | Requires model cache |
| Languages | 100+ | Chinese + English |
| Accuracy | Good for clear text | Excellent for Chinese |
Key Technologies Used
| Technology | Purpose |
|---|---|
| Tesseract.js | JavaScript OCR engine |
| ONNX Runtime Web | Run ONNX models in browser |
| PP-OCRv5 | Baidu's Chinese OCR models |
| DBNet | Text detection neural network |
| CRNN | Text recognition (CNN + RNN + CTC) |
| CTC Decoding | Convert model output to text |
Character Dictionary
PP-OCRv5 uses a character dictionary for recognition:
// Dictionary format (one character per line):
// 一
// 二
// 三
// ...
// A
// B
// C
// ...
// With blank token for CTC:
// blank
// 一
// 二
// 三
// ...
The dictionary contains 6000+ Chinese characters plus letters, numbers, and symbols.
Performance Characteristics
-
Tesseract.js:
- Auto-downloads traineddata (~20MB)
- Processing: 2-10 seconds per image
- Works offline after first run
-
PP-OCRv5:
- Detection model: ~10MB
- Recognition model: ~10MB
- Dictionary: ~50KB
- Processing: 3-15 seconds per image
- Better for Chinese text
Use Cases
- Document digitization - Convert paper documents to text
- Screenshot extraction - Extract text from screen captures
- Invoice processing - Extract information from receipts
- Book scanning - Digitize printed books
- Image translation - Prepare text for translation apps
Conclusion
Browser-based OCR with multiple engine support provides flexibility for different use cases. The implementation uses:
- Tesseract.js for quick, general-purpose OCR with language support
- PP-OCRv5 for specialized Chinese text recognition using deep learning
- ONNX Runtime Web for running ONNX models in the browser
- CTC Decoding for converting neural network outputs to readable text
Users can choose between lightweight Tesseract.js or powerful PP-OCRv5 based on their needs, all without sending images to any server.
Try it yourself at Free Image Tools
Experience the power of browser-based OCR. No upload required - your images stay on your device!


Top comments (0)