OCR on mobile needs to be fast. Users expect results in under 2 seconds. When I started building Screen Translator, our initial OCR pipeline took 4-5 seconds per screen capture. That's an eternity when you're trying to read a game menu or translate a chat message in real time.
Here's how we got it down to under 1 second on modern devices.
The Bottlenecks
Before optimizing, we profiled the pipeline:
- Screen capture: ~200ms (MediaProjection API)
- Image preprocessing: ~800ms 😱
- OCR inference: ~2500ms 😱😱
- Translation API call: ~500ms
- UI rendering: ~100ms
Total: ~4100ms. Steps 2 and 3 were the obvious targets.
Optimization 1: Smart Image Downscaling
The biggest win came from not feeding full-resolution screenshots to the OCR engine.
fun optimizeForOCR(bitmap: Bitmap): Bitmap {
val maxDimension = 1280 // Sweet spot for accuracy vs speed
val scale = minOf(
maxDimension.toFloat() / bitmap.width,
maxDimension.toFloat() / bitmap.height,
1f // Don't upscale
)
if (scale >= 1f) return bitmap
return Bitmap.createScaledBitmap(
bitmap,
(bitmap.width * scale).toInt(),
(bitmap.height * scale).toInt(),
true // Bilinear filtering
)
}
A 2400x1080 screenshot scaled to 1280x576 processes 3x faster with negligible accuracy loss for screen text.
Result: Image preprocessing dropped from 800ms to 250ms.
Optimization 2: Region of Interest (ROI) Detection
Why OCR the entire screen when the user only cares about a specific area?
fun detectTextRegions(bitmap: Bitmap): List<Rect> {
// Convert to grayscale
val gray = toGrayscale(bitmap)
// Apply adaptive threshold
val binary = adaptiveThreshold(gray)
// Find contours and merge nearby text blocks
val contours = findContours(binary)
return mergeNearbyContours(contours, mergeDistance = 20)
}
By detecting text regions first (which is fast — ~50ms), we only run the expensive OCR on areas that actually contain text. For a typical app screen, this means processing 30-40% of the image instead of 100%.
Result: OCR inference dropped from 2500ms to ~800ms.
Optimization 3: ML Kit On-Device vs Cloud
We use Google ML Kit's on-device text recognition as the default. It's free, fast, and works offline. For CJK languages (Chinese, Japanese, Korean), we use the V2 API which has significantly better accuracy.
val recognizer = TextRecognition.getClient(
when (scriptType) {
ScriptType.LATIN -> TextRecognizerOptions.DEFAULT_OPTIONS
ScriptType.CJK -> ChineseTextRecognizerOptions.Builder().build()
ScriptType.KOREAN -> KoreanTextRecognizerOptions.Builder().build()
ScriptType.JAPANESE -> JapaneseTextRecognizerOptions.Builder().build()
ScriptType.DEVANAGARI -> DevanagariTextRecognizerOptions.Builder().build()
}
)
The key insight: choose the right recognizer upfront. Running the Latin recognizer on Japanese text wastes time and gives garbage results. We detect the likely script from user settings and previous results.
Optimization 4: Background Threading with Coroutines
Never block the main thread. We use Kotlin coroutines with a dedicated dispatcher:
private val ocrDispatcher = Dispatchers.Default.limitedParallelism(2)
suspend fun processScreen(): TranslationResult = withContext(ocrDispatcher) {
val capture = captureScreen() // ~200ms
val optimized = optimizeForOCR(capture) // ~50ms
val regions = detectTextRegions(optimized) // ~50ms
// Process regions in parallel
val results = regions.map { region ->
async {
val cropped = cropRegion(optimized, region)
recognizeText(cropped)
}
}.awaitAll()
// Translate in batch
translateBatch(results) // ~400ms
}
Processing multiple text regions in parallel on multi-core devices gives us another 20-30% speedup.
Optimization 5: Caching
If the screen hasn't changed much, don't re-OCR everything.
class OCRCache(private val maxSize: Int = 50) {
private val cache = LruCache<Long, OCRResult>(maxSize)
fun getOrProcess(bitmap: Bitmap, process: () -> OCRResult): OCRResult {
val hash = computePerceptualHash(bitmap)
cache.get(hash)?.let { return it }
return process().also { cache.put(hash, it) }
}
private fun computePerceptualHash(bitmap: Bitmap): Long {
// Downscale to 8x8, convert to grayscale, compute average
// Compare each pixel to average -> 64-bit hash
val small = Bitmap.createScaledBitmap(bitmap, 8, 8, true)
// ... hash computation
}
}
Perceptual hashing means slightly different screenshots (e.g., a blinking cursor) still hit the cache.
Result: Repeated translations are instant (~10ms).
Final Numbers
After all optimizations on a mid-range device (Snapdragon 695):
| Step | Before | After |
|---|---|---|
| Screen capture | 200ms | 200ms |
| Image preprocessing | 800ms | 50ms |
| ROI detection | N/A | 50ms |
| OCR inference | 2500ms | 400ms |
| Translation | 500ms | 400ms |
| UI rendering | 100ms | 50ms |
| Total | 4100ms | ~800ms |
On flagship devices (Snapdragon 8 Gen 3), we're seeing 400-500ms total.
Key Takeaways
- Profile first — don't guess where the bottleneck is
- Downscale aggressively — screen text is high contrast, OCR handles lower resolution well
- ROI detection is cheap and saves massive OCR time
- Choose the right ML model for the script type
- Cache everything — screens don't change that often
- Parallelize where possible with coroutines
These techniques aren't specific to our app. If you're building anything with on-device OCR, these patterns will help.
If you want to see these optimizations in action, check out Screen Translator on Google Play.
What OCR performance challenges have you faced on mobile? Drop your experiences in the comments.
Top comments (0)