joe wang

Posted on Feb 21

Optimizing OCR Performance on Mobile: From 5 Seconds to Under 1 Second

#android #kotlin #performance #mobile

OCR on mobile needs to be fast. Users expect results in under 2 seconds. When I started building Screen Translator, our initial OCR pipeline took 4-5 seconds per screen capture. That's an eternity when you're trying to read a game menu or translate a chat message in real time.

Here's how we got it down to under 1 second on modern devices.

The Bottlenecks

Before optimizing, we profiled the pipeline:

Screen capture: ~200ms (MediaProjection API)
Image preprocessing: ~800ms 😱
OCR inference: ~2500ms 😱😱
Translation API call: ~500ms
UI rendering: ~100ms

Total: ~4100ms. Steps 2 and 3 were the obvious targets.

Optimization 1: Smart Image Downscaling

The biggest win came from not feeding full-resolution screenshots to the OCR engine.

fun optimizeForOCR(bitmap: Bitmap): Bitmap {
    val maxDimension = 1280 // Sweet spot for accuracy vs speed
    val scale = minOf(
        maxDimension.toFloat() / bitmap.width,
        maxDimension.toFloat() / bitmap.height,
        1f // Don't upscale
    )

    if (scale >= 1f) return bitmap

    return Bitmap.createScaledBitmap(
        bitmap,
        (bitmap.width * scale).toInt(),
        (bitmap.height * scale).toInt(),
        true // Bilinear filtering
    )
}

A 2400x1080 screenshot scaled to 1280x576 processes 3x faster with negligible accuracy loss for screen text.

Result: Image preprocessing dropped from 800ms to 250ms.

Optimization 2: Region of Interest (ROI) Detection

Why OCR the entire screen when the user only cares about a specific area?

fun detectTextRegions(bitmap: Bitmap): List<Rect> {
    // Convert to grayscale
    val gray = toGrayscale(bitmap)

    // Apply adaptive threshold
    val binary = adaptiveThreshold(gray)

    // Find contours and merge nearby text blocks
    val contours = findContours(binary)
    return mergeNearbyContours(contours, mergeDistance = 20)
}

By detecting text regions first (which is fast — ~50ms), we only run the expensive OCR on areas that actually contain text. For a typical app screen, this means processing 30-40% of the image instead of 100%.

Result: OCR inference dropped from 2500ms to ~800ms.

Optimization 3: ML Kit On-Device vs Cloud

We use Google ML Kit's on-device text recognition as the default. It's free, fast, and works offline. For CJK languages (Chinese, Japanese, Korean), we use the V2 API which has significantly better accuracy.

val recognizer = TextRecognition.getClient(
    when (scriptType) {
        ScriptType.LATIN -> TextRecognizerOptions.DEFAULT_OPTIONS
        ScriptType.CJK -> ChineseTextRecognizerOptions.Builder().build()
        ScriptType.KOREAN -> KoreanTextRecognizerOptions.Builder().build()
        ScriptType.JAPANESE -> JapaneseTextRecognizerOptions.Builder().build()
        ScriptType.DEVANAGARI -> DevanagariTextRecognizerOptions.Builder().build()
    }
)

The key insight: choose the right recognizer upfront. Running the Latin recognizer on Japanese text wastes time and gives garbage results. We detect the likely script from user settings and previous results.

Optimization 4: Background Threading with Coroutines

Never block the main thread. We use Kotlin coroutines with a dedicated dispatcher:

private val ocrDispatcher = Dispatchers.Default.limitedParallelism(2)

suspend fun processScreen(): TranslationResult = withContext(ocrDispatcher) {
    val capture = captureScreen()           // ~200ms
    val optimized = optimizeForOCR(capture)  // ~50ms
    val regions = detectTextRegions(optimized) // ~50ms

    // Process regions in parallel
    val results = regions.map { region ->
        async {
            val cropped = cropRegion(optimized, region)
            recognizeText(cropped)
        }
    }.awaitAll()

    // Translate in batch
    translateBatch(results)                  // ~400ms
}

Processing multiple text regions in parallel on multi-core devices gives us another 20-30% speedup.

Optimization 5: Caching

If the screen hasn't changed much, don't re-OCR everything.

class OCRCache(private val maxSize: Int = 50) {
    private val cache = LruCache<Long, OCRResult>(maxSize)

    fun getOrProcess(bitmap: Bitmap, process: () -> OCRResult): OCRResult {
        val hash = computePerceptualHash(bitmap)
        cache.get(hash)?.let { return it }

        return process().also { cache.put(hash, it) }
    }

    private fun computePerceptualHash(bitmap: Bitmap): Long {
        // Downscale to 8x8, convert to grayscale, compute average
        // Compare each pixel to average -> 64-bit hash
        val small = Bitmap.createScaledBitmap(bitmap, 8, 8, true)
        // ... hash computation
    }
}

Perceptual hashing means slightly different screenshots (e.g., a blinking cursor) still hit the cache.

Result: Repeated translations are instant (~10ms).

Final Numbers

After all optimizations on a mid-range device (Snapdragon 695):

Step	Before	After
Screen capture	200ms	200ms
Image preprocessing	800ms	50ms
ROI detection	N/A	50ms
OCR inference	2500ms	400ms
Translation	500ms	400ms
UI rendering	100ms	50ms
Total	4100ms	~800ms

On flagship devices (Snapdragon 8 Gen 3), we're seeing 400-500ms total.

Key Takeaways

Profile first — don't guess where the bottleneck is
Downscale aggressively — screen text is high contrast, OCR handles lower resolution well
ROI detection is cheap and saves massive OCR time
Choose the right ML model for the script type
Cache everything — screens don't change that often
Parallelize where possible with coroutines

These techniques aren't specific to our app. If you're building anything with on-device OCR, these patterns will help.

If you want to see these optimizations in action, check out Screen Translator on Google Play.

What OCR performance challenges have you faced on mobile? Drop your experiences in the comments.

DEV Community