Project Log #6: I Fixed the OCR Problem That Was Killing My AI Agent

#ai #automation #programming #webdev

Day 6. Swapped Tesseract for ML Kit. The agent is now 5x faster and handles interruptions.

Last post was about walls. OCR was slow. Interruptions broke everything. The phone overheated.

Today is about solutions. Two of those three walls just came down.

The OCR Problem

Tesseract was taking 8-12 seconds per screen scan. For a single action, that's fine. For a 5-step task, that's a full minute of waiting for text recognition. On a device that's already fighting thermal throttling, that's unacceptable.

I spent the weekend researching alternatives. Considered a custom-trained model (too heavy for a phone). Considered cloud APIs (defeats the whole "offline agent" purpose). Landed on something better.

Enter ML Kit

Google's ML Kit offers on-device text recognition that runs without internet. It's optimized for mobile CPUs. It's free. And it's fast.

I swapped Tesseract for ML Kit's text recognition API. The result:

Metric	Tesseract (Old)	ML Kit (New)
Screen scan time	8-12 seconds	1.5-2 seconds
Accuracy on clean text	~85%	~95%
Accuracy on noisy backgrounds	~60%	~85%
RAM usage	High (loads language model)	Low (native optimization)

The agent now scans a screen in under 2 seconds. A full 5-step task that used to take 60+ seconds now completes in 20-25 seconds. That's still not instant, but it's usable. It's no longer the bottleneck.

The Implementation

ML Kit runs as a separate service in Termux. The Python agent sends an HTTP request to localhost:8080 with a screenshot, and ML Kit returns structured text with bounding boxes. The vision.py module now has an mlkit_extract() function alongside the old Tesseract one, with a fallback: if ML Kit fails, the agent reverts to Tesseract.

The Interruption Handler (First Pass)

Last week, a WhatsApp call during a task broke everything. The agent tapped the notification instead of the target button. Then it got lost.

I wrote a simple pre-action check:

Before every tap, the agent scans the screen for known interruption patterns: incoming call UI, notification headers, "Update Available" dialogs.
If it finds one, it dismisses it first (back button via ADB) before continuing the task.
If the screen is locked, it sends a wake command and waits for unlock.

It's not perfect. A truly random interruption can still break the chain. But the most common ones—calls, notifications, system dialogs—are now handled. The agent recovers instead of crashing.

What's Still Broken

Thermal throttling remains unsolved. After 15 minutes of continuous use, the phone slows down. I'm experimenting with adding 30-second cooldown pauses between tasks, but that hurts the user experience.
Image-only UI elements (icons without text) are still invisible. ML Kit helps with text, but a camera icon or a send button with no label is just a shape to the agent. I'm exploring template matching next.

What's Next (Day 7)