Day 6. Swapped Tesseract for ML Kit. The agent is now 5x faster and handles interruptions.
Last post was about walls. OCR was slow. Interruptions broke everything. The phone overheated.
Today is about solutions. Two of those three walls just came down.
The OCR Problem
Tesseract was taking 8-12 seconds per screen scan. For a single action, that's fine. For a 5-step task, that's a full minute of waiting for text recognition. On a device that's already fighting thermal throttling, that's unacceptable.
I spent the weekend researching alternatives. Considered a custom-trained model (too heavy for a phone). Considered cloud APIs (defeats the whole "offline agent" purpose). Landed on something better.
Enter ML Kit
Google's ML Kit offers on-device text recognition that runs without internet. It's optimized for mobile CPUs. It's free. And it's fast.
I swapped Tesseract for ML Kit's text recognition API. The result:
| Metric | Tesseract (Old) | ML Kit (New) |
|---|---|---|
| Screen scan time | 8-12 seconds | 1.5-2 seconds |
| Accuracy on clean text | ~85% | ~95% |
| Accuracy on noisy backgrounds | ~60% | ~85% |
| RAM usage | High (loads language model) | Low (native optimization) |
The agent now scans a screen in under 2 seconds. A full 5-step task that used to take 60+ seconds now completes in 20-25 seconds. That's still not instant, but it's usable. It's no longer the bottleneck.
The Implementation
ML Kit runs as a separate service in Termux. The Python agent sends an HTTP request to localhost:8080 with a screenshot, and ML Kit returns structured text with bounding boxes. The vision.py module now has an mlkit_extract() function alongside the old Tesseract one, with a fallback: if ML Kit fails, the agent reverts to Tesseract.
The Interruption Handler (First Pass)
Last week, a WhatsApp call during a task broke everything. The agent tapped the notification instead of the target button. Then it got lost.
I wrote a simple pre-action check:
- Before every tap, the agent scans the screen for known interruption patterns: incoming call UI, notification headers, "Update Available" dialogs.
- If it finds one, it dismisses it first (back button via ADB) before continuing the task.
- If the screen is locked, it sends a wake command and waits for unlock.
It's not perfect. A truly random interruption can still break the chain. But the most common ones—calls, notifications, system dialogs—are now handled. The agent recovers instead of crashing.
What's Still Broken
- Thermal throttling remains unsolved. After 15 minutes of continuous use, the phone slows down. I'm experimenting with adding 30-second cooldown pauses between tasks, but that hurts the user experience.
- Image-only UI elements (icons without text) are still invisible. ML Kit helps with text, but a camera icon or a send button with no label is just a shape to the agent. I'm exploring template matching next.
What's Next (Day 7)
- Add template matching for icon-based buttons
- Test the agent on a second device to see if performance varies
- Record and upload the first full demo video
The Repo
👉 github.com/Dexter2344/phone-agent
vision.py now supports both Tesseract and ML Kit. agent.py includes the interruption handler. README updated with Day 6 status.
This is Day 6. The walls are coming down. The build continues.
Top comments (0)