DEV Community

Okeke Chukwudubem
Okeke Chukwudubem

Posted on

Project Log #3: The AI Phone Agent Can Now See

Day 3. Screen text detection is working. The agent is starting to understand what's on the screen.

The first two days were about the foundation—getting Gemma 4 to talk to ADB, parsing commands, and creating the repo.

Today was about solving the hardest problem: how does the agent know what's on the screen?

The Problem

An AI that can't see is useless. You can tell it "open WhatsApp and message Mom," but once WhatsApp opens, the agent is blind. It doesn't know where the search bar is. It can't find Mom's name. It can't locate the message box.

I needed a way for the agent to read the screen—offline, on a phone, without cloud APIs.

The Repo

👉 github.com/Dexter2344/phone-agent

The new vision.py module is live. It handles screenshot capture, OCR text extraction, and text-to-coordinate mapping.

Today's Progress

Task Status
Installed Tesseract OCR in Termux ✅ Done
Captured a screenshot via ADB ✅ Working
Extracted text from the screenshot ✅ Working
Mapped text positions to screen coordinates 🔧 In progress
Tested on WhatsApp contact list ✅ First successful read

How It Works

The pipeline is simple:

  1. Capture: ADB takes a screenshot and saves it locally.
  2. Extract: Tesseract OCR scans the image and returns all text it finds, along with bounding box coordinates.
  3. Map: A Python script matches the extracted text to known UI elements. If the agent needs to find "Mom," it scans the OCR output for the word "Mom," grabs the coordinates, and tells ADB to tap there.

All of this runs inside Termux. No internet. No cloud vision API. Just open-source OCR on a phone CPU.

What's Working

I tested it on WhatsApp. The agent captured the screen, extracted the contact list, found a specific name, and returned the coordinates. It's slow—about 8-12 seconds per screen scan on my device—but it works.

Na What's Broken

  • OCR misreads some text. "Mom" becomes "Morn" or "M0m." I'm working on fuzzy matching.
  • Small UI elements (icons, emojis) are invisible to OCR. I'll need image-based detection for those.
  • Tesseract is heavy. It takes 3-4 seconds just to load before scanning. I'm looking at lighter alternatives.

What's Next (Day 4)

  • Add fuzzy text matching to handle OCR errors
  • Write the verification layer: after each action, check if the expected result happened
  • Test a full 3-step task: open WhatsApp → find contact → send a message

If you're building something similar or just curious about offline AI agents, follow along. This is Day 3 of the build.

Top comments (0)