Project Log #8: The AI Phone Agent Can Now See Icons

#ai #webdev #software #buildinpublic

Day 8. Template matching is wired into the agent. The send button is no longer invisible.

Seven days ago, the agent was blind to icons. It could read text, find contacts, and type messages—but it couldn't press send.

Today, that changed.

What Got Built

Template matching is now integrated into the agent's decision pipeline. Here's the new flow when the agent needs to tap something:

Try OCR first. If the target is text (like a contact name), find it via ML Kit or Tesseract.
If OCR fails, fall back to template matching. The agent searches its icon library for a matching reference image.
If template matching succeeds above the confidence threshold (80%), tap the matched coordinates.
If both fail, report the failure and stop. No guessing. No hardcoded coordinates.

Today's Progress

Task	Status
Updated `agent.py` to call `match_template()` when OCR fails	✅ Done
Tested on WhatsApp send button	✅ Success
Tested on WhatsApp back button	✅ Success
Full pipeline test: type message → detect send icon → tap → verify sent	✅ Passed
Added 3 reference icons to the library	✅ Done

The Full Pipeline Test

I gave the agent a command: "Send a WhatsApp message to Mom saying I'll call later."

Here's what happened:

Agent opened WhatsApp via ADB. ✅
Agent searched for "Mom" using OCR + fuzzy matching. Found her. ✅
Agent tapped the contact. Chat opened. ✅
Agent typed "I'll call later" into the message box. ✅
Agent looked for the send button. OCR didn't find it (no text). ❌
Agent switched to template matching. ✅
Agent matched send_button.png with 94% confidence. ✅
Agent tapped the coordinates. ✅
Agent verified the message appeared in the chat. ✅

Task complete. No hardcoded coordinates. No guessing. The agent found the icon by seeing it.

What's in the Icon Library Now

Icon	File	Status
Send button (WhatsApp)	`send_button.png`	✅ Working
Back button (WhatsApp)	`back_button.png`	✅ Working
Search button (WhatsApp)	`search_button.png`	🔧 Testing

What's Still Hard

Template matching is slower than OCR. Each match takes 2-4 seconds on my device. For a single icon that's fine. For a task that needs to find three different icons, the delays add up.

The simple NumPy fallback is about 3x slower than OpenCV. On a device without OpenCV installed, template matching becomes the new bottleneck.

And icons that change appearance based on theme (dark mode vs light mode) need separate reference images. One icon, two variants. The library will grow.

What's Next (Day 9)

Add more reference icons: attach, camera, emoji, more options
Test on a second device to confirm the icon matching works across different screen sizes
Record a full demo video of the pipeline in action

The Repo

👉 github.com/Dexter2344/phone-agent

agent.py now calls match_template() as a fallback when OCR can't find a target. vision.py handles the matching with OpenCV primary and NumPy fallback. The icon library is growing.

This is Day 8. The agent can finally see what it's doing.

DEV Community

Project Log #8: The AI Phone Agent Can Now See Icons

Top comments (0)