Day 8. Template matching is wired into the agent. The send button is no longer invisible.
Seven days ago, the agent was blind to icons. It could read text, find contacts, and type messages—but it couldn't press send.
Today, that changed.
What Got Built
Template matching is now integrated into the agent's decision pipeline. Here's the new flow when the agent needs to tap something:
- Try OCR first. If the target is text (like a contact name), find it via ML Kit or Tesseract.
- If OCR fails, fall back to template matching. The agent searches its icon library for a matching reference image.
- If template matching succeeds above the confidence threshold (80%), tap the matched coordinates.
- If both fail, report the failure and stop. No guessing. No hardcoded coordinates.
Today's Progress
| Task | Status |
|---|---|
Updated agent.py to call match_template() when OCR fails |
✅ Done |
| Tested on WhatsApp send button | ✅ Success |
| Tested on WhatsApp back button | ✅ Success |
| Full pipeline test: type message → detect send icon → tap → verify sent | ✅ Passed |
| Added 3 reference icons to the library | ✅ Done |
The Full Pipeline Test
I gave the agent a command: "Send a WhatsApp message to Mom saying I'll call later."
Here's what happened:
- Agent opened WhatsApp via ADB. ✅
- Agent searched for "Mom" using OCR + fuzzy matching. Found her. ✅
- Agent tapped the contact. Chat opened. ✅
- Agent typed "I'll call later" into the message box. ✅
- Agent looked for the send button. OCR didn't find it (no text). ❌
- Agent switched to template matching. ✅
- Agent matched
send_button.pngwith 94% confidence. ✅ - Agent tapped the coordinates. ✅
- Agent verified the message appeared in the chat. ✅
Task complete. No hardcoded coordinates. No guessing. The agent found the icon by seeing it.
What's in the Icon Library Now
| Icon | File | Status |
|---|---|---|
| Send button (WhatsApp) | send_button.png |
✅ Working |
| Back button (WhatsApp) | back_button.png |
✅ Working |
| Search button (WhatsApp) | search_button.png |
🔧 Testing |
What's Still Hard
Template matching is slower than OCR. Each match takes 2-4 seconds on my device. For a single icon that's fine. For a task that needs to find three different icons, the delays add up.
The simple NumPy fallback is about 3x slower than OpenCV. On a device without OpenCV installed, template matching becomes the new bottleneck.
And icons that change appearance based on theme (dark mode vs light mode) need separate reference images. One icon, two variants. The library will grow.
What's Next (Day 9)
- Add more reference icons: attach, camera, emoji, more options
- Test on a second device to confirm the icon matching works across different screen sizes
- Record a full demo video of the pipeline in action
The Repo
👉 github.com/Dexter2344/phone-agent
agent.py now calls match_template() as a fallback when OCR can't find a target. vision.py handles the matching with OpenCV primary and NumPy fallback. The icon library is growing.
This is Day 8. The agent can finally see what it's doing.
Top comments (0)