Project Log #11: UI Tree vs Screenshots — The Real Performance Test

#ai #programming #webdev #productivity

Day 11. I ran the old vision system and the new one side by side. The results surprised me.

Yesterday I rewrote the vision system. Screenshots and template matching got demoted. Android's UI hierarchy tree became the primary.

Today I ran both systems side by side on the same tasks to see if the rewrite was actually worth it.

The Test

I set up three identical tasks and ran each one twice—once with the old screenshot-based system, once with the new UI tree system. Same phone. Same apps. Same commands.

Task	Old System (Screenshots)	New System (UI Tree)
Open WhatsApp	3.2s	2.1s
Find "Mom" in contacts	2.8s (OCR)	0.7s (UI tree text match)
Tap send button	4.1s (template matching)	0.6s (content-desc match)
Total	10.1s	3.4s

The new system is nearly 3x faster for a simple task. The difference gets bigger as tasks get more complex.

Why the UI Tree Wins

The old system had to:

Take a screenshot (1-2 seconds)
Run OCR or template matching (2-4 seconds)
Parse the results
Calculate coordinates

The new system:

Runs one ADB command that dumps an XML file
Searches the XML for a matching element
Reads the exact coordinates from the element's bounds

No image processing. No OCR loading time. No template matching loops. Just a text search on a structured document.

What I Got Wrong Yesterday

I said the UI tree takes 0.5-1 second. In practice, it's faster—around 0.3-0.7 seconds for the dump and pull combined. The XML file is small (usually under 100KB). The parsing is instant.

But there's something I didn't anticipate: the UI tree is too detailed. A single screen can have 200+ elements. Searching through all of them by content-desc or text takes a few milliseconds of Python. It's negligible, but it reminds me that XML parsing on a phone CPU isn't free.

What's Still Hard

The UI tree approach has a blind spot: not all apps label their elements well.

WhatsApp is great. The send button has content-desc="Send message". The back button has content-desc="Back". The search button has content-desc="Search". These are accessibility labels that screen readers use, and WhatsApp's developers actually invested in them.

But I tested a local banking app. The login button? content-desc="". The password field? content-desc="". Nothing. No labels. No text. Just generic class names like android.widget.Button with no identifying information.

For apps like that, the UI tree is useless. I still need OCR as a fallback. The unified find_target() function handles this gracefully—it tries UI tree first, and if nothing is found, it falls back to OCR and template matching.

The Accessibility Divide

Here's something I didn't expect to learn from this project: well-funded apps (WhatsApp, Google apps, Slack) have excellent accessibility labels. They invested in making their apps usable by screen readers, and that investment accidentally makes them automatable by my agent.

Smaller apps, local apps, banking apps—they often have zero accessibility labels. My agent's ability to automate an app depends on whether that app's developers cared about disabled users.

This is a sobering realization. The same neglect that locks out blind users also locks out AI agents. Accessibility isn't just about inclusion—it's about making software machine-readable. And when developers skip it, both humans and AI suffer.

What's Next (Day 12)

Build a fallback mapping for unlabeled elements: if content-desc is empty, try to identify the element by its class + position + surrounding labeled elements
Test on more apps beyond WhatsApp—banking apps, settings, Chrome
Start logging which apps are "agent-friendly" and which aren't

The Repo

👉 github.com/Dexter2344/phone-agent

vision.py v4 is live. agent.py v10 is live. The unified target finder handles UI tree, OCR, and template matching in one call.

This is Day 11. The rewrite was worth it. The agent is faster, more reliable, and no longer tied to one device. But I've discovered a new problem: the agent can only automate apps that were built to be accessible. That's the next wall to climb.

DEV Community

Project Log #11: UI Tree vs Screenshots — The Real Performance Test

Top comments (0)