22Gstudios

Posted on May 5

I built a voice agent for Android in a weekend. Here's what actually worked

#agents #android #automation #showdev

Yesterday I posted this on X: https://x.com/22Gstudios/status/2051377769414791582

LetItDo is a voice agent for Android that actually finishes tasks. Solo, two and a half days, on top of an existing Auto.js fork called AutoX and Charm's Crush as the agent runtime. The architecture ended up being closer to what production agents like Perplexity Comet use than I expected, and the bugs that bit me were not the ones I planned for.

I wanted my phone to do the boring stuff. Send a WhatsApp message to a contact. Open Spotify and play a song. Scroll Instagram and like a few posts. Stuff Siri and Google Assistant pretend to do but don't actually finish.

Here's the honest writeup.

Why this could even run on a phone

Most agent runtimes are Python. Python is hostile to Android: no good way to ship the interpreter in your APK without dragging in 50MB+ of CPython and fighting NDK quirks. I needed the agent to run on the user's phone, not on some server they'd have to host.

Charm's Crush is written in Go. Go cross-compiles to Android arm64 in one command. The whole runtime fits in a libcrush.so native library that I bundle into the APK alongside AutoX. The agent runs entirely on-device. The only network call is to the user's chosen LLM API.

Untethered, no laptop, no ADB at runtime

LetItDo runs untethered. No USB cable, no ADB connection at runtime, no laptop hosting the agent. The closest research projects (AppAgent from Tencent, Mobile-Agent from Alibaba, DroidBot-GPT) all require the agent to live on a laptop and control the phone via ADB. Their architecture works fine for research demos but breaks the moment the user isn't sitting at a desk with a USB-C cable.

LetItDo is a regular Android app the user installs once and uses with their voice. The only one-time setup is adb shell pm grant com.letitdo.v7 android.permission.WRITE_SECURE_SETTINGS for the OEM-survival trick I'll cover below. After that, no ADB. The phone is the whole stack.

The first wrong intuition: vision is the answer

I copied the architecture from browser-harness, which is a small Python harness that connects an LLM to your real Chrome via CDP. It works because the LLM has vision. The agent calls capture_screenshot, the host renders the PNG to the model, the model picks pixel coordinates, the harness calls click_at_xy. There is one click primitive. No selectors. The whole loop is built around the assumption that the model can see.

I tried to translate this to Android. Took screenshots. Sent them to qwen-plus. The model replied "I cannot see images." Because qwen-plus is text only.

Production agents have already chosen the answer. Comet calls Accessibility.getFullAXTree first, screenshots only as fallback. OpenAI Operator uses a hybrid (AX tree primary, vision for charts and captchas). Browser-harness leans vision because their LLM has eyes. I copied the wrong template. The right one is whatever Comet does, even if you have a vision model, because cheap-first cascade beats one-shot vision in cost, latency, and reliability.

So LetItDo's interaction layer became a cascade, cheapest first:

exact a11y text/desc/id (free, instant)
substring a11y (textContains/descContains)
fuzzy a11y tree (Levenshtein on dumped nodes, catches STT typos like Shawn → Shaun)
OCR (Paddle, on-device, ~2s, fallback for Canvas/WebView)
vision (multimodal LLM, opt-in, only when 1-4 fail)

Each layer earns its slot maybe 1% of the time more than the layer above it. The cascade exists because no single layer is right for every surface.

The bug that took 90 seconds and 11 round-trips to find

I told the agent: "send hello to Shaun on WhatsApp." It opened WhatsApp. Then it tapped what looked like the right element. The chat did not open. It tapped again. It scrolled. It went into Settings somehow. After 11 round-trips and 90 seconds it gave up.

The actual problem: WhatsApp's chat list has the contact name "Shaun" rendered as a TextView that is clickable=false. The clickable element is a parent LinearLayout four levels up the tree. The avatar to the left of the name has content-description "Shaun picture" and IS clickable, but tapping it opens the profile preview, not the chat.

When the agent fuzzy-matched "Shawn" (STT typo of Shaun) against the screen, OCR found the text glyph. The agent clicked at the glyph's bounding box center. Android's hit testing routed that to whichever clickable ancestor wanted it, and on Vivo's WhatsApp build that turned out to be the avatar's tap zone, not the row's. So we tapped the profile icon and opened a contact preview instead of the chat.

The fix was a five-line walk:

function tap_text(query) {
  var node = text(query).findOne(2000);
  if (!node) return null;
  var cur = node;
  while (cur && !cur.clickable()) cur = cur.parent();
  if (cur) cur.click();
}

Find the text node. Walk up the parent chain. Stop at the first clickable ancestor. Click that, not the leaf. AccessibilityService.performAction(ACTION_CLICK) fires on the row container. Chat opens. 12 seconds.

This is the exact pattern Comet uses on the web. Their accessibility tree parser walks up from text nodes to clickable ancestors before reporting click targets to the model. I had to rediscover it for Android because I started from the wrong template.

The other bug: bounds were always null

The structured tree dump I had been shipping for two days was returning nodes without coordinates. Every "smoke test" I had run actually used a different code path (AutoX's UiObject, which has working .bounds()) instead of raw AccessibilityNodeInfo (which doesn't). The function name is the same. The return shape is different.

// Wrong:
var bounds = n.getBoundsInScreen()  // returns void, not Rect

// Right:
var rect = new android.graphics.Rect()
n.getBoundsInScreen(rect)

getBoundsInScreen takes an out-parameter. Calling it bare returns nothing. Every node in my tree dump had cx and cy as null. None of my "tests" caught it because I was checking different stuff. The second I actually filtered for cx and got back zero results out of 220 nodes, the bug was obvious.

This is a personal lesson, not a technical one. Smoke-test every helper on the device the day you write it, before you build anything on top of it.

The OEM problem nobody talks about

Vivo, Xiaomi, Oppo, OnePlus, Huawei phones aggressively kill background services to save battery. Android's accessibility service is one of the services they kill. So even when the user grants accessibility access to your app, the OEM's battery manager turns it off later. The app keeps running. Its permissions look fine in Settings. But auto.service is null. Every script throws "Accessibility service is not started."

This is also what kills Panda (an open-source Android voice agent in this space). Their issue tracker has #275 about Xiaomi/Huawei battery management as an unresolved roadmap item, plus a Reddit complaint that Android revokes Panda's permissions every few hours with no recovery.

The fix is mildly nuclear:

Request WRITE_SECURE_SETTINGS via ADB once at install (adb shell pm grant com.letitdo.v7 android.permission.WRITE_SECURE_SETTINGS).
Watchdog WorkManager fires every 15 minutes. Reads the secure setting enabled_accessibility_services. If our component isn't in the list, write it back.
Pre-flight check before each agent run. If the service isn't bound (verified via local TCP ping to our bridge), call heal() which writes the setting and waits up to 5s for the system to rebind.
Mid-flight retry. If the agent's run_script call fails AND auto.service is null when the call returned, heal once and retry the same script.

In practice users see nothing. The accessibility service stays bound across OEM kill cycles. They speak a command, the agent runs, no setup ceremony.

Is this hostile to Google's design? A little. Google bans Sova (another Android voice agent) from the Play Store specifically because Sova uses the accessibility API for "universal automation." LetItDo will probably never reach the Play Store either. Both apps have to live as sideloads. Sova self-hosts the APK. I'll do the same when I open early access.

Skill capture: the agent writes its own playbooks

This is the part I'm most happy with.

The first time the agent solves "turn on flashlight," it flails. It tries AutoX's device.flash which doesn't exist on this device. It tries opening the quick settings panel and tapping the torch tile. It tries hardware key shortcuts. After about ten attempts it lands on android.hardware.camera2.CameraManager.setTorchMode(cameraId, true) and the flashlight turns on.

Crush has a built-in write tool. The system prompt tells the agent: after a successful task, write a SKILL.md to the skills directory describing what worked. The agent does this on its own, unprompted past the system message:

---
name: flashlight
description: "Turn on/off the device flashlight using CameraManager.setTorchMode."
  Use when user asks to turn on/off flashlight or torch.
---

Turn on:
importClass(android.hardware.camera2.CameraManager);
importClass(android.content.Context);
var cm = context.getSystemService(Context.CAMERA_SERVICE);
var cameraId = cm.getCameraIdList()[0];
cm.setTorchMode(cameraId, true);

Gotcha: device.flash may not exist on all devices.
Use CameraManager.setTorchMode instead.

Crush's progressive disclosure injects all skill metadata into the system prompt at session start. When a relevant skill matches, the body gets loaded. Verified in the logs:

INFO Skill turn summary component=skills
prompt_len=24 active_total=7
loaded_total=1 loaded_this_turn=[flashlight]

Next "turn on flashlight" command: 14 seconds total, single round-trip, exact recipe replay. From 90s to 14s the second time.

First time the skill loop closed end-to-end I sat there for a minute. The agent had spent 90 seconds flailing on "turn on flashlight" the first time. Wrote itself a SKILL.md when it finally got CameraManager.setTorchMode working. The next prompt, the agent loaded the skill, ran the cached recipe verbatim, finished in 14 seconds. From the outside it looks like nothing. But that's the thing improving itself, on a phone, without me touching it. After that I knew this was real.

AutoX is the other half of the stack

If Crush is the agent brain, AutoX is the body. It's an Auto.js fork that's been quietly maintained for years. Out of the box it gave me:

A bound AccessibilityService running in a separate :script process. This is what lets us read and tap the UI tree.
A Rhino JavaScript engine with full access to Android's Java APIs via importClass. The agent writes JS that calls android.hardware.camera2.CameraManager directly. No native bridge to maintain.
A scripting surface (text("Send").findOne(), click(x, y), app.launch("com.spotify.music"), device.width, setClip("hi"), http.get(url)) that already covers most automation primitives.
Screenshots without MediaProjection. This is the big one. The standard Android way to grab the screen is MediaProjection, which pops up a "Start recording or casting?" dialog every single capture. That kills any voice-agent UX. AutoX's auto.takeScreenshot() uses an accessibility-API path that doesn't trigger the prompt. The user grants accessibility once at install; nothing else interrupts them. Vision flows just work.
Bundled Paddle OCR. ~2s per screen, on-device, no network. We use it as layer 4 of the cascade.
A :script process boundary that keeps accessibility crashes from killing the main app.

Without AutoX I'd have written all of this myself: the accessibility service binding, the JS-to-Android bridge, the screenshot capture without MediaProjection (which is its own sharp-edged research project), the gesture dispatcher. Probably two more weekends of pure scaffolding work.

What I had to build vs what I got from AutoX and Crush

LetItDo is mostly the glue between two existing projects.

Crush gives the agent: a working LLM loop, OpenAI-compatible multi-provider support (OpenAI, Anthropic, Google, DashScope, Groq, Cerebras, OpenRouter, local Ollama), the Agent Skills standard with progressive disclosure, conversation compaction so long sessions don't blow up the context window, sub-agent spawning, and the MCP tool calling protocol.

AutoX gives the phone: a bound AccessibilityService, a Rhino JS engine with full access to Android's Java APIs, screenshots without MediaProjection prompts, on-device Paddle OCR, gesture dispatch, and a scripting surface that already covers most of what an automation agent needs.

What I actually built: the bridge that lets Crush's MCP tools call into AutoX's accessibility surface. The JS helpers the agent uses to discover and tap UI elements (read_screen, tap_text with walk-up-to-clickable, the fuzzy cascade). The OEM survival mechanism. The voice frontend, the result UI, the skill seeding, the on-device service watchdog. Two and a half days of glue and one critical insight (a11y tree first, vision second).

If LetItDo is interesting, AutoX and Crush deserve most of the credit. I'm being explicit about this because it's the truth and because it tells you what's actually novel here: not the agent, not the phone control, but the combination plus the OEM trick.

Honest speed numbers

Single-action tasks like "turn on flashlight" floor at 12-18s. Two LLM round-trips per task (decide → run_script → summarize), each ~5-7s on Qwen DashScope. The visible action itself is 30-50ms. The structural floor is the round-trip count. Persistent daemon between prompts saves ~2s of cold-start. Switching to Groq or Cerebras for sub-1s inference saves another 8s. Neither shipped yet.

Multi-action tasks like Instagram engagement feel faster than single-action because the agent batches: one run_script with a for-loop over 5 reels = 1 LLM round-trip for 5 actions. Visible activity hides the LLM wait.

What hasn't shipped

Persistent Crush daemon between prompts. Right now every voice command spawns a fresh process. Could be ~0s cold-start with a long-running daemon listening on stdin or a socket.
Vision pipeline with cost meter. Vision works (qwen3-vl-plus, gpt-4o, gemini-flash all forward image content correctly) but burns tokens. There's no usage display in the UI yet.
Cross-prompt memory. Each Crush invocation is a fresh session. Saying "send hi to Shaun" then "make it three exclamation marks" doesn't work; the second prompt has no idea what "it" refers to.
Play Store distribution. Same accessibility-policy reason Sova got banned will likely catch LetItDo. Sideload only.
iOS. iOS has no equivalent of AccessibilityService for third-party apps. The whole architecture is non-portable.

What's next

If you want early access, the waitlist is here: https://tally.so/r/jaGvx9

Open questions I'd like feedback on:

Is "untethered Android voice agent" a category, or just a feature Google will eventually ship in Gemini? (Their Android AppFunctions API in Feb 2026 suggests yes.)
Should LetItDo's skill library be private to the user (current state), community-shared (PR-style like browser-harness), or auto-synced via cloud for everyone's benefit (network effects but governance nightmares)?
Vision is gated behind model selection. qwen3-vl-plus works, qwen-plus doesn't. Is the right answer to require vision for everyone, or is the a11y-tree-first design good enough that vision stays optional?

If you've got opinions, the comments or my DMs are open.

The closing lesson if you're starting something similar: smoke-test every helper on the device the day you write it. The bug that wasted my Day 3 was a function that had been silently returning null bounds for 48 hours. None of my "tests" caught it because they all hit a different code path. Two days of debugging that should have been two minutes if I'd checked output once. Speed of iteration on a real device is the whole game.

DEV Community