Abishek Muthian

Posted on Mar 16

My Wife Is Losing the Ability to Use Her Phone. So I Built an AI to Use It for Her

#gemini #googleadk #vertexai #geminiliveagentchallenge

I'm created this content for the purpose of entering Gemini Live Agent Challenge hackathon

My partner lives with a rare disease called GNE Myopathy which causes progressive weakening of the muscles. At advanced stages even basic tasks like using a smartphone becomes an uphill task in-order to solve that I'm creating Access Agent, an accessibility first voice-driven AI phone navigator.

So I built Access Agent.

What is Access Agent?

Access Agent is a voice-driven AI phone navigator. You speak a goal — "Message Sarah I'll be 10 minutes late", "Play relaxing jazz on YouTube", "What's on my screen?" — and the agent sees the current screen, reasons about what steps are needed, and executes them autonomously.

It is not a voice macro system. It doesn't have pre-programmed action maps per app. It reads the live screen state at every step and figures out the sequence on its own. It works on any app, any screen, without training or configuration.

Checkout my demo video:

All it requires is an Android phone, USB cable and a Chromium based browser.

The Architecture

Before diving into each layer, here's the full picture:

Three zones:

Browser (left) — captures microphone audio via AudioWorklet, renders an orb visualiser, and runs the Tango WebUSB ADB client that talks directly to the phone
Server (centre) — FastAPI + WebSocket server hosting the ADK Live Agent and DroidRun phone agent
Phone (right of browser) — DroidRun Portal APK providing the accessibility tree + screenshot API over HTTP on port 8080

The browser proxies all phone ADB commands over the same WebSocket to the server, so the server can run anywhere — including Google Cloud Run.

The Voice Layer: Gemini Live API

The most important architectural choice was using the Gemini Live API for the voice layer, specifically gemini-live-2.5-flash-native-audio on Vertex AI.

Here's why every word in that model name matters:

Live — real-time bidirectional audio streaming. There is no "record then send" round-trip. Audio flows continuously in both directions over a persistent connection. The agent can interrupt you mid-sentence and you can interrupt it mid-sentence.
native-audio — the model produces speech directly as PCM audio output. No separate TTS step, no added latency, no robotic synthesis voice. The same model that understands your intent also speaks the response.
Server-side VAD — Gemini detects when you've finished speaking on the server. No client-side silence detection, no hardcoded 1.5s timeouts, no "please hold for silence" bugs.

I'm using Google ADK v1.17+ which wraps all of this cleanly:

# The entire Live session lifecycle in ~10 lines
runner = Runner(agent=live_agent, app_name="access-agent", session_service=session_svc)
queue = LiveRequestQueue()

async for event in runner.run_live(session=session, live_request_queue=queue):
    if event.content and event.content.parts:
        for part in event.content.parts:
            if part.inline_data:
                # Raw PCM16 @ 24kHz — send to browser
                await ws.send_bytes(audio_response_json(part.inline_data.data))

# Feed audio from browser mic
await queue.send_realtime(Blob(data=pcm_bytes, mime_type="audio/pcm;rate=16000"))

Two production concerns I had to solve:

Session resumption. Gemini Live sessions have a ~10-minute WebSocket connection limit. With SessionResumptionConfig() in the ADK RunConfig, ADK automatically reconnects and restores the full conversation context when the limit is hit — the user never notices.

Context window compression. Long phone-control sessions accumulate tokens quickly (screenshots are expensive). With ContextWindowCompressionConfig(trigger_tokens=100000, target_tokens=80000), ADK compresses the context before it overflows. Without this, sessions degrade noticeably after 15-20 minutes.

The Audio Pipeline: Why AudioWorklet Matters for Accessibility

The browser audio pipeline deserves its own section because getting it wrong would make the product unusable for people with motor impairments.

The old Web Audio API approach used ScriptProcessorNode, which runs audio processing on the main JavaScript thread. Under any CPU load — a slow render, a garbage collection pause — frames get dropped. For a user who can barely lift a finger, a dropped mic frame that causes the agent to mishear a command is not a minor annoyance, it's a failure.

Access Agent uses AudioWorklet instead:

pcm-recorder-processor.js runs on a dedicated audio thread at 16 kHz, capturing mic input as Float32 frames, converting to PCM16, and posting them to the main thread for WebSocket transmission. Frames are never dropped regardless of main thread load.
pcm-player-processor.js maintains a 180-second ring buffer for agent speech playback at 24 kHz. Audio is enqueued as chunks arrive and plays back gaplessly.
Barge-in is instant: when the user speaks while the agent is talking, the frontend sends { command: "endOfAudio" } to the worklet, which clears the ring buffer immediately. The agent stops mid-sentence.

Audio travels over the WebSocket as raw binary frames, not base64-encoded JSON. This eliminates the 33% size overhead and the encoding/decoding CPU cost on every frame.

The Phone Agent: Autonomous Reasoning with DroidRun

The Live Agent handles voice I/O. When the user asks to do something on the phone, the Live Agent calls perform_phone_action(goal) — a single tool that delegates to a separate DroidRun DroidAgent.

DroidRun uses a CodeAct workflow: at each step, gemini-2.5-flash takes a screenshot and the accessibility tree, then generates Python code calling atomic action functions:

# What the model generates internally — not what you write
screenshot = take_screenshot()
elements = get_ui_state()

# Model reasons: I see WhatsApp is not open. Launch it.
start_app("com.whatsapp")

# Next step: Model sees WhatsApp home. Find Sarah.
tap_by_index(3)  # index 3 = Sarah's chat from accessibility tree

# Next step: Model sees the chat. Compose message.
tap_by_index(12)  # compose field
type("I'll be 10 minutes late")
tap_by_index(14)  # send button
complete(success=True, reason="Message sent to Sarah")

The model writes the code. AdbTools executes it. The loop continues until complete() is called.

gemini-2.5-flash has a 1 million token context window. This is non-negotiable — a single screenshot at full device resolution plus the accessibility tree can be 50-100KB of tokens per step. With the previous architecture (ComputerUse model at 131k context), multi-step tasks overflowed the context after 2-3 iterations.

Zero-Install UX: The Accessibility Differentiator

The single feature I'm most proud of is one most developers might not even notice: the user never installs anything on their phone.

Access Agent requires a DroidRun Portal APK on the phone for accessibility tree access and reliable screenshots. Instead of sending the user to an app store or asking them to sideload an APK — both of which require dexterity and technical knowledge — the agent handles everything:

The server downloads the Portal APK from GitHub releases
Tango (WebUSB ADB client running in the browser) pushes the APK directly to the phone's temp storage via adb sync
The server issues pm install via ADB shell to install it silently
The Live Agent then guides the user through enabling the Accessibility Service — entirely by voice

The full experience: open the URL, plug in the cable, press the mic button, and speak. The agent says "I see your phone is connected. I'm installing the required software — this will take about 30 seconds." Then: "Now I need you to enable the Accessibility Service. I'll guide you step by step."

No app store. No sideloading. No technical knowledge required.

WebADB: ADB Over WebSocket

Here's the core infrastructure challenge: the server runs on Google Cloud Run. The user's phone is on their desk. How does the server control the phone?

The answer is WebADB (ya-webadb / Tango). Tango is a full ADB client implementation in JavaScript that runs in the browser over WebUSB. The browser talks to the phone directly via USB. The server sends ADB commands as JSON messages over the existing audio WebSocket, and the browser executes them via Tango and returns results.

Server                         Browser                    Phone
  |                              |                          |
  |-- { adb_request:            |                          |
  |     method: "shell",        |                          |
  |     cmd: "screencap -p" } ->|                          |
  |                             |-- ADB shell command ---->|
  |                             |<-- PNG bytes ------------|
  |<-- { adb_response:          |                          |
  |      data: <png bytes> } ---|                          |

Three adb_request methods: shell (arbitrary ADB shell commands), portal_http (HTTP requests to the DroidRun Portal on port 8080), and screencap (binary screenshot capture via screencap -p).

Two engineering challenges I hit:

createSocket unreliable across Android versions. Tango's adb.createSocket("tcp:8080") works on Android 14+ but fails silently on Android 11 (OnePlus devices in particular). The fix: a two-tier strategy that tries createSocket first, and falls back to sending the HTTP request via echo '<base64_request>' | base64 -d | toybox nc 127.0.0.1 8080 through an ADB shell. The fallback uses Content-Length-aware chunked reading to avoid truncation — toybox nc exits when stdin closes, so we keep stdin open with a sleep 30 pipeline and read until Content-Length bytes are received.

ADB transport staleness. After ~8 minutes of idle (no phone actions), the Tango WebUSB transport goes stale and stops responding. A keepalive_task on the server sends an echo 1 shell command via ADB bridge every 2 minutes to keep the transport warm. This saved my partner's session from silently dying mid-use.

Vertex AI: No API Key Required

The deployed instance runs on Vertex AI instead of the AI Studio Gemini API. This is critical for accessibility: users with motor impairments should not have to navigate a settings modal to paste an API key.

# .env on Cloud Run
GOOGLE_GENAI_USE_VERTEXAI=TRUE
PLATFORM=webadb

With Vertex AI:

Auth is via ADC (Application Default Credentials) — the Cloud Run service account has roles/aiplatform.user
The frontend fetches /health on load; if auth_mode: "vertex_ai" is returned, the API key modal is skipped entirely
google-adk, google-genai, and DroidRun's GoogleGenAI provider all respect GOOGLE_GENAI_USE_VERTEXAI transparently — no code changes between modes, just one env var

For developers running locally: switch to AI Studio mode with bash scripts/toggle_vertex.sh off and provide your own API key via the frontend modal.

Google Cloud Run: Built for Long-Lived WebSockets

Most Cloud Run deployments are stateless HTTP services. Access Agent is different — each session is a persistent WebSocket connection that can stay alive for up to an hour. This required non-default configuration:

Setting	Value	Why
Timeout	3600s	1-hour voice sessions
Concurrency	10	Each session holds ~200MB RAM + live Gemini session
Session affinity	enabled	WebSocket can't mid-session reconnect to a different instance
Min instances	0	Scales to zero — zero cost when idle
Memory	2Gi	DroidRun + ADK + buffered audio

One subtle issue: the websockets library's default ping_timeout is 20 seconds. The Vertex AI server doesn't respond to WebSocket ping frames while a DroidRun tool call is in progress (which can take 30-120 seconds). This caused spurious disconnects. Fix: monkey-patch the default at startup:

websockets.asyncio.client.connect.__init__.__kwdefaults__["ping_timeout"] = 300

A Few Engineering Details Worth Sharing

Real-time speech interruption. When the user speaks during a multi-step phone action, the running DroidRun agent is cancelled immediately. Detection: each audio frame's PCM16 bytes are converted to signed shorts, RMS is computed, and if rms > 1500 while a tool call is running, handler.cancel_run() is called. A background asyncio task watches for this continuously — even when stream_events() is blocked on an in-flight LLM call. A 5-second cooldown prevents the same utterance from triggering 3-4 consecutive cancels.

Blind operation detection. If DroidRun's portal HTTP times out silently, the CodeAct agent continues without any screen state — it hallucinates a response. Detection: a real step with screenshot + accessibility tree uses 1500+ prompt tokens; a blind step uses under 800. One-step "successes" with fewer than 800 tokens are overridden to an error: "I couldn't see the phone screen."

Anti-fabrication rules. gemini-live-2.5-flash-native-audio has a tendency to fabricate [SYSTEM] Setup complete messages from garbled ambient audio — the model pattern-matches something that sounds like the onboarding sequence and runs with it. The system instruction includes four explicit rules: never say [SYSTEM] aloud, never fabricate [SYSTEM] messages, always rephrase real ones naturally, and never say "Setup complete" without a real server message.

What's Next

iOS support is the obvious next step. The architecture already isolates the phone control layer behind a PlatformService interface — adding a remote iOS controller would slot in without touching the voice or audio layers.
I'm also exploring haptic feedback for confirmation (phone vibrates when an action completes) and a "describe what's on the screen" shortcut for situations where my partner needs a quick visual read without a full task.
I'm going to take Access Agent to those who need it, get real feedback from them. If you're one of them, then please get in touch.

Top comments (1)

Anuj Maitra • Mar 19

Unbelievable sir! 🙌🏼🔥
And I'm praying for your wife sir…
💕🙏🏼
*Everything will be alright.* 🫂💐💞