<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Abishek Muthian</title>
    <description>The latest articles on DEV Community by Abishek Muthian (@abishek_muthian).</description>
    <link>https://dev.to/abishek_muthian</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3603408%2F2893688d-2a8d-4f5e-bcf4-981e42832988.png</url>
      <title>DEV Community: Abishek Muthian</title>
      <link>https://dev.to/abishek_muthian</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/abishek_muthian"/>
    <language>en</language>
    <item>
      <title>My Wife Is Losing the Ability to Use Her Phone. So I Built an AI to Use It for Her</title>
      <dc:creator>Abishek Muthian</dc:creator>
      <pubDate>Mon, 16 Mar 2026 16:51:53 +0000</pubDate>
      <link>https://dev.to/abishek_muthian/my-wife-is-losing-the-ability-to-use-her-phone-so-i-built-an-ai-to-use-it-for-her-448m</link>
      <guid>https://dev.to/abishek_muthian/my-wife-is-losing-the-ability-to-use-her-phone-so-i-built-an-ai-to-use-it-for-her-448m</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;I'm created this content for the purpose of entering Gemini Live Agent Challenge hackathon&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My partner lives with a rare disease called &lt;a href="https://www.gne-myopathy.org/" rel="noopener noreferrer"&gt;GNE Myopathy&lt;/a&gt; which causes &lt;strong&gt;progressive weakening of the muscles&lt;/strong&gt;. At advanced stages even basic tasks like using a smartphone becomes an uphill task in-order to solve that I'm creating Access Agent, an accessibility first voice-driven AI phone navigator.&lt;/p&gt;

&lt;p&gt;So I built Access Agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Access Agent?
&lt;/h2&gt;

&lt;p&gt;Access Agent is a voice-driven AI phone navigator. You speak a goal — "Message Sarah I'll be 10 minutes late", "Play relaxing jazz on YouTube", "What's on my screen?" — and the agent sees the current screen, reasons about what steps are needed, and executes them autonomously.&lt;/p&gt;

&lt;p&gt;It is not a voice macro system. It doesn't have pre-programmed action maps per app. It reads the live screen state at every step and figures out the sequence on its own. It works on any app, any screen, without training or configuration.&lt;/p&gt;

&lt;p&gt;Checkout my demo video: &lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/1drH0zeBBx8"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;All it requires is an Android phone, USB cable and a Chromium based browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Before diving into each layer, here's the full picture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm2ippgp50p8iz3z09xs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm2ippgp50p8iz3z09xs.png" alt="Architecture diagram of Access Agent"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three zones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Browser (left)&lt;/strong&gt; — captures microphone audio via AudioWorklet, renders an orb visualiser, and runs the Tango WebUSB ADB client that talks directly to the phone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server (centre)&lt;/strong&gt; — FastAPI + WebSocket server hosting the ADK Live Agent and DroidRun phone agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phone (right of browser)&lt;/strong&gt; — DroidRun Portal APK providing the accessibility tree + screenshot API over HTTP on port 8080&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The browser proxies all phone ADB commands over the same WebSocket to the server, so the server can run anywhere — including Google Cloud Run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Voice Layer: Gemini Live API
&lt;/h2&gt;

&lt;p&gt;The most important architectural choice was using the &lt;strong&gt;Gemini Live API&lt;/strong&gt; for the voice layer, specifically &lt;code&gt;gemini-live-2.5-flash-native-audio&lt;/code&gt; on Vertex AI.&lt;/p&gt;

&lt;p&gt;Here's why every word in that model name matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live&lt;/strong&gt; — real-time bidirectional audio streaming. There is no "record then send" round-trip. Audio flows continuously in both directions over a persistent connection. The agent can interrupt you mid-sentence and you can interrupt it mid-sentence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;native-audio&lt;/strong&gt; — the model produces speech directly as PCM audio output. No separate TTS step, no added latency, no robotic synthesis voice. The same model that understands your intent also speaks the response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server-side VAD&lt;/strong&gt; — Gemini detects when you've finished speaking on the server. No client-side silence detection, no hardcoded 1.5s timeouts, no "please hold for silence" bugs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm using Google ADK v1.17+ which wraps all of this cleanly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The entire Live session lifecycle in ~10 lines
&lt;/span&gt;&lt;span class="n"&gt;runner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Runner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;live_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session_svc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LiveRequestQueue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_live&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;live_request_queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inline_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# Raw PCM16 @ 24kHz — send to browser
&lt;/span&gt;                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;audio_response_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inline_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Feed audio from browser mic
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_realtime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pcm_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/pcm;rate=16000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two production concerns I had to solve:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session resumption.&lt;/strong&gt; Gemini Live sessions have a ~10-minute WebSocket connection limit. With &lt;code&gt;SessionResumptionConfig()&lt;/code&gt; in the ADK &lt;code&gt;RunConfig&lt;/code&gt;, ADK automatically reconnects and restores the full conversation context when the limit is hit — the user never notices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context window compression.&lt;/strong&gt; Long phone-control sessions accumulate tokens quickly (screenshots are expensive). With &lt;code&gt;ContextWindowCompressionConfig(trigger_tokens=100000, target_tokens=80000)&lt;/code&gt;, ADK compresses the context before it overflows. Without this, sessions degrade noticeably after 15-20 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Audio Pipeline: Why AudioWorklet Matters for Accessibility
&lt;/h2&gt;

&lt;p&gt;The browser audio pipeline deserves its own section because getting it wrong would make the product unusable for people with motor impairments.&lt;/p&gt;

&lt;p&gt;The old Web Audio API approach used &lt;code&gt;ScriptProcessorNode&lt;/code&gt;, which runs audio processing on the main JavaScript thread. Under any CPU load — a slow render, a garbage collection pause — frames get dropped. For a user who can barely lift a finger, a dropped mic frame that causes the agent to mishear a command is not a minor annoyance, it's a failure.&lt;/p&gt;

&lt;p&gt;Access Agent uses &lt;strong&gt;AudioWorklet&lt;/strong&gt; instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pcm-recorder-processor.js&lt;/code&gt; runs on a dedicated audio thread at 16 kHz, capturing mic input as Float32 frames, converting to PCM16, and posting them to the main thread for WebSocket transmission. Frames are never dropped regardless of main thread load.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pcm-player-processor.js&lt;/code&gt; maintains a 180-second ring buffer for agent speech playback at 24 kHz. Audio is enqueued as chunks arrive and plays back gaplessly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Barge-in&lt;/strong&gt; is instant: when the user speaks while the agent is talking, the frontend sends &lt;code&gt;{ command: "endOfAudio" }&lt;/code&gt; to the worklet, which clears the ring buffer immediately. The agent stops mid-sentence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Audio travels over the WebSocket as &lt;strong&gt;raw binary frames&lt;/strong&gt;, not base64-encoded JSON. This eliminates the 33% size overhead and the encoding/decoding CPU cost on every frame.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Phone Agent: Autonomous Reasoning with DroidRun
&lt;/h2&gt;

&lt;p&gt;The Live Agent handles voice I/O. When the user asks to do something on the phone, the Live Agent calls &lt;code&gt;perform_phone_action(goal)&lt;/code&gt; — a single tool that delegates to a separate &lt;strong&gt;DroidRun DroidAgent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;DroidRun uses a &lt;strong&gt;CodeAct workflow&lt;/strong&gt;: at each step, &lt;code&gt;gemini-2.5-flash&lt;/code&gt; takes a screenshot and the accessibility tree, then generates Python code calling atomic action functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What the model generates internally — not what you write
&lt;/span&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;take_screenshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;elements&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_ui_state&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Model reasons: I see WhatsApp is not open. Launch it.
&lt;/span&gt;&lt;span class="nf"&gt;start_app&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;com.whatsapp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Next step: Model sees WhatsApp home. Find Sarah.
&lt;/span&gt;&lt;span class="nf"&gt;tap_by_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# index 3 = Sarah's chat from accessibility tree
&lt;/span&gt;
&lt;span class="c1"&gt;# Next step: Model sees the chat. Compose message.
&lt;/span&gt;&lt;span class="nf"&gt;tap_by_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# compose field
&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ll be 10 minutes late&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;tap_by_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# send button
&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Message sent to Sarah&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model writes the code. AdbTools executes it. The loop continues until &lt;code&gt;complete()&lt;/code&gt; is called.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;gemini-2.5-flash&lt;/code&gt; has a &lt;strong&gt;1 million token context window&lt;/strong&gt;. This is non-negotiable — a single screenshot at full device resolution plus the accessibility tree can be 50-100KB of tokens per step. With the previous architecture (ComputerUse model at 131k context), multi-step tasks overflowed the context after 2-3 iterations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Zero-Install UX: The Accessibility Differentiator
&lt;/h2&gt;

&lt;p&gt;The single feature I'm most proud of is one most developers might not even notice: &lt;strong&gt;the user never installs anything on their phone&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Access Agent requires a DroidRun Portal APK on the phone for accessibility tree access and reliable screenshots. Instead of sending the user to an app store or asking them to sideload an APK — both of which require dexterity and technical knowledge — the agent handles everything:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The server downloads the Portal APK from GitHub releases&lt;/li&gt;
&lt;li&gt;Tango (WebUSB ADB client running in the browser) pushes the APK directly to the phone's temp storage via &lt;code&gt;adb sync&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The server issues &lt;code&gt;pm install&lt;/code&gt; via ADB shell to install it silently&lt;/li&gt;
&lt;li&gt;The Live Agent then guides the user through enabling the Accessibility Service — entirely by voice&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full experience: open the URL, plug in the cable, press the mic button, and speak. The agent says "I see your phone is connected. I'm installing the required software — this will take about 30 seconds." Then: "Now I need you to enable the Accessibility Service. I'll guide you step by step."&lt;/p&gt;

&lt;p&gt;No app store. No sideloading. No technical knowledge required.&lt;/p&gt;

&lt;h2&gt;
  
  
  WebADB: ADB Over WebSocket
&lt;/h2&gt;

&lt;p&gt;Here's the core infrastructure challenge: the server runs on Google Cloud Run. The user's phone is on their desk. How does the server control the phone?&lt;/p&gt;

&lt;p&gt;The answer is &lt;strong&gt;WebADB&lt;/strong&gt; (ya-webadb / Tango). Tango is a full ADB client implementation in JavaScript that runs in the browser over WebUSB. The browser talks to the phone directly via USB. The server sends ADB commands as JSON messages over the existing audio WebSocket, and the browser executes them via Tango and returns results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server                         Browser                    Phone
  |                              |                          |
  |-- { adb_request:            |                          |
  |     method: "shell",        |                          |
  |     cmd: "screencap -p" } -&amp;gt;|                          |
  |                             |-- ADB shell command ----&amp;gt;|
  |                             |&amp;lt;-- PNG bytes ------------|
  |&amp;lt;-- { adb_response:          |                          |
  |      data: &amp;lt;png bytes&amp;gt; } ---|                          |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three &lt;code&gt;adb_request&lt;/code&gt; methods: &lt;code&gt;shell&lt;/code&gt; (arbitrary ADB shell commands), &lt;code&gt;portal_http&lt;/code&gt; (HTTP requests to the DroidRun Portal on port 8080), and &lt;code&gt;screencap&lt;/code&gt; (binary screenshot capture via &lt;code&gt;screencap -p&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Two engineering challenges I hit:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;createSocket&lt;/code&gt; unreliable across Android versions.&lt;/strong&gt; Tango's &lt;code&gt;adb.createSocket("tcp:8080")&lt;/code&gt; works on Android 14+ but fails silently on Android 11 (OnePlus devices in particular). The fix: a two-tier strategy that tries &lt;code&gt;createSocket&lt;/code&gt; first, and falls back to sending the HTTP request via &lt;code&gt;echo '&amp;lt;base64_request&amp;gt;' | base64 -d | toybox nc 127.0.0.1 8080&lt;/code&gt; through an ADB shell. The fallback uses Content-Length-aware chunked reading to avoid truncation — &lt;code&gt;toybox nc&lt;/code&gt; exits when stdin closes, so we keep stdin open with a &lt;code&gt;sleep 30&lt;/code&gt; pipeline and read until &lt;code&gt;Content-Length&lt;/code&gt; bytes are received.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ADB transport staleness.&lt;/strong&gt; After ~8 minutes of idle (no phone actions), the Tango WebUSB transport goes stale and stops responding. A &lt;code&gt;keepalive_task&lt;/code&gt; on the server sends an &lt;code&gt;echo 1&lt;/code&gt; shell command via ADB bridge every 2 minutes to keep the transport warm. This saved my partner's session from silently dying mid-use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vertex AI: No API Key Required
&lt;/h2&gt;

&lt;p&gt;The deployed instance runs on &lt;strong&gt;Vertex AI&lt;/strong&gt; instead of the AI Studio Gemini API. This is critical for accessibility: users with motor impairments should not have to navigate a settings modal to paste an API key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# .env on Cloud Run&lt;/span&gt;
&lt;span class="nv"&gt;GOOGLE_GENAI_USE_VERTEXAI&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;TRUE
&lt;span class="nv"&gt;PLATFORM&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;webadb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Vertex AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auth is via ADC (Application Default Credentials) — the Cloud Run service account has &lt;code&gt;roles/aiplatform.user&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The frontend fetches &lt;code&gt;/health&lt;/code&gt; on load; if &lt;code&gt;auth_mode: "vertex_ai"&lt;/code&gt; is returned, the API key modal is skipped entirely&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;google-adk&lt;/code&gt;, &lt;code&gt;google-genai&lt;/code&gt;, and DroidRun's &lt;code&gt;GoogleGenAI&lt;/code&gt; provider all respect &lt;code&gt;GOOGLE_GENAI_USE_VERTEXAI&lt;/code&gt; transparently — no code changes between modes, just one env var&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For developers running locally: switch to AI Studio mode with &lt;code&gt;bash scripts/toggle_vertex.sh off&lt;/code&gt; and provide your own API key via the frontend modal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Google Cloud Run: Built for Long-Lived WebSockets
&lt;/h2&gt;

&lt;p&gt;Most Cloud Run deployments are stateless HTTP services. Access Agent is different — each session is a persistent WebSocket connection that can stay alive for up to an hour. This required non-default configuration:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Timeout&lt;/td&gt;
&lt;td&gt;3600s&lt;/td&gt;
&lt;td&gt;1-hour voice sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concurrency&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Each session holds ~200MB RAM + live Gemini session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Session affinity&lt;/td&gt;
&lt;td&gt;enabled&lt;/td&gt;
&lt;td&gt;WebSocket can't mid-session reconnect to a different instance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Min instances&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Scales to zero — zero cost when idle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;2Gi&lt;/td&gt;
&lt;td&gt;DroidRun + ADK + buffered audio&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One subtle issue: the &lt;code&gt;websockets&lt;/code&gt; library's default &lt;code&gt;ping_timeout&lt;/code&gt; is 20 seconds. The Vertex AI server doesn't respond to WebSocket ping frames while a DroidRun tool call is in progress (which can take 30-120 seconds). This caused spurious disconnects. Fix: monkey-patch the default at startup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;websockets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__kwdefaults__&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ping_timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A Few Engineering Details Worth Sharing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Real-time speech interruption.&lt;/strong&gt; When the user speaks during a multi-step phone action, the running DroidRun agent is cancelled immediately. Detection: each audio frame's PCM16 bytes are converted to signed shorts, RMS is computed, and if &lt;code&gt;rms &amp;gt; 1500&lt;/code&gt; while a tool call is running, &lt;code&gt;handler.cancel_run()&lt;/code&gt; is called. A background asyncio task watches for this continuously — even when &lt;code&gt;stream_events()&lt;/code&gt; is blocked on an in-flight LLM call. A 5-second cooldown prevents the same utterance from triggering 3-4 consecutive cancels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blind operation detection.&lt;/strong&gt; If DroidRun's portal HTTP times out silently, the CodeAct agent continues without any screen state — it hallucinates a response. Detection: a real step with screenshot + accessibility tree uses 1500+ prompt tokens; a blind step uses under 800. One-step "successes" with fewer than 800 tokens are overridden to an error: "I couldn't see the phone screen."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-fabrication rules.&lt;/strong&gt; &lt;code&gt;gemini-live-2.5-flash-native-audio&lt;/code&gt; has a tendency to fabricate &lt;code&gt;[SYSTEM] Setup complete&lt;/code&gt; messages from garbled ambient audio — the model pattern-matches something that sounds like the onboarding sequence and runs with it. The system instruction includes four explicit rules: never say &lt;code&gt;[SYSTEM]&lt;/code&gt; aloud, never fabricate &lt;code&gt;[SYSTEM]&lt;/code&gt; messages, always rephrase real ones naturally, and never say "Setup complete" without a real server message.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;iOS support is the obvious next step. The architecture already isolates the phone control layer behind a &lt;code&gt;PlatformService&lt;/code&gt; interface — adding a remote iOS controller would slot in without touching the voice or audio layers.&lt;/li&gt;
&lt;li&gt;I'm also exploring haptic feedback for confirmation (phone vibrates when an action completes) and a "describe what's on the screen" shortcut for situations where my partner needs a quick visual read without a full task.&lt;/li&gt;
&lt;li&gt;I'm going to take Access Agent to those who need it, get real feedback from them. If you're one of them, then please get in touch.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gemini</category>
      <category>googleadk</category>
      <category>vertexai</category>
      <category>geminiliveagentchallenge</category>
    </item>
  </channel>
</rss>
