<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Elliot Gao</title>
    <description>The latest articles on DEV Community by Elliot Gao (@elliotgao2).</description>
    <link>https://dev.to/elliotgao2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3949216%2F30aefaac-3307-4d31-8c63-c21343f6e9b3.png</url>
      <title>DEV Community: Elliot Gao</title>
      <link>https://dev.to/elliotgao2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/elliotgao2"/>
    <language>en</language>
    <item>
      <title>Stop Wasting Tokens on Android Automation</title>
      <dc:creator>Elliot Gao</dc:creator>
      <pubDate>Sun, 24 May 2026 15:19:41 +0000</pubDate>
      <link>https://dev.to/elliotgao2/stop-wasting-tokens-on-android-automation-1mep</link>
      <guid>https://dev.to/elliotgao2/stop-wasting-tokens-on-android-automation-1mep</guid>
      <description>&lt;h2&gt;
  
  
  Stop Wasting Tokens on Android Automation
&lt;/h2&gt;

&lt;p&gt;Most LLM-driven Android automation starts by showing the model a screen.&lt;/p&gt;

&lt;p&gt;That sounds reasonable. A human looks at the phone, decides what to tap, and taps it. Give the model the same view.&lt;/p&gt;

&lt;p&gt;The problem is that "the same view" is expensive.&lt;/p&gt;

&lt;p&gt;A full screenshot is expensive. A raw Android UI XML dump is also expensive, just in a quieter way. The model reads thousands of tokens of layout machinery before it reaches the handful of labels that matter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Email
Password
Continue
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For one step, that waste is easy to ignore. For a 50-step mobile agent trajectory, it becomes the bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  The loop
&lt;/h2&gt;

&lt;p&gt;An Android agent usually does this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the current screen.&lt;/li&gt;
&lt;li&gt;Decide what to do.&lt;/li&gt;
&lt;li&gt;Tap, type, or swipe.&lt;/li&gt;
&lt;li&gt;Wait for the next screen.&lt;/li&gt;
&lt;li&gt;Repeat.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first step is where the token leak begins.&lt;/p&gt;

&lt;p&gt;If you use &lt;code&gt;uiautomator dump&lt;/code&gt;, the model gets XML like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;node&lt;/span&gt; &lt;span class="na"&gt;index=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt; &lt;span class="na"&gt;text=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="na"&gt;resource-id=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;
      &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"android.widget.FrameLayout"&lt;/span&gt;
      &lt;span class="na"&gt;package=&lt;/span&gt;&lt;span class="s"&gt;"com.google.android.apps.nexuslauncher"&lt;/span&gt;
      &lt;span class="na"&gt;content-desc=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;
      &lt;span class="na"&gt;checkable=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt; &lt;span class="na"&gt;checked=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt;
      &lt;span class="na"&gt;clickable=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt; &lt;span class="na"&gt;enabled=&lt;/span&gt;&lt;span class="s"&gt;"true"&lt;/span&gt;
      &lt;span class="na"&gt;focusable=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt; &lt;span class="na"&gt;focused=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt;
      &lt;span class="na"&gt;scrollable=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt; &lt;span class="na"&gt;long-clickable=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt;
      &lt;span class="na"&gt;password=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt; &lt;span class="na"&gt;selected=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt;
      &lt;span class="na"&gt;bounds=&lt;/span&gt;&lt;span class="s"&gt;"[0,0][1440,3120]"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is one layout node. It says almost nothing an agent can act on.&lt;/p&gt;

&lt;p&gt;It is not a bug in UIAutomator. XML is a faithful serialization of the accessibility tree. Faithful is not the same as useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;On a few ordinary Android screens, the difference looks like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;screen&lt;/th&gt;
&lt;th&gt;UIAutomator XML&lt;/th&gt;
&lt;th&gt;Handsets &lt;code&gt;hs ui -i&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;reduction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Launcher home&lt;/td&gt;
&lt;td&gt;3,153 tokens&lt;/td&gt;
&lt;td&gt;246 tokens&lt;/td&gt;
&lt;td&gt;12.8x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Settings home&lt;/td&gt;
&lt;td&gt;5,762 tokens&lt;/td&gt;
&lt;td&gt;729 tokens&lt;/td&gt;
&lt;td&gt;7.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Settings -&amp;gt; Apps&lt;/td&gt;
&lt;td&gt;4,050 tokens&lt;/td&gt;
&lt;td&gt;320 tokens&lt;/td&gt;
&lt;td&gt;12.7x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Token counts are from &lt;code&gt;tiktoken&lt;/code&gt; with the GPT-4 encoding. The deeper write-up is &lt;a href="//2026-05-22-android-ui-dump-for-llms.md"&gt;An Android UI Dump for LLMs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The short version: a typical screen that costs 4,000-6,000 tokens as XML can often be represented in a few hundred tokens as an action table.&lt;/p&gt;

&lt;p&gt;Across 50 steps, that is the difference between sending roughly 250k tokens of screen state and sending roughly 25k-40k.&lt;/p&gt;

&lt;p&gt;The agent usually makes the same decision either way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the model actually needs
&lt;/h2&gt;

&lt;p&gt;For UI automation, the model does not need a DOM-shaped tree.&lt;/p&gt;

&lt;p&gt;It needs a list of things it can act on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That table gives the model the useful facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What action is available.&lt;/li&gt;
&lt;li&gt;What label a human sees.&lt;/li&gt;
&lt;li&gt;What type of control it is.&lt;/li&gt;
&lt;li&gt;Where the tool will tap or type.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model can now answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tap "Continue"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It does not have to parse layout ancestors, negative booleans, fully-qualified class names, or four-number bounds rectangles.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule
&lt;/h2&gt;

&lt;p&gt;For LLM tool output, the optimization rule is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Do not serialize facts the model cannot use in its next action.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Android XML violates that rule constantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;clickable="false"&lt;/code&gt; on nodes the agent will never click.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;enabled="true"&lt;/code&gt; repeated on almost every node.&lt;/li&gt;
&lt;li&gt;Empty &lt;code&gt;FrameLayout&lt;/code&gt; and &lt;code&gt;LinearLayout&lt;/code&gt; containers.&lt;/li&gt;
&lt;li&gt;Full class names like &lt;code&gt;android.widget.TextView&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Bounds rectangles when the agent only needs a tap point.&lt;/li&gt;
&lt;li&gt;JSON-style key repetition when the reader is a language model, not a parser.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Handsets drops the defaults, shortens the names, computes the center point, and keeps the labels.&lt;/p&gt;

&lt;p&gt;The result is not a smaller XML file. It is a different interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hs ui
hs tap &lt;span class="s2"&gt;"Continue"&lt;/span&gt;
hs &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="s2"&gt;"Dashboard"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Screenshots are still useful
&lt;/h2&gt;

&lt;p&gt;This is not an argument against screenshots.&lt;/p&gt;

&lt;p&gt;Screenshots are useful when layout matters, when visual state matters, or when an app renders important information without accessible labels.&lt;/p&gt;

&lt;p&gt;But screenshots are a poor default for every step. They are large, slow to move, and often force the model to do OCR-like work for text that Android already exposes.&lt;/p&gt;

&lt;p&gt;A better loop is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hs ui &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/screen.txt
hs see &lt;span class="nt"&gt;--size&lt;/span&gt; 768 /tmp/screen.jpg   &lt;span class="c"&gt;# only when visual context matters&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Give the model the text UI first. Add the image when the text is not enough.&lt;/p&gt;

&lt;p&gt;That usually saves tokens and makes the action easier to audit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more for agents than tests
&lt;/h2&gt;

&lt;p&gt;Traditional mobile tests do not care much about token count. A test runner is not paying to read XML.&lt;/p&gt;

&lt;p&gt;LLM agents are different. Every loop step has a context budget and a cost. If half the prompt is a UI tree full of dead layout nodes, the model is spending attention on junk.&lt;/p&gt;

&lt;p&gt;This shows up in three places:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; repeated screen state dominates long trajectories.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency:&lt;/strong&gt; large prompts take longer to send and process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; shorter action-oriented context leaves less room for the model to latch onto irrelevant structure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best tool output for an agent is not the most complete representation of the system. It is the smallest representation that preserves the next correct action.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical pattern
&lt;/h2&gt;

&lt;p&gt;For Android, the pattern looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hs use
hs ui
hs tap &lt;span class="s2"&gt;"Sign in"&lt;/span&gt;
hs fill &lt;span class="s2"&gt;"Email"&lt;/span&gt; &lt;span class="s2"&gt;"you@example.com"&lt;/span&gt;
hs fill &lt;span class="s2"&gt;"Password"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PASSWORD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
hs tap &lt;span class="s2"&gt;"Continue"&lt;/span&gt;
hs &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="s2"&gt;"Dashboard"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For an LLM, the important handoff is even smaller:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Here is the current Android UI. Pick the next action by label.

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model does not need to know that these nodes live inside three nested &lt;code&gt;FrameLayout&lt;/code&gt;s. It needs to know that "Continue" is a button.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related guides
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="//2026-05-24-uiautomator2-alternative-for-android-automation.md"&gt;uiautomator2 Alternative for Android Automation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//2026-05-22-android-ui-dump-for-llms.md"&gt;An Android UI Dump for LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//2026-05-23-tapping-android-in-5ms-vs-appium-uiautomator2.md"&gt;Tapping Android in 5 ms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//2026-05-24-how-to-automate-android-apps-without-root.md"&gt;How to Automate Android Apps Without Root&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://handsets.dev/blog/stop-wasting-tokens-on-android-automation/" rel="noopener noreferrer"&gt;https://handsets.dev/blog/stop-wasting-tokens-on-android-automation/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>automation</category>
      <category>android</category>
    </item>
  </channel>
</rss>
