<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sharmin Sirajudeen</title>
    <description>The latest articles on DEV Community by Sharmin Sirajudeen (@sharminsirajudeen).</description>
    <link>https://dev.to/sharminsirajudeen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3861545%2Feed19cc1-58d6-44b1-9667-eebf139712d5.jpg</url>
      <title>DEV Community: Sharmin Sirajudeen</title>
      <link>https://dev.to/sharminsirajudeen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sharminsirajudeen"/>
    <language>en</language>
    <item>
      <title>From Intent Classification to Open-Ended Action Spaces: Why Mobile Testing Needed a New Paradigm</title>
      <dc:creator>Sharmin Sirajudeen</dc:creator>
      <pubDate>Mon, 06 Apr 2026 01:50:48 +0000</pubDate>
      <link>https://dev.to/sharminsirajudeen/from-intent-classification-to-open-ended-action-spaces-why-mobile-testing-needed-a-new-paradigm-2lpb</link>
      <guid>https://dev.to/sharminsirajudeen/from-intent-classification-to-open-ended-action-spaces-why-mobile-testing-needed-a-new-paradigm-2lpb</guid>
      <description>&lt;h1&gt;
  
  
  From Intent Classification to Open-Ended Action Spaces: Why Mobile Testing Needed a New Paradigm
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;I'm the creator of &lt;a href="https://drengr.dev" rel="noopener noreferrer"&gt;Drengr&lt;/a&gt;, an MCP server that gives AI agents eyes and hands on mobile devices. I started this blog to share the engineering behind it. No pretending to be a neutral observer writing a think piece — I built this, and I'm here to talk about it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Google recently shipped &lt;a href="https://github.com/google-ai-edge/gallery" rel="noopener noreferrer"&gt;AI Edge Gallery&lt;/a&gt; — an on-device AI sandbox app with a feature called "Mobile Actions" that lets you control your phone with natural language. Say "turn on the flashlight," and a 270M parameter model called FunctionGemma figures out the intent, extracts the parameters, and dispatches the right function call. It runs entirely offline. It clocks 1,916 tokens/sec prefill on a Pixel 7 Pro. And it's impressive.&lt;/p&gt;

&lt;p&gt;But it also reveals a ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Closed-World Assumption
&lt;/h2&gt;

&lt;p&gt;FunctionGemma is, at its core, a tiny NLU engine performing intent classification and slot filling. You speak. It classifies your sentence into one of a fixed set of intents — &lt;code&gt;turnOnFlashlight&lt;/code&gt;, &lt;code&gt;createCalendarEvent&lt;/code&gt;, &lt;code&gt;showLocationOnMap&lt;/code&gt; — and extracts the relevant slots: a time, a location, a contact name. The native app code then dispatches the structured output to the corresponding platform API.&lt;/p&gt;

&lt;p&gt;This is a &lt;strong&gt;closed-world system&lt;/strong&gt;. Every possible action is known at compile time. Every function is pre-registered. Every slot is pre-defined. The model's job is pattern matching over a bounded action space — the same fundamental design that Dialogflow, Alexa Skills, and SiriKit Intents have used for years, now running on-device at remarkable speed. These platforms have evolved over time — Apple's App Intents, Alexa's generative AI features — but the underlying intent-schema architecture remains fundamentally closed-world by design.&lt;/p&gt;

&lt;p&gt;It works beautifully for what it is. But it cannot do what it has never been told exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Open-World Problem
&lt;/h2&gt;

&lt;p&gt;Now consider a different scenario. You're a QA engineer. You need to verify that a flower delivery app correctly applies a promo code at checkout, that the cart total updates, and that the confirmation screen renders the right order summary. The app was built by your team. No one pre-registered its UI elements as callable functions. No one fine-tuned a model on its screen taxonomy.&lt;/p&gt;

&lt;p&gt;This is an &lt;strong&gt;open-world problem&lt;/strong&gt;. The action space is unbounded. The UI is arbitrary. The screens have never been seen by the testing agent before.&lt;/p&gt;

&lt;p&gt;This is the problem &lt;a href="https://www.npmjs.com/package/drengr" rel="noopener noreferrer"&gt;Drengr&lt;/a&gt; solves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Text-First Perception, Schema-Never
&lt;/h2&gt;

&lt;p&gt;Drengr is an MCP (Model Context Protocol) server — the open protocol that connects AI models to external tools and data sources, in the same way LSP (Language Server Protocol) connects editors to language servers. Drengr is purpose-built for mobile UI interaction. It doesn't require your app to expose an API. It doesn't need accessibility labels (though it uses them when available). It doesn't ask you to define intents or register functions.&lt;/p&gt;

&lt;p&gt;Instead, it operates through three primitives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;drengr_look&lt;/code&gt;&lt;/strong&gt; — Captures the current screen state as a compact text description (~300 tokens per screen) or an annotated image with numbered elements. Text-first by default — vision only escalates when less than 60% of elements have labels. 100x cheaper than sending screenshots every step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;drengr_do&lt;/code&gt;&lt;/strong&gt; — Performs 13 actions on the device: tap, type, swipe, long press, back, home, launch, wait, key press, install, clear and type, scroll to top, scroll to bottom. Each action returns a situation report — a structured diff of what changed on screen (new elements, disappeared elements, crash detection, stuck detection).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;drengr_query&lt;/code&gt;&lt;/strong&gt; — Structured queries about device and app state: list connected devices, check current activity, detect crashes, find elements by text, explore app navigation, read network calls, check keyboard state, dump the raw UI tree, and more.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI client — Claude Desktop, Cursor, Windsurf, VS Code, any MCP-compatible host — acts as the brain. Drengr provides the eyes and hands. The agent looks at a screen it has never seen, understands what's there, decides what to do, and does it. No pre-training on your app. No test script maintenance. No brittle XPath selectors that break every sprint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Distinction Matters
&lt;/h2&gt;

&lt;p&gt;The difference between closed-world function dispatch and open-world UI interaction is not incremental. It is architectural.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Closed-World (FunctionGemma)&lt;/th&gt;
&lt;th&gt;Open-World (Drengr)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Action space&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fixed, pre-defined functions&lt;/td&gt;
&lt;td&gt;Arbitrary, discovered at runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UI knowledge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compiled into the model&lt;/td&gt;
&lt;td&gt;Observed per-screen via text scenes + vision fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New app support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requires fine-tuning or function registration&lt;/td&gt;
&lt;td&gt;Works immediately against any app&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"I don't have a function for that"&lt;/td&gt;
&lt;td&gt;"I can see the screen — let me figure it out"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NLU → function dispatch&lt;/td&gt;
&lt;td&gt;Perception → reasoning → action&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;FunctionGemma is a classifier. Drengr is an agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The MCP Advantage
&lt;/h2&gt;

&lt;p&gt;Drengr is built as an MCP server — the same architectural pattern that made LSP the backbone of every modern code editor. Anthropic themselves draw this parallel in the MCP specification: both protocols solve the M×N integration problem. LSP connects M editors to N language servers. MCP connects M AI clients to N tool servers. Both use JSON-RPC 2.0 transport.&lt;/p&gt;

&lt;p&gt;This means Drengr isn't married to a single LLM. Today, a developer can wire up Claude Code, Cursor, or Windsurf as the reasoning layer, and Drengr handles the device interaction. Tomorrow, when a better model drops, you swap the brain without touching the tools.&lt;/p&gt;

&lt;p&gt;This separation of concerns — &lt;strong&gt;the model thinks, the server acts&lt;/strong&gt; — is what makes the architecture durable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;QA engineers&lt;/strong&gt; tired of maintaining Appium scripts that break every release cycle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mobile developers&lt;/strong&gt; who want to validate user flows without writing test code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering leads&lt;/strong&gt; exploring agentic testing as a force multiplier for small teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI tooling teams&lt;/strong&gt; evaluating MCP-compatible infrastructure for mobile automation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Testing Problem, Reframed
&lt;/h2&gt;

&lt;p&gt;Traditional mobile test automation asks: &lt;em&gt;"How do I script a robot to press the right buttons?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Drengr asks: &lt;em&gt;"What if the robot could just look at the screen and figure it out?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That reframing — from scripted automation to perceptual agency — is the paradigm shift. It's the difference between giving someone a map with every turn pre-marked, and giving them eyes and the ability to navigate.&lt;/p&gt;

&lt;p&gt;Google proved that on-device NLU can dispatch to a handful of OS functions at blazing speed. Drengr proves that an LLM with the right tools can operate across any app, any screen, any flow — without ever being told what to expect.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Drengr is free to use and available on &lt;a href="https://www.npmjs.com/package/drengr" rel="noopener noreferrer"&gt;npm&lt;/a&gt;. It supports Android (physical devices, emulators), iOS simulators (full gesture support), and cloud device farms (BrowserStack, SauceLabs, AWS Device Farm, LambdaTest, Perfecto, Kobiton). Built in Rust. Single binary. No runtime dependencies.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>appium</category>
      <category>mobiledev</category>
      <category>testing</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Connecting Claude to a Real Phone via MCP</title>
      <dc:creator>Sharmin Sirajudeen</dc:creator>
      <pubDate>Sun, 05 Apr 2026 03:01:57 +0000</pubDate>
      <link>https://dev.to/sharminsirajudeen/connecting-claude-to-a-real-phone-via-mcp-dfj</link>
      <guid>https://dev.to/sharminsirajudeen/connecting-claude-to-a-real-phone-via-mcp-dfj</guid>
      <description>&lt;h1&gt;
  
  
  I Gave Claude My Phone and It Tested My App
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;I'm the creator of &lt;a href="https://drengr.dev" rel="noopener noreferrer"&gt;Drengr&lt;/a&gt;, an MCP server that gives AI agents eyes and hands on mobile devices. I started this blog to share the engineering behind it. No pretending to be a neutral observer writing a think piece — I built this, and I'm here to talk about it.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup: 90 seconds
&lt;/h2&gt;

&lt;p&gt;I plugged an Android phone into my MacBook. Opened Claude Desktop. Added one line to the MCP config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"drengr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"drengr"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the entire setup. No Appium. No Selenium grid. No environment variables pointing to Java homes and Android SDK paths. Just &lt;code&gt;npm install -g drengr&lt;/code&gt;, plug in the phone, and tell Claude what to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Open YouTube and find a video about MCP servers"
&lt;/h2&gt;

&lt;p&gt;I typed that into Claude. Here's what happened over the next 40 seconds:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Claude called &lt;code&gt;drengr_look&lt;/code&gt; — got a text description of the home screen&lt;/li&gt;
&lt;li&gt;It saw YouTube in the app list and called &lt;code&gt;drengr_do&lt;/code&gt; to launch it&lt;/li&gt;
&lt;li&gt;YouTube opened. Claude called &lt;code&gt;drengr_look&lt;/code&gt; again — got the YouTube home feed as a list of labeled elements&lt;/li&gt;
&lt;li&gt;It tapped the search bar, typed "MCP servers," and hit search&lt;/li&gt;
&lt;li&gt;Results appeared. Claude read the titles and tapped the most relevant video&lt;/li&gt;
&lt;li&gt;The video started playing. Claude confirmed: "Found and playing 'MCP Server Explained' by IBM Technology"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Six actions. No scripts. No selectors. Claude read the screen, made decisions, and executed actions — exactly like a human would, except it took 40 seconds instead of 2 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The moment it got interesting
&lt;/h2&gt;

&lt;p&gt;I then asked: "Now go to Shorts and swipe through a few."&lt;/p&gt;

&lt;p&gt;Claude navigated to the Shorts tab, swiped up three times, read the titles of each short, and told me what it saw. It handled the vertical scroll, the full-screen video player, the overlay buttons — all without any special configuration.&lt;/p&gt;

&lt;p&gt;This is the kind of interaction that breaks traditional test frameworks. Shorts uses a custom renderer, the UI tree is minimal, the scroll behavior is non-standard. A selector-based test would need a custom handler for every quirk. Claude just... used the app.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's actually happening under the hood
&lt;/h2&gt;

&lt;p&gt;Claude doesn't see the phone directly. Drengr sits in between:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; → calls MCP tools → &lt;strong&gt;Drengr&lt;/strong&gt; → talks to the device → &lt;strong&gt;Phone&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Drengr handles the messy parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Capturing the screen and parsing the UI tree into a format the AI can read&lt;/li&gt;
&lt;li&gt;Translating "tap element 3" into the right platform command&lt;/li&gt;
&lt;li&gt;Reporting back what changed after every action (the situation report)&lt;/li&gt;
&lt;li&gt;Detecting if the app crashed or the UI got stuck&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Claude handles the smart parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Looking at the screen description and deciding what to do&lt;/li&gt;
&lt;li&gt;Adapting when something unexpected happens&lt;/li&gt;
&lt;li&gt;Knowing when the task is complete&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation is deliberate. The AI is the brain, Drengr is the hands. When better AI models come out, Drengr doesn't need to change — the hands stay the same, the brain gets smarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The things that surprised me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;It recovers from mistakes.&lt;/strong&gt; At one point Claude tapped the wrong video. It noticed the title didn't match what it expected, pressed back, and picked the right one. No retry logic, no error handling code — the AI just adapted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It works across apps.&lt;/strong&gt; I asked Claude to "check my notifications" after the YouTube test. It pressed home, pulled down the notification shade, read the notifications, and summarized them. No app-specific setup needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Text mode is almost always enough.&lt;/strong&gt; Out of ~30 actions across the session, Claude only needed the annotated screenshot twice — both times on screens with custom-rendered content. The rest worked with the ~300 token text description. That's 10x cheaper than sending images every step.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it can't do (yet)
&lt;/h2&gt;

&lt;p&gt;I'm not going to pretend this replaces manual testing today. Some limits are real:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt; — Each step takes 2-3 seconds (LLM round-trip). A human tester can tap faster. But the human can't run 50 test flows in parallel on a device farm.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual verification&lt;/strong&gt; — Claude can tell if an element exists, but not if it "looks right." Color, alignment, spacing — these need human eyes or a visual regression tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex gestures&lt;/strong&gt; — Standard taps, swipes, long presses, and pinch zooms work. But game-specific multi-touch patterns aren't there yet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sweet spot today is regression testing: "does the checkout flow still work after this deploy?" That's the 80% of QA time that's spent running the same flows every sprint. Let the AI handle that, and let humans focus on exploratory testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; drengr
drengr doctor          &lt;span class="c"&gt;# check your setup&lt;/span&gt;
drengr setup &lt;span class="nt"&gt;--client&lt;/span&gt; claude-desktop  &lt;span class="c"&gt;# generate MCP config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Connect a device, open your AI client, and tell it what to test. The first time an AI agent navigates your app without a single line of test code, you'll understand why I built this.&lt;/p&gt;

</description>
      <category>appium</category>
      <category>mobiledev</category>
      <category>testing</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Your Mobile QA Team Is Still Writing XPath. In 2026.</title>
      <dc:creator>Sharmin Sirajudeen</dc:creator>
      <pubDate>Sun, 05 Apr 2026 02:54:42 +0000</pubDate>
      <link>https://dev.to/sharminsirajudeen/your-mobile-qa-team-is-still-writing-xpath-in-2026-104g</link>
      <guid>https://dev.to/sharminsirajudeen/your-mobile-qa-team-is-still-writing-xpath-in-2026-104g</guid>
      <description>&lt;h1&gt;
  
  
  Your Mobile QA Team Is Still Writing XPath. In 2026.
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;I'm the creator of &lt;a href="https://drengr.dev" rel="noopener noreferrer"&gt;Drengr&lt;/a&gt;, an MCP server that gives AI agents eyes and hands on mobile devices. I started this blog to share the engineering behind it. No pretending to be a neutral observer writing a think piece — I built this, and I'm here to talk about it.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The test that breaks every sprint
&lt;/h2&gt;

&lt;p&gt;You know the drill. Your QA engineer writes a beautiful test suite. Login, browse catalog, add to cart, checkout. Fifty selectors, careful waits, retry logic for flaky network calls. It passes on Monday.&lt;/p&gt;

&lt;p&gt;Tuesday, the design team moves the checkout button. Three selectors break. The test fails. The QA engineer spends half a day updating locators. The test passes again.&lt;/p&gt;

&lt;p&gt;Wednesday, a new feature adds a bottom sheet that overlaps the cart icon. The tap lands on the sheet instead of the cart. The test fails. Another half day.&lt;/p&gt;

&lt;p&gt;This cycle repeats every sprint, in every mobile team, everywhere. The test suite doesn't test the app anymore — it tests whether the selectors still match the UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The root cause: selectors were never the right abstraction
&lt;/h2&gt;

&lt;p&gt;XPath, resource IDs, accessibility identifiers — they're all addresses. "Tap the element at this path in the view hierarchy." The moment the hierarchy changes, the address is wrong.&lt;/p&gt;

&lt;p&gt;Humans don't navigate apps by address. They look at the screen, see "Checkout," and tap it. They don't care that the button moved from &lt;code&gt;//android.widget.Button[@resource-id='checkout_btn']&lt;/code&gt; to &lt;code&gt;//android.widget.FrameLayout/android.widget.Button[2]&lt;/code&gt;. They just see the button and tap it.&lt;/p&gt;

&lt;p&gt;AI agents can do the same thing — if you give them the screen, not a selector tree.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "giving them the screen" looks like
&lt;/h2&gt;

&lt;p&gt;When an AI agent connects to Drengr, it asks: "What's on screen?" Drengr responds with either a compact text description:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1] "Checkout" (Button)
[2] "Your Cart: 3 items" (TextView)
[3] "Remove" (Button)
[4] "Continue Shopping" (Button)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or an annotated image with numbered elements. The AI reads this, decides "tap element 1," and calls &lt;code&gt;drengr_do&lt;/code&gt;. After the action, it gets a situation report telling it what changed.&lt;/p&gt;

&lt;p&gt;No selectors. No XPath. No element IDs to maintain. The AI sees the screen the way a human does — by what's visible, not by where it lives in the view tree.&lt;/p&gt;

&lt;h2&gt;
  
  
  "But what about reliability?"
&lt;/h2&gt;

&lt;p&gt;Fair question. If the AI is interpreting the screen every time, doesn't that introduce non-determinism?&lt;/p&gt;

&lt;p&gt;Yes. And that's the point. A deterministic test that breaks when the UI changes isn't reliable — it's rigid. An AI agent that adapts to UI changes is more reliable in practice because it handles the variations that break selector-based tests.&lt;/p&gt;

&lt;p&gt;Drengr adds guardrails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stuck detection&lt;/strong&gt; — if the screen doesn't change after an action, the agent knows to try something else&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Crash detection&lt;/strong&gt; — if the app dies, the agent knows immediately and can restart&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Situation reports&lt;/strong&gt; — after every action, the agent gets a structured diff of what changed, so it stays oriented&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI doesn't just blindly tap. It observes, acts, and adapts. That's more robust than a fixed script that works exactly one way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost argument
&lt;/h2&gt;

&lt;p&gt;"AI calls are expensive." Sure, if you're sending screenshots to GPT-4o on every step.&lt;/p&gt;

&lt;p&gt;Drengr's text-only mode compresses a screen to ~300 tokens. A 15-step test flow costs about $0.05 on GPT-4o pricing. The same flow with screenshots costs $0.45.&lt;/p&gt;

&lt;p&gt;But here's the real cost comparison: how much does your QA team spend maintaining selectors? If one engineer spends 2 hours a week updating broken tests, that's $5,000/month in salary going to XPath maintenance. The AI API costs are rounding errors next to that.&lt;/p&gt;

&lt;h2&gt;
  
  
  A test suite that survives redesigns
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;com.example.shop&lt;/span&gt;
&lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;browse&lt;/span&gt;
    &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;wireless&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;earbuds&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;first&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;result"&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;purchase&lt;/span&gt;
    &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cart&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;checkout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;card"&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;90s&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;verify&lt;/span&gt;
    &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Go&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;order&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;history&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;verify&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;order&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;appears"&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;45s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This YAML survived 3 redesigns of our test app. The checkout flow moved from a separate page to a bottom sheet to a full-screen modal. The YAML didn't change. The AI adapted every time because it reads the screen, not the selectors.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;drengr test tests.yml&lt;/code&gt; runs it. JUnit XML output plugs into any CI pipeline. No Appium server to maintain, no Selenium grid, no element locator spreadsheet.&lt;/p&gt;

&lt;h2&gt;
  
  
  This isn't theoretical
&lt;/h2&gt;

&lt;p&gt;Drengr runs on real Android phones, iOS simulators, and cloud device farms (BrowserStack, SauceLabs, AWS Device Farm, LambdaTest, Perfecto, Kobiton). It connects to any MCP-compatible AI client — Claude Desktop, Cursor, Windsurf, VS Code.&lt;/p&gt;

&lt;p&gt;One binary. One install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; drengr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your QA team can stop writing XPath. The AI can read the screen.&lt;/p&gt;

</description>
      <category>appium</category>
      <category>mobiledev</category>
      <category>testing</category>
      <category>mcp</category>
    </item>
    <item>
      <title>AI Can Browse the Web. Why Can't It Tap a Phone?</title>
      <dc:creator>Sharmin Sirajudeen</dc:creator>
      <pubDate>Sun, 05 Apr 2026 02:54:30 +0000</pubDate>
      <link>https://dev.to/sharminsirajudeen/ai-can-browse-the-web-why-cant-it-tap-a-phone-ndk</link>
      <guid>https://dev.to/sharminsirajudeen/ai-can-browse-the-web-why-cant-it-tap-a-phone-ndk</guid>
      <description>&lt;h1&gt;
  
  
  AI Can Browse the Web. Why Can't It Tap a Phone?
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;I'm the creator of &lt;a href="https://drengr.dev" rel="noopener noreferrer"&gt;Drengr&lt;/a&gt;, an MCP server that gives AI agents eyes and hands on mobile devices. I started this blog to share the engineering behind it. No pretending to be a neutral observer writing a think piece — I built this, and I'm here to talk about it.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap nobody talks about
&lt;/h2&gt;

&lt;p&gt;Every week there's a new "Show HN" for AI-powered browser testing. Playwright agents, Puppeteer bots, Chrome extensions that turn the DOM into JSON for LLMs. The web automation space is overflowing with AI-native tools.&lt;/p&gt;

&lt;p&gt;Then someone asks: "How do I do this on a phone?"&lt;/p&gt;

&lt;p&gt;Silence.&lt;/p&gt;

&lt;p&gt;The best answer the industry has is Appium — a tool from 2013 that requires you to set up a Selenium grid, write XPath selectors, and maintain brittle element locators that break every time a designer moves a button. Or Espresso/XCTest, which require you to embed test code inside the app itself.&lt;/p&gt;

&lt;p&gt;None of these are AI-native. They were built for humans to write scripts, not for LLMs to reason about screens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why mobile is harder than web
&lt;/h2&gt;

&lt;p&gt;The web has one universal API: the DOM. Every browser exposes the same tree of elements with the same attributes. Playwright reads the DOM, the AI decides what to click, done.&lt;/p&gt;

&lt;p&gt;Mobile doesn't have that. Android has &lt;code&gt;uiautomator&lt;/code&gt;. iOS has its own accessibility framework. They return different structures, different attributes, different coordinate systems. Cloud device farms add another layer — now you're talking to a device over Appium WebDriver, which adds its own abstraction on top.&lt;/p&gt;

&lt;p&gt;The result: every mobile testing tool is platform-specific, setup-heavy, and hostile to AI agents that just want to know "what's on screen?" and "tap that button."&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built instead
&lt;/h2&gt;

&lt;p&gt;Drengr is a single Rust binary that sits between the AI and the device. It exposes exactly 3 tools over the Model Context Protocol (MCP):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;drengr_look&lt;/code&gt;&lt;/strong&gt; — tells the AI what's on screen, either as an annotated image or a compact text description (~300 tokens instead of a 200KB screenshot)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;drengr_do&lt;/code&gt;&lt;/strong&gt; — executes an action (tap, type, swipe, long press, scroll, launch app, etc.) and reports back what changed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;drengr_query&lt;/code&gt;&lt;/strong&gt; — answers questions without touching the screen (is the app crashed? what HTTP calls happened? what's the current activity?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI client — Claude Desktop, Cursor, VS Code, whatever — is the brain. It decides strategy. Drengr is the hands. It handles the platform mess so the AI never has to think about ADB vs simctl vs Appium.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing that makes it work: situation reports
&lt;/h2&gt;

&lt;p&gt;After every action, Drengr doesn't just say "ok, done." It tells the AI what changed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"screen_changed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"new_elements"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"disappeared_elements"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"activity_changed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"crash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stuck"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI reads this and immediately knows: the screen updated, two new elements appeared, one vanished, we navigated somewhere new, and the app is still alive. No need to take another screenshot and visually diff it.&lt;/p&gt;

&lt;p&gt;This is what browser testing tools don't need — the DOM gives you change events for free. On mobile, you have to build this layer yourself. I spent months on it so you don't have to.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real test looks like this
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;com.example.app&lt;/span&gt;
&lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;login&lt;/span&gt;
    &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Log&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;user@test.com&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;password123"&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout&lt;/span&gt;
    &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;headphones&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cart&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;purchase"&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;90s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No selectors. No XPath. No element IDs to maintain. The AI reads the screen, decides what to do, and Drengr executes it. When the UI changes, the YAML doesn't break — because there's nothing brittle in it.&lt;/p&gt;

&lt;p&gt;Run with &lt;code&gt;drengr test tests.yml&lt;/code&gt; and get human-readable output, JSON, or JUnit XML for your CI pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I think this gap exists
&lt;/h2&gt;

&lt;p&gt;Browser testing got AI-native tools early because the DOM is an open, text-friendly format that LLMs can reason about directly. Mobile UIs are visual, proprietary, and locked behind platform-specific APIs that nobody unified.&lt;/p&gt;

&lt;p&gt;The MCP protocol changes this. It gives AI agents a standard way to connect to tools — and Drengr is the tool that bridges MCP to mobile devices. Android, iOS, simulators, cloud farms — one interface, one binary, one install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; drengr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The web got its AI testing moment. Mobile's turn is now.&lt;/p&gt;

</description>
      <category>appium</category>
      <category>mobiledev</category>
      <category>testing</category>
      <category>mcp</category>
    </item>
  </channel>
</rss>
