From Intent Classification to Open-Ended Action Spaces: Why Mobile Testing Needed a New Paradigm

#appium #mobiledev #testing #mcp

From Intent Classification to Open-Ended Action Spaces: Why Mobile Testing Needed a New Paradigm

I'm the creator of Drengr, an MCP server that gives AI agents eyes and hands on mobile devices. I started this blog to share the engineering behind it. No pretending to be a neutral observer writing a think piece — I built this, and I'm here to talk about it.

Google recently shipped AI Edge Gallery — an on-device AI sandbox app with a feature called "Mobile Actions" that lets you control your phone with natural language. Say "turn on the flashlight," and a 270M parameter model called FunctionGemma figures out the intent, extracts the parameters, and dispatches the right function call. It runs entirely offline. It clocks 1,916 tokens/sec prefill on a Pixel 7 Pro. And it's impressive.

But it also reveals a ceiling.

The Closed-World Assumption

FunctionGemma is, at its core, a tiny NLU engine performing intent classification and slot filling. You speak. It classifies your sentence into one of a fixed set of intents — turnOnFlashlight, createCalendarEvent, showLocationOnMap — and extracts the relevant slots: a time, a location, a contact name. The native app code then dispatches the structured output to the corresponding platform API.

This is a closed-world system. Every possible action is known at compile time. Every function is pre-registered. Every slot is pre-defined. The model's job is pattern matching over a bounded action space — the same fundamental design that Dialogflow, Alexa Skills, and SiriKit Intents have used for years, now running on-device at remarkable speed. These platforms have evolved over time — Apple's App Intents, Alexa's generative AI features — but the underlying intent-schema architecture remains fundamentally closed-world by design.

It works beautifully for what it is. But it cannot do what it has never been told exists.

The Open-World Problem

Now consider a different scenario. You're a QA engineer. You need to verify that a flower delivery app correctly applies a promo code at checkout, that the cart total updates, and that the confirmation screen renders the right order summary. The app was built by your team. No one pre-registered its UI elements as callable functions. No one fine-tuned a model on its screen taxonomy.

This is an open-world problem. The action space is unbounded. The UI is arbitrary. The screens have never been seen by the testing agent before.

This is the problem Drengr solves.

Text-First Perception, Schema-Never

Drengr is an MCP (Model Context Protocol) server — the open protocol that connects AI models to external tools and data sources, in the same way LSP (Language Server Protocol) connects editors to language servers. Drengr is purpose-built for mobile UI interaction. It doesn't require your app to expose an API. It doesn't need accessibility labels (though it uses them when available). It doesn't ask you to define intents or register functions.

Instead, it operates through three primitives:

drengr_look — Captures the current screen state as a compact text description (~300 tokens per screen) or an annotated image with numbered elements. Text-first by default — vision only escalates when less than 60% of elements have labels. 100x cheaper than sending screenshots every step.
drengr_do — Performs 13 actions on the device: tap, type, swipe, long press, back, home, launch, wait, key press, install, clear and type, scroll to top, scroll to bottom. Each action returns a situation report — a structured diff of what changed on screen (new elements, disappeared elements, crash detection, stuck detection).
drengr_query — Structured queries about device and app state: list connected devices, check current activity, detect crashes, find elements by text, explore app navigation, read network calls, check keyboard state, dump the raw UI tree, and more.

The AI client — Claude Desktop, Cursor, Windsurf, VS Code, any MCP-compatible host — acts as the brain. Drengr provides the eyes and hands. The agent looks at a screen it has never seen, understands what's there, decides what to do, and does it. No pre-training on your app. No test script maintenance. No brittle XPath selectors that break every sprint.

Why This Distinction Matters

The difference between closed-world function dispatch and open-world UI interaction is not incremental. It is architectural.

	Closed-World (FunctionGemma)	Open-World (Drengr)
Action space	Fixed, pre-defined functions	Arbitrary, discovered at runtime
UI knowledge	Compiled into the model	Observed per-screen via text scenes + vision fallback
New app support	Requires fine-tuning or function registration	Works immediately against any app
Failure mode	"I don't have a function for that"	"I can see the screen — let me figure it out"
Architecture	NLU → function dispatch	Perception → reasoning → action

FunctionGemma is a classifier. Drengr is an agent.

The MCP Advantage

Drengr is built as an MCP server — the same architectural pattern that made LSP the backbone of every modern code editor. Anthropic themselves draw this parallel in the MCP specification: both protocols solve the M×N integration problem. LSP connects M editors to N language servers. MCP connects M AI clients to N tool servers. Both use JSON-RPC 2.0 transport.

This means Drengr isn't married to a single LLM. Today, a developer can wire up Claude Code, Cursor, or Windsurf as the reasoning layer, and Drengr handles the device interaction. Tomorrow, when a better model drops, you swap the brain without touching the tools.

This separation of concerns — the model thinks, the server acts — is what makes the architecture durable.

Who This Is For

QA engineers tired of maintaining Appium scripts that break every release cycle.
Mobile developers who want to validate user flows without writing test code.
Engineering leads exploring agentic testing as a force multiplier for small teams.
AI tooling teams evaluating MCP-compatible infrastructure for mobile automation.

The Testing Problem, Reframed

Traditional mobile test automation asks: "How do I script a robot to press the right buttons?"

Drengr asks: "What if the robot could just look at the screen and figure it out?"

That reframing — from scripted automation to perceptual agency — is the paradigm shift. It's the difference between giving someone a map with every turn pre-marked, and giving them eyes and the ability to navigate.

Google proved that on-device NLU can dispatch to a handful of OS functions at blazing speed. Drengr proves that an LLM with the right tools can operate across any app, any screen, any flow — without ever being told what to expect.

Drengr is free to use and available on npm. It supports Android (physical devices, emulators), iOS simulators (full gesture support), and cloud device farms (BrowserStack, SauceLabs, AWS Device Farm, LambdaTest, Perfecto, Kobiton). Built in Rust. Single binary. No runtime dependencies.