The Missing Control Plane for Local AI Agents
I sat with my Pixel for 20 minutes trying to get Claude Desktop to dictate a Slack message via accessibility. It was miserable. The model was capable. The transport wasn't.
That gap — between an AI that can reason and an AI that can actually do — is what I've been working on with Drengr. This post is the version of the argument I'd give to anyone building local AI agents today.
What a control plane actually means here
When people talk about "AI agents," they usually focus on the model: which one, how big, how cheap to run, what context window. Those are real questions, but they all assume the agent has a way to act on the world. On mobile, it mostly doesn't. iOS sandboxing prevents one app from touching another. Android Accessibility Services exist but are heavy to set up, scary to permission, and limited in what they can synthesize.
The result: you can ship a brilliant Gemini Nano running on a Pixel, and it still can't open Maps and start navigation for you. The model has no hands.
A control plane fills that gap. It's not the model. It's the layer underneath that:
- observes the device (screen state, UI tree, current activity, foreground app)
- executes discrete actions (tap, type, swipe, draw, key event, app launch)
- reports what changed after each action so the agent can adjust
Drengr is one implementation of this control plane. It exposes three MCP tools to any AI client that supports the protocol — Claude Desktop, Cursor, Windsurf today; more soon:
drengr_look observe the current screen + UI tree
drengr_do execute a tap / type / swipe / etc
drengr_query read structured data (devices, activity, crashes)
That's the whole surface. Three verbs, no XPath, no fragile selectors, no Appium daemon to keep alive.
Observe → decide → act, on real devices
Drengr's runtime is a single Rust binary that drives the device through its native channels (ADB on Android, WDA on iOS simulators). The agent loop looks like this each step:
- The model calls
drengr_look. Drengr captures a screenshot, dumps the UI tree, builds a compact text description (~300 tokens vs ~100KB for an image — see why text-first matters in the field-notes post). - The model decides what to do, returns a JSON envelope with the action.
- The model calls
drengr_do. Drengr executes against the device, then runs a situation report — diffed against the previous state — and feeds it back so the next decision starts grounded.
The situation report is the part most agent frameworks miss. Without it, the model is blind between observations and tends to over-act (tapping the same dead button five times because nothing visibly changed). With it, the loop becomes self-correcting.
Why this needs to be local
Cloud-only AI assistants are dead for anything physical. The moment a model has to decide whether to tap "Confirm" on your banking app, three things matter that round-trips can't deliver:
- Latency. A two-second cloud round trip feels broken when you're holding the phone in your hand.
- Privacy. Banking apps, health apps, messages — none of that should leave the device for a UI inference.
- Network independence. Subway, airplane, bad hotel wifi.
Once Gemini Nano (Android) and Apple Intelligence (iOS) are widespread, the bottleneck shifts entirely. The model is local. The control plane has to be local too. Drengr's runtime is a single static binary; that's not a coincidence.
Beyond mobile QA: where this actually goes
The obvious early audience for a mobile control plane is QA — automate the tedious test flows that break every sprint. That market is real but small. The much bigger one is AI-agent builders making on-device personal assistants.
Concretely, with the same three tools shown above, an agent on the user's machine can:
- Open the Photos app, find pictures from last weekend, attach them to a message in WhatsApp
- Watch a flight app for a price drop and rebook automatically
- Operate a banking app inside a screen-sharing session for a low-vision user
- Run the long-tail of "things you'd ask a human assistant to do on your phone if you had one"
None of those need new model capability. They need a working hands-and-eyes layer that the model can call. That's exactly the gap I wrote about in AI Can Browse the Web. Why Can't It Tap a Phone?
Where to start
If you're building anything that wants an AI to control a real mobile device — whether your goal is QA, an on-device assistant, an accessibility tool, or something I haven't thought of — the control plane is the part you don't want to build from scratch. WDA, ADB, the screen-capture pipeline, the situation diffing, the cross-platform abstraction — they're all unglamorous infrastructure that's already done.
Drengr is free to use. One command to install it via Claude Code, one to verify it works:
claude mcp add drengr -- npx -y drengr mcp
drengr doctor
Then point your agent at it and see what your model can actually do when it has hands. (The Rust choice was deliberate too — that's a separate post.)
Top comments (0)