DEV Community: Sharmin Sirajudeen

The HiPPO is always right

Sharmin Sirajudeen — Tue, 30 Jun 2026 21:00:12 +0000

The HiPPO is always right

I've spent years inside companies serving millions of people. Different industries, different teams — and the same quiet thing in every one, the thing nobody says out loud:

The people running the company can't actually see it.

Not the founder. Not the CTO. Not the investor. Not the VP, not the director. Every one of them gets the same thing — the version from the data team. A report, written for them, by someone else, about what's happening inside their own company. The entire leadership's picture of reality, capped by one team in the middle. A sharp founder with an average data team gets an average, maybe-wrong picture of their own company — and no way to know it's wrong.

And sometimes there isn't even a report. I've watched this exact thing play out more than once: someone senior declares that a screen "has a problem." No data — a feeling, delivered with authority. The room agrees instantly, because when the most senior person names the cause, agreement isn't evidence, it's gravity. There's even a name for it: the HiPPO — the Highest Paid Person's Opinion. And that's where it ends. No one ever establishes whether the screen was actually the problem. The claim is never proven true or false — it's just said, and then it's real, because of who said it. Years later, you still don't know if it ever was.

That's the fast way a wrong conclusion enters a company. The slow way wears the costume of rigor.

Watch how a "data-driven" decision actually gets made. A PM has a hunch. It becomes a PRD. Engineering builds it, ships it, everyone waits. Weeks pass. The PM comes back with a list of things to track — events they guessed would matter, from a discovery that may have been wrong from the start. The data team wires it up. BI builds the dashboard. A number appears.

So you decide. On a number, produced by a hundred-person chain, measuring a hypothesis someone guessed, instrumented in a way you can't check, interpreted by a team whose ceiling you can't exceed. Sometimes you find out you were wrong. More often — the part that should keep you up at night — you don't. Nobody verified it.

Here's the most rigorous costume of all: the A/B test. The "gold standard." The thing we point to and say we're data-driven. Ask four honest questions about your last one:

Did the tracking even fire correctly for both variants?
Did your metric measure what you care about — or a proxy you hoped correlated?
Did the winning variant quietly hurt three things you never thought to measure?
Was the statistics even valid — no peeking, no sample mismatch, no confound?

You shipped the winner. You verified none of it. And that's the rigorous end of how companies decide.

The better-funded the company, the better it hides all this. The biggest players can afford world-class PMs and data scientists who guess better and rebuild the apparatus faster. So clarity becomes a luxury good — rationed by how much talent and money you can throw at the middle layer. Everyone else flies blind and calls it data-driven.

The problem isn't the people. Every PM and analyst I've worked alongside was sharp and trying hard. The problem is the method — a whole discipline built on a quiet lie: that you can understand what's happening by having a human guess what to measure, instrument the guess, wait, and trust the interpretation. Whether the conclusion falls from a HiPPO's mouth or crawls out of a hundred-person pipeline, it arrives the same way: unverified.

So here's the question I can't put down: what if you never had to guess?

What if you could just see what actually happened — and check it yourself — instead of trusting a chain of people to hand you a version of it? The teams that get there first won't win because they hired better analysts. They'll win because they stopped being blind.

That world is closer than it looks. I'm building toward it.

— Sharmin Sirajudeen. I'm building toward a world where you can just see what's real. drengr.dev

Claude + Mobile via MCP: Giving the Model Hands on a Real Phone

Sharmin Sirajudeen — Sun, 03 May 2026 19:24:44 +0000

Claude + Mobile via MCP: Giving the Model Hands on a Real Phone

I plugged in a Pixel two months ago, ran one command in Claude Desktop, and watched it open Maps and start navigation to my home address from a single sentence prompt. It was the first time I'd ever seen a language model physically operate a phone. Latency was about two seconds per action; the part that surprised me was the third step, where Claude noticed it had landed on the wrong screen and self-corrected without being asked.

That experience is what this post is about — what "Claude on mobile via MCP" actually means today, what's required underneath to make it work, and why the bottleneck for AI agents has stopped being model size and started being whether the model has hands.

What "Claude mobile MCP" actually refers to

There isn't a product called Claude Mobile MCP. What people are searching for is a way to let Claude — running in Claude Desktop, Claude Code, Cursor, or any other MCP-aware client — control a real Android phone or iOS simulator. MCP (Model Context Protocol) is Anthropic's open standard for exposing tools and data to language models. It's the cleanest way to plug a new capability into Claude without rebuilding the client.

The missing piece has been a server on the MCP side that knows how to drive a phone. That's what Drengr does. It's a single Rust binary that exposes three tools to any MCP client:

drengr_look      observe the current screen + UI tree
drengr_do        execute a tap / type / swipe / draw / key event
drengr_query     read structured data (devices, activity, crashes)

Three verbs. No XPath, no Appium daemon, no fragile selectors. Claude calls drengr_look, gets back a compact text description of what's on screen with numbered elements, decides what to do, and calls drengr_do with the action and target element. Drengr executes against the device through its native channel (ADB on Android, WDA on iOS simulators) and returns a situation report — what changed, what appeared, whether the app crashed — so the next decision starts grounded.

The bottleneck stopped being the model

A common assumption is that mobile AI assistants are blocked on model capability. They aren't. Current Claude models reason about a screenshot better than most humans can describe one. Open Photos, find a specific picture, attach it to a WhatsApp message — Claude has the visual reasoning to do all of that. What it doesn't have, by default, is a way to actually touch the screen.

That gap is purely transport. iOS sandboxing prevents one app from reaching into another. Android Accessibility Services exist but are heavy to set up, scary to permission, and limited in what they can synthesize. Cloud-only assistants are dead for anything physical because:

Latency. A two-second cloud round trip per tap feels broken when you're holding the phone.
Privacy. Banking apps, health apps, messages — none of that should leave the device for a UI inference.
Network independence. Subway, airplane, bad hotel wifi.

Once Gemini Nano on Android and Apple Intelligence on iOS are widespread, the model is local. The control plane has to be local too. Drengr is a single static binary; that's not a coincidence. (More on why local matters in The Missing Control Plane for Local AI Agents.)

How to actually try it

Two commands, assuming you have Claude Code installed:

claude mcp add drengr -- npx -y drengr mcp
drengr doctor

The first registers Drengr as an MCP server in Claude Code. The second checks that ADB and Xcode are reachable on your machine and prints any device it can see. Then plug in an Android phone (with USB debugging enabled) or boot an iOS simulator, and ask Claude to do something on it.

A few prompts that have worked for me:

"Open YouTube and search for m3 ultra benchmarks."
"Open Slack, find the channel called engineering, post the message merging at 3pm."
"Open Maps and start navigation to home."

Each one becomes an observe-decide-act loop under the hood. Claude calls drengr_look, reasons over the text scene Drengr returns (~300 tokens vs ~100KB for a raw screenshot — see why text-first matters here), decides what to do, calls drengr_do, gets a situation diff back, and repeats until the task is done or it gets stuck.

Beyond mobile QA

The obvious early audience for a mobile MCP server is QA — automate the test flows that break every sprint. That market is real but small. The much bigger one is everything else you can build once Claude can touch a phone:

RPA on mobile. UiPath / Automation Anywhere shops have spent a decade automating desktop workflows. Mobile has been a gap because the existing tooling assumed a developer sat behind it. With an LLM in the loop, the tooling assumption changes.
Accessibility tools. A low-vision user can ask Claude to operate a banking app on their behalf, with the app running on their own device.
On-device personal assistants. "Find pictures from last weekend in Photos and attach them to a WhatsApp message to my mom" is a sentence Claude can already plan. The hands-and-eyes layer was missing.

None of those need new model capability. They need a working transport. (That argument in full: AI Can Browse the Web. Why Can't It Tap a Phone?)

Where to start

If you want Claude to control a phone, install Drengr and ask it to do something. The control plane is the part you don't want to build yourself — WDA, ADB, the screen-capture pipeline, the situation diffing, the cross-platform abstraction, the MCP wiring — they're all unglamorous infrastructure that's already done.

claude mcp add drengr -- npx -y drengr mcp
drengr doctor

Then point Claude at any Android device or iOS simulator and watch what happens when the model has hands.

AI-Native Mobile Testing: What It Actually Means in 2026

Sharmin Sirajudeen — Sun, 03 May 2026 19:24:01 +0000

AI-Native Mobile Testing: What It Actually Means in 2026

The phrase "AI-native" has been thrown around in the testing space since 2019. Almost every tool calling itself that just bolts a language model on top of Appium and ships the same brittle XPath selectors with a new label.

That's not AI-native testing. That's Appium with a chatbot.

This post is about what AI-native actually has to mean to be worth anything — and what it changes about how mobile teams ship.

The pattern that broke testing for the last decade

Traditional mobile UI testing has a well-known failure mode. You write hundreds of tests with hardcoded element IDs or XPath selectors. Then half of them break each release for reasons that have nothing to do with bugs — a designer moved a button into a BottomSheet, a resource ID got renamed, an animation timing changed. Forty tests turn red overnight.

Engineering teams burn entire sprint days fixing tests for code that works fine. The economics are absurd: you spend more on test maintenance than the tests prevent in production bugs. (I covered the deeper history of why this happens in Evolution of Mobile Automation.)

The dominant fix from the last decade was "better selectors" — accessibility IDs, more stable resource paths, page-object patterns. None of it fixed the fundamental problem: a test that says "tap the element with id btn_login" couples the test to an implementation detail that has no business being part of the test contract.

What AI-native actually means

A real AI-native testing tool does one structural thing differently: the AI is the orchestrator, not a polish layer.

Concretely, that means the test framework hands the AI:

the current screen (screenshot + parsed UI tree, in plain text)
a description of what the test is trying to accomplish ("log in with these credentials")
the list of available actions (tap, type, swipe, etc.)

…and the AI decides what to do, step by step. No selectors. No hardcoded IDs. No XPath. The AI looks at the screen, notices a button labeled "Login" near a username field, taps it. If next release the button gets a new background and moves down 80 pixels, the AI doesn't care. It still sees a button labeled "Login," it still taps it. The test passes.

Drengr is one implementation of this pattern. It's a single Rust binary that exposes three tools to any MCP-aware AI client (Claude Desktop, Cursor, Windsurf today):

drengr_look      observe the current screen + UI tree
drengr_do        execute a tap / type / swipe / draw / key event
drengr_query     read structured data (devices, activity, crashes)

That's the whole surface. Three verbs, no XPath, no Appium daemon to keep alive.

What it changes economically

Once the AI is the orchestrator, two things shift:

1. Test maintenance approaches zero. The test prompt ("log in with test@example.com / pw123 and verify you land on the dashboard") rarely changes between releases. The implementation underneath (button labels, layout, animations) can change freely without breaking the test. Maintenance cost was the entire reason mid-size companies stopped writing UI tests at all. Removing it puts UI testing back on the table for teams that wrote it off.

2. The audience widens. Indie developers and small teams who could never afford a dedicated QA engineer are now in scope. The cost of one full-time Espresso maintainer was what kept many small apps shipping untested. An AI-native testing layer effectively replaces that role for the cost of a Claude or Cursor subscription. (How drengr's architecture lines up with what academic researchers have found, in Field Notes.)

Beyond testing: the same primitives, different audience

The interesting thing once you have an AI controlling a real device is that the testing use case stops being the most important one.

The same drengr_look / drengr_do / drengr_query primitives let an AI agent on the user's machine:

Open Maps and start navigation when the user says "drive me home"
Scroll through Instagram and report sponsored posts
Pay a bill in a banking app on behalf of someone with a motor disability
Run any of the long-tail "things you'd ask a human assistant to do on your phone" tasks

Mobile QA was the first audience because the pain is most acute there. The AI-agent-builder audience is much larger. If you're building anything that wants an AI to control a real mobile device — testing, accessibility, personal assistance, something I haven't thought of — the control plane is the part you don't want to write from scratch. (That argument in full: AI Can Browse the Web. Why Can't It Tap a Phone?)

Get started

Drengr is free. Install through Claude Code with one command, verify with a second:

claude mcp add drengr -- npx -y drengr mcp
drengr doctor

Then point your agent at any Android device or iOS simulator and watch what happens when the model has hands. No XPath. No selectors. No find_element(By.ID, "btn_login") to maintain across forty test files for the rest of your career.

That's what AI-native is supposed to mean.

The Missing Control Plane for Local AI Agents

Sharmin Sirajudeen — Sun, 03 May 2026 19:23:55 +0000

The Missing Control Plane for Local AI Agents

I sat with my Pixel for 20 minutes trying to get Claude Desktop to dictate a Slack message via accessibility. It was miserable. The model was capable. The transport wasn't.

That gap — between an AI that can reason and an AI that can actually do — is what I've been working on with Drengr. This post is the version of the argument I'd give to anyone building local AI agents today.

What a control plane actually means here

When people talk about "AI agents," they usually focus on the model: which one, how big, how cheap to run, what context window. Those are real questions, but they all assume the agent has a way to act on the world. On mobile, it mostly doesn't. iOS sandboxing prevents one app from touching another. Android Accessibility Services exist but are heavy to set up, scary to permission, and limited in what they can synthesize.

The result: you can ship a brilliant Gemini Nano running on a Pixel, and it still can't open Maps and start navigation for you. The model has no hands.

A control plane fills that gap. It's not the model. It's the layer underneath that:

observes the device (screen state, UI tree, current activity, foreground app)
executes discrete actions (tap, type, swipe, draw, key event, app launch)
reports what changed after each action so the agent can adjust

Drengr is one implementation of this control plane. It exposes three MCP tools to any AI client that supports the protocol — Claude Desktop, Cursor, Windsurf today; more soon:

drengr_look      observe the current screen + UI tree
drengr_do        execute a tap / type / swipe / etc
drengr_query     read structured data (devices, activity, crashes)

That's the whole surface. Three verbs, no XPath, no fragile selectors, no Appium daemon to keep alive.

Observe → decide → act, on real devices

Drengr's runtime is a single Rust binary that drives the device through its native channels (ADB on Android, WDA on iOS simulators). The agent loop looks like this each step:

The model calls drengr_look. Drengr captures a screenshot, dumps the UI tree, builds a compact text description (~300 tokens vs ~100KB for an image — see why text-first matters in the field-notes post).
The model decides what to do, returns a JSON envelope with the action.
The model calls drengr_do. Drengr executes against the device, then runs a situation report — diffed against the previous state — and feeds it back so the next decision starts grounded.

The situation report is the part most agent frameworks miss. Without it, the model is blind between observations and tends to over-act (tapping the same dead button five times because nothing visibly changed). With it, the loop becomes self-correcting.

Why this needs to be local

Cloud-only AI assistants are dead for anything physical. The moment a model has to decide whether to tap "Confirm" on your banking app, three things matter that round-trips can't deliver:

Latency. A two-second cloud round trip feels broken when you're holding the phone in your hand.
Privacy. Banking apps, health apps, messages — none of that should leave the device for a UI inference.
Network independence. Subway, airplane, bad hotel wifi.

Once Gemini Nano (Android) and Apple Intelligence (iOS) are widespread, the bottleneck shifts entirely. The model is local. The control plane has to be local too. Drengr's runtime is a single static binary; that's not a coincidence.

Beyond mobile QA: where this actually goes

The obvious early audience for a mobile control plane is QA — automate the tedious test flows that break every sprint. That market is real but small. The much bigger one is AI-agent builders making on-device personal assistants.

Concretely, with the same three tools shown above, an agent on the user's machine can:

Open the Photos app, find pictures from last weekend, attach them to a message in WhatsApp
Watch a flight app for a price drop and rebook automatically
Operate a banking app inside a screen-sharing session for a low-vision user
Run the long-tail of "things you'd ask a human assistant to do on your phone if you had one"

None of those need new model capability. They need a working hands-and-eyes layer that the model can call. That's exactly the gap I wrote about in AI Can Browse the Web. Why Can't It Tap a Phone?

Where to start

If you're building anything that wants an AI to control a real mobile device — whether your goal is QA, an on-device assistant, an accessibility tool, or something I haven't thought of — the control plane is the part you don't want to build from scratch. WDA, ADB, the screen-capture pipeline, the situation diffing, the cross-platform abstraction — they're all unglamorous infrastructure that's already done.

Drengr is free to use. One command to install it via Claude Code, one to verify it works:

claude mcp add drengr -- npx -y drengr mcp
drengr doctor

Then point your agent at it and see what your model can actually do when it has hands. (The Rust choice was deliberate too — that's a separate post.)

How a Web Worker Fixed My Dying-Battery Audio (And What I Learned About PWAs the Hard Way)

Sharmin Sirajudeen — Mon, 13 Apr 2026 11:54:40 +0000

I spent the last week modifying an open-source NES emulator to run in the browser as a PWA. I'm an Android developer by trade — Kotlin, Jetpack Compose, Flutter when the project calls for it. This was my first real dive into Web Workers, SharedArrayBuffer, and turning a browser tab into something that feels like a native app.

Here's what I learned. Some of it was obvious in hindsight. Most of it wasn't.

The Problem That Started Everything

I wanted to add real-time game modification sliders to a browser-based NES emulator. Speed multiplier, firepower boost, infinite lives — the kind of thing that's trivial if you have access to the game's memory. The emulator (JSNES, open-source) gives you direct access to the NES CPU's RAM via JavaScript. Writing a slider that tweaks cpu.mem[0x0487] every frame is maybe 10 lines of code.

I set up a GitHub Codespace, got the emulator running, and tested it in the browser. Everything worked beautifully. Then I opened the same URL on an older Android phone sitting on my desk.

The game visuals were smooth enough. But the audio — the iconic 8-bit music — sounded like a toy running out of battery. Slow, dragging, painful. Like someone was holding the NES's APU underwater.

Why Single-Threaded Was the Root Cause

Here's what was happening. The NES generates audio samples at 44,100 Hz, tied directly to CPU emulation. Each frame of emulation produces ~735 audio samples. The browser's Web Audio API expects those samples delivered at a consistent rate.

On a decent machine, the main thread easily ran the emulator at 60fps + rendered the canvas + fed audio samples. No contention. On the slow Android phone, canvas rendering was choking the main thread. Frames dropped to 30fps. Half the audio samples were generated per second. The Web Audio API played them at the expected rate but ran out halfway — producing that dying-battery sound.

I tried every hack I could think of:

Adaptive sample dropping — monitored FPS and dropped audio samples when the device struggled. Result: choppy audio instead of slow audio. Not better.
Dynamic Rate Control — stretched available samples via interpolation (the algorithm RetroArch uses). Result: alien communication sounds. The pitch was wrong because you can't stretch 22K samples to fill 44K slots without changing the fundamental frequency.
Multi-frame catch-up — ran 2 NES frames per requestAnimationFrame when the device fell behind. Result: even slower, because the device couldn't handle 2 frames if it was already struggling with 1.

None of it worked because I was treating the symptom, not the disease. The disease was: audio generation and canvas rendering were fighting for the same thread.

The Fix: Web Workers

The solution was architecturally simple. Move the NES emulation (CPU + audio generation) to a Web Worker. The main thread only handles canvas rendering, user input, and UI.

Worker Thread (setInterval @ 60fps)
├── JSNES emulation (CPU, PPU, APU)
├── Audio sample generation → SharedArrayBuffer
├── Game mod logic (speed, firepower, lives)
└── Frame pixel conversion → postMessage (Transferable)

Main Thread (requestAnimationFrame)
├── Canvas rendering (receives pixels from Worker)
├── Audio playback (reads from SharedArrayBuffer)
├── Keyboard/touch input → postMessage to Worker
└── UI (sliders, toggles, save/load, fullscreen)

The key insight: setInterval in a Web Worker is not throttled when the tab is backgrounded. requestAnimationFrame on the main thread is. This means the Worker keeps generating audio at a consistent rate regardless of what the renderer is doing. The audio buffer never starves.

SharedArrayBuffer: The Zero-Copy Audio Bridge

This was the part I found most interesting, coming from a mobile background where inter-thread communication usually means Handler.post() or Kotlin coroutine channels.

The Worker generates ~735 audio samples per frame. Those samples need to reach the main thread's ScriptProcessorNode with minimal latency. postMessage adds serialization overhead and scheduling jitter — fine for input events, not great for 44,100 samples per second.

SharedArrayBuffer gives both threads access to the same memory. The Worker writes audio samples into a ring buffer. The main thread's audio processor reads from the same buffer. Zero copy, zero serialization, microsecond access.

The layout is simple:

SharedArrayBuffer:
[0-3]   Int32: write index (Worker writes via Atomics.store)
[4-7]   Int32: read index (Main reads via Atomics.store)
[8+]    Float32[]: interleaved L/R audio samples

The Worker writes samples after each nes.frame() call. The ScriptProcessorNode on the main thread reads them in its onaudioprocess callback. The Atomics operations provide memory ordering guarantees — no locks needed for a single-producer, single-consumer ring buffer.

One gotcha that cost me an hour: interleaved audio samples must always be written in pairs (left + right). If the available buffer space is odd, you write one L sample without its R, and every subsequent read is shifted by one channel. The fix is one line: samplesToWrite = available & ~1 — force even.

SharedArrayBuffer requires specific HTTP headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Without these, typeof SharedArrayBuffer === 'undefined' in every browser. I built a fallback path using postMessage with Transferable Float32Array for environments where the headers can't be set.

Frame Transfer: Transferable Objects

The NES outputs 256×240 pixels per frame. That's ~245KB of pixel data at 60fps. Copying it via postMessage would be expensive. Transferable objects solve this — the ArrayBuffer is moved between threads, not copied. The sending thread loses access to it (it gets "neutered"), but the transfer is essentially free.

// Worker: convert pixels and transfer
const pixels = new Uint32Array(61440);
// ... fill pixels from JSNES frameBuffer ...
postMessage({ type: 'frame', pixels }, [pixels.buffer]);
// pixels.buffer is now neutered — length 0 in Worker

I used double-buffering: two pixel arrays in the Worker, alternating which one gets filled and transferred. In practice, I found that just reallocating a new Uint32Array(61440) after each transfer was simpler and fast enough — 245KB allocation at 60fps is well within V8's comfort zone.

The PWA Part

Turning this into a Progressive Web App was its own education. A few things I learned:

iOS Safari has no Fullscreen API. Not requestFullscreen, not webkitRequestFullscreen, not any variant. I discovered this when the fullscreen button simply did nothing on an iPhone. The only way to get "fullscreen" on iPhone is display: standalone in your web manifest + adding to home screen. Even then, the status bar stays — Apple never lets you hide it.

I ended up building a CSS-simulated fullscreen: toggling a body class that hides everything except the game canvas and touch controls. But then the exit button didn't work. Turns out, on iOS, a position: fixed button placed outside the main touch-responsive container silently fails to receive touch events. The button renders, you can see it, but tapping does nothing. I had to move the exit control inside the same overlay that handles game input. That one cost me a few hours of confused debugging.

PWA icons on iOS must be PNG, not SVG, and RGB not RGBA. Safari ignores SVG apple-touch-icon links entirely. And if your PNG has an alpha channel, iOS sometimes renders a blank or uses its default icon. My custom pixel-art icon only appeared after I converted it from RGBA to RGB using Pillow.

Service Worker caching is aggressive and separate from Safari's cache. Deleting Safari data doesn't clear a PWA's cache. You have to delete the home screen app icon first, then clear Safari data, then re-add. Learned this the hard way when testers kept seeing old versions.

The viewport-fit: cover meta tag is what lets your app extend under the iPhone notch. Without it, you get black bars.

Bonus: Background Execution Control

One thing I didn't expect — the Worker architecture gives you easy control over background behavior. Since the emulation loop runs on setInterval inside a Web Worker (which browsers don't throttle in background tabs), the game keeps running even when the user switches apps or tabs. That's great for audio continuity, but terrible for battery life.

The fix is trivial: listen for visibilitychange on the main thread and send a pause/resume message to the Worker. The emulation stops completely when the app is backgrounded and picks up exactly where it left off when the user returns. No state loss, no audio glitch on resume. If you ever need background execution (say, for a music player or a long-running computation), just don't send the pause — the Worker keeps ticking regardless of what the main thread is doing. Having that as a conscious choice rather than a browser-imposed limitation is a nice side effect of the architecture.

The Result

On the same slow Android phone that produced dying-battery audio with the single-threaded architecture: smooth, consistent, correct-speed audio. The Web Worker generates samples at a steady 60fps via setInterval, completely independent of the main thread's rendering frame rate. The SharedArrayBuffer bridge adds effectively zero latency.

The visual frames might drop to 30fps on a slow device — the game looks a bit less smooth — but the audio is untouched. That's the right tradeoff. Humans tolerate choppy video far better than choppy audio.

Takeaways

Thread architecture is a day-one decision, not an optimization. I built the single-threaded version first because it was faster to prototype. Then I spent more time patching audio hacks than the Worker migration ultimately took. If your app does real-time audio/video processing, put the producer on a separate thread from the start.

SharedArrayBuffer is the right tool for high-frequency inter-thread data. For audio at 44,100 samples/second, postMessage adds too much jitter. For input events at 10-30/second, postMessage is perfectly fine. Match the tool to the frequency.

Transferable objects are free. If you're passing large ArrayBuffers between threads via postMessage, mark them as transferable. Zero copy, zero overhead. Just remember the sender loses access.

PWAs on iOS are a different platform entirely. Don't assume web APIs work the same. The Fullscreen API doesn't exist. Touch events behave differently for fixed-position elements. Icons need specific formats. Test on an actual iPhone, not just Chrome DevTools mobile emulation.

Test on the slowest device first. If I'd tested on the old Android phone on day one, I would have designed for Workers from the start. Testing only on fast hardware hides architectural problems that become very expensive to fix later.

I'm a mobile developer (Android/Kotlin, Flutter) exploring the browser as a platform for real-time applications. If you've dealt with Web Workers, SharedArrayBuffer, or PWA quirks on iOS, I'd love to hear about your experiences in the comments.

Building in Public: The Architecture of a Solo Rust Project

Sharmin Sirajudeen — Mon, 06 Apr 2026 08:18:27 +0000

Building in Public: The Architecture of a Solo Rust Project

I'm the creator of Drengr, an MCP server that gives AI agents eyes and hands on mobile devices. I started this blog to share the engineering behind it. No pretending to be a neutral observer writing a think piece — I built this, and I'm here to talk about it.

I'm a solo developer building a Rust project, and I want to talk about what that actually looks like. Not the polished "launched on Product Hunt and got 500 stars" version, but the real one — the architecture decisions made at midnight, the bugs that took days, and the strange irony of using AI to build AI tooling.

Drengr started as a research question: can I give an AI agent a phone? No venture capital, no team, no timeline pressure. Just curiosity and a problem that felt important enough to spend months on. Building in public means sharing the journey honestly, including the parts that don't look impressive.

Why I'm Building This Alone

The honest answer is that this project started before I knew it was a project. After ten years of writing Android apps and watching every UI test suite I touched decay faster than we could maintain it, I started experimenting with whether AI could do better. I hacked together a Python script that captured screenshots and sent them to an LLM API with action instructions. It worked badly, but it worked. That script became a prototype, the prototype became an architecture, and the architecture demanded Rust.

At no point did I sit down and say "I'm going to build a product." I kept solving the next problem. The next problem kept being interesting. Drengr is my first real tool — the first thing I've built that isn't an internal script or a weekend experiment. Six months later, I have about 6,300 lines of Rust, a working MCP server, and the beginning of something I think could matter.

Solo development has real trade-offs. I don't have anyone to review my code. I don't have anyone to challenge my architectural decisions. When I make a mistake, there's no one to catch it until a user reports a bug. The upside is speed — I can refactor the entire transport layer on a Saturday without scheduling a meeting.

The Architecture

Drengr's architecture is built around one core abstraction: the transport layer.

Transport Trait

A single Rust trait defines what it means to "talk to a device":

trait Transport {
    fn capture_screen(&self) -> Result&lt;Screenshot&gt;;
    fn get_ui_tree(&self) -> Result&lt;Vec&lt;UiElement&gt;&gt;;
    fn execute_action(&self, action: Action) -> Result&lt;()&gt;;
    fn query_state(&self, query: Query) -> Result&lt;StateResponse&gt;;
}

Three implementations exist: ADB for Android devices and emulators, simctl for iOS simulators, and Appium for cloud device farms. Each speaks a completely different protocol. ADB uses shell commands and binary protocols. Simctl uses Apple's command-line tools. Appium uses HTTP/WebDriver.

The rest of the codebase doesn't know or care which one is active. The MCP handler, the OODA loop, the screen annotation system — they all work through the trait. Adding a new platform means implementing four methods.

MCP Handler

The MCP server reads JSON-RPC from stdin and writes responses to stdout. This sounds simple until you realize that the device interactions also write to stdout (ADB commands, for instance, produce output). One of my earliest architectural decisions was redirecting child process I/O to avoid polluting the MCP channel.

The handler routes incoming tool calls to one of three paths: drengr_look triggers a screen capture and UI tree extraction, drengr_do dispatches an action to the transport layer, and drengr_query reads state without side effects.

Screen Annotation

When the agent calls drengr_look, it doesn't just get a raw screenshot. Drengr extracts the UI hierarchy, identifies interactive elements, assigns each a number, and returns both the annotated information and the element metadata. The agent can then say "tap element 7" instead of "tap at coordinates (342, 891)."

This annotation system is more important than it might seem. It bridges the gap between how the AI perceives the screen (as a visual field) and how the device accepts input (as structured commands). Without it, every interaction requires the agent to estimate pixel coordinates from visual inspection, which is unreliable.

What 6,300 Lines of Rust Taught Me

The compiler is your strictest code reviewer. I've lost count of the number of times the borrow checker rejected code that I was confident was correct, only to realize on reflection that it was catching a real problem. Not always a bug — sometimes a design issue. "You can't hold a mutable reference to the transport while also iterating over its UI tree results" is the compiler's way of saying "your data flow is tangled."

If it compiles, it probably works. This cliche has limits — logic errors still exist, integration tests still matter — but the density of runtime bugs per line of code is lower than anything I've experienced in other languages. When I do hit a bug, it's almost always in my logic, not in my memory management, not in my error handling, and not in my concurrency model.

Ownership semantics forced better architecture. In Python or JavaScript, I'd have passed the transport connection around freely, probably storing references in three different places. Rust forced me to think about who owns the connection and who borrows it. That constraint produced a cleaner architecture than I would have designed voluntarily.

The Hardest Bug

MCP over stdio means Drengr reads JSON-RPC requests from stdin and writes responses to stdout. Simple enough — until you spawn an ADB shell command that also writes to stdout.

The first time this happened, the MCP client received a response that started with a valid JSON-RPC frame, continued with "List of devices attached," and then had another JSON-RPC frame. The client understandably choked.

The fix required redirecting all child process stdout to /dev/null or to a captured buffer, using os::unix::io and dup2 to manage file descriptors at the system call level. It's about 30 lines of code. It took me two full days to debug, because the symptoms were intermittent — ADB only writes to stdout under certain conditions, so the MCP corruption was sporadic.

This is the kind of bug that doesn't exist in simpler architectures. If Drengr were an HTTP server instead of a stdio server, the problem would never have arisen. But MCP over stdio is the standard for local tool servers, and for good reason — it's simpler for the client, requires no port management, and works inside sandboxed environments. The complexity is justified; the bug was the price of admission.

The Irony of Using AI to Build AI Tooling

I use Claude Code daily to work on Drengr. Claude helps me write the code that teaches Claude to use phones. The recursion is not lost on me.

It's genuinely productive. Claude is good at Rust — it understands ownership patterns, suggests idiomatic approaches, and catches issues I miss. When I was implementing the situation engine, Claude helped me think through the state comparison logic. When I was wrestling with async trait objects, Claude explained the Pin<Box<dyn Future>> pattern in a way that finally clicked.

The irony runs deeper, though. Every improvement I make to Drengr makes Claude slightly better at interacting with mobile devices. A better screen annotation system means Claude gets better information. A better situation engine means Claude makes fewer mistakes. I'm building a tool that improves the capability of the AI that helps me build the tool.

I don't think this is unique to my project. Every developer using AI to build AI tools is in this feedback loop. But working on it daily makes the loop very tangible.

What's on the Roadmap

Three things I'm actively working on:

Dashboard. A web interface for visualizing test runs, reviewing agent decisions, and correlating UI actions with network traffic. The technical spec is written; implementation is next.
Real-time steering. The ability to watch an agent run and redirect it mid-session. "Stop exploring settings, go test the checkout flow instead." This requires a WebSocket connection between the dashboard and the running Drengr process.
Network monitoring. An SDK that apps can integrate to capture network traffic during Drengr sessions. This lets the dashboard show what API calls happened alongside each UI action — invaluable for debugging integration issues.

How to Get Involved

Drengr is proprietary, but the community is open. I've set up GitHub Discussions for questions, feedback, and feature requests. The areas where I'd most appreciate input:

Testing on diverse devices. I develop on a limited set of emulator configurations. Reports of how Drengr behaves on different Android versions, screen sizes, and manufacturer overlays are extremely valuable.
Prompt engineering for test scenarios. The quality of Drengr's autonomous testing depends heavily on how the goal is expressed. I'm collecting effective prompts and would love contributions.
Bug reports and feature ideas. The best way to shape Drengr's direction is to use it and tell me what's missing.

Or just try it. curl -fsSL https://drengr.dev/install.sh | bash. Connect a device. Point Claude at it. Tell me what happens.

The best feedback isn't "great project." It's "I tried this and it broke." That's how the tool gets better.

Building in public means accepting that people will see the rough edges. I'm okay with that. The rough edges are where the interesting problems live.

Drengr is free to use and available on npm. It supports Android (physical devices, emulators), iOS simulators (full gesture support), and cloud device farms (BrowserStack, SauceLabs, AWS Device Farm, LambdaTest, Perfecto, Kobiton). Built in Rust. Single binary. No runtime dependencies.

Why I Chose Rust Over C and C++ for Drengr

Sharmin Sirajudeen — Mon, 06 Apr 2026 08:14:23 +0000

Why I Chose Rust Over C and C++ for Drengr

When I tell people I built a mobile automation tool in Rust, the first question is always "why not Python?" I've written about that in a separate post. But the question that actually kept me up at night during the early architecture phase was different: why not C or C++?

Drengr is a CLI tool that talks to Android devices over ADB, iOS simulators over simctl, and cloud devices over Appium WebDriver. It parses UI trees, captures screenshots, manages concurrent device sessions, and serves as an MCP server over stdio. This is systems programming territory. C and C++ have owned this space for decades. So why Rust?

This isn't a "Rust vs C++" holy war post. I've worked with C and C++ in different contexts over the years — JNI bridges and NDK modules at work when Java wasn't fast enough for real-time audio processing or custom camera pipelines, a raytracer in C++ during university that taught me more about segfaults than about light, and the usual Arduino/embedded experiments that every CS student does at some point. Enough to know what these languages are good at and where they hurt. This is an honest account of a specific decision for a specific project, with the trade-offs I actually faced.

The Case for C

C was tempting. ADB itself is written in C++. The Android debug bridge protocol is well-documented at the C level. I could have called into ADB's libraries directly, skipping the subprocess overhead entirely. A C binary would be tiny — potentially under 1MB with static linking and aggressive stripping.

I seriously considered it. For about two days.

The problem crystallized when I started sketching the MCP server. MCP is JSON-RPC 2.0 over stdio. That means parsing JSON, routing method calls, managing request/response correlation, handling concurrent tool invocations. In C, I'd need a JSON parser (jansson? cJSON? write my own?), string handling that doesn't segfault, and manual memory management for every request/response lifecycle.

I've seen enough C codebases to know what this looks like. It looks like 60% of your code being memory management boilerplate, and the remaining 40% being the actual logic you care about. For a research project where I need to iterate fast and try experimental approaches to screen parsing and AI agent loops, that ratio is fatal.

The Case for C++

C++ was a stronger contender. Modern C++ (17/20) has smart pointers, string_view, std::optional, std::variant — many of the ergonomic features that make Rust pleasant to write. The ADB ecosystem is native C++. I could use nlohmann/json for parsing. The standard library has threads, mutexes, condition variables.

Three things killed it for me:

1. The Build System Problem

I wanted a single static binary that anyone could curl and run. No shared library dependencies, no runtime requirements, no "install libfoo-dev first." In Rust, this is cargo build --release --target x86_64-unknown-linux-musl. Done.

In C++, static linking is a odyssey. CMake or Meson? Which standard library — libstdc++ or libc++? Static linking glibc is technically possible but discouraged and produces larger binaries with potential compatibility issues. Musl works but you need a separate toolchain. Cross-compilation for Apple Silicon from Linux? I'd need a cross-compiler toolchain per target triple.

Cargo handles all of this. I add a target, run the build, get a binary. The CI matrix in my GitHub Actions workflow is 20 lines. The equivalent CMake + cross-compilation setup would be 200+.

2. Concurrency Without Fear

Drengr manages multiple concurrent operations: the MCP server handles requests while the SDK server listens for in-app network events, the OODA loop runs autonomous agent sessions, and the explore mode does BFS traversal with concurrent screen captures. These all share state — the current device transport, the screen annotation cache, the situation engine.

In C++, shared mutable state across threads means choosing between:

Raw mutexes with manual lock/unlock discipline (and hoping you never forget)
Atomic operations for primitives (and hoping your lock-free algorithm is actually correct)
Higher-level abstractions like folly::Synchronized (and adding Facebook's folly as a dependency)

Data races in C++ are undefined behavior. Not "your program crashes." Undefined behavior. The compiler is allowed to do literally anything. Time travel. Nasal demons. In practice, it means subtle corruption that shows up three hours into a test session as a garbled screenshot or a silently wrong element count.

In Rust, the type system prevents data races at compile time. If I try to share a mutable reference across threads without proper synchronization, it doesn't compile. Period. The compiler forces me to use Arc<Mutex<T>> or channels or atomics explicitly. I can't accidentally share a raw pointer to a screen buffer across two async tasks.

For a tool that manages real device sessions — where a bug could mean sending the wrong tap to the wrong device — this isn't a nice-to-have. It's a requirement.

3. The Dependency Story

Drengr depends on reqwest (HTTP client), tokio (async runtime), serde (serialization), image (screenshot processing), and about 30 other crates. Adding a dependency in Rust is one line in Cargo.toml. Cargo downloads, compiles, and statically links it. Version resolution is automatic. Security advisories are tracked by cargo audit.

In C++, every dependency is a project. Do they use CMake? Meson? Autotools? Their own bespoke build system? Do they support static linking? Are their transitive dependencies compatible with mine? The Conan and vcpkg package managers have improved this, but they're still far from Cargo's "it just works" experience.

I estimated that managing C++ dependencies alone would cost me 2-3 weeks of the early development timeline. In a solo project where every week counts, that's not acceptable.

What I Miss From C/C++

Honesty requires admitting what Rust costs me.

Compile Times

A clean build of Drengr takes about 90 seconds. An incremental build after touching one file takes 8-12 seconds. The equivalent C project would compile in under 5 seconds clean, under 1 second incremental. When I'm iterating on screen parsing logic and want to test against a real device, those seconds add up.

I've mitigated this with cargo watch and by structuring the crate to minimize recompilation, but it's a real cost.

The Learning Curve

I knew C and C++ before I knew Rust. The borrow checker's mental model — ownership, borrowing, lifetimes — took weeks to internalize. There were days early in the project where I spent more time fighting the compiler than writing features. Async Rust made it worse: pinning, Send/Sync bounds, the colored function problem.

If I'd written Drengr in C++, the first prototype would have been done a week earlier. No question. But I believe the Rust version has fewer bugs, and I spend almost zero time debugging memory issues. That trade-off has compounded in my favor over the months since.

FFI Friction

ADB is a C++ tool. Some interactions would be more natural in C++ — direct FFI into ADB's libraries, for example. Instead, I shell out to the adb binary as a subprocess. It works, but it adds latency (spawning a process per command) and complexity (parsing stdout). A C++ implementation could potentially link against libadb directly.

In practice, the subprocess approach has been fine. ADB commands complete in 10-50ms typically, and the parsing is straightforward. But it's an architectural compromise I wouldn't need in C++.

The Numbers

After six months of development:

~6,300 lines of Rust — this includes the MCP server, three device transports (ADB, simctl, Appium), the OODA loop, the explore mode, the test runner, the SDK server, screen annotation, and the situation engine
Zero memory-related bugs in production. Not one use-after-free, double-free, buffer overflow, or data race
189 tests, all passing. The test suite runs in under 3 seconds
Binary size: ~15MB stripped, with LTO fat optimization. A C equivalent might be 3-5MB, but 15MB for a tool that includes an HTTP client, JSON parser, image processing, and async runtime is reasonable
Cold start: ~15ms to first MCP response. This matters when AI agents are waiting

What I'd Do Differently

If I started over tomorrow, I'd still choose Rust. But I'd do a few things differently:

Start with synchronous code, add async later. I went async-first with tokio, which complicated the early prototyping phase. Many of the ADB interactions don't benefit from async — they're sequential command-response pairs. I could have started synchronous and migrated the concurrent parts later.
Use fewer abstractions early. I over-engineered the transport trait in the first version. Three concrete implementations of a simple interface would have been clearer than a trait with twelve methods and two associated types.
Accept more unsafe. I avoided unsafe entirely for the first four months. Some of the ADB binary protocol parsing would have been cleaner with unsafe pointer arithmetic in a well-tested, isolated module. Rust's unsafe isn't C — it's a clearly bounded region where you tell the compiler "I've verified this manually." I was too cautious.

The Real Answer

The real reason I chose Rust over C and C++ isn't any single technical argument. It's this: Rust lets me write systems-level code at the speed I think, with the confidence that the compiler has caught the classes of bugs that would otherwise cost me debugging days.

For a solo developer building a research project that interacts with real hardware, manages concurrent sessions, and serves as infrastructure for AI agents — that confidence isn't a luxury. It's the difference between shipping and not shipping.

I don't have a team to review my pointer arithmetic. I don't have a QA department to catch my data races. I have the Rust compiler. And it's the most reliable colleague I've ever worked with.

C and C++ are extraordinary languages. They power the systems Drengr sits on top of — the operating systems, the ADB daemon, the simctl infrastructure. I have deep respect for them. But for this project, at this scale, as a solo developer? Rust was the right call.

The binary works. The code is correct. And I sleep well at night knowing the compiler has my back.

From ADB Shell to AI Agent: The Quiet Revolution in Mobile Automation

Sharmin Sirajudeen — Mon, 06 Apr 2026 08:12:22 +0000

From ADB Shell to AI Agent: The Quiet Revolution in Mobile Automation

Mobile test automation has a longer history than most developers realize, and the AI-driven approach I'm exploring with Drengr sits at the end of a progression that started with raw ADB shell commands in 2009. Understanding that progression matters — not because history is inherently interesting (though I think it is), but because each generation solved real problems while creating new ones. Every mobile automation tool, including mine, is a response to the limitations of what came before. Knowing those limitations helps evaluate what's genuinely new and what's just repackaging.

The ADB Era (2009-2012)

Android Debug Bridge shipped with the Android SDK, and it included a deceptively simple capability: adb shell input. You could inject taps, swipes, and key events from a terminal.

adb shell input tap 500 300
adb shell input text "hello"
adb shell input swipe 500 1500 500 500 300

Developers wrote bash scripts that chained these commands together. Open the app, wait 2 seconds, tap the login button at coordinates (340, 780), type the username, tap the next field at (340, 860), type the password.

The problems were immediate and severe. Coordinates were absolute pixels. A script written for a 1080p phone broke on a 720p phone. A script written for one app version broke when the developer moved a button 50 pixels down. There was no way to query the UI state — you sent commands blind and hoped for the best.

But ADB shell automation proved something important: developers wanted to automate mobile testing, even with terrible tools. The demand was real.

UIAutomator and Espresso (2012-2015)

Google responded with proper frameworks. UIAutomator provided black-box testing — you could find elements by resource ID, text, or description, rather than coordinates. Espresso provided white-box testing for Android — fast, deterministic tests that ran inside the app process.

These were real, production-quality tools. Espresso, in particular, is excellent. Its automatic synchronization with the UI thread eliminates an entire category of flaky tests. If you're doing Android-only testing with access to the source code, Espresso remains hard to beat in 2026.

The limitations: both are Android-only, language-locked to Java or Kotlin, and require compilation against the app. You can't use Espresso to test someone else's app. You can't use UIAutomator for iOS. And for teams building cross-platform products, maintaining separate test suites for each platform is expensive.

Appium's Universal Vision (2013-2020)

Appium had an ambitious idea: apply the WebDriver protocol — the same standard that powered Selenium for web testing — to mobile devices. Write tests in any language. Run them against any platform. One API to rule them all.

The vision was compelling, and Appium built a real foundation. It proved that cross-platform mobile testing was possible. Major companies adopted it. A huge ecosystem of plugins, drivers, and integrations grew around it.

But the architecture carried inherent weight. Appium runs a Node.js server that translates WebDriver commands into platform-specific actions through a chain of drivers. Setting up Appium meant installing Node.js, Java (for the Android driver), the appropriate SDKs, and getting all the versions to align. Session management was fragile. Tests that passed on one Appium version broke on the next. "Flaky tests" became almost synonymous with mobile automation in many teams.

Appium built the foundation. I want to be clear about that — a lot of what exists today in mobile automation stands on Appium's groundwork.

Maestro's Simplification (2022-2024)

Maestro, from mobile.dev, asked a sharp question: what if mobile testing was actually simple? Their answer was YAML-based test flows that you could write in minutes.

appId: com.example.app
---
- launchApp
- tapOn: "Sign In"
- inputText:
    text: "user@example.com"
    id: "email_field"
- tapOn: "Continue"
- assertVisible: "Welcome back"

Five-minute setup. No WebDriver server. No driver management. Just a CLI that talked directly to the device. Maestro proved that developer UX matters in testing tools — that a tool people actually enjoy using gets adopted, even if it has fewer features than the heavyweight alternative.

What Maestro didn't change: you still wrote every test manually. Every flow, every assertion, every edge case had to be authored by a human who understood the app. The tool was simpler, but the work was the same.

The AI Shift (2024-2026)

Two things happened in 2024-2025 that opened a genuinely new direction for mobile automation.

First, multimodal LLMs became good enough to reliably interpret screenshots. Not perfectly — I've written about the limitations — but well enough to identify buttons, text fields, navigation elements, and app state from a screenshot alone. The agent could see.

Second, Anthropic published the Model Context Protocol. MCP gave those capable-but-isolated LLMs a standard way to discover and invoke external tools. An AI model could now say "I want to tap element 5 on this screen" and have that intention reliably translated into a device action through a well-defined protocol.

These two ingredients — vision and tool use — are what make AI-driven mobile testing possible. Not just theoretically possible, but practically achievable by a solo developer building in Rust on weekends.

Where I Think This Is Heading

The progression I see is from imperative to declarative to goal-oriented.

ADB was imperative: tap here, swipe there, type this. Espresso was declarative: find this element, verify this state. Maestro was declarative with better DX.

Drengr is my attempt at goal-oriented: "verify that a user can sign up, log in, and post a message." The agent figures out the how. It adapts to the specific app. It handles UI variations and unexpected states. You describe what should work, not how to test it.

I'm not claiming this is solved. The previous sections of this blog document the limitations in detail. But I do believe the direction is correct: AI agents that explore apps like humans do, finding bugs through curiosity rather than scripts.

Comparison

This is my honest assessment of the current landscape. I've tried to be fair — every tool on this list solves real problems for real teams.

Feature
Appium
Maestro
Detox
Espresso / XCUITest
Drengr

Setup complexity
High
Low
Medium
Medium
Minimal

Cross-platform
Yes
Yes
React Native only
No (platform-specific)
Yes

AI-driven
No
No
No
No
Yes

Script-free testing
No
No
No
No
Yes

Single binary install
No
Yes
No
N/A (built-in)
Yes

MCP support
No
No
No
No
Native

Deterministic results
Mostly
Yes
Yes
Yes
No (AI-dependent)

Test authoring effort
High
Low
Medium
Medium-High
Minimal

Maturity
Very mature
Mature
Mature
Very mature
Early

I want to call attention to the "Deterministic results" row. Drengr is the only "No" in that column, and that matters. When you run an Espresso test, you get the same result every time. When you run a Drengr exploration, you might get different paths, different findings, different coverage. That's a feature for exploratory testing and a limitation for regression testing. Both are valid use cases; the right tool depends on what you need.

Appium built the foundation. Maestro proved that developer UX matters. I built Drengr because I saw a gap: what if the test itself was intelligent?

Whether that intelligence proves more valuable than determinism in practice is still an open question. I have early evidence that it is, for certain types of testing. But I'd rather present the question honestly than claim to have answered it definitively.

Giving Claude a Phone: How I Built an MCP Server for Mobile Devices

Sharmin Sirajudeen — Mon, 06 Apr 2026 08:08:19 +0000

Giving Claude a Phone: How I Built an MCP Server for Mobile Devices

After a decade of writing Android apps, I'd accepted a certain rhythm: write the code, build, deploy to a device, tap around manually, find the bug, go back to the IDE, fix it, repeat. When AI coding assistants arrived, they changed most of that loop — Claude could write a RecyclerView adapter with DiffUtil callbacks and proper coroutine scoping faster than I could type the class name. But it couldn't tap a single button on the emulator sitting right next to it. The code was flawless. The app was running. And the AI that wrote it had absolutely no way to verify its own work.

That disconnect — combined with years of watching Espresso and Appium test suites rot faster than teams could maintain them — made me think there had to be a better way. What if the AI could see the screen, understand what it's looking at, and interact with the app directly? Not through brittle element IDs, but through actual comprehension. This is the story of how I built Drengr — my first real tool, an MCP server that gives AI agents eyes and hands on mobile devices.

The Frustration That Started It

If you've used Claude or any capable LLM for mobile development, you've hit this wall. The AI helps you write code, debug layouts, even architect entire features. But the moment you need to verify something on an actual device, you're on your own. Copy the code, build, deploy, tap around, find the bug, go back to the AI, describe what you saw in words.

It's 2026, and the feedback loop between AI and mobile devices is still mediated entirely by human hands and human descriptions. As someone who spent ten years in that loop, it felt wrong to me. Not because automation is always better, but because the information loss is enormous. I can describe a broken layout to Claude, but Claude seeing the broken layout is fundamentally different.

The Insight: MCP as the Bridge

Anthropic's Model Context Protocol gave me the architecture I needed. MCP defines a standard way for AI models to discover and invoke tools — a JSON-RPC protocol over stdio or HTTP. Instead of building a bespoke integration, I could build an MCP server that exposes mobile device capabilities as tools that any MCP-compatible client can call.

The key insight was constraint. I didn't need to expose every possible device operation. I needed exactly three tools that would give an AI agent enough capability to understand and interact with any mobile app.

Three Tools, Three Verbs

Drengr exposes exactly three MCP tools:

drengr_look — Observes the current screen. Captures a screenshot, extracts the UI hierarchy, and returns an annotated view where every interactive element is numbered. The agent sees what a user would see, but with machine-readable structure.
drengr_do — Executes an action. Tap element 3, type "hello world", swipe up, press back. These are the hands.
drengr_query — Reads device state without side effects. Check if an element exists, read text content, get the current activity name. This is the quiet observer — it never changes anything.

That's it. Three tools. Every mobile interaction I've needed — from opening apps to navigating complex flows to filling forms — reduces to sequences of look, do, and query.

What Claude Actually Does With a Phone

Let me describe a real session. I asked Claude, through Drengr, to "open YouTube and find a video about the Model Context Protocol."

Claude called drengr_look first. It received back an annotated screenshot showing the home screen with numbered elements — the app drawer, status bar icons, and the YouTube icon labeled as element 14. Claude called drengr_do with {"action": "tap", "element": 14}.

YouTube opened. Claude called drengr_look again. Now it could see the YouTube home feed with a search icon at element 2. It tapped that, got a keyboard and search field, typed "Model Context Protocol MCP", and hit enter. Results appeared. Claude called drengr_look one more time, identified the first relevant result, and tapped it.

Total time: about 40 seconds. Total human intervention: zero. Claude navigated an app it had never been configured to use, adapting to whatever UI state it encountered.

Setting It Up

The MCP configuration is minimal. Here's what goes in your claude_desktop_config.json:

{
  "mcpServers": {
    "drengr": {
      "command": "drengr",
      "args": ["mcp"],
      "env": {
        "DRENGR_PLATFORM": "android"
      }
    }
  }
}

That's the entire integration. Drengr ships as a single binary — no Python virtualenv, no npm dependencies, no Docker container. You install it, point your MCP client at it, and Claude gains the ability to interact with whatever device is connected.

Honest Limitations

I want to be transparent about where this breaks down, because it does break down.

Vision isn't perfect. The UI hierarchy doesn't always capture everything visible on screen. Custom-drawn views, game canvases, and some Flutter widgets can appear as opaque rectangles. The agent can see the screenshot, but without structured element data, it's guessing at tap coordinates.

Some gestures are hard to express. A simple tap or swipe works reliably. But complex gestures — pinch to zoom, long-press-then-drag, multi-finger interactions — are difficult to represent in a tool call. I've implemented the common ones, but there's a long tail of interactions that don't map cleanly.

Latency adds up. Each look-do cycle involves capturing a screenshot, extracting the UI tree, sending it to the AI, waiting for a decision, and executing the action. On a fast local setup, each cycle takes 3-5 seconds. Over a network to a cloud device, it can be 8-12 seconds. For a 20-step flow, that's minutes of wall time.

Token costs are real. Screenshots and UI trees are not small. A single drengr_look response can be several thousand tokens. A complex navigation flow might consume 50,000-100,000 tokens. This isn't free, and it's something I think about when designing how much context to include in each response.

What This Changes

The immediate application is testing — give Claude a goal, let it explore the app, report what it finds. But I think the more interesting implication is broader. MCP mobile support means AI agents can participate in workflows that were previously human-only. Filing bug reports with actual screenshots. Verifying that a deployment worked on a real device. Walking through a user flow to understand it before writing code.

The gap between "AI that understands code" and "AI that understands the product" has always been the device. Drengr is my attempt to close that gap.

What's Next

I'm working on a dashboard for visualizing test runs, real-time network monitoring so the agent can correlate UI actions with API calls, and a steering system that lets you redirect the agent mid-run. The core — three tools, one binary, MCP-native — won't change. Everything else is about making that core more useful.

If you want to try it: curl -fsSL https://drengr.dev/install.sh | bash. It takes about 10 seconds. I'd genuinely appreciate feedback on what works, what doesn't, and what you'd want it to do that it can't yet.

Field Notes: How Drengr's Architecture Aligns with (and Diverges from) Current Research

Sharmin Sirajudeen — Mon, 06 Apr 2026 08:06:18 +0000

Field Notes: How Drengr's Architecture Aligns with (and Diverges from) Current Research

I've been an Android engineer for about ten years. I still remember the first time I discovered Espresso. I was genuinely thrilled — here was a framework from Google, deeply integrated with the Android SDK, that could simulate real user behavior and verify UI state. I dove in headfirst. Wrote hundreds of tests. Felt like I was doing engineering the right way.

Then reality set in. Tests that passed locally failed on CI because of animation timing. Tests that worked on a Pixel broke on a Samsung because of slightly different view hierarchies. A designer moved a button into a BottomSheet and forty tests turned red overnight — none of them testing anything related to that button. I spent more time maintaining the test suite than it saved me in bug prevention. And this was Google's own tool, built by one of the most capable engineering organizations on the planet. If Espresso was the best we had, the problem wasn't implementation — it was the entire approach.

I moved through Appium, UIAutomator, tried Maestro when it came out. Each one was a refinement of the same fundamental idea: match elements by ID or XPath, perform actions, assert state. And each one broke the same way — the moment the UI evolved, the tests fossilized. I've sat in sprint retrospectives where someone says "the UI tests are red again" and everyone nods like it's weather. And the part that quietly frustrated me most: the decision-makers above me — experienced, respected leaders who'd built careers on shipping great mobile products — had accepted these tools as the ceiling. Not out of laziness, but out of familiarity. When every conference talk, every "best practices" blog post, and every Google I/O session tells you Espresso is the answer, questioning it feels like questioning gravity. So the test suites stayed brittle, the teams stayed frustrated, and the leadership stayed confident they were using the best tools available. After a while, you start to wonder whether brittle UI tests are almost as good as not having tests at all.

That frustration is where Drengr started. Not from a paper. Not from a hackathon. From years of watching test suites rot faster than we could maintain them, and a quiet conviction that AI could do something fundamentally better — tests that understand what they're looking at instead of matching on fragile element IDs.

I started prototyping in late 2024. A simple idea: what if an AI agent could look at a screen, understand what it sees, and interact with the app the way a human would? No hardcoded selectors. No XPath expressions that shatter on the next release. Just "navigate to the settings page and verify the toggle works." If the UI changes, the agent adapts. Self-evolving tests.

Drengr is still early. I'm still figuring things out, still iterating, still learning what works and what doesn't. But recently I took some time to look at what the academic research community has been publishing — and I was surprised to find that researchers at Google, Meta, Microsoft, Tencent, and Princeton have been circling the same problems from different angles. Some of their solutions look like mine. Some are fundamentally different. A few of their insights are already changing how I think about what I'm building.

This post is my attempt to map the territory honestly — where Drengr's early architecture converges with published research, where it diverges, and what I've learned from reading the papers after building the first version of the system.

The Observe-Act Loop: Independent Convergence

Drengr's core architecture is three MCP tools: drengr_look (observe the screen), drengr_do (execute an action), and drengr_query (read structured data). An AI agent calls these in a loop — look at the screen, decide what to do, do it, look again.

In late 2023, Zhang et al. at Tencent published AppAgent: Multimodal Agents as Smartphone Users (arXiv:2312.13771). Their system does the same thing — observe the screen, decide, act — but as a Python agent framework. What struck me was their screen annotation approach: they number interactive elements on the screenshot so the LLM can reference them by ID. I'd independently arrived at the same design for Drengr's element numbering system. When two teams solve the same problem the same way without talking to each other, it usually means the solution is natural to the problem space.

A month later, Wang et al. published Mobile-Agent (arXiv:2401.16158), taking a purely vision-centric approach — no XML dumps, no accessibility tree, just screenshots plus detection and OCR models. Their finding that you don't need system metadata to navigate apps effectively was an important validation. Drengr deliberately uses both screenshots and the accessibility tree — the tree is faster to parse, costs almost nothing in tokens, and gives precise element bounds that vision models still struggle with. But Mobile-Agent's results are a useful signal: as vision models improve, the tree may become optional, and Drengr's architecture is designed to make that transition seamless when the time is right.

The key difference between Drengr and these systems: they're agent frameworks. Drengr is infrastructure. AppAgent and Mobile-Agent are Python applications that contain both the perception logic and the decision-making. Drengr separates these entirely — it handles perception and action, and delegates all decision-making to whatever LLM is on the other end of the MCP connection. This is a fundamentally different deployment model, and it's what lets Drengr work with Claude Desktop, Cursor, Windsurf, or any other MCP client without modification.

The OODA Loop: Military Theory Meets AI Agents

When I implemented drengr run — the autonomous agent mode — I structured it as an OODA loop: Observe (capture screen), Orient (situation engine analyzes what changed), Decide (LLM picks an action), Act (execute it). I chose OODA because it maps cleanly to the problem. The alternatives — simple while loops, state machines, behavior trees — all felt either too rigid or too unstructured.

I was genuinely surprised to find that Schneier and Raghavan published Agentic AI's OODA Loop Problem in IEEE Security & Privacy in 2025, analyzing exactly this pattern from a security perspective. Their key insight is that every stage of the OODA loop is a distinct attack surface. Prompt injection corrupts the Observe phase. Data poisoning corrupts Orient. Probabilistic decision-making without output verification corrupts Act. They specifically mention MCP and tool-calling systems as creating compounded vulnerabilities.

Reading this paper directly influenced Drengr's security model. The drengr_look observation phase cross-references the visual screenshot against the accessibility tree — if the two disagree (an element is visible but not in the tree, or vice versa), that inconsistency is surfaced in the situation report. It's not full tamper-evidence yet, but the dual-source design gives Drengr a foundation that purely vision-based systems don't have. Schneier and Raghavan's framing helped me see that as a security property, not just an implementation detail.

More recently, Yasuno published RAPTOR-AI for Disaster OODA Loop (arXiv:2602.00030) in early 2026, applying the OODA pattern to disaster response with entropy-aware strategy selection. The concept of adjusting confidence thresholds based on situational entropy maps directly to what Drengr's situation engine does — detecting when the screen hasn't changed (stuck detection), when the app has crashed, or when the agent is in an unfamiliar state.

BFS App Exploration: An Old Idea, Reimagined

Drengr's drengr explore mode does BFS traversal of an app — systematically tapping every interactive element, recording the resulting screens, and building a navigation graph. I built this because I needed a way to map unfamiliar apps before writing test suites for them.

The academic lineage here goes back to DroidBot by Li et al. (IEEE/ACM ICSE-C 2017), which built state transition models from live UI interactions. DroidBot used hard-coded heuristics to decide what to tap next. Drengr replaces those heuristics with an LLM decision layer — the agent can reason about whether a button is likely to navigate somewhere useful or just dismiss a dialog.

Wen et al. at Microsoft Research took this further with AutoDroid (ACM MobiCom 2024), combining LLM-driven exploration with a reusable knowledge graph. Their publication at MobiCom — a top-tier systems conference — establishes this as a recognized systems contribution, not just an ML exercise. Drengr's approach is architecturally simpler — a single Rust binary versus a Python/LLM stack — but the core insight is the same: BFS exploration is dramatically more effective when guided by a language model than by heuristics.

ReAct and Tool Use: The Conceptual Foundation

Two papers form the conceptual bedrock of what Drengr enables, even though I hadn't read either when I started building.

Yao et al. at Princeton published ReAct: Synergizing Reasoning and Acting in Language Models (ICLR 2023, arXiv:2210.03629). ReAct interleaves chain-of-thought reasoning with executable actions — the model reasons about what to do, issues an action, observes the result, reasons again. Every time Claude calls drengr_look, reasons about what to tap, then calls drengr_do, it's executing a ReAct loop. Drengr is, architecturally, a ReAct-compatible tool suite for mobile devices.

Schick et al. at Meta published Toolformer (NeurIPS 2023, arXiv:2302.04761), demonstrating that LLMs can learn when and how to call external tools. Toolformer's tools were information retrieval APIs — calculators, search engines, QA systems. Drengr's tools have physical side effects. When drengr_do taps a button, a real device changes state. That distinction matters — the consequences of a wrong action are much more significant than a wrong search query.

Screen Understanding: Where the Field Is Heading

Two papers from Google Research point to where Drengr's perception layer might evolve.

ScreenAI (Baechler et al., IJCAI 2024, arXiv:2402.04615) is a 4.6B-parameter vision-language model fine-tuned specifically for UI understanding. It identifies UI elements — buttons, text fields, images — at the pixel level from raw screenshots. Currently, Drengr uses the Android accessibility tree alongside screenshots for element identification. ScreenAI suggests that the screenshot alone might eventually be sufficient, which would make Drengr's perception layer identical across Android, iOS, and any other platform with a display.

Spotlight (Li and Li, arXiv:2209.14927, 2023) goes even further — a vision-only model for mobile UI understanding that outperforms methods using both screenshots and view hierarchies. This directly challenges Drengr's current design of using the accessibility tree as a primary data source. If vision-only models can outperform metadata-enhanced models, then Drengr's drengr_query tool (which reads the UI tree) might eventually become redundant — replaced by richer visual understanding from the LLM itself.

For now, the accessibility tree remains the right default — it's reliable, fast, and doesn't require a specialized vision model. But Drengr's perception layer is designed as a swappable trait, so when vision-only models reach the point where they consistently outperform metadata-enhanced approaches across device types and screen densities, the switch is an implementation change, not an architectural one.

The Survey: Situating Drengr in the Field

Wang et al. published GUI Agents with Foundation Models: A Comprehensive Survey (arXiv:2411.04890, 2024) — a systematic review of 100+ papers on LLM-based GUI agents across web, desktop, and mobile. Reading this survey was like looking at a map after you've already hiked the trail. I recognized the landmarks.

Drengr's three-tool architecture fits cleanly into the survey's taxonomy of perception-grounding-action pipelines. What the survey made clear is that most systems in this space are tightly coupled — the perception, grounding, and action components are part of the same codebase, usually Python. Drengr's contribution is decoupling these: it handles perception and action, and lets any MCP-compatible LLM handle grounding and reasoning. This is a systems architecture choice, not an ML innovation — but it's one that the survey suggests is underexplored.

Where Drengr Diverges

After reading all of this, here's what I think Drengr is doing differently — or at least trying to:

Infrastructure, not framework. Almost every paper describes an end-to-end agent. Drengr is deliberately not an agent — it's the hands and eyes that agents use. This separation came from ten years of watching tightly-coupled testing tools become unmaintainable. It's a pattern I saw repeated across every mobile organization I worked in — teams with brilliant leadership, seasoned VPs, directors who'd shipped apps to hundreds of millions of users — and yet the testing infrastructure always calcified the same way. The tooling forced coupling, but I also think there was a deeper issue — the mental model at the top often stopped at "we need more test coverage" without questioning whether the testing paradigm itself was the bottleneck. When you've shipped successful apps for years using a certain approach, it takes a particular kind of intellectual honesty to ask whether that approach has a ceiling. Most organizations optimized within the paradigm rather than questioning it. When your test framework is also your test runner is also your assertion library is also your device manager, everything breaks together. I think the industry internalized that pain as normal. Drengr's hypothesis is that it doesn't have to be. Separate the perception and action layer from the intelligence layer, and each can evolve independently. The agent will change. The tools should remain.
Rust, not Python. Every system cited above is Python. Drengr is a single static Rust binary. As an Android engineer, I know what it's like to ask a team to install a tool with twelve dependencies. I wanted curl | bash and done. That choice has trade-offs — I wrote about them in a separate post.
MCP first — by design. People ask why I released Drengr as an MCP server before building a standalone CLI agent. The answer comes from watching this industry long enough to know what survives and what doesn't. AI models improve every few months. The agent that's state-of-the-art today will be obsolete by next year. But the ability to observe a screen, tap a button, and read a UI tree? That's stable. That's the invariant. By releasing the tool layer first — as an MCP server that any AI client can consume — I'm building on the part that lasts. Claude Desktop uses it today. Cursor uses it today. Whatever comes next year will use it too, because the interface is standardized. If I'd built a monolithic agent instead, I'd be rewriting it every time a better model dropped. The Model Context Protocol didn't exist when most of these papers were written. Drengr's bet is that a standard protocol between AI agents and tools is more valuable than another custom agent framework. I might be wrong. But ten years of watching tightly-coupled tools age badly makes me think this bet is right.
Born from the field, not the lab. This isn't a research project with a team, compute budget, and publication timeline. It's one Android engineer who got tired of writing tests that broke every sprint and decided to try a different approach. The architecture reflects that — pragmatic, incremental, shaped by what I actually needed rather than what's theoretically optimal.

What I Learned

Reading these papers after building the first version taught me something I didn't expect: the problems I was solving alone — in my apartment, after work, on weekends — are the same problems that well-funded research teams at Google and Microsoft are working on. That's both humbling and encouraging.

The convergence gives me confidence that I'm not building something crazy. The divergence — particularly Drengr's choice to be infrastructure rather than an agent, and to use the accessibility tree alongside vision rather than vision alone — reflects deliberate trade-offs, not gaps. Where the academic work explores what's theoretically optimal, Drengr is built around what's practically reliable today while keeping the architecture open to what's coming.

I'm not an academic. I don't have a lab or a publication record. I'm an Android engineer with a decade of scar tissue from brittle test suites, building a tool shaped by what I actually needed in the field. The researchers cited here are formalizing the theory behind problems I've been solving through iteration and observation. We're approaching the same territory from different directions — and I think both directions produce insights the other can't.

If you're working in this area — whether you're writing papers or building tools or just frustrated with your own test suite — I'd love to hear from you. This space is wide open, and I think we're all just getting started.

References

Zhang et al. "AppAgent: Multimodal Agents as Smartphone Users." arXiv:2312.13771, 2023.
Wang et al. "Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception." arXiv:2401.16158, 2024.
Wen et al. "AutoDroid: LLM-powered Task Automation in Android." ACM MobiCom, 2024.
Schneier & Raghavan. "Agentic AI's OODA Loop Problem." IEEE Security & Privacy, 2025.
Yasuno. "RAPTOR-AI for Disaster OODA Loop." arXiv:2602.00030, 2026.
Li et al. "DroidBot: A Lightweight UI-Guided Test Input Generator for Android." IEEE/ACM ICSE-C, 2017.
Wang et al. "GUI Agents with Foundation Models: A Comprehensive Survey." arXiv:2411.04890, 2024.
Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS, 2023.
Yao et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR, 2023.
Baechler et al. "ScreenAI: A Vision-Language Model for UI and Infographics Understanding." IJCAI, 2024.
Li & Li. "Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus." arXiv:2209.14927, 2023.

What Happens When You Let AI Test Your App for a Week

Sharmin Sirajudeen — Mon, 06 Apr 2026 07:51:06 +0000

What Happens When You Let AI Test Your App for a Week

AI mobile testing is either the future of QA or an expensive way to generate false bug reports, depending on who you ask. I decided to find out for myself. I pointed Drengr's OODA-loop agent at three different apps — a calculator, a weather app, and a social media client — and let it run autonomously for a week. Here's what happened, including the parts that didn't work.

This wasn't a controlled experiment in any scientific sense. The sample size is tiny, the apps are specific, and the results may not generalize. I'm sharing this as a data point, not a proof. Autonomous mobile testing is genuinely new territory and I think honest reporting matters more than impressive claims.

The Setup

Each app got the same treatment:

10 autonomous exploration runs per day, each with a different high-level goal prompt
Each run capped at 50 actions to limit token costs
Goals ranged from specific ("calculate 15% tip on $47.50") to open-ended ("explore the app and report anything that seems broken")
Claude Sonnet 4 as the decision-making model, chosen for the balance of capability and cost
Android emulator, Pixel 7 image, API 34

Total over the week: 210 runs across the three apps, approximately 2.1 million tokens consumed, about $14 in API costs.

The Prompts

I learned quickly that prompt design matters enormously. "Test the calculator" produced aimless tapping. "Verify that the calculator handles edge cases in arithmetic operations, including negative numbers, decimal precision, division by zero, and very large numbers" produced useful, targeted exploration.

The sweet spot was specific enough to guide the agent but open enough to let it discover things I hadn't anticipated.

What It Found

App 1: Calculator — The Negative Number Bug

The calculator app was a personal project, something I'd built and considered "done" for months. The agent found a bug on the second day that I'd never noticed: entering a negative number, then pressing the percent button, then pressing equals produced NaN instead of a numeric result.

I'd never tested that sequence manually. Why would I? Negative percent of a number isn't a common operation. But the agent, exploring combinations I wouldn't think to try, stumbled into it. The underlying issue was a missing absolute value check in the percentage calculation path.

That alone made the experiment worthwhile for me. It's a trivial bug, but it had shipped. A real user could have hit it.

App 2: Weather App — The Broken Deep Link

The weather app supported deep links for sharing forecast URLs. The agent, when given the goal "navigate to the settings page using every available path," discovered that the deep link weather://settings/notifications crashed the app. The crash was caught by Drengr's logcat monitoring before the agent even had to report it — the situation engine flagged a fatal exception.

The root cause was a missing null check on a fragment argument. The deep link handler assumed a bundle parameter would always be present, but the notifications settings fragment expected it to be passed by the parent activity, not by a deep link.

App 3: Social Media Client — The Accessibility Issue

This was the most interesting finding. The social media client had several icon buttons — like, share, bookmark — that had no content descriptions. The agent reported them as "unlabeled interactive elements" because the UI hierarchy showed clickable views with no text and no accessibility labels.

The agent wasn't doing accessibility testing on purpose. It was trying to describe what it saw, and it couldn't identify those buttons. The same problem that confused the AI would confuse a screen reader. Inaccessible UI is ambiguous UI, and ambiguity hurts both automated agents and human users who rely on assistive technology.

What It Missed

Equally important is what the agent did not catch.

A timing-sensitive race condition. The weather app had a bug where rapidly switching between cities while forecasts were loading could display the wrong city's data. This required specific timing — switching during the 200-400ms window between the API response arriving and the UI updating. The agent's action cycle was too slow (3-5 seconds between actions) to ever trigger this window.

Visual alignment issues. The social media client had a layout bug where long usernames caused text to overlap with the timestamp on certain screen widths. The UI hierarchy reported correct element bounds — the overlap was a rendering issue, not a layout issue. The elements were "correctly positioned" according to the layout engine but visually overlapping. The agent, which relies on the UI tree more than pixel-level analysis, didn't notice.

Subtle UX problems. The calculator's history feature was confusing — it showed results in reverse chronological order with no clear timestamps, and old entries looked identical to new ones. A human tester would flag this as a usability issue. The agent, which has no concept of "confusing," saw a functioning list and moved on.

False Positives

The agent reported 23 "issues" across the week. After manual review, 14 were genuine findings and 9 were false positives. That's a 39% false positive rate — high enough to require human review of every report.

The most common false positive: interpreting slow loads as crashes. The agent would tap a button, wait for the screen to change, and if nothing happened within its patience window (about 8 seconds), report a failure. Several of these were just slow network responses on the emulator.

The second most common: misinterpreting intentional UI states as errors. A dismissed bottom sheet was reported as "content disappeared unexpectedly." An empty search results page was reported as "app failed to load content." These are correct observations — the content did disappear, the page is empty — but the agent's interpretation was wrong.

Cost Analysis

Across 210 runs:

Total tokens: ~2.1 million (input + output)
Total API cost: ~$14
Average per run: ~10,000 tokens, ~$0.07
Average run time: 3-4 minutes
Human review time: ~5 hours total to evaluate all reports

For comparison, manual QA testing of those three apps at a similar depth would have taken me roughly 15-20 hours. The AI testing took about 5 hours of my time (setup, prompt design, and report review) plus $14 in API costs.

That's a meaningful efficiency gain, but it's not zero-effort. The human is still in the loop, reviewing reports and separating signal from noise.

My Honest Take

AI QA testing is not a replacement for human QA. It finds different kinds of bugs through different kinds of exploration. A human tester applies domain knowledge, aesthetic judgment, and intuition about what "feels wrong." An AI agent applies exhaustive combinatorial exploration, patience for repetitive tasks, and zero assumptions about how the app "should" work.

The most valuable bugs the agent found were the ones I'd never have thought to test for. The most valuable bugs it missed were the ones that required human judgment to even recognize as bugs.

The two approaches are complementary. The agent explores the spaces I wouldn't think to explore. I evaluate the findings with context the agent doesn't have. Together, that coverage is better than either alone.

I plan to keep running these experiments with more apps and more sophisticated prompting strategies. The 39% false positive rate is the number I most want to bring down — that's where the agent goes from "interesting research tool" to "practical QA assistant."

Why Not Python? The Language Everyone Expected Me to Use for Drengr

Sharmin Sirajudeen — Mon, 06 Apr 2026 07:50:59 +0000

Why Not Python? The Language Everyone Expected Me to Use for Drengr

When I started building Drengr — a tool that gives AI agents eyes and hands on mobile devices — the default choice was obvious. Every AI agent project in 2025-2026 is Python. LangChain is Python. CrewAI is Python. AutoGen is Python. Most MCP server implementations are Python. The ecosystem, the tutorials, the community, the hiring market — all Python.

I chose Rust instead. This is the honest explanation of why, what it cost me, and whether I'd make the same choice again.

The Distribution Problem

The single biggest reason I didn't use Python is distribution. Drengr is a developer tool that other people need to install on machines I'll never see. The install experience is the first impression. And with Python, that first impression is often painful.

Consider what a Python-based Drengr install looks like:

pip install drengr
# ERROR: requires Python 3.11+, you have 3.9
# or: conflicts with existing package versions
# or: needs a virtual environment
# or: pip install fails because of a C extension dependency

Every Android engineer has been on the receiving end of this. You follow the install instructions for some Python-based tooling — a test runner, a code generator, a device farm client — and you're greeted with a ModuleNotFoundError or a version conflict with something else in your environment. I've lost count of the hours I've spent debugging other people's dependency trees instead of doing my actual work.

The Rust alternative:

curl -fsSL https://drengr.dev/install.sh | bash
drengr doctor

One binary. No runtime. No dependencies. No virtual environment. No version conflicts. It either works or it doesn't, and if it doesn't, it's a bug I can actually reproduce and fix — because the binary is the same on every machine.

This isn't a theoretical concern. Drengr interacts with ADB, simctl, and Appium — tools that already have their own dependency and version requirements. Adding Python's dependency management on top of that would create a combinatorial explosion of "works on my machine" problems.

Cold Start Matters for MCP

Drengr runs as an MCP server. When Claude Desktop or Cursor connects to it, the server needs to start and respond to the first tool call. The user is waiting. The AI agent is waiting. Every millisecond of startup time is friction.

Drengr's cold start to first MCP response: ~15ms.

A Python MCP server with typical imports (json, asyncio, an HTTP client, a CLI framework) starts in 200-500ms. Add heavier libraries — image processing, XML parsing, the Anthropic SDK — and you're looking at 1-2 seconds.

For a one-off script, nobody cares. For a tool that an AI agent might start and stop multiple times during a session, or that needs to respond to tool calls in real-time during an autonomous OODA loop, the difference is significant. The agent's thinking time is already the bottleneck — the tool layer shouldn't add to it.

Memory and Reliability

Drengr manages long-running device sessions. An autonomous test run might interact with a device for 30 minutes or more, capturing hundreds of screenshots, parsing hundreds of UI trees, maintaining situation engine state. This is the kind of workload where Python's memory management gets interesting.

Python's garbage collector is good enough for most applications. But "good enough" means occasional GC pauses. It means memory growing over time as objects are allocated and collected. It means that a screenshot buffer you thought was freed is actually being held by a reference cycle until the GC gets around to collecting it.

Rust's ownership model means memory is freed deterministically — at the exact point where the owning variable goes out of scope. No GC pauses. No reference cycles. No "why is my process using 2GB after running for an hour?" investigations. Drengr's memory usage is flat and predictable regardless of session length.

What Python Would Have Given Me

I want to be fair. Python would have given me real advantages.

Prototyping speed. The first working version of Drengr took me about three weeks in Rust. In Python, I estimate it would have taken one week. The borrow checker adds friction during exploration — when I'm trying three different approaches to screen parsing, Rust demands I think through ownership at each step. Python lets me hack first and clean up later.

The AI ecosystem. When I needed to add LLM integration for the OODA loop, the Python path was obvious: pip install anthropic, call the API, get structured responses. In Rust, I made raw HTTP calls to the API and wrote my own response parsing. It works fine, but it was more work than it needed to be.

Community and contributors. More developers know Python than Rust. If Drengr were Python, more people could read the code, understand it, and potentially contribute. Rust's learning curve is a barrier to contribution.

Faster iteration on AI-adjacent features. Some of Drengr's planned features — smarter situation analysis, better stuck detection, screen diffing — would benefit from rapid experimentation. Python is better for that kind of exploratory work.

What Python Would Have Cost Me

But the costs are real too.

Every user becomes a debugger. With a Python tool, a meaningful percentage of support interactions would be "it doesn't install on my machine" or "it crashes with this import error." I've been that user enough times with other people's Python tools to know exactly how it goes. The first time someone opens an issue about a dependency conflict, I'd spend a day I could have spent building features.

The packaging problem is unsolved. PyInstaller, Nuitka, cx_Freeze, Briefcase — Python has many tools for creating standalone executables, and all of them have sharp edges. Platform-specific behavior, missing dependencies at runtime, binary size inflation. Rust's cargo build --release produces a binary that just works.

Concurrency complexity. Drengr's MCP server, SDK event listener, and OODA loop can all be active simultaneously, sharing state about the current device session. In Python, this means threading (with the GIL), multiprocessing (with serialization overhead), or asyncio (with the colored function problem). In Rust, the type system enforces safe concurrency. The compiler catches data races. I don't have to choose between correctness and performance.

Process weight. A Python process carries the interpreter, the standard library, and all imported modules. Drengr running as a Rust binary uses about 8MB of resident memory. An equivalent Python process would use 40-80MB. For a background tool running alongside an IDE, a browser, and whatever else the developer has open, this matters.

The Decision Framework

Here's how I think about it now, after having shipped Drengr in Rust:

Use Python when: You're building an AI application where the AI logic is the product. When you need rapid experimentation with LLM APIs, prompt engineering, and agent orchestration. When your users are data scientists or ML engineers who already have Python installed and configured. When distribution is pip or Docker.

Use Rust when: You're building infrastructure that AI applications consume. When the tool needs to install in one command on any machine. When cold start time matters. When the tool runs for long periods and memory predictability matters. When you're a solo developer and can't afford to spend time debugging environment issues on machines you've never seen.

Drengr is infrastructure, not an application. It doesn't contain AI logic — it provides tools that AI logic consumes. That distinction made Rust the right choice.

Would I Do It Again?

Yes. Without hesitation.

The three weeks of additional development time cost me once. The zero-friction install experience pays off every time someone tries Drengr. The 15ms cold start pays off on every MCP tool call. The predictable memory usage pays off on every long-running test session.

Python is the right language for AI applications. Rust is the right language for AI infrastructure. Drengr is infrastructure.

If you want to read about why I chose Rust over C and C++ — the other systems languages I seriously considered — I've written about that here.