AI controls your phone. And it never leaves your phone.
Everyone else: Phone → Internet → Cloud API → Internet → Phone
💳 API key required. Monthly bill attached.
PokeClaw: Phone → LLM → Phone
That's it. No internet. No API key. No bill.
Every phone automation tool I found does the same thing: your phone calls a cloud API, gets instructions back, executes. There's always an internet connection in the pipeline and a credit card attached to it.
PokeClaw is the only one where the entire pipeline is a closed loop inside your phone. The LLM reads the screen, decides what to tap, and executes. Nothing leaves the device.
That didn't exist before. So I built it.
PokeClaw is, as far as I can tell, the first working app that runs Gemma 4 on-device to do fully autonomous phone control through the Android Accessibility API. Not a research demo. Not a cloud wrapper. A real APK you install and use. The model reads your screen, picks a tool, executes it, reads the new state, and loops until the task is done. The entire agentic pipeline runs on your phone's CPU.
This is new. Gemma 4's native tool calling on LiteRT-LM v0.10.0 only became available recently, and nobody had shipped a working phone automation app on top of it. PokeClaw is that app.
⭐ If this sounds interesting, star the repo first so you don't lose it. Then keep reading.
The Architecture That Makes It Work
The hard part of on-device phone control isn't the model. It's the session management.
LiteRT-LM only allows one active Conversation per Engine at a time. PokeClaw has two modes that both need the LLM: Chat (for talking) and Task (for controlling the phone). When you send a task, the app has to close the chat conversation, hand the engine to the task agent, let it run through its tool-calling loop, then reclaim the engine for chat when the task finishes. Get the handoff wrong and you either crash or OOM.
The same constraint hits auto-reply. PokeClaw can monitor a contact's messages and reply automatically using the LLM. But if chat mode holds the session, auto-reply can't generate. So the auto-reply manager force-closes and recreates the engine each time it needs to respond. Ugly, but it works without leaking memory.
Gemma 4 on LiteRT-LM v0.10.0 supports native tool calling. The model outputs structured tool calls directly, not text you have to parse. That's what makes the whole agentic loop clean: model receives accessibility state, returns a tool call like tap(x=540, y=1200), the app executes it, captures new state, feeds it back. No regex. No string matching. Just structured I/O between a model and a phone.
The model is a function. Input: what's on screen. Output: what to do next. The phone is the runtime.
How It Actually Sees Your Phone
Most phone automation tools work on screenshots. They feed pixel arrays to a vision model and hope it figures out where the button is. That's like reading a book by photographing each page and running OCR.
PokeClaw reads the Android Accessibility tree. That's the actual data structure behind your screen: real element IDs, text content, scroll positions, clickable states. The model doesn't guess where "Send" is by looking at pixels. It knows exactly which element is the send button, what its coordinates are, and whether it's enabled.
The model gets this tree as context, plus a set of tools: tap, swipe, type, open_app, send_message, auto_reply. It picks one, the phone executes it, the tree updates, and the model picks the next action. Same agentic loop as AI coding tools, except the environment is your phone.
It's not looking at your screen. It's reading the source code of your screen.
Performance
First launch downloads the model: 2.6GB, one time only. After that it's cached on your phone.
I tested on a budget phone with CPU-only inference. No GPU, no NPU. On that hardware, warmup takes about 45 seconds before the first action. That's the worst case.
If your phone has a dedicated ML accelerator, it's a different story. Phones with these chips run PokeClaw significantly faster:
- Google Tensor G3/G4 (Pixel 8, Pixel 9 series)
- Snapdragon 8 Gen 2/3 (Galaxy S24, OnePlus 12, etc.)
- Dimensity 9200/9300 (recent MediaTek flagships)
- Snapdragon 7+ Gen 2 and above (mid-range with GPU acceleration)
On these devices, warmup drops to seconds, and the whole agentic loop feels smooth. The model is the same, the hardware does the heavy lifting.
Accuracy: straightforward tasks work reliably. Multi-step reasoning across apps is where a 2.3B model starts hitting its limits. It's not trying to be GPT-4. It's trying to be the first AI that controls your phone without phoning home.
The slower your phone, the longer the warmup. But it works on every arm64 Android 9+ device, and your data never leaves.
What Else Exists
A desktop app with a few hundred GitHub stars does something similar but requires a cloud LLM backend. Defeats the purpose if you care about privacy.
A couple of research projects use screenshot-based approaches with on-device models. They work on a demo. They break on real phones with notifications, overlays, and system dialogs popping up mid-task.
Google and Samsung have on-device AI baked into their hardware. Polished but closed. You can't see the tool calling logic, can't swap the model, can't add tools. If they decide tomorrow that their assistant won't open a competitor's app, you have no recourse.
I searched for months before building this. Every open source phone automation project I found either needs a cloud LLM, uses screenshot-based vision (fragile), or doesn't actually ship a working APK. PokeClaw is, to my knowledge, the first open source app that does fully local LLM phone control with native tool calling against the live accessibility tree.
If I'm wrong about that, genuinely tell me. I'd love to not be the only person maintaining this.
What's Next
Things I'm building:
- Better feedback while the model is working
- Per-app permissions so you can restrict what PokeClaw touches
- Custom tool definitions
- Smaller model variants for older phones
If you want something specific, open an issue.
PokeClaw
monitor-demo.mp4
hi-demo.mp4
Why is the "hi" demo slow? Recorded on a budget Android phone (I'm literally too broke to buy a proper one, got this just to demo the app lol) with CPU-only inference, no GPU, no NPU. Running Gemma 4 E2B on pure CPU takes about 45 seconds to warm up — started at several minutes, we optimized the engine initialization and session handoff to squeeze it down this far. If your phone actually has a decent chip it's way faster:
- Google Tensor G3/G4 (Pixel 8, Pixel 9)
- Snapdragon 8 Gen 2/3 (Galaxy S24, OnePlus 12)
- Dimensity 9200/9300 (recent MediaTek flagships)
- Snapdragon 7+ Gen 2+ (mid-range with GPU)
On these devices, warmup drops to seconds. Same model, better hardware.
That said, the fact that a 2.3B model can autonomously control a phone running purely on CPU is already pretty impressive. GPU just makes it faster.
AI controls…
The Story
I'm a solo developer. CS dropout. When Gemma 4 dropped on April 2nd with native tool calling on LiteRT-LM, I pulled two all-nighters and built this from scratch with zero Android development experience.
It's completely free. No API keys that bill you every month. No subscription. No usage limits. The model runs on your hardware and costs you nothing.
We're living through a historic shift. Local LLMs are now smart enough to actually do useful work on a phone. That wasn't true 6 months ago. As on-device models keep getting smarter, PokeClaw is ready.
This project has a lot of issues. That's expected for something built in two nights on a model that dropped days ago. But the fact that it works at all is already pretty exciting.
⭐ Star the repo if you think local AI phone control matters. Every star helps more people find it.
About Me
CS dropout. Founder of @mcpware. github.com/ithiria894
AI controls your phone. And it never leaves your phone.




Top comments (4)
It’s great to see that you’re working on such an awesome idea. I’ll definitely try it. But I’m not sure whether it will run on a low Snapdragon Samsung Galaxy Note 10 Plus. If it does, that would be really great for me. I have an iPhone 13 also
Note 10 Plus has 12GB RAM so the model will fit fine. Snapdragon 855 doesn't have a dedicated ML accelerator so warmup will be slower, but it should work. Try it and let me know how it goes!
iPhone 13 unfortunately not possible. Apple doesn't allow any app to read the screen or control other apps. On iOS you're stuck with Siri and there's no way around that. Hopefully Apple opens this up someday but I wouldn't count on it.
github.com/agents-io/PokeClaw
this is absolutely cool will you be able to do ios ?
No, iOS is a no-go. Apple doesn't let any third-party app control other apps through Accessibility. On iPhone your only option is Siri and you're stuck waiting for Apple to make it smarter. Android is the only platform where this approach is even possible.
github.com/agents-io/PokeClaw