OpenGUI is a project that lets AI operate a real Android phone.
Repository: https://github.com/Core-Mate/open-gui
OpenClaw connects AI to a desktop environment. OpenGUI brings a similar execution layer to Android. It is aimed at tasks inside mobile apps: tapping, typing, taking screenshots, reading screens, moving through flows, and returning results.
A lot of work already happens on phones: X, Reddit, Hacker News, Telegram, WeChat, Xiaohongshu, and plenty of business flows that only really exist inside apps. Web automation does not reach those surfaces.
Basic architecture
OpenGUI has two main parts: a backend and an Android client.
The backend understands the task, creates a plan, supervises execution, and summarizes the result. The Android client connects to the backend and performs GUI actions on a real device. Beyond tapping the screen, it also has to handle task state, device state, and recovery after failures.
You can see a few pieces in the repo:
- task planning, Executor Graph, review, and summarization on the backend
- AccessibilityService-based action execution on Android
- WebSocket connections for keeping devices online
- remote entry points through Feishu/Lark, Telegram, and REST API
Once it is running, the phone can stay on standby, receive a task like a remote worker, execute it, and send results back.
Local setup
You need an Android development environment and a connected Android device.
Start the backend:
cd server
./start.sh
Start the Android client:
cd client
./start.sh
The backend script prepares the services, database, and API. The client script builds the APK, installs it on the connected Android device, and launches the app.
Some phone-side steps still need manual approval: USB debugging, Accessibility Service, overlay permission, and model API keys or bot credentials. Keeping those steps explicit makes sense.
Where it gets hard
The hard part of a phone agent usually starts after the first tap.
Take a simple task:
Open X, search for recent discussions about mobile AI agents, collect the main points, and summarize what people care about.
That sounds small, but the phone can be in many different states. The app may open on an old page. The search box may not receive focus. Results may load slowly. A login prompt, permission prompt, or follow recommendation can appear in the middle.
So a mobile agent cannot just look at a screenshot and tap once. It has to know where the task is, whether the current screen matches expectations, how to recover after a bad tap, when to retry after no visible change, and how to collect the final result.
I ran OpenGUI and also spent some time reading the source. The approach is pretty good: the backend graph manages task state and plans, the Executor Graph sends concrete steps to the phone, the Android side performs actions through AccessibilityService, and WebSocket sends device state and execution results back.
This puts the phone inside the execution loop. The backend decides whether to continue, retry, or finish; the phone reports what actually happened on screen.
This is much more practical than a plain script. The phone can stand by, execute, and report back. It starts to look like a mobile worker.
The first use cases I can imagine are community research, mobile flow testing, ops tasks, and App-only workflows that web automation cannot touch.
Top comments (0)