The Challenge
Most GUI Agent demos show the same thing: open a browser, fill out a form, click "Submit." It works, it's useful, but it doesn't really stress-test what these agents can do.
We wanted to find out: what happens when you throw a GUI Agent into a completely unfamiliar, non-standard interface?
So we picked Mahjong — a Chinese tile game with complex rules, dense visual information, and a UI that has nothing in common with a typical web app.
Here's Mano-P playing:
Liquid error: internal
And the raw video:
https://github.com/user-attachments/assets/397a0552-9611-4d74-9f24-99544da272b6
What is Mano-P?
Mano-P (GUI-VLA Agent for Edge Devices) is an open-source project from Mininglamp Technology. The name comes from "Mano" (Spanish for "hand") and "P" for Person + Party.
The key differentiator: Mano-P is purely vision-driven. It doesn't parse DOM trees, doesn't use accessibility APIs, doesn't rely on OCR as a preprocessing step. It takes a screenshot, understands what's on screen, and outputs mouse/keyboard actions.
Think of it as an AI that operates a computer the same way you do — by looking at the screen.
- 🔗 Repo: github.com/Mininglamp-AI/Mano-P
- 📄 License: Apache 2.0
Why Mahjong is a Brutal Test Case
If you've played Mahjong, you know it's no joke. But even if you haven't, here's why it's an excellent stress test for a GUI Agent:
1. Dense, Visually Similar Elements
A Mahjong board has 136 tiles. Your hand has 13 tiles at a time. The tiles are small, visually similar (slight variations in dots, characters, bamboo patterns), and tightly packed. The agent needs pixel-level precision to distinguish a "3 of Dots" from a "5 of Dots."
2. Zero Structured Data
There's no HTML, no DOM, no accessibility tree. The game UI is rendered by a game engine — it's all pixels. This means any approach that relies on parsing page structure is out. Only pure vision works.
3. Strategic Reasoning Required
This isn't "see button, click button." The agent needs to:
- Recognize all tiles in its hand
- Evaluate possible winning combinations
- Decide which tile to discard
- React to other players' moves (pass, claim, or declare)
4. Asynchronous Multi-Player Flow
Mahjong is turn-based with 4 players. The agent has to wait for others, recognize when it's their turn, handle variable timing, and respond to unexpected events (another player declares a win, for instance).
How It Works: Think-Act-Verify
Mano-P doesn't just look once and act. It runs a continuous reasoning loop:
┌─────────┐ ┌─────────┐ ┌──────────┐
│ Think │ ──▶ │ Act │ ──▶ │ Verify │
│ (analyze │ │(execute │ │(confirm │
│ screen) │ │ action) │ │ result) │
└─────────┘ └─────────┘ └──────────┘
▲ │
└──────── loop back ◀──────────┘
- Think: Capture a screenshot, analyze the current game state. What tiles do I have? What's on the table? Is it my turn?
- Act: Decide and execute an action — click a tile to discard, click "Pass," click "Claim."
- Verify: Take another screenshot. Did my action register? Did the game state change as expected? If not, go back to Think.
This loop is critical for games, where animations, delays, and other players' actions create a constantly shifting interface.
Training Pipeline
Mano-P uses a three-stage training approach:
| Stage | Method | Purpose |
|---|---|---|
| 1 | SFT (Supervised Fine-Tuning) | Learn basic GUI recognition and operation |
| 2 | Offline RL (Reinforcement Learning) | Optimize action policies from recorded trajectories |
| 3 | Online RL | Interactive learning in real environments |
This progression moves from "imitate human actions" to "discover optimal strategies through exploration" — a pattern that's proven effective across many RL domains.
Benchmark Results
Numbers matter. Here's where Mano-P stands:
OSWorld (Desktop App Automation)
| Model | Score |
|---|---|
| Mano-P 72B | 58.2% (Rank #1 among specialized models) |
| opencua-72b | 45.0% |
WebRetriever Protocol I (Web Interaction)
| Model | Score |
|---|---|
| Mano-P | 41.7 |
| Gemini 2.5 Pro | 40.9 |
| Claude 4.5 | 31.3 |
Edge Inference (4B Quantized, w4a16)
Running on Apple M4 + 32GB RAM:
| Metric | Value |
|---|---|
| Prefill throughput | 476 tok/s |
| Decode throughput | 76 tok/s |
| Peak memory | 4.3 GB |
That's fast enough for real-time GUI interaction on a local device. No cloud API calls, no data leaving your machine.
Hardware note: The 4B model currently requires Apple M4 + 32GB RAM. Not all Macs can run it — be aware of this before trying.
What This Actually Means
The Mahjong demo is fun, but the real takeaway is about generalization.
Most GUI automation tools are brittle. Traditional RPA breaks when a button moves. DOM-based agents break when there's no DOM. Screen-scraping breaks when the UI updates.
A purely vision-driven agent doesn't have these dependencies. If a human can operate the application by looking at the screen, Mano-P can too — at least in principle. The Mahjong demo shows this isn't just theory:
- ✅ Non-standard UI? Handled.
- ✅ Visually dense interface? Handled.
- ✅ Strategic reasoning? Handled.
- ✅ Async multi-player flow? Handled.
The same architecture that plays Mahjong can automate legacy enterprise systems, operate desktop applications, or handle any GUI that doesn't expose a programmatic interface.
Open Source Roadmap
Mano-P is released under Apache 2.0 with a three-phase open-source plan:
| Phase | Content | Status |
|---|---|---|
| Phase 1 | Skills (core capabilities) | ✅ Released |
| Phase 2 | Local models + SDK | Coming soon |
| Phase 3 | Training methodology | Planned |
Try It Out
The project is live on GitHub. Whether you're interested in GUI automation, VLA research, or just want to see an AI play Mahjong, check it out:
👉 github.com/Mininglamp-AI/Mano-P
If you're working on GUI agents or have thoughts on vision-driven automation, I'd love to hear from you in the comments. What would you test a GUI Agent on?
Built by the open-source team at Mininglamp Technology. Mano-P is Apache 2.0 licensed — contributions welcome.
Top comments (0)