Article
Mano-P: Open-Sourcing the #1 GUI Agent on OSWorld — Pure Vision, On-Device, Apache 2.0
Today we're open-sourcing Mano-P under Apache 2.0.
GitHub: https://github.com/Mininglamp-AI/Mano-P
What is Mano-P?
Mano-P is a pure-vision GUI agent. It looks at your screen — literally a screenshot — understands the UI elements, and performs actions. No CDP protocol. No HTML parsing. No accessibility API. Just vision.
The name: Mano is Spanish for "hand." Current AI agents can interact with computers, but most of them operate like a lobster's claw — functional, but clumsy. Mano-P is designed to give agents a proper hand.
The P stands for four things:
- Power — 13 multimodal benchmark SOTAs, #1 among proprietary models on OSWorld
- Private — Runs on-device, your data never leaves your machine
- Public — Apache 2.0, fully open source starting today
- Personal — A foundation for building your own personalized AI
Why Does This Matter?
There are broadly four approaches to GUI automation today:
| Approach | Examples | Limitation |
|---|---|---|
| Traditional RPA | UiPath | Coordinate-based, breaks when UI changes |
| Browser CUA | CDP-based agents | Limited to browsers |
| Cloud Computer Use | Claude CU, etc. | Requires uploading screen data to the cloud |
| Pure Vision GUI | Mano-P | Sees the screen directly, works everywhere, runs locally |
Most current GUI agents either depend on browser protocols (limiting them to Chrome) or run in the cloud (requiring you to stream your screen to a remote server). Mano-P takes a different approach: it processes raw screen captures through a vision model to understand and interact with any GUI — desktop apps, browsers, 3D tools, professional software, anything with a graphical interface.
Benchmarks
Numbers first.
OSWorld: #1 Among Proprietary Models
- Mano-P 72B: 58.2% success rate
- Second place (opencua-72b): 45.0%
- Gap: +13.2 percentage points
- Ranks 5th overall (the four models above it are 100B+ general-purpose LLMs like GPT, Claude, and Gemini)
13 Multimodal Benchmark SOTAs
Selected results:
| Benchmark | Score |
|---|---|
| ScreenSpot-V2 | 93.5 |
| MMBench | 87.5 |
| UI-Vision | 46.6 |
Web Navigation
On WebRetriever Protocol I NavEval:
- Mano-P: 41.7
- Gemini 2.5 Pro CU: 40.9
- Claude 4.5 CU: 31.3
On-Device Performance
The 4B quantized model (w4a16) running on Apple M4 Pro:
| Metric | Value |
|---|---|
| Prefill | 476 tokens/s |
| Decode | 76 tokens/s |
| Peak memory | 4.3 GB |
Your screenshots, your instructions, your data — all processed locally. Nothing uploaded to any cloud server.
For anyone dealing with sensitive environments (enterprise systems, personal data, financial interfaces), this isn't a nice-to-have. It's a requirement.
Key Technical Highlights
Pure-Vision GUI Understanding: No CDP, no DOM parsing. The model directly processes screen captures to identify UI elements, understand layout, and locate interaction targets.
Mano-Action Bidirectional Self-Enhancement: Inspired by cycle consistency in GANs. The training loop runs in both directions — Text→Action and Action→Text — enforcing consistency between them. This significantly improves generalization to unseen UIs.
Three-Stage Training Pipeline:
- SFT (Supervised Fine-Tuning): Learn basic screen-to-action mapping from human demonstrations
- Offline RL: Learn from collected trajectories (both successes and failures) without live interaction
- Online RL: Direct interaction with real GUI environments for continued policy improvement
GSPruning (Guided Structural Pruning): Compresses visual token retention to 约 25%, boosting throughput 2-3× with minimal accuracy loss. This is what makes on-device inference practical.
Think-Act-Verify Loop: Instead of generating a full action sequence upfront, Mano-P executes one step at a time: analyze the screen → perform an action → verify the result → plan the next step. Far more robust than one-shot planning.
Open-Source Roadmap
We're releasing Mano-P in three stages:
| Phase | What | Status |
|---|---|---|
| Phase 1 | Mano-CUA Skill — ready to use | ✅ Today |
| Phase 2 | Local model + SDK — zero cloud dependency | 🔜 Coming soon |
| Phase 3 | Training methods + GSPruning + quantization | 📋 Planned |
The end goal: the entire stack — training, pruning, quantization, deployment — fully open to the community.
Get Started
Option 1: CLI (mano-cua)
brew tap HanningWang/tap && brew install mano-cua
Then run:
mano-cua run "Open WeChat and tell FTY the meeting is postponed"
Option 2: Agent Skill (mano-skill)
clawhub install mano-cua
Once installed, your agent can autonomously invoke Mano-P for GUI operations — no manual triggering needed. The agent decides when GUI interaction is required and calls Mano-P automatically.
Option 3: Python SDK (mano-client)
Coming soon. Watch the repo for updates.
What Can You Do With It?
Mano-afk: Fully Autonomous App Building
Give it a natural language requirement. Mano-P autonomously handles the entire pipeline: requirements analysis → architecture → code → test → fix → verify. No human in the loop.
Personal AI: Learning Your Habits
Mano-P can learn and adapt to your personal operational style. A fun example: it can play mahjong following your specific strategy — not the optimal strategy, but your style. That's what Personal AI means in practice.
Why "Personal AI"?
When a capable GUI agent runs on your local device, handles your data without uploading anything, and operates all the software you use daily — it stops being a "tool" and starts becoming a truly personal AI.
With Mano-P open-sourced under Apache 2.0, every developer, every team, every organization can build their own version. This is the beginning.
⭐ Star, Clone, Try It
https://github.com/Mininglamp-AI/Mano-P
Apache 2.0. Issues, PRs, and contributions welcome.
Top comments (0)