Open-source GUI agents have been gaining traction, but most still rely on cloud inference, DOM parsing, or CLI hooks. Mano-P takes a different approach: pure vision-driven GUI automation that runs entirely on edge devices. And the benchmark results back it up — #1 on OSWorld among specialized models.
This article breaks down the architecture, benchmark data, and edge deployment performance, all from the project's public README and technical report.
The Numbers
OSWorld: #1 Among Specialized Models
OSWorld is the standard benchmark for GUI agent evaluation. Mano-P 1.0-72B achieves a 58.2% success rate, ranking first among all specialized GUI agent models. For context:
| Model | OSWorld Success Rate |
|---|---|
| Mano-P 1.0-72B | 58.2% |
| opencua-72b | 45.0% |
| Gap | +13.2 percentage points |
That's not a marginal improvement — it's a 29% relative gain over the second-place model.
WebRetriever: Beating Cloud Giants
On the WebRetriever Protocol I benchmark, Mano-P scores 41.7 NavEval, which puts it ahead of:
| Model | NavEval Score |
|---|---|
| Mano-P 1.0 | 41.7 |
| Gemini 2.5 Pro Computer Use | 40.9 |
| Claude 4.5 Computer Use | 31.3 |
Worth noting: Gemini and Claude are cloud-based services with massive compute budgets. Mano-P achieves comparable or better results while running on local hardware.
13 Benchmarks, SOTA Across the Board
Beyond OSWorld and WebRetriever, Mano-P holds SOTA positions across 13 benchmarks spanning GUI grounding, perception & cognition, context learning, and pruning efficiency. The full benchmark data is available in the README.
The Architecture: Why Pure Vision?
Most GUI agents fall into one of these categories:
| Approach | How It Works | Limitation |
|---|---|---|
| DOM/HTML Parsing | Read page structure directly | Web-only, breaks on native apps |
| CDP + CLI | Chrome DevTools Protocol + shell commands | Browser-dependent, fragile |
| Cloud Computer Use | Send screenshots to cloud API | Privacy concerns, latency, API costs |
| Pure Vision (Mano-P) | See the screen, understand it, act | Requires capable on-device model |
Mano-P chose pure vision. No DOM access, no browser hooks, no platform-specific APIs. The model looks at the screen — the same pixels a human sees — and decides what to click, type, or scroll.
This is harder to build, but the payoff is generality: the same model works across any GUI application, any platform, without integration work per app.
Training Methodology: Mano-Action
The technical backbone is Mano-Action, a bidirectional self-reinforcement learning framework. The training follows three stages:
Stage 1: Supervised Fine-Tuning (SFT)
Starting from a base vision-language model, fine-tune on curated GUI interaction datasets.
Stage 2: Offline Reinforcement Learning
Learn from recorded interaction trajectories, optimizing action quality without live environment access.
Stage 3: Online Reinforcement Learning
The model interacts with real GUI environments, receiving feedback and iterating. This is where the "think-act-verify" loop reasoning mechanism comes in — the model plans an action, executes it, verifies the result, and adjusts.
The bidirectional aspect means Text→Action and Action→Text consistency are both optimized, creating a tighter loop between understanding and execution.
Edge Optimization: Running on Apple M4
The 72B model delivers SOTA benchmarks, but the edge story is equally important. Through mixed-precision quantization and a novel visual token pruning technique called GSPruning, Mano-P achieves practical performance on consumer hardware.
GSPruning: Preserving What Matters
GSPruning (Global Spatial Pruning) is designed specifically for vision-language models processing high-resolution interfaces. It:
- Preserves global spatial structure through anchor points
- Identifies semantic outliers for critical UI elements
- Achieves 2-3× throughput speedup with minimal performance loss
On the Online-Mind2Web benchmark, GSPruning at 25% token retention achieves a success rate of 0.400 on Qwen3VL-4B, compared to 0.425 at full tokens — only a 6% drop while running significantly faster.
M4 Pro Performance
The 4B quantized model (w4a16) on Apple M4 Pro with 64GB RAM:
| Metric | Value |
|---|---|
| Prefill Speed | 476 tokens/s |
| Decode Speed | 76 tokens/s |
| Peak Memory | 4.3 GB |
| Prefill Time (4K context) | 8.6s |
4.3 GB peak memory means this runs comfortably alongside other applications. No dedicated GPU server required.
Hardware Requirements
Two deployment options:
- Direct: Mac mini or MacBook with Apple M4 chip, 32GB+ RAM
- Computing Stick: Any Mac + Mano-P computing stick via USB 4.0+
Data Privacy: The Edge Advantage
In Local Mode, all processing happens on-device:
- ✅ Screenshots never leave the device
- ✅ Task descriptions stay local
- ✅ No cloud API calls
- ✅ Full source code is open for audit
Cloud Mode is available as a fallback (screenshots sent to mano.mininglamp.com), but the local-first architecture means sensitive workflows can run with zero data exposure.
Getting Started
Three usage forms are available:
CLI (for terminal users):
brew tap HanningWang/tap
brew install mano-cua
mano-cua run "Open WeChat and tell FTY the meeting is postponed"
Python SDK (planned):
from mano_client import ManoClient
client = ManoClient()
client.run("Search for AI news on Xiaohongshu")
ClawHub Skill (for AI agents):
clawhub install mano-cua
The Skill form is designed for AI agents like Claude Code or OpenClaw — the agent automatically invokes Mano-P when GUI operations are needed.
What's Next
Mano-P is being released in three phases:
- Phase 1 (now): Mano-CUA Skills — for agent enthusiasts to build CUA task workflows
- Phase 2 (coming): Local models + SDK — for developers with high security requirements
- Phase 3 (planned): Training methods + pruning techniques — for researchers who want to train their own GUI-VLA models
The project is Apache 2.0 licensed. Full source, benchmarks, and documentation: github.com/Mininglamp-AI/Mano-P
Top comments (0)