Mininglamp

Posted on Apr 14

How Mano-P Achieves #1 on OSWorld: Architecture, Benchmarks, and Edge Deployment

#ai #opensource #agents #benchmark

Open-source GUI agents have been gaining traction, but most still rely on cloud inference, DOM parsing, or CLI hooks. Mano-P takes a different approach: pure vision-driven GUI automation that runs entirely on edge devices. And the benchmark results back it up — #1 on OSWorld among specialized models.

This article breaks down the architecture, benchmark data, and edge deployment performance, all from the project's public README and technical report.

The Numbers

OSWorld: #1 Among Specialized Models

OSWorld is the standard benchmark for GUI agent evaluation. Mano-P 1.0-72B achieves a 58.2% success rate, ranking first among all specialized GUI agent models. For context:

Model	OSWorld Success Rate
Mano-P 1.0-72B	58.2%
opencua-72b	45.0%
Gap	+13.2 percentage points

That's not a marginal improvement — it's a 29% relative gain over the second-place model.

WebRetriever: Beating Cloud Giants

On the WebRetriever Protocol I benchmark, Mano-P scores 41.7 NavEval, which puts it ahead of:

Model	NavEval Score
Mano-P 1.0	41.7
Gemini 2.5 Pro Computer Use	40.9
Claude 4.5 Computer Use	31.3

Worth noting: Gemini and Claude are cloud-based services with massive compute budgets. Mano-P achieves comparable or better results while running on local hardware.

13 Benchmarks, SOTA Across the Board

Beyond OSWorld and WebRetriever, Mano-P holds SOTA positions across 13 benchmarks spanning GUI grounding, perception & cognition, context learning, and pruning efficiency. The full benchmark data is available in the README.

The Architecture: Why Pure Vision?

Most GUI agents fall into one of these categories:

Approach	How It Works	Limitation
DOM/HTML Parsing	Read page structure directly	Web-only, breaks on native apps
CDP + CLI	Chrome DevTools Protocol + shell commands	Browser-dependent, fragile
Cloud Computer Use	Send screenshots to cloud API	Privacy concerns, latency, API costs
Pure Vision (Mano-P)	See the screen, understand it, act	Requires capable on-device model

Mano-P chose pure vision. No DOM access, no browser hooks, no platform-specific APIs. The model looks at the screen — the same pixels a human sees — and decides what to click, type, or scroll.

This is harder to build, but the payoff is generality: the same model works across any GUI application, any platform, without integration work per app.

Training Methodology: Mano-Action

The technical backbone is Mano-Action, a bidirectional self-reinforcement learning framework. The training follows three stages:

Stage 1: Supervised Fine-Tuning (SFT)
Starting from a base vision-language model, fine-tune on curated GUI interaction datasets.

Stage 2: Offline Reinforcement Learning
Learn from recorded interaction trajectories, optimizing action quality without live environment access.

Stage 3: Online Reinforcement Learning
The model interacts with real GUI environments, receiving feedback and iterating. This is where the "think-act-verify" loop reasoning mechanism comes in — the model plans an action, executes it, verifies the result, and adjusts.

The bidirectional aspect means Text→Action and Action→Text consistency are both optimized, creating a tighter loop between understanding and execution.

Edge Optimization: Running on Apple M4

The 72B model delivers SOTA benchmarks, but the edge story is equally important. Through mixed-precision quantization and a novel visual token pruning technique called GSPruning, Mano-P achieves practical performance on consumer hardware.

GSPruning: Preserving What Matters

GSPruning (Global Spatial Pruning) is designed specifically for vision-language models processing high-resolution interfaces. It:

Preserves global spatial structure through anchor points
Identifies semantic outliers for critical UI elements
Achieves 2-3× throughput speedup with minimal performance loss

On the Online-Mind2Web benchmark, GSPruning at 25% token retention achieves a success rate of 0.400 on Qwen3VL-4B, compared to 0.425 at full tokens — only a 6% drop while running significantly faster.

M4 Pro Performance

The 4B quantized model (w4a16) on Apple M4 Pro with 64GB RAM:

Metric	Value
Prefill Speed	476 tokens/s
Decode Speed	76 tokens/s
Peak Memory	4.3 GB
Prefill Time (4K context)	8.6s

4.3 GB peak memory means this runs comfortably alongside other applications. No dedicated GPU server required.

Hardware Requirements

Two deployment options:

Direct: Mac mini or MacBook with Apple M4 chip, 32GB+ RAM
Computing Stick: Any Mac + Mano-P computing stick via USB 4.0+

Data Privacy: The Edge Advantage

In Local Mode, all processing happens on-device:

✅ Screenshots never leave the device
✅ Task descriptions stay local
✅ No cloud API calls
✅ Full source code is open for audit

Cloud Mode is available as a fallback (screenshots sent to mano.mininglamp.com), but the local-first architecture means sensitive workflows can run with zero data exposure.

Getting Started

Three usage forms are available:

CLI (for terminal users):

brew tap HanningWang/tap
brew install mano-cua

mano-cua run "Open WeChat and tell FTY the meeting is postponed"

Python SDK (planned):

from mano_client import ManoClient
client = ManoClient()
client.run("Search for AI news on Xiaohongshu")

ClawHub Skill (for AI agents):

clawhub install mano-cua

The Skill form is designed for AI agents like Claude Code or OpenClaw — the agent automatically invokes Mano-P when GUI operations are needed.

What's Next

Mano-P is being released in three phases:

Phase 1 (now): Mano-CUA Skills — for agent enthusiasts to build CUA task workflows
Phase 2 (coming): Local models + SDK — for developers with high security requirements
Phase 3 (planned): Training methods + pruning techniques — for researchers who want to train their own GUI-VLA models

The project is Apache 2.0 licensed. Full source, benchmarks, and documentation: github.com/Mininglamp-AI/Mano-P

DEV Community