DEV Community

Mininglamp
Mininglamp

Posted on

How Mano-P Achieves #1 on OSWorld: Architecture, Benchmarks, and Edge Deployment

Open-source GUI agents have been gaining traction, but most still rely on cloud inference, DOM parsing, or CLI hooks. Mano-P takes a different approach: pure vision-driven GUI automation that runs entirely on edge devices. And the benchmark results back it up — #1 on OSWorld among specialized models.

This article breaks down the architecture, benchmark data, and edge deployment performance, all from the project's public README and technical report.

The Numbers

OSWorld: #1 Among Specialized Models

OSWorld is the standard benchmark for GUI agent evaluation. Mano-P 1.0-72B achieves a 58.2% success rate, ranking first among all specialized GUI agent models. For context:

Model OSWorld Success Rate
Mano-P 1.0-72B 58.2%
opencua-72b 45.0%
Gap +13.2 percentage points

That's not a marginal improvement — it's a 29% relative gain over the second-place model.

WebRetriever: Beating Cloud Giants

On the WebRetriever Protocol I benchmark, Mano-P scores 41.7 NavEval, which puts it ahead of:

Model NavEval Score
Mano-P 1.0 41.7
Gemini 2.5 Pro Computer Use 40.9
Claude 4.5 Computer Use 31.3

Worth noting: Gemini and Claude are cloud-based services with massive compute budgets. Mano-P achieves comparable or better results while running on local hardware.

13 Benchmarks, SOTA Across the Board

Beyond OSWorld and WebRetriever, Mano-P holds SOTA positions across 13 benchmarks spanning GUI grounding, perception & cognition, context learning, and pruning efficiency. The full benchmark data is available in the README.

The Architecture: Why Pure Vision?

Most GUI agents fall into one of these categories:

Approach How It Works Limitation
DOM/HTML Parsing Read page structure directly Web-only, breaks on native apps
CDP + CLI Chrome DevTools Protocol + shell commands Browser-dependent, fragile
Cloud Computer Use Send screenshots to cloud API Privacy concerns, latency, API costs
Pure Vision (Mano-P) See the screen, understand it, act Requires capable on-device model

Mano-P chose pure vision. No DOM access, no browser hooks, no platform-specific APIs. The model looks at the screen — the same pixels a human sees — and decides what to click, type, or scroll.

This is harder to build, but the payoff is generality: the same model works across any GUI application, any platform, without integration work per app.

Training Methodology: Mano-Action

The technical backbone is Mano-Action, a bidirectional self-reinforcement learning framework. The training follows three stages:

Stage 1: Supervised Fine-Tuning (SFT)
Starting from a base vision-language model, fine-tune on curated GUI interaction datasets.

Stage 2: Offline Reinforcement Learning
Learn from recorded interaction trajectories, optimizing action quality without live environment access.

Stage 3: Online Reinforcement Learning
The model interacts with real GUI environments, receiving feedback and iterating. This is where the "think-act-verify" loop reasoning mechanism comes in — the model plans an action, executes it, verifies the result, and adjusts.

The bidirectional aspect means Text→Action and Action→Text consistency are both optimized, creating a tighter loop between understanding and execution.

Edge Optimization: Running on Apple M4

The 72B model delivers SOTA benchmarks, but the edge story is equally important. Through mixed-precision quantization and a novel visual token pruning technique called GSPruning, Mano-P achieves practical performance on consumer hardware.

GSPruning: Preserving What Matters

GSPruning (Global Spatial Pruning) is designed specifically for vision-language models processing high-resolution interfaces. It:

  • Preserves global spatial structure through anchor points
  • Identifies semantic outliers for critical UI elements
  • Achieves 2-3× throughput speedup with minimal performance loss

On the Online-Mind2Web benchmark, GSPruning at 25% token retention achieves a success rate of 0.400 on Qwen3VL-4B, compared to 0.425 at full tokens — only a 6% drop while running significantly faster.

M4 Pro Performance

The 4B quantized model (w4a16) on Apple M4 Pro with 64GB RAM:

Metric Value
Prefill Speed 476 tokens/s
Decode Speed 76 tokens/s
Peak Memory 4.3 GB
Prefill Time (4K context) 8.6s

4.3 GB peak memory means this runs comfortably alongside other applications. No dedicated GPU server required.

Hardware Requirements

Two deployment options:

  1. Direct: Mac mini or MacBook with Apple M4 chip, 32GB+ RAM
  2. Computing Stick: Any Mac + Mano-P computing stick via USB 4.0+

Data Privacy: The Edge Advantage

In Local Mode, all processing happens on-device:

  • ✅ Screenshots never leave the device
  • ✅ Task descriptions stay local
  • ✅ No cloud API calls
  • ✅ Full source code is open for audit

Cloud Mode is available as a fallback (screenshots sent to mano.mininglamp.com), but the local-first architecture means sensitive workflows can run with zero data exposure.

Getting Started

Three usage forms are available:

CLI (for terminal users):

brew tap HanningWang/tap
brew install mano-cua

mano-cua run "Open WeChat and tell FTY the meeting is postponed"
Enter fullscreen mode Exit fullscreen mode

Python SDK (planned):

from mano_client import ManoClient
client = ManoClient()
client.run("Search for AI news on Xiaohongshu")
Enter fullscreen mode Exit fullscreen mode

ClawHub Skill (for AI agents):

clawhub install mano-cua
Enter fullscreen mode Exit fullscreen mode

The Skill form is designed for AI agents like Claude Code or OpenClaw — the agent automatically invokes Mano-P when GUI operations are needed.

What's Next

Mano-P is being released in three phases:

  • Phase 1 (now): Mano-CUA Skills — for agent enthusiasts to build CUA task workflows
  • Phase 2 (coming): Local models + SDK — for developers with high security requirements
  • Phase 3 (planned): Training methods + pruning techniques — for researchers who want to train their own GUI-VLA models

The project is Apache 2.0 licensed. Full source, benchmarks, and documentation: github.com/Mininglamp-AI/Mano-P


Top comments (0)