DEV Community

Mininglamp
Mininglamp

Posted on

Open-Sourcing Mano-P Today: Pure Vision GUI Agent, OSWorld #1, Apache 2.0

Article

Mano-P: Open-Sourcing the #1 GUI Agent on OSWorld — Pure Vision, On-Device, Apache 2.0

Today we're open-sourcing Mano-P under Apache 2.0.

GitHub: https://github.com/Mininglamp-AI/Mano-P

What is Mano-P?

Mano-P is a pure-vision GUI agent. It looks at your screen — literally a screenshot — understands the UI elements, and performs actions. No CDP protocol. No HTML parsing. No accessibility API. Just vision.

The name: Mano is Spanish for "hand." Current AI agents can interact with computers, but most of them operate like a lobster's claw — functional, but clumsy. Mano-P is designed to give agents a proper hand.

The P stands for four things:

  • Power — 13 multimodal benchmark SOTAs, #1 among proprietary models on OSWorld
  • Private — Runs on-device, your data never leaves your machine
  • Public — Apache 2.0, fully open source starting today
  • Personal — A foundation for building your own personalized AI

Why Does This Matter?

There are broadly four approaches to GUI automation today:

Approach Examples Limitation
Traditional RPA UiPath Coordinate-based, breaks when UI changes
Browser CUA CDP-based agents Limited to browsers
Cloud Computer Use Claude CU, etc. Requires uploading screen data to the cloud
Pure Vision GUI Mano-P Sees the screen directly, works everywhere, runs locally

Most current GUI agents either depend on browser protocols (limiting them to Chrome) or run in the cloud (requiring you to stream your screen to a remote server). Mano-P takes a different approach: it processes raw screen captures through a vision model to understand and interact with any GUI — desktop apps, browsers, 3D tools, professional software, anything with a graphical interface.

Benchmarks

Numbers first.

OSWorld: #1 Among Proprietary Models

  • Mano-P 72B: 58.2% success rate
  • Second place (opencua-72b): 45.0%
  • Gap: +13.2 percentage points
  • Ranks 5th overall (the four models above it are 100B+ general-purpose LLMs like GPT, Claude, and Gemini)

13 Multimodal Benchmark SOTAs

Selected results:

Benchmark Score
ScreenSpot-V2 93.5
MMBench 87.5
UI-Vision 46.6

Web Navigation

On WebRetriever Protocol I NavEval:

  • Mano-P: 41.7
  • Gemini 2.5 Pro CU: 40.9
  • Claude 4.5 CU: 31.3

On-Device Performance

The 4B quantized model (w4a16) running on Apple M4 Pro:

Metric Value
Prefill 476 tokens/s
Decode 76 tokens/s
Peak memory 4.3 GB

Your screenshots, your instructions, your data — all processed locally. Nothing uploaded to any cloud server.

For anyone dealing with sensitive environments (enterprise systems, personal data, financial interfaces), this isn't a nice-to-have. It's a requirement.

Key Technical Highlights

Pure-Vision GUI Understanding: No CDP, no DOM parsing. The model directly processes screen captures to identify UI elements, understand layout, and locate interaction targets.

Mano-Action Bidirectional Self-Enhancement: Inspired by cycle consistency in GANs. The training loop runs in both directions — Text→Action and Action→Text — enforcing consistency between them. This significantly improves generalization to unseen UIs.

Three-Stage Training Pipeline:

  1. SFT (Supervised Fine-Tuning): Learn basic screen-to-action mapping from human demonstrations
  2. Offline RL: Learn from collected trajectories (both successes and failures) without live interaction
  3. Online RL: Direct interaction with real GUI environments for continued policy improvement

GSPruning (Guided Structural Pruning): Compresses visual token retention to 约 25%, boosting throughput 2-3× with minimal accuracy loss. This is what makes on-device inference practical.

Think-Act-Verify Loop: Instead of generating a full action sequence upfront, Mano-P executes one step at a time: analyze the screen → perform an action → verify the result → plan the next step. Far more robust than one-shot planning.

Open-Source Roadmap

We're releasing Mano-P in three stages:

Phase What Status
Phase 1 Mano-CUA Skill — ready to use Today
Phase 2 Local model + SDK — zero cloud dependency 🔜 Coming soon
Phase 3 Training methods + GSPruning + quantization 📋 Planned

The end goal: the entire stack — training, pruning, quantization, deployment — fully open to the community.

Get Started

Option 1: CLI (mano-cua)

brew tap HanningWang/tap && brew install mano-cua
Enter fullscreen mode Exit fullscreen mode

Then run:

mano-cua run "Open WeChat and tell FTY the meeting is postponed"
Enter fullscreen mode Exit fullscreen mode

Option 2: Agent Skill (mano-skill)

clawhub install mano-cua
Enter fullscreen mode Exit fullscreen mode

Once installed, your agent can autonomously invoke Mano-P for GUI operations — no manual triggering needed. The agent decides when GUI interaction is required and calls Mano-P automatically.

Option 3: Python SDK (mano-client)

Coming soon. Watch the repo for updates.

What Can You Do With It?

Mano-afk: Fully Autonomous App Building

Give it a natural language requirement. Mano-P autonomously handles the entire pipeline: requirements analysis → architecture → code → test → fix → verify. No human in the loop.

Personal AI: Learning Your Habits

Mano-P can learn and adapt to your personal operational style. A fun example: it can play mahjong following your specific strategy — not the optimal strategy, but your style. That's what Personal AI means in practice.

Why "Personal AI"?

When a capable GUI agent runs on your local device, handles your data without uploading anything, and operates all the software you use daily — it stops being a "tool" and starts becoming a truly personal AI.

With Mano-P open-sourced under Apache 2.0, every developer, every team, every organization can build their own version. This is the beginning.


⭐ Star, Clone, Try It

https://github.com/Mininglamp-AI/Mano-P

Apache 2.0. Issues, PRs, and contributions welcome.


Top comments (0)