Mininglamp

Posted on Apr 13

Open-Sourcing Mano-P Today: Pure Vision GUI Agent, OSWorld #1, Apache 2.0

#ai #opensource #machinelearning #automation

Article

Mano-P: Open-Sourcing the #1 GUI Agent on OSWorld — Pure Vision, On-Device, Apache 2.0

Today we're open-sourcing Mano-P under Apache 2.0.

GitHub: https://github.com/Mininglamp-AI/Mano-P

What is Mano-P?

Mano-P is a pure-vision GUI agent. It looks at your screen — literally a screenshot — understands the UI elements, and performs actions. No CDP protocol. No HTML parsing. No accessibility API. Just vision.

The name: Mano is Spanish for "hand." Current AI agents can interact with computers, but most of them operate like a lobster's claw — functional, but clumsy. Mano-P is designed to give agents a proper hand.

The P stands for four things:

Power — 13 multimodal benchmark SOTAs, #1 among proprietary models on OSWorld
Private — Runs on-device, your data never leaves your machine
Public — Apache 2.0, fully open source starting today
Personal — A foundation for building your own personalized AI

Why Does This Matter?

There are broadly four approaches to GUI automation today:

Approach	Examples	Limitation
Traditional RPA	UiPath	Coordinate-based, breaks when UI changes
Browser CUA	CDP-based agents	Limited to browsers
Cloud Computer Use	Claude CU, etc.	Requires uploading screen data to the cloud
Pure Vision GUI	Mano-P	Sees the screen directly, works everywhere, runs locally

Most current GUI agents either depend on browser protocols (limiting them to Chrome) or run in the cloud (requiring you to stream your screen to a remote server). Mano-P takes a different approach: it processes raw screen captures through a vision model to understand and interact with any GUI — desktop apps, browsers, 3D tools, professional software, anything with a graphical interface.

Benchmarks

Numbers first.

OSWorld: #1 Among Proprietary Models

Mano-P 72B: 58.2% success rate
Second place (opencua-72b): 45.0%
Gap: +13.2 percentage points
Ranks 5th overall (the four models above it are 100B+ general-purpose LLMs like GPT, Claude, and Gemini)

13 Multimodal Benchmark SOTAs

Selected results:

Benchmark	Score
ScreenSpot-V2	93.5
MMBench	87.5
UI-Vision	46.6

Web Navigation

On WebRetriever Protocol I NavEval:

Mano-P: 41.7
Gemini 2.5 Pro CU: 40.9
Claude 4.5 CU: 31.3

On-Device Performance

The 4B quantized model (w4a16) running on Apple M4 Pro:

Metric	Value
Prefill	476 tokens/s
Decode	76 tokens/s
Peak memory	4.3 GB

Your screenshots, your instructions, your data — all processed locally. Nothing uploaded to any cloud server.

For anyone dealing with sensitive environments (enterprise systems, personal data, financial interfaces), this isn't a nice-to-have. It's a requirement.

Key Technical Highlights

Pure-Vision GUI Understanding: No CDP, no DOM parsing. The model directly processes screen captures to identify UI elements, understand layout, and locate interaction targets.

Mano-Action Bidirectional Self-Enhancement: Inspired by cycle consistency in GANs. The training loop runs in both directions — Text→Action and Action→Text — enforcing consistency between them. This significantly improves generalization to unseen UIs.

Three-Stage Training Pipeline:

SFT (Supervised Fine-Tuning): Learn basic screen-to-action mapping from human demonstrations
Offline RL: Learn from collected trajectories (both successes and failures) without live interaction
Online RL: Direct interaction with real GUI environments for continued policy improvement

GSPruning (Guided Structural Pruning): Compresses visual token retention to 约 25%, boosting throughput 2-3× with minimal accuracy loss. This is what makes on-device inference practical.

Think-Act-Verify Loop: Instead of generating a full action sequence upfront, Mano-P executes one step at a time: analyze the screen → perform an action → verify the result → plan the next step. Far more robust than one-shot planning.

Open-Source Roadmap

We're releasing Mano-P in three stages:

Phase	What	Status
Phase 1	Mano-CUA Skill — ready to use	✅ Today
Phase 2	Local model + SDK — zero cloud dependency	🔜 Coming soon
Phase 3	Training methods + GSPruning + quantization	📋 Planned

The end goal: the entire stack — training, pruning, quantization, deployment — fully open to the community.

Get Started

Option 1: CLI (mano-cua)

brew tap HanningWang/tap && brew install mano-cua

Then run:

mano-cua run "Open WeChat and tell FTY the meeting is postponed"

Option 2: Agent Skill (mano-skill)

clawhub install mano-cua

Once installed, your agent can autonomously invoke Mano-P for GUI operations — no manual triggering needed. The agent decides when GUI interaction is required and calls Mano-P automatically.

Option 3: Python SDK (mano-client)

Coming soon. Watch the repo for updates.

What Can You Do With It?

Mano-afk: Fully Autonomous App Building

Give it a natural language requirement. Mano-P autonomously handles the entire pipeline: requirements analysis → architecture → code → test → fix → verify. No human in the loop.

Personal AI: Learning Your Habits

Mano-P can learn and adapt to your personal operational style. A fun example: it can play mahjong following your specific strategy — not the optimal strategy, but your style. That's what Personal AI means in practice.

Why "Personal AI"?

When a capable GUI agent runs on your local device, handles your data without uploading anything, and operates all the software you use daily — it stops being a "tool" and starts becoming a truly personal AI.

With Mano-P open-sourced under Apache 2.0, every developer, every team, every organization can build their own version. This is the beginning.

⭐ Star, Clone, Try It

https://github.com/Mininglamp-AI/Mano-P

Apache 2.0. Issues, PRs, and contributions welcome.

DEV Community