SOTA on 13 Benchmarks: Mininglamp Open-Sources GUI-VLA Model Mano-P 1.0

#ai #opensource #machinelearning #automation

SOTA on 13 Benchmarks: Mininglamp Open-Sources GUI-VLA Model Mano-P 1.0

Original article: WeChat (Chinese)
GitHub: Mano-P

Mininglamp Technology has officially open-sourced Mano-P 1.0, an in-house GUI-aware agent model. Mano-P handles GUI perception, understanding, planning, action, and verification — all through pure vision. It can directly understand and operate desktop applications, web interfaces, and complex graphical workflows, and it runs locally on Apple M4 devices.

Mano-P moves AI beyond "look but don't touch." It executes complex tasks across platforms directly in real graphical interfaces. The project is released under Apache 2.0 with full source code available for audit, commercial use, and modification.

By combining pure visual understanding with local execution, Mano-P enables individual developers and organizations to build personalized AI at low cost while maintaining full data sovereignty.

Vision-Only: Solving the Last Mile for Complex Workflows

Most automation today depends on underlying API calls, the CDP protocol, or HTML parsing. These approaches fall short when dealing with non-standard applications or cross-system workflows. Mano-P takes a fundamentally different approach: pure visual understanding as the core paradigm. It requires no external APIs or protocols, and can directly understand and operate desktop software, 3D applications, and specialized professional tools — breaking free from the browser-centric ecosystem.

Mano-P also serves as an execution backbone for existing agent ecosystems. It integrates seamlessly as a skill into AI agents like OpenClaw. With this integration, agents can navigate across multiple windows and cross-application workflows, performing clicks, text input, window switching, and visual verification in a closed loop.

This addresses a long-standing bottleneck in agent workflows: the need for human intervention. Mano-P enables not just automated build-and-test pipelines, but autonomous execution of complex business scenarios end-to-end.

SOTA on 13 Benchmarks: Raising the Bar for GUI-Specific Models

Mano-P ships in two versions: a full 72B model that pushes the performance ceiling, and a 4B quantized model (w4a16) optimized for on-device deployment.

The 72B version achieves SOTA results across 13 authoritative multimodal benchmarks, covering GUI Grounding, CUA (Computer Use Agent), multimodal perception and cognition, video understanding, and long-context learning. It sets a new performance standard for on-device GUI agents.

On the OSWorld proprietary model benchmark, Mano-P 72B reaches a 58.2% task success rate — ranking first globally and leading the second-place opencua-72b (45.0%) by 13.2 percentage points. It also tops ScreenSpot-V2, MMBench, UI-Vision, and other evaluation suites.

These results are driven by architectural innovation. Mano-P uses a three-stage progressive training pipeline: SFT (supervised fine-tuning), offline reinforcement learning, and online reinforcement learning. Combined with a proprietary GSPruning visual token pruning technique, this delivers a significant leap in on-device inference efficiency.

On Apple M4 Pro hardware, the 4B quantized model achieves 476 tokens/s prefill speed and 76 tokens/s decode speed, with peak memory usage of just 4.3 GB — well within the constraints of mainstream edge devices.

On-Device Deployment: Air-Gapped Data Protection

As AI moves deeper into core business processes, data privacy and compliance become critical. Mano-P supports fully local on-device deployment with zero cloud data transmission. Its "pure vision + local execution" architecture enables physical isolation between data processing and external networks.

In local mode, the model runs directly on Mac mini or MacBook (M4 chip or later, 32 GB+ RAM), or via a Mano-P compute stick connected over USB 4.0. Screenshots, business data, and task instructions all stay in a local closed loop, eliminating cloud transmission risk at the source.

Mano-P also handles long-running tasks autonomously offline. Even without network access, it can independently drive complex business workflows, including mid-process decision-making and error correction.

Full Open-Source Strategy: Accelerating the Personalized AI Ecosystem

Mano-P is released under Apache 2.0 with complete client source code open for audit, commercial use, and derivative work.

To lower the barrier to entry, Mano-P offers three ready-to-use deployment modes covering different tech stacks. No complex API key setup required — users can build high-performance GUI agents with minimal configuration.

As a first step, Mininglamp is open-sourcing the Mano-CUA core skill. Users can configure it into OpenClaw or Claude Code to build smarter CUA task workflows and eliminate human-intervention bottlenecks.

The Mano-CUA local model and SDK components are expected to be open-sourced within the month, targeting developers with high-security requirements. Users will be able to call locally deployed GUI-VLA models to build custom skills and tools, with all CUA operations executing on local Mac hardware — nothing uploaded to external servers.

Looking ahead, Mininglamp plans to fully open-source the underlying training methods, token pruning techniques, and mixed-precision quantization schemes behind Mano-P, enabling developers to build custom local GUI-VLA models tailored to their own use cases.

From technical breakthroughs to ecosystem building, Mano-P tightly integrates GUI perception, visual operation, local execution, and open-source collaboration. It establishes a solid technical foundation for on-device agents and charts a concrete path toward Personalized AI.

GitHub: https://github.com/Mininglamp-AI/Mano-P

Original article (Chinese): https://mp.weixin.qq.com/s/eWnQTvY0OiuHzujJE32kPw