Yesterday, Apple announced a landmark succession: Tim Cook steps down as CEO to become Executive Chairman, with John Ternus taking over on September 1. In its 50-year history, Apple has had just three CEOs: Jobs, Cook, Ternus.
Three people. Fifty years. Each transition spaced over a decade apart.
Now consider the AI Agent space: one year ago, most people were still debating whether AI could operate a computer at all. Today, there are open-source projects delivering usable on-device solutions.
This article breaks down the technical evolution of GUI Agents — using Mano-P, our open-source project, as a concrete example of what it takes to go from training to on-device deployment.
What Is a GUI Agent?
A GUI Agent's core mission: let AI operate a computer's graphical interface the way a human does — recognizing screen elements, understanding task intent, and executing clicks, typing, and drag-and-drop operations.
There are currently two main technical approaches:
| Approach | Mechanism | Strength | Limitation |
|---|---|---|---|
| API/DOM-driven | Reads interface structure via accessibility APIs or DOM trees | Precise element targeting | Depends on app-specific interfaces |
| Pure vision | Understands UI from screenshots alone | Works across any application | Higher demand on visual comprehension |
Mano-P takes the pure vision route. Designed for Mac, it's an on-device GUI Agent — "Mano" means "hand" in Spanish, "P" stands for Person. AI for Personal. It runs entirely locally; no data leaves the device.
Training: Bidirectional Self-Reinforcement Learning
The training pipeline follows a three-stage progressive framework:
Stage 1: SFT (Supervised Fine-Tuning)
↓ Build foundational capabilities
Stage 2: Offline Reinforcement Learning
↓ Learn strategy optimization from historical data
Stage 3: Online Reinforcement Learning
↓ Continuously improve through real-environment interaction
Stage 1 — SFT: Supervised fine-tuning on high-quality GUI operation datasets. The model learns basic interface understanding and action mapping — ground-truth capability building.
Stage 2 — Offline RL: Uses collected interaction trajectories to optimize policies via reinforcement learning. Extracts success/failure signals from historical operations without requiring live environment interaction, keeping training costs manageable.
Stage 3 — Online RL: Interacts with real GUI environments, adjusting strategy based on live feedback. The key challenge here is balancing exploration (trying new operation paths) with exploitation (reinforcing proven strategies).
Inference: Think-Act-Verify Loop
The inference mechanism uses a think-act-verify cycle:
while task_not_complete:
# Think: analyze current screen, plan next action
thought = model.think(screenshot, task_context)
# Act: execute GUI operation (click, type, scroll)
action = model.act(thought)
execute(action)
# Verify: capture new screenshot, check result
new_screenshot = capture_screen()
verified = model.verify(new_screenshot, expected_state)
if not verified:
task_context.update(error_info) # back to Think
This gives the Agent self-correction capability. In real desktop environments, unexpected popups, loading delays, and dynamic element repositioning are common — the verify step catches these before errors cascade.
Core capabilities span four areas: complex GUI automation, cross-system data integration, long-task planning and execution, and intelligent report generation.
Benchmark Performance
OSWorld: Mano-P's 72B model achieves 58.2% success rate, ranking #1 among specialized GUI agent models. Second place scores 45.0%. OSWorld simulates real OS environments with cross-application tasks including file operations, browser interactions, and office software workflows.
WebRetriever Protocol I: Scores 41.7 NavEval, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). This benchmark focuses on web information retrieval and interaction.
Edge Deployment: 4B Model Running On-Device
On-device deployment is a core feature of Mano-P. Here's the 4B quantized model (w4a16) performance on M4 Pro:
| Metric | Value |
|---|---|
| Prefill Speed | 476 tokens/s |
| Decode Speed | 76 tokens/s |
| Peak Memory | 4.3 GB |
The w4a16 quantization scheme — 4-bit weights with 16-bit activations — strikes a practical balance: 4-bit weights dramatically reduce memory footprint while 16-bit activations preserve numerical precision during inference.
Hardware requirement: Apple M4 chip + 32 GB RAM. Fully local execution — your screen data never leaves your device.
Getting Started
Open-sourced under the Apache 2.0 license:
# Install
brew tap HanningWang/tap && brew install mano-cua
GitHub: https://github.com/Mininglamp-AI/Mano-P
Wrapping Up
From the three-stage progressive training framework, to think-act-verify inference, to w4a16 quantization enabling edge deployment — the path from "concept" to "locally usable" GUI Agents is becoming clear.
Apple took 50 years and three leaders. The GUI Agent space went from academic papers to open-source tools in roughly one year. These are two fundamentally different timescales.
For developers, Mano-P — Apache 2.0 licensed, runnable on a local Mac — is already a starting point for exploration and experimentation.


Top comments (0)