DEV Community

Mininglamp
Mininglamp

Posted on

Apple Took 50 Years for 3 CEOs — GUI Agents Went from Paper to Production in One

Yesterday, Apple announced a landmark succession: Tim Cook steps down as CEO to become Executive Chairman, with John Ternus taking over on September 1. In its 50-year history, Apple has had just three CEOs: Jobs, Cook, Ternus.

Three people. Fifty years. Each transition spaced over a decade apart.

Now consider the AI Agent space: one year ago, most people were still debating whether AI could operate a computer at all. Today, there are open-source projects delivering usable on-device solutions.

This article breaks down the technical evolution of GUI Agents — using Mano-P, our open-source project, as a concrete example of what it takes to go from training to on-device deployment.

What Is a GUI Agent?

A GUI Agent's core mission: let AI operate a computer's graphical interface the way a human does — recognizing screen elements, understanding task intent, and executing clicks, typing, and drag-and-drop operations.

There are currently two main technical approaches:

Approach Mechanism Strength Limitation
API/DOM-driven Reads interface structure via accessibility APIs or DOM trees Precise element targeting Depends on app-specific interfaces
Pure vision Understands UI from screenshots alone Works across any application Higher demand on visual comprehension

Mano-P takes the pure vision route. Designed for Mac, it's an on-device GUI Agent — "Mano" means "hand" in Spanish, "P" stands for Person. AI for Personal. It runs entirely locally; no data leaves the device.

Mano-P Open Source Architecture

Training: Bidirectional Self-Reinforcement Learning

The training pipeline follows a three-stage progressive framework:

Stage 1: SFT (Supervised Fine-Tuning)
    ↓  Build foundational capabilities
Stage 2: Offline Reinforcement Learning
    ↓  Learn strategy optimization from historical data
Stage 3: Online Reinforcement Learning
    ↓  Continuously improve through real-environment interaction
Enter fullscreen mode Exit fullscreen mode

Stage 1 — SFT: Supervised fine-tuning on high-quality GUI operation datasets. The model learns basic interface understanding and action mapping — ground-truth capability building.

Stage 2 — Offline RL: Uses collected interaction trajectories to optimize policies via reinforcement learning. Extracts success/failure signals from historical operations without requiring live environment interaction, keeping training costs manageable.

Stage 3 — Online RL: Interacts with real GUI environments, adjusting strategy based on live feedback. The key challenge here is balancing exploration (trying new operation paths) with exploitation (reinforcing proven strategies).

Inference: Think-Act-Verify Loop

The inference mechanism uses a think-act-verify cycle:

while task_not_complete:
    # Think: analyze current screen, plan next action
    thought = model.think(screenshot, task_context)

    # Act: execute GUI operation (click, type, scroll)
    action = model.act(thought)
    execute(action)

    # Verify: capture new screenshot, check result
    new_screenshot = capture_screen()
    verified = model.verify(new_screenshot, expected_state)

    if not verified:
        task_context.update(error_info)  # back to Think
Enter fullscreen mode Exit fullscreen mode

This gives the Agent self-correction capability. In real desktop environments, unexpected popups, loading delays, and dynamic element repositioning are common — the verify step catches these before errors cascade.

Core capabilities span four areas: complex GUI automation, cross-system data integration, long-task planning and execution, and intelligent report generation.

Benchmark Performance

OSWorld: Mano-P's 72B model achieves 58.2% success rate, ranking #1 among specialized GUI agent models. Second place scores 45.0%. OSWorld simulates real OS environments with cross-application tasks including file operations, browser interactions, and office software workflows.

WebRetriever Protocol I: Scores 41.7 NavEval, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). This benchmark focuses on web information retrieval and interaction.

Mano-P Benchmark Overview

Edge Deployment: 4B Model Running On-Device

On-device deployment is a core feature of Mano-P. Here's the 4B quantized model (w4a16) performance on M4 Pro:

Metric Value
Prefill Speed 476 tokens/s
Decode Speed 76 tokens/s
Peak Memory 4.3 GB

The w4a16 quantization scheme — 4-bit weights with 16-bit activations — strikes a practical balance: 4-bit weights dramatically reduce memory footprint while 16-bit activations preserve numerical precision during inference.

Hardware requirement: Apple M4 chip + 32 GB RAM. Fully local execution — your screen data never leaves your device.

Getting Started

Open-sourced under the Apache 2.0 license:

# Install
brew tap HanningWang/tap && brew install mano-cua
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/Mininglamp-AI/Mano-P

Wrapping Up

From the three-stage progressive training framework, to think-act-verify inference, to w4a16 quantization enabling edge deployment — the path from "concept" to "locally usable" GUI Agents is becoming clear.

Apple took 50 years and three leaders. The GUI Agent space went from academic papers to open-source tools in roughly one year. These are two fundamentally different timescales.

For developers, Mano-P — Apache 2.0 licensed, runnable on a local Mac — is already a starting point for exploration and experimentation.

Top comments (0)