In 2020, if you wanted to automate a desktop app, you'd write an RPA script — record mouse movements, hardcode coordinates, and pray the UI never changed.
In 2024, if you wanted an AI to operate a browser, you'd use a CDP-based agent — one that reads the DOM, parses HTML, and executes tasks inside Chrome.
In 2026, there's a model that looks at a screenshot, understands the interface, and clicks, types, and switches windows like a human — no API needed, no HTML parsing, no knowledge of the underlying tech stack.
These three stages represent three paradigm shifts in GUI automation over the past few years.
Let's break down how we got here.
Generation 1: RPA — Record and Replay
Traditional RPA (UiPath, Blue Prism, Automation Anywhere) boils down to one idea: record what a human does, then replay it.
Under the hood, it's simulating mouse and keyboard events at the OS level. Early versions used coordinate-based targeting — change the resolution and everything breaks. Later iterations added control tree recognition (Windows UI Automation, macOS Accessibility API) and image matching.
RPA still powers automation at banks, insurance companies, and government systems today. But for developers, it has structural problems:
- Brittle: Change one pixel in the UI and the script breaks
- Zero understanding: It doesn't know what it's doing — just mechanically repeating
- High maintenance: Every UI change requires re-recording
- Limited scope: Cross-application, cross-platform workflows are painful
RPA was always "automation for non-technical users," not something that excited developers.
Generation 2: Browser CUA — DOM-Based Agents
In 2024–2025, LLMs got good enough to understand web pages. A new class of solutions emerged:
- Use Chrome DevTools Protocol (CDP) to grab the page DOM
- Feed DOM/HTML fragments to an LLM for comprehension
- LLM outputs action instructions (click element X, fill form Y)
- Execute via CDP
The improvement was real: LLMs brought understanding instead of mechanical replay. But the limitations were equally clear:
- Locked inside the browser: CDP is a Chrome protocol. Desktop apps, native apps, games, 3D tools — none of them work
- Depends on HTML structure: Complex or dynamically rendered pages produce massive, unreliable DOM trees
- Data security: DOM content (including your login state and sensitive data) gets sent to a cloud LLM
For developers, this solved "browser automation" but not "general GUI automation."
Generation 3: Pure-Vision GUI Agents — See the Screen, Not the Code
Starting in late 2025, a fundamentally different approach matured: models that take a screenshot as input and output actions like "click at (x, y)" or "type 'hello world'."
The key difference from everything before: no dependency on any underlying protocol or interface. No CDP, no Accessibility API, no need to know what framework the app was built with. Input is a screenshot. Output is an action.
Coverage is theoretically unlimited — any application with a graphical interface can be operated. Desktop software, browsers, games, 3D modeling tools, even apps inside a remote desktop session.
The technical challenges are significant:
- GUI Grounding: The model needs to precisely locate and understand interface elements
- Multi-step planning: Complex tasks require sequences of actions with memory
- Error recovery: When something goes wrong, the model needs to detect the anomaly and self-correct
This approach splits into two paths — cloud (screenshots sent to remote servers) and on-device (inference runs locally). Same technique, completely different data flow.
On-Device Pure-Vision: Where It Gets Interesting
Let me use a concrete example to show where on-device GUI agents stand today.
Mano-P 1.0 is a GUI-VLA (Vision-Language-Action) agent model purpose-built for on-device deployment. Pure vision, no CDP, no HTML parsing.
Benchmark results
On OSWorld — the academic community's standard benchmark for desktop GUI agents — the Mano-P 72B model achieved 58.2% success rate, ranking #1 among proprietary models globally.
For context: the other four models in the top 5 are all 100B+ general-purpose models. A 72B model purpose-built for GUI scenarios beating them says something about the efficiency of specialized models vs. the brute-force approach.
Across a broader evaluation, Mano-P hit SOTA on 13 benchmark leaderboards spanning GUI grounding, perception, video understanding, and in-context learning.
On-device performance
The 4B quantized model (w4a16) runs at 476 tokens/s prefill, 76 tokens/s decode on Apple M4 Pro, with peak memory of just 4.3GB.
That means on an M4 Mac mini or MacBook with 32GB RAM, you can run an OSWorld-champion-level GUI agent entirely on-device. No data ever leaves your machine.
One command to install:
brew install mano-cua
No API key. No cloud config. No worrying about where your screenshots end up.
The Comparison Table Developers Actually Want
| Dimension | Traditional RPA | Browser CUA | Cloud Computer Use | On-Device GUI Agent (Mano-P) |
|---|---|---|---|---|
| Perception | Coordinates / control tree / image matching | DOM / HTML parsing | Cloud screenshot + vision model | Local screenshot + vision model |
| Coverage | Single app | Browser only | Theoretically all platforms | All platforms |
| Understanding | None | Yes (HTML-based) | Yes (vision-based) | Yes (vision-based) |
| Data flow | Local | DOM sent to cloud | Screenshots uploaded to cloud | Data never leaves device |
| Robustness | Low (breaks on UI change) | Medium (depends on DOM stability) | High | High |
| Deployment | Local RPA engine | Browser + API | Cloud API + network | Local device (e.g., M4 Mac + 32GB) |
There's a frequently overlooked distinction here: cloud Computer Use and on-device GUI agents use the same technique (pure vision), but the data flow is completely different.
Cloud solutions send your screenshots — everything on your screen, including code, emails, and credentials — to a remote server. For many developers, that's a non-starter.
On-device solutions run inference locally. Screenshots processed locally. Actions executed locally. This isn't "we added encryption" level security — it's physically eliminating the possibility of data leakage.
Why On-Device Only Became Possible Now
Two changes made this viable:
Hardware: Apple's M4 unified memory architecture gave consumer devices the foundation to run medium-scale models. M4 + 32GB unified memory + high-bandwidth memory bus — this was workstation-grade hardware two years ago.
Model compression: Mano-P's GSPruning visual token pruning + w4a16 quantization keeps the 4B model at 4.3GB peak memory with 476 tokens/s throughput. That's a fully usable inference speed.
What's the Endgame?
When an AI agent can see any screen, understand intent, and operate any graphical interface, it has the same software-usage capability as a human user. It doesn't need APIs, doesn't wait for integrations, doesn't learn each tool's SDK.
The implications:
- Long-tail software gets activated: Millions of professional tools with no API can suddenly be operated by agents
- Cross-application workflows become possible: Design in Figma, compile in Terminal, deploy in browser — all via GUI
- The walls between software break down: No data export/import needed — the agent just operates at the interface level
With benchmark scores above 50% on complex desktop tasks, we're watching GUI agents cross from "lab demo" to "developer-usable."
Try It
Mano-P 1.0 is open source under Apache 2.0.
brew install mano-cua
👉 GitHub: github.com/Mininglamp-AI/Mano-P
What's your take — is on-device the right path for GUI agents, or is cloud compute still the pragmatic choice? Drop your thoughts below.
Top comments (0)