Mininglamp

Posted on Apr 9

The Evolution of GUI Agents: From RPA Scripts to AI That Sees Your Screen

#ai #opensource #automation #machinelearning

In 2020, if you wanted to automate a desktop app, you'd write an RPA script — record mouse movements, hardcode coordinates, and pray the UI never changed.

In 2024, if you wanted an AI to operate a browser, you'd use a CDP-based agent — one that reads the DOM, parses HTML, and executes tasks inside Chrome.

In 2026, there's a model that looks at a screenshot, understands the interface, and clicks, types, and switches windows like a human — no API needed, no HTML parsing, no knowledge of the underlying tech stack.

These three stages represent three paradigm shifts in GUI automation over the past few years.

Let's break down how we got here.

Generation 1: RPA — Record and Replay

Traditional RPA (UiPath, Blue Prism, Automation Anywhere) boils down to one idea: record what a human does, then replay it.

Under the hood, it's simulating mouse and keyboard events at the OS level. Early versions used coordinate-based targeting — change the resolution and everything breaks. Later iterations added control tree recognition (Windows UI Automation, macOS Accessibility API) and image matching.

RPA still powers automation at banks, insurance companies, and government systems today. But for developers, it has structural problems:

Brittle: Change one pixel in the UI and the script breaks
Zero understanding: It doesn't know what it's doing — just mechanically repeating
High maintenance: Every UI change requires re-recording
Limited scope: Cross-application, cross-platform workflows are painful

RPA was always "automation for non-technical users," not something that excited developers.

Generation 2: Browser CUA — DOM-Based Agents

In 2024–2025, LLMs got good enough to understand web pages. A new class of solutions emerged:

Use Chrome DevTools Protocol (CDP) to grab the page DOM
Feed DOM/HTML fragments to an LLM for comprehension
LLM outputs action instructions (click element X, fill form Y)
Execute via CDP

The improvement was real: LLMs brought understanding instead of mechanical replay. But the limitations were equally clear:

Locked inside the browser: CDP is a Chrome protocol. Desktop apps, native apps, games, 3D tools — none of them work
Depends on HTML structure: Complex or dynamically rendered pages produce massive, unreliable DOM trees
Data security: DOM content (including your login state and sensitive data) gets sent to a cloud LLM

For developers, this solved "browser automation" but not "general GUI automation."

Generation 3: Pure-Vision GUI Agents — See the Screen, Not the Code

Starting in late 2025, a fundamentally different approach matured: models that take a screenshot as input and output actions like "click at (x, y)" or "type 'hello world'."

The key difference from everything before: no dependency on any underlying protocol or interface. No CDP, no Accessibility API, no need to know what framework the app was built with. Input is a screenshot. Output is an action.

Coverage is theoretically unlimited — any application with a graphical interface can be operated. Desktop software, browsers, games, 3D modeling tools, even apps inside a remote desktop session.

The technical challenges are significant:

GUI Grounding: The model needs to precisely locate and understand interface elements
Multi-step planning: Complex tasks require sequences of actions with memory
Error recovery: When something goes wrong, the model needs to detect the anomaly and self-correct

This approach splits into two paths — cloud (screenshots sent to remote servers) and on-device (inference runs locally). Same technique, completely different data flow.

On-Device Pure-Vision: Where It Gets Interesting

Let me use a concrete example to show where on-device GUI agents stand today.

Mano-P 1.0 is a GUI-VLA (Vision-Language-Action) agent model purpose-built for on-device deployment. Pure vision, no CDP, no HTML parsing.

Benchmark results

On OSWorld — the academic community's standard benchmark for desktop GUI agents — the Mano-P 72B model achieved 58.2% success rate, ranking #1 among proprietary models globally.

For context: the other four models in the top 5 are all 100B+ general-purpose models. A 72B model purpose-built for GUI scenarios beating them says something about the efficiency of specialized models vs. the brute-force approach.

Across a broader evaluation, Mano-P hit SOTA on 13 benchmark leaderboards spanning GUI grounding, perception, video understanding, and in-context learning.

On-device performance

The 4B quantized model (w4a16) runs at 476 tokens/s prefill, 76 tokens/s decode on Apple M4 Pro, with peak memory of just 4.3GB.

That means on an M4 Mac mini or MacBook with 32GB RAM, you can run an OSWorld-champion-level GUI agent entirely on-device. No data ever leaves your machine.

One command to install:

brew install mano-cua

No API key. No cloud config. No worrying about where your screenshots end up.

The Comparison Table Developers Actually Want

Dimension	Traditional RPA	Browser CUA	Cloud Computer Use	On-Device GUI Agent (Mano-P)
Perception	Coordinates / control tree / image matching	DOM / HTML parsing	Cloud screenshot + vision model	Local screenshot + vision model
Coverage	Single app	Browser only	Theoretically all platforms	All platforms
Understanding	None	Yes (HTML-based)	Yes (vision-based)	Yes (vision-based)
Data flow	Local	DOM sent to cloud	Screenshots uploaded to cloud	Data never leaves device
Robustness	Low (breaks on UI change)	Medium (depends on DOM stability)	High	High
Deployment	Local RPA engine	Browser + API	Cloud API + network	Local device (e.g., M4 Mac + 32GB)

There's a frequently overlooked distinction here: cloud Computer Use and on-device GUI agents use the same technique (pure vision), but the data flow is completely different.

Cloud solutions send your screenshots — everything on your screen, including code, emails, and credentials — to a remote server. For many developers, that's a non-starter.

On-device solutions run inference locally. Screenshots processed locally. Actions executed locally. This isn't "we added encryption" level security — it's physically eliminating the possibility of data leakage.

Why On-Device Only Became Possible Now

Two changes made this viable:

Hardware: Apple's M4 unified memory architecture gave consumer devices the foundation to run medium-scale models. M4 + 32GB unified memory + high-bandwidth memory bus — this was workstation-grade hardware two years ago.

Model compression: Mano-P's GSPruning visual token pruning + w4a16 quantization keeps the 4B model at 4.3GB peak memory with 476 tokens/s throughput. That's a fully usable inference speed.

What's the Endgame?

When an AI agent can see any screen, understand intent, and operate any graphical interface, it has the same software-usage capability as a human user. It doesn't need APIs, doesn't wait for integrations, doesn't learn each tool's SDK.

The implications:

Long-tail software gets activated: Millions of professional tools with no API can suddenly be operated by agents
Cross-application workflows become possible: Design in Figma, compile in Terminal, deploy in browser — all via GUI
The walls between software break down: No data export/import needed — the agent just operates at the interface level

With benchmark scores above 50% on complex desktop tasks, we're watching GUI agents cross from "lab demo" to "developer-usable."

Try It

Mano-P 1.0 is open source under Apache 2.0.

brew install mano-cua

👉 GitHub: github.com/Mininglamp-AI/Mano-P

What's your take — is on-device the right path for GUI agents, or is cloud compute still the pragmatic choice? Drop your thoughts below.

DEV Community