<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mininglamp</title>
    <description>The latest articles on DEV Community by Mininglamp (@mininglamp).</description>
    <link>https://dev.to/mininglamp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3846168%2F6a138840-d665-4ba6-aedf-1b5c492035c4.png</url>
      <title>DEV Community: Mininglamp</title>
      <link>https://dev.to/mininglamp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mininglamp"/>
    <language>en</language>
    <item>
      <title>The Evolution of GUI Agents: From RPA Scripts to AI That Sees Your Screen</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Thu, 09 Apr 2026 07:12:14 +0000</pubDate>
      <link>https://dev.to/mininglamp/the-evolution-of-gui-agents-from-rpa-scripts-to-ai-that-sees-your-screen-4mkc</link>
      <guid>https://dev.to/mininglamp/the-evolution-of-gui-agents-from-rpa-scripts-to-ai-that-sees-your-screen-4mkc</guid>
      <description>&lt;p&gt;In 2020, if you wanted to automate a desktop app, you'd write an RPA script — record mouse movements, hardcode coordinates, and pray the UI never changed.&lt;/p&gt;

&lt;p&gt;In 2024, if you wanted an AI to operate a browser, you'd use a CDP-based agent — one that reads the DOM, parses HTML, and executes tasks inside Chrome.&lt;/p&gt;

&lt;p&gt;In 2026, there's a model that looks at a screenshot, understands the interface, and clicks, types, and switches windows like a human — no API needed, no HTML parsing, no knowledge of the underlying tech stack.&lt;/p&gt;

&lt;p&gt;These three stages represent three paradigm shifts in GUI automation over the past few years.&lt;/p&gt;

&lt;p&gt;Let's break down how we got here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generation 1: RPA — Record and Replay
&lt;/h2&gt;

&lt;p&gt;Traditional RPA (UiPath, Blue Prism, Automation Anywhere) boils down to one idea: record what a human does, then replay it.&lt;/p&gt;

&lt;p&gt;Under the hood, it's simulating mouse and keyboard events at the OS level. Early versions used coordinate-based targeting — change the resolution and everything breaks. Later iterations added control tree recognition (Windows UI Automation, macOS Accessibility API) and image matching.&lt;/p&gt;

&lt;p&gt;RPA still powers automation at banks, insurance companies, and government systems today. But for developers, it has structural problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Brittle&lt;/strong&gt;: Change one pixel in the UI and the script breaks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero understanding&lt;/strong&gt;: It doesn't know &lt;em&gt;what&lt;/em&gt; it's doing — just mechanically repeating&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High maintenance&lt;/strong&gt;: Every UI change requires re-recording&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited scope&lt;/strong&gt;: Cross-application, cross-platform workflows are painful&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RPA was always "automation for non-technical users," not something that excited developers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generation 2: Browser CUA — DOM-Based Agents
&lt;/h2&gt;

&lt;p&gt;In 2024–2025, LLMs got good enough to understand web pages. A new class of solutions emerged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use Chrome DevTools Protocol (CDP) to grab the page DOM&lt;/li&gt;
&lt;li&gt;Feed DOM/HTML fragments to an LLM for comprehension&lt;/li&gt;
&lt;li&gt;LLM outputs action instructions (click element X, fill form Y)&lt;/li&gt;
&lt;li&gt;Execute via CDP&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The improvement was real: LLMs brought &lt;em&gt;understanding&lt;/em&gt; instead of mechanical replay. But the limitations were equally clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Locked inside the browser&lt;/strong&gt;: CDP is a Chrome protocol. Desktop apps, native apps, games, 3D tools — none of them work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Depends on HTML structure&lt;/strong&gt;: Complex or dynamically rendered pages produce massive, unreliable DOM trees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data security&lt;/strong&gt;: DOM content (including your login state and sensitive data) gets sent to a cloud LLM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For developers, this solved "browser automation" but not "general GUI automation."&lt;/p&gt;

&lt;h2&gt;
  
  
  Generation 3: Pure-Vision GUI Agents — See the Screen, Not the Code
&lt;/h2&gt;

&lt;p&gt;Starting in late 2025, a fundamentally different approach matured: models that take a screenshot as input and output actions like "click at (x, y)" or "type 'hello world'."&lt;/p&gt;

&lt;p&gt;The key difference from everything before: &lt;strong&gt;no dependency on any underlying protocol or interface.&lt;/strong&gt; No CDP, no Accessibility API, no need to know what framework the app was built with. Input is a screenshot. Output is an action.&lt;/p&gt;

&lt;p&gt;Coverage is theoretically unlimited — any application with a graphical interface can be operated. Desktop software, browsers, games, 3D modeling tools, even apps inside a remote desktop session.&lt;/p&gt;

&lt;p&gt;The technical challenges are significant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GUI Grounding&lt;/strong&gt;: The model needs to precisely locate and understand interface elements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-step planning&lt;/strong&gt;: Complex tasks require sequences of actions with memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error recovery&lt;/strong&gt;: When something goes wrong, the model needs to detect the anomaly and self-correct&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach splits into two paths — &lt;strong&gt;cloud&lt;/strong&gt; (screenshots sent to remote servers) and &lt;strong&gt;on-device&lt;/strong&gt; (inference runs locally). Same technique, completely different data flow.&lt;/p&gt;

&lt;h2&gt;
  
  
  On-Device Pure-Vision: Where It Gets Interesting
&lt;/h2&gt;

&lt;p&gt;Let me use a concrete example to show where on-device GUI agents stand today.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P 1.0&lt;/a&gt; is a GUI-VLA (Vision-Language-Action) agent model purpose-built for on-device deployment. Pure vision, no CDP, no HTML parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark results
&lt;/h3&gt;

&lt;p&gt;On &lt;a href="https://os-world.github.io/" rel="noopener noreferrer"&gt;OSWorld&lt;/a&gt; — the academic community's standard benchmark for desktop GUI agents — the Mano-P 72B model achieved &lt;strong&gt;58.2% success rate&lt;/strong&gt;, ranking &lt;strong&gt;#1 among proprietary models globally&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For context: the other four models in the top 5 are all 100B+ general-purpose models. A 72B model purpose-built for GUI scenarios beating them says something about the efficiency of specialized models vs. the brute-force approach.&lt;/p&gt;

&lt;p&gt;Across a broader evaluation, Mano-P hit SOTA on &lt;strong&gt;13 benchmark leaderboards&lt;/strong&gt; spanning GUI grounding, perception, video understanding, and in-context learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-device performance
&lt;/h3&gt;

&lt;p&gt;The 4B quantized model (w4a16) runs at &lt;strong&gt;476 tokens/s prefill, 76 tokens/s decode&lt;/strong&gt; on Apple M4 Pro, with peak memory of just &lt;strong&gt;4.3GB&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That means on an M4 Mac mini or MacBook with 32GB RAM, you can run an OSWorld-champion-level GUI agent &lt;strong&gt;entirely on-device&lt;/strong&gt;. No data ever leaves your machine.&lt;/p&gt;

&lt;p&gt;One command to install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API key. No cloud config. No worrying about where your screenshots end up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Comparison Table Developers Actually Want
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Traditional RPA&lt;/th&gt;
&lt;th&gt;Browser CUA&lt;/th&gt;
&lt;th&gt;Cloud Computer Use&lt;/th&gt;
&lt;th&gt;On-Device GUI Agent (Mano-P)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Perception&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Coordinates / control tree / image matching&lt;/td&gt;
&lt;td&gt;DOM / HTML parsing&lt;/td&gt;
&lt;td&gt;Cloud screenshot + vision model&lt;/td&gt;
&lt;td&gt;Local screenshot + vision model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Coverage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single app&lt;/td&gt;
&lt;td&gt;Browser only&lt;/td&gt;
&lt;td&gt;Theoretically all platforms&lt;/td&gt;
&lt;td&gt;All platforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Understanding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Yes (HTML-based)&lt;/td&gt;
&lt;td&gt;Yes (vision-based)&lt;/td&gt;
&lt;td&gt;Yes (vision-based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data flow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local&lt;/td&gt;
&lt;td&gt;DOM sent to cloud&lt;/td&gt;
&lt;td&gt;Screenshots uploaded to cloud&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Data never leaves device&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Robustness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (breaks on UI change)&lt;/td&gt;
&lt;td&gt;Medium (depends on DOM stability)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local RPA engine&lt;/td&gt;
&lt;td&gt;Browser + API&lt;/td&gt;
&lt;td&gt;Cloud API + network&lt;/td&gt;
&lt;td&gt;Local device (e.g., M4 Mac + 32GB)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;There's a frequently overlooked distinction here: cloud Computer Use and on-device GUI agents use the same technique (pure vision), but the data flow is completely different.&lt;/p&gt;

&lt;p&gt;Cloud solutions send your screenshots — everything on your screen, including code, emails, and credentials — to a remote server. For many developers, that's a non-starter.&lt;/p&gt;

&lt;p&gt;On-device solutions run inference locally. Screenshots processed locally. Actions executed locally. This isn't "we added encryption" level security — it's &lt;strong&gt;physically eliminating the possibility of data leakage&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why On-Device Only Became Possible Now
&lt;/h2&gt;

&lt;p&gt;Two changes made this viable:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware&lt;/strong&gt;: Apple's M4 unified memory architecture gave consumer devices the foundation to run medium-scale models. M4 + 32GB unified memory + high-bandwidth memory bus — this was workstation-grade hardware two years ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model compression&lt;/strong&gt;: Mano-P's GSPruning visual token pruning + w4a16 quantization keeps the 4B model at 4.3GB peak memory with 476 tokens/s throughput. That's a fully usable inference speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's the Endgame?
&lt;/h2&gt;

&lt;p&gt;When an AI agent can see any screen, understand intent, and operate any graphical interface, it has &lt;strong&gt;the same software-usage capability as a human user&lt;/strong&gt;. It doesn't need APIs, doesn't wait for integrations, doesn't learn each tool's SDK.&lt;/p&gt;

&lt;p&gt;The implications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-tail software gets activated&lt;/strong&gt;: Millions of professional tools with no API can suddenly be operated by agents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-application workflows become possible&lt;/strong&gt;: Design in Figma, compile in Terminal, deploy in browser — all via GUI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The walls between software break down&lt;/strong&gt;: No data export/import needed — the agent just operates at the interface level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With benchmark scores above 50% on complex desktop tasks, we're watching GUI agents cross from "lab demo" to "developer-usable."&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Mano-P 1.0 is open source under Apache 2.0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your take — is on-device the right path for GUI agents, or is cloud compute still the pragmatic choice? Drop your thoughts below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>automation</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
