<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mininglamp</title>
    <description>The latest articles on DEV Community by Mininglamp (@mininglamp).</description>
    <link>https://dev.to/mininglamp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3846168%2F6a138840-d665-4ba6-aedf-1b5c492035c4.png</url>
      <title>DEV Community: Mininglamp</title>
      <link>https://dev.to/mininglamp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mininglamp"/>
    <language>en</language>
    <item>
      <title>Complex UIs, Cross-App Workflows, Long Tasks: What GUI Agents Actually Unlock</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Wed, 29 Apr 2026 09:44:16 +0000</pubDate>
      <link>https://dev.to/mininglamp/complex-uis-cross-app-workflows-long-tasks-what-gui-agents-actually-unlock-40ad</link>
      <guid>https://dev.to/mininglamp/complex-uis-cross-app-workflows-long-tasks-what-gui-agents-actually-unlock-40ad</guid>
      <description>&lt;p&gt;AI agents have gotten remarkably good at text-based tasks. Platforms like OpenClaw and Claude Code can write code, manage files, search the web, analyze data, and orchestrate multi-step workflows. If the task lives in a terminal, an editor, or an API — agents handle it well.&lt;/p&gt;

&lt;p&gt;But ask an agent to fill out a form in your CRM, adjust parameters in a design tool, or navigate a multi-step workflow in an enterprise system — and you'll hit a wall.&lt;/p&gt;

&lt;p&gt;The problem isn't intelligence. It's that &lt;strong&gt;agents can't see your screen&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The GUI Gap in Agent Capabilities
&lt;/h2&gt;

&lt;p&gt;Most agent platforms interact with computers through three channels: command-line interfaces (CLI), browser developer protocols (CDP), and APIs. These work well for code execution, web scraping, and cloud service calls. But they share a fundamental limitation: &lt;strong&gt;they only work with software that exposes a programmatic interface&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In practice, a large portion of the software people use daily has no API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enterprise systems (ERP, CRM, internal tools) often lack external interfaces&lt;/li&gt;
&lt;li&gt;Desktop applications (office suites, design tools, specialized software) rely on mouse and keyboard interaction&lt;/li&gt;
&lt;li&gt;Many web applications involve complex dynamic UIs that resist simple scripting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a structural gap in the agent technology stack. Agents have the "brain" to plan and reason, but they lack the "eyes" to see the screen and the "hands" to operate the interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why GUI Vision Is the Missing Piece
&lt;/h2&gt;

&lt;p&gt;Humans interact with computers through a visual feedback loop: observe the screen → understand the interface → locate the target element → perform an action → check the result → proceed. This process doesn't depend on any underlying API. It works through seeing and doing.&lt;/p&gt;

&lt;p&gt;Traditional RPA (Robotic Process Automation) attempted to automate GUI interactions, but relied on hardcoded coordinates, element paths, and pixel matching. When the UI changes — which happens constantly in modern software — scripts break and need manual updates.&lt;/p&gt;

&lt;p&gt;A more robust approach is &lt;strong&gt;GUI-VLA (Vision-Language-Action) models&lt;/strong&gt;: architectures that unify visual perception (seeing the screen), language understanding (interpreting instructions), and action execution (clicking, typing, navigating) into a single framework. Instead of depending on fixed UI structures, the agent understands the interface through visual comprehension and acts accordingly.&lt;/p&gt;

&lt;p&gt;The implication: &lt;strong&gt;if a piece of software has a graphical interface, an agent can potentially operate it&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Theory to Working System
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" alt="Mano-P Architecture" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt; is an open-source GUI-VLA agent model built for edge devices, released by Mininglamp Technology under the Apache 2.0 license. Its core approach: &lt;strong&gt;pure vision-driven GUI interaction&lt;/strong&gt; — no DOM parsing, no system APIs, just screen understanding and action execution from screenshots.&lt;/p&gt;

&lt;p&gt;The technical design involves three key mechanisms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three-stage progressive training.&lt;/strong&gt; The model goes through supervised fine-tuning (SFT), offline reinforcement learning, and online reinforcement learning. Each stage builds on the previous one, progressively improving action accuracy and environmental robustness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think-act-verify reasoning loop.&lt;/strong&gt; Before each action, the agent plans its intent. After execution, it verifies whether the result matches expectations. If the outcome deviates, the system automatically corrects course. This significantly reduces error accumulation in multi-step tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge-optimized deployment.&lt;/strong&gt; Through mixed-precision quantization and visual token pruning (GS-Pruning), the model runs locally on Apple M4 devices with 32GB RAM. All screenshots and task data stay on-device — no cloud calls required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark Results
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" alt="Benchmark Overview" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OSWorld benchmark&lt;/strong&gt;: Mano-P 1.0-72B achieves a 58.2% success rate, ranking #1 among specialized GUI agent models — 13.2 percentage points ahead of the second-place opencua-72b (45.0%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebRetriever Protocol I&lt;/strong&gt;: Mano-P 1.0 scores 41.7 NavEval, surpassing Gemini 2.5 Pro Computer Use (40.9) and Claude 4.5 Computer Use (31.3)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-device inference&lt;/strong&gt;: The 4B quantized model (w4a16) achieves 476 tokens/s prefill and 76 tokens/s decode on Apple M4 Pro, with only 4.3GB peak memory&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What GUI Agents Actually Unlock
&lt;/h2&gt;

&lt;p&gt;Once agents gain the ability to see and operate graphical interfaces, several previously impossible workflows become practical. Here are four scenarios demonstrated in the Mano-P project:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Fully Automated Application Building
&lt;/h3&gt;

&lt;p&gt;The agent receives natural language requirements and autonomously completes the entire pipeline: requirement clarification → architecture design → code generation → local deployment → multi-level testing (API tests, LLM-based visual page inspection, and end-to-end GUI automation testing driven by VLA models). When tests fail, the system automatically diagnoses root causes, fixes code, redeploys, and retests — iterating until all test cases pass. No human intervention required. The final deliverable is a running application with complete documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Commercial Video Production Pipeline
&lt;/h3&gt;

&lt;p&gt;Starting from a user command, the system handles video generation, uploading, analysis, editing, and secondary evaluation. The agent independently operates web interfaces and editing software, performing file management, subtitle modifications, and other fine-grained GUI operations. It then generates analysis reports with both subjective assessments and objective metrics. This kind of cross-application, multi-step workflow is exactly what GUI agents enable.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Local On-Device Task Execution
&lt;/h3&gt;

&lt;p&gt;The model runs inference directly on Mac devices (M4 chip + 32GB RAM required), breaking through the bottleneck where agent workflows previously had to pause and wait for human GUI interaction. The agent handles the entire flow autonomously, including steps that require screen-based operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Beyond Work: General-Purpose Visual Understanding
&lt;/h3&gt;

&lt;p&gt;GUI vision capabilities extend beyond productivity scenarios. Through pure visual understanding of a game interface, the agent can perform tile recognition, analysis, and decision-making in Mahjong. This demonstrates the generality of the GUI-VLA approach — the same model framework applies across structured business processes and unstructured interactive environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Developers
&lt;/h2&gt;

&lt;p&gt;The agent ecosystem has been expanding steadily — from chat to code generation, from file management to data analysis. But the jump from "text-based assistant" to "desktop-native operator" requires a fundamentally new capability: &lt;strong&gt;visual understanding of graphical interfaces&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;With GUI vision in place, agents are no longer limited to software that provides APIs or CLI access. Any application with a screen becomes a potential workspace.&lt;/p&gt;

&lt;p&gt;For developers building agent-powered automation, this opens up scenarios that were previously out of reach: enterprise systems without APIs, cross-application data workflows, long-running business processes that span multiple desktop tools, and tasks that previously required a human sitting in front of a screen.&lt;/p&gt;

&lt;p&gt;The desktop was the last frontier agents couldn't reach. That's changing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>opensource</category>
    </item>
    <item>
      <title>1.6 Trillion Parameters Just Went Open Source. What About the Other Direction?</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Tue, 28 Apr 2026 10:54:16 +0000</pubDate>
      <link>https://dev.to/mininglamp/16-trillion-parameters-just-went-open-source-what-about-the-other-direction-3dkl</link>
      <guid>https://dev.to/mininglamp/16-trillion-parameters-just-went-open-source-what-about-the-other-direction-3dkl</guid>
      <description>&lt;p&gt;On April 27, DeepSeek released its V4 model family and open-sourced the weights. The flagship V4-Pro Base has 1.6 trillion parameters (862B active), while V4-Flash comes in at 158B (Base 292B). Both use a Mixture of Experts (MoE) architecture. Within 48 hours of landing on HuggingFace, V4-Pro had already racked up 3,000+ likes and 174K downloads.&lt;/p&gt;

&lt;p&gt;It's an impressive milestone for open-source AI. But it also crystallizes a question that's been brewing for a while: &lt;strong&gt;Is "bigger" the only direction AI models can go?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The case for Scaling Up
&lt;/h2&gt;

&lt;p&gt;Let's be clear — Scaling Up works, and DeepSeek V4 is the latest proof.&lt;/p&gt;

&lt;p&gt;The logic behind bigger models traces back to the Scaling Laws paper (Kaplan et al., 2020): model performance scales predictably with parameter count, dataset size, and compute. From GPT-3 (175B) to DeepSeek V3 to V4 (1.6T), each generation has pushed the ceiling higher on general reasoning, code generation, and mathematical problem-solving.&lt;/p&gt;

&lt;p&gt;The engineering has matured too. MoE architecture is key — V4-Pro's 1.6T total parameters don't all activate at once. A routing mechanism selects which expert networks fire for each input, keeping per-inference compute manageable while retaining the knowledge capacity of a massive model. Combined with distributed inference, mixed precision, and optimized serving stacks (V4-Pro is already available on Together, Novita, Fireworks, and others), trillion-parameter models are becoming practically accessible.&lt;/p&gt;

&lt;p&gt;None of this is hype. The results are real. For general-purpose tasks — open-ended reasoning, multilingual generation, complex code synthesis — larger models consistently outperform smaller ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  But not every problem needs a trillion parameters
&lt;/h2&gt;

&lt;p&gt;Here's where the story gets more interesting.&lt;/p&gt;

&lt;p&gt;Running V4-Pro requires a multi-GPU cluster. Even using it through an inference API costs money per call. For high-frequency use cases — real-time interaction, continuous agent workflows, batch processing — that cost adds up fast. And for individual developers or small teams, the economics don't always work.&lt;/p&gt;

&lt;p&gt;There are also structural constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data privacy.&lt;/strong&gt; Cloud inference means your input data leaves your machine. For AI agent scenarios where the model needs to see your entire screen — emails, chat messages, bank statements — that's a non-trivial compliance issue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency.&lt;/strong&gt; Network round trips add delay. For agent workflows involving dozens of sequential steps (screenshot → understand → act → repeat), every millisecond of latency compounds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Availability.&lt;/strong&gt; No internet, no AI. But real-world use cases on airplanes, in secure facilities, or on unstable connections require AI that works offline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't criticisms of Scaling Up. They're boundary conditions that define where a different approach makes more sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  The other direction: Scaling Out
&lt;/h2&gt;

&lt;p&gt;If Scaling Up means making one model as large as possible, &lt;strong&gt;Scaling Out means distributing multiple smaller, specialized AI models closer to where they're actually needed — and having them collaborate.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't a theoretical alternative. Several converging technical trends make it practical:&lt;/p&gt;

&lt;h3&gt;
  
  
  Model compression is real
&lt;/h3&gt;

&lt;p&gt;Techniques like mixed-precision quantization (e.g., w4a16), visual token pruning, and knowledge distillation can shrink billion-parameter models to run on consumer hardware. On an Apple M4 chip, a 4B-parameter quantized model achieves 476 tokens/s prefill and 76 tokens/s decode, with a peak memory footprint of just 4.3GB.&lt;/p&gt;

&lt;h3&gt;
  
  
  Specialized models can beat general ones — in their domain
&lt;/h3&gt;

&lt;p&gt;A general-purpose trillion-parameter model spreads its capacity across every conceivable task. A specialized model focuses all its parameters on one domain. In GUI automation specifically, a 4B-parameter model trained for this task has achieved #1 scores on domain benchmarks, outperforming models hundreds of times its size on the same tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data sovereignty matters
&lt;/h3&gt;

&lt;p&gt;When the model runs on the user's device, the data never leaves. No cloud upload, no network transmission, no third-party processing. For enterprise compliance, personal privacy, and regulated industries, this is a structural advantage that cloud-only models can't match.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-agent collaboration
&lt;/h3&gt;

&lt;p&gt;Instead of one giant model doing everything, multiple specialized agents can divide work — each running on different devices or nodes, communicating through standardized protocols. This architecture naturally fits the Scaling Out paradigm.&lt;/p&gt;

&lt;h2&gt;
  
  
  A concrete example: GUI agents on the edge
&lt;/h2&gt;

&lt;p&gt;Let's make this concrete with a specific domain: GUI automation.&lt;/p&gt;

&lt;p&gt;The task is straightforward in concept but demanding in practice: an AI agent looks at a screen, understands the interface elements, and performs operations — clicking buttons, filling forms, navigating menus — just like a human user would.&lt;/p&gt;

&lt;p&gt;This is a natural fit for Scaling Out because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Screen captures contain sensitive personal data — better processed locally&lt;/li&gt;
&lt;li&gt;GUI tasks involve many sequential steps — latency accumulates&lt;/li&gt;
&lt;li&gt;The task requires precise visual grounding and action planning, not broad general knowledge&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" alt="Mano-P OSWorld Benchmark" width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mano-P&lt;/strong&gt; is an open-source project (Apache 2.0) by Mininglamp Technology that takes this approach. It's a GUI-VLA (Vision-Language-Action) agent designed for edge devices — specifically, it runs entirely on a Mac, with all data staying on the local machine.&lt;/p&gt;

&lt;p&gt;The architecture integrates visual understanding, language reasoning, and action generation in a single end-to-end model, trained through a three-stage pipeline (SFT → offline RL → online RL) with a think-act-verify inference loop and GS-Pruning for visual token efficiency.&lt;/p&gt;

&lt;p&gt;Published benchmark results (with evaluation framework and model specification noted):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OSWorld&lt;/strong&gt; (72B model): 58.2% accuracy — ranked #1 (2nd place: 45.0%, a 13.2 percentage point gap)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebRetriever Protocol I&lt;/strong&gt; (72B model): 41.7 NavEval — ranked #1 (Gemini 2.5 Pro: 40.9, Claude 4.5: 31.3)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge deployment&lt;/strong&gt; (4B quantized, w4a16): 476 tokens/s prefill, 76 tokens/s decode, 4.3GB peak memory on Apple M4&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hardware requirement: Mac with Apple M4 chip + 32GB RAM (or Mano-P Compute Stick via USB 4.0+).&lt;/p&gt;

&lt;p&gt;The takeaway: &lt;strong&gt;a 4B-parameter model running locally on a Mac can achieve state-of-the-art results in its domain.&lt;/strong&gt; Not because small models are universally better, but because the right model for the right task, deployed in the right place, can outperform a general-purpose giant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two tracks, one ecosystem
&lt;/h2&gt;

&lt;p&gt;DeepSeek V4 pushing to 1.6 trillion parameters and a 4B model hitting #1 on GUI benchmarks are not contradictory developments. They're two sides of the same evolution in AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scaling Up&lt;/strong&gt; provides the general intelligence foundation — broad reasoning, complex generation, cross-domain capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling Out&lt;/strong&gt; provides the execution layer — privacy-preserving, low-latency, offline-capable, specialized for specific tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The two can work together: edge models handle local tasks, and when something exceeds their scope, they call out to cloud models. This layered architecture may be closer to how AI actually gets deployed in the real world than any single-model paradigm.&lt;/p&gt;

&lt;p&gt;For developers choosing a direction: it's not about picking the model with the most parameters. It's about picking the model that fits your constraints — compute budget, latency requirements, data sensitivity, deployment environment.&lt;/p&gt;

&lt;p&gt;The trillion-parameter era is here. And so is the era of AI that runs on your machine.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mano-P (Apache 2.0): &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>deepseek</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Happy 45th Birthday, GUI. Meet Your New Power User.</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Mon, 27 Apr 2026 11:54:15 +0000</pubDate>
      <link>https://dev.to/mininglamp/happy-45th-birthday-gui-meet-your-new-power-user-pdi</link>
      <guid>https://dev.to/mininglamp/happy-45th-birthday-gui-meet-your-new-power-user-pdi</guid>
      <description>&lt;p&gt;On April 27, 1981, Xerox introduced the Star 8010 Information System — the first commercial computer with a graphical user interface.&lt;/p&gt;

&lt;p&gt;Bitmapped display, desktop metaphor, icons, windows, mouse, WYSIWYG. Everything we take for granted about modern computing started with a $16,595 workstation that most people never used.&lt;/p&gt;

&lt;p&gt;Today marks the 45th anniversary of that moment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Milestones in 45 Years
&lt;/h2&gt;

&lt;p&gt;The GUI's history can be traced through a handful of defining moments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1981 · Xerox Star&lt;/strong&gt;: GUI is born. The desktop metaphor becomes the foundational paradigm for human-computer interaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1984 · Macintosh&lt;/strong&gt;: Apple brings GUI to the consumer market. Computing becomes visual for everyone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1995 · Windows 95&lt;/strong&gt;: The Start menu and taskbar. GUI becomes the global default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2007 · iPhone&lt;/strong&gt;: Multi-touch replaces the mouse. GUI extends from desktops to pockets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2025–2026 · GUI Agents&lt;/strong&gt;: AI learns to "see" screens and operate them autonomously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first four milestones share one constant: &lt;strong&gt;the user is always a human&lt;/strong&gt;. Interface design revolves around human visual cognition — icons should be intuitive, layouts should follow natural eye movement, interactions should provide instant feedback.&lt;/p&gt;

&lt;p&gt;The fifth milestone introduces a fundamental shift: &lt;strong&gt;the "user" can be an AI&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When AI Becomes the GUI Operator
&lt;/h2&gt;

&lt;p&gt;Over the past two years, GUI Agents have emerged as a distinct technical direction. The core idea: train AI models to operate computers the way humans do — by looking at the screen and performing mouse/keyboard actions.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from traditional automation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dependency&lt;/th&gt;
&lt;th&gt;Coverage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API/CLI&lt;/td&gt;
&lt;td&gt;Target system must expose an API&lt;/td&gt;
&lt;td&gt;Only apps with APIs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DOM/CDP parsing&lt;/td&gt;
&lt;td&gt;Requires browser internals or accessible widget trees&lt;/td&gt;
&lt;td&gt;Primarily web apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure vision&lt;/td&gt;
&lt;td&gt;None — works with any GUI&lt;/td&gt;
&lt;td&gt;Any application with a visual interface&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The vision-based approach inherits the exact principle that Xerox Star's designers articulated 45 years ago: &lt;strong&gt;a GUI should be self-explanatory — you should be able to understand how to use it just by looking at it&lt;/strong&gt;. Back then, that capability belonged to humans. Now AI is developing it too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mano-P: A Vision-Only GUI Agent for Edge Devices
&lt;/h2&gt;

&lt;p&gt;Mininglamp Technology open-sourced &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt; under the Apache 2.0 license, taking a vision-only approach to GUI automation. Mano-P uses a GUI-VLA (Vision-Language-Action) architecture that integrates visual understanding, language reasoning, and action generation in a single end-to-end model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark Results
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" alt="OSWorld Benchmark" width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OSWorld&lt;/strong&gt; (verified, specialized model): Mano-P 72B achieves &lt;strong&gt;58.2% accuracy&lt;/strong&gt;, ranking #1 (runner-up: 45.0%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebRetriever Protocol I&lt;/strong&gt;: &lt;strong&gt;41.7 NavEval&lt;/strong&gt; (ranked #1), surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  On-Device Performance
&lt;/h3&gt;

&lt;p&gt;The 4B quantized model (w4a16) runs locally on Apple M4 Macs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefill: 476 tokens/s&lt;/li&gt;
&lt;li&gt;Decode: 76 tokens/s&lt;/li&gt;
&lt;li&gt;Peak memory: 4.3 GB&lt;/li&gt;
&lt;li&gt;Fully local execution — screen captures and task data never leave the device&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hardware requirements: Mac with Apple M4 chip + 32GB RAM, or any Mac with a Mano-P Compute Stick (USB 4.0).&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Approach
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Bidirectional self-reinforcement learning (Text ↔ Action cyclic consistency)&lt;/li&gt;
&lt;li&gt;Three-stage training: SFT → Offline RL → Online RL&lt;/li&gt;
&lt;li&gt;Think-act-verify reasoning loop&lt;/li&gt;
&lt;li&gt;GS-Pruning for visual token reduction, optimizing edge inference&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Full Circle
&lt;/h2&gt;

&lt;p&gt;Forty-five years ago, the Xerox Star taught humans to interact with computers through visual interfaces. Today, AI agents are learning to do the same thing — looking at pixels, understanding layouts, clicking buttons.&lt;/p&gt;

&lt;p&gt;The Xerox Star was a commercial failure but a technical triumph. Its design DNA — bitmapped displays, the desktop metaphor, WYSIWYG — lives on in every Mac, PC, phone, and tablet. GUI Agents are the next chapter: &lt;strong&gt;the interface designed for human eyes turns out to work for AI eyes too&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The GUI hasn't changed. What changed is who's looking at the screen.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Technical Report&lt;/strong&gt;: &lt;a href="https://arxiv.org/abs/2509.17336" rel="noopener noreferrer"&gt;arXiv:2509.17336&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Mano-P is developed by Mininglamp Technology and released under the Apache 2.0 license.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>gui</category>
    </item>
    <item>
      <title>AI Got Hands: Breaking the Human Bottleneck in Agent Workflows</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Fri, 24 Apr 2026 07:35:34 +0000</pubDate>
      <link>https://dev.to/mininglamp/ai-got-hands-breaking-the-human-bottleneck-in-agent-workflows-2b5o</link>
      <guid>https://dev.to/mininglamp/ai-got-hands-breaking-the-human-bottleneck-in-agent-workflows-2b5o</guid>
      <description>&lt;p&gt;Most AI agent frameworks can browse the web. Open a URL, read some HTML, click a button, fill a form. This works because browsers expose their internals through well-defined protocols — Chrome DevTools Protocol (CDP), DOM APIs, JavaScript injection.&lt;/p&gt;

&lt;p&gt;But here's the problem: the majority of professional work doesn't happen in a browser.&lt;/p&gt;

&lt;p&gt;CAD engineers work in SolidWorks. Video editors work in DaVinci Resolve. Data analysts switch between Excel, custom BI dashboards, and terminal sessions. System administrators navigate native configuration panels. Designers use Figma's desktop app, Photoshop, Blender.&lt;/p&gt;

&lt;p&gt;None of these expose a DOM. None of them speak CDP. And most of the "AI automation" ecosystem simply cannot reach them.&lt;/p&gt;

&lt;p&gt;This article examines the three main technical approaches to GUI automation, explains why the vision-only approach matters for breaking the browser boundary, and looks at measured results on cross-application benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Approaches to GUI Automation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Approach 1: CDP and HTML Parsing
&lt;/h3&gt;

&lt;p&gt;The Chrome DevTools Protocol gives programmatic access to Chromium-based browsers. You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query the DOM tree&lt;/li&gt;
&lt;li&gt;Execute JavaScript in page context&lt;/li&gt;
&lt;li&gt;Intercept network requests&lt;/li&gt;
&lt;li&gt;Simulate clicks and keyboard input at the DOM element level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Frameworks like Playwright, Puppeteer, and most browser-based AI agents use this approach. It's precise, fast, and reliable — within its domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pixel-perfect element targeting via CSS selectors&lt;/li&gt;
&lt;li&gt;Access to hidden elements, shadow DOM, iframe contents&lt;/li&gt;
&lt;li&gt;Can read and modify page state programmatically&lt;/li&gt;
&lt;li&gt;Low latency (no screen capture needed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Browser-only.&lt;/strong&gt; CDP doesn't exist outside Chromium. Firefox has a partial equivalent; Safari's is limited. Native desktop apps, mobile apps, and OS-level UI are completely out of scope.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Site-specific fragility.&lt;/strong&gt; CSS selectors break when websites update their markup. A class name change, a restructured component tree, or a switch from server-rendered to client-rendered content can silently break automation scripts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SPA complexity.&lt;/strong&gt; Modern single-page applications with dynamic rendering, lazy loading, and virtual scrolling create timing dependencies that are hard to handle reliably.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-automation measures.&lt;/strong&gt; Many sites actively detect and block CDP-based automation through bot detection, CAPTCHAs, and behavioral analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For browser-based tasks, CDP is the right tool. But framing "AI automation" as "browser automation" leaves most of the desktop untouched.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 2: Accessibility APIs
&lt;/h3&gt;

&lt;p&gt;Operating systems provide accessibility APIs (UI Automation on Windows, Accessibility API on macOS, AT-SPI on Linux) that expose a tree of UI elements with their roles, labels, and states. Screen readers use these APIs. So can automation frameworks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Works across native applications, not just browsers&lt;/li&gt;
&lt;li&gt;Semantic information (button labels, text field values, checkbox states)&lt;/li&gt;
&lt;li&gt;Standardized per-OS (once you handle the platform API, it works across apps)&lt;/li&gt;
&lt;li&gt;Doesn't require visual rendering — works even on headless systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inconsistent implementation.&lt;/strong&gt; Application developers implement accessibility support to varying degrees. A well-built macOS app might expose a complete accessibility tree. A cross-platform Electron app might expose a flat, unlabeled hierarchy. A legacy Qt application might expose nothing useful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom controls are invisible.&lt;/strong&gt; Rendered canvases (games, CAD viewports, video timelines, terminal emulators with custom rendering) don't have accessibility tree entries for their internal elements. A 3D modeling tool's viewport is a single opaque rectangle to the accessibility API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform fragmentation.&lt;/strong&gt; Each OS has its own API, data model, and quirks. Code written for macOS accessibility doesn't transfer to Windows or Linux.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance overhead.&lt;/strong&gt; Querying the full accessibility tree of a complex application can be slow — hundreds of milliseconds for apps with deep hierarchies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Accessibility APIs are genuinely useful and underappreciated in the automation space. But they have a fundamental coverage gap: they can only see what developers explicitly expose, and many interfaces — especially professional tools with custom rendering — aren't fully accessible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 3: Vision-Only Understanding
&lt;/h3&gt;

&lt;p&gt;The third approach skips the application's internal representation entirely. Instead of querying DOM trees or accessibility APIs, the agent looks at what's on screen — raw pixels — and reasons about what it sees.&lt;/p&gt;

&lt;p&gt;This is how humans interact with computers. We don't parse HTML to find the "Submit" button. We see a rectangle that looks like a button, read its label, and click it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Universal coverage.&lt;/strong&gt; If a human can see it on screen, the agent can see it. Native apps, web apps, terminals, games, remote desktops, virtual machines — all the same to a screenshot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No application cooperation required.&lt;/strong&gt; The agent doesn't need hooks, APIs, or special access. Screen capture is a standard OS capability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resilient to UI changes.&lt;/strong&gt; A button that moves from the left sidebar to the top toolbar still looks like a button. Visual understanding is inherently more robust to layout changes than coordinate-based or selector-based targeting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-platform by default.&lt;/strong&gt; Screenshots are screenshots, regardless of OS. The same model that automates macOS can automate Windows or Linux without platform-specific code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Requires capable vision models.&lt;/strong&gt; The agent needs to accurately parse dense UIs, read small text, distinguish between similar-looking elements, and understand spatial relationships. This is a hard computer vision problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher computational cost.&lt;/strong&gt; Processing a full screenshot through a vision model is more expensive than querying a DOM tree. This is where model optimization and edge deployment become critical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Occlusion and overlaps.&lt;/strong&gt; Dropdown menus, tooltips, and modal dialogs can cover important UI elements. The agent needs to handle these states.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No hidden state access.&lt;/strong&gt; The agent can't see what's behind a collapsed menu or in an unscrolled region. It has to navigate to make information visible, just like a human would.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is clear: vision-only gives you universal reach at the cost of requiring a strong vision model. The question is whether today's models are good enough to make that trade worthwhile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Breaking the Browser Boundary
&lt;/h2&gt;

&lt;p&gt;Let's make this concrete. Consider a workflow that's common in any organization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Pull Q1 sales data from the CRM, cross-reference it with the finance spreadsheet on the shared drive, and create a summary slide deck for the Monday meeting."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A browser-based agent can maybe handle the CRM part (if it's a web app). But the finance spreadsheet might be in a native Excel window. The slide deck is in PowerPoint or Keynote. The shared drive might be mounted as a local folder or accessed through a native file manager.&lt;/p&gt;

&lt;p&gt;This is one task that touches three or four applications. A CDP-based agent taps out after step one. An accessibility-based agent might handle two of the three but struggle with Excel's complex grid rendering. A vision-based agent can navigate all of them — it sees what you see, clicks where you'd click, types what you'd type.&lt;/p&gt;

&lt;p&gt;The same principle applies to more specialized work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DevOps:&lt;/strong&gt; Switching between a terminal, a monitoring dashboard (Grafana), a cloud console (AWS), and a ticket system (Jira) — mixing web and native UIs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design:&lt;/strong&gt; Moving assets between Figma, Photoshop, and a file manager, with each tool having its own UI paradigms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data science:&lt;/strong&gt; Interacting with Jupyter notebooks, database GUIs, Excel, and custom visualization tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System administration:&lt;/strong&gt; Navigating OS settings panels, network configuration tools, and hardware management interfaces that have no web equivalent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't edge cases. They're the normal workday for millions of professionals. The browser boundary isn't a minor limitation — it's a wall that separates "AI demo" from "AI tool."&lt;/p&gt;

&lt;h2&gt;
  
  
  Measured Results on Cross-Application Benchmarks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt; (GUI-Aware Agent Model for Edge Devices, open-source under Apache 2.0) uses the vision-only approach. The name stands for "Mano" (Spanish for "hand") and "P" (Person &amp;amp; Party).&lt;/p&gt;

&lt;p&gt;Here's the architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" alt="Mano-P Architecture" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The model takes screenshots as input and outputs action sequences — click coordinates, keystrokes, scroll directions, and multi-step plans. No DOM parsing. No accessibility tree queries. Just pixels in, actions out.&lt;/p&gt;

&lt;p&gt;On OSWorld — a benchmark specifically designed to test agents on real desktop environments across different operating systems and applications — the results look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" alt="Mano-P OSWorld Results" width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mano-P achieves a &lt;strong&gt;58.2% success rate&lt;/strong&gt; on OSWorld, compared to 45.0% for the second-place model. This benchmark includes tasks spanning file management, office applications, web browsing, system configuration, and multi-application workflows — exactly the kind of cross-boundary work where vision-only approaches should theoretically shine.&lt;/p&gt;

&lt;p&gt;On web-specific benchmarks, the vision-only approach remains competitive. On WebRetriever Protocol I, Mano-P scores &lt;strong&gt;41.7 NavEval&lt;/strong&gt;, ahead of Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). This is notable because web benchmarks should favor approaches that can access the DOM directly — yet the vision-only model still leads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Vision-Only Can Win on the Web Too
&lt;/h2&gt;

&lt;p&gt;This counterintuitive result — a vision model beating DOM-aware models on web tasks — has a plausible explanation.&lt;/p&gt;

&lt;p&gt;Modern web pages are designed for human eyes, not for programmatic parsing. A typical SaaS dashboard might have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dynamically loaded content with JavaScript-rendered elements&lt;/li&gt;
&lt;li&gt;Canvas-based charts and visualizations&lt;/li&gt;
&lt;li&gt;Complex CSS layouts where the visual hierarchy doesn't match the DOM hierarchy&lt;/li&gt;
&lt;li&gt;Shadow DOM components that hide internal structure&lt;/li&gt;
&lt;li&gt;Iframes embedding third-party content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A DOM parser sees the structural complexity. A vision model sees the rendered result — the same clean layout the designer intended for human users. In many cases, the rendered output is actually easier to reason about than the underlying markup.&lt;/p&gt;

&lt;p&gt;This doesn't mean vision-only is universally better for web tasks. DOM access provides exact text content (no OCR errors), hidden metadata, and element state information. But for navigation and interaction tasks — "find the settings button and change this option" — visual understanding can be more robust than structural parsing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running on the Edge
&lt;/h2&gt;

&lt;p&gt;A vision-based agent is computationally demanding. Processing high-resolution screenshots through a vision-language model requires significant inference capacity. This is where model design and hardware optimization become critical.&lt;/p&gt;

&lt;p&gt;Mano-P uses a 4B parameter model with w4a16 quantization (4-bit weights, 16-bit activations). On an Apple M4 Pro with 32GB RAM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prefill:&lt;/strong&gt; 476 tokens/s (ingesting the screenshot and context)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decode:&lt;/strong&gt; 76 tokens/s (generating the action sequence)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peak memory:&lt;/strong&gt; 4.3 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These numbers mean the full perception-reasoning-action loop completes in under a second for typical interactions. The 4.3 GB memory footprint leaves plenty of room for the applications being automated to run alongside the agent.&lt;/p&gt;

&lt;p&gt;Running locally also eliminates the latency of uploading screenshots to a cloud API. A screenshot from a 4K display can be several megabytes — sending that to a remote server for every action step adds meaningful delay, especially on typical upload speeds.&lt;/p&gt;

&lt;p&gt;The local execution model also means screenshots and task data never leave the device. For workflows involving sensitive information — financial data, medical records, proprietary designs — this is often a hard requirement, not a nice-to-have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Training Challenge: Teaching a Model to See and Act
&lt;/h2&gt;

&lt;p&gt;Building a vision-only agent that works across diverse applications requires solving several interconnected problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visual grounding:&lt;/strong&gt; The model must map regions of a screenshot to semantic UI elements. "The blue button in the top-right corner that says 'Save'" needs to become a precise coordinate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action planning:&lt;/strong&gt; Given a goal ("rename this file to quarterly-report-v2.pdf"), the model must generate a sequence of actions: right-click the file → click "Rename" → select all text → type the new name → press Enter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error recovery:&lt;/strong&gt; UI automation in real environments is noisy. Menus take time to open. Dialog boxes appear unexpectedly. Actions sometimes fail. The model needs to verify outcomes and adapt.&lt;/p&gt;

&lt;p&gt;Mano-P addresses these through a three-stage training pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Supervised Fine-Tuning (SFT)&lt;/strong&gt; on curated GUI interaction datasets builds foundational visual understanding and action generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline Reinforcement Learning&lt;/strong&gt; on collected trajectories teaches multi-step planning from both successful and failed interactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Online Reinforcement Learning&lt;/strong&gt; with a think-act-verify loop develops robustness — the model learns to check its work and recover from failures in live environments.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A technique called &lt;strong&gt;GS-Pruning&lt;/strong&gt; (Gradient-based Structured Pruning) then compresses the model, removing redundant capacity to hit the 4B parameter target without proportional capability loss.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implications for Agent Architecture
&lt;/h2&gt;

&lt;p&gt;The vision-only approach has second-order effects on how agent systems are designed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simpler integration.&lt;/strong&gt; Adding a new application to the agent's capabilities doesn't require building an adapter, writing selectors, or mapping accessibility trees. If the app has a GUI, the agent can use it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-system workflows.&lt;/strong&gt; Tasks that span multiple applications — copying data from a web CRM into a native spreadsheet, then attaching it to an email — don't require different automation strategies for each app. The agent uses the same perception-action loop throughout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-task planning.&lt;/strong&gt; Because the agent perceives the full screen state at each step, it can maintain context across complex, multi-step workflows. The think-act-verify training means it checks whether each step succeeded before proceeding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reduced maintenance burden.&lt;/strong&gt; Selector-based automation scripts break when UIs update. Vision-based automation is inherently more resilient because it relies on visual patterns rather than structural identifiers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current Limitations and Honest Assessment
&lt;/h2&gt;

&lt;p&gt;Vision-only GUI automation is not a solved problem. Current limitations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small text and dense UIs.&lt;/strong&gt; Spreadsheets with tiny fonts, code editors with many similar-looking lines, and dashboards with packed metrics are still challenging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed-sensitive interactions.&lt;/strong&gt; Drag-and-drop, real-time canvas manipulation, and rapid sequential inputs are harder than discrete click-and-type actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification ambiguity.&lt;/strong&gt; Sometimes it's hard to tell from a screenshot alone whether an action succeeded (e.g., a background save operation with no visual confirmation).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training data coverage.&lt;/strong&gt; The model performs best on application types well-represented in training data. Niche or custom enterprise software may require fine-tuning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are active research areas, not fundamental barriers. As vision models improve in resolution handling, temporal reasoning, and few-shot adaptation, the coverage gap will narrow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Mano-P is open-source under Apache 2.0 with a three-phase release plan:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1 (released):&lt;/strong&gt; Skills — task-specific capability modules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2:&lt;/strong&gt; Local models and SDK — the inference runtime and integration tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3:&lt;/strong&gt; Training methods — the full pipeline for community extension&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code and documentation are at &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you're building agent workflows that stop at the browser boundary, it might be time to give your AI hands that can reach the rest of the desktop.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>automation</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>AI for Personal: How Edge-Native Agents Bring Data Sovereignty Back to Your Device</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Fri, 24 Apr 2026 07:34:50 +0000</pubDate>
      <link>https://dev.to/mininglamp/ai-for-personal-how-edge-native-agents-bring-data-sovereignty-back-to-your-device-5882</link>
      <guid>https://dev.to/mininglamp/ai-for-personal-how-edge-native-agents-bring-data-sovereignty-back-to-your-device-5882</guid>
      <description>&lt;p&gt;When you ask a cloud-based AI agent to "summarize my last 20 emails" or "fill out this expense report from my receipts," you're making an implicit trade: convenience for control. Your screenshots, your documents, your workflow patterns — all uploaded to someone else's infrastructure, processed on someone else's GPUs, stored under someone else's data retention policy.&lt;/p&gt;

&lt;p&gt;For many developers and enterprise users, that trade is becoming harder to justify.&lt;/p&gt;

&lt;p&gt;This article explores the technical architecture behind running AI agents entirely on local hardware — no cloud round-trips, no data exfiltration, no API keys required — and how a 4B-parameter model running on Apple Silicon can match or exceed cloud-hosted alternatives on GUI automation benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cloud Dependency Problem
&lt;/h2&gt;

&lt;p&gt;Most AI agent frameworks today follow a predictable pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Capture screen state (screenshot, DOM, accessibility tree)&lt;/li&gt;
&lt;li&gt;Send it to a cloud API (OpenAI, Anthropic, Google)&lt;/li&gt;
&lt;li&gt;Receive action instructions&lt;/li&gt;
&lt;li&gt;Execute locally&lt;/li&gt;
&lt;li&gt;Repeat&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works. But it has structural problems that no amount of prompt engineering can fix:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency compounds.&lt;/strong&gt; Each action in a multi-step workflow requires a round-trip. A 10-step task that takes 500ms per API call adds 5 seconds of pure network overhead — before you account for token generation time on the server side.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data leaves the device by design.&lt;/strong&gt; Screenshots contain everything visible on screen: open tabs, notification previews, partial passwords in terminal windows, private messages, financial data. The agent doesn't selectively capture — it sees what you see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost scales with usage.&lt;/strong&gt; Vision API calls with screenshot inputs are expensive. A power user running an agent for 8 hours might generate hundreds of screenshots, each consuming thousands of tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Availability depends on infrastructure you don't control.&lt;/strong&gt; API rate limits, outages, region restrictions, and policy changes can break your workflow without warning.&lt;/p&gt;

&lt;p&gt;None of these are hypothetical. They're the everyday reality of cloud-dependent agent architectures.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Edge-Native" Actually Means
&lt;/h2&gt;

&lt;p&gt;Edge-native AI isn't just "smaller model on a laptop." It's a fundamentally different architecture where the entire inference loop — perception, reasoning, and action — runs on the device where the work happens.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt; (GUI-Aware Agent Model for Edge Devices, open-source under Apache 2.0) is built around this principle. The name comes from "Mano" (Spanish for "hand") and "P" (Person &amp;amp; Party) — an agent that works with its hands, for its person.&lt;/p&gt;

&lt;p&gt;Here's the architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" alt="Mano-P Architecture" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The key design decision: Mano-P uses &lt;strong&gt;vision-only understanding&lt;/strong&gt;. It looks at screenshots — raw pixels — rather than parsing HTML, querying accessibility APIs, or injecting JavaScript into the DOM. This matters for edge deployment because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No application-specific adapters.&lt;/strong&gt; The same model works on browsers, native apps, terminal windows, and 3D tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No privilege escalation required.&lt;/strong&gt; Screen capture is a standard OS capability. DOM injection and accessibility API access often require elevated permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduced attack surface.&lt;/strong&gt; The agent reads pixels. It doesn't hook into application internals.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In local mode, screenshots and task data never leave the device. There's no telemetry endpoint, no "anonymous usage data" upload, no cloud fallback. The inference happens on your hardware, and the data stays on your hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running a 4B Model on Apple Silicon
&lt;/h2&gt;

&lt;p&gt;The practical question is: can edge hardware actually run a capable agent model at interactive speeds?&lt;/p&gt;

&lt;p&gt;Here are measured numbers on an Apple M4 Pro with 32GB unified memory:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model size&lt;/td&gt;
&lt;td&gt;4B parameters (w4a16 quantization)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefill throughput&lt;/td&gt;
&lt;td&gt;476 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode throughput&lt;/td&gt;
&lt;td&gt;76 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak memory&lt;/td&gt;
&lt;td&gt;4.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let's break down why these numbers matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;476 tokens/s prefill&lt;/strong&gt; means the model can ingest a screenshot (encoded as visual tokens) and the task context in well under a second. This is the "reading" phase — where the model processes what it sees on screen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;76 tokens/s decode&lt;/strong&gt; means action generation (the "writing" phase — outputting what to click, type, or scroll) takes roughly 100-300ms for a typical action sequence. This is fast enough for real-time interaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4.3 GB peak memory&lt;/strong&gt; means the model fits comfortably alongside your normal workload. On a 32GB machine, you have ~28GB left for browsers, IDEs, design tools — whatever the agent is supposed to be automating.&lt;/p&gt;

&lt;p&gt;The w4a16 quantization scheme (4-bit weights, 16-bit activations) is the key enabler here. It reduces the model's memory footprint by roughly 4x compared to fp16, while preserving activation precision where it matters most — in the attention and reasoning layers.&lt;/p&gt;

&lt;p&gt;Apple Silicon's unified memory architecture is particularly well-suited for this workload. There's no PCIe bottleneck between CPU and GPU memory; the model weights, the screenshot tensor, and the action output all live in the same memory space. The Neural Engine and GPU cores can be dispatched to different parts of the inference pipeline without data copies.&lt;/p&gt;

&lt;p&gt;For machines without sufficient local compute, Mano-P also supports offloading to a compute stick connected via USB 4.0 — effectively adding a dedicated inference accelerator without changing the data sovereignty model (the stick is still physically local).&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Performance: Does Local Mean Worse?
&lt;/h2&gt;

&lt;p&gt;The assumption that smaller, local models must sacrifice capability is worth testing empirically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" alt="Mano-P Benchmark Results" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On OSWorld — a benchmark that tests agents on real desktop environments across operating systems — Mano-P achieves a &lt;strong&gt;58.2% success rate&lt;/strong&gt;, compared to 45.0% for the second-place model. This isn't a narrow domain-specific benchmark; OSWorld tests general GUI automation across diverse applications and multi-step workflows.&lt;/p&gt;

&lt;p&gt;On WebRetriever Protocol I, Mano-P scores &lt;strong&gt;41.7 NavEval&lt;/strong&gt;, ahead of Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3).&lt;/p&gt;

&lt;p&gt;These results suggest that the "edge tax" — the performance cost of running locally instead of in the cloud — can be zero or negative when the model architecture is specifically designed for the task. A 4B model trained and optimized for GUI understanding can outperform much larger general-purpose models that treat GUI automation as one capability among many.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Training Pipeline: How a Small Model Gets Good
&lt;/h2&gt;

&lt;p&gt;Model size alone doesn't explain the benchmark results. The training methodology matters more at this scale because every parameter has to earn its keep.&lt;/p&gt;

&lt;p&gt;Mano-P's training follows a three-stage progression:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Supervised Fine-Tuning (SFT).&lt;/strong&gt; The base model is trained on curated GUI interaction datasets — screenshots paired with correct action sequences. This gives the model foundational competence in visual grounding (mapping screen regions to semantic elements) and action generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Offline Reinforcement Learning.&lt;/strong&gt; Using collected interaction trajectories, the model learns from both successful and failed attempts. This stage improves multi-step planning — the ability to reason about sequences of actions rather than reacting to each screenshot independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Online Reinforcement Learning.&lt;/strong&gt; The model interacts with live environments and learns from real outcomes. A think-act-verify loop ensures the model checks whether its actions achieved the intended result before proceeding. This is where the model develops robustness — learning to recover from unexpected states, handle loading delays, and adapt to UI variations.&lt;/p&gt;

&lt;p&gt;An additional technique called &lt;strong&gt;GS-Pruning&lt;/strong&gt; (Gradient-based Structured Pruning) removes redundant model capacity after training, further reducing the model size without proportional capability loss. This is how you get a 4B model that punches above its weight class.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Enables
&lt;/h2&gt;

&lt;p&gt;When an AI agent runs entirely on your device with no cloud dependency, certain use cases become possible that were previously impractical or unacceptable:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sensitive workflow automation.&lt;/strong&gt; Automating tasks that involve medical records, legal documents, financial data, or classified information — where uploading screenshots to a third-party API would violate compliance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Air-gapped environments.&lt;/strong&gt; Research labs, government facilities, and financial trading floors often operate without internet access. A local agent works regardless of network state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistent performance.&lt;/strong&gt; No API rate limits, no cold starts, no "the service is experiencing high demand" degradation. The model runs at the same speed whether it's Monday morning or Friday night.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost predictability.&lt;/strong&gt; The hardware is a one-time cost. There's no per-token billing, no surprise invoices, no pricing changes.&lt;/p&gt;

&lt;p&gt;Beyond single-device automation, the core capabilities extend to cross-system data integration (working across multiple apps to consolidate information), long-task planning (breaking complex goals into executable sequences), and intelligent report generation (synthesizing information from multiple sources into structured output).&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source Roadmap
&lt;/h2&gt;

&lt;p&gt;Mano-P is released under Apache 2.0 with a three-phase open-source plan:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1 (released):&lt;/strong&gt; Skills — the agent's capability modules for specific task domains&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2:&lt;/strong&gt; Local models and SDK — the inference runtime and developer integration tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3:&lt;/strong&gt; Training methods — the full pipeline so others can train specialized models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The phased approach is deliberate. Phase 1 lets developers use and evaluate the agent immediately. Phase 2 gives them the tools to integrate it into their own products. Phase 3 enables the community to extend the model to new domains and hardware platforms.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The shift from cloud-dependent to edge-native AI agents isn't primarily a technical argument. It's an architectural one.&lt;/p&gt;

&lt;p&gt;Cloud APIs are shared infrastructure. They're powerful, convenient, and constantly improving. But they come with structural constraints — latency, cost, data exposure, availability — that are inherent to the architecture, not bugs to be fixed.&lt;/p&gt;

&lt;p&gt;Edge-native agents trade cloud-scale compute for data sovereignty, predictable performance, and zero marginal cost. For many workflows — especially those involving sensitive data or requiring low-latency interaction — that's a trade worth making.&lt;/p&gt;

&lt;p&gt;The benchmark results suggest it doesn't have to be a trade at all. A well-designed, well-trained 4B model running on consumer hardware can match or exceed cloud-hosted alternatives on practical GUI automation tasks.&lt;/p&gt;

&lt;p&gt;The code is on GitHub: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If your data matters enough to keep it on your device, your AI agent should be able to stay there too.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>privacy</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Apple Took 50 Years for 3 CEOs — GUI Agents Went from Paper to Production in One</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Wed, 22 Apr 2026 04:56:34 +0000</pubDate>
      <link>https://dev.to/mininglamp/apple-took-50-years-for-3-ceos-gui-agents-went-from-paper-to-production-in-one-33eb</link>
      <guid>https://dev.to/mininglamp/apple-took-50-years-for-3-ceos-gui-agents-went-from-paper-to-production-in-one-33eb</guid>
      <description>&lt;p&gt;Yesterday, Apple announced a landmark succession: Tim Cook steps down as CEO to become Executive Chairman, with John Ternus taking over on September 1. In its 50-year history, Apple has had just three CEOs: Jobs, Cook, Ternus.&lt;/p&gt;

&lt;p&gt;Three people. Fifty years. Each transition spaced over a decade apart.&lt;/p&gt;

&lt;p&gt;Now consider the AI Agent space: one year ago, most people were still debating whether AI could operate a computer at all. Today, there are open-source projects delivering usable on-device solutions.&lt;/p&gt;

&lt;p&gt;This article breaks down the technical evolution of GUI Agents — using &lt;strong&gt;&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt;&lt;/strong&gt;, our open-source project, as a concrete example of what it takes to go from training to on-device deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is a GUI Agent?
&lt;/h2&gt;

&lt;p&gt;A GUI Agent's core mission: let AI operate a computer's graphical interface the way a human does — recognizing screen elements, understanding task intent, and executing clicks, typing, and drag-and-drop operations.&lt;/p&gt;

&lt;p&gt;There are currently two main technical approaches:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Strength&lt;/th&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API/DOM-driven&lt;/td&gt;
&lt;td&gt;Reads interface structure via accessibility APIs or DOM trees&lt;/td&gt;
&lt;td&gt;Precise element targeting&lt;/td&gt;
&lt;td&gt;Depends on app-specific interfaces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure vision&lt;/td&gt;
&lt;td&gt;Understands UI from screenshots alone&lt;/td&gt;
&lt;td&gt;Works across any application&lt;/td&gt;
&lt;td&gt;Higher demand on visual comprehension&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Mano-P takes the pure vision route. Designed for Mac, it's an on-device GUI Agent — "Mano" means "hand" in Spanish, "P" stands for Person. AI for Personal. It runs entirely locally; no data leaves the device.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" alt="Mano-P Open Source Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Training: Bidirectional Self-Reinforcement Learning
&lt;/h2&gt;

&lt;p&gt;The training pipeline follows a three-stage progressive framework:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 1: SFT (Supervised Fine-Tuning)
    ↓  Build foundational capabilities
Stage 2: Offline Reinforcement Learning
    ↓  Learn strategy optimization from historical data
Stage 3: Online Reinforcement Learning
    ↓  Continuously improve through real-environment interaction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 1 — SFT&lt;/strong&gt;: Supervised fine-tuning on high-quality GUI operation datasets. The model learns basic interface understanding and action mapping — ground-truth capability building.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2 — Offline RL&lt;/strong&gt;: Uses collected interaction trajectories to optimize policies via reinforcement learning. Extracts success/failure signals from historical operations without requiring live environment interaction, keeping training costs manageable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3 — Online RL&lt;/strong&gt;: Interacts with real GUI environments, adjusting strategy based on live feedback. The key challenge here is balancing exploration (trying new operation paths) with exploitation (reinforcing proven strategies).&lt;/p&gt;

&lt;h2&gt;
  
  
  Inference: Think-Act-Verify Loop
&lt;/h2&gt;

&lt;p&gt;The inference mechanism uses a think-act-verify cycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;task_not_complete&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Think: analyze current screen, plan next action
&lt;/span&gt;    &lt;span class="n"&gt;thought&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;think&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Act: execute GUI operation (click, type, scroll)
&lt;/span&gt;    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;act&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thought&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Verify: capture new screenshot, check result
&lt;/span&gt;    &lt;span class="n"&gt;new_screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;capture_screen&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;verified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_screenshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected_state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;verified&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;task_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# back to Think
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives the Agent self-correction capability. In real desktop environments, unexpected popups, loading delays, and dynamic element repositioning are common — the verify step catches these before errors cascade.&lt;/p&gt;

&lt;p&gt;Core capabilities span four areas: complex GUI automation, cross-system data integration, long-task planning and execution, and intelligent report generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Performance
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OSWorld&lt;/strong&gt;: Mano-P's 72B model achieves &lt;strong&gt;58.2% success rate&lt;/strong&gt;, ranking #1 among specialized GUI agent models. Second place scores 45.0%. OSWorld simulates real OS environments with cross-application tasks including file operations, browser interactions, and office software workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebRetriever Protocol I&lt;/strong&gt;: Scores &lt;strong&gt;41.7 NavEval&lt;/strong&gt;, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). This benchmark focuses on web information retrieval and interaction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" alt="Mano-P Benchmark Overview"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Edge Deployment: 4B Model Running On-Device
&lt;/h2&gt;

&lt;p&gt;On-device deployment is a core feature of Mano-P. Here's the &lt;strong&gt;4B quantized model (w4a16)&lt;/strong&gt; performance on M4 Pro:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill Speed&lt;/td&gt;
&lt;td&gt;476 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode Speed&lt;/td&gt;
&lt;td&gt;76 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak Memory&lt;/td&gt;
&lt;td&gt;4.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The w4a16 quantization scheme — 4-bit weights with 16-bit activations — strikes a practical balance: 4-bit weights dramatically reduce memory footprint while 16-bit activations preserve numerical precision during inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware requirement&lt;/strong&gt;: Apple M4 chip + 32 GB RAM. Fully local execution — your screen data never leaves your device.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Open-sourced under the Apache 2.0 license:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
brew tap HanningWang/tap &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;From the three-stage progressive training framework, to think-act-verify inference, to w4a16 quantization enabling edge deployment — the path from "concept" to "locally usable" GUI Agents is becoming clear.&lt;/p&gt;

&lt;p&gt;Apple took 50 years and three leaders. The GUI Agent space went from academic papers to open-source tools in roughly one year. These are two fundamentally different timescales.&lt;/p&gt;

&lt;p&gt;For developers, Mano-P — Apache 2.0 licensed, runnable on a local Mac — is already a starting point for exploration and experimentation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>apple</category>
    </item>
    <item>
      <title>Tim Cook Steps Down — Is the Mac Becoming the Next AI Agent Platform?</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Wed, 22 Apr 2026 04:55:56 +0000</pubDate>
      <link>https://dev.to/mininglamp/tim-cook-steps-down-is-the-mac-becoming-the-next-ai-agent-platform-4cml</link>
      <guid>https://dev.to/mininglamp/tim-cook-steps-down-is-the-mac-becoming-the-next-ai-agent-platform-4cml</guid>
      <description>&lt;p&gt;On April 20, Apple dropped a bombshell.&lt;/p&gt;

&lt;p&gt;On April 20, Apple announced that Tim Cook will transition from CEO to Executive Chairman, with hardware engineering SVP John Ternus taking over on September 1. In its 50-year history, Apple has now had just three CEOs.&lt;/p&gt;

&lt;p&gt;Cook's 14-year tenure defined two eras: making Apple the world's most valuable company, and driving the historic transition from Intel to Apple Silicon. Ternus's background is telling — he's not from the software or services side. He's Apple's hardware engineering chief, the person who shipped Apple Silicon. Choosing a hardware engineer as CEO is Apple signaling that hardware innovation remains the priority for the next decade.&lt;/p&gt;

&lt;p&gt;This signal is especially interesting in the context of AI. For the past few years, AI development and deployment has been virtually synonymous with "NVIDIA GPUs + Windows/Linux." The Mac has been a non-factor in the AI ecosystem. But Apple Silicon is changing that — more and more developers are running AI workloads on Mac, and it's no longer just experimentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Mac Couldn't Do AI Before
&lt;/h2&gt;

&lt;p&gt;The answer is straightforward: &lt;strong&gt;the CUDA ecosystem&lt;/strong&gt;. NVIDIA GPUs + CUDA have effectively monopolized AI training and inference infrastructure. Apple and NVIDIA parted ways after 2016 — Macs haven't shipped with NVIDIA GPUs since. Without CUDA, major deep learning frameworks (PyTorch, TensorFlow) treated Mac as a second-class citizen — technically supported, but performance-limited.&lt;/p&gt;

&lt;p&gt;AI practitioners defaulted to Windows desktops or Linux servers. Mac was fine for writing code, but running models meant SSH-ing into a remote machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Apple Silicon Changed
&lt;/h2&gt;

&lt;p&gt;The M1 chip in 2020 was the inflection point. Apple Silicon's Unified Memory Architecture broke the traditional CPU-GPU separation — CPU and GPU share a single memory pool, eliminating the need to shuttle data between them. This design has natural advantages for AI inference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No VRAM bottleneck&lt;/strong&gt;: 32 GB or more of unified memory is directly available for model inference, unlike traditional GPUs constrained by dedicated VRAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Superior power efficiency&lt;/strong&gt;: Lower power consumption at equivalent compute, enabling MacBooks to run models on battery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Growing ecosystem&lt;/strong&gt;: Apple launched MLX, a machine learning framework optimized for Apple Silicon; PyTorch now officially supports the MPS backend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From M1 through M4, each generation has delivered meaningful improvements in AI inference performance. With M4 and 32 GB RAM, Macs can now smoothly run models that previously required dedicated GPU servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real-World Example: GUI Agents on Mac
&lt;/h2&gt;

&lt;p&gt;To make this concrete, consider GUI Agents — a fast-growing area in AI where models directly observe the screen, understand interface elements, and operate mouse and keyboard to complete complex computer tasks. These applications demand real-time local responsiveness, making them a natural fit for Mac deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt;&lt;/strong&gt; is our open-source GUI Agent built specifically for Mac. "Mano" comes from the Spanish word for "hand," "P" stands for Person — AI for Personal. It uses pure vision — no accessibility APIs, no DOM parsing, just screenshot understanding. Everything runs locally on Mac; no data leaves the device.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" alt="Mano-P Open Source Architecture" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does It Perform on Apple Silicon?
&lt;/h2&gt;

&lt;p&gt;The question everyone cares about: is Apple Silicon actually fast enough for AI Agents?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OSWorld Benchmark&lt;/strong&gt; (the standard end-to-end evaluation for GUI Agents): Mano-P's 72B model achieves &lt;strong&gt;58.2% success rate&lt;/strong&gt;, ranking #1. Second place scores 45.0% — a gap of over 13 percentage points.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebRetriever Protocol I&lt;/strong&gt;: Mano-P scores &lt;strong&gt;41.7 NavEval&lt;/strong&gt;, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" alt="Mano-P Benchmark Overview" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Local inference performance — Mano-P's 4B quantized model (w4a16) on M4 Pro:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill Speed&lt;/td&gt;
&lt;td&gt;476 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode Speed&lt;/td&gt;
&lt;td&gt;76 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak Memory&lt;/td&gt;
&lt;td&gt;4.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 4.3 GB peak memory on a 32 GB Mac, you can run the Agent alongside your IDE, browser, Slack, and everything else without breaking a sweat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware requirement&lt;/strong&gt;: Apple M4 chip + 32 GB RAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Overview
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Training&lt;/strong&gt;: Bidirectional self-reinforcement learning with three progressive stages — SFT → Offline RL → Online RL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inference&lt;/strong&gt;: Think-act-verify loop. Analyze the screen state, execute an action, verify the result. If something unexpected happens (popup, loading delay), the system self-corrects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core capabilities&lt;/strong&gt;: Complex GUI automation, cross-system data integration, long-task planning and execution, intelligent report generation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew tap HanningWang/tap &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open-sourced under Apache 2.0: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mac AI Ecosystem Is Taking Shape
&lt;/h2&gt;

&lt;p&gt;Mano-P is our contribution, but it's one data point. The bigger picture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MLX&lt;/strong&gt; gives developers an efficient way to run models on Apple Silicon&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama and LM Studio&lt;/strong&gt; make running open-source LLMs on Mac as easy as installing an app&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Core ML&lt;/strong&gt; continues to improve, with Apple investing in on-device AI infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The old consensus — "doing AI means Windows/Linux + NVIDIA" — is loosening. Not because the Mac is replacing GPU servers for large-scale training, but because for &lt;strong&gt;inference, personal development, and on-device applications&lt;/strong&gt;, the Mac is becoming a genuinely viable platform.&lt;/p&gt;

&lt;p&gt;Apple just chose a hardware engineer as CEO. The Mac's AI capabilities are only going up from here. We've experienced this trend firsthand building GUI Agents on Mac, and we're excited to see more developers explore this direction.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>apple</category>
    </item>
    <item>
      <title>Google Released Android CLI Agent — Want to See an On-Device Agent on Mac?</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Mon, 20 Apr 2026 11:40:06 +0000</pubDate>
      <link>https://dev.to/mininglamp/google-released-android-cli-agent-want-to-see-an-on-device-agent-on-mac-41ei</link>
      <guid>https://dev.to/mininglamp/google-released-android-cli-agent-want-to-see-an-on-device-agent-on-mac-41ei</guid>
      <description>&lt;p&gt;Google's Android team recently released a new CLI toolchain built for AI agents — packaging SDK management, project creation, and device debugging into streamlined commands with standardized Skills and a searchable Knowledge Base. It's a clear sign: &lt;strong&gt;on-device agents are moving from concept to production&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;On mac, our open-source project Mano-P has been working on the same frontier — enabling AI agents to run locally and operate real GUI applications on your own machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Mano-P?
&lt;/h2&gt;

&lt;p&gt;Mano-P is an open-source, on-device GUI agent built by &lt;a href="https://github.com/Mininglamp-AI" rel="noopener noreferrer"&gt;Mininglamp Technology&lt;/a&gt;, designed for macOS. It's based on a VLA (Vision-Language-Action) architecture.&lt;/p&gt;

&lt;p&gt;"Mano" means "hand" in Spanish, and "P" stands for Person &amp;amp; Party — our vision is that every individual and organization can create their own personalized AI, running on their own hardware.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqp7qjsx7ax67qnagr0g.png" alt="Mano-P Open Source Architecture" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mano-P is purely vision-driven: it understands screen content through visual models, plans action sequences, and executes operations via native OS input (mouse clicks, keyboard strokes). No system APIs or CLI access required — it can theoretically operate any GUI application on Mac.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Technical Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;think-act-verify Loop&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mano-P doesn't do single-pass inference. It uses a cyclical reasoning mechanism: observe → think → act → verify → repeat. After each action, the model re-examines the screen state to confirm success before deciding the next step. This enables complex workflows spanning dozens to hundreds of steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge Optimization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To run efficiently on consumer hardware, Mano-P uses mixed-precision quantization and the GS-Pruning algorithm for visual token compression. Performance of the 4B quantized model (w4a16) on Apple M4 Pro:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefill: 476 tokens/s&lt;/li&gt;
&lt;li&gt;Decode: 76 tokens/s&lt;/li&gt;
&lt;li&gt;Peak memory: 4.3 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A Mac mini or MacBook with an M4 chip and 32GB RAM can run Mano-P locally, with no cloud dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three-Stage Training&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SFT → Offline RL → Online RL, with a bidirectional self-reinforcement framework (Text↔Action cycle consistency learning).&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" alt="Mano-P OSWorld Results" width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mano-P has achieved competitive results on public benchmarks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Mano-P (72B)&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OSWorld (Specialized)&lt;/td&gt;
&lt;td&gt;58.2%&lt;/td&gt;
&lt;td&gt;#1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WebRetriever Protocol I&lt;/td&gt;
&lt;td&gt;41.7 NavEval&lt;/td&gt;
&lt;td&gt;#1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FWebRetriever.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FWebRetriever.png" alt="Mano-P WebRetriever Results" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Privacy
&lt;/h2&gt;

&lt;p&gt;In local mode, Mano-P keeps all screenshots and task descriptions on-device. Nothing leaves your machine. The full client code is open-source and auditable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three-Phase Open Source Plan
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1&lt;/strong&gt; (released): Mano-CUA Skills for agent enthusiasts and Claude Code/OpenClaw users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2&lt;/strong&gt; (coming soon): Local model + SDK for developers with strict data security requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3&lt;/strong&gt; (planned): Training methods + pruning/quantization techniques&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;p&gt;Install via Homebrew:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew tap HanningWang/tap &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt; (Apache 2.0)&lt;/p&gt;




&lt;p&gt;The era of on-device agents is accelerating. Google took a step on Android; we've been building for Mac. If you're interested in local AI agents, give Mano-P a try and let us know what you think in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Let an Open-Source GUI Agent Play Mahjong. Here's What Happened.</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Fri, 17 Apr 2026 10:57:52 +0000</pubDate>
      <link>https://dev.to/mininglamp/i-let-an-open-source-gui-agent-play-mahjong-heres-what-happened-1333</link>
      <guid>https://dev.to/mininglamp/i-let-an-open-source-gui-agent-play-mahjong-heres-what-happened-1333</guid>
      <description>&lt;h2&gt;
  
  
  The Challenge
&lt;/h2&gt;

&lt;p&gt;Most GUI Agent demos show the same thing: open a browser, fill out a form, click "Submit." It works, it's useful, but it doesn't really stress-test what these agents can do.&lt;/p&gt;

&lt;p&gt;We wanted to find out: &lt;strong&gt;what happens when you throw a GUI Agent into a completely unfamiliar, non-standard interface?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So we picked Mahjong — a Chinese tile game with complex rules, dense visual information, and a UI that has nothing in common with a typical web app.&lt;/p&gt;

&lt;p&gt;Here's Mano-P playing:&lt;/p&gt;

&lt;p&gt;Liquid error: internal&lt;/p&gt;

&lt;p&gt;And the raw video:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/user-attachments/assets/397a0552-9611-4d74-9f24-99544da272b6" rel="noopener noreferrer"&gt;https://github.com/user-attachments/assets/397a0552-9611-4d74-9f24-99544da272b6&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Mano-P?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mano-P&lt;/strong&gt; (GUI-VLA Agent for Edge Devices) is an open-source project from &lt;a href="https://github.com/Mininglamp-AI" rel="noopener noreferrer"&gt;Mininglamp Technology&lt;/a&gt;. The name comes from "Mano" (Spanish for "hand") and "P" for Person + Party.&lt;/p&gt;

&lt;p&gt;The key differentiator: &lt;strong&gt;Mano-P is purely vision-driven.&lt;/strong&gt; It doesn't parse DOM trees, doesn't use accessibility APIs, doesn't rely on OCR as a preprocessing step. It takes a screenshot, understands what's on screen, and outputs mouse/keyboard actions.&lt;/p&gt;

&lt;p&gt;Think of it as an AI that operates a computer the same way you do — by looking at the screen.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔗 &lt;strong&gt;Repo&lt;/strong&gt;: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📄 &lt;strong&gt;License&lt;/strong&gt;: Apache 2.0&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Mahjong is a Brutal Test Case
&lt;/h2&gt;

&lt;p&gt;If you've played Mahjong, you know it's no joke. But even if you haven't, here's why it's an excellent stress test for a GUI Agent:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Dense, Visually Similar Elements
&lt;/h3&gt;

&lt;p&gt;A Mahjong board has 136 tiles. Your hand has 13 tiles at a time. The tiles are small, visually similar (slight variations in dots, characters, bamboo patterns), and tightly packed. The agent needs pixel-level precision to distinguish a "3 of Dots" from a "5 of Dots."&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Zero Structured Data
&lt;/h3&gt;

&lt;p&gt;There's no HTML, no DOM, no accessibility tree. The game UI is rendered by a game engine — it's all pixels. This means any approach that relies on parsing page structure is out. Only pure vision works.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Strategic Reasoning Required
&lt;/h3&gt;

&lt;p&gt;This isn't "see button, click button." The agent needs to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recognize all tiles in its hand&lt;/li&gt;
&lt;li&gt;Evaluate possible winning combinations&lt;/li&gt;
&lt;li&gt;Decide which tile to discard&lt;/li&gt;
&lt;li&gt;React to other players' moves (pass, claim, or declare)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Asynchronous Multi-Player Flow
&lt;/h3&gt;

&lt;p&gt;Mahjong is turn-based with 4 players. The agent has to wait for others, recognize when it's their turn, handle variable timing, and respond to unexpected events (another player declares a win, for instance).&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works: Think-Act-Verify
&lt;/h2&gt;

&lt;p&gt;Mano-P doesn't just look once and act. It runs a continuous reasoning loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────┐     ┌─────────┐     ┌──────────┐
│  Think   │ ──▶ │   Act   │ ──▶ │  Verify  │
│ (analyze │     │(execute │     │(confirm  │
│  screen) │     │ action) │     │ result)  │
└─────────┘     └─────────┘     └──────────┘
      ▲                              │
      └──────── loop back ◀──────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Think&lt;/strong&gt;: Capture a screenshot, analyze the current game state. What tiles do I have? What's on the table? Is it my turn?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act&lt;/strong&gt;: Decide and execute an action — click a tile to discard, click "Pass," click "Claim."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify&lt;/strong&gt;: Take another screenshot. Did my action register? Did the game state change as expected? If not, go back to Think.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This loop is critical for games, where animations, delays, and other players' actions create a constantly shifting interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training Pipeline
&lt;/h2&gt;

&lt;p&gt;Mano-P uses a three-stage training approach:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;SFT (Supervised Fine-Tuning)&lt;/td&gt;
&lt;td&gt;Learn basic GUI recognition and operation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Offline RL (Reinforcement Learning)&lt;/td&gt;
&lt;td&gt;Optimize action policies from recorded trajectories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Online RL&lt;/td&gt;
&lt;td&gt;Interactive learning in real environments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This progression moves from "imitate human actions" to "discover optimal strategies through exploration" — a pattern that's proven effective across many RL domains.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Numbers matter. Here's where Mano-P stands:&lt;/p&gt;

&lt;h3&gt;
  
  
  OSWorld (Desktop App Automation)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mano-P 72B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;58.2%&lt;/strong&gt; (Rank #1 among specialized models)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;opencua-72b&lt;/td&gt;
&lt;td&gt;45.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  WebRetriever Protocol I (Web Interaction)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mano-P&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;40.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 4.5&lt;/td&gt;
&lt;td&gt;31.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Edge Inference (4B Quantized, w4a16)
&lt;/h3&gt;

&lt;p&gt;Running on Apple M4 + 32GB RAM:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill throughput&lt;/td&gt;
&lt;td&gt;476 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode throughput&lt;/td&gt;
&lt;td&gt;76 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak memory&lt;/td&gt;
&lt;td&gt;4.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's fast enough for real-time GUI interaction on a local device. No cloud API calls, no data leaving your machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware note&lt;/strong&gt;: The 4B model currently requires Apple M4 + 32GB RAM. Not all Macs can run it — be aware of this before trying.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;The Mahjong demo is fun, but the real takeaway is about &lt;strong&gt;generalization&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Most GUI automation tools are brittle. Traditional RPA breaks when a button moves. DOM-based agents break when there's no DOM. Screen-scraping breaks when the UI updates.&lt;/p&gt;

&lt;p&gt;A purely vision-driven agent doesn't have these dependencies. If a human can operate the application by looking at the screen, Mano-P can too — at least in principle. The Mahjong demo shows this isn't just theory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Non-standard UI? Handled.&lt;/li&gt;
&lt;li&gt;✅ Visually dense interface? Handled.&lt;/li&gt;
&lt;li&gt;✅ Strategic reasoning? Handled.&lt;/li&gt;
&lt;li&gt;✅ Async multi-player flow? Handled.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The same architecture that plays Mahjong can automate legacy enterprise systems, operate desktop applications, or handle any GUI that doesn't expose a programmatic interface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source Roadmap
&lt;/h2&gt;

&lt;p&gt;Mano-P is released under Apache 2.0 with a three-phase open-source plan:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Phase 1&lt;/td&gt;
&lt;td&gt;Skills (core capabilities)&lt;/td&gt;
&lt;td&gt;✅ Released&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phase 2&lt;/td&gt;
&lt;td&gt;Local models + SDK&lt;/td&gt;
&lt;td&gt;Coming soon&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phase 3&lt;/td&gt;
&lt;td&gt;Training methodology&lt;/td&gt;
&lt;td&gt;Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Try It Out
&lt;/h2&gt;

&lt;p&gt;The project is live on GitHub. Whether you're interested in GUI automation, VLA research, or just want to see an AI play Mahjong, check it out:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're working on GUI agents or have thoughts on vision-driven automation, I'd love to hear from you in the comments. What would &lt;strong&gt;you&lt;/strong&gt; test a GUI Agent on?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by the open-source team at Mininglamp Technology. Mano-P is Apache 2.0 licensed — contributions welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>fun</category>
    </item>
    <item>
      <title>On-Device AI Agents vs Cloud AI Agents: Which Path Are You Betting On?</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Thu, 16 Apr 2026 11:22:29 +0000</pubDate>
      <link>https://dev.to/mininglamp/on-device-ai-agents-vs-cloud-ai-agents-which-path-are-you-betting-on-53dp</link>
      <guid>https://dev.to/mininglamp/on-device-ai-agents-vs-cloud-ai-agents-which-path-are-you-betting-on-53dp</guid>
      <description>&lt;p&gt;Let me start with a question that's been bugging me lately:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Would you let an AI agent continuously stream your entire screen — emails, Slack DMs, browser tabs, documents — to a remote server?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you hesitated, you've already identified the core tension in the AI Agent space right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Paths, One Goal
&lt;/h2&gt;

&lt;p&gt;In 2026, the AI Agent world has split into two distinct camps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Camp Cloud&lt;/strong&gt; says: Throw the biggest models at the problem. 100B+ parameters, GPU clusters, infinite context windows. The raw intelligence approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Camp On-Device&lt;/strong&gt; says: Run the model locally. Your data never leaves your machine. Trade some model size for privacy, speed, and zero marginal cost.&lt;/p&gt;

&lt;p&gt;Both camps want the same thing — an AI that can actually &lt;em&gt;use&lt;/em&gt; your computer for you. Open apps, fill forms, click buttons, extract data, automate workflows. The disagreement is about &lt;em&gt;where&lt;/em&gt; the brain should live.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cloud Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Cloud-based GUI agents work like this: screenshot your screen → upload to cloud → model processes → send back instructions → repeat.&lt;/p&gt;

&lt;p&gt;For a simple demo, this is fine. For daily use? Let's talk about the three elephants in the room.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Privacy Is Not a Feature — It's a Prerequisite
&lt;/h3&gt;

&lt;p&gt;GUI agents need to &lt;em&gt;see&lt;/em&gt; your screen. Everything on it. Your email drafts, your Slack conversations, your financial spreadsheets, your browser history. All of it gets uploaded to someone else's server for processing.&lt;/p&gt;

&lt;p&gt;For individual developers? Maybe you're okay with that. For enterprise deployments? Compliance teams will shut this down before you finish the proposal deck.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Latency Compounds
&lt;/h3&gt;

&lt;p&gt;A single cloud roundtrip might take 500ms. Sounds fast. But agents aren't single-shot — they're multi-step. A 10-step task means 10 roundtrips, and suddenly you're looking at 5+ seconds of cumulative network delay on top of inference time. That's the difference between "this feels instant" and "I could've done this faster myself."&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cost Scales Linearly (Your Patience Doesn't)
&lt;/h3&gt;

&lt;p&gt;Vision model inference isn't cheap, especially with high-resolution screenshots. Every step costs tokens. Every retry costs tokens. Every mistake-and-recover costs tokens. Developers who prototyped with cloud APIs and then tried to run agents continuously were often surprised by the monthly bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  The On-Device Bet
&lt;/h2&gt;

&lt;p&gt;The on-device approach flips these trade-offs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privacy&lt;/strong&gt;: Your screen data never leaves your machine. Period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: Local inference, no network roundtrip.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: One-time setup, zero marginal cost per operation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The catch? You need to fit a capable model into consumer hardware. And that's where things get technically interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Do You Fit an Agent Into a Laptop?
&lt;/h2&gt;

&lt;p&gt;Three key techniques make on-device agents viable in 2026:&lt;/p&gt;

&lt;h3&gt;
  
  
  Quantization (W4A16)
&lt;/h3&gt;

&lt;p&gt;Compress model weights from FP16 to 4-bit integers while keeping activations at FP16 precision. This cuts model size to roughly 1/4 while preserving most of the model's capability.&lt;/p&gt;

&lt;p&gt;Real-world numbers on a &lt;strong&gt;4B quantized model running on Apple M4 + 32GB RAM&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill speed&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;476 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode speed&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;76 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak memory&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.3 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let that sink in. 4.3GB peak memory means your agent runs alongside your normal apps without breaking a sweat. 76 tok/s decode means action instructions are generated faster than you can read them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visual Token Pruning (GSPruning)
&lt;/h3&gt;

&lt;p&gt;GUI screenshots are full of visual redundancy — blank areas, solid backgrounds, decorative elements. GSPruning identifies and removes low-information visual tokens before they hit the language model. The result: 30-50% fewer tokens to process, with minimal impact on task accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pure Vision (No DOM, No API)
&lt;/h3&gt;

&lt;p&gt;This is a deliberate architectural choice. Instead of parsing DOM trees or hooking into application APIs, on-device agents understand the screen purely through vision — the same way a human would.&lt;/p&gt;

&lt;p&gt;Why? Because DOM parsing only works for web apps. Desktop applications, system dialogs, proprietary software — none of these expose DOM trees. A pure-vision agent can work with &lt;em&gt;any&lt;/em&gt; interface a human can see and interact with.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncchcu8a59yayqdyimux.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncchcu8a59yayqdyimux.png" alt="Architecture Overview" width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  "But Can Small Models Actually Do This?"
&lt;/h2&gt;

&lt;p&gt;Fair question. Here's what the benchmarks say.&lt;/p&gt;

&lt;h3&gt;
  
  
  OSWorld
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;72B model (pure vision)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runner-up&lt;/td&gt;
&lt;td&gt;45.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's a 13.2 percentage point gap on one of the most rigorous GUI agent benchmarks available.&lt;/p&gt;

&lt;h3&gt;
  
  
  WebRetriever
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;72B model&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;40.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;31.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" alt="Benchmark Overview" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, to be clear: the 72B model isn't what you'd run on your laptop. The deployment path is: 72B validates the architecture → knowledge distillation transfers capability to a 4B model → quantization makes the 4B model run on consumer hardware.&lt;/p&gt;

&lt;p&gt;The 72B benchmarks prove the ceiling. The 4B quantized model is what you actually use.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trade-Off Matrix
&lt;/h2&gt;

&lt;p&gt;Let me lay it out honestly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Cloud Agent&lt;/th&gt;
&lt;th&gt;On-Device Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Raw capability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Higher (bigger models)&lt;/td&gt;
&lt;td&gt;Lower (but closing the gap)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Privacy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your data on their servers&lt;/td&gt;
&lt;td&gt;Your data stays local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Network-dependent&lt;/td&gt;
&lt;td&gt;Near-instant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost per use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pay per token&lt;/td&gt;
&lt;td&gt;Zero after setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Offline support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cross-app support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Varies (DOM/API dependent)&lt;/td&gt;
&lt;td&gt;Any visible interface (pure vision)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hardware requirement&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any device with internet&lt;/td&gt;
&lt;td&gt;M4 + 32GB or equivalent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key&lt;/td&gt;
&lt;td&gt;Local model deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Neither column is strictly better. It depends on what you're optimizing for.&lt;/p&gt;

&lt;h2&gt;
  
  
  An Open-Source Reference: Mano-P
&lt;/h2&gt;

&lt;p&gt;We've been working on this problem at our team, and we open-sourced our on-device agent implementation as &lt;strong&gt;Mano-P&lt;/strong&gt; under the Apache 2.0 license.&lt;/p&gt;

&lt;p&gt;Key technical choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pure vision approach (no DOM/API dependency)&lt;/li&gt;
&lt;li&gt;W4A16 quantization for edge deployment&lt;/li&gt;
&lt;li&gt;GSPruning for visual token efficiency&lt;/li&gt;
&lt;li&gt;SFT + RL training pipeline&lt;/li&gt;
&lt;li&gt;Native Apple Silicon support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We chose Apache 2.0 because we believe this space needs open collaboration. Restrictive licenses would only slow things down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;https://github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Think (And Where I Might Be Wrong)
&lt;/h2&gt;

&lt;p&gt;My current mental model: the future isn't purely cloud &lt;em&gt;or&lt;/em&gt; purely on-device. It's a hybrid.&lt;/p&gt;

&lt;p&gt;On-device handles the privacy-sensitive stuff — screen understanding, action execution, anything involving your personal data. Cloud provides optional capability boosts — complex multi-step reasoning, large-scale knowledge retrieval, tasks that genuinely need 100B+ parameters.&lt;/p&gt;

&lt;p&gt;But I might be wrong. Maybe cloud providers will solve the privacy problem with confidential computing. Maybe on-device models will get good enough that cloud augmentation becomes unnecessary. Maybe a third path emerges that we haven't thought of yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Discussion Part
&lt;/h2&gt;

&lt;p&gt;I'm genuinely curious about how other developers and teams are thinking about this. A few questions I'd love to hear perspectives on:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Where's your privacy line?&lt;/strong&gt; Would you use a cloud-based GUI agent for work? Personal use? Both? Neither?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Is the performance gap closing fast enough?&lt;/strong&gt; 4B quantized models are usable today, but they're not as capable as cloud giants. Do you think the gap will close in 12 months? 24? Never?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. What's your hardware reality?&lt;/strong&gt; The benchmarks above use M4 + 32GB. That's not a budget machine. Is the hardware bar too high for on-device to go mainstream?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Pure vision vs DOM/API — which bet would you make?&lt;/strong&gt; Pure vision is more general but harder. DOM/API is more reliable but limited in scope. Where do you land?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Does the license matter?&lt;/strong&gt; Apache 2.0 vs GPL vs proprietary — does the licensing model affect whether you'd actually adopt an on-device agent?&lt;/p&gt;

&lt;p&gt;Drop your thoughts in the comments. I'll be reading every one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Disclosure: I work on the Mano-P project at Mininglamp Technology. All benchmark data cited is from public evaluations. This post represents our team's perspective, not an objective industry assessment.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Eyes and Hands for GUI Agents: How VLA Models Enable End-to-End Desktop Automation</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Wed, 15 Apr 2026 11:43:48 +0000</pubDate>
      <link>https://dev.to/mininglamp/eyes-and-hands-for-gui-agents-how-vla-models-enable-end-to-end-desktop-automation-nk7</link>
      <guid>https://dev.to/mininglamp/eyes-and-hands-for-gui-agents-how-vla-models-enable-end-to-end-desktop-automation-nk7</guid>
      <description>&lt;h1&gt;
  
  
  Eyes and Hands for GUI Agents: How VLA Models Enable End-to-End Desktop Automation
&lt;/h1&gt;

&lt;p&gt;Most GUI automation today works by reading the app's internals — parsing HTML, querying the DOM, hooking into accessibility APIs. It works well... until you hit a native desktop app with no exposed interface.&lt;/p&gt;

&lt;p&gt;At Mininglamp, we asked a different question: &lt;strong&gt;what if the model just looked at the screen, the way a human does?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the premise behind GUI-VLA (Vision-Language-Action), and we open-sourced our implementation as &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;Mano-P&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is GUI-VLA?
&lt;/h2&gt;

&lt;p&gt;VLA comes from robotics — a robot sees the world through cameras, understands a spoken command, and moves its arms to act. GUI-VLA applies the same idea to screen automation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vision&lt;/strong&gt;: the input is a raw screenshot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Language&lt;/strong&gt;: the model understands your natural language instruction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt;: the output is a concrete GUI operation — click at (x, y), type text, scroll, drag&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pipeline is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Screenshot → Visual Encoding → Language Understanding → Action Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No CDP protocol. No HTML parsing. No accessibility tree. Just pixels in, actions out.&lt;/p&gt;

&lt;p&gt;This means it can operate &lt;strong&gt;any application with a graphical interface&lt;/strong&gt; — including native macOS apps, legacy desktop software, and cross-application workflows that no single API can bridge.&lt;/p&gt;

&lt;h2&gt;
  
  
  How We Train It: Three Stages
&lt;/h2&gt;

&lt;p&gt;Getting a model to reliably operate GUIs from screenshots is hard. A single wrong click can cascade into a completely wrong state. Here's our three-stage training approach:&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 1: Supervised Fine-Tuning (SFT)
&lt;/h3&gt;

&lt;p&gt;We start with large-scale supervised learning on (screenshot, instruction, correct action) triplets. This teaches the model the basics — what buttons look like, where text fields are, how menus work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Offline Reinforcement Learning
&lt;/h3&gt;

&lt;p&gt;SFT only shows the model correct actions. Offline RL introduces negative examples and reward signals, teaching the model to distinguish good actions from bad ones without live interaction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Online Reinforcement Learning (Mano-Action)
&lt;/h3&gt;

&lt;p&gt;The final stage. The model interacts with real environments, receives actual feedback, and refines its policy. We call this the &lt;strong&gt;Mano-Action method&lt;/strong&gt;. This is where the model develops genuine error recovery skills.&lt;/p&gt;

&lt;p&gt;Why three stages? Because GUI operations have &lt;strong&gt;cascading errors&lt;/strong&gt;. Click the wrong button, and every subsequent step happens in the wrong context. SFT alone can't handle this. RL builds the judgment to recover from mistakes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think → Act → Verify
&lt;/h2&gt;

&lt;p&gt;For complex, multi-step tasks, we use a &lt;strong&gt;Think-Act-Verify&lt;/strong&gt; reasoning loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Think&lt;/strong&gt;: Analyze the current screenshot, understand the state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act&lt;/strong&gt;: Execute the next operation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify&lt;/strong&gt;: Check if the result matches expectations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If verification fails, the model loops back to Think and replans instead of blindly continuing. This is critical for long task chains like "gather data from three apps and compile a report."&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Let's talk numbers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FBenchmark_Overview.png" alt="Benchmark Overview" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  OSWorld (Specialized Models)
&lt;/h3&gt;

&lt;p&gt;OSWorld evaluates desktop OS-level GUI automation. Among specialized models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mano-P 72B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;opencua-72b&lt;/td&gt;
&lt;td&gt;45.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn84c2hme34qvu6ji68fk.png" alt="OSWorld Ranking" width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's a 13.2-point gap over the second-place model.&lt;/p&gt;

&lt;h3&gt;
  
  
  WebRetriever Protocol I
&lt;/h3&gt;

&lt;p&gt;On web information retrieval:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mano-P&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;40.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 4.5&lt;/td&gt;
&lt;td&gt;31.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note that Gemini 2.5 Pro and Claude 4.5 are flagship general-purpose models. A specialized GUI-VLA model outperforming them on this task suggests that purpose-built architectures still have an edge in vertical scenarios.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FGUI_Agent_Grounding_Benchmark.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FMininglamp-AI%2FMano-P%2Fmain%2Fpics%2FGUI_Agent_Grounding_Benchmark.png" alt="GUI Grounding Benchmark" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It on Your Mac
&lt;/h2&gt;

&lt;p&gt;One of the things we're most excited about: the 4B quantized model runs locally on Apple Silicon.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill speed&lt;/td&gt;
&lt;td&gt;476 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode speed&lt;/td&gt;
&lt;td&gt;76 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak memory&lt;/td&gt;
&lt;td&gt;4.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware&lt;/td&gt;
&lt;td&gt;M4 chip + 32GB RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;4.3 GB peak memory means you can run it as a background service without impacting your daily workflow. Your data stays on your machine — no cloud uploads required.&lt;/p&gt;

&lt;p&gt;The secret sauce here is &lt;strong&gt;GSPruning&lt;/strong&gt; — visual token pruning that removes tokens corresponding to unimportant screen regions (blank backgrounds, decorative elements). This gives us a &lt;strong&gt;2-3x speedup&lt;/strong&gt; without meaningful accuracy loss.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Can It Actually Do?
&lt;/h2&gt;

&lt;p&gt;Based on the current capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex GUI Automation&lt;/strong&gt;: Multi-step interface operations across applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-System Data Integration&lt;/strong&gt;: Moving and combining data between different apps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-Task Planning&lt;/strong&gt;: Workflows that require multi-step reasoning and planning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Report Generation&lt;/strong&gt;: Extracting information from interfaces and producing structured outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Honest Limitations
&lt;/h2&gt;

&lt;p&gt;We believe in transparent communication about what works and what doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In highly standardized web scenarios, API-based approaches (CDP + DOM) can still be more reliable than pure vision&lt;/li&gt;
&lt;li&gt;Screenshot resolution and interface complexity affect recognition accuracy&lt;/li&gt;
&lt;li&gt;There's a capability gap between the 4B edge model and the 72B cloud model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GUI-VLA isn't here to replace API-based agents. It's here to handle everything those agents can't reach.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Mano-P is fully open source under Apache 2.0:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paper&lt;/strong&gt;: &lt;a href="https://arxiv.org/abs/2509.17336" rel="noopener noreferrer"&gt;arxiv.org/abs/2509.17336&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'd genuinely love feedback — issues, PRs, or just telling us where it falls short.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What do you think about vision-only GUI agents?&lt;/strong&gt; Is the pure-vision approach the future, or will API-based methods always win in structured environments? Drop your thoughts in the comments — we're especially curious to hear from anyone who's tried building GUI agents in production.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>automation</category>
    </item>
    <item>
      <title>How Mano-P Achieves #1 on OSWorld: Architecture, Benchmarks, and Edge Deployment</title>
      <dc:creator>Mininglamp</dc:creator>
      <pubDate>Tue, 14 Apr 2026 12:15:51 +0000</pubDate>
      <link>https://dev.to/mininglamp/how-mano-p-achieves-1-on-osworld-architecture-benchmarks-and-edge-deployment-4p81</link>
      <guid>https://dev.to/mininglamp/how-mano-p-achieves-1-on-osworld-architecture-benchmarks-and-edge-deployment-4p81</guid>
      <description>&lt;p&gt;Open-source GUI agents have been gaining traction, but most still rely on cloud inference, DOM parsing, or CLI hooks. Mano-P takes a different approach: pure vision-driven GUI automation that runs entirely on edge devices. And the benchmark results back it up — #1 on OSWorld among specialized models.&lt;/p&gt;

&lt;p&gt;This article breaks down the architecture, benchmark data, and edge deployment performance, all from the project's public README and technical report.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  OSWorld: #1 Among Specialized Models
&lt;/h3&gt;

&lt;p&gt;OSWorld is the standard benchmark for GUI agent evaluation. Mano-P 1.0-72B achieves a &lt;strong&gt;58.2% success rate&lt;/strong&gt;, ranking first among all specialized GUI agent models. For context:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;OSWorld Success Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mano-P 1.0-72B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;opencua-72b&lt;/td&gt;
&lt;td&gt;45.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gap&lt;/td&gt;
&lt;td&gt;+13.2 percentage points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's not a marginal improvement — it's a 29% relative gain over the second-place model.&lt;/p&gt;

&lt;h3&gt;
  
  
  WebRetriever: Beating Cloud Giants
&lt;/h3&gt;

&lt;p&gt;On the WebRetriever Protocol I benchmark, Mano-P scores &lt;strong&gt;41.7 NavEval&lt;/strong&gt;, which puts it ahead of:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;NavEval Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mano-P 1.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro Computer Use&lt;/td&gt;
&lt;td&gt;40.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 4.5 Computer Use&lt;/td&gt;
&lt;td&gt;31.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Worth noting: Gemini and Claude are cloud-based services with massive compute budgets. Mano-P achieves comparable or better results while running on local hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  13 Benchmarks, SOTA Across the Board
&lt;/h3&gt;

&lt;p&gt;Beyond OSWorld and WebRetriever, Mano-P holds SOTA positions across 13 benchmarks spanning GUI grounding, perception &amp;amp; cognition, context learning, and pruning efficiency. The full benchmark data is available in the &lt;a href="https://github.com/Mininglamp-AI/Mano-P#benchmark-performance" rel="noopener noreferrer"&gt;README&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: Why Pure Vision?
&lt;/h2&gt;

&lt;p&gt;Most GUI agents fall into one of these categories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;How It Works&lt;/th&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DOM/HTML Parsing&lt;/td&gt;
&lt;td&gt;Read page structure directly&lt;/td&gt;
&lt;td&gt;Web-only, breaks on native apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CDP + CLI&lt;/td&gt;
&lt;td&gt;Chrome DevTools Protocol + shell commands&lt;/td&gt;
&lt;td&gt;Browser-dependent, fragile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Computer Use&lt;/td&gt;
&lt;td&gt;Send screenshots to cloud API&lt;/td&gt;
&lt;td&gt;Privacy concerns, latency, API costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pure Vision (Mano-P)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;See the screen, understand it, act&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Requires capable on-device model&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Mano-P chose pure vision. No DOM access, no browser hooks, no platform-specific APIs. The model looks at the screen — the same pixels a human sees — and decides what to click, type, or scroll.&lt;/p&gt;

&lt;p&gt;This is harder to build, but the payoff is generality: the same model works across any GUI application, any platform, without integration work per app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training Methodology: Mano-Action
&lt;/h2&gt;

&lt;p&gt;The technical backbone is &lt;strong&gt;Mano-Action&lt;/strong&gt;, a bidirectional self-reinforcement learning framework. The training follows three stages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Supervised Fine-Tuning (SFT)&lt;/strong&gt;&lt;br&gt;
Starting from a base vision-language model, fine-tune on curated GUI interaction datasets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Offline Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
Learn from recorded interaction trajectories, optimizing action quality without live environment access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Online Reinforcement Learning&lt;/strong&gt;&lt;br&gt;
The model interacts with real GUI environments, receiving feedback and iterating. This is where the "think-act-verify" loop reasoning mechanism comes in — the model plans an action, executes it, verifies the result, and adjusts.&lt;/p&gt;

&lt;p&gt;The bidirectional aspect means Text→Action and Action→Text consistency are both optimized, creating a tighter loop between understanding and execution.&lt;/p&gt;
&lt;h2&gt;
  
  
  Edge Optimization: Running on Apple M4
&lt;/h2&gt;

&lt;p&gt;The 72B model delivers SOTA benchmarks, but the edge story is equally important. Through &lt;strong&gt;mixed-precision quantization&lt;/strong&gt; and a novel visual token pruning technique called &lt;strong&gt;GSPruning&lt;/strong&gt;, Mano-P achieves practical performance on consumer hardware.&lt;/p&gt;
&lt;h3&gt;
  
  
  GSPruning: Preserving What Matters
&lt;/h3&gt;

&lt;p&gt;GSPruning (Global Spatial Pruning) is designed specifically for vision-language models processing high-resolution interfaces. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Preserves global spatial structure through anchor points&lt;/li&gt;
&lt;li&gt;Identifies semantic outliers for critical UI elements&lt;/li&gt;
&lt;li&gt;Achieves 2-3× throughput speedup with minimal performance loss&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the Online-Mind2Web benchmark, GSPruning at 25% token retention achieves a success rate of 0.400 on Qwen3VL-4B, compared to 0.425 at full tokens — only a 6% drop while running significantly faster.&lt;/p&gt;
&lt;h3&gt;
  
  
  M4 Pro Performance
&lt;/h3&gt;

&lt;p&gt;The 4B quantized model (w4a16) on Apple M4 Pro with 64GB RAM:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill Speed&lt;/td&gt;
&lt;td&gt;476 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decode Speed&lt;/td&gt;
&lt;td&gt;76 tokens/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak Memory&lt;/td&gt;
&lt;td&gt;4.3 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefill Time (4K context)&lt;/td&gt;
&lt;td&gt;8.6s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;4.3 GB peak memory means this runs comfortably alongside other applications. No dedicated GPU server required.&lt;/p&gt;
&lt;h3&gt;
  
  
  Hardware Requirements
&lt;/h3&gt;

&lt;p&gt;Two deployment options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Direct&lt;/strong&gt;: Mac mini or MacBook with Apple M4 chip, 32GB+ RAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Computing Stick&lt;/strong&gt;: Any Mac + Mano-P computing stick via USB 4.0+&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Data Privacy: The Edge Advantage
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;Local Mode&lt;/strong&gt;, all processing happens on-device:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Screenshots never leave the device&lt;/li&gt;
&lt;li&gt;✅ Task descriptions stay local&lt;/li&gt;
&lt;li&gt;✅ No cloud API calls&lt;/li&gt;
&lt;li&gt;✅ Full source code is open for audit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cloud Mode&lt;/strong&gt; is available as a fallback (screenshots sent to &lt;code&gt;mano.mininglamp.com&lt;/code&gt;), but the local-first architecture means sensitive workflows can run with zero data exposure.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Three usage forms are available:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLI (for terminal users):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew tap HanningWang/tap
brew &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua

mano-cua run &lt;span class="s2"&gt;"Open WeChat and tell FTY the meeting is postponed"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python SDK (planned):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mano_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ManoClient&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManoClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search for AI news on Xiaohongshu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ClawHub Skill (for AI agents):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;clawhub &lt;span class="nb"&gt;install &lt;/span&gt;mano-cua
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Skill form is designed for AI agents like Claude Code or OpenClaw — the agent automatically invokes Mano-P when GUI operations are needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Mano-P is being released in three phases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1&lt;/strong&gt; (now): Mano-CUA Skills — for agent enthusiasts to build CUA task workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2&lt;/strong&gt; (coming): Local models + SDK — for developers with high security requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3&lt;/strong&gt; (planned): Training methods + pruning techniques — for researchers who want to train their own GUI-VLA models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project is Apache 2.0 licensed. Full source, benchmarks, and documentation: &lt;a href="https://github.com/Mininglamp-AI/Mano-P" rel="noopener noreferrer"&gt;github.com/Mininglamp-AI/Mano-P&lt;/a&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agents</category>
      <category>benchmark</category>
    </item>
  </channel>
</rss>
