DEV Community

Mininglamp
Mininglamp

Posted on

Eyes and Hands for GUI Agents: How VLA Models Enable End-to-End Desktop Automation

Eyes and Hands for GUI Agents: How VLA Models Enable End-to-End Desktop Automation

Most GUI automation today works by reading the app's internals — parsing HTML, querying the DOM, hooking into accessibility APIs. It works well... until you hit a native desktop app with no exposed interface.

At Mininglamp, we asked a different question: what if the model just looked at the screen, the way a human does?

That's the premise behind GUI-VLA (Vision-Language-Action), and we open-sourced our implementation as Mano-P.

What is GUI-VLA?

VLA comes from robotics — a robot sees the world through cameras, understands a spoken command, and moves its arms to act. GUI-VLA applies the same idea to screen automation:

  • Vision: the input is a raw screenshot
  • Language: the model understands your natural language instruction
  • Action: the output is a concrete GUI operation — click at (x, y), type text, scroll, drag

The pipeline is straightforward:

Screenshot → Visual Encoding → Language Understanding → Action Output
Enter fullscreen mode Exit fullscreen mode

No CDP protocol. No HTML parsing. No accessibility tree. Just pixels in, actions out.

This means it can operate any application with a graphical interface — including native macOS apps, legacy desktop software, and cross-application workflows that no single API can bridge.

How We Train It: Three Stages

Getting a model to reliably operate GUIs from screenshots is hard. A single wrong click can cascade into a completely wrong state. Here's our three-stage training approach:

Stage 1: Supervised Fine-Tuning (SFT)

We start with large-scale supervised learning on (screenshot, instruction, correct action) triplets. This teaches the model the basics — what buttons look like, where text fields are, how menus work.

Stage 2: Offline Reinforcement Learning

SFT only shows the model correct actions. Offline RL introduces negative examples and reward signals, teaching the model to distinguish good actions from bad ones without live interaction.

Stage 3: Online Reinforcement Learning (Mano-Action)

The final stage. The model interacts with real environments, receives actual feedback, and refines its policy. We call this the Mano-Action method. This is where the model develops genuine error recovery skills.

Why three stages? Because GUI operations have cascading errors. Click the wrong button, and every subsequent step happens in the wrong context. SFT alone can't handle this. RL builds the judgment to recover from mistakes.

Think → Act → Verify

For complex, multi-step tasks, we use a Think-Act-Verify reasoning loop:

  1. Think: Analyze the current screenshot, understand the state
  2. Act: Execute the next operation
  3. Verify: Check if the result matches expectations

If verification fails, the model loops back to Think and replans instead of blindly continuing. This is critical for long task chains like "gather data from three apps and compile a report."

Benchmark Results

Let's talk numbers.

Benchmark Overview

OSWorld (Specialized Models)

OSWorld evaluates desktop OS-level GUI automation. Among specialized models:

Model Score
Mano-P 72B 58.2%
opencua-72b 45.0%

OSWorld Ranking

That's a 13.2-point gap over the second-place model.

WebRetriever Protocol I

On web information retrieval:

Model Score
Mano-P 41.7
Gemini 2.5 Pro 40.9
Claude 4.5 31.3

Note that Gemini 2.5 Pro and Claude 4.5 are flagship general-purpose models. A specialized GUI-VLA model outperforming them on this task suggests that purpose-built architectures still have an edge in vertical scenarios.

GUI Grounding Benchmark

Running It on Your Mac

One of the things we're most excited about: the 4B quantized model runs locally on Apple Silicon.

Metric Value
Prefill speed 476 tok/s
Decode speed 76 tok/s
Peak memory 4.3 GB
Hardware M4 chip + 32GB RAM

4.3 GB peak memory means you can run it as a background service without impacting your daily workflow. Your data stays on your machine — no cloud uploads required.

The secret sauce here is GSPruning — visual token pruning that removes tokens corresponding to unimportant screen regions (blank backgrounds, decorative elements). This gives us a 2-3x speedup without meaningful accuracy loss.

What Can It Actually Do?

Based on the current capabilities:

  • Complex GUI Automation: Multi-step interface operations across applications
  • Cross-System Data Integration: Moving and combining data between different apps
  • Long-Task Planning: Workflows that require multi-step reasoning and planning
  • Report Generation: Extracting information from interfaces and producing structured outputs

The Honest Limitations

We believe in transparent communication about what works and what doesn't:

  • In highly standardized web scenarios, API-based approaches (CDP + DOM) can still be more reliable than pure vision
  • Screenshot resolution and interface complexity affect recognition accuracy
  • There's a capability gap between the 4B edge model and the 72B cloud model

GUI-VLA isn't here to replace API-based agents. It's here to handle everything those agents can't reach.

Try It Yourself

Mano-P is fully open source under Apache 2.0:

We'd genuinely love feedback — issues, PRs, or just telling us where it falls short.


What do you think about vision-only GUI agents? Is the pure-vision approach the future, or will API-based methods always win in structured environments? Drop your thoughts in the comments — we're especially curious to hear from anyone who's tried building GUI agents in production.

Top comments (0)