Mininglamp

Posted on Apr 15

Eyes and Hands for GUI Agents: How VLA Models Enable End-to-End Desktop Automation

#ai #opensource #machinelearning #automation

Eyes and Hands for GUI Agents: How VLA Models Enable End-to-End Desktop Automation

Most GUI automation today works by reading the app's internals — parsing HTML, querying the DOM, hooking into accessibility APIs. It works well... until you hit a native desktop app with no exposed interface.

At Mininglamp, we asked a different question: what if the model just looked at the screen, the way a human does?

That's the premise behind GUI-VLA (Vision-Language-Action), and we open-sourced our implementation as Mano-P.

What is GUI-VLA?

VLA comes from robotics — a robot sees the world through cameras, understands a spoken command, and moves its arms to act. GUI-VLA applies the same idea to screen automation:

Vision: the input is a raw screenshot
Language: the model understands your natural language instruction
Action: the output is a concrete GUI operation — click at (x, y), type text, scroll, drag

The pipeline is straightforward:

Screenshot → Visual Encoding → Language Understanding → Action Output

No CDP protocol. No HTML parsing. No accessibility tree. Just pixels in, actions out.

This means it can operate any application with a graphical interface — including native macOS apps, legacy desktop software, and cross-application workflows that no single API can bridge.

How We Train It: Three Stages

Getting a model to reliably operate GUIs from screenshots is hard. A single wrong click can cascade into a completely wrong state. Here's our three-stage training approach:

Stage 1: Supervised Fine-Tuning (SFT)

We start with large-scale supervised learning on (screenshot, instruction, correct action) triplets. This teaches the model the basics — what buttons look like, where text fields are, how menus work.

Stage 2: Offline Reinforcement Learning

SFT only shows the model correct actions. Offline RL introduces negative examples and reward signals, teaching the model to distinguish good actions from bad ones without live interaction.

Stage 3: Online Reinforcement Learning (Mano-Action)

The final stage. The model interacts with real environments, receives actual feedback, and refines its policy. We call this the Mano-Action method. This is where the model develops genuine error recovery skills.

Why three stages? Because GUI operations have cascading errors. Click the wrong button, and every subsequent step happens in the wrong context. SFT alone can't handle this. RL builds the judgment to recover from mistakes.

Think → Act → Verify

For complex, multi-step tasks, we use a Think-Act-Verify reasoning loop:

Think: Analyze the current screenshot, understand the state
Act: Execute the next operation
Verify: Check if the result matches expectations

If verification fails, the model loops back to Think and replans instead of blindly continuing. This is critical for long task chains like "gather data from three apps and compile a report."

Benchmark Results

Let's talk numbers.

OSWorld (Specialized Models)

OSWorld evaluates desktop OS-level GUI automation. Among specialized models:

Model	Score
Mano-P 72B	58.2%
opencua-72b	45.0%

That's a 13.2-point gap over the second-place model.

WebRetriever Protocol I

On web information retrieval:

Model	Score
Mano-P	41.7
Gemini 2.5 Pro	40.9
Claude 4.5	31.3

Note that Gemini 2.5 Pro and Claude 4.5 are flagship general-purpose models. A specialized GUI-VLA model outperforming them on this task suggests that purpose-built architectures still have an edge in vertical scenarios.

Running It on Your Mac

One of the things we're most excited about: the 4B quantized model runs locally on Apple Silicon.

Metric	Value
Prefill speed	476 tok/s
Decode speed	76 tok/s
Peak memory	4.3 GB
Hardware	M4 chip + 32GB RAM

4.3 GB peak memory means you can run it as a background service without impacting your daily workflow. Your data stays on your machine — no cloud uploads required.

The secret sauce here is GSPruning — visual token pruning that removes tokens corresponding to unimportant screen regions (blank backgrounds, decorative elements). This gives us a 2-3x speedup without meaningful accuracy loss.

What Can It Actually Do?

Based on the current capabilities:

Complex GUI Automation: Multi-step interface operations across applications
Cross-System Data Integration: Moving and combining data between different apps
Long-Task Planning: Workflows that require multi-step reasoning and planning
Report Generation: Extracting information from interfaces and producing structured outputs

The Honest Limitations

We believe in transparent communication about what works and what doesn't:

In highly standardized web scenarios, API-based approaches (CDP + DOM) can still be more reliable than pure vision
Screenshot resolution and interface complexity affect recognition accuracy
There's a capability gap between the 4B edge model and the 72B cloud model

GUI-VLA isn't here to replace API-based agents. It's here to handle everything those agents can't reach.

Try It Yourself

Mano-P is fully open source under Apache 2.0:

GitHub: github.com/Mininglamp-AI/Mano-P
Paper: arxiv.org/abs/2509.17336

We'd genuinely love feedback — issues, PRs, or just telling us where it falls short.

What do you think about vision-only GUI agents? Is the pure-vision approach the future, or will API-based methods always win in structured environments? Drop your thoughts in the comments — we're especially curious to hear from anyone who's tried building GUI agents in production.

DEV Community

Eyes and Hands for GUI Agents: How VLA Models Enable End-to-End Desktop Automation

Eyes and Hands for GUI Agents: How VLA Models Enable End-to-End Desktop Automation

What is GUI-VLA?

How We Train It: Three Stages

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Offline Reinforcement Learning

Stage 3: Online Reinforcement Learning (Mano-Action)

Think → Act → Verify

Benchmark Results

OSWorld (Specialized Models)

WebRetriever Protocol I

Running It on Your Mac

What Can It Actually Do?

The Honest Limitations

Try It Yourself

Top comments (0)