MuleRun GACUA: An Open-Source Computer Use Agent That Actually Works

#webdev #programming #ai #opensource

TL;DR: We've open-sourced GACUA, a free, out-of-the-box computer use agent built on the Gemini CLI. You can start it with a single command. GACUA boosts Gemini's grounding accuracy with a special "Image Slicing & Two-Step Grounding" method and gives you transparent, human-in-the-loop control over complex tasks.

Hey everyone! Have you played with the idea of an AI agent that can actually use your computer? Not just write code, but click buttons, install software, or even help you grind through daily check-ins?

We have, and we ran into a wall. So, we built a tool to knock it down. Today, we're open-sourcing it: MuleRun GACUA (Gemini CLI as Computer Use Agent).

MuleRun GACUA is a free, open-source agent built on top of Google's Gemini CLI, designed to be the most accessible and transparent way to get started with computer automation.

(Video demo of GACUA in action)

What Makes MuleRun GACUA Different?

MuleRun GACUA isn't just another wrapper. It extends the core of Gemini CLI to create a robust agentic experience that's both powerful and developer-friendly.

💻 Truly Out-of-the-Box: Get started with a single command. No complex setup, no expensive API keys for proprietary models. Just a free, immediate way to experience computer use.
🎯 High-Accuracy Grounding: We'll get into the technical details below, but GACUA uses a unique "Image Slicing + Two-Step Grounding" method to dramatically improve Gemini 2.5 Pro's ability to accurately click on UI elements.
🔬 Full Observability & Control: Sick of "black box" agents? GACUA provides a transparent, step-by-step execution flow via a web UI. You can review, accept, or reject each action before it happens. You're always in control.
🌐 Remote Operation: Run the agent in its own environment and access it from another device. No more fighting with the AI for your mouse and keyboard.

The Technical Challenge: Making Gemini "See" the Screen

Our initial idea was simple: connect a Computer Use MCP (Model Context Protocol) to the Gemini CLI. Easy, right?

Not quite. We quickly discovered that Gemini 2.5 Pro's grounding capabilities—its ability to translate a description like "click the Chrome icon" into precise screen coordinates—were surprisingly limited.

For example, when we asked it to locate the Chrome icon, the bounding box it generated was often inaccurate. Clicking the center of that box would be a miss.

We tried everything: prompt tuning, scaling screenshot resolutions, you name it. Nothing worked reliably.

The Open-Source Solution: Image Slicing & Two-Step Grounding

After a lot of experimentation, we found a combination of techniques that made a huge difference. As an open-source project, we want to be completely transparent about how it works.

1. Image Slicing

By default, Gemini tiles images into 768x768 chunks. This isn't always ideal for common screen resolutions. We bypass this by applying our own slicing logic.

For a 16:9 screen, we slice it into three overlapping vertical tiles, ensuring the overlap between adjacent tiles is more than 50% of their width. This guarantees that any UI element up to 50% of the screen's height will appear fully intact in at least one tile.

2. Two-Step Grounding

For any operation needing precise coordinates, we use a two-step model call: Plan and Ground.

Plan: The model receives all three 768x768 tiles and identifies which tile contains the target object. It outputs an image_id (e.g., 0, 1, or 2) and an element_description (e.g., "Google Chrome icon").
Ground: The selected tile and the element_description are passed to the model again. This time, its only job is to generate a precise box_2d (bounding box) for that specific element.

The result is a dramatically more accurate grounding process. You can find the reproduction script for this demo in our GitHub repo.

This method forces a slower, more deliberate reasoning process (similar to Chain-of-Thought) and makes the agent's decisions much more explainable. If a command fails, you can easily see if the agent misunderstood the description or failed to find the coordinates.

Why We Built GACUA in the Open

When we talked to other developers, two pain points with existing computer use agents kept coming up: the high cost of entry and their "black box" nature. GACUA was designed to solve these, and open-source is core to that philosophy.

Low Barrier to Entry: Many powerful agents rely on expensive proprietary models (like Claude) or require specialized, locally-run models with high-end GPUs. GACUA offers an accessible alternative. It's built on the free Gemini CLI and uses our engineering methods to achieve high-quality grounding, allowing any developer to experience computer use for free.
Transparent Execution: We believe you should be able to understand and trust the tools you use. GACUA's web UI gives you full observability into the agent's Planning and Grounding steps. The Human-in-the-Loop (HITL) control—letting you "accept" or "reject" each action—is a direct result of this open, transparent approach.

Our Thoughts on the Future of Computer Use

Building GACUA shaped our perspective on where this technology is headed. We see two major scenarios where it shines:

Tasks with a "Knowledge Gap": Operations that are simple to execute but you don't know how (e.g., "adjust the row height in this Excel sheet").
Repetitive Manual Labor: High-frequency, low-value tasks perfect for automation (e.g., processing unread emails, monitoring product prices).

There's a growing sentiment that vision-based computer use is an inefficient "robot-pulling-a-cart" approach and that a fully API-driven world is superior. While API-based agents have their strengths, we believe a purely API-driven view misses two fundamental points:

The GUI is Already a Universal API: The dream of a fully API-driven world clashes with the reality of inconsistent standards. The GUI, however, has evolved into a de facto universal standard for interaction. Teaching an agent to master this "visual language" is a path worth exploring.
It's a Necessary Step Towards World Models: Our ultimate vision is agents that can interact with the physical world. Vision-based perception and action are indispensable for that future. The computer screen is the most effective "training ground" we have today to teach an agent how to "see and interact."

We see GACUA not just as a practical tool, but as a pragmatic step toward that grander vision.

What's Next? It's Open-Source.

The future of GACUA is open-source, and that means it's up to you.

Give GACUA a try on GitHub! 🙌

You can start it with a single command. We encourage you to star the repo, fork it, and submit pull requests. Found a bug? Open an issue. Have a wild idea for a new feature? Let's discuss it.

GACUA is an open-source project from MuleRun, a team building the world's first AI Agent Marketplace. This is our way of sharing the insights we've gained while exploring how to build reliable agents in the open.