<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: MuleRun</title>
    <description>The latest articles on DEV Community by MuleRun (@mulerun).</description>
    <link>https://dev.to/mulerun</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3588864%2F330f8af3-d423-49ee-b663-a0a33be23bf7.png</url>
      <title>DEV Community: MuleRun</title>
      <link>https://dev.to/mulerun</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mulerun"/>
    <language>en</language>
    <item>
      <title>MuleRun GACUA: An Open-Source Computer Use Agent That Actually Works</title>
      <dc:creator>MuleRun</dc:creator>
      <pubDate>Thu, 30 Oct 2025 07:26:23 +0000</pubDate>
      <link>https://dev.to/mulerun/mulerun-gacua-an-open-source-computer-use-agent-that-actually-works-28b7</link>
      <guid>https://dev.to/mulerun/mulerun-gacua-an-open-source-computer-use-agent-that-actually-works-28b7</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; We've open-sourced &lt;strong&gt;&lt;a href="https://github.com/openmule/gacua" rel="noopener noreferrer"&gt;GACUA&lt;/a&gt;&lt;/strong&gt;, a free, out-of-the-box computer use agent built on the Gemini CLI. You can start it with a single command. GACUA boosts Gemini's grounding accuracy with a special "Image Slicing &amp;amp; Two-Step Grounding" method and gives you transparent, human-in-the-loop control over complex tasks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Hey everyone! Have you played with the idea of an AI agent that can actually &lt;em&gt;use&lt;/em&gt; your computer? Not just write code, but click buttons, install software, or even help you grind through daily check-ins?&lt;/p&gt;

&lt;p&gt;We have, and we ran into a wall. So, we built a tool to knock it down. Today, we're open-sourcing it: &lt;strong&gt;&lt;a href="https://mulerun.com/" rel="noopener noreferrer"&gt;MuleRun&lt;/a&gt; GACUA (Gemini CLI as Computer Use Agent)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://mulerun.com/" rel="noopener noreferrer"&gt;MuleRun&lt;/a&gt; GACUA is a free, open-source agent built on top of Google's Gemini CLI, designed to be the most accessible and transparent way to get started with computer automation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Video demo of GACUA in action)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What Makes MuleRun GACUA Different?
&lt;/h3&gt;

&lt;p&gt;MuleRun GACUA isn't just another wrapper. It extends the core of Gemini CLI to create a robust agentic experience that's both powerful and developer-friendly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💻 &lt;strong&gt;Truly Out-of-the-Box&lt;/strong&gt;: Get started with a single command. No complex setup, no expensive API keys for proprietary models. Just a free, immediate way to experience computer use.&lt;/li&gt;
&lt;li&gt;🎯 &lt;strong&gt;High-Accuracy Grounding&lt;/strong&gt;: We'll get into the technical details below, but GACUA uses a unique "Image Slicing + Two-Step Grounding" method to dramatically improve Gemini 2.5 Pro's ability to accurately click on UI elements.&lt;/li&gt;
&lt;li&gt;🔬 &lt;strong&gt;Full Observability &amp;amp; Control&lt;/strong&gt;: Sick of "black box" agents? GACUA provides a transparent, step-by-step execution flow via a web UI. You can review, accept, or reject each action before it happens. You're always in control.&lt;/li&gt;
&lt;li&gt;🌐 &lt;strong&gt;Remote Operation&lt;/strong&gt;: Run the agent in its own environment and access it from another device. No more fighting with the AI for your mouse and keyboard.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Technical Challenge: Making Gemini "See" the Screen
&lt;/h2&gt;

&lt;p&gt;Our initial idea was simple: connect a Computer Use MCP (Model Context Protocol) to the Gemini CLI. Easy, right?&lt;/p&gt;

&lt;p&gt;Not quite. We quickly discovered that Gemini 2.5 Pro's grounding capabilities—its ability to translate a description like "click the Chrome icon" into precise screen coordinates—were surprisingly limited.&lt;/p&gt;

&lt;p&gt;For example, when we asked it to locate the Chrome icon, the bounding box it generated was often inaccurate. Clicking the center of that box would be a miss.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ljr7fmlnq751kca37jp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ljr7fmlnq751kca37jp.png" alt="Detect Chrome in the Image Using Gemini 2.5 Pro" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We tried everything: prompt tuning, scaling screenshot resolutions, you name it. Nothing worked reliably.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Open-Source Solution: Image Slicing &amp;amp; Two-Step Grounding
&lt;/h3&gt;

&lt;p&gt;After a lot of experimentation, we found a combination of techniques that made a huge difference. As an open-source project, we want to be completely transparent about how it works.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Image Slicing
&lt;/h4&gt;

&lt;p&gt;By default, Gemini tiles images into 768x768 chunks. This isn't always ideal for common screen resolutions. We bypass this by applying our own slicing logic.&lt;/p&gt;

&lt;p&gt;For a 16:9 screen, we slice it into three overlapping vertical tiles, ensuring the overlap between adjacent tiles is more than 50% of their width. This guarantees that any UI element up to 50% of the screen's height will appear fully intact in at least one tile.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.mulerun.com%2Fimg%2Fgacua_2.svg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.mulerun.com%2Fimg%2Fgacua_2.svg" alt="Three Vertically Cropped Images with Sufficient Overlays" width="841" height="641"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1bjvs4dj1chxpll7999t.png" alt="Square 0" width="768" height="768"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo79zp58g8vu27tttkl9l.png" alt="Square 1" width="768" height="768"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuw4owjvm3kzrdhn4ow8e.png" alt="Square 2" width="768" height="768"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  2. Two-Step Grounding
&lt;/h4&gt;

&lt;p&gt;For any operation needing precise coordinates, we use a two-step model call: &lt;strong&gt;Plan&lt;/strong&gt; and &lt;strong&gt;Ground&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Plan&lt;/strong&gt;: The model receives all three 768x768 tiles and identifies which tile contains the target object. It outputs an &lt;code&gt;image_id&lt;/code&gt; (e.g., &lt;code&gt;0&lt;/code&gt;, &lt;code&gt;1&lt;/code&gt;, or &lt;code&gt;2&lt;/code&gt;) and an &lt;code&gt;element_description&lt;/code&gt; (e.g., "Google Chrome icon").&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Ground&lt;/strong&gt;: The selected tile and the &lt;code&gt;element_description&lt;/code&gt; are passed to the model again. This time, its only job is to generate a precise &lt;code&gt;box_2d&lt;/code&gt; (bounding box) for that specific element.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is a dramatically more accurate grounding process. You can find the &lt;a href="https://github.com/openmule/gacua/tree/main/examples/gemini-grounding-demo" rel="noopener noreferrer"&gt;reproduction script for this demo in our GitHub repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fk5591brnghihwn4h8u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fk5591brnghihwn4h8u.png" alt="Result of Grounding Process" width="768" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This method forces a slower, more deliberate reasoning process (similar to Chain-of-Thought) and makes the agent's decisions much more explainable. If a command fails, you can easily see if the agent misunderstood the description or failed to find the coordinates.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why We Built GACUA in the Open
&lt;/h2&gt;

&lt;p&gt;When we talked to other developers, two pain points with existing computer use agents kept coming up: the high cost of entry and their "black box" nature. GACUA was designed to solve these, and &lt;strong&gt;open-source&lt;/strong&gt; is core to that philosophy.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low Barrier to Entry&lt;/strong&gt;: Many powerful agents rely on expensive proprietary models (like Claude) or require specialized, locally-run models with high-end GPUs. GACUA offers an accessible alternative. It's built on the free Gemini CLI and uses our engineering methods to achieve high-quality grounding, allowing any developer to experience computer use for free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparent Execution&lt;/strong&gt;: We believe you should be able to understand and trust the tools you use. GACUA's web UI gives you full observability into the agent's Planning and Grounding steps. The Human-in-the-Loop (HITL) control—letting you "accept" or "reject" each action—is a direct result of this open, transparent approach.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Our Thoughts on the Future of Computer Use
&lt;/h2&gt;

&lt;p&gt;Building GACUA shaped our perspective on where this technology is headed. We see two major scenarios where it shines:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Tasks with a "Knowledge Gap"&lt;/strong&gt;: Operations that are simple to execute but you don't know how (e.g., "adjust the row height in this Excel sheet").&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Repetitive Manual Labor&lt;/strong&gt;: High-frequency, low-value tasks perfect for automation (e.g., processing unread emails, monitoring product prices).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There's a growing sentiment that vision-based computer use is an inefficient "robot-pulling-a-cart" approach and that a fully API-driven world is superior. While API-based agents have their strengths, we believe a purely API-driven view misses two fundamental points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The GUI is Already a Universal API&lt;/strong&gt;: The dream of a fully API-driven world clashes with the reality of inconsistent standards. The GUI, however, has evolved into a de facto universal standard for interaction. Teaching an agent to master this "visual language" is a path worth exploring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's a Necessary Step Towards World Models&lt;/strong&gt;: Our ultimate vision is agents that can interact with the physical world. Vision-based perception and action are indispensable for that future. The computer screen is the most effective "training ground" we have today to teach an agent how to "see and interact."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We see GACUA not just as a practical tool, but as a pragmatic step toward that grander vision.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next? It's Open-Source.
&lt;/h2&gt;

&lt;p&gt;The future of GACUA is open-source, and that means it's up to you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/openmule/gacua" rel="noopener noreferrer"&gt;Give GACUA a try on GitHub!&lt;/a&gt;&lt;/strong&gt; 🙌&lt;/p&gt;

&lt;p&gt;You can start it with a single command. We encourage you to &lt;code&gt;star&lt;/code&gt; the repo, &lt;code&gt;fork&lt;/code&gt; it, and submit pull requests. Found a bug? Open an &lt;code&gt;issue&lt;/code&gt;. Have a wild idea for a new feature? Let's discuss it.&lt;/p&gt;

&lt;p&gt;GACUA is an open-source project from &lt;a href="https://mulerun.com/" rel="noopener noreferrer"&gt;MuleRun&lt;/a&gt;, a team building the world's first AI Agent Marketplace. This is our way of sharing the insights we've gained while exploring how to build reliable agents in the open.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>opensource</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
