Why AI Agents shouldn't rely on screenshots: Building a cross-platform alternative to Anthropic's Computer Use

#ai #automation #opensource #python

Anthropic recently released their Computer Use feature for macOS. It is a big step forward for AI agents, allowing models to interact with local software. However, this release also highlights a major technical bottleneck in how we are building GUI agents today. The current approach relies heavily on taking continuous screenshots and using large vision models to figure out where to click. This method is slow, expensive, and currently leaves Windows users out of the loop.

When an agent uses screenshots, it essentially treats the operating system like a flat picture. It takes an image, sends it to the cloud, waits for the vision model to calculate pixel coordinates, and then finally moves the mouse. If a UI element shifts by a few pixels or the network is delayed, the action easily fails. Clicking a single button can take several seconds and consume a lot of tokens.

We need a more efficient way for agents to interact with software. Human developers use APIs to talk to applications, and AI agents should have a similar structural interface. This is why I built the Agent-Computer Interface (ACI).

ACI is an open-source protocol that turns any application into a structured JSON tree. It allows large language models to read and operate software interfaces directly through text, completely bypassing the need for pixels and screenshots for standard tasks.

The core idea behind ACI is unification. Currently, web browsers use the DOM to organize elements, while Windows desktop applications use UI Automation (UIA). ACI reads these completely different underlying structures and converts them into one universal node tree.

Because the data format is standardized, the agent only needs to learn two basic commands: perceive and act.

When the agent calls the perceive command, it receives a clean, structured list of all interactive elements currently on the screen. Each element, whether it is a web link or a local desktop button, gets a unique ID. If the agent wants to click the search bar, it simply calls the act command and targets that specific ID. The interaction logic is exactly the same whether the agent is browsing Hacker News or operating a local Windows notepad. This drops the action latency from several seconds down to milliseconds.

Of course, purely reading UI structures cannot solve every single edge case. Some applications use custom rendering engines or complex canvas elements where the structural data is hidden. ACI handles this practically. It defaults to the lightning-fast structural parsing for the vast majority of tasks. When it encounters a completely blind area like a canvas, it automatically falls back to a vision model to capture a visual reference image for that specific region.

This hybrid routing approach ensures that the agent stays extremely fast most of the time without losing the ability to handle complex graphical interfaces.

I believe treating the UI as an API rather than an image is the right path forward for agent automation. The ACI framework is fully open-source and ready to be tested. If you are building AI agents or working on desktop automation, I would love to hear your feedback or see your contributions.

GitHub Repository: https://github.com/Leoooooli/ACI

DEV Community

Why AI Agents shouldn't rely on screenshots: Building a cross-platform alternative to Anthropic's Computer Use

Top comments (0)