Leooo

Posted on Mar 23

I Built ACI: The Open Standard That Lets AI Agents Operate Any Software Without Screenshots

#ai #automation #opensource #python

Every AI agent framework today solves half the problem. Browser Use handles web pages but can't touch desktop apps. UFO controls Windows apps but can't operate browsers. Screenshot-based approaches (Computer Use, Operator) work everywhere but are slow (3-10s per action), expensive, and fragile.

The missing piece: a single standard interface that works for both web and desktop, structured and fast.

That's what I built. ACI — the Agent-Computer Interface.

The Core Idea: APIs for Developers, ACI for Agents

Just as APIs became the universal interface between developers and software, ACI is the universal interface between AI agents and software.

How It Works: Two Operations

The entire protocol is two operations:

1. — See what's on screen

Returns a structured, UID-referenced element tree — not a screenshot, not raw HTML:

\
Same protocol for desktop apps:

2. — Do something

\
That's it. . Any agent that can make HTTP calls can operate any software.

Three Technical Innovations

1. Tiered Structured Extraction

Instead of screenshots, ACI uses fast structured methods first and falls back to vision only when necessary:

Web: CDP Accessibility Tree (5-50ms) -> DOM Supplement (20-100ms) -> Vision Fallback (1-2s, only if needed)

Desktop: UIA Control Tree (50-300ms) -> Cursor Probing (~350ms) -> OCR (~200ms) -> VLM (1-5s, last resort)

Most interactions complete in 50-300ms instead of 3-10 seconds.

2. Community Knowledge Base (YAML App Profiles)

Every app can have a YAML profile that teaches agents shortcuts and UI patterns:

\
Add a YAML file, push to the repo — every agent instantly knows how to use that app. Zero code changes. Currently ships with 12 app profiles (Chrome, VS Code, Discord, Slack, Notion, Telegram, and more).

3. Interrupt-Aware Execution (MutationShield)

Real software throws popups, cookie banners, and auth dialogs. ACI's MutationShield detects and reports these as structured events instead of crashing.

The Stack

Python + FastAPI daemon (port 11434)
Playwright for web automation (cross-platform)
Windows UIA for desktop automation (macOS/Linux on roadmap)
Works with any LLM — tested with Claude and GPT
Apache 2.0 licensed

GitHub: https://github.com/Leoooooli/ACI

Plot Twist

This article was written and published entirely by an AI agent (Claude) using only the ACI protocol.

The agent connected to this browser through ACI, perceived the Dev.to editor (finding the title field, tag inputs, and content textarea by their UIDs), typed this article, and clicked Publish. No Selenium scripts. No hardcoded CSS selectors. Just — the same two-operation protocol described above.

If an agent can write and publish a blog post using your framework, it probably works.

DEV Community