DEV Community

Cover image for We Built an Open Protocol So AI Agents Can Actually See Your Screen
Kacper Gadomski
Kacper Gadomski

Posted on

We Built an Open Protocol So AI Agents Can Actually See Your Screen

You know what's wild? Every major AI lab is building "computer use" agents right now. Models that can look at your screen, understand what they see, and click buttons on your behalf. Anthropic has Claude Computer Use. OpenAI shipped CUA. Microsoft built UFO2.

And every single one of them is independently solving the same problem: how do you describe a UI to an AI?

We thought that was broken, so we built Computer Use Protocol (CUP), an open specification that gives AI agents a universal way to perceive and interact with any desktop UI. One format. Every platform. MIT licensed.

GitHub: computeruseprotocol/computeruseprotocol
Website: computeruseprotocol.com

The Problem: Six Platforms, Six Different Languages

Here's the fragmentation that every computer-use agent has to deal with today:

Platform Accessibility API Role Count IPC Mechanism
Windows UIA (COM) ~40 ControlTypes COM
macOS AXUIElement AXRole + AXSubrole XPC / Mach
Linux AT-SPI2 ~100+ AtspiRole values D-Bus
Web ARIA ~80 ARIA roles In-process / CDP
Android AccessibilityNodeInfo Java class names Binder
iOS UIAccessibility ~15 trait flags In-process

That's roughly 300+ combined role types across platforms, each with different naming, different semantics, and different ways to query them. If you're building an agent that needs to work on more than one OS, you're writing a lot of glue code.

The Solution: One Schema, One Vocabulary

CUP collapses all of that into a single, ARIA-derived schema:

  • 59 universal roles, the subset that maps cleanly across all 6 platforms
  • 16 state flags, only truthy/active states listed (absence = default)
  • 15 action verbs, a canonical vocabulary for what agents can do with elements
  • Platform escape hatch, raw native properties preserved for advanced use

Here's what a CUP snapshot looks like in JSON:

{
  "version": "0.1.0",
  "platform": "windows",
  "timestamp": 1740067200000,
  "screen": { "w": 2560, "h": 1440, "scale": 1.0 },
  "app": { "name": "Spotify", "pid": 1234 },
  "tree": [
    {
      "id": "e0",
      "role": "window",
      "name": "Spotify",
      "bounds": { "x": 120, "y": 40, "w": 1680, "h": 1020 },
      "states": ["focused"],
      "actions": ["click"],
      "children": ["..."]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Whether that UI was captured on Windows via UIA, macOS via AXUIElement, or Linux via AT-SPI2, it comes out looking exactly the same.

The Compact Format (97% Fewer Tokens)

Sending JSON trees to an LLM burns context fast. CUP defines a compact text format that's optimized for token efficiency:

# CUP 0.1.0 | windows | 2560x1440
# app: Spotify
# 63 nodes (280 before pruning)

[e0] window "Spotify" @120,40 1680x1020
  [e1] document "Spotify" @120,40 1680x1020
    [e2] button "Back" @132,52 32x32 [click]
    [e3] button "Forward" @170,52 32x32 {disabled} [click]
    [e7] navigation "Main" @120,88 240x972
      [e8] link "Home" @132,100 216x40 {selected} [click]
Enter fullscreen mode Exit fullscreen mode

Each line follows: [id] role "name" @x,y wxh {states} [actions]

Same information. A fraction of the tokens. Your agent sees more UI in less context.

Architecture: Protocol First

CUP is intentionally layered. The protocol is the foundation, and everything else is optional.

Layer What It Does
Protocol (core repo) Defines the universal tree format: roles, states, actions, schema
SDKs Capture native accessibility trees, normalize to CUP, execute actions
MCP Servers Expose CUP as tools for AI agents (Claude Code, Cursor, Copilot, etc.)

You can adopt just the schema. Or use the SDKs. Or go all the way to MCP integration. Each layer is independent.

Getting Started in 30 Seconds

Install the SDK:

# TypeScript
npm install computeruseprotocol

# Python
pip install computeruseprotocol
Enter fullscreen mode Exit fullscreen mode

Capture a UI tree and interact with it:

import { snapshot, action } from 'computeruseprotocol'

// Capture the active window's UI tree
const tree = await snapshot()

// Click a button
await action('click', 'e14')

// Type into a search box
await action('type', 'e9', { value: 'hello world' })

// Send a keyboard shortcut
await action('press', { keys: 'ctrl+s' })
Enter fullscreen mode Exit fullscreen mode

That's it. The SDK auto-detects your OS and loads the right platform adapter.

MCP Integration

CUP ships a built-in MCP server. Add it to your claude_desktop_config.json or equivalent and your agent can start controlling desktop UIs immediately:

{
  "mcpServers": {
    "cup": {
      "command": "cup-mcp"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This exposes tools like snapshot (capture window tree), action (interact with elements), overview (list all open windows), and find (search elements in the last tree). It works with Claude Code, Cursor, OpenClaw, Codex, and anything MCP-compatible.

Why This Matters

Computer-use agents are evolving fast, but the infrastructure layer is still ad-hoc. Every team building an agent that needs to "see" a desktop is solving the same problems from scratch: how to capture UI state, how to represent it for an LLM, how to execute actions reliably across platforms.

CUP standardizes that layer so teams can focus on what makes their agent unique (the reasoning, the planning, the task execution) instead of reimplementing platform-specific UI perception.

Think of it like this: HTTP didn't make web browsers smart, but it gave them a common language. CUP aims to do the same for computer-use agents.

Current Status & Contributing

CUP is at v0.1.0, early but functional. The spec covers 59 roles mapped across 6 platforms, with SDKs for Python and TypeScript.

Contributions are very welcome, especially around new role/action proposals with cross-platform mapping rationale, platform mapping improvements, and SDK contributions like new platform adapters or bug fixes.

Check out the repos:

If you're building computer-use agents, cross-platform UI testing, or accessibility tooling, we'd love to hear from you. Open an issue, submit a PR, or just star the repo if you think this problem is worth solving.


CUP is MIT licensed and community-driven. The protocol belongs to everyone building in this space.

Top comments (0)