DEV Community

CodeKing
CodeKing

Posted on

"My AI Agent Kept Missing Buttons, So I Used Windows UI Automation"

The first time you let an AI agent control a desktop, it feels impressive.

Then it misses a button by 40 pixels.

Or it clicks the window behind the window. Or it types into the wrong field because a notification stole focus. Or it spends ten seconds looking at a screenshot just to decide where a textbox probably is.

That was the part of desktop automation that bothered me. The model was not really failing at reasoning. It was being forced to reverse-engineer an application from pixels.

Screenshot-first is the wrong default

The common loop looks like this:

screenshot -> model guesses UI -> model guesses coordinates -> click -> screenshot again
Enter fullscreen mode Exit fullscreen mode

It is useful as a fallback, but it is a bad default for normal desktop apps.

Most buttons, inputs, tabs, menus, and labels already exist in a semantic tree before they become pixels on the screen. The operating system exposes that tree for accessibility tools: screen readers, magnifiers, automation software, and anything else that needs to understand the UI without guessing from a bitmap.

On Windows, that layer is UI Automation.

So I stopped treating screenshots as the first source of truth.

The loop I wanted

For CliGate's desktop agent, I wanted a local loop that worked more like this:

list windows -> focus app -> find control -> set value -> send key -> read text
Enter fullscreen mode Exit fullscreen mode

That is a different kind of automation.

Instead of "click around until something happens," the agent can ask Windows for visible windows, focus the right one, find an Edit control, set its value through ValuePattern, invoke a button through UIA, or read text from a matching control.

Coordinates still exist, but they become a fallback. If the app exposes accessibility metadata, the assistant should use the semantic route first. If the app is custom-rendered, Canvas-heavy, or not exposing useful controls, then it can capture a screenshot and fall back to visual inspection.

The important rule is:

Observe semantically first. Use pixels only when the semantic layer is missing.

How I wired it into CliGate

CliGate is already a local gateway for Claude Code, Codex CLI, Gemini CLI, OpenClaw, account pools, API keys, runtime sessions, and channel workflows.

The desktop part adds a small local companion service that runs in the user's interactive desktop session. The assistant talks to it through local tools:

  • list visible windows
  • launch or focus an app
  • find one control or all matching controls
  • click a control through UIA
  • set an input value without relying on clipboard focus
  • send control keys like Enter
  • read visible text or values
  • capture a screenshot when UIA is not enough

That means the agent can stay inside one local control plane. It can work in a repo, continue a Codex or Claude Code runtime session, and still operate the desktop app that owns the last step of the workflow.

Why local matters

I do not want desktop control to depend on a hosted relay.

The desktop is full of sensitive state: open apps, browser sessions, account consoles, chat windows, clipboard contents, and local files. Keeping the control service on localhost keeps the automation close to the machine that owns that state.

It also makes the loop simpler. The assistant can inspect, act, and verify against the actual window in front of the user without shipping a stream of screenshots to a separate service.

This fits the rest of CliGate's design: local server, local dashboard, local runtime orchestration, local desktop bridge.

What changed

The big change was not speed, although UIA calls are much faster than screenshot-model-click loops.

The bigger change was reliability.

A textbox is no longer "some rectangle near the bottom of the screenshot." It is an Edit control. A button is no longer "probably the blue thing." It is a control with a name, bounding box, state, and supported patterns.

Screenshots are still useful. Some apps do not expose good accessibility metadata. Some content is graphical by nature. But for normal desktop and browser workflows, UIA-first makes the assistant feel less like a demo and more like an operator.

The project is open source here: CliGate on GitHub

If you are building desktop-capable agents, are you starting with screenshots, accessibility trees, browser automation, or something else?

Top comments (0)