CodeKing

Posted on May 27

"My AI Agent Kept Missing Buttons, So I Used Windows UI Automation"

#ai #webdev #node #tutorial

The first time you let an AI agent control a desktop, it feels impressive.

Then it misses a button by 40 pixels.

Or it clicks the window behind the window. Or it types into the wrong field because a notification stole focus. Or it spends ten seconds looking at a screenshot just to decide where a textbox probably is.

That was the part of desktop automation that bothered me. The model was not really failing at reasoning. It was being forced to reverse-engineer an application from pixels.

Screenshot-first is the wrong default

The common loop looks like this:

screenshot -> model guesses UI -> model guesses coordinates -> click -> screenshot again

It is useful as a fallback, but it is a bad default for normal desktop apps.

Most buttons, inputs, tabs, menus, and labels already exist in a semantic tree before they become pixels on the screen. The operating system exposes that tree for accessibility tools: screen readers, magnifiers, automation software, and anything else that needs to understand the UI without guessing from a bitmap.

On Windows, that layer is UI Automation.

So I stopped treating screenshots as the first source of truth.

The loop I wanted

For CliGate's desktop agent, I wanted a local loop that worked more like this:

list windows -> focus app -> find control -> set value -> send key -> read text

That is a different kind of automation.

Instead of "click around until something happens," the agent can ask Windows for visible windows, focus the right one, find an Edit control, set its value through ValuePattern, invoke a button through UIA, or read text from a matching control.

Coordinates still exist, but they become a fallback. If the app exposes accessibility metadata, the assistant should use the semantic route first. If the app is custom-rendered, Canvas-heavy, or not exposing useful controls, then it can capture a screenshot and fall back to visual inspection.

The important rule is:

Observe semantically first. Use pixels only when the semantic layer is missing.

How I wired it into CliGate

CliGate is already a local gateway for Claude Code, Codex CLI, Gemini CLI, OpenClaw, account pools, API keys, runtime sessions, and channel workflows.

The desktop part adds a small local companion service that runs in the user's interactive desktop session. The assistant talks to it through local tools:

list visible windows
launch or focus an app
find one control or all matching controls
click a control through UIA
set an input value without relying on clipboard focus
send control keys like Enter
read visible text or values
capture a screenshot when UIA is not enough

That means the agent can stay inside one local control plane. It can work in a repo, continue a Codex or Claude Code runtime session, and still operate the desktop app that owns the last step of the workflow.

Why local matters

I do not want desktop control to depend on a hosted relay.

The desktop is full of sensitive state: open apps, browser sessions, account consoles, chat windows, clipboard contents, and local files. Keeping the control service on localhost keeps the automation close to the machine that owns that state.

It also makes the loop simpler. The assistant can inspect, act, and verify against the actual window in front of the user without shipping a stream of screenshots to a separate service.

This fits the rest of CliGate's design: local server, local dashboard, local runtime orchestration, local desktop bridge.

What changed

The big change was not speed, although UIA calls are much faster than screenshot-model-click loops.

The bigger change was reliability.

A textbox is no longer "some rectangle near the bottom of the screenshot." It is an Edit control. A button is no longer "probably the blue thing." It is a control with a name, bounding box, state, and supported patterns.

Screenshots are still useful. Some apps do not expose good accessibility metadata. Some content is graphical by nature. But for normal desktop and browser workflows, UIA-first makes the assistant feel less like a demo and more like an operator.

The project is open source here: CliGate on GitHub

If you are building desktop-capable agents, are you starting with screenshots, accessibility trees, browser automation, or something else?

Top comments (1)

Harjot Singh • May 31

Switching from pixel/vision clicking to Windows UI Automation is exactly the right move, and the why is the interesting part: a vision model locating a button by pixels is guessing at coordinates from an image, so it misses on DPI scaling, slight layout shifts, or a control that moved 4px, and you get the confident-click-on-nothing failure. UI Automation hands the agent the actual accessibility tree, real elements with real names and roles, so instead of where-do-I-think-the-button-is it's invoke-the-element-named-Submit. That's the same lesson as structured-tools-over-screen-scraping everywhere: when a typed, declared interface exists, use it, and reserve vision for when nothing structured is available. The bonus is reliability becomes verifiable, you can confirm the element exists and is enabled before acting, instead of clicking and hoping. The failure mode you escaped (agent acts on a misperceived screen) is one of the biggest sources of silent desktop-automation breakage, and grounding in the accessibility layer kills most of it. That prefer-the-structured-interface-over-pixels instinct is core to how I think about agent automation in Moonshift. Do you fall back to vision when an app exposes no accessibility tree (Electron/custom-rendered UIs are notorious), or skip those entirely?