"My AI Assistant Could Code, But It Couldn't Operate My Desktop"

#webdev #javascript #ai #opensource

My assistant could already read files, run shell commands, and delegate coding work to Claude Code or Codex.

But the moment a workflow hit a real desktop app, the illusion broke.

A browser needed a click. A page needed a scroll. A field needed real text input. A task could finish the hard part and still get stuck on the last two seconds of UI.

That felt like a fake kind of automation.

The problem wasn't coding

The hard part here wasn't generating code. It was crossing the gap between "I know what should happen next" and "I can actually operate the window in front of me."

In practice, that gap showed up in small but annoying ways:

a browser tab needed Ctrl+L and a URL paste
a page exposed no reliable accessibility selector, so a screenshot was needed first
a long form needed scrolling inside the right pane, not the whole desktop
a final publish step still depended on one visible button

So the assistant didn't need another coding loop. It needed a safe desktop-control layer.

The local control loop I added

I added a small set of desktop tools around a companion agent running on the same machine.

The assistant can now do things like:

list windows
focus a specific app
find accessible controls when UI Automation is available
set input values directly
send hotkeys like Ctrl+L
capture screenshots before pixel-based actions
click, move, and scroll with explicit coordinates only after visual confirmation

The key constraint is simple: observe first, then act.

If selectors are available, use them. If they are not, capture the window, inspect what is actually visible, and only then click. That rule matters more than any single tool because it keeps desktop automation from turning into random coordinate guessing.

What changed in the workflow

Before this, the assistant could help me prepare a task but not finish anything that crossed into a real app.

Now the same local loop can cover more of the actual workflow:

inspect window → focus app → locate control or capture screenshot → act → verify

That sounds small, but it changes what "assistant" means in practice.

It is no longer limited to code and terminal state. It can handle the messy last mile where real work often stalls.

Why I kept it local

I did not want this running through a hosted browser service or a remote desktop relay.

Desktop control touches exactly the kind of things that should stay on the machine that owns them: open apps, visible windows, clipboard state, local sessions, and personal accounts.

Keeping it local also makes the loop faster. The assistant can inspect, act, and verify against the current desktop state without shipping screenshots or UI events to another service first.

That local-first constraint fits the rest of CliGate anyway. The gateway, the assistant, the runtimes, and now the desktop-control layer all live on the same box.

What I learned

The interesting lesson was that "assistant capability" is not just about better reasoning or better code generation.

A lot of workflows fail because the assistant cannot cross boundaries between tools.

Terminal-only automation is useful. But if the real workflow ends in a browser, settings window, login dialog, or web app form, then desktop control becomes part of the product surface whether you planned for it or not.

So this update was less about making the assistant smarter and more about making it less incomplete.

If you're building local AI tooling, where does your automation still stop — at the terminal, at the API, or at the desktop?

Repo: https://github.com/codeking-ai/cligate