"My AI Assistant Could Code, But It Couldn't Operate My Desktop"

#webdev #javascript #ai #opensource

Most AI coding agents are good until the task leaves the terminal.

They can edit files. They can run tests. They can explain a diff. Then the work hits a desktop app, an OAuth approval screen, a native settings window, or a web UI that was not designed for API access. Suddenly the agent is not stuck on intelligence. It is stuck on reach.

That was the gap I kept running into while building my local AI setup. I had Claude Code, Codex CLI, Gemini CLI, local models, provider keys, and account pools. The missing piece was not another model.

It was an operator.

The Problem Was The Boundary

My old workflow had two separate worlds.

In one world, coding agents lived inside terminals and repos. They could reason about code, run commands, and keep a session alive.

In the other world, real work still happened through desktop apps, dashboards, browser windows, chat clients, and provider consoles. A human could jump between those worlds without thinking. An agent could not.

That made the assistant feel smaller than it should:

It could fix a bug, but not always finish the setup.
It could tell me where to click, but not click safely.
It could generate a workflow, but not reliably drive the app that owned the workflow.
It could reuse project knowledge, but only if I remembered to paste it in.

So I changed how I think about CliGate.

CliGate is no longer just a local API gateway for AI tools. It is becoming a local control plane for agent work.

What CliGate Does Now

CliGate still starts as one localhost service for AI coding tools.

You can point Claude Code, Codex CLI, Gemini CLI, and OpenClaw at the same local server, then manage provider keys, account pools, routing, usage, logs, and local runtimes from one dashboard.

But the newer assistant layer sits above that.

It has two modes:

Direct runtime: keep talking to the current Codex or Claude Code session.
Assistant collaboration: ask CliGate Assistant to inspect state, choose a runtime, continue a task, handle a blocked run, or summarize what happened.

That split matters. I do not want every normal message to be intercepted by a clever supervisor. Sometimes I just want to continue the current runtime session. Other times I want an assistant that can see the bigger picture.

The assistant is not trying to replace Codex or Claude Code. It coordinates them.

Skills Made It Less Generic

The second piece is skills.

A skill is a local package of instructions, scripts, templates, and references. The assistant does not need every detail in context all the time. It can see a short description first, then read the full SKILL.md only when the task matches.

For example:

skills/
  devto-publisher/
    SKILL.md
    publish.js
    templates/

That turns the assistant from "a general chat box with tools" into something closer to a teammate with reusable procedures.

One skill can know how to publish a Dev.to article. Another can know how to build a spreadsheet. Another can know the conventions of a local repo. The key is that these are local, inspectable, and executable through the same permission system as the rest of the agent.

It is not magic. It is just a better way to keep operational knowledge out of one giant prompt.

The Desktop Part Is The Big Unlock

The part I am most excited about is desktop control.

The first naive version of desktop automation is usually visual: take a screenshot, ask the model where to click, move the mouse, repeat. That works for demos, but it is fragile. Small buttons, focus changes, DPI scaling, popups, and animations can break it.

CliGate's desktop agent takes a different default path on Windows: UI Automation first, screenshots second.

Instead of guessing pixels, the assistant can ask the operating system for the UI tree:

list windows -> focus app -> find input -> set value -> send Enter -> read text

That means it can find a textbox by control type, set its value through the accessibility API, invoke a button, read visible text, and only fall back to screenshots when the app does not expose useful accessibility metadata.

This is the bridge I wanted: a coding assistant that can work in repos, but also operate the desktop applications that surround the repo.

Where This Is Going

The current shape is:

CliGate routes AI coding tools through one local server.
Runtime sessions keep Codex and Claude Code work alive.
The assistant watches, coordinates, and summarizes.
Skills give it reusable procedures.
Desktop control gives it a path into native apps and GUI workflows.

That combination changes the product from "proxy for AI tools" into "local operator for developer workflows."

I think the desktop-control layer deserves its own post, because "AI can operate any app through the OS accessibility tree" is a deeper topic than I can fit here.

The project is open source here: CliGate on GitHub

How are you handling the boundary between coding agents and the desktop apps they still need to interact with?