Stephanie Dover

Posted on Jun 8

How I made one desktop app drive four AI coding agent CLIs

#ai #automation #productivity #showdev

TL;DR

I built Klaussy, a desktop app that runs AI coding-agent CLIs in parallel across git worktrees and pairs them with a GitHub PR review surface. The v0.3.0 release (out June 5) replaced its hard dependency on Claude Code with a provider registry, so it now drives Claude Code, OpenAI Codex, Google Gemini, or GitHub Copilot — your pick per task. This post covers how the registry works, the side-by-side terminal model, and where the deeper AI features are still uneven across agents. It's closed-source and in beta. Site: klaussy.com.

The problem

Until a few weeks ago, Klaussy had one ugly constraint baked into its core: it only worked if you used Claude Code. The early-access docs said it outright — if you didn't run Claude Code, there was nothing in the app for you.

That was fine for the first users, who mostly did run Claude Code. But it ruled out a large chunk of the people the app is actually for: engineers who already use an agent CLI daily, have two or three tasks in flight at once, and want a structured way to run them without juggling branches in a single clone. A lot of those people had standardized on Codex, or Gemini, or Copilot — often for cost, procurement, or data-handling reasons that had nothing to do with the tool's quality. For them, Klaussy was "the Claude thing," and that was the end of the conversation.

The pain Klaussy targets isn't agent-specific. It's the workflow around the agent: starting task B while the agent grinds on task A, hopping between the terminal (where the agent lives) and the browser (where PR review lives), and triaging a CI failure across GitHub, the terminal, and the editor. None of that cares which agent you run. So hard-wiring one agent into the foundation was a self-inflicted limit. v0.3.0 is the release that removed it.

Why a single-agent design was the wrong foundation

The original architecture made the easy assumption: there's one agent CLI, so call it directly. Spawn claude, parse its output, wire its session resume into the terminal manager. Every AI surface in the app — the interactive terminal, the PR-review actions, the CI-failure debugger — reached for Claude by name.

That works right up until you want a second agent. Then every one of those call sites is a place that knows too much. The four agent CLIs differ in obvious ways (@anthropic-ai/claude-code vs @openai/codex vs @google/gemini-cli vs @github/copilot) and in annoying ones: different model-selection flags, different session-resume mechanics, different output streams to parse, different auth quirks. You can't paper over that with a single if agent == "codex" branch sprinkled everywhere — you end up with the same conditional copied across a dozen files, and adding a fifth agent later means finding all of them again.

The terminal multiplexers people would otherwise reach for (tmux, zellij) don't help here either. They'll give you N shells in one window, but they don't know what a git worktree is, what an agent session is, or what state a PR review is in. Klaussy ties each terminal to a specific agent instance, a branch, and a worktree — so the abstraction it needed was a clean seam between "which agent" and "what we're asking the agent to do."

The approach

The fix was a provider registry: one module that owns everything agent-specific, and a rule that nothing outside it hard-codes an agent. Each provider declares its npm package, how to launch it, which models it exposes, and how its output should be parsed. The rest of the app asks the registry for "the current agent" and works against that interface.

On top of the registry sits a small amount of UI: a global default agent you set once, plus a per-action picker so you can override it for a single task. Set Gemini as your default and every agent action follows it, persisting across restarts; reach for Codex on one specific worktree without changing the global setting. The orchestration layer above — parallel worktrees, one task per terminal, the PR review surface — didn't change. It just stopped caring which agent was underneath.

The honest version of this story is that the registry isn't uniformly deep yet. Phase 1 nailed the parts every agent shares — launching, switching, resuming, running two side by side. The parts that require parsing each agent's particular output stream are mature on Claude and still being verified on the other three. More on that below, because it's the most important caveat in the release.

How it works

The provider registry

Every AI surface in the app now routes through the registry instead of calling an agent by name. A provider entry knows its npm package, its launch command, and its model list. Adding agent number five means adding one entry, not editing a dozen call sites.

// Illustrative — confirm exact API in docs.
// Conceptual shape of the provider registry (main/state/ai-providers.js).
const PROVIDERS = {
  claude:  { pkg: "@anthropic-ai/claude-code", models: ["opus", "sonnet", "haiku"] },
  codex:   { pkg: "@openai/codex",             models: ["gpt-5.5", "gpt-5.4-mini"] },
  gemini:  { pkg: "@google/gemini-cli",        models: ["2.5-flash", "3-pro"] },
  copilot: { pkg: "@github/copilot",           models: ["default"] },
};

Model selection is verified for three of the four. Claude takes --model aliases (opus/sonnet/haiku), Codex exposes gpt-5.5 and gpt-5.4-mini, and Gemini offers its 2.5/3 flash and pro tiers. Copilot is Default-model-only in v0.3.0 — its model slugs aren't verified yet, so the picker doesn't pretend to offer a choice it can't honor. That's a deliberate "show what's real" call rather than a missing feature dressed up as one.

Parallel worktrees, one window

Each task runs in its own git worktree, with its own pseudo-terminal via node-pty, surfaced in the same window as columns, a grid, or a single pane. This is the part that replaces the git worktree + tmux + a handful of gh aliases that an engineer would otherwise script themselves. The agent for each terminal is whatever the registry hands back, so a column running Claude and a column running Gemini coexist without either knowing about the other.

Running the same work in two agents at once

Because the registry decouples agent from task, the worktree Actions dropdown can spawn a sibling task in the same worktree on a different agent. You can hand the same change to two agents side by side and compare what they do — useful when you're still forming an opinion about which agent is better at a given kind of work.

One sharp edge worth naming: running two Codex sessions concurrently can invalidate each other's rotating OAuth tokens. Klaussy warns you before it starts a second concurrent Codex session rather than letting it silently break. Codex's auth model, not a Klaussy choice — but the kind of thing the orchestration layer has to know about, which is exactly why the registry exists.

The PR review surface

Separately from the terminals, Klaussy renders a GitHub PR without a local checkout: Files, Conversation, Checks, and AI Review tabs. The inline review composer batches comments and submits them in one round trip, and per-finding state (Ignore / Add to PR / Implement / Investigate / Ask) persists across sessions. One click materializes a PR into a worktree plus a task when you do want it locally. There's a built-in Monaco editor with LSP diagnostics so you can edit and commit straight from the diff.

This is where the maturity gap matters most, so I'll be plain about it: the review actions, the CI-failure auto-debug, Implement, and Ask are most battle-tested on Claude. The non-Claude output parsers for those headless surfaces are documented but still being verified in the shipped code. The interactive terminal, agent switching, resume, and side-by-side work across all four agents today. The deep AI surfaces on Codex, Gemini, and Copilot are the path I trust least, and I'd rather you know that going in than discover it on a real PR.

Optional on-device autocomplete

There's also inline tab-autocomplete that runs entirely on your machine via local Ollama, using qwen2.5-coder:1.5b at roughly 100ms latency. Nothing leaves the laptop per keystroke. It's opt-in and costs about a 2 GB download (the Ollama runtime plus the model weights); without it you get a free word-based completer. This is the one piece of the app that does its own inference instead of delegating to your agent CLI.

A quick demo

The setup before first launch is real and worth stating plainly. You need Node.js 18+, the GitHub CLI authenticated, and at least one of the four supported agent CLIs installed and authenticated:

# Illustrative — Klaussy surfaces install commands in a setup dialog;
# it can detect missing CLIs but cannot bootstrap your auth.
npm i -g @anthropic-ai/claude-code   # or @openai/codex,
                                     # @google/gemini-cli, @github/copilot
gh auth login

Once a worktree is open, picking an agent and spawning a sibling task on a second agent both happen from the worktree's Actions dropdown — no config files, no per-project agent setup. The agent you choose becomes the global default and sticks across restarts until you change it.

What's next

A few things are honestly incomplete:

Multi-agent maturity is uneven. Running, switching, resuming, and side-by-side work on all four agents. The deeper AI surfaces (PR review, Implement, CI-debug, Ask) are most proven on Claude and still being verified on Codex, Gemini, and Copilot. Treat Claude as the battle-tested path today.
The built-in flow prompts are still Claude-flavored. The Plan/Debug/Review slash-command bodies were written for Claude. They run on the other agents but aren't tuned to them yet. Per-agent tuning is a follow-up.
It doesn't replace your agent. Klaussy does not bundle and mark up agent access, you use your account that you already have with one of the supported agents. Klaussy is an developer productivity app.

There's no Klaussy server in any of this. Data flows from your own agent CLI to that agent's provider, from your own gh to GitHub, and optionally to local Ollama. Pricing is a one-time $39 founder license (rising later), or $349 / $599 for 5 / 10 seats.

Try it

Site and downloads: klaussy.com
Early-access discussion: Discord

If you've tried wiring multiple agent CLIs into one workflow yourself, I'd genuinely like to know which agent you trust for which job — drop a comment.

DEV Community