DEV Community: Stephanie Dover

Decoupling a High-Throughput Engagement Service from a Monetization System

Stephanie Dover — Tue, 23 Jun 2026 17:17:14 +0000

At large-scale consumer platforms, product reuse can quietly turn into technical debt.

In this case, a non-monetized engagement points feature shared a backend with a wallet-based monetization service.

The engagement system handled over a million transactions per second, far beyond what the transactional payments backend was built for.

What began as a pragmatic shortcut led to rising latency, ballooning storage costs, and operational strain on a system optimized for financial correctness rather than lightweight interactivity.

The only sustainable fix was to decouple them completely.

Problem Context

The monetization system prioritized:

Durable writes
Idempotency
Strong transactional guarantees

The engagement system prioritized:

Speed
Throughput
Low operational cost

By sharing a backend, the engagement workload forced the payments infrastructure to scale inefficiently adding expensive transactional overhead to a system that didn’t need it.

Scaling further wasn’t the answer. Isolation was.

Designing the New Backend

The decoupling required two major components:

A dedicated API and data model built for high-TPS, low-latency operations.
A live migration pipeline capable of achieving full data parity with zero downtime.

Infrastructure Overview

Component	Purpose
Go microservice	Core logic and API layer
Protobuf + gRPC	Internal RPC communication
AWS DynamoDB	Primary datastore: high throughput, flexible schema
AWS Kinesis Streams	Real-time change-data capture
AWS Lambda functions	Stream processors handling event ordering and writes
Redis cache	Idempotency layer to prevent duplicate writes during dual-write and stream replay
Terraform	Infrastructure-as-code provisioning
CloudWatch metrics	Observability for throughput, lag, and latency
Feature flag service	Safe rollout and traffic control

The new backend used a lightweight schema aligned to engagement interactions, simpler, cheaper, and better suited for massive write volume.

Architecture Diagrams

Diagram 1: System Overview

flowchart TD
    A[Engagement API] --> B[Old DynamoDB Table]
    B --> C[AWS Kinesis Stream]
    C --> D[AWS Lambda Functions]
    D --> E[New DynamoDB Table]
    D --> F[Redis Cache<br/>(idempotency)]
    F --> E
    E --> G[Feature-flag Dual-Write]

Diagram 2: Migration Lifecycle

flowchart TD
    A[1. Export snapshot → S3 → Import new table] --> B[2. Sync updates via Kinesis + Lambda]
    B --> C[3. Redis ensures idempotency<br/>for dual-writes & replayed events]
    C --> D[4. Feature flag directs dual-writes]
    D --> E[5. Cutover & validation]

Migration Strategy

Building the API was straightforward.

Migrating live data at 1M+ TPS without downtime was the challenge.

1. Snapshot Bootstrap

AWS provides a built-in mechanism to export a DynamoDB table snapshot to S3, which can then be imported into a new table.

This seeded the new database with a point-in-time baseline, no long-running scans or Glue jobs required.

2. Real-Time Sync via Kinesis + Lambda

Once the snapshot was imported, Kinesis Streams captured every subsequent change (insert, update, delete) from the source table.

Each event was processed by an AWS Lambda consumer that replayed the change into the new DynamoDB table.

Maintaining transaction order was critical, out-of-sequence events could cause corruption or lost updates.

To handle retries and potential duplicate delivery, I introduced a Redis-based idempotency layer.

Each event carried a unique transaction ID. Before processing, Lambda performed a fast Redis lookup to check whether that ID had already been written.

If found, the event was skipped, eliminating double writes both from Kinesis replays and from the feature-flagged dual-write traffic hitting the same endpoint.

This lightweight Redis layer made the migration safe, ensuring exactly-once behavior without compromising throughput.

Monitoring IteratorAge and Duration metrics in CloudWatch remained critical.

If IteratorAge rose, the stream was falling behind, meaning either smaller batches or more concurrency were needed.

With tuning and caching in place, the pipeline kept pace with over a million updates per second.

The full migration completed within hours, not days.

Cutover with Feature Flags

After the real-time sync stabilized, I rolled out the new backend via a feature-flagged dual-write:

Dual-write requests to both APIs.
Use Redis for idempotency checks to prevent duplicate writes.
Validate data parity.
Monitor Kinesis lag until zero.
Cut traffic to the old API.

Once validation passed, the engagement service ran entirely on its new infrastructure.

The monetization system was finally free of the extra load, and both systems could scale independently.

Safety and Verification

Extended Kinesis retention to 24+ hours during migration.
Kept the source table intact until post-cutover validation completed.
Used Redis TTLs to automatically expire processed transaction keys, keeping cache cost minimal.
Continuously compared record counts and hash digests between tables during dual-write.

These guardrails ensured recovery options, consistency, and full traceability throughout the migration.

Results and Takeaways

Kinesis Streams + Lambda enable live, high-TPS migrations when tuned for throughput and ordering.
Redis caching ensures idempotency and prevents double writes under dual-write conditions.
Feature-flag rollouts provide control and observability for safe cutovers.
Decoupling mismatched systems is often cheaper and safer than scaling them together.

“Scaling isn’t always about adding resources, sometimes it’s about removing coupling.”

After the final cutover, metrics flatlined exactly where they should, and I finally took that long-delayed vacation.

Technologies Referenced

AWS Lambda — Serverless compute service for running backend logic without managing servers.

Amazon DynamoDB — Fully managed NoSQL database optimized for high-throughput workloads.

Amazon Kinesis Data Streams — Real-time event streaming service used for data replication and ingestion.

AWS CloudWatch — Metrics and observability platform for monitoring throughput, latency, and iterator age.

Amazon S3 — Object storage service used for snapshot exports and imports.

Redis — In-memory cache used here for idempotency checks during dual-writes and stream replay.

Terraform — Infrastructure-as-code tool for provisioning and managing AWS resources.

gRPC — High-performance RPC framework for service-to-service communication.

Protocol Buffers (Protobuf) — Serialization format used with gRPC to define and enforce API contracts.

Feature Toggles / Flags — Technique for gradual rollouts and safe cutovers.

Go Concurrency: Goroutines — Lightweight thread mechanism for concurrent workloads.

Go Channels — Synchronization and communication primitive used for concurrent fan-out/fan-in patterns.

Written by Stephanie Dover, Software Engineer 10+ YOE, ex GitHub, Twitch, Microsoft. Creator of Klaussy.

LinkedIn · GitHub · Klaussy Desktop · Klaussy Agents

Why I scrub AI prose with regex, not a second LLM

Stephanie Dover — Sat, 20 Jun 2026 00:11:10 +0000

Written by Stephanie Dover, Software Engineer 10+ YOE, ex GitHub, Twitch, Microsoft. Creator of Klaussy.

LinkedIn · GitHub · Klaussy Desktop · Klaussy Agents

TL;DR

klaussy-agents is a free, MIT-licensed CLI (pip install klaussy-agents) that makes the prose an AI coding agent writes, PR comments, review notes, commit messages, read like a person wrote them. It works in two layers: a humanization spec baked into the agent's skills so it writes clean prose up front, and a deterministic klaussy humanize pass that scrubs the output afterward. The scrubber is rule-based regex, not an LLM, and it never touches code. There's also a part I didn't expect going in: once the AI tells are gone, what's left can read curt and run long, so the spec also handles tone (don't be rude) and length (one sentence for a reply, one to five for a review comment). Repo: github.com/steph-dove/klaussy-agents.

The problem

You can spot AI-written text now. Everyone can. And the place it grates most is a code review comment or a commit message, where the prose sits next to your name in a thread your teammates read.

The tells are consistent. The em-dash is the biggest one. Right behind it: filler openers like "It's worth noting that…" and "I wanted to point out that…", chatbot scaffolding like "Hope this helps!" and "Let me know if you have questions!", and stacked hedges like could potentially. An agent that leaves those in your PR reads like a bot, and people notice.

The obvious fix is to tell the model not to do it. Add "don't sound like AI" to the prompt and move on. That helps, inconsistently, and it regresses silently the moment you change the model or the prompt drifts. Editing every comment by hand works too, but hand-editing every comment defeats the point of having an agent write them. I wanted something I could trust without rereading.

Why "just tell the model" wasn't enough

The honest answer to "why not just prompt for it" is: a prompt asks, it doesn't enforce. The model tries to comply. Sometimes it complies fully, sometimes it slips an em-dash back in on a longer comment, and you don't find out until the tell is already in the thread. Prompt compliance is soft by nature: you can make it better, never guaranteed.

So I stopped treating the prompt as the whole answer and treated it as the first of two layers.

The first layer is prompt-side. There's a single shared humanization block, internally HUMANIZE_BLOCK, that gets substituted into every prose-output skill: review, pr, commit, explain, across all five of the agents klaussy-agents ships. The rules in it are plain: no em or en dashes, no filler openers, no chatbot scaffolding, tighten hedges, no emoji, no "Certainly", vary sentence shape, and never reword code. One spec, applied everywhere, so the agent isn't given conflicting instructions in different places.

The second layer is the part that actually enforces. After the agent writes, the text goes through klaussy humanize, a deterministic pass that makes the high-confidence edits regardless of how well the model followed instructions. The prompt asks; the scrubber enforces. Neither layer alone is enough, which is the whole reason there are two.

Why deterministic, and not a second model

The tempting design is to run the agent's prose through another LLM with a "rewrite this to sound human" instruction. That's how most "AI humanizer" tools on the market work. I deliberately didn't.

A rewrite model can fix awkward phrasing a regex never will. But it can also paraphrase something into being subtly wrong, and over technical prose that's a real risk: it can rewrite an identifier, mangle a command, or "improve" an example until it no longer runs. For text that ends up in a PR comment a teammate will act on, I didn't want a process that could introduce a new mistake while removing an em-dash.

So the scrubber is rule-based regex. The consequences of that choice are all upsides for this job: it's fast, it's free, it runs offline with no network call, and because it only ever does a fixed set of high-confidence substitutions, it can't introduce a new error. It will not restructure a sentence or paraphrase a paragraph. It does a small, reliable set of edits and stops.

How it works

The edits it makes

The scrubber does a short list of conservative transforms. Each one is a tell with a known, safe fix:

Dashes. Em-dashes (—) and en-dashes (–) in prose become commas or hyphens.
Filler openers. Sentence-initial filler ("It's worth noting that", "I noticed that", "Please note that", and similar) is stripped, and the next word is re-capitalized so the sentence still reads correctly.
Chatbot scaffolding. Trailing lines like "Let me know if…", "Hope this helps", "Feel free to…" are dropped.
Verbose phrasings. A few get tightened: in order to becomes to, could potentially becomes could.

Here's the before and after on the exact cases the test suite covers:

Input	Output
`Leaks a connection — wrap it.`	`Leaks a connection, wrap it.`
`range 1–5 here`	`range 1 - 5 here`
`It's worth noting that the handler swallows the error.`	`The handler swallows the error.`
`This races on startup.` `Let me know if you have questions!`	`This races on startup.`
`Refactor in order to avoid the N+1.`	`Refactor to avoid the N+1.`
`This could potentially deadlock.`	`This could deadlock.`

Every one of those is a rule a human editor would apply without thinking. None of them changes what the comment means. That's the bar for inclusion: if an edit could change meaning, it doesn't make the list.

Never touching code

The single biggest risk in running any text transform over developer prose is that it reaches into a code example and breaks it. A "humanizer" that turns a dash inside a shell command into a comma has made your example wrong.

The scrubber avoids this structurally. It splits the input on fenced blocks and on inline code, scrubs only the prose segments, and leaves every code segment byte-for-byte untouched. The dashes in a command stay dashes. The identifiers stay identifiers.

Concretely, given a string that mixes prose and code like Use `a — b` then:, followed by a fenced block containing x — y, the dashes inside the inline span and inside the fence are kept exactly as written. Only a dash out in the prose would be normalized. Code in, code out, unchanged.

The interface

There are two ways to use it. As a library:

from klaussy.humanize import humanize

clean = humanize("It's worth noting that this races, fix it.")
# -> "This races, fix it."

humanize(text: str) -> str. Non-string input passes through unchanged.

As a CLI, it's built to drop into a pipe or a CI gate:

# stdin to stdout
printf '%s' "$comment" | klaussy humanize

# rewrite a file in place
klaussy humanize NOTES.md --write

# CI gate: exit 1 if the file would change
klaussy humanize NOTES.md --check

The --check mode is the one I lean on most. It turns "did an AI tell slip into this doc" into a check that fails a pull request instead of a thing someone has to catch by eye.

One spec, shared across products

The rules in both layers, the prompt-side block and the CLI, are a faithful port of one source: humanize-comment.js from the klaussy desktop codebase. (That desktop app is a separate product; this is the open-source klaussy-agents package. That's the only time I'll mention it.) Porting from one canonical implementation means the prompt rules and the scrubber rules don't drift apart over time, and any pipeline, CI, the desktop app, or your own scripts, can pipe through the same behavior.

Comment hygiene, not just prose

The same instinct shows up one layer down, in code review. Beyond the prose tells, the generated review skill flags excessive or narrating comments in code, the kind that restate what the line does or read like a changelog, and the commit guard blocks committed commented-out code via ruff --select ERA. The judgment-heavy part lives in the skill where a model can weigh context; the deterministic part lives in the hook. Same division of labor as the two prose layers: ask where you need judgment, enforce where you can be certain.

Clean isn't the same as kind, or short

Removing the tells exposed a second problem I didn't plan for: stripping the filler also strips the softening. Scrub "It's worth noting that this could potentially swallow the error, you may want to wrap it" down to its substance and you get "This swallows the error. Wrap it." That's clean, and on a PR thread it's also a little cold, and cold reads as curt. Review comments are the worst case, because they land on a person's work. A real one, lightly defanged: "Personally I don't find these unit tests useful, because you are mocking everything." The tells aren't the problem there; the framing is.

So humanizing can't be purely subtractive. After removing what makes prose sound like a machine, the spec adds back what makes it sound like a considerate human, split the same two ways as everything else.

Prompt-side, a civility floor: critique the work, never the person; prefer a question over a flat verdict; keep it a light touch, not filler praise. It's a floor, not forced warmth, so a review you asked to be blunt stays blunt, it just can't tip into insulting. A second rule covers replies: read the comment you're answering for substance but not temperature, and neutralize its rudeness before drafting, so a hostile thread doesn't prime a hostile reply. And a brevity rule with actual numbers instead of "be concise": a thread reply aims for one sentence, a single review comment for one to five, anything longer gets cut, not summarized.

Deterministic-side, the scrubber's opener list grew to catch the editorializing lead-ins that prime a dismissive read, "Personally," "Honestly," "Frankly," "IMO," "In my opinion," "If you ask me," so Personally I don't find these useful. becomes I don't find these useful. with the same guarantee as the rest, and still never inside code.

The honest caveat: tone and brevity are mostly judgment, so most of it lives in the soft prompt layer, only the openers are guaranteed.

A quick demo

The clearest way to see it is to pipe a comment that has several tells stacked up:

printf '%s' "It's worth noting that this handler swallows the error, wrap it. Hope this helps!" \
 | klaussy humanize

The opener gets stripped, the next word re-capitalized, the em-dash becomes a comma, and the trailing scaffolding line is dropped. What comes out reads like a note an engineer left, because the things that made it read like a bot are gone and nothing else was touched.

The whole pass is pure standard library, no network, no model. The test suite covers every transform above plus the code-preservation cases, ported from the desktop test suite; 137 tests pass overall.

What's next, and where the line is

The deliberate limit is the headline tradeoff: deterministic means limited. A regex scrubber catches the reliable tells, but it will not rewrite genuinely awkward phrasing the way an LLM could. That's the trade I chose on purpose, the same property that makes it safe, fast, and incapable of introducing an error is the property that keeps it from doing deeper rewrites. If you want paraphrasing, this isn't that tool, and it's not trying to be.

A couple of other honest limits:

The prompt layer is soft. The {{HUMANIZE}} block depends on the model following instructions. Two layers because neither alone is enough.
It's opinionated. Some people like em-dashes. The scrubber normalizes them, and that's a stance. It's also open code, so if you disagree, the rules are right there to edit.
Scope is prose, not code. By design it won't touch anything inside a code block. The flip side is that it won't fix a tell living inside a code comment unless you wire it to do so.

None of these are things I'm hiding. They're the consequences of picking "safe and predictable" over "clever and risky" for text that goes in front of your team.

Try it

pip install klaussy-agents
printf '%s' "It's worth noting that this races, fix it. Hope this helps!" | klaussy humanize

Repo and docs: github.com/steph-dove/klaussy-agents

steph-dove / klaussy-agents

Claude Code boilerplate generator. One command to make any repo Claude Code-ready.

klaussy

Multi-agent repo boilerplate generator. One command to make any repo ready for Claude Code, Gemini CLI, Cursor, Codex, and GitHub Copilot — each gets the same conventions and the same workflow skills in its own native format.

Install

pip install klaussy-agents

Requires klaussy-repo-conventions (installed automatically).

Quick Start

cd your-repo
klaussy init

That's it. You'll be prompted for your base branch (auto-detects dev, main, etc.), then klaussy generates everything.

By default klaussy bootstraps all supported agents from the same conventions. To narrow to a subset, pass --agents:

klaussy init                                   # all agents (default)
klaussy init --agents claude                   # Claude Code only
klaussy init --agents claude,gemini,cursor     # a subset

See Multi-agent targets for what each agent gets.

What Gets Generated

klaussy discovers your repo's conventions once, then writes — for every selected agent (all five by default) — that agent's native conventions file, the workflow skills, stack-appropriate permissions…

View on GitHub

If you've got an AI tell that drives you up the wall and the scrubber doesn't catch it yet, open an issue with the before/after, that's exactly the kind of case I want to see.

Building repo conventions aware coding agents

Stephanie Dover — Thu, 18 Jun 2026 01:51:18 +0000

Written by Stephanie Dover, Software Engineer 10+ YOE, ex GitHub, Twitch, Microsoft. Creator of Klaussy.

LinkedIn · GitHub · Klaussy Desktop · Klaussy Agents

TL;DR

I built klaussy-agents, an open-source CLI (pip install klaussy-agents) that reads your repo's conventions once and scaffolds repo-aware skills for five coding agents: Claude Code, Gemini CLI, Cursor, Codex, and GitHub Copilot. The interesting part is the adaptation layer: one Claude-authored SKILL.md gets rewritten into each agent's native form, including how sub-agent and plan-mode wording maps to each agent's own primitive. It's a scaffolder, not a runtime, it's at v0.3.2, and the per-agent translation has real seams. Repo: github.com/steph-dove/klaussy-agents.

The problem

Each AI coding agent carries repo context its own way. Claude Code reads CLAUDE.md, Gemini CLI reads GEMINI.md, Codex reads AGENTS.md, Cursor reads .cursor/rules, and Copilot reads .github/copilot-instructions.md. On top of that, each one has its own folder of reusable "skills" or "commands." So the conventions you write for one agent, and any review or plan workflow you tune in it, don't transfer to the others. Point a second agent at the same repo and it starts from zero, because it's reading a file the first agent never wrote.

You can keep all five context files and skill folders in sync by hand, but that's busywork that rots the first time one gets edited and the rest don't. The practical outcome is that whichever agent you open is the one that knows the least about your repo, unless you've redone the setup five times.

What makes a real fix possible is that all five agents now read the same open Agent Skills SKILL.md format. One folder format, genuinely portable. So a tool can discover the repo's conventions once and emit skills, adapted, into every agent's native layout. This post is about the adaptation step, because that's where it stops being a clean copy-paste.

Why hand-rolling five configs didn't scale for me

To be clear about what the alternatives actually are: hand-maintaining per-agent config works fine when you use one agent. A CLAUDE.md plus a couple of .claude/skills/ is a perfectly good setup, and if your whole team is on one tool, you don't need any of this.

Generic scaffolders (cookiecutter and friends) are great for stamping out project structure, but they don't know what a sub-agent is or how Cursor scopes a rule versus how Copilot does. The gap is specifically the multi-agent one: the same review workflow, expressed five different ways, kept in sync forever. This is a translation problem, not a templating problem. The SKILL.md spec provides a shared source format; the work klaussy does is the discovery, the per-agent adaptation, and the native wiring so I'm not hand-maintaining five copies of the same skill.

The honest framing is: the spec made portability possible. klaussy does the boring, error-prone part on top of it.

The approach

klaussy init does the whole thing in one pass and defaults to all five agents. Under the hood it runs a repo-conventions discovery step once, produces a project-wide CLAUDE.md plus path-scoped .claude/rules/*.md, then translates that into each agent's native form. Path-scoped rules map onto each agent's own scoping mechanism: Copilot's .github/instructions/*.instructions.md with applyTo: frontmatter, Cursor's .cursor/rules/*.mdc with globs: frontmatter, and so on.

The same one-discovery-then-translate shape applies to skills. klaussy ships 11 namespaced workflow skills, written once in Claude Code's syntax, and writes them into each agent's dedicated skills directory. The skills are namespaced <repo>-<skill> so they don't collide across repos when an agent has several checked out.

pip install klaussy-agents
klaussy init                     # prompts for base branch, scaffolds all five agents
klaussy init --agents claude,cursor   # narrow to a subset

You can also run individual steps (klaussy skills, settings, hooks, github, and so on) if you only want part of it. The rest of this post zooms in on the skills step, because the adaptation it does is the part I'd actually want to read about.

How it works

One authored skill, five target dialects

The skills are authored in Claude Code's syntax. That means they lean on a few Claude-specific constructs: `! dynamic-shell blocks that run a command and inline its output, parallel sub-agents invoked through the Agent tool with a subagent_type, and ExitPlanMode for plan mode. None of those tokens mean anything to Gemini or Codex verbatim.

So klaussy doesn't copy the body across. It rewrites each body to capture the same intent in the target agent's terms. Three concrete rewrites happen:

Dynamic shell blocks become explicit instructions. A ! block that silently runs git diff` and inlines the result is rewritten into a plain "run this command and use its output" instruction the other agent will actually follow.
Path references get retargeted. A skill that points at .claude/skills/... is rewritten to the target agent's own skills directory, so cross-skill references don't dangle.
Sub-agent and plan-mode orchestration gets an adaptation note. This is the subtle one, and it's the next section.

These directories are the targets, exactly:

.claude/skills/     # Claude Code
.gemini/skills/     # Gemini CLI
.cursor/skills/     # Cursor
.agents/skills/     # Codex (neutral .agents/ path)
.github/skills/     # GitHub Copilot

Sub-agents and plan mode

Several of the skills orchestrate parallel sub-agents. The review skill, for instance, fans out separate lenses (correctness, architecture, security, scope) and runs them concurrently. In Claude Code that's the Agent tool with a subagent_type.

Different agents handle parallel sub-agents differently. Most have their own model-invocable parallel sub-agent tool, and they name it differently; a given agent or setup might not expose one:

Claude Code: Agent / subagent_type
Cursor: Task (GA)
Codex: spawn_agent (GA)
Gemini CLI: subagents (default-on, toggled under experimental)
GitHub Copilot: task / read_agent (plus an experimental context: fork)

So instead of hardcoding sequential, the adaptation note tells the target agent to map Claude's sub-agent wording onto its own equivalent primitive, and to fall back to sequential execution only if it genuinely has none. The note carries intent ("these lenses are independent, run them in parallel"), not Claude-specific tool names. The same goes for ExitPlanMode: the note describes the plan-then-confirm intent rather than naming a tool that only Claude has.

# Illustrative — exact wording is generated per skill, confirm in the emitted SKILL.md
# Adaptation note (appended when a skill orchestrates sub-agents):
# This skill runs independent lenses in parallel. Use your own
# parallel sub-agent tool (e.g. Task / spawn_agent / subagents / task).
# Only run them sequentially if you have no sub-agent tool.

This is the part I find genuinely teachable: portability across agents isn't "find and replace the tool name." It's separating intent from primitive, then letting each agent rebind the intent to whatever primitive it owns. The SKILL.md spec gives you a shared container; it doesn't give you shared tools, so the body has to be written to degrade gracefully.

A real skill, so this isn't abstract

To make "what a generated skill actually does" concrete: the review skill triages by diff size, then runs parallel lenses. Beyond the four standard lenses it adds an Agentic/Evals lens when the change touches AI code, and an Architecture-Decision/Design-Doc lens when the PR contains an ADR, RFC, or design doc. It's precision-biased: an empty review (nothing worth flagging) is a valid outcome. Every finding has to name a concrete trigger, and a final validation phase self-refutes false positives before anything is reported.

That whole workflow is authored once and adapted into all five agents' skills folders. The lens fan-out is exactly the piece that needs the sub-agent translation above; the precision-bias and validation phases are plain prose and carry across unchanged.

Permissions and hooks ride along

The skills step is the headline, but klaussy init also writes native permission allow-lists per agent (Claude settings.json allow/deny, Gemini settings.json tools.allowed, Cursor permissions.json terminalAllowlist, Codex config.toml approval/sandbox) and tries to keep secrets like .env, *.pem, and credentials* out of each agent's reach using that agent's own mechanism.

It also wires two cross-agent hooks: a git-commit guard that runs your detected format and lint before a commit, and a read-injection guard that scans file and fetch content for prompt-injection markers. The guard scripts pull the command or path out of whatever agent's hook payload they're handed and block with exit 2 plus stderr, which every supported agent honors. They're pure-stdlib and hardened so any parse error falls back to allow rather than crashing. The current suite is 130 tests passing, ruff clean.

A quick demo

Point it at a repo and run it:

pip install klaussy-agents
klaussy init

It prompts for the base branch, then scaffolds conventions files, skills, permissions, and hooks for all five agents. If you only want a subset, name them:

klaussy init --agents claude,cursor

After that, opening any of the configured agents in the repo means it already has the conventions file it looks for and the namespaced skills in its own directory. You're still installing and running the agents yourself; klaussy generates the files they read.

What's next, and where the seams are

A deep-dive owes you the honest version, so here's what doesn't fully reach:

The skills are Claude-authored then adapted, not hand-tuned per agent. The adaptation captures intent, and the sub-agent note tells each agent to use its own primitive, but it's a translation, not five bespoke skill sets written from scratch for each agent's quirks. If you want a skill perfectly idiomatic to Codex, you may still tweak it by hand.
Hook coverage is uneven, by design. The read-injection guard is wired for Claude, Gemini, and Cursor only. Codex exposes no pre-file-read hook event, and Copilot's preToolUse is fail-closed with unconfirmed read-tool args, so those two get the commit guard only. klaussy logs that rather than pretending the guard is everywhere.
Secret exclusion isn't universally possible. Codex's sandbox governs writes and network, not reads, so there's no read-exclusion there; Copilot content-exclusion is a GitHub setting, not a committed file. klaussy says so instead of faking a .ignore that wouldn't do anything.
It leans on the SKILL.md spec being honored. Portability is real today, but the spec and the agents are young. If an agent changes how it reads skills, klaussy has to track it.

None of those are agents being "equal." They're not. The whole point of the adaptation layer is that the agents differ and the skill set has to bend to each one. klaussy is a scaffolder: it generates files the agents read, it does not run the agents or change their model quality.

(One disambiguation: there's a separate, paid klaussy desktop app. Different product, same developer. This post is only about the open-source klaussy-agents CLI.)

Try it

Repo: github.com/steph-dove/klaussy-agents
Install: pip install klaussy-agents

{% github steph-dove/klaussy-agents %}

If you run more than one coding agent in the same repo, I'd like to know which skill survived the translation cleanly and which one came out awkward in your agent of choice. Open an issue or drop a comment with what broke.

How I made one desktop app drive four AI coding agent CLIs

Stephanie Dover — Mon, 08 Jun 2026 02:45:26 +0000

TL;DR

I built Klaussy, a desktop app that runs AI coding-agent CLIs in parallel across git worktrees and pairs them with a GitHub PR review surface. The v0.3.0 release (out June 5) replaced its hard dependency on Claude Code with a provider registry, so it now drives Claude Code, OpenAI Codex, Google Gemini, or GitHub Copilot — your pick per task. This post covers how the registry works, the side-by-side terminal model, and where the deeper AI features are still uneven across agents. It's closed-source and in beta. Site: klaussy.com.

The problem

Until a few weeks ago, Klaussy had one ugly constraint baked into its core: it only worked if you used Claude Code. The early-access docs said it outright — if you didn't run Claude Code, there was nothing in the app for you.

That was fine for the first users, who mostly did run Claude Code. But it ruled out a large chunk of the people the app is actually for: engineers who already use an agent CLI daily, have two or three tasks in flight at once, and want a structured way to run them without juggling branches in a single clone. A lot of those people had standardized on Codex, or Gemini, or Copilot — often for cost, procurement, or data-handling reasons that had nothing to do with the tool's quality. For them, Klaussy was "the Claude thing," and that was the end of the conversation.

The pain Klaussy targets isn't agent-specific. It's the workflow around the agent: starting task B while the agent grinds on task A, hopping between the terminal (where the agent lives) and the browser (where PR review lives), and triaging a CI failure across GitHub, the terminal, and the editor. None of that cares which agent you run. So hard-wiring one agent into the foundation was a self-inflicted limit. v0.3.0 is the release that removed it.

Why a single-agent design was the wrong foundation

The original architecture made the easy assumption: there's one agent CLI, so call it directly. Spawn claude, parse its output, wire its session resume into the terminal manager. Every AI surface in the app — the interactive terminal, the PR-review actions, the CI-failure debugger — reached for Claude by name.

That works right up until you want a second agent. Then every one of those call sites is a place that knows too much. The four agent CLIs differ in obvious ways (@anthropic-ai/claude-code vs @openai/codex vs @google/gemini-cli vs @github/copilot) and in annoying ones: different model-selection flags, different session-resume mechanics, different output streams to parse, different auth quirks. You can't paper over that with a single if agent == "codex" branch sprinkled everywhere — you end up with the same conditional copied across a dozen files, and adding a fifth agent later means finding all of them again.

The terminal multiplexers people would otherwise reach for (tmux, zellij) don't help here either. They'll give you N shells in one window, but they don't know what a git worktree is, what an agent session is, or what state a PR review is in. Klaussy ties each terminal to a specific agent instance, a branch, and a worktree — so the abstraction it needed was a clean seam between "which agent" and "what we're asking the agent to do."

The approach

The fix was a provider registry: one module that owns everything agent-specific, and a rule that nothing outside it hard-codes an agent. Each provider declares its npm package, how to launch it, which models it exposes, and how its output should be parsed. The rest of the app asks the registry for "the current agent" and works against that interface.

On top of the registry sits a small amount of UI: a global default agent you set once, plus a per-action picker so you can override it for a single task. Set Gemini as your default and every agent action follows it, persisting across restarts; reach for Codex on one specific worktree without changing the global setting. The orchestration layer above — parallel worktrees, one task per terminal, the PR review surface — didn't change. It just stopped caring which agent was underneath.

The honest version of this story is that the registry isn't uniformly deep yet. Phase 1 nailed the parts every agent shares — launching, switching, resuming, running two side by side. The parts that require parsing each agent's particular output stream are mature on Claude and still being verified on the other three. More on that below, because it's the most important caveat in the release.

How it works

The provider registry

Every AI surface in the app now routes through the registry instead of calling an agent by name. A provider entry knows its npm package, its launch command, and its model list. Adding agent number five means adding one entry, not editing a dozen call sites.

// Illustrative — confirm exact API in docs.
// Conceptual shape of the provider registry (main/state/ai-providers.js).
const PROVIDERS = {
  claude:  { pkg: "@anthropic-ai/claude-code", models: ["opus", "sonnet", "haiku"] },
  codex:   { pkg: "@openai/codex",             models: ["gpt-5.5", "gpt-5.4-mini"] },
  gemini:  { pkg: "@google/gemini-cli",        models: ["2.5-flash", "3-pro"] },
  copilot: { pkg: "@github/copilot",           models: ["default"] },
};

Model selection is verified for three of the four. Claude takes --model aliases (opus/sonnet/haiku), Codex exposes gpt-5.5 and gpt-5.4-mini, and Gemini offers its 2.5/3 flash and pro tiers. Copilot is Default-model-only in v0.3.0 — its model slugs aren't verified yet, so the picker doesn't pretend to offer a choice it can't honor. That's a deliberate "show what's real" call rather than a missing feature dressed up as one.

Parallel worktrees, one window

Each task runs in its own git worktree, with its own pseudo-terminal via node-pty, surfaced in the same window as columns, a grid, or a single pane. This is the part that replaces the git worktree + tmux + a handful of gh aliases that an engineer would otherwise script themselves. The agent for each terminal is whatever the registry hands back, so a column running Claude and a column running Gemini coexist without either knowing about the other.

Running the same work in two agents at once

Because the registry decouples agent from task, the worktree Actions dropdown can spawn a sibling task in the same worktree on a different agent. You can hand the same change to two agents side by side and compare what they do — useful when you're still forming an opinion about which agent is better at a given kind of work.

One sharp edge worth naming: running two Codex sessions concurrently can invalidate each other's rotating OAuth tokens. Klaussy warns you before it starts a second concurrent Codex session rather than letting it silently break. Codex's auth model, not a Klaussy choice — but the kind of thing the orchestration layer has to know about, which is exactly why the registry exists.

The PR review surface

Separately from the terminals, Klaussy renders a GitHub PR without a local checkout: Files, Conversation, Checks, and AI Review tabs. The inline review composer batches comments and submits them in one round trip, and per-finding state (Ignore / Add to PR / Implement / Investigate / Ask) persists across sessions. One click materializes a PR into a worktree plus a task when you do want it locally. There's a built-in Monaco editor with LSP diagnostics so you can edit and commit straight from the diff.

This is where the maturity gap matters most, so I'll be plain about it: the review actions, the CI-failure auto-debug, Implement, and Ask are most battle-tested on Claude. The non-Claude output parsers for those headless surfaces are documented but still being verified in the shipped code. The interactive terminal, agent switching, resume, and side-by-side work across all four agents today. The deep AI surfaces on Codex, Gemini, and Copilot are the path I trust least, and I'd rather you know that going in than discover it on a real PR.

Optional on-device autocomplete

There's also inline tab-autocomplete that runs entirely on your machine via local Ollama, using qwen2.5-coder:1.5b at roughly 100ms latency. Nothing leaves the laptop per keystroke. It's opt-in and costs about a 2 GB download (the Ollama runtime plus the model weights); without it you get a free word-based completer. This is the one piece of the app that does its own inference instead of delegating to your agent CLI.

A quick demo

The setup before first launch is real and worth stating plainly. You need Node.js 18+, the GitHub CLI authenticated, and at least one of the four supported agent CLIs installed and authenticated:

# Illustrative — Klaussy surfaces install commands in a setup dialog;
# it can detect missing CLIs but cannot bootstrap your auth.
npm i -g @anthropic-ai/claude-code   # or @openai/codex,
                                     # @google/gemini-cli, @github/copilot
gh auth login

Once a worktree is open, picking an agent and spawning a sibling task on a second agent both happen from the worktree's Actions dropdown — no config files, no per-project agent setup. The agent you choose becomes the global default and sticks across restarts until you change it.

What's next

A few things are honestly incomplete:

Multi-agent maturity is uneven. Running, switching, resuming, and side-by-side work on all four agents. The deeper AI surfaces (PR review, Implement, CI-debug, Ask) are most proven on Claude and still being verified on Codex, Gemini, and Copilot. Treat Claude as the battle-tested path today.
The built-in flow prompts are still Claude-flavored. The Plan/Debug/Review slash-command bodies were written for Claude. They run on the other agents but aren't tuned to them yet. Per-agent tuning is a follow-up.
It doesn't replace your agent. Klaussy does not bundle and mark up agent access, you use your account that you already have with one of the supported agents. Klaussy is an developer productivity app.

There's no Klaussy server in any of this. Data flows from your own agent CLI to that agent's provider, from your own gh to GitHub, and optionally to local Ollama. Pricing is a one-time $39 founder license (rising later), or $349 / $599 for 5 / 10 seats.

Try it

Site and downloads: klaussy.com
Early-access discussion: Discord

If you've tried wiring multiple agent CLIs into one workflow yourself, I'd genuinely like to know which agent you trust for which job — drop a comment.