DEV Community

CodeKing
CodeKing

Posted on

"I Stopped Building a Coding Agent and Built a Supervisor for Codex and Claude Code Instead"

A couple of weeks ago I was about to do what everyone on my timeline was doing: build another coding agent. Read files, run commands, plan steps, loop until done.

Then I asked myself the uncomfortable question.

Why am I building a worse version of Claude Code and Codex, when both of them are already installed on my machine and work better than anything I can ship this month?

So I stopped. And I built the opposite of a coding agent instead.

The part I was getting wrong

I kept describing the problem as "I want an agent." But when I wrote down what I actually needed it to do, almost none of it was coding:

  • pick whether this request should go to Codex or Claude Code
  • decide whether it belongs in the current runtime session or a new one
  • remember what task the user was iterating on
  • surface approval prompts that are hiding in logs
  • summarize when a run finishes
  • handle "retry that last one" without a human translating

None of those are coding tasks. They are dispatch, supervision, and memory.

The executors (Codex, Claude Code) are the muscle. What I was missing wasn't more muscle. It was a nervous system.

Control plane vs execution plane

Once I framed it that way, the architecture fell out naturally. I now split the system into two planes:

  • Execution plane — Codex, Claude Code, and any future runtime that can actually write files and run commands. These are providers. They are not the agent.
  • Control plane — the supervisor agent. It reasons about what to do, chooses an executor, dispatches, observes, and reports back.

The rule I gave myself: the control plane never writes code. If it ever finds itself wanting to, that's a signal that I'm collapsing the two planes and I need to stop and route the work to an executor instead.

This is the opposite of the current trend, where everyone is trying to pack more executor capability into a single agent loop. I went the other way on purpose.

What the supervisor actually does

The supervisor runs its own ReAct loop — but the tools aren't read_file and run_command. They're dispatch and observation tools:

  • start_runtime_task(provider, prompt, working_dir)
  • continue_runtime_task(session_id, message)
  • get_runtime_status(session_id)
  • list_active_sessions(conversation_id)
  • approve_pending_question(session_id, answer)
  • recall_memory(scope, key)
  • write_memory(scope, key, value)
  • summarize_task(session_id)

That's it. That's the tool catalog for the agent itself. The coding tools live inside Codex and Claude Code, where they already work.

Observation First — the rule that saved me

The biggest failure mode I expected was the supervisor getting poisoned by the raw text streams from the executors. Dozens of megabytes of stdout, tool output, and chain-of-thought per session. If I pump that into the supervisor's context, it becomes a bloated, expensive, unreliable mess in about fifteen minutes.

So I adopted one principle and protected it fiercely:

The supervisor consumes structured observations, not raw logs.

When Codex emits an event — a turn starts, a tool is invoked, a question is asked, a task completes, a failure occurs — that event gets normalized into a small structured observation. The supervisor sees things like:

{
  "kind": "awaiting_approval",
  "session_id": "sess_83",
  "tool": "shell",
  "summary": "Wants to run: npm install",
  "risk": "medium"
}
Enter fullscreen mode Exit fullscreen mode

Not:

[2026-04-22T14:03:18Z][codex][turn=4][tool_call] shell {...2300 more chars...}
Enter fullscreen mode Exit fullscreen mode

The full log is still archived for audit. The supervisor just doesn't read it by default. This is the single architectural decision with the biggest impact on latency, cost, and correctness.

Memory needs scope, not just storage

The other thing I got wrong in my first draft was memory. I had two levels — "session" and "global" — and within a week they were both the wrong size for every real use case.

What I have now is four scopes:

  1. global user — preferences that cross every project ("I prefer TypeScript over JavaScript")
  2. workspace / project — conventions for this codebase ("tests live under tests/unit/")
  3. conversation — the current chat thread ("we're iterating on the auth middleware")
  4. runtime session — the specific Codex or Claude Code run ("already approved npm install in this session")

Each memory write has to declare its scope. Each read filters by scope. A preference written at conversation scope in a Telegram chat doesn't leak into a totally unrelated Feishu conversation, even though they share the same user.

This sounds obvious written down. It was not obvious when I started.

Direct runtime vs assistant — don't hijack the default

The other thing I was careful about: not making every message go through the supervisor.

If the user is mid-flow with Codex, they don't want a chatty middleman interrupting every turn with observations and summaries. So the default behavior for plain messages is still direct runtime path — the message goes straight to the current session, the supervisor does not intervene.

The supervisor only takes over when the user explicitly invokes it. /cligate do X or a dedicated assistant chat tab. Low-latency, low-noise, predictable.

The result is that you get two modes in one product:

  • Direct Runtime — fast, predictable, feels like talking to Codex or Claude Code
  • Assistant Collaboration — explicit, structured, feels like talking to a supervisor who then delegates

Users can tell the difference instantly, because one is immediate and the other shows a planning step.

What this freed me from

The moment I committed to this split, a long list of problems disappeared:

  • I no longer needed to reinvent tool-use primitives for file editing and shell commands
  • I no longer had to ship security sandboxing for the agent itself — the executors already have it
  • I no longer had to match Claude Code or Codex on coding quality
  • I could ship a useful supervisor in a week, not a quarter

The supervisor's job is narrow enough to be finishable. The coding agent's job is not.

The local-first part matters here

All of this runs on localhost. The supervisor, the executors, the memory store, the channel providers — none of it phones home. That's important to me because a supervisor that manages my credentials, remembers my preferences, and dispatches to my coding tools is exactly the kind of component I do not want living on someone else's server.

Local-first also means the supervisor can observe the executors directly, without routing through anyone's cloud. No round trips, no rate limits on the control plane itself.

Quick start

npx cligate@latest start
Enter fullscreen mode Exit fullscreen mode

Then http://localhost:8081. Normal messages still go to Codex / Claude Code directly. Invoke the supervisor explicitly when you want dispatch and memory behavior.

Repo: https://github.com/codeking-ai/cligate

The question I keep asking myself

Everyone is building agents that can do more. I spent the last two weeks building one that does less — on purpose — because the thing it does less of is already done better by two other tools I have open in the next terminal tab.

Is "supervisor over existing executors" a more honest shape for an agent than "re-implement everything inside a single loop"?

I genuinely don't know the answer across the industry. But for my setup, it's already a clear yes. I'd like to hear how you draw the line — are you putting everything inside one agent, or are you also splitting control plane from execution plane? And if you're splitting, where does your line fall?

Top comments (1)

Collapse
 
codekingai profile image
CodeKing

Genuinely interested in the counter-argument here — where do you draw the line between supervisor and executor in
your own agent setups? Going all-in on one loop or splitting like this?