Aleksi Tuominen

Posted on Mar 25 • Edited on Jun 4

The Adventuring Party: From Sub-Agents to Multi-Agent Orchestration with tmux

#claudecode #ai #tmux #multiagent

1. Introduction

Little bit of background before we get into the meat of things;

I came back from parental leave in the middle of January. I had not been following any news nor Slack at all so you can imagine the shock I had when I saw how coding had changed. Before my leave, I used Cursor and sometimes had it write tests or one-off fixes, but now, no one was writing code by hand anymore. It was all AI, skills this and that...
So I took it as a chance; I decided to dive straight into the deep end and bar myself from coding like I used to.
I took up Claude Code, and even though I had not used the terminal almost at all, forced myself to work in that environment.

At first, it was rough. Not really knowing how to prompt the agent well, having to double check and ask Claude to go back and fix things multiple times...
And the most grueling part; coding was not fun anymore. I used to love tackling hard bugs and coming up with elegant ways to write things, but I quickly realized those days are far gone.
That's why I directed my tinkering needs to the agents and the harness (I know, a buzz word). I wanted to build a system that produces the best quality code it can without me having to babysit the agent all the time.
Having played through Baldur's Gate 3 recently, I also wanted to give the agent character, and chose to call my Claude The Paladin.

I started with skills, as most do, and created skills for reviewing, writing tests, and coding with React and Go.
I also delved into Claude's sub-agents and created specific agents for running tests, lint and reviewing from different angles.
This is all good old vanilla Claude, and it worked, kind of, but I still felt that I had to review the code myself before submitting. The quality just was not there.
That's where I stumbled upon an article about using different models to review each other's code. At the time, Codex 5.3 had just come out, and luckily I had the chance of using our company's subscription, so I started to work on integrating Codex into my Claude setup.

So from here on, the Paladin was joined by the Wizard, and together we formed the Adventuring Party!

2. Sub-Agents (Jan 2026)

The first solution I came across was one defined in this blog post by MKJ about orchestrating Claude Code with sub-agents.

Using Claude's sub-agents to work as messengers between Codex and Claude (and Gemini, but I quickly abandoned it due to not really using it) I was able to dispatch messages to Codex's CLI, and have the sub-agents return the verdicts to Claude.

The main use case I had was to dispatch Codex sub-agents for review, as a part of my review cascade that consisted of the other sub-agents mentioned above.

Since most of the settings were just Claude's skills and sub-agents, configuring and troubleshooting was quite easy.
The problem was that using sub-agents (Haiku at first, Sonnet afterwards) added a layer to the relay between Codex and the main conversation. This was especially painful when the sub-agents sometimes simplified or straight up omitted parts of the review.

On top of that, every Codex call was a new one, meaning Codex had no context of the previously fixed parts.
The main agent was wise enough to pass those as prompts to the sub-agents, but the layer problem was still making it unreliable.

At this point I asked myself:

"Why not just remove the middle man and call Codex's CLI directly?"

3. Direct CLI (Feb 2026)

So I decided to ditch the sub-agent messengers and have the main Claude call Codex directly through Bash.

This eliminated the black-box of sub-agents and also meant there was one layer less between the agents.
No more dilution, GPT5.3's strengths as a deep thinker were able to fully manifest and the Wizard earned its title.
There were, however, still problems; Bash calls are synchronous and Codex being quite thorough and slow meant that Claude was stuck in the review phase where it previously was able to work on other things while the sub-agents churned away.

Beyond that, the context not persisting was still a remaining issue from the previous sub-agent pattern. Using the CLI meant that the only way Codex knew what had been fixed and what it had checked was through Claude's prompts.
Small problems with the Bash calls lingering in the background and finishing even after the review was received were also present.

At this time I was talking with a coworker on Slack, and he told me that he was using tmux to have Claude talk to Codex. I had never heard of tmux (I know, typical frontend dev), so I tasked my party to research tmux as an alternative to the current setup.
The research yielded a resounding "Let's migrate!" from both agents, and after a while of pondering on my own, I decided to give it a go.

4. tmux as Transport (Feb-Mar 2026)

At this point, I was starting to get quite familiar with the terminal as a developing environment, and after installing tmux, I was immediately hooked.

I booted both Claude and Codex on neighboring panes, told Claude to send something to Codex in the next pane, and boom, communication! Both agents having quite good understanding of tmux out-of-the-box meant that creating skills for talking to each other was actually quite straightforward.
Not only was Claude able to talk to Codex, Codex was now also able to respond back. I created corresponding skills for both Claude and Codex, and watched them debate on what their favorite D&D God was (just a small test).

The problems from the previous setup were all but gone. Claude could work while Codex was thinking. Codex having a conversation of its own meant that the previous changes and comments were all in memory. And with no more Bash calls, there was no cleanup or lingering finished-calls to deal with.
Claude would do its thing, reach the review phase, run the native sub-agents and dispatch a review request to Codex. Codex would then review while Claude addressed any comments the critics had, and when done, Codex would send a message to Claude's pane telling it to take a look at the file where it wrote its findings.

The core was working quite well, but I wanted to go further. I wanted to lean more into the fantasy and also create an easy to launch setup that had all the things ready-to-go.
So I came up with the party session: a tmux session with 3 panes next to each other, Codex - Claude - Shell.
This paired with some visual magic, named panes and a custom status bar got me something that vaguely resembled an agent dashboard of sorts.

This is my current setup, in which I replaced the Codex pane with another custom TUI, and moved Codex into another tmux window.

5. The Party System (Mar 2026)

The Orchestrator

Since I was satisfied with the core flow, I started to tinker with the harness around the agents a bit more.
In my work I usually had multiple terminal tabs open all running a party session, doing parallel work. This was fine and dandy but I still had times when I wanted to quickly dispatch several of those sessions to for example fix multiple bugs.

That's where I took a page from Claude's agent teams feature, and decided to create my own orchestrator.
The master session as I called it was just another party, but without Codex. Workers were once again, just party sessions, but linked to a master session through a JSON file.
The master was able to spawn workers, monitor and relay messages to them, and the workers could relay info back.
This enabled me to just pass a bunch of bug tickets to the master session, and watch it summon worker parties for each task, monitor their work and give me status updates.

From Bash to Go

The system started out as fully bash, 700+ line scripts that parsed JSON with jq, cleaned up the session data...
While it worked mostly as intended, the testability and stability started to become an issue when I kept adding functionality (like a fzf picker for moving between sessions etc), so I decided to take a leap and migrate the scripts to Go.
The bash scripts became wrappers for calling the party-cli Go binary, and the system became testable and type-safe.

I was also able to make the setup actually look cool with a custom TUI tracker for the master sessions.
The tracker is a Go TUI built with Bubble Tea, and it tracks the workers' progress, displaying snippets of their Claude panes. It also enables me to send messages (to individual and all workers), jump to the workers' tmux sessions and to kill the sessions if need be.

Here I tasked the master to dispatch 3 workers and have them come up with a joke.

Though I must confess that all this is a tad bit on the over-engineering side of things, but one must redirect the will to tinker towards something right..?

6. The Quality Harness

The Pipeline

As for the actual workflows for the agents, I wanted to make sure that Claude could not just declare "This works!" and move on, so I created workflow skills for specific scenarios and implemented gates to make sure Claude followed the guardrails.

When working on a new project, we usually start by creating spec, plan, and task markdown files with completion criteria to steer the agents. The task files are then fed to Claude which executes them creating a PR per task.
For this, I created a task-workflow skill that details the steps Claude must take when executing a task.

In short, it goes like this:

Scope — Check the task file and do basic research to confirm the scope
RED/GREEN — Start by writing tests, a RED phase where the tests fail on purpose, followed by the implementation to make the tests pass (GREEN phase)
Critics — Once the implementation is done, the review phase begins, first with the sub-agents; one for code correctness and one for bloat checking
Deep review — The second pass consists of Codex and an Opus sub-agent, both inspecting the whole diff objectively.
PR gate — Once all checks pass, a PR is created. Before that, a gate verifies that evidence for each step is logged in a JSON file — if anything is missing or unapproved, the PR is rejected.

The main point behind all this iteration is to make sure as few bugs and bad code slips past, and having multiple sub-agents and different models working on that has proven quite efficient.
There are still however times when stuff does slip through, and Codex especially tends to flag things beyond the task's boundaries, resulting in Claude having to dispute the findings.

The Evidence System

As stated above, the evidence, namely what checks and what reviews are done, is the basis of not letting Claude just decide that things work.
Every approval the sub-agents or Codex makes is tied to a diff-hash, meaning if Claude edits something after approvals, the evidence becomes stale and the gate forces Claude to re-run the reviews.

As for storage, the evidence is logged in a JSONL file. Before that, the system created marker files, but having multiple files instead of one source of truth was making the cleanup a mess.

The sub-agents and especially Codex sometimes disagree with Claude, and if this happens, Claude is able to debate the review through the skill. If 3 rounds of reviews do not result in approval, Claude enters the dispute mode and tries to resolve the situation. In most cases, this is due to Codex adamantly flagging out-of-scope or extremely specific edge-cases.

With this, Claude has to prove that it did the reviews and fixed the flagged things. A Ralph Wiggum-esque loop that results in better code than just trusting Claude on its own.
The best part is, I no longer have to babysit the system, since even when the review phase keeps looping, the dispute functionality resolves automatically and usually only pops up when Codex is being snarky.

7. Lessons Learned

What did I learn then?

When it comes to the setup around the agents, I wish I'd taken more time to research the terminal environment before jumping in blindly.
Had I known how well both Claude and Codex understand tmux out-of-the-box, I would have started to build my setup on top of that from the beginning. Even in the event of the agents not loading the skills, they are able to send messages to each other after a couple of tries.

And as for the agents themselves, leaning into the multi-model approach is in my opinion still the best way to code with agents.
The number of times Codex actually catches bugs and humbles Claude was also genuinely a surprise. I had read that GPT5.3 (GPT5.4 at the time of writing this) excelled in deep thinking which made it the perfect reviewer, but even then, I had not had one session where Codex did not grill Claude on bugs.
This to me just proves the importance of having agents review each other's work. The implementor will always try to finish the task it's given even at the expense of code quality.
However, just having reviews still does not cut it. It's easy for the implementor to just self-approve everything after one run, so forcing it to prove it passed the reviews adds one more layer of safety.

Even if you don't use this exact setup, the core patterns transfer: separate your implementor from your reviewer, make agents prove their work with evidence, and never let anything self-approve.

8. Closing

So, if you reached this far, thank you for reading! I truly appreciate that. This was my first blog post, so I hope I managed to convey even a fraction of how fun this has been to build. I said earlier that coding stopped being fun — building the harness brought it back, just in a different form.
I will most likely keep on tinkering and building needlessly complicated, super specific additions.

Oh and if you've built something similar, please leave a comment! I'd love to hear about it.

Top comments (2)

Kyle Carriedo • Jun 3 • Edited

The evolution from sub-agent messenger layers to direct tmux hits the real failure — intermediate agents eat context unpredictably. The review-gate system you landed on (implementer + Codex reviewer + JSONL evidence before PR) is solid. One thing that gets messy at scale: when you have 6+ active pairs running, tracking which gate each pair is at becomes its own coordination overhead. That's what we're working on with Claudeverse (claudeverse.ai) — persistent session state and a single view across all active workers so you can see "waiting for review sign-off" vs "working" vs "stuck at permission prompt" without opening each pane. The JSONL evidence trail is a smart invariant worth preserving as you scale.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.