nwyin

Posted on Mar 23 • Originally published at nwyin.com

Hive: A Lightweight Multi-Agent Orchestrator

#ai #llm #automation #agents

2025 was the "year of agents".
We focused on making LLMs more autonomous and able to coherently work on tasks for longer and longer time horizons.

Claude Code made working with LLMs akin to pair programming with a very skilled but inexperienced junior developer.
Some time in December 2025, with the release of Opus 4.5, a step-wise increase in capability became noticeable.
Claude was able to work and verify work by itself for hours at a time.

One obvious way to parallelize work here is to tmux many Claude Code instances and have them work on separate issues in different parts of the codebase.
This became so common that Anthropic and TPOT refer to this as "multi-Clauding".
In Steve Yegge's parlance, this is level 6/7 of agentic coding.

tmux is great, but you run into natural limitations in cognitive overhead.
Even the most brilliant (i.e. ADHD) developers become overwhelmed when trying to steer 8+ different Claudes at once.
It is usually just too hard to figure out how to compose several tracks of work in a way that you can meaningfully parallelize them.
You rapidly have to move back and forth between idea generation, steering, and work review.
The context switching gets to be too much.

It's possible to manage a handful of Claudes.
And there's this suspicion that, well, surely I should be able to scale this 10x, right?
Surely I can figure out a way to work on 10 things at a time, and if I can figure out how to work on 10 things at a time maybe I can scale it further to 100?

The solutions in this space are, putting it kindly, immature.
They're hacked together and sometimes very broken, but also show sparks of promise and excitement.

openclaw/openclaw

Incredibly slop.
Great viral marketing tactics and the skills/plugin harness is genuinely useful.
There are two things that really make me hesitant to draw any real lessons from its codebase.
First, the security breaches of users' wallets and data are not inspiring.
Second, the state of the repo speaks to the level of care and thought put into it (e.g. the fact that PRs are very liberally accepted and there are too many PRs to review, s.t. the maintainers have decided to YOLO merge a lot of them).

steveyegge/gastown

It's the project that inspired hive.
It has abundant design docs and shows Steve's excited iteration and tinkering with the ideas around multi-agent coordination.
My only complaint is that it's too complex!
It's trying to solve this issue of coordinating hundreds or thousands of agents at scale.
Me, I'd like to just have 20 working together.

randomlabs/slate

I haven't poked at it too much, and it's not open source, so to really understand it deeply I'd need to reverse engineer its binary.
That being said, the technical blog is quite good and demonstrates that the team behind it is really thoughtful and making reasonable tradeoffs in this space.
They're one of the few players who are actively innovating AND sharing about their innovation, which I appreciate.

There's really only 3 core ideas you need to know to understand hive (and its sister solutions):

Rather than interact with an agent as if it's a pair programmer, you treat it like a project manager
You maintain some kind of task board/external TODO list with the ability to note what tasks depend on each other
Agents that implement the work have the ability to gather context themselves, and guardrails to keep them on track (e.g. tests, other models that review work, etc)

1 helps alleviate the amount of context switching you do between idea generation and steering models.
Rather than having to keep your attention focused on a single instance, following along and steering its implementation, you work ahead of time to clarify intent and plan the implementation out.
Models are strong enough that even with a rough sketch (and clear intent) they can fill in the rest.

2 is necessary as a way to "externalize" project memory and context.
Every issue can be generated by a fresh agent, which can do the heavy lifting of figuring out all the files that need to be touched, tests that need to be created, etc.
This information can then be handed off to a model with fresh context, which increases the chance of the model one-shotting the feature.

3 helps with the context switch to reviewing output (because you build thorough testing systems and have a model to competently use that information, you end up doing far less review yourself).

A lot of libraries and implementations have converged on ideas in this proximity, like Anthropic's agent teams or hive or gastown.

hive has a few features that I think make it special:

Simplicity and hackability as a core principle. The code base is ~10k LOC and designed to run locally, attempting to use minimal resources, and as simple a state machine as possible so you can reason about it and rip it out and change it to your needs.
Model/harness agnostic. Right now it supports using either codex or claude, but there's no reason why you can't bring your own harness (and thus any other model) as well.
Multi-project/headless delegation of issues. Meaning, you can script and create a meta workflow to work on many projects in parallel. (This is the end goal of gastown, but at a much larger scale than hive).
Auditability/logging. hive tracks events, issues, success rates, etc. Meaning that, if one wanted to, you could easily experiment with multiple models and start to figure out which models in which harnesses perform the best for your various problems.

1 is useful because I'm not quite sure how multi-agent orchestrators will evolve as the base models get stronger and stronger, and it's convenient to be able to rapidly and cheaply test out ideas.
For example, I've thought about having multiple models and runners attempt to implement the same feature.
You can then let an LLM grade and choose one to merge, and collect stats so you can gather data on per-model, per-runner success rates.

2 is nice to have because the SOTA, frontier models are still playing the game of rotating their first place podium spot every few months.
It's also unclear how different frontier models interact with various harnesses (e.g. I hear GPT 5.4 in Claude Code is quite good).
Being agnostic to both model and harness means you don't have to rewrite the core orchestration code as everyone is rapidly improving and iterating on these other core pieces.

3+4 together are helpful for debugging and verifying the system is working smoothly.
It's what's enabled me to start using hive in multiple projects at once with confidence.

Future multi-agent systems are going to be far more ergonomic, and enable power users to manage 100s of agents in parallel.
What exactly that looks like, and the kinds of problems that need 100s of agents working together, I don't know yet.

Agents aren't quite substitutable for humans.
If you get 100 humans together, you can build a 100M dollar company or create some new hit movie or start a revolution.
If you get 100 agents together, you get slop.
Models are not great at generating out-of-distribution, interesting ideas.
Repeatedly chaining models together with no external signal or steering, you end up getting a very "collapsed" output.

This is definitely solvable though.
It's so exciting to be given the privilege to help figure it out.

Top comments (2)

Kalpaka • Mar 24

The scaling question here is framed as cognitive overhead — how many agents can one person steer. But there's a structural gap underneath: what happens when the coordinator process dies mid-delegation?

Task boards externalize memory, which handles context limits well. But externalized memory isn't the same as externalized state. If the orchestrator crashes, agents finish their work, report back, and nothing is listening. The task board knows what was assigned but not what completed while it was down.

Parallelism and durability pull in different directions. Parallelism wants loose coupling. Durability wants checkpointing. The ~10k LOC simplicity works for the happy path, but recovery from coordinator failure is where the real complexity lives — and where most of these tools are still silent.

nwyin • Apr 2

could you clarify what you mean by coordinator?

if the daemon dies, worktrees and sqlite capture the work that was scheduled, their state, and what work is done. that's what the compare-and-swap (CAS) built into sqlite gets us for free.

if the "queen" (orchestration LLM session) dies mid process, you still have the conversation on disk, and can resume (perhaps with some messages lost but most of the context still there).

hope this clarifies your questions!