DEV Community: nwyin

Hive: A Lightweight Multi-Agent Orchestrator

nwyin — Mon, 23 Mar 2026 09:25:11 +0000

2025 was the "year of agents".
We focused on making LLMs more autonomous and able to coherently work on tasks for longer and longer time horizons.

Claude Code made working with LLMs akin to pair programming with a very skilled but inexperienced junior developer.
Some time in December 2025, with the release of Opus 4.5, a step-wise increase in capability became noticeable.
Claude was able to work and verify work by itself for hours at a time.

One obvious way to parallelize work here is to tmux many Claude Code instances and have them work on separate issues in different parts of the codebase.
This became so common that Anthropic and TPOT refer to this as "multi-Clauding".
In Steve Yegge's parlance, this is level 6/7 of agentic coding.

tmux is great, but you run into natural limitations in cognitive overhead.
Even the most brilliant (i.e. ADHD) developers become overwhelmed when trying to steer 8+ different Claudes at once.
It is usually just too hard to figure out how to compose several tracks of work in a way that you can meaningfully parallelize them.
You rapidly have to move back and forth between idea generation, steering, and work review.
The context switching gets to be too much.

It's possible to manage a handful of Claudes.
And there's this suspicion that, well, surely I should be able to scale this 10x, right?
Surely I can figure out a way to work on 10 things at a time, and if I can figure out how to work on 10 things at a time maybe I can scale it further to 100?

The solutions in this space are, putting it kindly, immature.
They're hacked together and sometimes very broken, but also show sparks of promise and excitement.

openclaw/openclaw

Incredibly slop.
Great viral marketing tactics and the skills/plugin harness is genuinely useful.
There are two things that really make me hesitant to draw any real lessons from its codebase.
First, the security breaches of users' wallets and data are not inspiring.
Second, the state of the repo speaks to the level of care and thought put into it (e.g. the fact that PRs are very liberally accepted and there are too many PRs to review, s.t. the maintainers have decided to YOLO merge a lot of them).

steveyegge/gastown

It's the project that inspired hive.
It has abundant design docs and shows Steve's excited iteration and tinkering with the ideas around multi-agent coordination.
My only complaint is that it's too complex!
It's trying to solve this issue of coordinating hundreds or thousands of agents at scale.
Me, I'd like to just have 20 working together.

randomlabs/slate

I haven't poked at it too much, and it's not open source, so to really understand it deeply I'd need to reverse engineer its binary.
That being said, the technical blog is quite good and demonstrates that the team behind it is really thoughtful and making reasonable tradeoffs in this space.
They're one of the few players who are actively innovating AND sharing about their innovation, which I appreciate.

There's really only 3 core ideas you need to know to understand hive (and its sister solutions):

Rather than interact with an agent as if it's a pair programmer, you treat it like a project manager
You maintain some kind of task board/external TODO list with the ability to note what tasks depend on each other
Agents that implement the work have the ability to gather context themselves, and guardrails to keep them on track (e.g. tests, other models that review work, etc)

1 helps alleviate the amount of context switching you do between idea generation and steering models.
Rather than having to keep your attention focused on a single instance, following along and steering its implementation, you work ahead of time to clarify intent and plan the implementation out.
Models are strong enough that even with a rough sketch (and clear intent) they can fill in the rest.

2 is necessary as a way to "externalize" project memory and context.
Every issue can be generated by a fresh agent, which can do the heavy lifting of figuring out all the files that need to be touched, tests that need to be created, etc.
This information can then be handed off to a model with fresh context, which increases the chance of the model one-shotting the feature.

3 helps with the context switch to reviewing output (because you build thorough testing systems and have a model to competently use that information, you end up doing far less review yourself).

A lot of libraries and implementations have converged on ideas in this proximity, like Anthropic's agent teams or hive or gastown.

hive has a few features that I think make it special:

Simplicity and hackability as a core principle. The code base is ~10k LOC and designed to run locally, attempting to use minimal resources, and as simple a state machine as possible so you can reason about it and rip it out and change it to your needs.
Model/harness agnostic. Right now it supports using either codex or claude, but there's no reason why you can't bring your own harness (and thus any other model) as well.
Multi-project/headless delegation of issues. Meaning, you can script and create a meta workflow to work on many projects in parallel. (This is the end goal of gastown, but at a much larger scale than hive).
Auditability/logging. hive tracks events, issues, success rates, etc. Meaning that, if one wanted to, you could easily experiment with multiple models and start to figure out which models in which harnesses perform the best for your various problems.

1 is useful because I'm not quite sure how multi-agent orchestrators will evolve as the base models get stronger and stronger, and it's convenient to be able to rapidly and cheaply test out ideas.
For example, I've thought about having multiple models and runners attempt to implement the same feature.
You can then let an LLM grade and choose one to merge, and collect stats so you can gather data on per-model, per-runner success rates.

2 is nice to have because the SOTA, frontier models are still playing the game of rotating their first place podium spot every few months.
It's also unclear how different frontier models interact with various harnesses (e.g. I hear GPT 5.4 in Claude Code is quite good).
Being agnostic to both model and harness means you don't have to rewrite the core orchestration code as everyone is rapidly improving and iterating on these other core pieces.

3+4 together are helpful for debugging and verifying the system is working smoothly.
It's what's enabled me to start using hive in multiple projects at once with confidence.

Future multi-agent systems are going to be far more ergonomic, and enable power users to manage 100s of agents in parallel.
What exactly that looks like, and the kinds of problems that need 100s of agents working together, I don't know yet.

Agents aren't quite substitutable for humans.
If you get 100 humans together, you can build a 100M dollar company or create some new hit movie or start a revolution.
If you get 100 agents together, you get slop.
Models are not great at generating out-of-distribution, interesting ideas.
Repeatedly chaining models together with no external signal or steering, you end up getting a very "collapsed" output.

This is definitely solvable though.
It's so exciting to be given the privilege to help figure it out.

On Static Analysis + LLM

nwyin — Mon, 23 Mar 2026 09:23:23 +0000

Static analysis is understanding your code before running it.

def add(a, b):
  return a + b

def main():
  return add(1, 3)

Above is a trivial program.
At a glance, you can tell that calling main() will return 4.

Here are some questions to ponder about:

What happens if you call add with non-numeric types? i.e. what does add("foo", "bar") return?
How do you know about the above?

Static analysis is about answering these questions without having to run the code.
As human programmers, we develop familiarity with the language and runtime.
This lets us answer such questions easily.

But how does a machine figure out the answer?

Programs have many representations (the human-friendly syntax being one of many).
Here are some other ways to also represent the above python program:

AST

Module(
  body=[
    FunctionDef(
      name='add',
      args=arguments(
        args=[
          arg(arg='a'),
          arg(arg='b')]),
      body=[
        Return(
          value=BinOp(
            left=Name(id='a'),
            op=Add(),
            right=Name(id='b')))]),
    FunctionDef(
      name='main',
      args=arguments(),
      body=[
        Return(
          value=Call(
            func=Name(id='add'),
            args=[
              Constant(value=1),
              Constant(value=3)]))])])

Bytecode

Disassembly of <add>:
  1  RESUME              0
  2  LOAD_FAST           0 (a)
     LOAD_FAST           1 (b)
     BINARY_OP           0 (+)
     RETURN_VALUE

Disassembly of <main>:
4 RESUME 0
5 LOAD_GLOBAL 1 (add)
LOAD_SMALL_INT 1
LOAD_SMALL_INT 3
CALL 2
RETURN_VALUE

These representations are "closer to the machine".
They're the same program, but with details that are relevant to a compiler or CPU trying to execute the program.

Most human programmers never bother with these details.
They're frankly too low-level.
Not much work warrants looking at the AST or bytecode or compiled assembly of a program.

But!
There are some other representations of programs that perhaps all programms should be aware of.
For example, here's a call-graph of our sample program:

example.py
  [D] add
  [D] main
main
  [U] add

[D] means defined in, and [U] means uses.
You read the above as add and main being defined in example.py, and add being used in main.

Call graphs are very useful for understanding the chain of depencies and dataflow of your program.
In systems that grow to 100s of thousands lines, data being passed across several thousand functions and several hundred files -- call graphs end up being quite useful!
Without actually running or profiling the code, you can answer: what are all the functions and files this data passes through?
If I had to refactor or rethink the approach of how this data flows, what are all the files and functions I'd need to touch?

Static analysis is a powerful tool for people working on very complicated systems, and don't have the privilege of "just" re-writing it from scratch.
Many systems grow by standing on past foundations -- even if the foundations are quite shaky.

In the year of 2026, though, we have LLMs.
Can't we just use LLMs for our refactors and rewrites now?

There is a sense that you can't really do "serious" engineering with LLMs.
Their context windows are too small, their tendency to produce slop too high.
I disagree with these sentiments entirely.

I truly believe in this idea of a "capability overhang" in models.
That is, naive prompting and context management nerfs a model's ability to perform quite dramatically.
It's like taking a competent engineer and saying, you only get to look at the code for 1 second and then you must come up with solutions to the problem ondemand with no thinking or external tools.
Clearly insane and naive!

Harnesses like Claude Code/Codex/Opencode make the experience better, giving models hands and envrionments to test their code and iterate, but are still rather restricted due to tool permissions or patterns we encode in our prompts.
In a sense, a model+harness's results is limited by your own competency as an engineer.
How will you know to drive the model to better and better engineering practices if you don't know about them yourself?

Great engineers are not wizards.
Rather, they're good engineers who use great tools.

There's a variety of analysis you can do with a program without running or profiling it at all.
Call-graph analysis and control-flow analysis being some of them.
There's "low hanging fruit" in giving models access to these tools that senior+ engineers use to make changes in larger and more complex systems.
In my experience, giving models the ability to analyze codebases via static analysis enables them to more competently plan and execute larger scale refactors.
They're able to reason about the codebase that a senior+ engineer would.

The patterns I'm experimenting with right now is building better static analysis tools for Python.
Due to the dynamic nature of the language (and the fact that most users of Python are not programmers themselves), the state of these tools are a bit immature compared to any C/LLVM based language.
But it's great that LLMs are turbocharging the development here.

Already, I've found that LLMs have been able to refactor and come up with reasonable code improvements to a 10k LOC Python project I maintain, just by nature of being able to look at call graphs and control-flow graphs.

It seems to me that frontier models are probably as competent as a senior or staff level engineer, if given the right prompt and tools and ability to reason for long enough.
Getting this right will "just" be a matter of building tools and formats and representations that are very amenable to LLM reasoning.

Hashline vs Replace: Does the Edit Format Matter?

nwyin — Mon, 23 Mar 2026 09:22:58 +0000

Can Bölük's The Harness Problem showed hashline-style edits (line-number anchored, like 4#WB) outperforming traditional replace-mode edits (old_string/new_string matching) for coding agents.
I've been experimenting with building my own harness (tau), and wanted to verify this result and see if I should consider using hashline as the default edit strategy there.
So I built edit-bench to test this myself across multiple languages and models.

Setup

edit-bench generates mutation-based tests from existing codebases.
You point a script at a directory, and it generates mutations like deleting a statement, flipping a boolean, swapping args, etc.

Languages: Python (from hive), TypeScript (from oh-my-pi), Rust (from irradiate)
Models: gpt-4.1-mini, google/gemini-3-flash-preview, qwen/qwen3.5-397b-a17b
Edit modes: replace (old_string/new_string) vs hashline (line-number anchored)
20 tasks per language, single-attempt oneshot runs
I also recently added fuzzy matching to tau (trim cascade: trim_end → trim_both → unicode normalization) and wanted to see if this helps

Results

Replace mode:

Model	Python	TypeScript	Rust
gemini-3-flash	95%	80%	95%
qwen3.5-397b	90%	85%	85%
gpt-4.1-mini	65%	75%	45%

Hashline mode (from earlier runs):

Model	Python	TypeScript	Rust
gemini-3-flash	70%	85%	90%
qwen3.5-397b	85%	85%	90%
gpt-4.1-mini	50%	70%	55%

Hashline hurts Python noticeably, and seems roughly neutral on TypeScript and Rust.
The language-dependence is interesting — Python's significant whitespace might make line-anchored edits more error-prone.

Does Fuzzy Matching Help?

Apparently not.

I added trace collection to see if tau's fuzzy trim cascade ever fires during replace-mode runs. Across 114 successful edits and 20 failed edits (3 models × 3 languages), fuzzy matching triggered zero times.

Of the 20 failed edits:

1 had trailing whitespace (theoretically fixable)
~8 included line numbers in old_string (model bug)
~11 had completely hallucinated content

When models get old_string right, they get whitespace right too.
When they get it wrong, they get it very wrong — trim cascading doesn't help.

(Trace analysis details)

Takeaways

Hashline vs replace is not a clear winner either way. The effect is language-dependent and model-dependent. Python penalizes hashline; TypeScript is neutral; Rust is a toss-up.
Can's results are hard to generalize. The react-edit-benchmark is JavaScript-only and uses an LSP for validation feedback. Our setup (no LSP, multiple languages) shows a different picture. The LSP feedback loop in particular likely confounds. Giving the model type errors to retry against is a meaningful boost that interacts with edit format.
Fuzzy matching is a non-problem for current models. LLMs either reproduce source text exactly or hallucinate something completely different. The whitespace near-miss case that fuzzy matching targets basically doesn't happen in practice.
For current-gen models in contemporary harnesses, edit format is not the bottleneck. The gap between models (gemini-3-flash at 90%+ vs gpt-4.1-mini at 55-65%) dwarfs the gap between edit formats. Invest in model selection and prompt engineering before worrying about edit format.

Obligatory disclaimer: small n, not statistically rigorous, treat accordingly.

All data: nwyin/edit-bench, issues #13 and #14.

Notes on Implementing Raft for the First Time

nwyin — Mon, 23 Mar 2026 09:22:45 +0000

I implemented the Raft consensus algorithm (the poster child of distributed algorithms) in Python.
It's a pretty bad implementation!
But also (somewhat) correct.

Here are some notes I'd share with anyone else who's interested in taking on a similar challenge.

In hindsight, these were the most useful resources for learning about Raft and implementing it correctly.

The Raft paper (read up to section 5 and reference figure 2 heavily)
Students' Guide to Raft
one of the most widely used Raft implementations
- clone the repo, skim raft.go and go back and forth with an LLM to understand the code base and design decisions
Eli Bendersky's blog series

I'd suggest spending an hour or so reading the paper first, then stubbing out some code for a UDP or TCP server that reads incoming bytes and adds them to an array.
I then followed along with Eli's implementation, adding features to my Raft implementation in the same order.

After getting something that looks like elections working, I started looking for bugs and errors in my understanding of the algorithm.
I'd go back and forth between the students' guide, Figure 2 in the Raft paper, and my implementation, thinking carefully about where my implementation was the same (or differed).
I also heavily used an LLM to review this code, adding material from the above resources into the context.

Repeat the above process for log replication, persistence, etc.

re: implementation

I made some simplifying design choices in my implementation.
In no particular order:

each node runs and processes messages on a single thread
use a "logical clock" to keep track of local "time" on the system (e.g. tick() and increment a counter local to each node, vs using system time)
"muddy" the implementation by having everything in one file. e.g network parsing, storage/persistence, the core raft algorithm, and utilities/commands for controlling the node itself

2 seems like a sound and correct design choice (logical clocks are what's used in etcd's implementation).
3 is arguably better for learning/pedagogy.
It's nice to have everything in one file so you can see it all at once, and gives you a nice implementation you can rip up and see which abstractions fit the algorithm the best.

1 is a bit of an egregious choice to me.
It does make the implementation far simpler (you worry less about getting into deadlocks and atomic updates to the node's internal state), but you also end up with something that isn't quite Raft.
For a first implementation, this seems fine.
The algorithm is complex enough and I think you'd rather spend your time debugging logical errors in the core Raft algorithm vs fussing with mutexes.

I'd consider implementing Raft this way as a ~30-hour project.
The initial reading of the Raft paper and reviewing related materials should take a few hours.
I did the bulk of the coding in ~3 days during the holidays, hacking for about 6-8 hours/day.
I still have some things to polish and improve (e.g. fix some subtle bugs) in the existing implementation, which might be another half a day of work.

All in all, not too bad for understanding one of the core algorithms that powers so much infrastructure.

Reverse-Engineering Claude Code Agent Teams: Architecture and Protocol

nwyin — Mon, 23 Mar 2026 09:22:01 +0000

Introduction

Claude Code (v2.1.47) ships with an experimental feature called Agent Teams: multiple Claude Code sessions coordinate on shared work through a lead-and-teammates topology. I've been building Hive, a multi-agent coding orchestrator with similar goals but a very different architecture, so I wanted to understand how Anthropic's approach works under the hood.

This post documents what I found through:

Reading the official documentation
Examining actual artifacts left on disk by previous team sessions
Letting Claude analyze the Claude Code binary (v2.1.47) for implementation details (hah!)

1. Architecture Overview
2. The Shared Task List
3. Inter-Agent Communication
4. Agent Spawning and Lifecycle
5. Quality Gates and Hooks
6. Token Economics
7. Architecture Summary
Sources

1. Architecture Overview

An agent team consists of four components:

Component	Role
Team lead	The main Claude Code session that creates the team, spawns teammates
Teammates	Separate Claude Code instances, each with its own context window
Task list	Shared work items stored as individual JSON files on disk
Mailbox	Per-agent inbox files for message delivery

The entire coordination layer is file-based. The filesystem at ~/.claude/ is the sole coordination substrate:

~/.claude/
├── teams/{team-name}/
│   ├── config.json                  # team membership registry
│   └── inboxes/{agent-name}.json    # per-agent mailbox
└── tasks/{team-name}/
    ├── .lock                        # flock() for concurrent task claiming
    ├── .highwatermark               # auto-increment counter
    ├── 1.json                       # individual task files
    ├── 2.json
    └── ...

This is a fundamentally decentralized design. The lead is just another Claude session with extra tools (TeamCreate, TeamDelete, SendMessage). There is no background process. Coordination emerges from shared file access.

In an active session, if you ask Claude to spin up a team to do some kind of task and then run the following in another window, you can observe the filesystem update in real time.

watch -n 0.5 'tree ~/.claude/teams/ 2>/dev/null; echo "---"; tree ~/.claude/tasks/ 2>/dev/null'

For example, with the following prompt:

can you spanw an agent team to examine this code base?
  - have one look for bugs
  - have one look for complexity
  - have one look for good things to call out and play devil's advocate against the other two agents

I observed this:

teams
└── code-review
    ├── config.json
    └── inboxes
        ├── bug-hunter.json
        ├── complexity-analyst.json
        ├── devils-advocate.json
        └── team-lead.json

Team Config

The team config at ~/.claude/teams/{team-name}/config.json contains a members array that teammates read to discover each other:

{
  "members": [
    { "name": "team-lead", "agentId": "abc-123", "agentType": "leader" },
    {
      "name": "researcher",
      "agentId": "def-456",
      "agentType": "general-purpose"
    }
  ]
}

Names are the primary addressing mechanism (UUIDs exist but aren't used for routing). All messaging and task assignment uses the name field.

2. The Shared Task List

File Format

Each task is stored as an individual JSON file in ~/.claude/tasks/{team-name}/. Here's a real example from a previous session:

{
  "id": "1",
  "subject": "Hunt for bugs across the codebase",
  "description": "...",
  "activeForm": "Hunting for bugs",
  "owner": "bug-hunter",
  "status": "completed",
  "blocks": [],
  "blockedBy": []
}

Task schema:

Field	Type	Description
`id`	string	Numeric ID, auto-incremented via `.highwatermark`
`subject`	string	Imperative-form title (e.g., "Run tests")
`description`	string	Detailed requirements and acceptance criteria
`activeForm`	string	Present-continuous form for spinner display ("Running tests")
`status`	string	`pending` → `in_progress` → `completed` (or `deleted`)
`blocks`	string[]	Task IDs that this task blocks
`blockedBy`	string[]	Task IDs that must complete before this task can start

Concurrency Control

Two special files provide coordination:

.lock: A 0-byte file used for filesystem-level mutual exclusion (flock()). Present in all 42 task directories observed on my machine.
.highwatermark: Contains a single integer (e.g., "3", "13"). The next available task ID for auto-incrementing.

Task Claiming

Task claiming uses file locking to prevent race conditions. Teammates prefer lowest-ID-first ordering. A task with a non-empty blockedBy array cannot be claimed until all blocking tasks are in a terminal state.

Observation: Most Task Directories Are Empty

Of 42 task directories on my machine, only 5 contained actual task JSON files. The remaining 37 had only .lock and .highwatermark. This likely means tasks are cleaned up after completion, or these were sessions where Claude used the internal task list (available since the task list feature launch) without decomposing into subtask files.

3. Inter-Agent Communication

Mailbox Pattern

Each agent has a JSON array file at ~/.claude/teams/{team-name}/inboxes/{agent-name}.json. Here's a real inbox from a previous session where a team-lead dispatched work to a controlplane-agent:

[
  {
    "from": "team-lead",
    "text": "{\"type\":\"task_assignment\",\"taskId\":\"1\",\"subject\":\"Phase 2: Control-plane - remove participants/presence\",\"description\":\"Remove multiplayer code from the control-plane package...\",\"assignedBy\":\"team-lead\",\"timestamp\":\"2026-02-18T02:37:16.890Z\"}",
    "timestamp": "2026-02-18T02:37:16.890Z",
    "read": false
  }
]

Note the JSON-in-JSON encoding: the text field is a JSON string containing a serialized message object. The outer envelope has from, text, timestamp, and read fields.

Message Types

The type field inside the text payload supports:

Type	Direction	Purpose
`task_assignment`	lead → teammate	Assign a task with full details
`message`	any → any	Direct message to one recipient
`broadcast`	lead → all	Same message to every teammate
`shutdown_request`	lead → teammate	Request graceful shutdown
`shutdown_response`	teammate → lead	Approve or reject shutdown
`plan_approval_request`	teammate → lead	Submit plan for review
`plan_approval_response`	lead → teammate	Approve or reject with feedback
`idle_notification`	teammate → lead	Auto-sent when teammate's turn ends

Delivery Mechanism

Write path: The sender appends a new entry to the recipient's inbox JSON array file.

Read path: The recipient polls their own inbox file. New messages are injected as synthetic conversation turns (they appear as if a user sent them).

Broadcast: Literally writes the same message to every teammate's inbox file. Token cost scales linearly with team size.

Communication is just file append + file read. Latency between send and receive depends on the recipient's poll interval.

Peer DM Visibility

When a teammate sends a DM to another teammate, a brief summary is included in the lead's idle notification. This gives the lead visibility into peer collaboration without the full message content.

4. Agent Spawning and Lifecycle

How Teammates Are Created

Each teammate is a separate claude CLI process. The lead spawns them via the Task tool with team_name and name parameters. Environment variables are set on the spawned process:

CLAUDE_CODE_TEAM_NAME: auto-set on spawned teammates
CLAUDE_CODE_PLAN_MODE_REQUIRED: set to true if plan approval is required

Context Initialization

Teammates load the same project context as any fresh session:

CLAUDE.md files from the working directory
MCP servers
Skills
The spawn prompt from the lead

The lead's conversation history does NOT carry over. Each teammate starts fresh with only the spawn prompt as context.

Internal Implementation

From binary analysis of Claude Code v2.1.47, the teammate context is managed via AsyncLocalStorage with these fields:

agentId, agentName, teamName
parentSessionId, color
planModeRequired

Key internal functions:

isTeammate() / isTeamLead(): role detection
waitForTeammatesToBecomeIdle(): synchronization primitive for the lead
getTeammateContext() / setDynamicTeamContext(): runtime context management

Idle Detection

After every LLM turn, a teammate automatically goes idle and sends an idle_notification to the lead. This is the normal resting state, rather than an error or staleness condition. Sending a message to an idle teammate wakes it (the next poll cycle picks up the inbox message).

Shutdown Protocol

Lead sends shutdown_request to a teammate
Teammate can approve (exits gracefully) or reject (continues working with an explanation)
Team cleanup via TeamDelete removes ~/.claude/teams/{team-name}/ and ~/.claude/tasks/{team-name}/
Cleanup fails if any teammates are still active; they must be shut down first

Permission Inheritance

Teammates inherit the lead's permission mode at spawn time. If the lead runs --dangerously-skip-permissions, all teammates do too. Individual modes can be changed post-spawn but not configured per-teammate at spawn time.

5. Quality Gates and Hooks

Agent Teams integrates with Claude Code's hook system for quality enforcement:

TeammateIdle Hook

Fires when a teammate is about to go idle. Exit code 2 sends stderr as feedback and prevents idle, keeping the teammate working.

{
  "hook_event_name": "TeammateIdle",
  "teammate_name": "researcher",
  "team_name": "my-project"
}

TaskCompleted Hook

Fires when a task is being marked complete. Exit code 2 prevents completion and feeds stderr back as feedback.

{
  "hook_event_name": "TaskCompleted",
  "task_id": "task-001",
  "task_subject": "Implement user authentication",
  "task_description": "Add login and signup endpoints",
  "teammate_name": "implementer",
  "team_name": "my-project"
}

This fires in two situations: (1) when any agent explicitly marks a task completed via TaskUpdate, or (2) when an agent team teammate finishes its turn with in-progress tasks.

Hook Handler Types

Type	Description
`command`	Shell script. JSON on stdin, exit codes for decisions.
`prompt`	Single-turn LLM evaluation. Returns `{ok, reason}`.
`agent`	Multi-turn subagent with read tools. Up to 50 turns.

6. Token Economics

Agent teams use approximately 7× more tokens than standard sessions when teammates run in plan mode. Each teammate maintains its own full context window as a separate Claude instance.

Baseline Reference

Average Claude Code usage: ~$6/developer/day
Agent teams: roughly proportional to team size on top of baseline

7. Architecture Summary

Dimension	Claude Code Agent Teams
Coordination substrate	Flat files (`~/.claude/tasks/`, `~/.claude/teams/`)
Task format	One JSON file per task + `.lock` for claiming
Messaging	JSON inbox files (append + poll)
Agent lifecycle	Self-managing CLI processes
Work isolation	Shared working directory
Merge strategy	None (agents edit files directly)
Retry/escalation	Manual (lead decides, or user intervenes)
Topology	Lead + flat peers, peer-to-peer messaging
Scheduling	Self-claim (teammates grab next task)
State durability	Files only; no in-process teammate resumption
Quality gates	Shell hooks (`TeammateIdle`, `TaskCompleted`)
Token tracking	Per-session only, no cross-agent aggregation
Stall detection	Manual (user notices teammate stopped)
Concurrency control	Implicit (team size = teammate count)
Dependency model	`blocks`/`blockedBy` on task files

Sources

Official Documentation

On-Disk Artifacts (Claude Code v2.1.47)

Observed at /Users/tau/.claude/:

Team directories with config.json and inboxes/{agent-name}.json files
Task directories with .lock, .highwatermark, and individual task JSON files
Sample task assignment message from team-lead to cp-agent, timestamped 2026-02-18T02:37:16.890Z

Binary Analysis

Claude Code binary v2.1.47. Internal functions identified via string analysis: getTeamName, getAgentName, getAgentId, isTeammate, isTeamLead, waitForTeammatesToBecomeIdle, getTeammateContext, setDynamicTeamContext, createTeammateContext. AsyncLocalStorage context fields: agentId, agentName, teamName, parentSessionId, color, planModeRequired.

Hive Codebase

Hive Technical Design Doc

Agent Use Patterns

nwyin — Mon, 23 Mar 2026 09:22:00 +0000

It's a tricky thing, managing so many agents.
Too many things can go wrong!

But it's also clear that there are a lot of ways things can go right.

work queue + orchestrator
polling agent/"cron job"
message queue + responder

what are the patterns that make using these things useful?
memory + context management

CLI TOOLS ARE A MUST

CLI is ultimate ergonomics for text-based beings; if you can wrap a CLI over your workflow, the agents become so much better