Last year I had 10 open tickets, a week-long deadline, and three AI coding agents installed on my machine. Claude Code, Codex, Gemini CLI. Each one individually capable of knocking out a task in minutes. Together? Absolute chaos.
Agent A edits auth.py. Agent B edits auth.py. Agent A's changes get silently overwritten. Meanwhile, Agent C decides to "refactor" the test suite and breaks everything. Nobody runs the linter. Nobody checks types. I spend more time mediating conflicts than I would have just writing the code myself.
So I built an orchestrator. And the single most important design decision I made was: the orchestrator is not an LLM.
The insight that changed everything
My first attempt used an LLM to coordinate the other LLMs. A "manager" agent would read the backlog, decide what to assign where, check in on progress, re-plan when things failed.
It was slow. It was expensive. It hallucinated priorities. It forgot what it had already assigned. It spent 40% of total tokens on coordination overhead — not on actual coding.
Then I had a realization that felt almost too obvious: scheduling is a solved problem. Operating systems have been scheduling concurrent processes since the 1960s. We don't use neural networks for cron. Why was I using one for task assignment?
I ripped out the LLM scheduler and replaced it with deterministic Python. The result is Bernstein — an open-source multi-agent orchestrator that coordinates any CLI coding agent with zero LLM tokens on scheduling decisions.
Architecture: how it actually works
The pipeline is four stages:
- Decompose — An LLM (this is the only place one is used) takes your goal and breaks it into a task graph with roles, owned files, dependencies, and completion signals.
- Spawn — Each task gets a fresh CLI agent in an isolated git worktree. Agents work in parallel. Main branch stays untouched.
- Verify — A janitor process checks concrete signals: tests pass, files exist, linter clean, types correct. No vibes. No "looks good to me."
- Merge — Verified work lands on main. Failed tasks get retried, routed to a different model, or decomposed further.
Goal → LLM Planner → Task Graph → Orchestrator → Agents (parallel)
↓
Janitor (verify)
↓
Git merge → main
The orchestrator is a Python event loop that polls a local task server, matches open tasks to available agents, and manages their lifecycle. It's deterministic, auditable, and reproducible. If you run it twice with the same inputs, you get the same scheduling decisions.
The worktree trick
This is the part that made everything click. Instead of letting agents stomp on the same working tree, each agent gets its own git worktree on a disposable branch:
# Bernstein does this internally for each spawned agent:
git worktree add .sdd/worktrees/session-abc123 -b agent/session-abc123
# Agent works in complete isolation...
# On success, janitor verifies, then:
cd .sdd/worktrees/session-abc123
git checkout main
git merge agent/session-abc123 --no-ff
# Cleanup:
git worktree remove .sdd/worktrees/session-abc123
git branch -d agent/session-abc123
Each agent thinks it owns the entire repo. No merge conflicts during work. No file locks. No coordination protocol between agents. When the janitor passes the work, it merges cleanly because tasks have declared file ownership — the orchestrator won't assign overlapping files to concurrent agents.
The worktrees live under .sdd/worktrees/ (Bernstein's state directory). Expensive directories like node_modules or .venv get symlinked from the main tree so you don't pay the setup cost per agent.
Model routing: contextual bandits, not vibes
Not every task needs Opus. Renaming a variable doesn't require a $15/million-token model. But deciding which model to use per task is genuinely hard to do with static rules.
Bernstein uses a LinUCB contextual bandit that learns from outcomes. The feature vector for each task includes:
- Complexity tier (low / medium / high)
- Scope (number of files)
- Role (backend, frontend, security, etc.)
- Estimated token budget
The reward signal is quality_score * (1 - normalized_cost) — it optimizes for the cheapest model that passes the janitor.
# From bernstein/core/bandit_router.py (simplified)
@dataclass
class TaskContext:
role: str
complexity_tier: int # 0=LOW, 1=MEDIUM, 2=HIGH
scope_tier: int # 0=SMALL, 1=MEDIUM, 2=LARGE
priority_norm: float # 0=critical, 1=nice-to-have
file_count: int
estimated_tokens: float
# LinUCB selects from available arms: ["haiku", "sonnet", "opus"]
# High-stakes roles (manager, architect, security) never start at haiku
During cold start (under 50 completions), it falls back to static cascade routing: haiku for simple stuff, sonnet for medium, opus for hard. After warm-up, the bandit takes over. Policy persists across runs in .sdd/routing/policy.json so learning accumulates over time.
In practice, this cuts costs by roughly 23% compared to using the same model for everything, because most tasks are boilerplate that cheap models handle fine.
How it compares to CrewAI, AutoGen, and LangGraph
I keep getting asked this, so here's the honest breakdown:
| Bernstein | CrewAI | AutoGen | LangGraph | |
|---|---|---|---|---|
| Scheduling | Deterministic code | LLM-driven | LLM-driven | Graph + LLM |
| Works with | Any CLI agent (18+) | Python SDK classes | Python agents | LangChain nodes |
| Git isolation | Worktrees per agent | No | No | No |
| Verification | Janitor + quality gates | No | No | Conditional edges |
| Agent lifetime | Short (spawn, work, exit) | Long-running | Long-running | Long-running |
| State model | File-based (.sdd/) | In-memory | In-memory | Checkpointer |
The core difference is philosophical. CrewAI, AutoGen, and LangGraph are frameworks — you write agents in their SDK, using their abstractions. Bernstein is infrastructure — it orchestrates CLI agents you already have installed. You don't write Bernstein agents. You point Bernstein at Claude Code or Codex or Gemini CLI (or all three in the same run) and it handles the rest.
The other frameworks also use LLMs for coordination, which means scheduling decisions are non-deterministic, expensive, and hard to debug. When Bernstein assigns task #47 to Sonnet, you can trace exactly why: the bandit policy selected it based on the task's feature vector, and you can read the policy file to verify. No prompt archaeology required.
The trade-off is real, though. Bernstein doesn't have agent-to-agent chat, built-in RAG, or a cloud-hosted option. It's a CLI tool for people who want their agents to write code and get out.
Real results
Here's the part where I tell you Bernstein built itself.
During a 47-hour development marathon, I had 12 agents running on a single MacBook. The system consumed its own backlog: 737 tickets closed, 826 commits generated, averaging 15.7 tasks per hour. Bernstein's codebase — all 522,000+ lines of it — was largely written by agents that Bernstein orchestrated.
The --evolve flag takes this further. It analyzes its own metrics (which models failed, which tasks took too long, which prompts produced bad output) and proposes improvements to routing rules and prompt templates. Then agents implement those improvements. It's not AGI. It's a feedback loop.
Benchmarks against single-agent baseline: 1.78x faster on a 12-task test suite, 23% lower cost from model routing.
What still sucks
I'm not going to pretend this is solved.
Agents still hallucinate file paths. They'll confidently import from modules that don't exist. The janitor catches this, but it means wasted cycles and retries.
Context windows fill up. On large codebases, agents run out of context and start forgetting earlier instructions. Short-lived agents help (fresh context per task), but it's still a fundamental constraint.
Cost adds up. 12 parallel agents burning Opus tokens is not cheap. The bandit router helps, budgets help, but if you're not paying attention you can blow through $50 in an afternoon.
Non-trivial setup. You need at least one CLI agent installed and authenticated. You need API keys. The bernstein doctor command catches most config issues, but it's not zero-friction yet.
Merge conflicts still happen. Despite file ownership declarations, agents occasionally touch files they weren't supposed to. The janitor catches regressions, but conflict resolution still needs work.
This is v1.4, not v10. Lots of rough edges.
Getting started
pip install bernstein
cd your-project
bernstein init # creates .sdd/ workspace + config
bernstein -g "Add rate limiting" # agents spawn, work, verify, merge
bernstein live # watch progress in the TUI
bernstein cost # see what you spent
For multi-stage projects, write a YAML plan:
# plan.yaml
stages:
- name: backend
steps:
- goal: "Add rate limiting middleware"
role: backend
complexity: medium
- goal: "Write integration tests for rate limiter"
role: qa
complexity: low
- name: docs
depends_on: [backend]
steps:
- goal: "Document rate limiting API in OpenAPI spec"
role: docs
complexity: low
bernstein run plan.yaml # deterministic execution, no LLM planning
bernstein run --dry-run plan.yaml # preview tasks and estimated cost
Works with whatever CLI agents you have installed. Bernstein auto-discovers them. Mix Claude Code for architecture decisions, Gemini CLI for boilerplate, Aider with a local Ollama model for offline work — all in the same run.
Try it
GitHub repo — Apache 2.0, PRs welcome.
If you've been babysitting one agent at a time and wondering why your backlog isn't shrinking, maybe the answer isn't a better agent. Maybe it's a conductor.
"To achieve great things, two things are needed: a plan and not quite enough time." — Leonard Bernstein
Top comments (0)