Stop using LLMs to schedule other LLMs

#agents #ai #opensource #python

Three AI coding agents on the same repo = three agents overwriting each other's work. Claude Code edits auth.py. Codex edits auth.py two seconds later. Claude's changes vanish. Meanwhile Gemini "refactors" the test suite and breaks six things.

Two weeks of this. Here's what fixed it: git worktrees per agent, a deterministic Python scheduler (not an LLM), and a janitor that verifies work before merge.

The wrong turn

My first orchestrator used an LLM to coordinate the other LLMs. A manager agent read the backlog, decided assignments, checked progress, re-planned on failure.

It was slow, expensive, and kept hallucinating priorities. ~40% of total tokens went to coordination overhead instead of code.

Then the obvious hit: scheduling is a solved problem. Operating systems have done concurrent process scheduling since the 1960s. Nobody uses neural networks for cron. Why use one for task assignment?

I ripped out the LLM scheduler. The result is Bernstein, an open-source orchestrator that coordinates any CLI coding agent with zero LLM tokens on scheduling.

The pipeline

Four stages:

Decompose: one LLM call takes your goal, outputs a task graph with roles, owned files, and dependencies.
Spawn: each task gets a fresh CLI agent in an isolated git worktree. Parallel execution. Main branch untouched.
Verify: a janitor checks concrete signals. Tests pass, files exist, linter clean, types correct. Binary outcomes, not opinions.
Merge: verified work lands on main. Failed tasks retry on a different model or get decomposed further.

Goal → Planner (LLM) → Task Graph → Orchestrator (Python) → Agents ‖
                                         ↓
                                    Janitor → Merge

The orchestrator is a Python event loop that polls a local task server, matches open tasks to available agents, and manages lifecycle. Deterministic, auditable, reproducible. Same inputs produce the same decisions.

Worktrees: the part that unlocked it

Each agent gets its own git worktree on a disposable branch:

git worktree add .sdd/worktrees/session-abc123 -b agent/session-abc123
# agent works in isolation
# janitor verifies, then:
git checkout main
git merge agent/session-abc123 --no-ff
git worktree remove .sdd/worktrees/session-abc123

Each agent thinks it owns the repo. No file locks, no coordination protocol between agents, no conflicts during work. The task graph declares file ownership, so overlapping files never get assigned concurrently.

Expensive directories (node_modules, .venv) get symlinked from the main tree so you don't pay setup cost per agent.

Model routing without vibes

Renaming a variable doesn't need Opus. But static rules for model selection go stale fast.

Bernstein uses a LinUCB contextual bandit that learns from outcomes. Features: complexity tier, file scope, role, estimated token budget. Reward: quality_score * (1 - normalized_cost). Cheapest model that passes the janitor wins.

Under ~50 completions it falls back to static cascade (haiku → sonnet → opus). After warm-up the bandit takes over. Policy persists across runs so learning accumulates.

Net effect in my runs: ~23% cost reduction vs. running everything on one top-tier model.

New in v1.8: MCP server mode

Since the original post, Bernstein gained a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Cursor, VS Code, Zed) can now call Bernstein as a tool:

bernstein mcp --transport stdio

Your IDE agent decomposes a goal, calls bernstein_run, and Bernstein fans out the work across 12 parallel CLI agents in worktrees. The IDE agent just waits for results. One cheap router model at the top, a swarm of cheap workers below, one expensive reviewer at the end — instead of one Opus chewing through 40 serialized tasks.

How it differs from CrewAI, AutoGen, LangGraph, Composio, emdash

	Bernstein	CrewAI / AutoGen / LangGraph	Composio / emdash
Scheduling	Deterministic Python	LLM-driven	Hosted/UI-driven
Works with	20+ CLI agents (Claude Code, Codex, Aider, etc.)	Their SDK classes	Their desktop app / web UI
Git isolation	Worktree per agent	None	Varies
Verification	Janitor + quality gates	Mostly absent	Mostly absent
Agent lifetime	Short: spawn, work, exit	Long-running	Long-running
State	File-based (inspect with `cat`)	In-memory / checkpointer	Cloud/hosted
Interface	CLI + MCP server	SDK	Desktop ADE

Philosophical difference: CrewAI/AutoGen/LangGraph are frameworks — you write agents in their SDK. Composio and emdash are desktop ADEs — you use their UI. Bernstein is infrastructure — you point it at Claude Code, Codex, or Aider (or all three in one run) and it handles the rest.

The LLM-driven coordination in those frameworks is non-deterministic and hard to debug. When Bernstein assigns task #47 to Sonnet, you can read the policy file and trace the feature vector that selected it. No prompt archaeology.

Trade-off: no agent-to-agent chat, no built-in RAG, no hosted option. It's a CLI for people who want their agents to write code and get out.

What still sucks

Agents hallucinate file paths. The janitor catches it, but retries cost tokens.
Context windows fill up on large codebases. Short-lived agents help; it's still a real constraint.
12 parallel Opus agents is not cheap. Budgets and the bandit help. Not attention-free.
Setup friction. At least one CLI agent must be installed and authenticated.
File ownership isn't bulletproof. Agents occasionally touch files outside their scope.

This is v1.8, not v10. But the core loop is stable and I've been running it against production code for months.

Getting started

pip install bernstein
cd your-project
bernstein init
bernstein -g "Add rate limiting middleware"
bernstein live    # TUI
bernstein cost    # spend so far

For multi-stage work, a YAML plan:

stages:
  - name: backend
    steps:
      - goal: "Add rate limiting middleware"
        role: backend
        complexity: medium
      - goal: "Integration tests for rate limiter"
        role: qa
        complexity: low
  - name: docs
    depends_on: [backend]
    steps:
      - goal: "Document rate limiting in OpenAPI spec"
        role: docs
        complexity: low

bernstein run plan.yaml              # deterministic execution
bernstein run --dry-run plan.yaml    # preview + cost estimate

Mix models in the same run. Claude Code for architecture, Gemini for boilerplate, Aider with a local Ollama model for offline tasks.

GitHub repo. Apache 2.0. Star if it saves you a merge conflict.

If you've been babysitting one agent at a time, try the worktree-per-agent pattern and tell me what breaks. I'm especially interested in failure modes I haven't hit yet.