DEV Community

Cover image for Stop using LLMs to schedule other LLMs
Alex Chernysh
Alex Chernysh

Posted on • Edited on

Stop using LLMs to schedule other LLMs

Three AI coding agents on the same repo = three agents overwriting each other's work. Claude Code edits auth.py. Codex edits auth.py two seconds later. Claude's changes vanish. Meanwhile Gemini "refactors" the test suite and breaks six things.

Two weeks of this. Here's what fixed it: git worktrees per agent, a deterministic Python scheduler (not an LLM), and a janitor that verifies work before merge.

The wrong turn

My first orchestrator used an LLM to coordinate the other LLMs. A manager agent read the backlog, decided assignments, checked progress, re-planned on failure.

It was slow, expensive, and kept hallucinating priorities. ~40% of total tokens went to coordination overhead instead of code.

Then the obvious hit: scheduling is a solved problem. Operating systems have done concurrent process scheduling since the 1960s. Nobody uses neural networks for cron. Why use one for task assignment?

I ripped out the LLM scheduler. The result is Bernstein, an open-source orchestrator that coordinates any CLI coding agent with zero LLM tokens on scheduling.

The pipeline

Four stages:

  1. Decompose: one LLM call takes your goal, outputs a task graph with roles, owned files, and dependencies.
  2. Spawn: each task gets a fresh CLI agent in an isolated git worktree. Parallel execution. Main branch untouched.
  3. Verify: a janitor checks concrete signals. Tests pass, files exist, linter clean, types correct. Binary outcomes, not opinions.
  4. Merge: verified work lands on main. Failed tasks retry on a different model or get decomposed further.
Goal → Planner (LLM) → Task Graph → Orchestrator (Python) → Agents ‖
                                         ↓
                                    Janitor → Merge
Enter fullscreen mode Exit fullscreen mode

The orchestrator is a Python event loop that polls a local task server, matches open tasks to available agents, and manages lifecycle. Deterministic, auditable, reproducible. Same inputs produce the same decisions.

Worktrees: the part that unlocked it

Each agent gets its own git worktree on a disposable branch:

git worktree add .sdd/worktrees/session-abc123 -b agent/session-abc123
# agent works in isolation
# janitor verifies, then:
git checkout main
git merge agent/session-abc123 --no-ff
git worktree remove .sdd/worktrees/session-abc123
Enter fullscreen mode Exit fullscreen mode

Each agent thinks it owns the repo. No file locks, no coordination protocol between agents, no conflicts during work. The task graph declares file ownership, so overlapping files never get assigned concurrently.

Expensive directories (node_modules, .venv) get symlinked from the main tree so you don't pay setup cost per agent.

Model routing without vibes

Renaming a variable doesn't need Opus. But static rules for model selection go stale fast.

Bernstein uses a LinUCB contextual bandit that learns from outcomes. Features: complexity tier, file scope, role, estimated token budget. Reward: quality_score * (1 - normalized_cost). Cheapest model that passes the janitor wins.

Under ~50 completions it falls back to static cascade (haiku → sonnet → opus). After warm-up the bandit takes over. Policy persists across runs so learning accumulates.

Net effect in my runs: ~23% cost reduction vs. running everything on one top-tier model.

New in v1.8: MCP server mode

Since the original post, Bernstein gained a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Cursor, VS Code, Zed) can now call Bernstein as a tool:

bernstein mcp --transport stdio
Enter fullscreen mode Exit fullscreen mode

Your IDE agent decomposes a goal, calls bernstein_run, and Bernstein fans out the work across 12 parallel CLI agents in worktrees. The IDE agent just waits for results. One cheap router model at the top, a swarm of cheap workers below, one expensive reviewer at the end — instead of one Opus chewing through 40 serialized tasks.

How it differs from CrewAI, AutoGen, LangGraph, Composio, emdash

Bernstein CrewAI / AutoGen / LangGraph Composio / emdash
Scheduling Deterministic Python LLM-driven Hosted/UI-driven
Works with 20+ CLI agents (Claude Code, Codex, Aider, etc.) Their SDK classes Their desktop app / web UI
Git isolation Worktree per agent None Varies
Verification Janitor + quality gates Mostly absent Mostly absent
Agent lifetime Short: spawn, work, exit Long-running Long-running
State File-based (inspect with cat) In-memory / checkpointer Cloud/hosted
Interface CLI + MCP server SDK Desktop ADE

Philosophical difference: CrewAI/AutoGen/LangGraph are frameworks — you write agents in their SDK. Composio and emdash are desktop ADEs — you use their UI. Bernstein is infrastructure — you point it at Claude Code, Codex, or Aider (or all three in one run) and it handles the rest.

The LLM-driven coordination in those frameworks is non-deterministic and hard to debug. When Bernstein assigns task #47 to Sonnet, you can read the policy file and trace the feature vector that selected it. No prompt archaeology.

Trade-off: no agent-to-agent chat, no built-in RAG, no hosted option. It's a CLI for people who want their agents to write code and get out.

What still sucks

  • Agents hallucinate file paths. The janitor catches it, but retries cost tokens.
  • Context windows fill up on large codebases. Short-lived agents help; it's still a real constraint.
  • 12 parallel Opus agents is not cheap. Budgets and the bandit help. Not attention-free.
  • Setup friction. At least one CLI agent must be installed and authenticated.
  • File ownership isn't bulletproof. Agents occasionally touch files outside their scope.

This is v1.8, not v10. But the core loop is stable and I've been running it against production code for months.

Getting started

pip install bernstein
cd your-project
bernstein init
bernstein -g "Add rate limiting middleware"
bernstein live    # TUI
bernstein cost    # spend so far
Enter fullscreen mode Exit fullscreen mode

For multi-stage work, a YAML plan:

stages:
  - name: backend
    steps:
      - goal: "Add rate limiting middleware"
        role: backend
        complexity: medium
      - goal: "Integration tests for rate limiter"
        role: qa
        complexity: low
  - name: docs
    depends_on: [backend]
    steps:
      - goal: "Document rate limiting in OpenAPI spec"
        role: docs
        complexity: low
Enter fullscreen mode Exit fullscreen mode
bernstein run plan.yaml              # deterministic execution
bernstein run --dry-run plan.yaml    # preview + cost estimate
Enter fullscreen mode Exit fullscreen mode

Mix models in the same run. Claude Code for architecture, Gemini for boilerplate, Aider with a local Ollama model for offline tasks.

GitHub repo. Apache 2.0. Star if it saves you a merge conflict.

If you've been babysitting one agent at a time, try the worktree-per-agent pattern and tell me what breaks. I'm especially interested in failure modes I haven't hit yet.

Top comments (0)