DEV Community: Alex Chernysh

bernstein 2.x recap: lineage, ten trackers, A2A capability cards, and a CI that started fixing itself

Alex Chernysh — Wed, 20 May 2026 18:05:02 +0000

Ten days since the 1.10 recap. Thirteen point releases later. Not a roadmap and not a refactor. The cumulative effect of fixing the things that started to hurt the moment we tried to run the orchestrator on a regulated codebase, against a non-GitHub backlog, alongside three editors that all wanted to host the agent themselves.

This is a single round-up so the trajectory between then and now is legible from one page. The point releases are not headline-shaped individually. Grouped by theme, they are: a per-artefact transparency log with signatures every auditor can verify on their own laptop; ten tracker adapters from Jira to Plane under one contract; A2A capability cards plus an MCP client that treats every upstream as untrusted; a web UI, a PWA, and a one-command registration for seven host editors; a Playwright sandbox for the UI agents; a secrets broker plus the supply-chain hardening around it; an auto-heal CI that finally grew teeth; cost guards backed by a Brier-scored forecast log; and a single-writer state model that makes a session reconnect across machines.

a transparency log per artefact

The audit chain was already there. What was missing: the part that lets two agents touch the same file without losing the trail.

Lineage now writes every agent edit as an Ed25519-signed entry against the agent's A2A card. Two writers on the same path surface as siblings; the Steward writes an explicit merge entry rather than letting one quietly overwrite the other. bernstein lineage gate is a required CI check; merges with unresolved parallel-edit forks fail. The same idea layers on tracker state moves — every transition the orchestrator drives (open, label, transfer, comment, close) is captured as a signed entry, so a ticket that loses or gains the wrong label reads back, line by line, who did it and what they had signed for.

The compliance side ships a verifier you don't need to trust the orchestrator to run. bernstein compliance pack --since … --org "…" --output … produces a signed ZIP with PDF, CSV, raw log, Agent Cards, and a SLSA-style manifest. bernstein-verify pack .zip is its own wheel: cryptography and click are the only dependencies. The verifier's own test asserts that import bernstein raises ModuleNotFoundError from inside its venv, because that is the property an external auditor wants from a verifier they will hand to their own team.

Honest framing: the surface ships with tests, runbooks, and the standalone verifier. It has not been bashed against an external regulatory audit yet. Treat it as code an evaluator can read and stand up themselves, not as an attestation. The three reference demos under examples/lineage/ (fintech, healthcare, EU manufacturer) are written so a compliance officer can pattern-match the EU AI Act Article 12 surface against their own paperwork.

ten trackers, one contract

The work people had been asking for since 1.0. Ten adapters land under one TrackerContract: Jira Cloud and Jira DC, GitLab Issues, Linear, Plane, Asana, ServiceNow, ClickUp, GitHub Projects v2, plus webhook ingestion. Third-party trackers plug in via the same pluggy hookspec the orchestrator uses internally; bernstein trackers surfaces them on the CLI. A multi-tracker federation layer sits above the adapters, so a single team running Jira for engineering and Linear for product can route to both from one orchestrator config.

Two design surprises earned their own slices. Tracker comments became the orchestration handoff bus: worker agents now coordinate over the same comment thread the operator reads, so a session resumes across CLI restarts and across operator machines without a synthetic state file in between. And the issue-to-PR pipeline walks a tracker issue through plan synthesis, plan-comment posting for human review, and PR creation in one path. Run-failure classification closes the loop on the other side: when a run dies, the orchestrator labels the ticket with what class of failure killed it.

The unhonest framing would be that you can wire this up in an afternoon. The honest one is that all ten adapters were lit in two weeks while one operator (me) ate the integration cost ten different ways, and every one of them is bound by the same conformance suite that has been keeping the CLI adapters honest. If your shop runs on Linear and you want the same orchestration semantics as a GitHub-resident one, the contract says you should.

interop, finally

The piece that kept blocking real cross-process work was the lack of a real handshake. Claude Desktop is one process, Claude Code is another, both can spawn agents, neither knew what the other had already decided.

A2A capability cards close that gap. One process mints a signed manifest of what it can do; the other consumes it, verifies the signature against a trusted-issuer set, and refuses to delegate when the advertised policies don't meet the operator's required policies. The lineage chain rides through the same envelope, so the audit trail does not break at the organisation boundary. The handshake builds on the A2A v1.0 contract: JCS body per RFC 8785, detached JWS per RFC 7515, Ed25519 per RFC 8037, JWKS at /.well-known/jwks.json, audience binding via RFC 8707.

The MCP client got the matching upgrade. Upstream servers will return malformed responses, hang mid-stream, demand re-auth, lie about their capability manifest. The client now treats every upstream as untrusted: capability-card validation before a tool call, retry-with-continuation on dropped streams, in-flight cancellation that preserves partial output, per-server cost metering, schema-violation containment that marks a misbehaving server degraded for the rest of the task. None of this is exotic; it is the brittle-real-world posture that the larger MCP ecosystem will end up needing.

The MCP server side got a prompt catalogue plus OAuth-2 PKCE discovery metadata so auto-discovering hosts that expect a real RFC 8414 / RFC 9728 surface stop skipping us. Full token issuance and OIDC federation are deferred to a follow-up; the discovery metadata is what unblocks the common case.

operator surfaces, plural

bernstein gui serve boots a FastAPI server with a React SPA mounted at /ui. No Node toolchain at install time; the Vite bundle is committed under src/bernstein/gui/static/. Default at http://127.0.0.1:8052/ui/. Tasks, Agents, Approvals, Audit, Costs, Fleet, Settings. Six functional panels and one placeholder. The per-task drawer has six tabs: Summary, Logs (SSE-streamed with ANSI, virtualised list, search), Diff (split or unified git diff ... with syntax highlight, copy, .patch download), Gates (quality-gate report with auto-expanded failures), Deps (upstream / downstream task graph), Trace (timeline from .sdd/traces/{task_id}.jsonl).

bernstein gui serve --tunnel publishes through the tunnel driver registry (cloudflared / ngrok / bore / tailscale, auto-select). The same command issues a URL-safe bearer token plus a 6-word diceware passphrase, persisted at ~/.bernstein/dashboard.passphrase with mode 0600, prints an ASCII QR, and ships an installable PWA: service worker with stale-while-revalidate for /api/projects and /api/cost, programmatic maskable icons, iOS Safari and Android Chrome install cleanly.

For operators who already live in another host: bernstein desktop-register --host writes the host-specific config entry for Claude Desktop, Claude Code, Cursor, Continue, Cline, Zed, and Aider. One command. bernstein doctor --substrate reports which hosts have us registered, which do not, and which have a stale registration. The orchestrator is a guest in the host's settings file; we ship the plugin, the host renders it.

Honest framing on the web UI: it is a minimal demo of the operator surface. No theme toggle, no mobile-responsive pass, the Settings screen is a placeholder, the Fleet screen is scaffolding with a real data plane behind it, no front-end test suite, no Playwright smoke test in CI. It exists because the core could support it and operators asked. Tracking issue #1262 is the contributor welcome mat; small PRs preferred. Each per-host adapter is small enough that a host-spec change is a one-day fix, not a re-architecture. That is the cost of being a guest.

a CI that started fixing itself

The auto-heal daemon shipped with twenty-six parameters and produced zero successful heals in its first three weeks. The post-mortem was dull in the best way. A fetch URL had moved. The classifier was missing the agents-md drift class, so doc-only commits looked like a new failure shape. Ruff was running before agents-md sync, so the sync's whitespace tweaks looked like lint regressions. And the heal-branch CI never started because push events from GITHUB_TOKEN don't fire downstream workflows by default; the daemon now dispatches explicitly.

The rest of the immune system landed in the same wave. Inline-pushing the regenerated lockfile to a PR head instead of opening a bot-PR for it removed the dominant bot-PR-class source. A weekly aggregated digest issue replaces N auto-release-skipped notifications. A hotfix R-counter detects when a hotfix begets another hotfix and blocks further auto-merge after two-in-a-row. A trunk-health Andon gate holds merges on red trunk. An idempotency self-check in the regen path so a non-deterministic regen halts itself instead of looping. CI concurrency split by branch so rapid-merge bursts drain the queue instead of cancelling each other. The macOS runner queue (20 to 70 minutes during burst-merge waves) got split off the per-PR default matrix into push-to-main, macos_sensitive-path-changed, or macos-needed-labelled gated jobs, with a nightly full matrix.

Every PR now also passes through a review-bot acknowledgement gate. CodeRabbit and Sourcery findings classified as must-address block merge until they are addressed in a fixup commit (with bot-ack: in the commit message) or acknowledged in the PR body with a structured marker (reason=... -->). A nightly sweeper and a reusable shepherd workflow template ship in the same wave, so the cadence stays predictable.

deterministic replay and one writer per session

Three small things compounded into something operationally useful. Session ids are bound deterministically so a replayed run reproduces its own event stream without colliding with a sibling. The supervisor enforces a bounded respawn budget and parks an agent when the budget is exhausted, instead of looping respawns indefinitely. On-disk state has a versioned migrations module so an older .sdd/ upgrades predictably. Plus the cosmetic-but-real win: runs surface a memorable deterministic name in user-facing output, so the operator can refer to "the brisk-sparrow run" instead of memorising a UUID.

The bigger structural piece is the single-writer RunActor. One per-session actor owns canonical state. Mutations flow as typed events through one async queue. A pure apply_event reducer applies them with monotonic seq numbers. ReplayBuffer is a bounded ring (default 1024) that emits an explicit Gap{up_to_seq} marker when a subscriber asks for an evicted range, so a reconnect-after-eviction is observable instead of silently corrupt.

bernstein simulate is the digital-twin runner that pairs with this. Feed it a plan plus a route and it executes the orchestration without the adapter network. Rehearse an expensive plan before paying for it.

cost guards, calibrated

The bandit router had been doing the right thing for a while. What was missing was a way to read the routing decisions back.

A per-task criterion profile plus TOPSIS multi-criteria ranking means a latency-sensitive task routes differently from a thorough one. A structured decision log covers every routing, retry, and gate verdict with its inputs. The calibration log got teeth: every forecast is scored with a Brier. Per-quota-envelope attribution shows where the spend actually went, not where the most expensive role declared it would. The preflight estimator stopped picking the first declared role and started picking the most expensive one; old behaviour underestimated by 40 to 60 percent on multi-role plans.

The hard cap is --max-cost-usd. Cross the threshold, the run aborts cleanly, partial results merged or rolled back the same way a normal cancel works. The per-ticket variant lives in bernstein.yaml so the cap survives a CLI restart and writes back to the tracker on termination. The same cap, posted via REST, now fails fast at the request boundary with 422 instead of bleeding into the task store as an unhandled 500.

supply chain, secrets, and a sandbox for the UI agents

The security workstream does not write up as a single feature. Half of it is the broker that hands a task a short-lived token scoped to what it declared in its plan; the other half is the dozen smaller things that surround the broker.

A secrets broker mints per-task tokens, scoped to the resources the task declared. Audit events dispatch outside the broker lock so a misbehaving sink can't stall the issuance path. Constant-time HMAC compare. Approval responses bound to a 16-byte server-minted single-use nonce; mismatches surface as 409, evicted replays as 410. Per-tool allowlist with fail-closed policy and a read-only profile.

Prompt-injection containment runs against three surfaces. Invisible Unicode Tag codepoints are stripped from injected skills before any prompt sees them. Promptware cross-agent C2 strings are detected in tool output. MCP tool-call inputs are JSON-Schema validated, deny-by-default. A security-pentest eval scenario exercises the lot end to end.

Supply-chain coverage on the workflow side: OSSF Scorecard, an SBOM emitted on every release, actions/dependency-review on PRs, trufflesecurity/trufflehog for secret scanning, Dependabot extended to the github-actions ecosystem, step-security/harden-runner on every workflow job (audit mode first, then block). The workflow security pass resolved 163 zizmor findings across unpinned-uses, artipacked, template-injection, bot-conditions, dangerous-triggers, ref-version-mismatch, cache-poisoning, excessive-permissions, and dependabot-cooldown. The three jobs that legitimately push back to git keep their credentials with an annotated rationale.

A self-testing layer drives a Playwright context against the dev server, captures screenshots, console messages, and network errors as a structured artefact, and hands the result back to an LLM judge for verdict. This is the slice that closes the loop on UI and web agent tasks the way the existing test harness closed it for backend tasks. The agent's diff plus the post-run screenshot plus the console log feed one judge prompt; the judge returns a structured pass-or-fail with a rationale that lands in the task transcript.

Honest mistake worth naming. The shipped wheel had errors.bernstein.run baked in as the GlitchTip DSN default and telemetry.bernstein.run as the telemetry endpoint default. Both backends soft-fail when their env vars are unset, so the package never actually reached out without consent. But the hostnames were sitting there as defaults, which is the kind of thing that turns into a real leak the day someone wires a config they did not read. Stripped, with a test that asserts zero operator-private host, IP, or DSN matches in src/ and fails the build if a future change reintroduces one. Telemetry is now portable behind one Sentry-compatible BERNSTEIN_TELEMETRY_DSN, so each operator runs against their own backend rather than mine.

observability under one umbrella

bernstein doctor observe runs each per-backend probe (Sonar, GlitchTip, Dependency-Track, GitHub Code Scanning) in order and renders one Rich table with metric, value, delta-since-last-check, threshold, and status. --json and --watch. Each backend soft-fails to SKIPPED when its env vars are unset, so a fresh checkout stays green. A per-PR sticky summary comment and a daily trends snapshot ride on the same JSON. Per-backend bernstein doctor sonar and bernstein doctor glitchtip ship behind the same umbrella for the operators who want one signal at a time.

the smaller things

A bucket of cuts that do not need a whole section but matter in a specific situation.

AI-BOM in three formats. bernstein bom emit and bernstein bom verify ship a Bernstein-native JSON encoder plus CycloneDX 1.5 with the AI/ML extension plus SPDX 2.3 with AI-specific annotations behind one dispatcher. Pure projection from existing lineage / cost / adapter state; determinism enforced by Hypothesis property tests.
Diary plus synthesizer. One structured entry per closed task (tried / worked / failed / rationale / tags) with redaction of OpenAI keys, GitHub tokens, AWS access keys, PEM banners, and high-entropy hex. The synthesizer clusters diaries by tag-overlap Jaccard and drafts a markdown report. HITL-gated; reports default to approved: false.
Consensus relay. HMAC-chained per-cycle handoff at .sdd/runtime/consensus/.json so an operator restarting a long evolution cycle can pull the prior cycle's decisions, blockers, and open questions without rediscovery.
Three-layer skill customisation. BASE / TEAM / USER under XDG paths with a deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append, missing layers fall through cleanly.
Canonical stream-signal vocabulary. A small text-line vocabulary (COMPLETED, FAILED, QUESTION, PLAN_DRAFT, PLAN_READY, BLOCKED) parseable from any wrapped CLI stdout, so non-stream-json adapters surface lifecycle events through the same channel as native stream-json adapters.
Empirical-confidence ledger. Append-only SQLite store of per-decision outcomes; sample-size-gated; refuses to return below a documented threshold. Backs the model recommender with measured outcomes over the capability-tier heuristic.
bernstein export, bernstein analyze, bernstein adapters list, bernstein compare. The operator-side cuts that make the orchestrator legible from the CLI without spelunking the JSONL.
Adapter count is at 44. Devin for Terminal, JetBrains Junie (BYOK across the usual five providers plus the Copilot proxy), AWS Q Developer, DeepSeek V4-Flash and V4-Pro via an Ollama-compatible endpoint with an EU-residency guard.

two open questions for the community

Two RFCs are live where the design genuinely depends on what other operators think. Drive-by comments welcome; full proposals more welcome.

#1720 — Skills end-to-end. The skill subsystem already has discovery, layered merge (BASE / TEAM / USER under XDG, above), and an injector for Claude Code. The operator never touches it because there is no install, no sync, no publish, no lint, no test, no init, no watch. If you have an opinion on the verb surface, the manifest shape, or whether a community index belongs in scope at all, the RFC is where to leave it.

#1719 — Opt-in telemetry to a community-shared backend. The package already has a portable telemetry pipeline behind BERNSTEIN_TELEMETRY_DSN. The current state (no maintainer-side endpoint, package never reaches out by default) is fine. The question on the table is whether an explicitly opt-in maintainer-operated endpoint is worth adding so the rare class of bug that bites many operators looks different from the rare class that bites one. The consent and transparency contract is the live debate.

Both issues are tagged up-for-grabs; both have zero comments at the time of writing.

why these matter

If you read the 1.10 recap and asked which of the friction points you were actually going to feel, the answer ten days later is most of them.

Two agents writing the same file no longer race silently. A non-GitHub backlog is not a special case; ten adapters share the same conformance suite that has been keeping the CLI adapters honest. The web UI is one command and one port; the same command issues a tunnel, a QR, and an installable PWA. A CI break that the heuristic can fix does not need a human-dispatched hotfix. The compliance pack is a single ZIP an auditor can verify without installing the orchestrator. The MCP client treats every upstream as untrusted, which is the posture the larger ecosystem will end up needing. Cost decisions are read-back instead of inferred. Sessions reconnect across CLI restarts and across machines without rediscovery.

The one I noticed most was the removed-our-own-infrastructure cut. The kind of mistake that ships invisibly. The kind of fix that should be a test.

try it

pipx install --upgrade bernstein

# operator surface
bernstein gui serve                             # web UI at http://127.0.0.1:8052/ui/
bernstein gui serve --tunnel                    # public URL + QR + bearer + diceware
bernstein desktop-register --host cursor        # register as a plugin in another host
bernstein doctor --substrate                    # which hosts have us registered
bernstein doctor observe                        # one umbrella table over four backends

# routing and replay
bernstein simulate --plan plan.yaml             # digital-twin a routing decision
bernstein plan dag                              # render the declarative task DAG
bernstein run --max-cost-usd 5                  # per-run hard cap

# trackers and lineage
bernstein trackers                              # plugin index for tracker adapters
bernstein lineage gate                          # required check; fails on unresolved forks
bernstein compliance pack \
  --since 2026-04-01 --until 2026-05-19 \
  --org "Your Org" --output pack.zip
pipx install bernstein-verify
bernstein-verify pack pack.zip                  # zero-trust verification

# AI-BOM
bernstein bom emit --format cyclonedx-1.5 > sbom.json
bernstein bom verify sbom.json

Container: ghcr.io/sipyourdrink-ltd/bernstein:2.5.1.

The spec-quality gate and the empirical-confidence ledger are the two slices most likely to compound. The first refuses to advance a feature spec until a deterministic, library-only rule set passes; the second backs the model recommender with a measured-outcomes store rather than a heuristic. Both are in early shape; both get bigger only if the operator surface stays restrained.

If you hit something rough across the 2.0 to 2.5 surface, open an issue. The next batch is shaped by what blocks real work.

v1.10.x recap: the one this picks up from.
v2.0 release notes: when the web UI landed.
orchestrator on someone else's box: the on-prem deployment story this release strengthens.
Bernstein

Originally published at https://bernstein.run/blog/v2-x-recap?utm_source=devto&utm_medium=crosspost&utm_campaign=v2-x-recap&utm_content=canonical

Forecasting Without Prophecy: a plain-text discipline

Alex Chernysh — Mon, 11 May 2026 18:02:07 +0000

Back to notes

I am a fire Aries ruled by Mars, and even I will not pretend the future is a thing you can read off a chart. Calibrated uncertainty does the work prediction promises. The difference shows up six months later, when you can still grade what you wrote. Same question whether the deadline is a deploy, a hiring call, a relocation, or a difficult conversation with a peer.

The minimum viable forecast loopFive steps that turn a question into something you can actually grade six months later.

The illusion of point estimates

The most common forecasting mistake I see is not bias. It is false specificity in the answer.

A senior engineer says "I'm 70% confident this ships by end of Q2." The number sounds disciplined. There is no scoring history attached. The same engineer said "70%" last quarter, and the quarter before, and three out of four ended up landing in different buckets. The "70%" is a feeling reformatted as a probability.

The same trap shows up well outside a deploy window. A friend is "pretty sure" the new sleep regimen holds through the work-week. A cousin is "fairly confident" the visa will clear in time. A founder is "70% sure" the round closes in six weeks. None have a forecast log behind them.

Point probabilities without a forecast log are theatre. They wear the costume of rigour (the decimal, the percentage sign) and the calibration that would make them rigorous is missing.

The same trap shows up further down the AI stack. A retrieval system reports score: 0.83 and the team treats it as ground truth. A model reports confidence: 0.91 and the team builds an approval flow on top of it. Neither number is calibrated against actual outcomes. They are surface forms of a habit that does not exist yet.

The fix is not "stop using numbers." The fix is ranges, not points, until you have a calibration log that earns the precision. Twenty-to-thirty-five percent is defensible. Twenty-seven percent without a log is a costume.

Reference classes before stories

The second most common mistake is starting from the inside view. The story of the project, the story of the relationship, the story of the deploy.

Reference-class forecasting is the corrective. The original framing comes from Kahneman and Lovallo, and was operationalised most aggressively by Bent Flyvbjerg on infrastructure megaprojects, where insiders consistently overstated success and external base rates told a quieter, more accurate story.

The procedure is short:

Name two to four classes the case belongs to. Not metaphorical classes. Observable ones, with countable outcomes. "Solo-founder consumer-SaaS launches with no paid acquisition." "First-time hires from an outside referral at a company under 30 people." "Friends who have gone quiet for ten days after a tense exchange." "Indie novel projects taken from outline to a finished draft within twelve months."
Estimate the prior odds of the target outcome from those base rates. Use ranges.
Adjust modestly (at most thirty or forty percentage points) only if your case-specific evidence is strong and distinctive.
If no relevant reference class exists, your confidence drops automatically.

The discipline lives in the order. Build the prior before you tell yourself the story. Once the story is in your head, every reference class will start to look "different in our case" and the outside view gets rationalised away. I have done this to myself on a relocation, a hiring call, on whether a parent would actually visit in spring. Writing the prior down before the narrative is the only thing that has ever stopped it.

This is the same discipline that makes eval suites useful in LLM systems: pick the reference set first, then look at the system, not the other way round.

Premortems and falsifiers

A premortem is the cheapest decision-quality intervention I have ever run. The technique is associated with Gary Klein's 2007 HBR piece. The underlying discipline is older. A deliberate inversion of the usual kickoff posture. Works on a relocation, a difficult conversation with a colleague, or the question of whether to stretch the emergency fund onto a new lease.

The procedure, in plain text:

1. Set the scene: it is six months from now, the project failed.
2. Each participant writes down, alone, the strongest specific reason it failed.
3. Read the answers out. Cluster them.
4. Each cluster becomes a falsifier or a mitigation in the live plan.

Copy

Two effects compound. First, asking "why did it fail" generates more honest hypotheses than "what could go wrong" because the failure is now an established fact in the imagined timeline. Nobody is debating whether it might happen, only how. Second, the failure modes that survive clustering become falsifiers: observations that, if they happen, mean the plan is broken. Falsifiers convert vague risk into a leading indicator you can actually watch for.

This pairs well with how I run feature flags and staged rollouts in agentic systems. The flag's "off" criteria are usually written casually. They should be written as falsifiers. "If the regression rate exceeds 4% over two weekly cohorts, this rollout has failed and we revert." That sentence is forecastable. "We'll keep an eye on regressions" is not. The same shape works outside a codebase. "If the antibiotic course produces nausea on day three, I switch back to the GP" is a falsifier. "I'll see how I feel" is not.

Abstention as a feature

The third common mistake is answering when the honest answer is "I don't know yet, and here is what would change that."

Abstention is treated as failure in most organisations and in most personal conversations. In disciplined forecasting it is a feature. Two reasons.

Calibration. A forecaster who abstains on cases that are genuinely underdetermined posts better Brier scores than one who answers everything with a 50% confidence shrug.

Decision quality. The ask "what evidence would resolve this?" reframes the situation from "what do I think?" to "what do I need to look at next?" That is the question that actually moves projects forward, and the question that quietly de-escalates most family arguments about hypothetical futures.

The technical analogue worth knowing is conformal prediction, surveyed accessibly in Angelopoulos and Bates' 2021 tutorial. The output is a set of labels guaranteed to contain the truth at least (1 − α) of the time, rather than a single label with a confidence. When the set has one element, you have a confident prediction. When the set has six, the model is honestly saying "I cannot distinguish among these without more evidence". The set size is the abstention signal.

You don't need conformal infrastructure to apply the principle. The principle: make the size of your answer track the size of your uncertainty. A short single-line forecast for a confident case. A two- or three-branch forecast for a moderately known case. An explicit "I abstain because X, Y, Z would resolve it" for the underdetermined case. This sits next to my preference for product safety without theatre. Refusing a question is sometimes the strongest answer the system has.

Calibration is a habit, not an event

A forecast is incomplete until it is graded.

The grading metric I keep coming back to is the Brier score, summarised on the Wikipedia page. Lower is better. Zero is perfect. The convenient property is that the score decomposes into calibration and resolution. You can be wrong because your probabilities do not match observed frequencies, or because your forecasts do not separate likely from unlikely cases. Two different fixes.

In practice, you do not need fancy infrastructure to track calibration. A four-column markdown table is enough:

| Date       | Question                                | Forecast    | Outcome | Notes                |
|------------|-----------------------------------------|-------------|---------|----------------------|
| 2026-03-01 | Will candidate X accept by 03-15?       | 35-50%      | yes     | accepted on 03-09    |
| 2026-03-04 | Will deploy be clean on 03-08?          | 60-75%      | no      | DB pool exhausted    |
| 2026-03-09 | Will my friend reply within 48 hours?   | 40-55%      | no      | replied on day 5     |
| 2026-03-12 | Will the landlord renew on same terms?  | 55-70%      | yes     | small CPI bump only  |

Copy

Two months of entries and you start to see the systematic biases. Overconfidence on a topic you "know". Underconfidence on a topic you are afraid of. Point estimates that hide a wide range. Ranges that hide a missing reference class. The biases that show up against deploys also show up against landlords, friends, and gigs that need to break even on the door.

I keep the log as a live file. New entries take less than a minute to write. The discipline lives in the reading, on a slow Sunday once a month, with last month's predictions next to last month's outcomes.

If that sounds tedious, consider the alternative is a version of you who never learns whether your forecasts are right. Public superforecasters, profiled in the Good Judgment Project, do score above average on fluid intelligence and active open-mindedness. The strongest single predictor of breaking into the top 2% was perpetual updating, roughly three times more predictive than IQ. They keep score.

Scenario discipline in plain text

The single-story narrative is the most expensive default in informal forecasting. "I think they're going to ghost us." "I think the round will close in six weeks." "I think the strike will be over by Friday." A single hidden-motive story replaces the work of generating competing hypotheses.

The fix is a scenario table, generated once, with three to five branches that meaningfully compete:

ScenarioProbability rangeStrongest evidence forStrongest evidence againstLeading indicatorStatus quo continues30–45%track record of inactionrecent change in incentivesno decision in next 14 daysCautious improvement25–35%small visible gestures last weekhistory of regressionsone substantive ask answeredEscalation or rupture10–20%pattern of ultimatumscalmer recent toneunilateral action by the other sideStrategic distance10–20%resources are clearly limiteddependency on this threadreduced engagement, not reduced contactExternal shock5–10%three competitors movingsector quiet otherwisea third party makes the question moot

That same shape covers a regulator's calendar, a job search, a health regimen, or whether a quiet family thread reopens on its own. You change the rows; the columns stay.

Two rules make the table earn its keep.

Probability ranges, not single numbers, unless you have a forecast log to back the precision. Range midpoints should land somewhere near 100%. They will never sum to exactly 100 (they are ranges) but if the column sums to 50% the scenario set is incomplete, not the arithmetic.

One falsifier per row. A scenario without a falsifier is a wish or a fear, not a scenario. The leading-indicator column does the work: it tells you what observation, if you saw it next week, would shift the probability of that branch up or down.

A nice consequence of plain text is that you can paste it into a thread, hand it to a colleague, or feed it to a model for a second opinion without an export step. The same plain-text discipline that makes spec-driven development survive context switches makes scenario tables survive them too.

Try it on something small

This entire discipline collapses if you only ever apply it to high-stakes, low-frequency questions. You don't get calibration data. You don't develop the muscle. You don't learn which biases are yours.

Start with something small enough to grade within two weeks. A Q3 review, an offer letter, a connecting flight, a Saturday gig that needs to break even on the door. Anything where the outcome lands before you forget you forecasted it.

If two of the three are wildly off after two weeks, the lesson is in the gap, not in the embarrassment. Reread the original entries. Which step did you skip? Did you start with a story instead of a reference class? Did you give a point estimate instead of a range? Did you forget the falsifier?

The future stays unpredictable. The job is to build a calmer, slightly more honest interface to a messy world, and to leave behind enough trail-of-evidence that next year's version of you can grade this year's forecasts and learn something.

RightLayout: Shipping a Mac AI Tool, Then Letting Go

Alex Chernysh — Mon, 11 May 2026 18:02:05 +0000

Back to notes

I built RightLayout because every keyboard-layout corrector I tried for macOS broke on names, code, and typos. It was a small bet: train a CoreML model from scratch, three layouts, on-device. It worked. Then the maintenance bill came due, and I open-sourced it.

1. The problem with dictionary punto-switchers

If you type in two or three languages on a Mac, you have lived this. You start a sentence in English, the layout is still on Russian, and the screen fills with Cyrillic noise. The fix exists in theory. There are tools that watch your input and flip the layout when the word "looks wrong".

The classical version of that tool is dictionary-based. It checks each word against a frozen vocabulary and corrects when the word does not appear. That works for the easy cases. It also fails the moment a real human starts typing real text.

Names break it. Code breaks it. Acronyms break it. URLs break it. The word kubectl is not in any Russian dictionary, but it is also not a wrong-layout English word. A typo like helo is missing from the dictionary, so the tool helpfully turns it into руды. And in mixed-language paragraphs the dictionary does not even know which language to anchor against.

A dictionary check has no idea what you are typing. It can be solid and polished and still get the same class of false positives, because the underlying signal is the wrong one. You need something that reads context, not vocabulary.

2. The bet

The bet was small enough to attempt over a few weekends. Train a tiny character-level model that takes a short window of recent input and predicts which of nine classes it belongs to. The nine classes are the three native layouts (EN, RU, HE) plus the six cross-layout misfires: en_from_ru, ru_from_en, en_from_he, he_from_en, ru_from_he, he_from_ru. That class set is the whole trick. Once the model says "this looks like Russian typed on an English layout", a deterministic mapper handles the actual character substitution.

The training pipeline is in the repo and is unromantic. Wikipedia and subtitle corpora for the three languages, generation of clean and cross-layout-mistyped pairs, character-level tokenization, augmentation for typos and case noise, mixup, label smoothing. The model itself is an ensemble of a small multi-scale CNN and a four-layer character Transformer, both pooled into a single linear head. It runs over a fixed 20-character window. The export goes through PyTorch into CoreML.

# from Tools/CoreMLTrainer/train.py
CLASSES = [
    'ru', 'en', 'he',
    'ru_from_en', 'he_from_en',
    'en_from_ru', 'en_from_he',
    'he_from_ru', 'ru_from_he'
]

Copy

The CoreML model that ships inside the .pkg is around 14 MB. It runs entirely on-device. Inference per token window is fast enough that the correction logic stays well under the 50 ms budget I set for the whole pipeline (event tap to replacement). It is small enough to fit in the bundle and ship with no cloud dependency.

The first time it correctly turned ghbdtn into "привет" in the middle of a sentence with a code snippet in it, I knew it was going to work. A dictionary-based corrector would have eaten the snippet.

3. Where the model actually wins

Three places it beats a dictionary cleanly.

It handles typos. A short word with one missing or duplicated character is still recognizable as the right language to a character-level model. Dictionary tools either silently miss the word or, worse, "correct" it into nonsense.

It handles names and code. The model has seen enough mixed-language and mixed-script text in training that an English snippet inside a Russian sentence does not trigger a flip. The dictionary approach to this is a hand-maintained whitelist that grows forever.

It handles Hebrew, which is the genuinely hard one. RTL text plus a character set with no overlap with Latin or Cyrillic plus a layout that maps Hebrew letters onto English keys means the dictionary approach has to maintain three pairwise tables and a context heuristic on top. The model just learned that akuo is "שלום" typed on an English layout and moves on.

For three or four months I used my own tool every day. It was the first time the corrector was invisible enough to forget about.

4. Why I open-sourced it instead of scaling it

Then the maintenance bill arrived.

A free macOS utility with a learned model has a long tail of unglamorous work. The accessibility-API event tap needs to keep working across macOS versions. Apple loves to silently change permission semantics between releases. The CoreML runtime drifts. The model has no test infrastructure for "real users typing real text" because that is, by definition, not in the training set. The undo-ratio learning loop, where the tool watches users undo a correction and adapts, is hard to make safe and harder to validate without telemetry I refuse to collect.

For a funded product, those costs are absorbable. For a free tool maintained by one person with a day job, they compound. Every macOS major release became a week of evening debugging. Every CoreML version bump was a small risk. Every issue in GitHub was a fork in the road: do I become a Mac systems engineer in my spare time, or do I let the project rot quietly while pretending it is still maintained?

I picked a third option. I marked the project community-maintained, wrote an honest banner on the homepage and the README, kept the model in the bundle so installs still work, and moved my attention to Bernstein. The repo is public. The training pipeline is public. Pull requests are reviewed. There are no gatekeepers. If you ship good PRs, you get commit access.

That is a more honest position than "v2 coming soon, watch this space".

5. What you can take

If you want the tool, the .pkg is on the releases page. macOS 13 or newer, Accessibility permission, free. The model is inside the bundle.

If you want the code, the repo is small, the architecture doc is in .sdd/, and the training pipeline is in Tools/CoreMLTrainer/. Adding a fourth language is a few-hour exercise: extend the class enum, add the layout map, retrain, ship.

If you want the lesson, I think it is short. A focused weekend project can ship faster than the coordination cost of a team. Maintenance is a different shape entirely, and there is no shortcut around it. Choose accordingly.

I am genuinely glad it is in the wild. Take it, fix it, ship it.

Resources

Repositories and downloads

Shipping the orchestrator onto someone else's box

Alex Chernysh — Mon, 11 May 2026 13:14:37 +0000

This is the on-prem / regulated-deployment notes for 1.10 — mTLS cluster mode, signed lineage records, air-gapped install, a capability gate against the lethal-trifecta exfiltration class. If you're looking for "how to install it on my laptop," that's the curl-pipe install post instead. A laptop tool and an on-prem deployment have almost nothing in common: the first answers to the developer who started it, the second answers to a security architect, a compliance officer, a network team that hates outbound traffic, and a procurement reviewer with a checklist. The batch below is what it took to stop pretending those were the same thing.

the questions

Customers who want the orchestrator inside their perimeter ask a fairly predictable sequence. How do nodes authenticate when the network isn't yours. How do we prove, six months later, which agent wrote which line. How do we install when the box has no outbound internet. What stops a clever prompt from chaining a database read into a public webhook.

We had partial answers to all four. None were the answer you'd hand a security review without flinching. The five PRs that landed today close that gap by replacing "we sort of do that" with concrete, demonstrable behaviour. None of these are revolutionary on their own. The cumulative effect is a 1.10 build we're willing to drop into a regulated customer's box and walk through with their auditor.

cluster mode without ambient trust

Cluster mode used to assume the network was safe. Acceptable for a developer machine talking to itself, unacceptable everywhere else. Worker–central traffic now runs over mTLS by default, with cert issuance done locally and pinned to the cluster's own CA.

bernstein cluster bootstrap-ca --out .sdd/cluster/ca/
bernstein cluster issue-cert --role worker --node-id worker-01 \
  --out .sdd/cluster/worker-01/
bernstein cluster start --tls .sdd/cluster/worker-01/

The CLI generates a private CA, issues short-lived node certs with role and node-id baked in, refuses connections whose chain doesn't terminate at the cluster CA (PR #1019). Rotation is a re-issue, not a rebuild. The central trusts any cert with a valid chain and a non-expired NotAfter. There is no shared symmetric secret to leak.

Underneath, we replaced the in-process happy-path tests with a real two-process harness: central and worker as separate subprocess.Popen invocations, walked through six chaos scenarios — worker SIGKILL mid-task, central restart with in-flight claims, network partition, token expiry across a claim boundary, two workers racing for the same task, certificate revocation (PR #1020). Bugs the in-process harness silently swallowed — claim re-entry, token clock skew, an off-by-one in the partition healer — surface in seconds. CI runs the matrix on every push.

Operators get five new Prometheus counters and gauges (bernstein_cluster_claims_total, bernstein_cluster_token_rejections_total, bernstein_cluster_partition_seconds, bernstein_cluster_central_restarts_total, bernstein_cluster_workers_active) and six audit event types covering token issuance, claim transitions, certificate rotation, disconnects. Grafana dashboard JSON in observability/dashboards/cluster.json (PR #1021). Plain dashboard, accurate fields.

For deployments that can't expose a port to the worker fleet, there's a tested pattern using Cloudflare Tunnel for the central edge plus Tailscale for the worker mesh. Documented end-to-end. A nightly CI job stands up the topology in ephemeral containers and runs a smoke task through it (PR #1024). The pattern works without modifying firewall rules at the customer site, which is usually the difference between "scheduled for next quarter" and "deploy this week."

What's still missing: MESH and HIERARCHICAL coordinator topologies are stubs. Multi-tenant isolation inside a single cluster — separate quotas, audit chains, model budgets per tenant — is deferred to 1.11. If you need either today, run one cluster per tenant. We're naming the limit out loud because pretending it isn't there is how trust evaporates on the second deployment.

lineage that survives a regulator

Auditing a code change six months after the fact is a problem of information loss. By the time anyone asks "which agent wrote this," the prompt is gone, the cost ledger has rolled, the producing model version has been replaced. The new lineage subsystem keeps the answer.

Every agent write emits a signed lineage record linking the output (file path, byte range, sha256) to its inputs (the files the agent read), producer (agent id, role, model, effort), prompt SHA, cost, token count, wall-clock timestamp. Records chain via HMAC the same way the existing audit log does, so tampering with one record invalidates the chain past it.

bernstein lineage src/auth/middleware.py:74
# wrote by:    backend / claude-sonnet / effort=high
# prompt sha:  3f9a…b421  (template: roles/backend.md@v17)
# inputs:      src/auth/__init__.py  src/auth/jwt.py  tests/test_auth.py
# producer:    session 7c4f1a3b9d22, task #412
# cost:        $0.0214   tokens: 11,983
# signed:      ed25519 (cluster-key)  customer-key: aporia-prod-2026-05

Schema v2 adds two fields: a regulatory_class tag (pii, phi, dora, nis2, none) inferred from the file's policy zone, plus a customer-key signature attached after the cluster-key signature so customers can revoke trust without re-issuing the cluster CA. Combination is what makes a DORA or NIS2 evidence package mechanical to assemble: filter by regulatory_class, walk the chain, hand the bundle to the auditor.

The janitor verifies the chain on every gate run. A tamper hit forwards to a configurable SIEM webhook with the broken record and the surrounding window. Default surface is loud. We'd rather wake an operator than miss a forged signature.

What's still missing: the customer-key signing path uses ed25519 in software. FIPS-140 hardware keys are on the 1.11 roadmap. If FIPS-140 is a hard procurement requirement, you cannot ship today.

distribution without outbound internet

The first sovereign deployment we did, the customer's box could not reach pypi.org, github.com, or any registry we'd ever heard of. Install procedure was an engineer carrying a USB drive through a security checkpoint. Not a problem we wanted to solve twice.

bernstein wheelhouse build produces a self-contained tarball: pinned wheels for the orchestrator and every transitive dependency, embedded model weights for offline classifiers, a manifest with sha256 per file, a detached GPG signature over the manifest. bernstein wheelhouse verify checks both the signature and every per-file hash before any installer logic runs.

# on a connected build host
bernstein wheelhouse build --out bernstein-1.10.0-airgap.tar.gz \
  --sign-with [[email protected]](/cdn-cgi/l/email-protection)

# on the customer box
bernstein wheelhouse verify bernstein-1.10.0-airgap.tar.gz
bernstein wheelhouse install bernstein-1.10.0-airgap.tar.gz \
  --target /opt/bernstein

bernstein --profile airgap doctor airgap

--profile airgap flips the orchestrator into explicit egress-deny: any code path that opens a socket to a non-loopback, non-cluster address fails closed with an error naming the offender. doctor airgap runs ten self-checks — DNS lookup for a poisoned hostname, plaintext HTTP attempt, MCP reachability, model-weight integrity — and returns a single pass/fail line that procurement can paste into a runbook.

This was the piece we expected to be smallest and that ate the most time. Deciding what counts as "egress" inside a complex Python process is a research project. Deciding what to do when a transitive dependency tries to phone home for telemetry is a policy question. We landed on "fail closed, name the caller, document the override" because every other choice creates a quiet failure mode.

a capability gate against the lethal trifecta

The lethal trifecta — private data, untrusted input, external communication — is the prompt-injection escape hatch every multi-agent system eventually trips over. An agent that reads a customer's database, ingests a webhook body crafted by an attacker, and is allowed to call out to a public URL has, by construction, an exfil path. The mitigation in the literature is to refuse the chain, not the individual capabilities.

Tools and adapters now declare capability tags in their manifest:

# src/bernstein/adapters/postgres_query.py
CAPABILITIES = frozenset({"PRIVATE_DATA"})

# src/bernstein/adapters/webhook_ingest.py
CAPABILITIES = frozenset({"UNTRUSTED_INPUT"})

# src/bernstein/adapters/http_post.py
CAPABILITIES = frozenset({"EXTERNAL_COMM"})

At spawn, the orchestrator unions the tags of every tool the prospective agent has access to. If the union contains all three of PRIVATE_DATA, UNTRUSTED_INPUT, and EXTERNAL_COMM, the spawn fails with a refusal naming the offending capability set and the tools that contributed to each tag. Operators can override per-task with --allow-trifecta, which is logged to the audit chain and surfaced in the lineage record.

The gate catches the architectural mistake — assembling a tool belt that shouldn't exist together — before any prompt runs. It cannot prevent a capable insider from manually wiring around it. The default failure mode is now refusal, which is the right default for a tool you don't fully control.

what this batch isn't

A few things are honestly not done. MESH/HIERARCHICAL cluster topologies are stubs. FIPS-140 hardware keys for the lineage signer are on the 1.11 roadmap, not this release. Multi-tenant isolation inside a single cluster is deferred. The capability matrix covers the trifecta and nothing else; finer-grained gates like "no PII into models without a BAA" are next quarter's work. None of these are blockers for the deployments we have lined up. All of them will be eventually.

This isn't a "version 2.0." It's the unglamorous list a procurement reviewer asks about before the technical evaluation has even started. Cluster auth, signed audit, offline install, capability isolation. Table stakes for getting the orchestrator dropped onto a regulated customer's box instead of staying a thing that runs on someone's laptop.

We picked the second option for two years because the first option is mostly paperwork. The batch shipped today is the paperwork.

If you got here from the README and want the codebase view, GitHub is the canonical place. If the regulated-deployment angle is what you came for, open an issue describing the air-gap or compliance gap you're hitting; the next batch is shaped by what blocks real deployments.

Bernstein

Originally published at https://bernstein.run/blog/orchestrator-on-someone-elses-box?utm_source=devto&utm_medium=crosspost&utm_campaign=orchestrator-on-someone-elses-box&utm_content=canonical

1.10.1 through 1.10.6: the shipped things

Alex Chernysh — Mon, 11 May 2026 13:14:35 +0000

The v1.10.0 post covered the regulated-deployment work. The five point releases since are not headline-shaped. They are the things people kept asking for in issues and the things that were quietly broken once we tried to use the orchestrator on a real polyglot codebase. Worth a single round-up so the trajectory is legible from one page.

a single AGENTS.md the rest of the agents can read

If you run more than one CLI agent on the same repo you already know the problem. Cursor wants .cursor/rules/*.mdc. Claude Code wants CLAUDE.md. Aider wants CONVENTIONS.md plus a tiny .aider.conf.yml line so it actually loads on every session. Goose wants .goosehints. Each of those files holds the same handful of facts about your codebase, said five different ways, which means a real repo carries four drifting copies of the same instructions.

bernstein agents-md reads the repo's roles, hooks, skills, capability matrix, and install snippets, emits one AAIF AGENTS.md as the single source of truth, then rewrites that into the four vendor shapes above. Five subcommands. generate previews the canonical IR to stdout. write produces one target. sync produces the canonical plus all four CLI-specific files in one pass. verify is a CI gate that fails on drift. diff shows what is stale.

The IR is intentionally schema-free, because the AAIF spec doesn't impose one and locking ourselves in would have meant fighting upstream every quarter. The CI gate is the part that compounds. After three months of drift you can re-run agents-md sync and watch four files all snap back to the same content without anyone hand-merging. The orchestrator runs agents-md verify against its own tree on every PR; the pattern is the same one anyone with more than one agent ends up wanting.

cost legibility you don't have to grep for

Two patches, both small, both the kind of thing that should have shipped in 1.0.

The first is a per-turn budget banner. bernstein run now prints a one-line countdown each turn: dollars and tokens remaining against the task budget. The Anthropic prompt-caching beta header is lit by default, so cache hits actually land. Operators stop pattern-matching for a wallet limit in their head while the agent is mid-thought. CI runs with a cost ceiling get a real per-turn signal instead of finding out post-mortem.

The second is --max-cost-usd. A hard cap on a run's cumulative routed-model spend. Crosses the threshold, the run aborts cleanly, with the partial results merged or rolled back the same way a normal cancel works. Pair it with the run summary's "estimated savings vs. single-shot through the most expensive routed model" line that 1.10.1 added and the wallet picture is finally visible without a jq on .sdd/runtime/costs.jsonl. The bandit router has been doing the right thing for a while; the operator surface has not.

A2A v1.0 with a verifier you can actually run

If you connect Bernstein to other agents over the A2A protocol, every Bernstein agent now publishes a signed agent card at /.well-known/agent.json and the public verification keys at /.well-known/jwks.json. JWS detached signature over JCS canonical bytes with Ed25519, audience binding via RFC 8707 resource indicators, persistent keystore with O_EXCL plus 0o600 semantics, and a 24-hour rotation grace window so a peer that fetched JWKS five minutes ago can still verify the previous key after a rotation without races.

The compliance side ships a verifier you don't have to trust us to run. tools/verify_audit_dsse.py depends only on the Python standard library and cryptography. Its own test asserts that import bernstein raises ModuleNotFoundError from inside the verifier's venv, because that is the property an external auditor wants from a verifier they will hand to their own team. The audit log itself is HMAC-SHA256 chained, JCS-canonicalised per RFC 8785, timestamp-anchored against an external TSA via RFC 3161 chain validation, and exported as a DSSE plus in-toto v1 envelope. Multi-tenant slicing via bernstein audit slice exports a deterministic subset for an evaluator without breaking the chain on either side.

Honest framing: the compliance surface ships with tests, runbooks, and the standalone verifier above, but it has not been bashed against an external regulatory audit yet. Treat it as code an evaluator can read and stand up themselves, not as a SOC 2 attestation.

four new adapters

Adapter count went from 31 to 44 over the five releases. The four worth calling out by name:

Devin for Terminal (Cognition). First-class adapter for the enterprise coding agent. 558 lines of contract tests verify the spawn surface mirrors the long-running adapter pattern. Drop-in via cli_agent: devin_terminal.
JetBrains Junie. BYOK across Anthropic, OpenAI, Google, xAI, OpenRouter, and the Copilot proxy. Bring whichever key the org already has procurement for.
AWS Q Developer. Wraps q chat --no-interactive --trust-all-tools so AWS-resident teams can route the steps where their security model wants the AWS-trusted lane.
DeepSeek V4-Flash and V4-Pro. Self-hosted via an Ollama-compatible endpoint. Ships an EU-residency guard that pins the endpoint host and rejects DNS rebinding via a loopback test. The Hypothesis bug-hunt suite caught a 10.example.com rebinding bypass while the adapter was still in development, which is roughly the point of running the Hypothesis suite.

The Cursor adapter also got a real rewrite. The previous code shelled a non-existent cursor agent binary with fictional flags. New version targets the real cursor-agent CLI surface (-p, --workspace, --output-format stream-json, --trust, --approve-mcps, --force) with 242 lines of new contract tests so it can't regress to vapor again.

the smaller things

A few that don't need a whole section but matter in a specific situation.

bernstein run learned a pending_approval state. Tasks pause there until an operator approves or rejects through the API or a panel, with the decision logged to the audit chain. The fresh-context retry mode (agent_restart_between_retries, opt-in) restarts the agent cold instead of inheriting the failed run's context bloat, which is the right default once you have watched a 200k-token context retry and somehow get worse.

bernstein scaffold is a first slice for going from one sentence to a working repo skeleton. bernstein wiki build generates a per-repo wiki from the canonical AGENTS.md IR. The A/B runner primitive lets you compare two adapter configurations on the same task set without writing a custom harness. None of these are finished; they ship as the smallest viable slice so the spec, the test, and the runtime artefact all exist while the operational surface stays thin enough not to lock in a bad shape.

There is also an opt-in LLM watcher that reads the deterministic loop's events and annotates them with a natural-language summary. Off by default, runs on Haiku, useful when you are explaining a failed run to a human reviewer who is not going to read the JSONL by hand. The orchestrator stays deterministic. The watcher is a side-channel.

why these matter

Most of the friction in running a multi-agent setup is not the agents. It is the four config files that disagree, the run that quietly burned through a budget at 3am, the A2A peer that won't verify your card because your keystore lost a race condition, the EU-residency requirement that bites the second a transitive dependency tries to phone home. None of those are interesting to write up as a feature. All of them are the thing that decides whether someone runs the orchestrator twice.

try it

pipx install --upgrade bernstein
bernstein agents-md sync          # one canonical, four vendor shapes
bernstein run --max-cost-usd 5    # hard cap; per-turn countdown shows above

Container: ghcr.io/sipyourdrink-ltd/bernstein:1.10.6.

The KF-1 through KF-9 slices each shipped as smallest-viable. The next release fills the operational surface for the ones people actually use; the others stay slices until somebody asks. Hypothesis property-test coverage gets extended into the orchestrator runtime path, which is the surface most likely to leak invariants nobody wrote down. If you hit something rough in 1.10.x, open an issue; the next batch is shaped by what blocks real work.

Bernstein

Originally published at https://bernstein.run/blog/v1-10-x-recap?utm_source=devto&utm_medium=crosspost&utm_campaign=v1-10-x-recap&utm_content=canonical

Orchestration primitive or desktop ADE? Choosing your multi-agent coding layer in 2026

Alex Chernysh — Tue, 21 Apr 2026 14:11:04 +0000

The multi-agent coding tool category went from a handful of projects in late 2024 to thirty-plus by mid-2026. Along the way it split into two shapes that solve adjacent-but-different problems. Here's when to reach for each, and why you might end up using both.

The two shapes

Desktop ADEs. A downloadable desktop application. You install it like any other app, open a window, configure credentials, and see your repo, your agents, and your diffs in a unified UI. Examples in the open-source corner: emdash (Electron app, 23 CLI providers supported, YC W26-funded), Conductor, Cline's desktop mode. Closed-source you'd put in the same category: Claude Code's VS Code extension, Cursor's "run in background" mode.

Orchestration primitives. A library or CLI you import into your own workflow. You don't see a window; you see a process you can pipe into other things. Examples: Bernstein (the project this blog belongs to — 18 CLI adapters, Python-importable), Workz, certain configurations of Plandex. LangGraph and CrewAI are adjacent but different — they orchestrate LLM calls, not CLI coding agents.

The distinction is not about which is better. It's about what layer of the problem you're solving.

What a desktop ADE does well

A desktop ADE gives you:

A visual workspace. Diffs, PR status, CI checks, agent logs all in one window.
Zero-config launch. You open the app, it picks up your repo, agents just work.
Identity handled. Credentials in the OS keychain, not in a .env file that leaks.
Distribution pattern. Electron installers for macOS, Windows, Linux. Your non-terminal colleague can use it.

This shape is the right answer when:

You're the kind of developer who keeps an IDE open all day and wants agents integrated into that workflow, not hidden in a tmux pane.
You're onboarding teammates who don't live in the terminal.
You want one tool that covers edit, review, merge, and CI visibility end-to-end.

What it trades off:

Not programmable from the outside. You can't import emdash or write a CI job that kicks off a parallel agent run via emdash's API. It's a UI, not a library.
Ships with opinionated conventions. Agents live in app-managed worktrees; audit logs live in app databases. Extracting them into another system is possible but not first-class.
Cross-machine coordination is an extra feature (SSH mode, remote runtime) rather than the default shape.

What an orchestration primitive does well

A primitive gives you:

A process you can script. bernstein run --goal "..." | jq . works. So does invoking it from a GitHub Actions workflow, or importing bernstein.core in your Python code.
Deterministic coordination. The scheduler is a regular event loop. Every run is replay-able from the audit trail.
MCP server mode. Your agent-of-choice can talk to the orchestrator through the same Model Context Protocol Anthropic publishes for Claude Code.
Composition. A primitive is one step in a larger pipeline: linter → primitive multi-agent pass → janitor → merge queue → deploy.

This shape is the right answer when:

You want to embed multi-agent coding into a system you already run: CI, internal dev-platform, evaluation harness.
You care about reproducibility. HMAC-chained audit trails give you "did the agent really do exactly that?" answers days later.
You're already in a scripting-first workflow and don't want a new app to keep open.

What it trades off:

No visual diff/merge UI out of the box. You git diff the worktree, or plug it into your existing tools.
Setup needs a terminal. pipx install bernstein && bernstein init, not a double-click installer.
It's one layer of a larger stack. You'll likely pair it with a separate review tool, CI system, and notification channel.

Decision shortcuts

Building a product on top of multi-agent coding? Reach for a primitive. Libraries compose; apps don't.
Onboarding a team that wants a single download? Reach for a desktop ADE. Developer ergonomics of an opinionated installable app is hard to beat for non-power-users.
Running agents as part of CI / evaluation / internal platform? Primitive, nearly always.
Running agents on your own laptop during normal dev work? Either works; it's a preference question. Try both for a week.
Need to prove to compliance or security "here's exactly what happened"? HMAC audit trails live in the primitive layer. ADE output logs are usually app-scoped.

They often co-exist

Nothing prevents running both. A pattern we've seen in Bernstein's early users:

Bernstein in CI for the "every PR gets a lint-plus-refactor agent pass" step.
Desktop ADE for interactive "I'm pairing with Claude Code on this refactor" flow.
Bernstein's MCP server mode exposed to the ADE so both see the same audit trail.

If you're already using a desktop ADE and it covers what you need, keep it. If you hit the "but I want to run this from a shell script / from CI / inside another service" wall, that's the signal to look at a primitive, regardless of which specific one.

Bernstein's specific positioning

Bernstein is the primitive-shape tool. What we optimize for:

Deterministic coordinator written in plain Python — no LLM in the scheduling loop, so runs are reproducible.
HMAC-chained audit trail — every agent action is replay-able bit-for-bit days later.
MCP server mode — expose Bernstein to any MCP-capable client (Claude Code, Cursor, or your own agent).
18 CLI adapters including Claude Code, Codex, Cursor, Aider, Gemini CLI, OpenAI Agents SDK, Amp, Cody, Ollama, and more.
Apache 2.0, BYOK, pipx install bernstein.

What we don't build: a desktop UI. If you need one, emdash and Conductor both do that well and are worth trying.

The category is large enough to have multiple right answers. The question is which layer of your stack you're optimizing for. A primitive and an ADE are not competing with each other. They're competing with the "write a bunch of glue code to make two agents work on the same repo without destroying it" option — which nearly everyone used until twelve months ago, and which neither shape is going back to.

From 4,000 Lines to 200: Decomposing Bernstein's Core

Alex Chernysh — Tue, 21 Apr 2026 14:10:28 +0000

Bernstein's orchestrator.py hit 4,198 lines. We used 11 parallel agents, orchestrated by Bernstein itself, to decompose it into 15 sub-packages in the first pass, each under 400 lines. Subsequent refactors extended this to 22 sub-packages. Here's how that worked and what we learned.

How a file gets to 4,000 lines

It happens gradually. The orchestrator started as a clean 300-line module that managed a tick loop: check for tasks, spawn agents, collect results. Then it grew. Cost tracking logic. Quality gates. Token monitoring. Git worktree management. Heartbeat detection. Idle agent recycling. Shutdown coordination.

Each addition was small and reasonable. But after two months of active development, orchestrator.py was a 4,198-line monolith that imported 47 modules and had 23 public methods. The test file was 2,800 lines. IDE navigation was painful. Merge conflicts were constant because every feature touched the same file.

The rule we now follow: if a module crosses 600 lines, it's time to decompose.

The plan

We defined 15 target sub-packages, each responsible for one concern:

Sub-package	Responsibility	Lines (after)
`orchestration/`	Lifecycle, tick pipeline	~350
`agents/`	Spawner, discovery, heartbeat	~380
`tasks/`	Task store, retry, scheduling	~340
`quality/`	Quality gates, CI monitor	~290
`cost/`	Cost tracking, budgets	~310
`tokens/`	Token monitoring, intervention	~250
`security/`	Audit logs, policy engine	~270
`git/`	Worktree management, merge queue	~280
`persistence/`	WAL, checkpointing	~220
`planning/`	Plan loading, dependencies	~200
`routing/`	Model selection, bandit	~320
`communication/`	Bulletin board, messaging	~180
`server/`	Task server, API	~260
`config/`	Configuration, defaults	~190
`observability/`	Metrics, tracing	~240

The decomposition needed to be backward-compatible. Existing code importing from bernstein.core.orchestrator import Orchestrator had to keep working.

11 agents, 15 packages

Here's the recursive part: we used Bernstein to execute the decomposition. A YAML plan defined 15 extraction stages with dependency edges (e.g., tasks/ had to be extracted before agents/ because the spawner depends on the task store).

11 agents ran in parallel across independent sub-packages. Each agent:

Extracted the relevant functions and classes from orchestrator.py
Created the new sub-package with proper __init__.py exports
Updated all internal imports
Ran the sub-package's tests to verify nothing broke

The whole decomposition took about 3 hours of wall time. A human doing this manually — carefully moving code, fixing imports, running tests after each change — would spend 2-3 days.

The re-export shim pattern

Backward compatibility was the hardest constraint. We solved it with re-export shims. The original orchestrator.py became a thin file that imports from sub-packages and re-exports:

# src/bernstein/core/orchestrator.py (after — ~200 lines, down from 4,198)
"""Orchestrator shim — re-exports from sub-packages for backward compat."""

from bernstein.core.orchestration.lifecycle import Orchestrator
from bernstein.core.orchestration.tick import TickPipeline
from bernstein.core.orchestration.manager import OrchestratorManager
from bernstein.core.orchestration.shutdown import ShutdownCoordinator

__all__ = ["Orchestrator", "TickPipeline", "OrchestratorManager", "ShutdownCoordinator"]

Every existing import path works unchanged. New code imports from the specific sub-package. Over time, the shims can be deprecated.

What we learned

Dependency graphs matter more than you think. The extraction order was critical. Extracting git/ before tasks/ would have created circular imports because the merge queue references task completion callbacks. We had to map the dependency graph before writing the plan.

Tests are the safety net. Each extraction step ran the full test suite. We caught 14 import errors, 3 circular dependencies, and 1 subtle bug where a function relied on module-level state that moved to a different file. Without tests, at least half of those would have shipped broken.

600 lines is a good limit. After the decomposition, the largest sub-package is agents/ at ~380 lines. Every module is small enough to read in one sitting, grep effectively, and test in isolation. When a new file starts approaching 600 lines, we split it proactively.

Orchestrators can orchestrate themselves. There's something satisfying about using your own tool to refactor itself. The decomposition was one of our most complex multi-agent runs, and it validated that the parallel execution model works for real refactoring tasks, not just greenfield code generation.

The result

Before: 1 file, 4,198 lines, 47 imports, constant merge conflicts.
After: 15 sub-packages in the first pass (extended to 22 in later refactors), ~280 lines average, clean dependency boundaries, agents can work on different packages without conflicts.

The full source is on GitHub. The re-export shims are in the top-level files like orchestrator.py, spawner.py, and task_lifecycle.py.

Getting Started: Your First Multi-Agent Run in 5 Minutes

Alex Chernysh — Tue, 21 Apr 2026 14:09:52 +0000

This guide gets you from zero to a working multi-agent session in under 5 minutes. You'll install Bernstein, configure Claude Code as your agent, run a goal, and understand the output.

Step 1: Install Bernstein

Bernstein requires Python 3.12+. Install it with pip or uv:

pip install bernstein

Or if you use uv:

uv pip install bernstein

Verify the installation:

bernstein --version
# bernstein 1.8.8

Step 2: Configure your agent

Bernstein needs at least one CLI coding agent installed. The fastest setup uses Claude Code, but 18 agents are supported including Codex, Gemini CLI, the OpenAI Agents SDK, Aider, and more.

Make sure Claude Code is installed and your API key is set:

# Install Claude Code if you haven't
npm install -g @anthropic-ai/claude-code

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

Bernstein auto-detects installed agents. Verify it finds yours:

bernstein agents
# Available agents:
#   claude (Claude Code) ✓

Step 3: Run your first goal

cd into any git repository and run a goal:

cd your-project
bernstein run --goal "Add type hints to all functions in src/utils.py"

Bernstein will:

Decompose the goal into concrete tasks
Assign each task a role, priority, and model
Spawn agents in isolated git worktrees
Monitor progress via heartbeats and output parsing
Merge completed work back to your branch

Step 4: Read the TUI

The terminal UI shows live progress:

┌─ Bernstein v1.8.8 ─────────────────────────────────┐
│ Goal: Add type hints to all functions in src/utils  │
│ Tasks: 3 total │ 1 running │ 1 done │ 1 pending    │
│ Agents: 2 active │ Cost: $0.12                      │
├─────────────────────────────────────────────────────┤
│ ✓ task-001  Analyze existing type usage    00:42    │
│ ► task-002  Add type hints to helpers      01:15    │
│ ○ task-003  Add type hints to validators   pending  │
└─────────────────────────────────────────────────────┘

✓ = completed and merged
► = currently running
○ = pending (waiting for dependencies or an available agent)

Press q to stop gracefully (agents finish their current task) or Ctrl+C to force stop.

Step 5: Check the results

When all tasks complete, check what changed:

git log --oneline -5
# a1b2c3d Add type hints to validator functions
# d4e5f6g Add type hints to helper functions
# h7i8j9k Analyze existing type usage in src/utils.py

Each agent's work is a separate commit, merged through Bernstein's merge queue. If any task failed, its changes are rolled back and the failure is logged in .sdd/dead_letter.json.

What to try next

Run a YAML plan for structured, multi-stage projects:

bernstein run plans/my-project.yaml

Plans let you define stages, dependencies, roles, and complexity per task. See the plan file docs for the full schema.

Use multiple agent types by installing additional adapters:

# Bernstein will route tasks to the best available agent
pip install codex-cli  # or install any supported agent
bernstein agents       # see all detected agents

Monitor costs across sessions:

bernstein cost
# Session total: $0.47
# By model: haiku=$0.03, sonnet=$0.28, opus=$0.16

Check the API for programmatic access:

# Task server runs on port 8052 during sessions
curl http://127.0.0.1:8052/status

How Bernstein Routes Tasks to the Right Model

Alex Chernysh — Tue, 21 Apr 2026 14:09:16 +0000

Not every coding task needs Opus. Bernstein's contextual bandit router learns which model handles each task type best, then routes accordingly. In our own runs, the bandit router cut spend roughly in half compared to uniform model selection. Measure yours with bernstein cost.

The uniform selection problem

Most multi-agent setups use the same model for everything. Every task — whether it's renaming a variable or designing an authentication system — gets routed to the same model at the same effort level. This is wasteful. A docs task that writes a docstring doesn't need the same model as a security task that implements credential scoping.

The cost difference is real. At current API pricing, routing a simple task to Haiku instead of Opus costs roughly 30x less. Over a session with 40-60 tasks, that adds up fast.

How the router works

Bernstein's routing pipeline has three layers:

Layer 1: Heuristic classification. Every task has a complexity field (low, medium, high) and a role (backend, frontend, qa, security, etc.). The router uses a rule-based classifier to make an initial model/effort assignment. Low-complexity tasks default to Haiku or Sonnet with standard effort. High-complexity tasks get Opus with max effort.

Layer 2: Epsilon-greedy bandit. This is where it gets interesting. The bandit maintains per-role reward estimates for each model. When a task arrives, it exploits the best-known model 80% of the time and explores alternatives 20% of the time. Rewards come from task outcomes: did the agent complete the task? Did tests pass? How many retries were needed?

# Simplified selection logic
candidates = ["sonnet", "opus"] if task.complexity == "high" else CASCADE
selected = bandit.select(role=task.role, candidate_models=candidates)

The CASCADE list includes all available models from cheapest to most capable. For high-complexity tasks, the bandit only considers Sonnet and Opus — sending a hard architecture task to Haiku would waste the agent's time even if it's cheap.

Layer 3: Effectiveness seeding. The bandit warms up using historical effectiveness data from the .sdd/metrics/ directory. If a previous run showed that backend tasks succeed 95% of the time with Sonnet but only 70% with Haiku, the bandit starts with that prior. No cold-start problem after the first session.

What the router learns

After a few sessions, clear patterns emerge:

Task type	Typical model	Why
Docs, docstrings	Haiku	Templated output, low reasoning
Test writing	Sonnet	Needs code understanding, not creativity
Bug fixes	Sonnet	Pattern matching on error traces
Refactoring	Sonnet/Opus	Depends on scope
Architecture, security	Opus	Requires deep reasoning

These aren't hardcoded rules — they're learned from outcomes. If your codebase has unusually complex tests, the bandit will learn to route test tasks to a stronger model.

Configuration

The bandit is enabled by default when a metrics directory exists. You can tune exploration rate and model cascade in your config:

# .sdd/config.yaml
routing:
  bandit_epsilon: 0.2          # 20% exploration
  cascade: [haiku, sonnet, opus]
  min_samples_per_arm: 5       # explore each option at least 5 times

To disable bandit routing and use pure heuristics:

routing:
  bandit_enabled: false

The numbers

Across our internal runs (self-development sessions where Bernstein improves its own codebase), the bandit router cut per-session spend roughly in half compared to the baseline of Sonnet-for-everything. Task completion rates stayed within a couple of percentage points, so cheaper models handle their assigned tasks fine. Measure your own runs with bernstein cost.

The savings compound. A 10-agent session running 50 tasks might cost $15-20 with uniform Sonnet. With bandit routing, the same session runs $7-10. Over weeks of iterative development, that's the difference between a side project budget and a real expense.

Community Spotlight: April 2026

Alex Chernysh — Tue, 21 Apr 2026 14:08:40 +0000

Every month we spotlight the people who make Bernstein better. Here are April's highlights from the first month of public development.

What happened in April

Bernstein went from v1.0.0 to v1.8.8 in a few weeks. The pace was intense, and community contributions made a real difference:

Architecture decomposition: 52 oversized modules broken into 22 focused sub-packages, each under 600 lines. The orchestrator monolith (4,198 lines) is now navigable, testable, and merge-conflict-free. Read the full story.
18 agent adapters: We started with 7 adapters and now support 18: Claude Code, Codex, Gemini CLI, OpenAI Agents SDK, Cursor, Aider, Amp, Kiro, Kilo, Qwen, Goose, Ollama, Cody, Continue, OpenCode, Cloudflare Agents, IaC, and a generic wrapper. Each adapter is a focused Python class under 200 lines.
Cost-aware routing: The contextual bandit router learns which model handles each task type best. In our own runs, the bandit cut spend roughly in half compared to sending everything to the same model.
Cloudflare cloud execution: Agents can now run on Cloudflare Workers with Durable Workflows, R2 artifact storage, and D1 state.
Windows support: Full cross-platform compatibility contributed by @oldschoola: environment passthrough, Unicode safety, process management, and terminal handling.

Contributors

Thanks to everyone who contributed PRs, reported bugs, and tested edge cases this month:

@oldschoola: Windows compatibility (3 merged PRs), codex config, task filtering, auto-PR
@Ai-chan-0411: community spotlight template
@alexanderxfgl-bit: spotlight generator script
@forfreedomforrich-eng: --dry-run flag, trigger URL fix
@TheCodingDragon0: bernstein config diff, glossary
@internet-dot: HOL workflow
@Beledarian: config path validation

All contributors are listed in CONTRIBUTORS.md.

How to get involved

Bernstein is Apache 2.0 and welcomes contributions of all sizes:

Good first issues: curated tasks for newcomers
Write a blog post: get published on bernstein.run
Adopt an adapter: become the maintainer for your favorite agent
Submit benchmarks: share your orchestration metrics

Running AI Agents on Cloudflare: Workers, Workflows, and Durable Objects

Alex Chernysh — Tue, 21 Apr 2026 14:08:39 +0000

Bernstein v1.8.4 ships with Cloudflare cloud execution. Agents can now run on Workers, multi-step tasks use Durable Workflows, artifacts go to R2, and state persists in D1. Here's the architecture and how to deploy.

Why local-only limits adoption

Running agents locally works for individual developers, but it has real constraints. Your laptop is the bottleneck: CPU, memory, and network all compete with your actual work. Long-running sessions drain battery. If you close your laptop, the session dies. And scaling beyond 4-5 concurrent agents on a MacBook starts hitting resource limits.

Cloud execution solves this. Agents run on remote infrastructure while you monitor progress from a dashboard or TUI. Sessions survive disconnects. You can scale to 20+ concurrent agents without melting your machine.

The Cloudflare stack

Cloudflare recently became OpenAI's infrastructure partner for agent cloud computing — the same infrastructure Bernstein agents can now run on. We chose Cloudflare's stack because it maps cleanly to orchestration primitives:

Workers handle lightweight, stateless agent execution. Each agent task runs in an isolated Worker with its own environment. Workers cold-start in under 50ms, so spinning up a new agent is nearly instant.

Durable Workflows orchestrate multi-step tasks. When an agent needs to clone a repo, run code, execute tests, and report results, the workflow ensures each step completes before the next begins — with automatic retries on failure. If a Worker crashes mid-task, the workflow resumes from the last completed step, not from scratch.

R2 stores artifacts. Agent outputs — diffs, test results, generated files — persist in R2 buckets. The orchestrator reads results from R2 when merging completed work back to the main branch.

D1 holds orchestration state. Task queues, agent assignments, cost metrics, and audit logs all live in D1. This replaces the local .sdd/ file-based state with a durable database that survives restarts and supports concurrent access from multiple Workers.

Architecture overview

Architecture diagram omitted in this cross-post. See the original post on bernstein.run for the rendered version.

The orchestrator itself runs as a Worker with a Durable Object for maintaining tick state. Agent Workers are spawned per-task and communicate results back through R2 and D1.

Deploying

Prerequisites: a Cloudflare account with Workers, R2, and D1 enabled, and wrangler installed.

# Authenticate with Cloudflare
wrangler login

# Deploy the Bernstein cloud stack
bernstein cloud deploy --project my-project

# This creates:
#   - Orchestrator Worker + Durable Object
#   - R2 bucket: bernstein-my-project-artifacts
#   - D1 database: bernstein-my-project-state
#   - Workflow definitions for multi-step tasks

Once deployed, run tasks against the cloud backend:

# Run a goal on cloud infrastructure
bernstein run --goal "Refactor auth module" --cloud

# Monitor from your terminal
bernstein cloud status

# Or check the dashboard
bernstein dashboard --cloud

Agent API keys (Anthropic, OpenAI, etc.) are stored as Worker secrets via wrangler secret put. They never leave the Cloudflare network.

Cost considerations

Cloudflare Workers pricing is request-based, not instance-based. You pay for the compute your agents actually use, not for idle VMs. For a typical 50-task session:

Workers compute: ~$0.50-2.00
R2 storage: pennies (artifacts are small)
D1 reads/writes: pennies (state operations are lightweight)

The cloud infrastructure cost is a small fraction of the LLM API costs that agents incur. The real savings come from not needing to keep your machine running and from being able to scale to more concurrent agents.

What's next

We're working on scheduled runs (trigger a session from a cron or GitHub webhook), multi-region execution (run agents closer to the repos they're working on), and a hosted dashboard for monitoring cloud sessions without a local CLI.

Try it: pip install bernstein and check the getting started guide.
Source: github.com/chernistry/bernstein

Stop using LLMs to schedule other LLMs

Alex Chernysh — Wed, 08 Apr 2026 12:56:54 +0000

Three AI coding agents on the same repo = three agents overwriting each other's work. Claude Code edits auth.py. Codex edits auth.py two seconds later. Claude's changes vanish. Meanwhile Gemini "refactors" the test suite and breaks six things.

Two weeks of this. Here's what fixed it: git worktrees per agent, a deterministic Python scheduler (not an LLM), and a janitor that verifies work before merge.

The wrong turn

My first orchestrator used an LLM to coordinate the other LLMs. A manager agent read the backlog, decided assignments, checked progress, re-planned on failure.

It was slow, expensive, and kept hallucinating priorities. ~40% of total tokens went to coordination overhead instead of code.

Then the obvious hit: scheduling is a solved problem. Operating systems have done concurrent process scheduling since the 1960s. Nobody uses neural networks for cron. Why use one for task assignment?

I ripped out the LLM scheduler. The result is Bernstein, an open-source orchestrator that coordinates any CLI coding agent with zero LLM tokens on scheduling.

The pipeline

Four stages:

Decompose: one LLM call takes your goal, outputs a task graph with roles, owned files, and dependencies.
Spawn: each task gets a fresh CLI agent in an isolated git worktree. Parallel execution. Main branch untouched.
Verify: a janitor checks concrete signals. Tests pass, files exist, linter clean, types correct. Binary outcomes, not opinions.
Merge: verified work lands on main. Failed tasks retry on a different model or get decomposed further.

Goal → Planner (LLM) → Task Graph → Orchestrator (Python) → Agents ‖
                                         ↓
                                    Janitor → Merge

The orchestrator is a Python event loop that polls a local task server, matches open tasks to available agents, and manages lifecycle. Deterministic, auditable, reproducible. Same inputs produce the same decisions.

Worktrees: the part that unlocked it

Each agent gets its own git worktree on a disposable branch:

git worktree add .sdd/worktrees/session-abc123 -b agent/session-abc123
# agent works in isolation
# janitor verifies, then:
git checkout main
git merge agent/session-abc123 --no-ff
git worktree remove .sdd/worktrees/session-abc123

Each agent thinks it owns the repo. No file locks, no coordination protocol between agents, no conflicts during work. The task graph declares file ownership, so overlapping files never get assigned concurrently.

Expensive directories (node_modules, .venv) get symlinked from the main tree so you don't pay setup cost per agent.

Model routing without vibes

Renaming a variable doesn't need Opus. But static rules for model selection go stale fast.

Bernstein uses a LinUCB contextual bandit that learns from outcomes. Features: complexity tier, file scope, role, estimated token budget. Reward: quality_score * (1 - normalized_cost). Cheapest model that passes the janitor wins.

Under ~50 completions it falls back to static cascade (haiku → sonnet → opus). After warm-up the bandit takes over. Policy persists across runs so learning accumulates.

Net effect in my runs: ~23% cost reduction vs. running everything on one top-tier model.

New in v1.8: MCP server mode

Since the original post, Bernstein gained a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Cursor, VS Code, Zed) can now call Bernstein as a tool:

bernstein mcp --transport stdio

Your IDE agent decomposes a goal, calls bernstein_run, and Bernstein fans out the work across 12 parallel CLI agents in worktrees. The IDE agent just waits for results. One cheap router model at the top, a swarm of cheap workers below, one expensive reviewer at the end — instead of one Opus chewing through 40 serialized tasks.

How it differs from CrewAI, AutoGen, LangGraph, Composio, emdash

	Bernstein	CrewAI / AutoGen / LangGraph	Composio / emdash
Scheduling	Deterministic Python	LLM-driven	Hosted/UI-driven
Works with	20+ CLI agents (Claude Code, Codex, Aider, etc.)	Their SDK classes	Their desktop app / web UI
Git isolation	Worktree per agent	None	Varies
Verification	Janitor + quality gates	Mostly absent	Mostly absent
Agent lifetime	Short: spawn, work, exit	Long-running	Long-running
State	File-based (inspect with `cat`)	In-memory / checkpointer	Cloud/hosted
Interface	CLI + MCP server	SDK	Desktop ADE

Philosophical difference: CrewAI/AutoGen/LangGraph are frameworks — you write agents in their SDK. Composio and emdash are desktop ADEs — you use their UI. Bernstein is infrastructure — you point it at Claude Code, Codex, or Aider (or all three in one run) and it handles the rest.

The LLM-driven coordination in those frameworks is non-deterministic and hard to debug. When Bernstein assigns task #47 to Sonnet, you can read the policy file and trace the feature vector that selected it. No prompt archaeology.

Trade-off: no agent-to-agent chat, no built-in RAG, no hosted option. It's a CLI for people who want their agents to write code and get out.

What still sucks

Agents hallucinate file paths. The janitor catches it, but retries cost tokens.
Context windows fill up on large codebases. Short-lived agents help; it's still a real constraint.
12 parallel Opus agents is not cheap. Budgets and the bandit help. Not attention-free.
Setup friction. At least one CLI agent must be installed and authenticated.
File ownership isn't bulletproof. Agents occasionally touch files outside their scope.

This is v1.8, not v10. But the core loop is stable and I've been running it against production code for months.

Getting started

pip install bernstein
cd your-project
bernstein init
bernstein -g "Add rate limiting middleware"
bernstein live    # TUI
bernstein cost    # spend so far

For multi-stage work, a YAML plan:

stages:
  - name: backend
    steps:
      - goal: "Add rate limiting middleware"
        role: backend
        complexity: medium
      - goal: "Integration tests for rate limiter"
        role: qa
        complexity: low
  - name: docs
    depends_on: [backend]
    steps:
      - goal: "Document rate limiting in OpenAPI spec"
        role: docs
        complexity: low

bernstein run plan.yaml              # deterministic execution
bernstein run --dry-run plan.yaml    # preview + cost estimate

Mix models in the same run. Claude Code for architecture, Gemini for boilerplate, Aider with a local Ollama model for offline tasks.

GitHub repo. Apache 2.0. Star if it saves you a merge conflict.

If you've been babysitting one agent at a time, try the worktree-per-agent pattern and tell me what breaks. I'm especially interested in failure modes I haven't hit yet.

DEV Community: Alex Chernysh

bernstein 2.x recap: lineage, ten trackers, A2A capability cards, and a CI that started fixing itself

a transparency log per artefact

ten trackers, one contract

interop, finally

operator surfaces, plural

a CI that started fixing itself

deterministic replay and one writer per session

cost guards, calibrated

supply chain, secrets, and a sandbox for the UI agents

observability under one umbrella

the smaller things

two open questions for the community

why these matter

try it

next

related

Forecasting Without Prophecy: a plain-text discipline

The illusion of point estimates

Reference classes before stories

Premortems and falsifiers

Abstention as a feature

Calibration is a habit, not an event

Scenario discipline in plain text

Try it on something small

Related reading

Further reading

RightLayout: Shipping a Mac AI Tool, Then Letting Go

1. The problem with dictionary punto-switchers

2. The bet

3. Where the model actually wins

4. Why I open-sourced it instead of scaling it

5. What you can take

Repositories and downloads

Related reading

Shipping the orchestrator onto someone else's box

the questions

cluster mode without ambient trust

lineage that survives a regulator

distribution without outbound internet

a capability gate against the lethal trifecta

what this batch isn't

1.10.1 through 1.10.6: the shipped things

a single AGENTS.md the rest of the agents can read

cost legibility you don't have to grep for

A2A v1.0 with a verifier you can actually run

four new adapters

the smaller things

why these matter

try it

next

Orchestration primitive or desktop ADE? Choosing your multi-agent coding layer in 2026

The two shapes

What a desktop ADE does well

What an orchestration primitive does well

Decision shortcuts

They often co-exist

Bernstein's specific positioning

From 4,000 Lines to 200: Decomposing Bernstein's Core

How a file gets to 4,000 lines

The plan

11 agents, 15 packages

The re-export shim pattern

What we learned

The result

Further reading

Getting Started: Your First Multi-Agent Run in 5 Minutes

Step 1: Install Bernstein

Step 2: Configure your agent

Step 3: Run your first goal

Step 4: Read the TUI

Step 5: Check the results

What to try next

Further reading

How Bernstein Routes Tasks to the Right Model

The uniform selection problem

How the router works

What the router learns

Configuration

The numbers