<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Chernysh</title>
    <description>The latest articles on DEV Community by Alex Chernysh (@alex_chernysh).</description>
    <link>https://dev.to/alex_chernysh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3867467%2F289c1248-2b35-42b9-a79d-2ee8ce4b0a93.jpeg</url>
      <title>DEV Community: Alex Chernysh</title>
      <link>https://dev.to/alex_chernysh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alex_chernysh"/>
    <language>en</language>
    <item>
      <title>bernstein 2.x recap: lineage, ten trackers, A2A capability cards, and a CI that started fixing itself</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Wed, 20 May 2026 18:05:02 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/bernstein-2x-recap-lineage-ten-trackers-a2a-capability-cards-and-a-ci-that-started-fixing-31g2</link>
      <guid>https://dev.to/alex_chernysh/bernstein-2x-recap-lineage-ten-trackers-a2a-capability-cards-and-a-ci-that-started-fixing-31g2</guid>
      <description>&lt;p&gt;Ten days since the 1.10 recap. Thirteen point releases later. Not a roadmap and not a refactor. The cumulative effect of fixing the things that started to hurt the moment we tried to run the orchestrator on a regulated codebase, against a non-GitHub backlog, alongside three editors that all wanted to host the agent themselves.&lt;/p&gt;

&lt;p&gt;This is a single round-up so the trajectory between then and now is legible from one page. The point releases are not headline-shaped individually. Grouped by theme, they are: a per-artefact transparency log with signatures every auditor can verify on their own laptop; ten tracker adapters from Jira to Plane under one contract; A2A capability cards plus an MCP client that treats every upstream as untrusted; a web UI, a PWA, and a one-command registration for seven host editors; a Playwright sandbox for the UI agents; a secrets broker plus the supply-chain hardening around it; an auto-heal CI that finally grew teeth; cost guards backed by a Brier-scored forecast log; and a single-writer state model that makes a session reconnect across machines.&lt;/p&gt;

&lt;h2&gt;
  
  
  a transparency log per artefact
&lt;/h2&gt;

&lt;p&gt;The audit chain was already there. What was missing: the part that lets two agents touch the same file without losing the trail.&lt;/p&gt;

&lt;p&gt;Lineage now writes every agent edit as an Ed25519-signed entry against the agent's A2A card. Two writers on the same path surface as siblings; the Steward writes an explicit merge entry rather than letting one quietly overwrite the other. &lt;code&gt;bernstein lineage gate&lt;/code&gt; is a required CI check; merges with unresolved parallel-edit forks fail. The same idea layers on tracker state moves — every transition the orchestrator drives (open, label, transfer, comment, close) is captured as a signed entry, so a ticket that loses or gains the wrong label reads back, line by line, who did it and what they had signed for.&lt;/p&gt;

&lt;p&gt;The compliance side ships a verifier you don't need to trust the orchestrator to run. &lt;code&gt;bernstein compliance pack --since … --org "…" --output …&lt;/code&gt; produces a signed ZIP with PDF, CSV, raw log, Agent Cards, and a SLSA-style manifest. &lt;code&gt;bernstein-verify pack .zip&lt;/code&gt; is its own wheel: &lt;code&gt;cryptography&lt;/code&gt; and &lt;code&gt;click&lt;/code&gt; are the only dependencies. The verifier's own test asserts that &lt;code&gt;import bernstein&lt;/code&gt; raises &lt;code&gt;ModuleNotFoundError&lt;/code&gt; from inside its venv, because that is the property an external auditor wants from a verifier they will hand to their own team.&lt;/p&gt;

&lt;p&gt;Honest framing: the surface ships with tests, runbooks, and the standalone verifier. It has not been bashed against an external regulatory audit yet. Treat it as code an evaluator can read and stand up themselves, not as an attestation. The three reference demos under &lt;code&gt;examples/lineage/&lt;/code&gt; (fintech, healthcare, EU manufacturer) are written so a compliance officer can pattern-match the EU AI Act Article 12 surface against their own paperwork.&lt;/p&gt;

&lt;h2&gt;
  
  
  ten trackers, one contract
&lt;/h2&gt;

&lt;p&gt;The work people had been asking for since 1.0. Ten adapters land under one &lt;code&gt;TrackerContract&lt;/code&gt;: Jira Cloud and Jira DC, GitLab Issues, Linear, Plane, Asana, ServiceNow, ClickUp, GitHub Projects v2, plus webhook ingestion. Third-party trackers plug in via the same pluggy hookspec the orchestrator uses internally; &lt;code&gt;bernstein trackers&lt;/code&gt; surfaces them on the CLI. A multi-tracker federation layer sits above the adapters, so a single team running Jira for engineering and Linear for product can route to both from one orchestrator config.&lt;/p&gt;

&lt;p&gt;Two design surprises earned their own slices. Tracker comments became the orchestration handoff bus: worker agents now coordinate over the same comment thread the operator reads, so a session resumes across CLI restarts and across operator machines without a synthetic state file in between. And the issue-to-PR pipeline walks a tracker issue through plan synthesis, plan-comment posting for human review, and PR creation in one path. Run-failure classification closes the loop on the other side: when a run dies, the orchestrator labels the ticket with what class of failure killed it.&lt;/p&gt;

&lt;p&gt;The unhonest framing would be that you can wire this up in an afternoon. The honest one is that all ten adapters were lit in two weeks while one operator (me) ate the integration cost ten different ways, and every one of them is bound by the same conformance suite that has been keeping the CLI adapters honest. If your shop runs on Linear and you want the same orchestration semantics as a GitHub-resident one, the contract says you should.&lt;/p&gt;

&lt;h2&gt;
  
  
  interop, finally
&lt;/h2&gt;

&lt;p&gt;The piece that kept blocking real cross-process work was the lack of a real handshake. Claude Desktop is one process, Claude Code is another, both can spawn agents, neither knew what the other had already decided.&lt;/p&gt;

&lt;p&gt;A2A capability cards close that gap. One process mints a signed manifest of what it can do; the other consumes it, verifies the signature against a trusted-issuer set, and refuses to delegate when the advertised policies don't meet the operator's required policies. The lineage chain rides through the same envelope, so the audit trail does not break at the organisation boundary. The handshake builds on the A2A v1.0 contract: JCS body per RFC 8785, detached JWS per RFC 7515, Ed25519 per RFC 8037, JWKS at &lt;code&gt;/.well-known/jwks.json&lt;/code&gt;, audience binding via RFC 8707.&lt;/p&gt;

&lt;p&gt;The MCP client got the matching upgrade. Upstream servers will return malformed responses, hang mid-stream, demand re-auth, lie about their capability manifest. The client now treats every upstream as untrusted: capability-card validation before a tool call, retry-with-continuation on dropped streams, in-flight cancellation that preserves partial output, per-server cost metering, schema-violation containment that marks a misbehaving server degraded for the rest of the task. None of this is exotic; it is the brittle-real-world posture that the larger MCP ecosystem will end up needing.&lt;/p&gt;

&lt;p&gt;The MCP server side got a prompt catalogue plus OAuth-2 PKCE discovery metadata so auto-discovering hosts that expect a real RFC 8414 / RFC 9728 surface stop skipping us. Full token issuance and OIDC federation are deferred to a follow-up; the discovery metadata is what unblocks the common case.&lt;/p&gt;

&lt;h2&gt;
  
  
  operator surfaces, plural
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;bernstein gui serve&lt;/code&gt; boots a FastAPI server with a React SPA mounted at &lt;code&gt;/ui&lt;/code&gt;. No Node toolchain at install time; the Vite bundle is committed under &lt;code&gt;src/bernstein/gui/static/&lt;/code&gt;. Default at &lt;code&gt;http://127.0.0.1:8052/ui/&lt;/code&gt;. Tasks, Agents, Approvals, Audit, Costs, Fleet, Settings. Six functional panels and one placeholder. The per-task drawer has six tabs: Summary, Logs (SSE-streamed with ANSI, virtualised list, search), Diff (split or unified &lt;code&gt;git diff ...&lt;/code&gt; with syntax highlight, copy, &lt;code&gt;.patch&lt;/code&gt; download), Gates (quality-gate report with auto-expanded failures), Deps (upstream / downstream task graph), Trace (timeline from &lt;code&gt;.sdd/traces/{task_id}.jsonl&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bernstein gui serve --tunnel&lt;/code&gt; publishes through the tunnel driver registry (cloudflared / ngrok / bore / tailscale, auto-select). The same command issues a URL-safe bearer token plus a 6-word diceware passphrase, persisted at &lt;code&gt;~/.bernstein/dashboard.passphrase&lt;/code&gt; with mode 0600, prints an ASCII QR, and ships an installable PWA: service worker with stale-while-revalidate for &lt;code&gt;/api/projects&lt;/code&gt; and &lt;code&gt;/api/cost&lt;/code&gt;, programmatic maskable icons, iOS Safari and Android Chrome install cleanly.&lt;/p&gt;

&lt;p&gt;For operators who already live in another host: &lt;code&gt;bernstein desktop-register --host&lt;/code&gt; writes the host-specific config entry for Claude Desktop, Claude Code, Cursor, Continue, Cline, Zed, and Aider. One command. &lt;code&gt;bernstein doctor --substrate&lt;/code&gt; reports which hosts have us registered, which do not, and which have a stale registration. The orchestrator is a guest in the host's settings file; we ship the plugin, the host renders it.&lt;/p&gt;

&lt;p&gt;Honest framing on the web UI: it is a minimal demo of the operator surface. No theme toggle, no mobile-responsive pass, the Settings screen is a placeholder, the Fleet screen is scaffolding with a real data plane behind it, no front-end test suite, no Playwright smoke test in CI. It exists because the core could support it and operators asked. Tracking issue #1262 is the contributor welcome mat; small PRs preferred. Each per-host adapter is small enough that a host-spec change is a one-day fix, not a re-architecture. That is the cost of being a guest.&lt;/p&gt;

&lt;h2&gt;
  
  
  a CI that started fixing itself
&lt;/h2&gt;

&lt;p&gt;The auto-heal daemon shipped with twenty-six parameters and produced zero successful heals in its first three weeks. The post-mortem was dull in the best way. A fetch URL had moved. The classifier was missing the &lt;code&gt;agents-md drift&lt;/code&gt; class, so doc-only commits looked like a new failure shape. Ruff was running before &lt;code&gt;agents-md sync&lt;/code&gt;, so the sync's whitespace tweaks looked like lint regressions. And the heal-branch CI never started because push events from &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; don't fire downstream workflows by default; the daemon now dispatches explicitly.&lt;/p&gt;

&lt;p&gt;The rest of the immune system landed in the same wave. Inline-pushing the regenerated lockfile to a PR head instead of opening a bot-PR for it removed the dominant bot-PR-class source. A weekly aggregated digest issue replaces N auto-release-skipped notifications. A hotfix R-counter detects when a hotfix begets another hotfix and blocks further auto-merge after two-in-a-row. A trunk-health Andon gate holds merges on red trunk. An idempotency self-check in the regen path so a non-deterministic regen halts itself instead of looping. CI concurrency split by branch so rapid-merge bursts drain the queue instead of cancelling each other. The macOS runner queue (20 to 70 minutes during burst-merge waves) got split off the per-PR default matrix into push-to-main, &lt;code&gt;macos_sensitive&lt;/code&gt;-path-changed, or &lt;code&gt;macos-needed&lt;/code&gt;-labelled gated jobs, with a nightly full matrix.&lt;/p&gt;

&lt;p&gt;Every PR now also passes through a review-bot acknowledgement gate. CodeRabbit and Sourcery findings classified as must-address block merge until they are addressed in a fixup commit (with &lt;code&gt;bot-ack:&lt;/code&gt; in the commit message) or acknowledged in the PR body with a structured marker (&lt;code&gt;reason=... --&amp;gt;&lt;/code&gt;). A nightly sweeper and a reusable shepherd workflow template ship in the same wave, so the cadence stays predictable.&lt;/p&gt;

&lt;h2&gt;
  
  
  deterministic replay and one writer per session
&lt;/h2&gt;

&lt;p&gt;Three small things compounded into something operationally useful. Session ids are bound deterministically so a replayed run reproduces its own event stream without colliding with a sibling. The supervisor enforces a bounded respawn budget and parks an agent when the budget is exhausted, instead of looping respawns indefinitely. On-disk state has a versioned migrations module so an older &lt;code&gt;.sdd/&lt;/code&gt; upgrades predictably. Plus the cosmetic-but-real win: runs surface a memorable deterministic name in user-facing output, so the operator can refer to "the brisk-sparrow run" instead of memorising a UUID.&lt;/p&gt;

&lt;p&gt;The bigger structural piece is the single-writer RunActor. One per-session actor owns canonical state. Mutations flow as typed events through one async queue. A pure &lt;code&gt;apply_event&lt;/code&gt; reducer applies them with monotonic seq numbers. &lt;code&gt;ReplayBuffer&lt;/code&gt; is a bounded ring (default 1024) that emits an explicit &lt;code&gt;Gap{up_to_seq}&lt;/code&gt; marker when a subscriber asks for an evicted range, so a reconnect-after-eviction is observable instead of silently corrupt.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bernstein simulate&lt;/code&gt; is the digital-twin runner that pairs with this. Feed it a plan plus a route and it executes the orchestration without the adapter network. Rehearse an expensive plan before paying for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  cost guards, calibrated
&lt;/h2&gt;

&lt;p&gt;The bandit router had been doing the right thing for a while. What was missing was a way to read the routing decisions back.&lt;/p&gt;

&lt;p&gt;A per-task criterion profile plus TOPSIS multi-criteria ranking means a latency-sensitive task routes differently from a thorough one. A structured decision log covers every routing, retry, and gate verdict with its inputs. The calibration log got teeth: every forecast is scored with a Brier. Per-quota-envelope attribution shows where the spend actually went, not where the most expensive role declared it would. The preflight estimator stopped picking the first declared role and started picking the most expensive one; old behaviour underestimated by 40 to 60 percent on multi-role plans.&lt;/p&gt;

&lt;p&gt;The hard cap is &lt;code&gt;--max-cost-usd&lt;/code&gt;. Cross the threshold, the run aborts cleanly, partial results merged or rolled back the same way a normal cancel works. The per-ticket variant lives in &lt;code&gt;bernstein.yaml&lt;/code&gt; so the cap survives a CLI restart and writes back to the tracker on termination. The same cap, posted via REST, now fails fast at the request boundary with 422 instead of bleeding into the task store as an unhandled 500.&lt;/p&gt;

&lt;h2&gt;
  
  
  supply chain, secrets, and a sandbox for the UI agents
&lt;/h2&gt;

&lt;p&gt;The security workstream does not write up as a single feature. Half of it is the broker that hands a task a short-lived token scoped to what it declared in its plan; the other half is the dozen smaller things that surround the broker.&lt;/p&gt;

&lt;p&gt;A secrets broker mints per-task tokens, scoped to the resources the task declared. Audit events dispatch outside the broker lock so a misbehaving sink can't stall the issuance path. Constant-time HMAC compare. Approval responses bound to a 16-byte server-minted single-use nonce; mismatches surface as 409, evicted replays as 410. Per-tool allowlist with fail-closed policy and a read-only profile.&lt;/p&gt;

&lt;p&gt;Prompt-injection containment runs against three surfaces. Invisible Unicode Tag codepoints are stripped from injected skills before any prompt sees them. Promptware cross-agent C2 strings are detected in tool output. MCP tool-call inputs are JSON-Schema validated, deny-by-default. A &lt;code&gt;security-pentest&lt;/code&gt; eval scenario exercises the lot end to end.&lt;/p&gt;

&lt;p&gt;Supply-chain coverage on the workflow side: OSSF Scorecard, an SBOM emitted on every release, &lt;code&gt;actions/dependency-review&lt;/code&gt; on PRs, trufflesecurity/trufflehog for secret scanning, Dependabot extended to the &lt;code&gt;github-actions&lt;/code&gt; ecosystem, &lt;code&gt;step-security/harden-runner&lt;/code&gt; on every workflow job (audit mode first, then block). The workflow security pass resolved 163 zizmor findings across &lt;code&gt;unpinned-uses&lt;/code&gt;, &lt;code&gt;artipacked&lt;/code&gt;, &lt;code&gt;template-injection&lt;/code&gt;, &lt;code&gt;bot-conditions&lt;/code&gt;, &lt;code&gt;dangerous-triggers&lt;/code&gt;, &lt;code&gt;ref-version-mismatch&lt;/code&gt;, &lt;code&gt;cache-poisoning&lt;/code&gt;, &lt;code&gt;excessive-permissions&lt;/code&gt;, and &lt;code&gt;dependabot-cooldown&lt;/code&gt;. The three jobs that legitimately push back to git keep their credentials with an annotated rationale.&lt;/p&gt;

&lt;p&gt;A self-testing layer drives a Playwright context against the dev server, captures screenshots, console messages, and network errors as a structured artefact, and hands the result back to an LLM judge for verdict. This is the slice that closes the loop on UI and web agent tasks the way the existing test harness closed it for backend tasks. The agent's diff plus the post-run screenshot plus the console log feed one judge prompt; the judge returns a structured pass-or-fail with a rationale that lands in the task transcript.&lt;/p&gt;

&lt;p&gt;Honest mistake worth naming. The shipped wheel had &lt;code&gt;errors.bernstein.run&lt;/code&gt; baked in as the GlitchTip DSN default and &lt;code&gt;telemetry.bernstein.run&lt;/code&gt; as the telemetry endpoint default. Both backends soft-fail when their env vars are unset, so the package never actually reached out without consent. But the hostnames were sitting there as defaults, which is the kind of thing that turns into a real leak the day someone wires a config they did not read. Stripped, with a test that asserts zero operator-private host, IP, or DSN matches in &lt;code&gt;src/&lt;/code&gt; and fails the build if a future change reintroduces one. Telemetry is now portable behind one Sentry-compatible &lt;code&gt;BERNSTEIN_TELEMETRY_DSN&lt;/code&gt;, so each operator runs against their own backend rather than mine.&lt;/p&gt;

&lt;h2&gt;
  
  
  observability under one umbrella
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;bernstein doctor observe&lt;/code&gt; runs each per-backend probe (Sonar, GlitchTip, Dependency-Track, GitHub Code Scanning) in order and renders one Rich table with metric, value, delta-since-last-check, threshold, and status. &lt;code&gt;--json&lt;/code&gt; and &lt;code&gt;--watch&lt;/code&gt;. Each backend soft-fails to SKIPPED when its env vars are unset, so a fresh checkout stays green. A per-PR sticky summary comment and a daily trends snapshot ride on the same JSON. Per-backend &lt;code&gt;bernstein doctor sonar&lt;/code&gt; and &lt;code&gt;bernstein doctor glitchtip&lt;/code&gt; ship behind the same umbrella for the operators who want one signal at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  the smaller things
&lt;/h2&gt;

&lt;p&gt;A bucket of cuts that do not need a whole section but matter in a specific situation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI-BOM in three formats.&lt;/strong&gt; &lt;code&gt;bernstein bom emit&lt;/code&gt; and &lt;code&gt;bernstein bom verify&lt;/code&gt; ship a Bernstein-native JSON encoder plus CycloneDX 1.5 with the AI/ML extension plus SPDX 2.3 with AI-specific annotations behind one dispatcher. Pure projection from existing lineage / cost / adapter state; determinism enforced by Hypothesis property tests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Diary plus synthesizer.&lt;/strong&gt; One structured entry per closed task (tried / worked / failed / rationale / tags) with redaction of OpenAI keys, GitHub tokens, AWS access keys, PEM banners, and high-entropy hex. The synthesizer clusters diaries by tag-overlap Jaccard and drafts a markdown report. HITL-gated; reports default to &lt;code&gt;approved: false&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Consensus relay.&lt;/strong&gt; HMAC-chained per-cycle handoff at &lt;code&gt;.sdd/runtime/consensus/.json&lt;/code&gt; so an operator restarting a long evolution cycle can pull the prior cycle's decisions, blockers, and open questions without rediscovery.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Three-layer skill customisation.&lt;/strong&gt; BASE / TEAM / USER under XDG paths with a deterministic merge spec: scalars override, tables deep-merge, keyed arrays replace by name, unkeyed arrays append, missing layers fall through cleanly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Canonical stream-signal vocabulary.&lt;/strong&gt; A small text-line vocabulary (COMPLETED, FAILED, QUESTION, PLAN_DRAFT, PLAN_READY, BLOCKED) parseable from any wrapped CLI stdout, so non-stream-json adapters surface lifecycle events through the same channel as native stream-json adapters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Empirical-confidence ledger.&lt;/strong&gt; Append-only SQLite store of per-decision outcomes; sample-size-gated; refuses to return below a documented threshold. Backs the model recommender with measured outcomes over the capability-tier heuristic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;bernstein export&lt;/code&gt;, &lt;code&gt;bernstein analyze&lt;/code&gt;, &lt;code&gt;bernstein adapters list&lt;/code&gt;, &lt;code&gt;bernstein compare&lt;/code&gt;.&lt;/strong&gt; The operator-side cuts that make the orchestrator legible from the CLI without spelunking the JSONL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Adapter count is at 44.&lt;/strong&gt; Devin for Terminal, JetBrains Junie (BYOK across the usual five providers plus the Copilot proxy), AWS Q Developer, DeepSeek V4-Flash and V4-Pro via an Ollama-compatible endpoint with an EU-residency guard.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  two open questions for the community
&lt;/h2&gt;

&lt;p&gt;Two RFCs are live where the design genuinely depends on what other operators think. Drive-by comments welcome; full proposals more welcome.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sipyourdrink-ltd/bernstein/issues/1720" rel="noopener noreferrer"&gt;#1720 — Skills end-to-end&lt;/a&gt;. The skill subsystem already has discovery, layered merge (BASE / TEAM / USER under XDG, above), and an injector for Claude Code. The operator never touches it because there is no &lt;code&gt;install&lt;/code&gt;, no &lt;code&gt;sync&lt;/code&gt;, no &lt;code&gt;publish&lt;/code&gt;, no &lt;code&gt;lint&lt;/code&gt;, no &lt;code&gt;test&lt;/code&gt;, no &lt;code&gt;init&lt;/code&gt;, no &lt;code&gt;watch&lt;/code&gt;. If you have an opinion on the verb surface, the manifest shape, or whether a community index belongs in scope at all, the RFC is where to leave it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sipyourdrink-ltd/bernstein/issues/1719" rel="noopener noreferrer"&gt;#1719 — Opt-in telemetry to a community-shared backend&lt;/a&gt;. The package already has a portable telemetry pipeline behind &lt;code&gt;BERNSTEIN_TELEMETRY_DSN&lt;/code&gt;. The current state (no maintainer-side endpoint, package never reaches out by default) is fine. The question on the table is whether an explicitly opt-in maintainer-operated endpoint is worth adding so the rare class of bug that bites many operators looks different from the rare class that bites one. The consent and transparency contract is the live debate.&lt;/p&gt;

&lt;p&gt;Both issues are tagged &lt;code&gt;up-for-grabs&lt;/code&gt;; both have zero comments at the time of writing.&lt;/p&gt;

&lt;h2&gt;
  
  
  why these matter
&lt;/h2&gt;

&lt;p&gt;If you read the 1.10 recap and asked which of the friction points you were actually going to feel, the answer ten days later is most of them.&lt;/p&gt;

&lt;p&gt;Two agents writing the same file no longer race silently. A non-GitHub backlog is not a special case; ten adapters share the same conformance suite that has been keeping the CLI adapters honest. The web UI is one command and one port; the same command issues a tunnel, a QR, and an installable PWA. A CI break that the heuristic can fix does not need a human-dispatched hotfix. The compliance pack is a single ZIP an auditor can verify without installing the orchestrator. The MCP client treats every upstream as untrusted, which is the posture the larger ecosystem will end up needing. Cost decisions are read-back instead of inferred. Sessions reconnect across CLI restarts and across machines without rediscovery.&lt;/p&gt;

&lt;p&gt;The one I noticed most was the removed-our-own-infrastructure cut. The kind of mistake that ships invisibly. The kind of fix that should be a test.&lt;/p&gt;

&lt;h2&gt;
  
  
  try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pipx &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; bernstein

&lt;span class="c"&gt;# operator surface&lt;/span&gt;
bernstein gui serve                             &lt;span class="c"&gt;# web UI at http://127.0.0.1:8052/ui/&lt;/span&gt;
bernstein gui serve &lt;span class="nt"&gt;--tunnel&lt;/span&gt;                    &lt;span class="c"&gt;# public URL + QR + bearer + diceware&lt;/span&gt;
bernstein desktop-register &lt;span class="nt"&gt;--host&lt;/span&gt; cursor        &lt;span class="c"&gt;# register as a plugin in another host&lt;/span&gt;
bernstein doctor &lt;span class="nt"&gt;--substrate&lt;/span&gt;                    &lt;span class="c"&gt;# which hosts have us registered&lt;/span&gt;
bernstein doctor observe                        &lt;span class="c"&gt;# one umbrella table over four backends&lt;/span&gt;

&lt;span class="c"&gt;# routing and replay&lt;/span&gt;
bernstein simulate &lt;span class="nt"&gt;--plan&lt;/span&gt; plan.yaml             &lt;span class="c"&gt;# digital-twin a routing decision&lt;/span&gt;
bernstein plan dag                              &lt;span class="c"&gt;# render the declarative task DAG&lt;/span&gt;
bernstein run &lt;span class="nt"&gt;--max-cost-usd&lt;/span&gt; 5                  &lt;span class="c"&gt;# per-run hard cap&lt;/span&gt;

&lt;span class="c"&gt;# trackers and lineage&lt;/span&gt;
bernstein trackers                              &lt;span class="c"&gt;# plugin index for tracker adapters&lt;/span&gt;
bernstein lineage gate                          &lt;span class="c"&gt;# required check; fails on unresolved forks&lt;/span&gt;
bernstein compliance pack &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--since&lt;/span&gt; 2026-04-01 &lt;span class="nt"&gt;--until&lt;/span&gt; 2026-05-19 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--org&lt;/span&gt; &lt;span class="s2"&gt;"Your Org"&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; pack.zip
pipx &lt;span class="nb"&gt;install &lt;/span&gt;bernstein-verify
bernstein-verify pack pack.zip                  &lt;span class="c"&gt;# zero-trust verification&lt;/span&gt;

&lt;span class="c"&gt;# AI-BOM&lt;/span&gt;
bernstein bom emit &lt;span class="nt"&gt;--format&lt;/span&gt; cyclonedx-1.5 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; sbom.json
bernstein bom verify sbom.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Container: &lt;code&gt;ghcr.io/sipyourdrink-ltd/bernstein:2.5.1&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  next
&lt;/h2&gt;

&lt;p&gt;The spec-quality gate and the empirical-confidence ledger are the two slices most likely to compound. The first refuses to advance a feature spec until a deterministic, library-only rule set passes; the second backs the model recommender with a measured-outcomes store rather than a heuristic. Both are in early shape; both get bigger only if the operator surface stays restrained.&lt;/p&gt;

&lt;p&gt;If you hit something rough across the 2.0 to 2.5 surface, &lt;a href="https://github.com/sipyourdrink-ltd/bernstein/issues" rel="noopener noreferrer"&gt;open an issue&lt;/a&gt;. The next batch is shaped by what blocks real work.&lt;/p&gt;

&lt;h2&gt;
  
  
  related
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/v1-10-x-recap"&gt;v1.10.x recap&lt;/a&gt;: the one this picks up from.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/v2-0-release"&gt;v2.0 release notes&lt;/a&gt;: when the web UI landed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/orchestrator-on-someone-elses-box"&gt;orchestrator on someone else's box&lt;/a&gt;: the on-prem deployment story this release strengthens.&lt;br&gt;
Bernstein&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://bernstein.run/blog/v2-x-recap?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=v2-x-recap&amp;amp;utm_content=canonical" rel="noopener noreferrer"&gt;https://bernstein.run/blog/v2-x-recap?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=v2-x-recap&amp;amp;utm_content=canonical&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>multiagentorchestration</category>
      <category>release</category>
      <category>lineage</category>
      <category>auditlog</category>
    </item>
    <item>
      <title>Forecasting Without Prophecy: a plain-text discipline</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Mon, 11 May 2026 18:02:07 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/forecasting-without-prophecy-a-plain-text-discipline-5h9c</link>
      <guid>https://dev.to/alex_chernysh/forecasting-without-prophecy-a-plain-text-discipline-5h9c</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/blog"&gt;Back to notes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I am a fire Aries ruled by Mars, and even I will not pretend the future is a thing you can read off a chart. Calibrated uncertainty does the work prediction promises. The difference shows up six months later, when you can still grade what you wrote. Same question whether the deadline is a deploy, a hiring call, a relocation, or a difficult conversation with a peer.&lt;/p&gt;

&lt;p&gt;The minimum viable forecast loopFive steps that turn a question into something you can actually grade six months later.&lt;/p&gt;

&lt;h2&gt;
  
  
  The illusion of point estimates
&lt;/h2&gt;

&lt;p&gt;The most common forecasting mistake I see is not bias. It is false specificity in the answer.&lt;/p&gt;

&lt;p&gt;A senior engineer says "I'm 70% confident this ships by end of Q2." The number sounds disciplined. There is no scoring history attached. The same engineer said "70%" last quarter, and the quarter before, and three out of four ended up landing in different buckets. The "70%" is a feeling reformatted as a probability.&lt;/p&gt;

&lt;p&gt;The same trap shows up well outside a deploy window. A friend is "pretty sure" the new sleep regimen holds through the work-week. A cousin is "fairly confident" the visa will clear in time. A founder is "70% sure" the round closes in six weeks. None have a forecast log behind them.&lt;/p&gt;

&lt;p&gt;Point probabilities without a forecast log are theatre. They wear the costume of rigour (the decimal, the percentage sign) and the calibration that would make them rigorous is missing.&lt;/p&gt;

&lt;p&gt;The same trap shows up further down the AI stack. A retrieval system reports &lt;code&gt;score: 0.83&lt;/code&gt; and the team treats it as ground truth. A model reports &lt;code&gt;confidence: 0.91&lt;/code&gt; and the team builds an approval flow on top of it. Neither number is calibrated against actual outcomes. They are surface forms of a habit that does not exist yet.&lt;/p&gt;

&lt;p&gt;The fix is not "stop using numbers." The fix is &lt;strong&gt;ranges, not points&lt;/strong&gt;, until you have a calibration log that earns the precision. Twenty-to-thirty-five percent is defensible. Twenty-seven percent without a log is a costume.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference classes before stories
&lt;/h2&gt;

&lt;p&gt;The second most common mistake is starting from the inside view. The story of the project, the story of the relationship, the story of the deploy.&lt;/p&gt;

&lt;p&gt;Reference-class forecasting is the corrective. The original framing comes from Kahneman and Lovallo, and was operationalised most aggressively by &lt;a href="https://en.wikipedia.org/wiki/Reference_class_forecasting" rel="noopener noreferrer"&gt;Bent Flyvbjerg&lt;/a&gt; on infrastructure megaprojects, where insiders consistently overstated success and external base rates told a quieter, more accurate story.&lt;/p&gt;

&lt;p&gt;The procedure is short:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Name two to four classes the case belongs to. Not metaphorical classes. Observable ones, with countable outcomes. "Solo-founder consumer-SaaS launches with no paid acquisition." "First-time hires from an outside referral at a company under 30 people." "Friends who have gone quiet for ten days after a tense exchange." "Indie novel projects taken from outline to a finished draft within twelve months."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Estimate the prior odds of the target outcome from those base rates. Use ranges.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Adjust modestly (at most thirty or forty percentage points) only if your case-specific evidence is strong &lt;strong&gt;and&lt;/strong&gt; distinctive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If no relevant reference class exists, your confidence drops automatically.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The discipline lives in the order. Build the prior before you tell yourself the story. Once the story is in your head, every reference class will start to look "different in our case" and the outside view gets rationalised away. I have done this to myself on a relocation, a hiring call, on whether a parent would actually visit in spring. Writing the prior down before the narrative is the only thing that has ever stopped it.&lt;/p&gt;

&lt;p&gt;This is the same discipline that makes &lt;a href="https://dev.to/blog/llm-evals-in-production"&gt;eval suites useful in LLM systems&lt;/a&gt;: pick the reference set first, then look at the system, not the other way round.&lt;/p&gt;

&lt;h2&gt;
  
  
  Premortems and falsifiers
&lt;/h2&gt;

&lt;p&gt;A premortem is the cheapest decision-quality intervention I have ever run. The technique is associated with &lt;a href="https://hbr.org/2007/09/performing-a-project-premortem" rel="noopener noreferrer"&gt;Gary Klein's 2007 HBR piece&lt;/a&gt;. The underlying discipline is older. A deliberate inversion of the usual kickoff posture. Works on a relocation, a difficult conversation with a colleague, or the question of whether to stretch the emergency fund onto a new lease.&lt;/p&gt;

&lt;p&gt;The procedure, in plain text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Set the scene: it is six months from now, the project failed.
2. Each participant writes down, alone, the strongest specific reason it failed.
3. Read the answers out. Cluster them.
4. Each cluster becomes a falsifier or a mitigation in the live plan.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copy&lt;/p&gt;

&lt;p&gt;Two effects compound. First, asking "why did it fail" generates more honest hypotheses than "what could go wrong" because the failure is now an established fact in the imagined timeline. Nobody is debating whether it might happen, only how. Second, the failure modes that survive clustering become &lt;strong&gt;falsifiers&lt;/strong&gt;: observations that, if they happen, mean the plan is broken. Falsifiers convert vague risk into a leading indicator you can actually watch for.&lt;/p&gt;

&lt;p&gt;This pairs well with how I run feature flags and staged rollouts in &lt;a href="https://dev.to/blog/agentic-systems-best-practices"&gt;agentic systems&lt;/a&gt;. The flag's "off" criteria are usually written casually. They should be written as falsifiers. "If the regression rate exceeds 4% over two weekly cohorts, this rollout has failed and we revert." That sentence is forecastable. "We'll keep an eye on regressions" is not. The same shape works outside a codebase. "If the antibiotic course produces nausea on day three, I switch back to the GP" is a falsifier. "I'll see how I feel" is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Abstention as a feature
&lt;/h2&gt;

&lt;p&gt;The third common mistake is answering when the honest answer is "I don't know yet, and here is what would change that."&lt;/p&gt;

&lt;p&gt;Abstention is treated as failure in most organisations and in most personal conversations. In disciplined forecasting it is a feature. Two reasons.&lt;/p&gt;

&lt;p&gt;Calibration. A forecaster who abstains on cases that are genuinely underdetermined posts better Brier scores than one who answers everything with a 50% confidence shrug.&lt;/p&gt;

&lt;p&gt;Decision quality. The ask "what evidence would resolve this?" reframes the situation from "what do I think?" to "what do I need to look at next?" That is the question that actually moves projects forward, and the question that quietly de-escalates most family arguments about hypothetical futures.&lt;/p&gt;

&lt;p&gt;The technical analogue worth knowing is &lt;strong&gt;conformal prediction&lt;/strong&gt;, surveyed accessibly in &lt;a href="https://arxiv.org/abs/2107.07511" rel="noopener noreferrer"&gt;Angelopoulos and Bates' 2021 tutorial&lt;/a&gt;. The output is a &lt;strong&gt;set&lt;/strong&gt; of labels guaranteed to contain the truth at least &lt;em&gt;(1 − α)&lt;/em&gt; of the time, rather than a single label with a confidence. When the set has one element, you have a confident prediction. When the set has six, the model is honestly saying "I cannot distinguish among these without more evidence". The set size is the abstention signal.&lt;/p&gt;

&lt;p&gt;You don't need conformal infrastructure to apply the principle. The principle: &lt;strong&gt;make the size of your answer track the size of your uncertainty.&lt;/strong&gt; A short single-line forecast for a confident case. A two- or three-branch forecast for a moderately known case. An explicit "I abstain because X, Y, Z would resolve it" for the underdetermined case. This sits next to my preference for &lt;a href="https://dev.to/blog/llm-product-safety-without-theater"&gt;product safety without theatre&lt;/a&gt;. Refusing a question is sometimes the strongest answer the system has.&lt;/p&gt;

&lt;h2&gt;
  
  
  Calibration is a habit, not an event
&lt;/h2&gt;

&lt;p&gt;A forecast is incomplete until it is graded.&lt;/p&gt;

&lt;p&gt;The grading metric I keep coming back to is the Brier score, summarised on &lt;a href="https://en.wikipedia.org/wiki/Brier_score" rel="noopener noreferrer"&gt;the Wikipedia page&lt;/a&gt;. Lower is better. Zero is perfect. The convenient property is that the score decomposes into calibration and resolution. You can be wrong because your probabilities do not match observed frequencies, or because your forecasts do not separate likely from unlikely cases. Two different fixes.&lt;/p&gt;

&lt;p&gt;In practice, you do not need fancy infrastructure to track calibration. A four-column markdown table is enough:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;| Date       | Question                                | Forecast    | Outcome | Notes                |
|------------|-----------------------------------------|-------------|---------|----------------------|
| 2026-03-01 | Will candidate X accept by 03-15?       | 35-50%      | yes     | accepted on 03-09    |
| 2026-03-04 | Will deploy be clean on 03-08?          | 60-75%      | no      | DB pool exhausted    |
| 2026-03-09 | Will my friend reply within 48 hours?   | 40-55%      | no      | replied on day 5     |
| 2026-03-12 | Will the landlord renew on same terms?  | 55-70%      | yes     | small CPI bump only  |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copy&lt;/p&gt;

&lt;p&gt;Two months of entries and you start to see the systematic biases. Overconfidence on a topic you "know". Underconfidence on a topic you are afraid of. Point estimates that hide a wide range. Ranges that hide a missing reference class. The biases that show up against deploys also show up against landlords, friends, and gigs that need to break even on the door.&lt;/p&gt;

&lt;p&gt;I keep the log as a live file. New entries take less than a minute to write. The discipline lives in the &lt;strong&gt;reading&lt;/strong&gt;, on a slow Sunday once a month, with last month's predictions next to last month's outcomes.&lt;/p&gt;

&lt;p&gt;If that sounds tedious, consider the alternative is a version of you who never learns whether your forecasts are right. Public superforecasters, profiled in &lt;a href="https://goodjudgment.com/superforecasting/" rel="noopener noreferrer"&gt;the Good Judgment Project&lt;/a&gt;, do score above average on fluid intelligence and active open-mindedness. The strongest single predictor of breaking into the top 2% was perpetual updating, roughly three times more predictive than IQ. They keep score.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario discipline in plain text
&lt;/h2&gt;

&lt;p&gt;The single-story narrative is the most expensive default in informal forecasting. "I think they're going to ghost us." "I think the round will close in six weeks." "I think the strike will be over by Friday." A single hidden-motive story replaces the work of generating competing hypotheses.&lt;/p&gt;

&lt;p&gt;The fix is a scenario table, generated once, with three to five branches that meaningfully compete:&lt;/p&gt;

&lt;p&gt;ScenarioProbability rangeStrongest evidence forStrongest evidence againstLeading indicatorStatus quo continues30–45%track record of inactionrecent change in incentivesno decision in next 14 daysCautious improvement25–35%small visible gestures last weekhistory of regressionsone substantive ask answeredEscalation or rupture10–20%pattern of ultimatumscalmer recent toneunilateral action by the other sideStrategic distance10–20%resources are clearly limiteddependency on this threadreduced engagement, not reduced contactExternal shock5–10%three competitors movingsector quiet otherwisea third party makes the question moot&lt;/p&gt;

&lt;p&gt;That same shape covers a regulator's calendar, a job search, a health regimen, or whether a quiet family thread reopens on its own. You change the rows; the columns stay.&lt;/p&gt;

&lt;p&gt;Two rules make the table earn its keep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Probability ranges, not single numbers&lt;/strong&gt;, unless you have a forecast log to back the precision. Range midpoints should land somewhere near 100%. They will never sum to exactly 100 (they are ranges) but if the column sums to 50% the scenario set is incomplete, not the arithmetic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One falsifier per row.&lt;/strong&gt; A scenario without a falsifier is a wish or a fear, not a scenario. The leading-indicator column does the work: it tells you what observation, if you saw it next week, would shift the probability of that branch up or down.&lt;/p&gt;

&lt;p&gt;A nice consequence of plain text is that you can paste it into a thread, hand it to a colleague, or feed it to a model for a second opinion without an export step. The same plain-text discipline that makes &lt;a href="https://dev.to/blog/sdd-spec-driven-development"&gt;spec-driven development&lt;/a&gt; survive context switches makes scenario tables survive them too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it on something small
&lt;/h2&gt;

&lt;p&gt;This entire discipline collapses if you only ever apply it to high-stakes, low-frequency questions. You don't get calibration data. You don't develop the muscle. You don't learn which biases are yours.&lt;/p&gt;

&lt;p&gt;Start with something small enough to grade within two weeks. A Q3 review, an offer letter, a connecting flight, a Saturday gig that needs to break even on the door. Anything where the outcome lands before you forget you forecasted it.&lt;/p&gt;

&lt;p&gt;If two of the three are wildly off after two weeks, the lesson is in the gap, not in the embarrassment. Reread the original entries. Which step did you skip? Did you start with a story instead of a reference class? Did you give a point estimate instead of a range? Did you forget the falsifier?&lt;/p&gt;

&lt;p&gt;The future stays unpredictable. The job is to build a calmer, slightly more honest interface to a messy world, and to leave behind enough trail-of-evidence that next year's version of you can grade this year's forecasts and learn something.&lt;/p&gt;

&lt;p&gt;Related reading&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/agentic-systems-best-practices"&gt;Building agentic AI systems that hold up&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/llm-evals-in-production"&gt;LLM evals in production&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/llm-hallucination-prevention"&gt;Hallucination prevention in LLM products&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/llm-product-safety-without-theater"&gt;Product safety without theatre&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/sdd-spec-driven-development"&gt;Spec-driven development&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/ai-assisted-development-green-state"&gt;AI-assisted development from a green state&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/legal-answering-systems"&gt;Building legal answering systems&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;References&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://hbr.org/2007/09/performing-a-project-premortem" rel="noopener noreferrer"&gt;Gary Klein, "Performing a Project Premortem" (HBR, 2007)&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Reference_class_forecasting" rel="noopener noreferrer"&gt;Reference-class forecasting (Kahneman, Lovallo, Flyvbjerg)&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://arxiv.org/abs/2107.07511" rel="noopener noreferrer"&gt;Angelopoulos &amp;amp; Bates, "A Gentle Introduction to Conformal Prediction" (arXiv 2021)&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Brier_score" rel="noopener noreferrer"&gt;Brier score - calibration metric primer&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml" rel="noopener noreferrer"&gt;Brier (1950) - verification of forecasts expressed in terms of probability&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://goodjudgment.com/superforecasting/" rel="noopener noreferrer"&gt;The Good Judgment Project - superforecaster research&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ Reading complete&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexchernysh.com/blog/forecasting-without-prophecy?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=forecasting-without-prophecy&amp;amp;utm_content=canonical" rel="noopener noreferrer"&gt;https://alexchernysh.com/blog/forecasting-without-prophecy?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=forecasting-without-prophecy&amp;amp;utm_content=canonical&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>forecasting</category>
      <category>decisionmaking</category>
      <category>calibration</category>
      <category>aiengineering</category>
    </item>
    <item>
      <title>RightLayout: Shipping a Mac AI Tool, Then Letting Go</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Mon, 11 May 2026 18:02:05 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/rightlayout-shipping-a-mac-ai-tool-then-letting-go-3imh</link>
      <guid>https://dev.to/alex_chernysh/rightlayout-shipping-a-mac-ai-tool-then-letting-go-3imh</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/blog"&gt;Back to notes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://dev.to/rightlayout"&gt;RightLayout&lt;/a&gt; because every keyboard-layout corrector I tried for macOS broke on names, code, and typos. It was a small bet: train a CoreML model from scratch, three layouts, on-device. It worked. Then the maintenance bill came due, and I open-sourced it.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The problem with dictionary punto-switchers
&lt;/h2&gt;

&lt;p&gt;If you type in two or three languages on a Mac, you have lived this. You start a sentence in English, the layout is still on Russian, and the screen fills with Cyrillic noise. The fix exists in theory. There are tools that watch your input and flip the layout when the word "looks wrong".&lt;/p&gt;

&lt;p&gt;The classical version of that tool is dictionary-based. It checks each word against a frozen vocabulary and corrects when the word does not appear. That works for the easy cases. It also fails the moment a real human starts typing real text.&lt;/p&gt;

&lt;p&gt;Names break it. Code breaks it. Acronyms break it. URLs break it. The word &lt;code&gt;kubectl&lt;/code&gt; is not in any Russian dictionary, but it is also not a wrong-layout English word. A typo like &lt;code&gt;helo&lt;/code&gt; is missing from the dictionary, so the tool helpfully turns it into &lt;code&gt;руды&lt;/code&gt;. And in mixed-language paragraphs the dictionary does not even know which language to anchor against.&lt;/p&gt;

&lt;p&gt;A dictionary check has no idea what you are typing. It can be solid and polished and still get the same class of false positives, because the underlying signal is the wrong one. You need something that reads context, not vocabulary.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The bet
&lt;/h2&gt;

&lt;p&gt;The bet was small enough to attempt over a few weekends. Train a tiny character-level model that takes a short window of recent input and predicts which of nine classes it belongs to. The nine classes are the three native layouts (EN, RU, HE) plus the six cross-layout misfires: &lt;code&gt;en_from_ru&lt;/code&gt;, &lt;code&gt;ru_from_en&lt;/code&gt;, &lt;code&gt;en_from_he&lt;/code&gt;, &lt;code&gt;he_from_en&lt;/code&gt;, &lt;code&gt;ru_from_he&lt;/code&gt;, &lt;code&gt;he_from_ru&lt;/code&gt;. That class set is the whole trick. Once the model says "this looks like Russian typed on an English layout", a deterministic mapper handles the actual character substitution.&lt;/p&gt;

&lt;p&gt;The training pipeline is in the repo and is unromantic. Wikipedia and subtitle corpora for the three languages, generation of clean and cross-layout-mistyped pairs, character-level tokenization, augmentation for typos and case noise, mixup, label smoothing. The model itself is an ensemble of a small multi-scale CNN and a four-layer character Transformer, both pooled into a single linear head. It runs over a fixed 20-character window. The export goes through PyTorch into CoreML.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# from Tools/CoreMLTrainer/train.py
&lt;/span&gt;&lt;span class="n"&gt;CLASSES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ru&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;he&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ru_from_en&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;he_from_en&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;en_from_ru&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;en_from_he&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;he_from_ru&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ru_from_he&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copy&lt;/p&gt;

&lt;p&gt;The CoreML model that ships inside the .pkg is around 14 MB. It runs entirely on-device. Inference per token window is fast enough that the correction logic stays well under the 50 ms budget I set for the whole pipeline (event tap to replacement). It is small enough to fit in the bundle and ship with no cloud dependency.&lt;/p&gt;

&lt;p&gt;The first time it correctly turned &lt;code&gt;ghbdtn&lt;/code&gt; into "привет" in the middle of a sentence with a code snippet in it, I knew it was going to work. A dictionary-based corrector would have eaten the snippet.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Where the model actually wins
&lt;/h2&gt;

&lt;p&gt;Three places it beats a dictionary cleanly.&lt;/p&gt;

&lt;p&gt;It handles typos. A short word with one missing or duplicated character is still recognizable as the right language to a character-level model. Dictionary tools either silently miss the word or, worse, "correct" it into nonsense.&lt;/p&gt;

&lt;p&gt;It handles names and code. The model has seen enough mixed-language and mixed-script text in training that an English snippet inside a Russian sentence does not trigger a flip. The dictionary approach to this is a hand-maintained whitelist that grows forever.&lt;/p&gt;

&lt;p&gt;It handles Hebrew, which is the genuinely hard one. RTL text plus a character set with no overlap with Latin or Cyrillic plus a layout that maps Hebrew letters onto English keys means the dictionary approach has to maintain three pairwise tables and a context heuristic on top. The model just learned that &lt;code&gt;akuo&lt;/code&gt; is "שלום" typed on an English layout and moves on.&lt;/p&gt;

&lt;p&gt;For three or four months I used my own tool every day. It was the first time the corrector was invisible enough to forget about.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Why I open-sourced it instead of scaling it
&lt;/h2&gt;

&lt;p&gt;Then the maintenance bill arrived.&lt;/p&gt;

&lt;p&gt;A free macOS utility with a learned model has a long tail of unglamorous work. The accessibility-API event tap needs to keep working across macOS versions. Apple loves to silently change permission semantics between releases. The CoreML runtime drifts. The model has no test infrastructure for "real users typing real text" because that is, by definition, not in the training set. The undo-ratio learning loop, where the tool watches users undo a correction and adapts, is hard to make safe and harder to validate without telemetry I refuse to collect.&lt;/p&gt;

&lt;p&gt;For a funded product, those costs are absorbable. For a free tool maintained by one person with a day job, they compound. Every macOS major release became a week of evening debugging. Every CoreML version bump was a small risk. Every issue in GitHub was a fork in the road: do I become a Mac systems engineer in my spare time, or do I let the project rot quietly while pretending it is still maintained?&lt;/p&gt;

&lt;p&gt;I picked a third option. I marked the project community-maintained, wrote an honest banner on the homepage and the README, kept the model in the bundle so installs still work, and moved my attention to &lt;a href="https://bernstein.run" rel="noopener noreferrer"&gt;Bernstein&lt;/a&gt;. The repo is public. The training pipeline is public. Pull requests are reviewed. There are no gatekeepers. If you ship good PRs, you get commit access.&lt;/p&gt;

&lt;p&gt;That is a more honest position than "v2 coming soon, watch this space".&lt;/p&gt;

&lt;h2&gt;
  
  
  5. What you can take
&lt;/h2&gt;

&lt;p&gt;If you want the tool, the .pkg is on the &lt;a href="https://github.com/chernistry/RightLayout/releases/latest" rel="noopener noreferrer"&gt;releases page&lt;/a&gt;. macOS 13 or newer, Accessibility permission, free. The model is inside the bundle.&lt;/p&gt;

&lt;p&gt;If you want the code, the &lt;a href="https://github.com/chernistry/RightLayout" rel="noopener noreferrer"&gt;repo&lt;/a&gt; is small, the architecture doc is in &lt;code&gt;.sdd/&lt;/code&gt;, and the training pipeline is in &lt;code&gt;Tools/CoreMLTrainer/&lt;/code&gt;. Adding a fourth language is a few-hour exercise: extend the class enum, add the layout map, retrain, ship.&lt;/p&gt;

&lt;p&gt;If you want the lesson, I think it is short. A focused weekend project can ship faster than the coordination cost of a team. Maintenance is a different shape entirely, and there is no shortcut around it. Choose accordingly.&lt;/p&gt;

&lt;p&gt;I am genuinely glad it is in the wild. Take it, fix it, ship it.&lt;/p&gt;

&lt;p&gt;Resources&lt;/p&gt;

&lt;h2&gt;
  
  
  Repositories and downloads
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/chernistry/RightLayout" rel="noopener noreferrer"&gt;RightLayout on GitHub&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/chernistry/RightLayout/releases/latest" rel="noopener noreferrer"&gt;Latest .pkg release&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/rightlayout"&gt;Product page&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://bernstein.run" rel="noopener noreferrer"&gt;Bernstein, what I work on now&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Related reading&lt;/p&gt;

&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/interface-design-serious-products"&gt;Interface design for serious products&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/hireex-autonomous-job-discovery"&gt;Need a job? Sip your drink. We'll look for you.&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dev.to/blog/bernstein-multi-agent-orchestration"&gt;Bernstein: multi-agent orchestration that holds up&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ Reading complete&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexchernysh.com/blog/rightlayout-shipping-then-letting-go?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=rightlayout-shipping-then-letting-go&amp;amp;utm_content=canonical" rel="noopener noreferrer"&gt;https://alexchernysh.com/blog/rightlayout-shipping-then-letting-go?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=rightlayout-shipping-then-letting-go&amp;amp;utm_content=canonical&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>macos</category>
      <category>opensource</category>
      <category>swift</category>
    </item>
    <item>
      <title>Shipping the orchestrator onto someone else's box</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Mon, 11 May 2026 13:14:37 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/shipping-the-orchestrator-onto-someone-elses-box-220i</link>
      <guid>https://dev.to/alex_chernysh/shipping-the-orchestrator-onto-someone-elses-box-220i</guid>
      <description>&lt;p&gt;This is the on-prem / regulated-deployment notes for 1.10 — mTLS cluster mode, signed lineage records, air-gapped install, a capability gate against the lethal-trifecta exfiltration class. If you're looking for "how to install it on my laptop," that's &lt;a href="https://dev.to/blog/frictionless-install"&gt;the curl-pipe install post&lt;/a&gt; instead. A laptop tool and an on-prem deployment have almost nothing in common: the first answers to the developer who started it, the second answers to a security architect, a compliance officer, a network team that hates outbound traffic, and a procurement reviewer with a checklist. The batch below is what it took to stop pretending those were the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  the questions
&lt;/h2&gt;

&lt;p&gt;Customers who want the orchestrator inside their perimeter ask a fairly predictable sequence. How do nodes authenticate when the network isn't yours. How do we prove, six months later, which agent wrote which line. How do we install when the box has no outbound internet. What stops a clever prompt from chaining a database read into a public webhook.&lt;/p&gt;

&lt;p&gt;We had partial answers to all four. None were the answer you'd hand a security review without flinching. The five PRs that landed today close that gap by replacing "we sort of do that" with concrete, demonstrable behaviour. None of these are revolutionary on their own. The cumulative effect is a 1.10 build we're willing to drop into a regulated customer's box and walk through with their auditor.&lt;/p&gt;

&lt;h2&gt;
  
  
  cluster mode without ambient trust
&lt;/h2&gt;

&lt;p&gt;Cluster mode used to assume the network was safe. Acceptable for a developer machine talking to itself, unacceptable everywhere else. Worker–central traffic now runs over mTLS by default, with cert issuance done locally and pinned to the cluster's own CA.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bernstein cluster bootstrap-ca &lt;span class="nt"&gt;--out&lt;/span&gt; .sdd/cluster/ca/
bernstein cluster issue-cert &lt;span class="nt"&gt;--role&lt;/span&gt; worker &lt;span class="nt"&gt;--node-id&lt;/span&gt; worker-01 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--out&lt;/span&gt; .sdd/cluster/worker-01/
bernstein cluster start &lt;span class="nt"&gt;--tls&lt;/span&gt; .sdd/cluster/worker-01/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI generates a private CA, issues short-lived node certs with role and node-id baked in, refuses connections whose chain doesn't terminate at the cluster CA (PR #1019). Rotation is a re-issue, not a rebuild. The central trusts any cert with a valid chain and a non-expired NotAfter. There is no shared symmetric secret to leak.&lt;/p&gt;

&lt;p&gt;Underneath, we replaced the in-process happy-path tests with a real two-process harness: central and worker as separate &lt;code&gt;subprocess.Popen&lt;/code&gt; invocations, walked through six chaos scenarios — worker SIGKILL mid-task, central restart with in-flight claims, network partition, token expiry across a claim boundary, two workers racing for the same task, certificate revocation (PR #1020). Bugs the in-process harness silently swallowed — claim re-entry, token clock skew, an off-by-one in the partition healer — surface in seconds. CI runs the matrix on every push.&lt;/p&gt;

&lt;p&gt;Operators get five new Prometheus counters and gauges (&lt;code&gt;bernstein_cluster_claims_total&lt;/code&gt;, &lt;code&gt;bernstein_cluster_token_rejections_total&lt;/code&gt;, &lt;code&gt;bernstein_cluster_partition_seconds&lt;/code&gt;, &lt;code&gt;bernstein_cluster_central_restarts_total&lt;/code&gt;, &lt;code&gt;bernstein_cluster_workers_active&lt;/code&gt;) and six audit event types covering token issuance, claim transitions, certificate rotation, disconnects. Grafana dashboard JSON in &lt;code&gt;observability/dashboards/cluster.json&lt;/code&gt; (PR #1021). Plain dashboard, accurate fields.&lt;/p&gt;

&lt;p&gt;For deployments that can't expose a port to the worker fleet, there's a tested pattern using Cloudflare Tunnel for the central edge plus Tailscale for the worker mesh. Documented end-to-end. A nightly CI job stands up the topology in ephemeral containers and runs a smoke task through it (PR #1024). The pattern works without modifying firewall rules at the customer site, which is usually the difference between "scheduled for next quarter" and "deploy this week."&lt;/p&gt;

&lt;p&gt;What's still missing: MESH and HIERARCHICAL coordinator topologies are stubs. Multi-tenant isolation inside a single cluster — separate quotas, audit chains, model budgets per tenant — is deferred to 1.11. If you need either today, run one cluster per tenant. We're naming the limit out loud because pretending it isn't there is how trust evaporates on the second deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  lineage that survives a regulator
&lt;/h2&gt;

&lt;p&gt;Auditing a code change six months after the fact is a problem of information loss. By the time anyone asks "which agent wrote this," the prompt is gone, the cost ledger has rolled, the producing model version has been replaced. The new lineage subsystem keeps the answer.&lt;/p&gt;

&lt;p&gt;Every agent write emits a signed lineage record linking the output (file path, byte range, sha256) to its inputs (the files the agent read), producer (agent id, role, model, effort), prompt SHA, cost, token count, wall-clock timestamp. Records chain via HMAC the same way the existing audit log does, so tampering with one record invalidates the chain past it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;bernstein lineage src/auth/middleware.py:74&lt;/span&gt;
&lt;span class="c1"&gt;# wrote by:    backend / claude-sonnet / effort=high&lt;/span&gt;
&lt;span class="c1"&gt;# prompt sha:  3f9a…b421  (template: roles/backend.md@v17)&lt;/span&gt;
&lt;span class="c1"&gt;# inputs:      src/auth/__init__.py  src/auth/jwt.py  tests/test_auth.py&lt;/span&gt;
&lt;span class="c1"&gt;# producer:    session 7c4f1a3b9d22, task #412&lt;/span&gt;
&lt;span class="c1"&gt;# cost:        $0.0214   tokens: 11,983&lt;/span&gt;
&lt;span class="c1"&gt;# signed:      ed25519 (cluster-key)  customer-key: aporia-prod-2026-05&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Schema v2 adds two fields: a &lt;code&gt;regulatory_class&lt;/code&gt; tag (&lt;code&gt;pii&lt;/code&gt;, &lt;code&gt;phi&lt;/code&gt;, &lt;code&gt;dora&lt;/code&gt;, &lt;code&gt;nis2&lt;/code&gt;, &lt;code&gt;none&lt;/code&gt;) inferred from the file's policy zone, plus a customer-key signature attached after the cluster-key signature so customers can revoke trust without re-issuing the cluster CA. Combination is what makes a DORA or NIS2 evidence package mechanical to assemble: filter by &lt;code&gt;regulatory_class&lt;/code&gt;, walk the chain, hand the bundle to the auditor.&lt;/p&gt;

&lt;p&gt;The janitor verifies the chain on every gate run. A tamper hit forwards to a configurable SIEM webhook with the broken record and the surrounding window. Default surface is loud. We'd rather wake an operator than miss a forged signature.&lt;/p&gt;

&lt;p&gt;What's still missing: the customer-key signing path uses ed25519 in software. FIPS-140 hardware keys are on the 1.11 roadmap. If FIPS-140 is a hard procurement requirement, you cannot ship today.&lt;/p&gt;

&lt;h2&gt;
  
  
  distribution without outbound internet
&lt;/h2&gt;

&lt;p&gt;The first sovereign deployment we did, the customer's box could not reach &lt;code&gt;pypi.org&lt;/code&gt;, &lt;code&gt;github.com&lt;/code&gt;, or any registry we'd ever heard of. Install procedure was an engineer carrying a USB drive through a security checkpoint. Not a problem we wanted to solve twice.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bernstein wheelhouse build&lt;/code&gt; produces a self-contained tarball: pinned wheels for the orchestrator and every transitive dependency, embedded model weights for offline classifiers, a manifest with sha256 per file, a detached GPG signature over the manifest. &lt;code&gt;bernstein wheelhouse verify&lt;/code&gt; checks both the signature and every per-file hash before any installer logic runs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# on a connected build host&lt;/span&gt;
bernstein wheelhouse build &lt;span class="nt"&gt;--out&lt;/span&gt; bernstein-1.10.0-airgap.tar.gz &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--sign-with&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt;email protected]]&lt;span class="o"&gt;(&lt;/span&gt;/cdn-cgi/l/email-protection&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# on the customer box&lt;/span&gt;
bernstein wheelhouse verify bernstein-1.10.0-airgap.tar.gz
bernstein wheelhouse &lt;span class="nb"&gt;install &lt;/span&gt;bernstein-1.10.0-airgap.tar.gz &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; /opt/bernstein

bernstein &lt;span class="nt"&gt;--profile&lt;/span&gt; airgap doctor airgap
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--profile airgap&lt;/code&gt; flips the orchestrator into explicit egress-deny: any code path that opens a socket to a non-loopback, non-cluster address fails closed with an error naming the offender. &lt;code&gt;doctor airgap&lt;/code&gt; runs ten self-checks — DNS lookup for a poisoned hostname, plaintext HTTP attempt, MCP reachability, model-weight integrity — and returns a single pass/fail line that procurement can paste into a runbook.&lt;/p&gt;

&lt;p&gt;This was the piece we expected to be smallest and that ate the most time. Deciding what counts as "egress" inside a complex Python process is a research project. Deciding what to do when a transitive dependency tries to phone home for telemetry is a policy question. We landed on "fail closed, name the caller, document the override" because every other choice creates a quiet failure mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  a capability gate against the lethal trifecta
&lt;/h2&gt;

&lt;p&gt;The lethal trifecta — private data, untrusted input, external communication — is the prompt-injection escape hatch every multi-agent system eventually trips over. An agent that reads a customer's database, ingests a webhook body crafted by an attacker, and is allowed to call out to a public URL has, by construction, an exfil path. The mitigation in the literature is to refuse the chain, not the individual capabilities.&lt;/p&gt;

&lt;p&gt;Tools and adapters now declare capability tags in their manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# src/bernstein/adapters/postgres_query.py
&lt;/span&gt;&lt;span class="n"&gt;CAPABILITIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PRIVATE_DATA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# src/bernstein/adapters/webhook_ingest.py
&lt;/span&gt;&lt;span class="n"&gt;CAPABILITIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNTRUSTED_INPUT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# src/bernstein/adapters/http_post.py
&lt;/span&gt;&lt;span class="n"&gt;CAPABILITIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EXTERNAL_COMM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At spawn, the orchestrator unions the tags of every tool the prospective agent has access to. If the union contains all three of &lt;code&gt;PRIVATE_DATA&lt;/code&gt;, &lt;code&gt;UNTRUSTED_INPUT&lt;/code&gt;, and &lt;code&gt;EXTERNAL_COMM&lt;/code&gt;, the spawn fails with a refusal naming the offending capability set and the tools that contributed to each tag. Operators can override per-task with &lt;code&gt;--allow-trifecta&lt;/code&gt;, which is logged to the audit chain and surfaced in the lineage record.&lt;/p&gt;

&lt;p&gt;The gate catches the architectural mistake — assembling a tool belt that shouldn't exist together — before any prompt runs. It cannot prevent a capable insider from manually wiring around it. The default failure mode is now refusal, which is the right default for a tool you don't fully control.&lt;/p&gt;

&lt;h2&gt;
  
  
  what this batch isn't
&lt;/h2&gt;

&lt;p&gt;A few things are honestly not done. MESH/HIERARCHICAL cluster topologies are stubs. FIPS-140 hardware keys for the lineage signer are on the 1.11 roadmap, not this release. Multi-tenant isolation inside a single cluster is deferred. The capability matrix covers the trifecta and nothing else; finer-grained gates like "no PII into models without a BAA" are next quarter's work. None of these are blockers for the deployments we have lined up. All of them will be eventually.&lt;/p&gt;

&lt;p&gt;This isn't a "version 2.0." It's the unglamorous list a procurement reviewer asks about before the technical evaluation has even started. Cluster auth, signed audit, offline install, capability isolation. Table stakes for getting the orchestrator dropped onto a regulated customer's box instead of staying a thing that runs on someone's laptop.&lt;/p&gt;

&lt;p&gt;We picked the second option for two years because the first option is mostly paperwork. The batch shipped today is the paperwork.&lt;/p&gt;




&lt;p&gt;If you got here from the README and want the codebase view, &lt;a href="https://github.com/sipyourdrink-ltd/bernstein" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; is the canonical place. If the regulated-deployment angle is what you came for, &lt;a href="https://github.com/sipyourdrink-ltd/bernstein/issues" rel="noopener noreferrer"&gt;open an issue&lt;/a&gt; describing the air-gap or compliance gap you're hitting; the next batch is shaped by what blocks real deployments.&lt;/p&gt;

&lt;p&gt;Bernstein&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://bernstein.run/blog/orchestrator-on-someone-elses-box?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=orchestrator-on-someone-elses-box&amp;amp;utm_content=canonical" rel="noopener noreferrer"&gt;https://bernstein.run/blog/orchestrator-on-someone-elses-box?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=orchestrator-on-someone-elses-box&amp;amp;utm_content=canonical&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>multiagentorchestration</category>
      <category>cluster</category>
      <category>enterprise</category>
      <category>audittrail</category>
    </item>
    <item>
      <title>1.10.1 through 1.10.6: the shipped things</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Mon, 11 May 2026 13:14:35 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/1101-through-1106-the-shipped-things-3p2</link>
      <guid>https://dev.to/alex_chernysh/1101-through-1106-the-shipped-things-3p2</guid>
      <description>&lt;p&gt;The v1.10.0 post covered the regulated-deployment work. The five point releases since are not headline-shaped. They are the things people kept asking for in issues and the things that were quietly broken once we tried to use the orchestrator on a real polyglot codebase. Worth a single round-up so the trajectory is legible from one page.&lt;/p&gt;

&lt;h2&gt;
  
  
  a single AGENTS.md the rest of the agents can read
&lt;/h2&gt;

&lt;p&gt;If you run more than one CLI agent on the same repo you already know the problem. Cursor wants &lt;code&gt;.cursor/rules/*.mdc&lt;/code&gt;. Claude Code wants &lt;code&gt;CLAUDE.md&lt;/code&gt;. Aider wants &lt;code&gt;CONVENTIONS.md&lt;/code&gt; plus a tiny &lt;code&gt;.aider.conf.yml&lt;/code&gt; line so it actually loads on every session. Goose wants &lt;code&gt;.goosehints&lt;/code&gt;. Each of those files holds the same handful of facts about your codebase, said five different ways, which means a real repo carries four drifting copies of the same instructions.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bernstein agents-md&lt;/code&gt; reads the repo's roles, hooks, skills, capability matrix, and install snippets, emits one &lt;a href="https://agents.md" rel="noopener noreferrer"&gt;AAIF AGENTS.md&lt;/a&gt; as the single source of truth, then rewrites that into the four vendor shapes above. Five subcommands. &lt;code&gt;generate&lt;/code&gt; previews the canonical IR to stdout. &lt;code&gt;write&lt;/code&gt; produces one target. &lt;code&gt;sync&lt;/code&gt; produces the canonical plus all four CLI-specific files in one pass. &lt;code&gt;verify&lt;/code&gt; is a CI gate that fails on drift. &lt;code&gt;diff&lt;/code&gt; shows what is stale.&lt;/p&gt;

&lt;p&gt;The IR is intentionally schema-free, because the AAIF spec doesn't impose one and locking ourselves in would have meant fighting upstream every quarter. The CI gate is the part that compounds. After three months of drift you can re-run &lt;code&gt;agents-md sync&lt;/code&gt; and watch four files all snap back to the same content without anyone hand-merging. The orchestrator runs &lt;code&gt;agents-md verify&lt;/code&gt; against its own tree on every PR; the pattern is the same one anyone with more than one agent ends up wanting.&lt;/p&gt;

&lt;h2&gt;
  
  
  cost legibility you don't have to grep for
&lt;/h2&gt;

&lt;p&gt;Two patches, both small, both the kind of thing that should have shipped in 1.0.&lt;/p&gt;

&lt;p&gt;The first is a per-turn budget banner. &lt;code&gt;bernstein run&lt;/code&gt; now prints a one-line countdown each turn: dollars and tokens remaining against the task budget. The Anthropic prompt-caching beta header is lit by default, so cache hits actually land. Operators stop pattern-matching for a wallet limit in their head while the agent is mid-thought. CI runs with a cost ceiling get a real per-turn signal instead of finding out post-mortem.&lt;/p&gt;

&lt;p&gt;The second is &lt;code&gt;--max-cost-usd&lt;/code&gt;. A hard cap on a run's cumulative routed-model spend. Crosses the threshold, the run aborts cleanly, with the partial results merged or rolled back the same way a normal cancel works. Pair it with the run summary's "estimated savings vs. single-shot through the most expensive routed model" line that 1.10.1 added and the wallet picture is finally visible without a &lt;code&gt;jq&lt;/code&gt; on &lt;code&gt;.sdd/runtime/costs.jsonl&lt;/code&gt;. The bandit router has been doing the right thing for a while; the operator surface has not.&lt;/p&gt;

&lt;h2&gt;
  
  
  A2A v1.0 with a verifier you can actually run
&lt;/h2&gt;

&lt;p&gt;If you connect Bernstein to other agents over the A2A protocol, every Bernstein agent now publishes a signed agent card at &lt;code&gt;/.well-known/agent.json&lt;/code&gt; and the public verification keys at &lt;code&gt;/.well-known/jwks.json&lt;/code&gt;. JWS detached signature over JCS canonical bytes with Ed25519, audience binding via RFC 8707 resource indicators, persistent keystore with &lt;code&gt;O_EXCL&lt;/code&gt; plus &lt;code&gt;0o600&lt;/code&gt; semantics, and a 24-hour rotation grace window so a peer that fetched JWKS five minutes ago can still verify the previous key after a rotation without races.&lt;/p&gt;

&lt;p&gt;The compliance side ships a verifier you don't have to trust us to run. &lt;code&gt;tools/verify_audit_dsse.py&lt;/code&gt; depends only on the Python standard library and &lt;code&gt;cryptography&lt;/code&gt;. Its own test asserts that &lt;code&gt;import bernstein&lt;/code&gt; raises &lt;code&gt;ModuleNotFoundError&lt;/code&gt; from inside the verifier's venv, because that is the property an external auditor wants from a verifier they will hand to their own team. The audit log itself is HMAC-SHA256 chained, JCS-canonicalised per RFC 8785, timestamp-anchored against an external TSA via RFC 3161 chain validation, and exported as a DSSE plus in-toto v1 envelope. Multi-tenant slicing via &lt;code&gt;bernstein audit slice&lt;/code&gt; exports a deterministic subset for an evaluator without breaking the chain on either side.&lt;/p&gt;

&lt;p&gt;Honest framing: the compliance surface ships with tests, runbooks, and the standalone verifier above, but it has not been bashed against an external regulatory audit yet. Treat it as code an evaluator can read and stand up themselves, not as a SOC 2 attestation.&lt;/p&gt;

&lt;h2&gt;
  
  
  four new adapters
&lt;/h2&gt;

&lt;p&gt;Adapter count went from 31 to 44 over the five releases. The four worth calling out by name:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Devin for Terminal (Cognition).&lt;/strong&gt; First-class adapter for the enterprise coding agent. 558 lines of contract tests verify the spawn surface mirrors the long-running adapter pattern. Drop-in via &lt;code&gt;cli_agent: devin_terminal&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;JetBrains Junie.&lt;/strong&gt; BYOK across Anthropic, OpenAI, Google, xAI, OpenRouter, and the Copilot proxy. Bring whichever key the org already has procurement for.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AWS Q Developer.&lt;/strong&gt; Wraps &lt;code&gt;q chat --no-interactive --trust-all-tools&lt;/code&gt; so AWS-resident teams can route the steps where their security model wants the AWS-trusted lane.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DeepSeek V4-Flash and V4-Pro.&lt;/strong&gt; Self-hosted via an Ollama-compatible endpoint. Ships an EU-residency guard that pins the endpoint host and rejects DNS rebinding via a loopback test. The Hypothesis bug-hunt suite caught a &lt;code&gt;10.example.com&lt;/code&gt; rebinding bypass while the adapter was still in development, which is roughly the point of running the Hypothesis suite.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Cursor adapter also got a real rewrite. The previous code shelled a non-existent &lt;code&gt;cursor agent&lt;/code&gt; binary with fictional flags. New version targets the real &lt;code&gt;cursor-agent&lt;/code&gt; CLI surface (&lt;code&gt;-p&lt;/code&gt;, &lt;code&gt;--workspace&lt;/code&gt;, &lt;code&gt;--output-format stream-json&lt;/code&gt;, &lt;code&gt;--trust&lt;/code&gt;, &lt;code&gt;--approve-mcps&lt;/code&gt;, &lt;code&gt;--force&lt;/code&gt;) with 242 lines of new contract tests so it can't regress to vapor again.&lt;/p&gt;

&lt;h2&gt;
  
  
  the smaller things
&lt;/h2&gt;

&lt;p&gt;A few that don't need a whole section but matter in a specific situation.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bernstein run&lt;/code&gt; learned a &lt;code&gt;pending_approval&lt;/code&gt; state. Tasks pause there until an operator approves or rejects through the API or a panel, with the decision logged to the audit chain. The fresh-context retry mode (&lt;code&gt;agent_restart_between_retries&lt;/code&gt;, opt-in) restarts the agent cold instead of inheriting the failed run's context bloat, which is the right default once you have watched a 200k-token context retry and somehow get worse.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bernstein scaffold&lt;/code&gt; is a first slice for going from one sentence to a working repo skeleton. &lt;code&gt;bernstein wiki build&lt;/code&gt; generates a per-repo wiki from the canonical AGENTS.md IR. The A/B runner primitive lets you compare two adapter configurations on the same task set without writing a custom harness. None of these are finished; they ship as the smallest viable slice so the spec, the test, and the runtime artefact all exist while the operational surface stays thin enough not to lock in a bad shape.&lt;/p&gt;

&lt;p&gt;There is also an opt-in LLM watcher that reads the deterministic loop's events and annotates them with a natural-language summary. Off by default, runs on Haiku, useful when you are explaining a failed run to a human reviewer who is not going to read the JSONL by hand. The orchestrator stays deterministic. The watcher is a side-channel.&lt;/p&gt;

&lt;h2&gt;
  
  
  why these matter
&lt;/h2&gt;

&lt;p&gt;Most of the friction in running a multi-agent setup is not the agents. It is the four config files that disagree, the run that quietly burned through a budget at 3am, the A2A peer that won't verify your card because your keystore lost a race condition, the EU-residency requirement that bites the second a transitive dependency tries to phone home. None of those are interesting to write up as a feature. All of them are the thing that decides whether someone runs the orchestrator twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pipx &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; bernstein
bernstein agents-md &lt;span class="nb"&gt;sync&lt;/span&gt;          &lt;span class="c"&gt;# one canonical, four vendor shapes&lt;/span&gt;
bernstein run &lt;span class="nt"&gt;--max-cost-usd&lt;/span&gt; 5    &lt;span class="c"&gt;# hard cap; per-turn countdown shows above&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Container: &lt;code&gt;ghcr.io/sipyourdrink-ltd/bernstein:1.10.6&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  next
&lt;/h2&gt;

&lt;p&gt;The KF-1 through KF-9 slices each shipped as smallest-viable. The next release fills the operational surface for the ones people actually use; the others stay slices until somebody asks. Hypothesis property-test coverage gets extended into the orchestrator runtime path, which is the surface most likely to leak invariants nobody wrote down. If you hit something rough in 1.10.x, &lt;a href="https://github.com/sipyourdrink-ltd/bernstein/issues" rel="noopener noreferrer"&gt;open an issue&lt;/a&gt;; the next batch is shaped by what blocks real work.&lt;/p&gt;

&lt;p&gt;Bernstein&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://bernstein.run/blog/v1-10-x-recap?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=v1-10-x-recap&amp;amp;utm_content=canonical" rel="noopener noreferrer"&gt;https://bernstein.run/blog/v1-10-x-recap?utm_source=devto&amp;amp;utm_medium=crosspost&amp;amp;utm_campaign=v1-10-x-recap&amp;amp;utm_content=canonical&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>multiagentorchestration</category>
      <category>release</category>
      <category>agentsmd</category>
      <category>a2a</category>
    </item>
    <item>
      <title>Orchestration primitive or desktop ADE? Choosing your multi-agent coding layer in 2026</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:11:04 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/orchestration-primitive-or-desktop-ade-choosing-your-multi-agent-coding-layer-in-2026-3nnd</link>
      <guid>https://dev.to/alex_chernysh/orchestration-primitive-or-desktop-ade-choosing-your-multi-agent-coding-layer-in-2026-3nnd</guid>
      <description>&lt;p&gt;The multi-agent coding tool category went from a handful of projects in late 2024 to thirty-plus by mid-2026. Along the way it split into two shapes that solve adjacent-but-different problems. Here's when to reach for each, and why you might end up using both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two shapes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Desktop ADEs.&lt;/strong&gt; A downloadable desktop application. You install it like any other app, open a window, configure credentials, and see your repo, your agents, and your diffs in a unified UI. Examples in the open-source corner: &lt;a href="https://github.com/generalaction/emdash" rel="noopener noreferrer"&gt;emdash&lt;/a&gt; (Electron app, 23 CLI providers supported, YC W26-funded), &lt;a href="https://conductor.build" rel="noopener noreferrer"&gt;Conductor&lt;/a&gt;, Cline's desktop mode. Closed-source you'd put in the same category: Claude Code's VS Code extension, Cursor's "run in background" mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestration primitives.&lt;/strong&gt; A library or CLI you import into your own workflow. You don't see a window; you see a process you can pipe into other things. Examples: &lt;a href="https://bernstein.run" rel="noopener noreferrer"&gt;Bernstein&lt;/a&gt; (the project this blog belongs to — 18 CLI adapters, Python-importable), &lt;a href="https://github.com/skeet70/workz" rel="noopener noreferrer"&gt;Workz&lt;/a&gt;, certain configurations of Plandex. LangGraph and CrewAI are adjacent but different — they orchestrate LLM calls, not CLI coding agents.&lt;/p&gt;

&lt;p&gt;The distinction is not about which is better. It's about what layer of the problem you're solving.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a desktop ADE does well
&lt;/h2&gt;

&lt;p&gt;A desktop ADE gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A visual workspace. Diffs, PR status, CI checks, agent logs all in one window.&lt;/li&gt;
&lt;li&gt;Zero-config launch. You open the app, it picks up your repo, agents just work.&lt;/li&gt;
&lt;li&gt;Identity handled. Credentials in the OS keychain, not in a &lt;code&gt;.env&lt;/code&gt; file that leaks.&lt;/li&gt;
&lt;li&gt;Distribution pattern. Electron installers for macOS, Windows, Linux. Your non-terminal colleague can use it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This shape is the right answer when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're the kind of developer who keeps an IDE open all day and wants agents integrated into that workflow, not hidden in a &lt;code&gt;tmux&lt;/code&gt; pane.&lt;/li&gt;
&lt;li&gt;You're onboarding teammates who don't live in the terminal.&lt;/li&gt;
&lt;li&gt;You want one tool that covers edit, review, merge, and CI visibility end-to-end.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it trades off:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not programmable from the outside. You can't &lt;code&gt;import emdash&lt;/code&gt; or write a CI job that kicks off a parallel agent run via emdash's API. It's a UI, not a library.&lt;/li&gt;
&lt;li&gt;Ships with opinionated conventions. Agents live in app-managed worktrees; audit logs live in app databases. Extracting them into another system is possible but not first-class.&lt;/li&gt;
&lt;li&gt;Cross-machine coordination is an extra feature (SSH mode, remote runtime) rather than the default shape.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What an orchestration primitive does well
&lt;/h2&gt;

&lt;p&gt;A primitive gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A process you can script. &lt;code&gt;bernstein run --goal "..." | jq .&lt;/code&gt; works. So does invoking it from a GitHub Actions workflow, or importing &lt;code&gt;bernstein.core&lt;/code&gt; in your Python code.&lt;/li&gt;
&lt;li&gt;Deterministic coordination. The scheduler is a regular event loop. Every run is replay-able from the audit trail.&lt;/li&gt;
&lt;li&gt;MCP server mode. Your agent-of-choice can talk to the orchestrator through the same Model Context Protocol Anthropic publishes for Claude Code.&lt;/li&gt;
&lt;li&gt;Composition. A primitive is one step in a larger pipeline: linter → primitive multi-agent pass → janitor → merge queue → deploy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This shape is the right answer when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want to embed multi-agent coding into a system you already run: CI, internal dev-platform, evaluation harness.&lt;/li&gt;
&lt;li&gt;You care about reproducibility. HMAC-chained audit trails give you "did the agent really do exactly that?" answers days later.&lt;/li&gt;
&lt;li&gt;You're already in a scripting-first workflow and don't want a new app to keep open.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it trades off:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No visual diff/merge UI out of the box. You &lt;code&gt;git diff&lt;/code&gt; the worktree, or plug it into your existing tools.&lt;/li&gt;
&lt;li&gt;Setup needs a terminal. &lt;code&gt;pipx install bernstein &amp;amp;&amp;amp; bernstein init&lt;/code&gt;, not a double-click installer.&lt;/li&gt;
&lt;li&gt;It's one layer of a larger stack. You'll likely pair it with a separate review tool, CI system, and notification channel.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Decision shortcuts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Building a product on top of multi-agent coding?&lt;/strong&gt; Reach for a primitive. Libraries compose; apps don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Onboarding a team that wants a single download?&lt;/strong&gt; Reach for a desktop ADE. Developer ergonomics of an opinionated installable app is hard to beat for non-power-users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Running agents as part of CI / evaluation / internal platform?&lt;/strong&gt; Primitive, nearly always.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Running agents on your own laptop during normal dev work?&lt;/strong&gt; Either works; it's a preference question. Try both for a week.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need to prove to compliance or security "here's exactly what happened"?&lt;/strong&gt; HMAC audit trails live in the primitive layer. ADE output logs are usually app-scoped.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  They often co-exist
&lt;/h2&gt;

&lt;p&gt;Nothing prevents running both. A pattern we've seen in Bernstein's early users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bernstein in CI for the "every PR gets a lint-plus-refactor agent pass" step.&lt;/li&gt;
&lt;li&gt;Desktop ADE for interactive "I'm pairing with Claude Code on this refactor" flow.&lt;/li&gt;
&lt;li&gt;Bernstein's MCP server mode exposed to the ADE so both see the same audit trail.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're already using a desktop ADE and it covers what you need, keep it. If you hit the "but I want to run this from a shell script / from CI / inside another service" wall, that's the signal to look at a primitive, regardless of which specific one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bernstein's specific positioning
&lt;/h2&gt;

&lt;p&gt;Bernstein is the primitive-shape tool. What we optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deterministic coordinator written in plain Python — no LLM in the scheduling loop, so runs are reproducible.&lt;/li&gt;
&lt;li&gt;HMAC-chained audit trail — every agent action is replay-able bit-for-bit days later.&lt;/li&gt;
&lt;li&gt;MCP server mode — expose Bernstein to any MCP-capable client (Claude Code, Cursor, or your own agent).&lt;/li&gt;
&lt;li&gt;18 CLI adapters including Claude Code, Codex, Cursor, Aider, Gemini CLI, OpenAI Agents SDK, Amp, Cody, Ollama, and more.&lt;/li&gt;
&lt;li&gt;Apache 2.0, BYOK, &lt;code&gt;pipx install bernstein&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What we don't build: a desktop UI. If you need one, emdash and Conductor both do that well and are worth trying.&lt;/p&gt;

&lt;p&gt;The category is large enough to have multiple right answers. The question is which layer of your stack you're optimizing for. A primitive and an ADE are not competing with each other. They're competing with the "write a bunch of glue code to make two agents work on the same repo without destroying it" option — which nearly everyone used until twelve months ago, and which neither shape is going back to.&lt;/p&gt;

</description>
      <category>multiagentcoding</category>
      <category>agentorchestration</category>
      <category>aicodingagents</category>
      <category>developertools</category>
    </item>
    <item>
      <title>From 4,000 Lines to 200: Decomposing Bernstein's Core</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:10:28 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/from-4000-lines-to-200-decomposing-bernsteins-core-2n8h</link>
      <guid>https://dev.to/alex_chernysh/from-4000-lines-to-200-decomposing-bernsteins-core-2n8h</guid>
      <description>&lt;p&gt;Bernstein's orchestrator.py hit 4,198 lines. We used 11 parallel agents, orchestrated by Bernstein itself, to decompose it into 15 sub-packages in the first pass, each under 400 lines. Subsequent refactors extended this to 22 sub-packages. Here's how that worked and what we learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  How a file gets to 4,000 lines
&lt;/h2&gt;

&lt;p&gt;It happens gradually. The orchestrator started as a clean 300-line module that managed a tick loop: check for tasks, spawn agents, collect results. Then it grew. Cost tracking logic. Quality gates. Token monitoring. Git worktree management. Heartbeat detection. Idle agent recycling. Shutdown coordination.&lt;/p&gt;

&lt;p&gt;Each addition was small and reasonable. But after two months of active development, &lt;code&gt;orchestrator.py&lt;/code&gt; was a 4,198-line monolith that imported 47 modules and had 23 public methods. The test file was 2,800 lines. IDE navigation was painful. Merge conflicts were constant because every feature touched the same file.&lt;/p&gt;

&lt;p&gt;The rule we now follow: if a module crosses 600 lines, it's time to decompose.&lt;/p&gt;

&lt;h2&gt;
  
  
  The plan
&lt;/h2&gt;

&lt;p&gt;We defined 15 target sub-packages, each responsible for one concern:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sub-package&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;th&gt;Lines (after)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;orchestration/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lifecycle, tick pipeline&lt;/td&gt;
&lt;td&gt;~350&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;agents/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Spawner, discovery, heartbeat&lt;/td&gt;
&lt;td&gt;~380&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tasks/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Task store, retry, scheduling&lt;/td&gt;
&lt;td&gt;~340&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;quality/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Quality gates, CI monitor&lt;/td&gt;
&lt;td&gt;~290&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cost/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cost tracking, budgets&lt;/td&gt;
&lt;td&gt;~310&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tokens/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Token monitoring, intervention&lt;/td&gt;
&lt;td&gt;~250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;security/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Audit logs, policy engine&lt;/td&gt;
&lt;td&gt;~270&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;git/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Worktree management, merge queue&lt;/td&gt;
&lt;td&gt;~280&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;persistence/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;WAL, checkpointing&lt;/td&gt;
&lt;td&gt;~220&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;planning/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Plan loading, dependencies&lt;/td&gt;
&lt;td&gt;~200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;routing/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model selection, bandit&lt;/td&gt;
&lt;td&gt;~320&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;communication/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Bulletin board, messaging&lt;/td&gt;
&lt;td&gt;~180&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;server/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Task server, API&lt;/td&gt;
&lt;td&gt;~260&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;config/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Configuration, defaults&lt;/td&gt;
&lt;td&gt;~190&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;observability/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Metrics, tracing&lt;/td&gt;
&lt;td&gt;~240&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The decomposition needed to be backward-compatible. Existing code importing &lt;code&gt;from bernstein.core.orchestrator import Orchestrator&lt;/code&gt; had to keep working.&lt;/p&gt;

&lt;h2&gt;
  
  
  11 agents, 15 packages
&lt;/h2&gt;

&lt;p&gt;Here's the recursive part: we used Bernstein to execute the decomposition. A YAML plan defined 15 extraction stages with dependency edges (e.g., &lt;code&gt;tasks/&lt;/code&gt; had to be extracted before &lt;code&gt;agents/&lt;/code&gt; because the spawner depends on the task store).&lt;/p&gt;

&lt;p&gt;11 agents ran in parallel across independent sub-packages. Each agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extracted the relevant functions and classes from &lt;code&gt;orchestrator.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Created the new sub-package with proper &lt;code&gt;__init__.py&lt;/code&gt; exports&lt;/li&gt;
&lt;li&gt;Updated all internal imports&lt;/li&gt;
&lt;li&gt;Ran the sub-package's tests to verify nothing broke&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole decomposition took about 3 hours of wall time. A human doing this manually — carefully moving code, fixing imports, running tests after each change — would spend 2-3 days.&lt;/p&gt;

&lt;h2&gt;
  
  
  The re-export shim pattern
&lt;/h2&gt;

&lt;p&gt;Backward compatibility was the hardest constraint. We solved it with re-export shims. The original &lt;code&gt;orchestrator.py&lt;/code&gt; became a thin file that imports from sub-packages and re-exports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# src/bernstein/core/orchestrator.py (after — ~200 lines, down from 4,198)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Orchestrator shim — re-exports from sub-packages for backward compat.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bernstein.core.orchestration.lifecycle&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Orchestrator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bernstein.core.orchestration.tick&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TickPipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bernstein.core.orchestration.manager&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OrchestratorManager&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bernstein.core.orchestration.shutdown&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ShutdownCoordinator&lt;/span&gt;

&lt;span class="n"&gt;__all__&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Orchestrator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TickPipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OrchestratorManager&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ShutdownCoordinator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every existing import path works unchanged. New code imports from the specific sub-package. Over time, the shims can be deprecated.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Dependency graphs matter more than you think.&lt;/strong&gt; The extraction order was critical. Extracting &lt;code&gt;git/&lt;/code&gt; before &lt;code&gt;tasks/&lt;/code&gt; would have created circular imports because the merge queue references task completion callbacks. We had to map the dependency graph before writing the plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tests are the safety net.&lt;/strong&gt; Each extraction step ran the full test suite. We caught 14 import errors, 3 circular dependencies, and 1 subtle bug where a function relied on module-level state that moved to a different file. Without tests, at least half of those would have shipped broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;600 lines is a good limit.&lt;/strong&gt; After the decomposition, the largest sub-package is &lt;code&gt;agents/&lt;/code&gt; at ~380 lines. Every module is small enough to read in one sitting, grep effectively, and test in isolation. When a new file starts approaching 600 lines, we split it proactively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestrators can orchestrate themselves.&lt;/strong&gt; There's something satisfying about using your own tool to refactor itself. The decomposition was one of our most complex multi-agent runs, and it validated that the parallel execution model works for real refactoring tasks, not just greenfield code generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;Before: 1 file, 4,198 lines, 47 imports, constant merge conflicts.&lt;br&gt;
After: 15 sub-packages in the first pass (extended to 22 in later refactors), ~280 lines average, clean dependency boundaries, agents can work on different packages without conflicts.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/chernistry/bernstein/tree/main/src/bernstein/core" rel="noopener noreferrer"&gt;full source&lt;/a&gt; is on GitHub. The re-export shims are in the top-level files like &lt;code&gt;orchestrator.py&lt;/code&gt;, &lt;code&gt;spawner.py&lt;/code&gt;, and &lt;code&gt;task_lifecycle.py&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/cost-aware-routing"&gt;How Bernstein routes tasks to the right model&lt;/a&gt; — the routing sub-package in action&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/cloudflare-cloud-execution"&gt;Running agents on Cloudflare&lt;/a&gt; — cloud execution built on the decomposed architecture&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/getting-started"&gt;Getting started&lt;/a&gt; — try a multi-agent session yourself&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>pythonrefactoring</category>
      <category>codedecomposition</category>
      <category>multiagentorchestration</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Getting Started: Your First Multi-Agent Run in 5 Minutes</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:09:52 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/getting-started-your-first-multi-agent-run-in-5-minutes-57fj</link>
      <guid>https://dev.to/alex_chernysh/getting-started-your-first-multi-agent-run-in-5-minutes-57fj</guid>
      <description>&lt;p&gt;This guide gets you from zero to a working multi-agent session in under 5 minutes. You'll install Bernstein, configure Claude Code as your agent, run a goal, and understand the output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Install Bernstein
&lt;/h2&gt;

&lt;p&gt;Bernstein requires Python 3.12+. Install it with pip or uv:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;bernstein
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or if you use &lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;uv&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv pip &lt;span class="nb"&gt;install &lt;/span&gt;bernstein
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bernstein &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="c"&gt;# bernstein 1.8.8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Configure your agent
&lt;/h2&gt;

&lt;p&gt;Bernstein needs at least one CLI coding agent installed. The fastest setup uses Claude Code, but &lt;a href="https://bernstein.readthedocs.io/en/latest/ADAPTER_GUIDE/" rel="noopener noreferrer"&gt;18 agents are supported&lt;/a&gt; including Codex, Gemini CLI, the OpenAI Agents SDK, Aider, and more.&lt;/p&gt;

&lt;p&gt;Make sure Claude Code is installed and your API key is set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Claude Code if you haven't&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @anthropic-ai/claude-code

&lt;span class="c"&gt;# Set your API key&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bernstein auto-detects installed agents. Verify it finds yours:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bernstein agents
&lt;span class="c"&gt;# Available agents:&lt;/span&gt;
&lt;span class="c"&gt;#   claude (Claude Code) ✓&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Run your first goal
&lt;/h2&gt;

&lt;p&gt;cd into any git repository and run a goal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
bernstein run &lt;span class="nt"&gt;--goal&lt;/span&gt; &lt;span class="s2"&gt;"Add type hints to all functions in src/utils.py"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bernstein will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Decompose&lt;/strong&gt; the goal into concrete tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assign&lt;/strong&gt; each task a role, priority, and model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spawn&lt;/strong&gt; agents in isolated git worktrees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor&lt;/strong&gt; progress via heartbeats and output parsing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merge&lt;/strong&gt; completed work back to your branch&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 4: Read the TUI
&lt;/h2&gt;

&lt;p&gt;The terminal UI shows live progress:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─ Bernstein v1.8.8 ─────────────────────────────────┐
│ Goal: Add type hints to all functions in src/utils  │
│ Tasks: 3 total │ 1 running │ 1 done │ 1 pending    │
│ Agents: 2 active │ Cost: $0.12                      │
├─────────────────────────────────────────────────────┤
│ ✓ task-001  Analyze existing type usage    00:42    │
│ ► task-002  Add type hints to helpers      01:15    │
│ ○ task-003  Add type hints to validators   pending  │
└─────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;✓&lt;/strong&gt; = completed and merged&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;►&lt;/strong&gt; = currently running&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;○&lt;/strong&gt; = pending (waiting for dependencies or an available agent)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Press &lt;code&gt;q&lt;/code&gt; to stop gracefully (agents finish their current task) or &lt;code&gt;Ctrl+C&lt;/code&gt; to force stop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Check the results
&lt;/h2&gt;

&lt;p&gt;When all tasks complete, check what changed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git log &lt;span class="nt"&gt;--oneline&lt;/span&gt; &lt;span class="nt"&gt;-5&lt;/span&gt;
&lt;span class="c"&gt;# a1b2c3d Add type hints to validator functions&lt;/span&gt;
&lt;span class="c"&gt;# d4e5f6g Add type hints to helper functions&lt;/span&gt;
&lt;span class="c"&gt;# h7i8j9k Analyze existing type usage in src/utils.py&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each agent's work is a separate commit, merged through Bernstein's merge queue. If any task failed, its changes are rolled back and the failure is logged in &lt;code&gt;.sdd/dead_letter.json&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to try next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Run a YAML plan&lt;/strong&gt; for structured, multi-stage projects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bernstein run plans/my-project.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plans let you define stages, dependencies, roles, and complexity per task. See the &lt;a href="https://bernstein.readthedocs.io/en/latest/GETTING_STARTED/" rel="noopener noreferrer"&gt;plan file docs&lt;/a&gt; for the full schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use multiple agent types&lt;/strong&gt; by installing additional adapters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Bernstein will route tasks to the best available agent&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;codex-cli  &lt;span class="c"&gt;# or install any supported agent&lt;/span&gt;
bernstein agents       &lt;span class="c"&gt;# see all detected agents&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Monitor costs&lt;/strong&gt; across sessions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bernstein cost
&lt;span class="c"&gt;# Session total: $0.47&lt;/span&gt;
&lt;span class="c"&gt;# By model: haiku=$0.03, sonnet=$0.28, opus=$0.16&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check the API&lt;/strong&gt; for programmatic access:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Task server runs on port 8052 during sessions&lt;/span&gt;
curl http://127.0.0.1:8052/status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/cost-aware-routing"&gt;How Bernstein routes tasks to the right model&lt;/a&gt;: bandit router cuts spend roughly in half in our own runs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/cloudflare-cloud-execution"&gt;Running agents on Cloudflare&lt;/a&gt; — scale beyond your laptop&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/chernistry/bernstein" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; — source code, issues, and discussions&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/bernstein/" rel="noopener noreferrer"&gt;PyPI package&lt;/a&gt; — release history and downloads&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aicodingagentssetup</category>
      <category>claudecodetutorial</category>
      <category>gettingstarted</category>
      <category>pythonclitools</category>
    </item>
    <item>
      <title>How Bernstein Routes Tasks to the Right Model</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:09:16 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/how-bernstein-routes-tasks-to-the-right-model-379j</link>
      <guid>https://dev.to/alex_chernysh/how-bernstein-routes-tasks-to-the-right-model-379j</guid>
      <description>&lt;p&gt;Not every coding task needs Opus. Bernstein's contextual bandit router learns which model handles each task type best, then routes accordingly. In our own runs, the bandit router cut spend roughly in half compared to uniform model selection. Measure yours with bernstein cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  The uniform selection problem
&lt;/h2&gt;

&lt;p&gt;Most multi-agent setups use the same model for everything. Every task — whether it's renaming a variable or designing an authentication system — gets routed to the same model at the same effort level. This is wasteful. A &lt;code&gt;docs&lt;/code&gt; task that writes a docstring doesn't need the same model as a &lt;code&gt;security&lt;/code&gt; task that implements credential scoping.&lt;/p&gt;

&lt;p&gt;The cost difference is real. At current API pricing, routing a simple task to Haiku instead of Opus costs roughly 30x less. Over a session with 40-60 tasks, that adds up fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the router works
&lt;/h2&gt;

&lt;p&gt;Bernstein's routing pipeline has three layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Heuristic classification.&lt;/strong&gt; Every task has a &lt;code&gt;complexity&lt;/code&gt; field (low, medium, high) and a &lt;code&gt;role&lt;/code&gt; (backend, frontend, qa, security, etc.). The router uses a rule-based classifier to make an initial model/effort assignment. Low-complexity tasks default to Haiku or Sonnet with standard effort. High-complexity tasks get Opus with max effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Epsilon-greedy bandit.&lt;/strong&gt; This is where it gets interesting. The bandit maintains per-role reward estimates for each model. When a task arrives, it exploits the best-known model 80% of the time and explores alternatives 20% of the time. Rewards come from task outcomes: did the agent complete the task? Did tests pass? How many retries were needed?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified selection logic
&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;opus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;CASCADE&lt;/span&gt;
&lt;span class="n"&gt;selected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bandit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidate_models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;CASCADE&lt;/code&gt; list includes all available models from cheapest to most capable. For high-complexity tasks, the bandit only considers Sonnet and Opus — sending a hard architecture task to Haiku would waste the agent's time even if it's cheap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Effectiveness seeding.&lt;/strong&gt; The bandit warms up using historical effectiveness data from the &lt;code&gt;.sdd/metrics/&lt;/code&gt; directory. If a previous run showed that &lt;code&gt;backend&lt;/code&gt; tasks succeed 95% of the time with Sonnet but only 70% with Haiku, the bandit starts with that prior. No cold-start problem after the first session.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the router learns
&lt;/h2&gt;

&lt;p&gt;After a few sessions, clear patterns emerge:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task type&lt;/th&gt;
&lt;th&gt;Typical model&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Docs, docstrings&lt;/td&gt;
&lt;td&gt;Haiku&lt;/td&gt;
&lt;td&gt;Templated output, low reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test writing&lt;/td&gt;
&lt;td&gt;Sonnet&lt;/td&gt;
&lt;td&gt;Needs code understanding, not creativity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bug fixes&lt;/td&gt;
&lt;td&gt;Sonnet&lt;/td&gt;
&lt;td&gt;Pattern matching on error traces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Refactoring&lt;/td&gt;
&lt;td&gt;Sonnet/Opus&lt;/td&gt;
&lt;td&gt;Depends on scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture, security&lt;/td&gt;
&lt;td&gt;Opus&lt;/td&gt;
&lt;td&gt;Requires deep reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These aren't hardcoded rules — they're learned from outcomes. If your codebase has unusually complex tests, the bandit will learn to route test tasks to a stronger model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;The bandit is enabled by default when a metrics directory exists. You can tune exploration rate and model cascade in your config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .sdd/config.yaml&lt;/span&gt;
&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;bandit_epsilon&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.2&lt;/span&gt;          &lt;span class="c1"&gt;# 20% exploration&lt;/span&gt;
  &lt;span class="na"&gt;cascade&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;haiku&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;sonnet&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;opus&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;min_samples_per_arm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;       &lt;span class="c1"&gt;# explore each option at least 5 times&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To disable bandit routing and use pure heuristics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;bandit_enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Across our internal runs (self-development sessions where Bernstein improves its own codebase), the bandit router cut per-session spend roughly in half compared to the baseline of Sonnet-for-everything. Task completion rates stayed within a couple of percentage points, so cheaper models handle their assigned tasks fine. Measure your own runs with &lt;code&gt;bernstein cost&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The savings compound. A 10-agent session running 50 tasks might cost $15-20 with uniform Sonnet. With bandit routing, the same session runs $7-10. Over weeks of iterative development, that's the difference between a side project budget and a real expense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://bernstein.readthedocs.io/en/latest/ARCHITECTURE/" rel="noopener noreferrer"&gt;Architecture overview&lt;/a&gt; for how routing fits into the orchestration pipeline&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/getting-started"&gt;Getting started&lt;/a&gt; to try it yourself&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/chernistry/bernstein/tree/main/src/bernstein/core/routing" rel="noopener noreferrer"&gt;Source code&lt;/a&gt; for the full router implementation&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aicostoptimization</category>
      <category>modelrouting</category>
      <category>contextualbandit</category>
      <category>multiagentorchestration</category>
    </item>
    <item>
      <title>Community Spotlight: April 2026</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:08:40 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/community-spotlight-april-2026-50o5</link>
      <guid>https://dev.to/alex_chernysh/community-spotlight-april-2026-50o5</guid>
      <description>&lt;p&gt;Every month we spotlight the people who make Bernstein better. Here are April's highlights from the first month of public development.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened in April
&lt;/h2&gt;

&lt;p&gt;Bernstein went from v1.0.0 to v1.8.8 in a few weeks. The pace was intense, and community contributions made a real difference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Architecture decomposition&lt;/strong&gt;: 52 oversized modules broken into 22 focused sub-packages, each under 600 lines. The orchestrator monolith (4,198 lines) is now navigable, testable, and merge-conflict-free. &lt;a href="https://dev.to/blog/module-decomposition"&gt;Read the full story&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;18 agent adapters&lt;/strong&gt;: We started with 7 adapters and now support 18: Claude Code, Codex, Gemini CLI, OpenAI Agents SDK, Cursor, Aider, Amp, Kiro, Kilo, Qwen, Goose, Ollama, Cody, Continue, OpenCode, Cloudflare Agents, IaC, and a generic wrapper. Each adapter is a focused Python class under 200 lines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-aware routing&lt;/strong&gt;: The &lt;a href="https://dev.to/blog/cost-aware-routing"&gt;contextual bandit router&lt;/a&gt; learns which model handles each task type best. In our own runs, the bandit cut spend roughly in half compared to sending everything to the same model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare cloud execution&lt;/strong&gt;: Agents can now &lt;a href="https://dev.to/blog/cloudflare-cloud-execution"&gt;run on Cloudflare Workers&lt;/a&gt; with Durable Workflows, R2 artifact storage, and D1 state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows support&lt;/strong&gt;: Full cross-platform compatibility contributed by &lt;a href="https://github.com/oldschoola" rel="noopener noreferrer"&gt;@oldschoola&lt;/a&gt;: environment passthrough, Unicode safety, process management, and terminal handling.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Contributors
&lt;/h2&gt;

&lt;p&gt;Thanks to everyone who contributed PRs, reported bugs, and tested edge cases this month:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/oldschoola" rel="noopener noreferrer"&gt;@oldschoola&lt;/a&gt;: Windows compatibility (3 merged PRs), codex config, task filtering, auto-PR&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Ai-chan-0411" rel="noopener noreferrer"&gt;@Ai-chan-0411&lt;/a&gt;: community spotlight template&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/alexanderxfgl-bit" rel="noopener noreferrer"&gt;@alexanderxfgl-bit&lt;/a&gt;: spotlight generator script&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/forfreedomforrich-eng" rel="noopener noreferrer"&gt;@forfreedomforrich-eng&lt;/a&gt;: &lt;code&gt;--dry-run&lt;/code&gt; flag, trigger URL fix&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/TheCodingDragon0" rel="noopener noreferrer"&gt;@TheCodingDragon0&lt;/a&gt;: &lt;code&gt;bernstein config diff&lt;/code&gt;, glossary&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/internet-dot" rel="noopener noreferrer"&gt;@internet-dot&lt;/a&gt;: HOL workflow&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Beledarian" rel="noopener noreferrer"&gt;@Beledarian&lt;/a&gt;: config path validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All contributors are listed in &lt;a href="https://github.com/chernistry/bernstein/blob/main/CONTRIBUTORS.md" rel="noopener noreferrer"&gt;CONTRIBUTORS.md&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to get involved
&lt;/h2&gt;

&lt;p&gt;Bernstein is Apache 2.0 and welcomes contributions of all sizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/chernistry/bernstein/labels/good%20first%20issue" rel="noopener noreferrer"&gt;Good first issues&lt;/a&gt;: curated tasks for newcomers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/chernistry/bernstein/issues/786" rel="noopener noreferrer"&gt;Write a blog post&lt;/a&gt;: get published on bernstein.run&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/chernistry/bernstein/issues/775" rel="noopener noreferrer"&gt;Adopt an adapter&lt;/a&gt;: become the maintainer for your favorite agent&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/chernistry/bernstein/issues/787" rel="noopener noreferrer"&gt;Submit benchmarks&lt;/a&gt;: share your orchestration metrics&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>community</category>
      <category>opensource</category>
      <category>contributors</category>
      <category>multiagentorchestration</category>
    </item>
    <item>
      <title>Running AI Agents on Cloudflare: Workers, Workflows, and Durable Objects</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:08:39 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/running-ai-agents-on-cloudflare-workers-workflows-and-durable-objects-fl</link>
      <guid>https://dev.to/alex_chernysh/running-ai-agents-on-cloudflare-workers-workflows-and-durable-objects-fl</guid>
      <description>&lt;p&gt;Bernstein v1.8.4 ships with Cloudflare cloud execution. Agents can now run on Workers, multi-step tasks use Durable Workflows, artifacts go to R2, and state persists in D1. Here's the architecture and how to deploy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why local-only limits adoption
&lt;/h2&gt;

&lt;p&gt;Running agents locally works for individual developers, but it has real constraints. Your laptop is the bottleneck: CPU, memory, and network all compete with your actual work. Long-running sessions drain battery. If you close your laptop, the session dies. And scaling beyond 4-5 concurrent agents on a MacBook starts hitting resource limits.&lt;/p&gt;

&lt;p&gt;Cloud execution solves this. Agents run on remote infrastructure while you monitor progress from a dashboard or TUI. Sessions survive disconnects. You can scale to 20+ concurrent agents without melting your machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cloudflare stack
&lt;/h2&gt;

&lt;p&gt;Cloudflare recently became &lt;a href="https://openai.com/index/cloudflare-openai-agent-cloud/" rel="noopener noreferrer"&gt;OpenAI's infrastructure partner for agent cloud computing&lt;/a&gt; — the same infrastructure Bernstein agents can now run on. We chose Cloudflare's stack because it maps cleanly to orchestration primitives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workers&lt;/strong&gt; handle lightweight, stateless agent execution. Each agent task runs in an isolated Worker with its own environment. Workers cold-start in under 50ms, so spinning up a new agent is nearly instant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Durable Workflows&lt;/strong&gt; orchestrate multi-step tasks. When an agent needs to clone a repo, run code, execute tests, and report results, the workflow ensures each step completes before the next begins — with automatic retries on failure. If a Worker crashes mid-task, the workflow resumes from the last completed step, not from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R2&lt;/strong&gt; stores artifacts. Agent outputs — diffs, test results, generated files — persist in R2 buckets. The orchestrator reads results from R2 when merging completed work back to the main branch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D1&lt;/strong&gt; holds orchestration state. Task queues, agent assignments, cost metrics, and audit logs all live in D1. This replaces the local &lt;code&gt;.sdd/&lt;/code&gt; file-based state with a durable database that survives restarts and supports concurrent access from multiple Workers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture overview
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Architecture diagram omitted in this cross-post. See the original post on bernstein.run for the rendered version.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The orchestrator itself runs as a Worker with a Durable Object for maintaining tick state. Agent Workers are spawned per-task and communicate results back through R2 and D1.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying
&lt;/h2&gt;

&lt;p&gt;Prerequisites: a Cloudflare account with Workers, R2, and D1 enabled, and &lt;code&gt;wrangler&lt;/code&gt; installed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Authenticate with Cloudflare&lt;/span&gt;
wrangler login

&lt;span class="c"&gt;# Deploy the Bernstein cloud stack&lt;/span&gt;
bernstein cloud deploy &lt;span class="nt"&gt;--project&lt;/span&gt; my-project

&lt;span class="c"&gt;# This creates:&lt;/span&gt;
&lt;span class="c"&gt;#   - Orchestrator Worker + Durable Object&lt;/span&gt;
&lt;span class="c"&gt;#   - R2 bucket: bernstein-my-project-artifacts&lt;/span&gt;
&lt;span class="c"&gt;#   - D1 database: bernstein-my-project-state&lt;/span&gt;
&lt;span class="c"&gt;#   - Workflow definitions for multi-step tasks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once deployed, run tasks against the cloud backend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run a goal on cloud infrastructure&lt;/span&gt;
bernstein run &lt;span class="nt"&gt;--goal&lt;/span&gt; &lt;span class="s2"&gt;"Refactor auth module"&lt;/span&gt; &lt;span class="nt"&gt;--cloud&lt;/span&gt;

&lt;span class="c"&gt;# Monitor from your terminal&lt;/span&gt;
bernstein cloud status

&lt;span class="c"&gt;# Or check the dashboard&lt;/span&gt;
bernstein dashboard &lt;span class="nt"&gt;--cloud&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agent API keys (Anthropic, OpenAI, etc.) are stored as Worker secrets via &lt;code&gt;wrangler secret put&lt;/code&gt;. They never leave the Cloudflare network.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost considerations
&lt;/h2&gt;

&lt;p&gt;Cloudflare Workers pricing is request-based, not instance-based. You pay for the compute your agents actually use, not for idle VMs. For a typical 50-task session:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workers compute: ~$0.50-2.00&lt;/li&gt;
&lt;li&gt;R2 storage: pennies (artifacts are small)&lt;/li&gt;
&lt;li&gt;D1 reads/writes: pennies (state operations are lightweight)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cloud infrastructure cost is a small fraction of the LLM API costs that agents incur. The real savings come from not needing to keep your machine running and from being able to scale to more concurrent agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;We're working on scheduled runs (trigger a session from a cron or GitHub webhook), multi-region execution (run agents closer to the repos they're working on), and a hosted dashboard for monitoring cloud sessions without a local CLI.&lt;/p&gt;

&lt;p&gt;Try it: &lt;code&gt;pip install bernstein&lt;/code&gt; and check the &lt;a href="https://dev.to/blog/getting-started"&gt;getting started guide&lt;/a&gt;.&lt;br&gt;
Source: &lt;a href="https://github.com/chernistry/bernstein" rel="noopener noreferrer"&gt;github.com/chernistry/bernstein&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cloudflareworkers</category>
      <category>cloudaiagents</category>
      <category>serverlessorchestration</category>
      <category>multiagentorchestration</category>
    </item>
    <item>
      <title>Stop using LLMs to schedule other LLMs</title>
      <dc:creator>Alex Chernysh</dc:creator>
      <pubDate>Wed, 08 Apr 2026 12:56:54 +0000</pubDate>
      <link>https://dev.to/alex_chernysh/why-i-stopped-using-llms-to-schedule-llms-4176</link>
      <guid>https://dev.to/alex_chernysh/why-i-stopped-using-llms-to-schedule-llms-4176</guid>
      <description>&lt;p&gt;Three AI coding agents on the same repo = three agents overwriting each other's work. Claude Code edits &lt;code&gt;auth.py&lt;/code&gt;. Codex edits &lt;code&gt;auth.py&lt;/code&gt; two seconds later. Claude's changes vanish. Meanwhile Gemini "refactors" the test suite and breaks six things.&lt;/p&gt;

&lt;p&gt;Two weeks of this. Here's what fixed it: git worktrees per agent, a deterministic Python scheduler (not an LLM), and a janitor that verifies work before merge.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wrong turn
&lt;/h2&gt;

&lt;p&gt;My first orchestrator used an LLM to coordinate the other LLMs. A manager agent read the backlog, decided assignments, checked progress, re-planned on failure.&lt;/p&gt;

&lt;p&gt;It was slow, expensive, and kept hallucinating priorities. ~40% of total tokens went to coordination overhead instead of code.&lt;/p&gt;

&lt;p&gt;Then the obvious hit: scheduling is a solved problem. Operating systems have done concurrent process scheduling since the 1960s. Nobody uses neural networks for &lt;code&gt;cron&lt;/code&gt;. Why use one for task assignment?&lt;/p&gt;

&lt;p&gt;I ripped out the LLM scheduler. The result is &lt;a href="https://github.com/chernistry/bernstein" rel="noopener noreferrer"&gt;Bernstein&lt;/a&gt;, an open-source orchestrator that coordinates any CLI coding agent with &lt;strong&gt;zero LLM tokens on scheduling&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline
&lt;/h2&gt;

&lt;p&gt;Four stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Decompose&lt;/strong&gt;: one LLM call takes your goal, outputs a task graph with roles, owned files, and dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spawn&lt;/strong&gt;: each task gets a fresh CLI agent in an isolated git worktree. Parallel execution. Main branch untouched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify&lt;/strong&gt;: a janitor checks concrete signals. Tests pass, files exist, linter clean, types correct. Binary outcomes, not opinions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merge&lt;/strong&gt;: verified work lands on main. Failed tasks retry on a different model or get decomposed further.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Goal → Planner (LLM) → Task Graph → Orchestrator (Python) → Agents ‖
                                         ↓
                                    Janitor → Merge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The orchestrator is a Python event loop that polls a local task server, matches open tasks to available agents, and manages lifecycle. Deterministic, auditable, reproducible. Same inputs produce the same decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worktrees: the part that unlocked it
&lt;/h2&gt;

&lt;p&gt;Each agent gets its own &lt;a href="https://git-scm.com/docs/git-worktree" rel="noopener noreferrer"&gt;git worktree&lt;/a&gt; on a disposable branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git worktree add .sdd/worktrees/session-abc123 &lt;span class="nt"&gt;-b&lt;/span&gt; agent/session-abc123
&lt;span class="c"&gt;# agent works in isolation&lt;/span&gt;
&lt;span class="c"&gt;# janitor verifies, then:&lt;/span&gt;
git checkout main
git merge agent/session-abc123 &lt;span class="nt"&gt;--no-ff&lt;/span&gt;
git worktree remove .sdd/worktrees/session-abc123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each agent thinks it owns the repo. No file locks, no coordination protocol between agents, no conflicts during work. The task graph declares file ownership, so overlapping files never get assigned concurrently.&lt;/p&gt;

&lt;p&gt;Expensive directories (&lt;code&gt;node_modules&lt;/code&gt;, &lt;code&gt;.venv&lt;/code&gt;) get symlinked from the main tree so you don't pay setup cost per agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model routing without vibes
&lt;/h2&gt;

&lt;p&gt;Renaming a variable doesn't need Opus. But static rules for model selection go stale fast.&lt;/p&gt;

&lt;p&gt;Bernstein uses a &lt;a href="https://en.wikipedia.org/wiki/Contextual_bandit" rel="noopener noreferrer"&gt;LinUCB contextual bandit&lt;/a&gt; that learns from outcomes. Features: complexity tier, file scope, role, estimated token budget. Reward: &lt;code&gt;quality_score * (1 - normalized_cost)&lt;/code&gt;. Cheapest model that passes the janitor wins.&lt;/p&gt;

&lt;p&gt;Under ~50 completions it falls back to static cascade (haiku → sonnet → opus). After warm-up the bandit takes over. Policy persists across runs so learning accumulates.&lt;/p&gt;

&lt;p&gt;Net effect in my runs: ~23% cost reduction vs. running everything on one top-tier model.&lt;/p&gt;

&lt;h2&gt;
  
  
  New in v1.8: MCP server mode
&lt;/h2&gt;

&lt;p&gt;Since the original post, Bernstein gained a &lt;a href="https://modelcontextprotocol.io" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt; server. Any MCP-aware client (Claude Desktop, Cursor, VS Code, Zed) can now call Bernstein as a tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bernstein mcp &lt;span class="nt"&gt;--transport&lt;/span&gt; stdio
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your IDE agent decomposes a goal, calls &lt;code&gt;bernstein_run&lt;/code&gt;, and Bernstein fans out the work across 12 parallel CLI agents in worktrees. The IDE agent just waits for results. One cheap router model at the top, a swarm of cheap workers below, one expensive reviewer at the end — instead of one Opus chewing through 40 serialized tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it differs from CrewAI, AutoGen, LangGraph, Composio, emdash
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Bernstein&lt;/th&gt;
&lt;th&gt;CrewAI / AutoGen / LangGraph&lt;/th&gt;
&lt;th&gt;Composio / emdash&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scheduling&lt;/td&gt;
&lt;td&gt;Deterministic Python&lt;/td&gt;
&lt;td&gt;LLM-driven&lt;/td&gt;
&lt;td&gt;Hosted/UI-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Works with&lt;/td&gt;
&lt;td&gt;20+ CLI agents (Claude Code, Codex, Aider, etc.)&lt;/td&gt;
&lt;td&gt;Their SDK classes&lt;/td&gt;
&lt;td&gt;Their desktop app / web UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Git isolation&lt;/td&gt;
&lt;td&gt;Worktree per agent&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verification&lt;/td&gt;
&lt;td&gt;Janitor + quality gates&lt;/td&gt;
&lt;td&gt;Mostly absent&lt;/td&gt;
&lt;td&gt;Mostly absent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent lifetime&lt;/td&gt;
&lt;td&gt;Short: spawn, work, exit&lt;/td&gt;
&lt;td&gt;Long-running&lt;/td&gt;
&lt;td&gt;Long-running&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State&lt;/td&gt;
&lt;td&gt;File-based (inspect with &lt;code&gt;cat&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;In-memory / checkpointer&lt;/td&gt;
&lt;td&gt;Cloud/hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interface&lt;/td&gt;
&lt;td&gt;CLI + MCP server&lt;/td&gt;
&lt;td&gt;SDK&lt;/td&gt;
&lt;td&gt;Desktop ADE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Philosophical difference: CrewAI/AutoGen/LangGraph are frameworks — you write agents in their SDK. Composio and emdash are desktop ADEs — you use their UI. Bernstein is infrastructure — you point it at Claude Code, Codex, or Aider (or all three in one run) and it handles the rest.&lt;/p&gt;

&lt;p&gt;The LLM-driven coordination in those frameworks is non-deterministic and hard to debug. When Bernstein assigns task #47 to Sonnet, you can read the policy file and trace the feature vector that selected it. No prompt archaeology.&lt;/p&gt;

&lt;p&gt;Trade-off: no agent-to-agent chat, no built-in RAG, no hosted option. It's a CLI for people who want their agents to write code and get out.&lt;/p&gt;

&lt;h2&gt;
  
  
  What still sucks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Agents hallucinate file paths. The janitor catches it, but retries cost tokens.&lt;/li&gt;
&lt;li&gt;Context windows fill up on large codebases. Short-lived agents help; it's still a real constraint.&lt;/li&gt;
&lt;li&gt;12 parallel Opus agents is not cheap. Budgets and the bandit help. Not attention-free.&lt;/li&gt;
&lt;li&gt;Setup friction. At least one CLI agent must be installed and authenticated.&lt;/li&gt;
&lt;li&gt;File ownership isn't bulletproof. Agents occasionally touch files outside their scope.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is v1.8, not v10. But the core loop is stable and I've been running it against production code for months.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;bernstein
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
bernstein init
bernstein &lt;span class="nt"&gt;-g&lt;/span&gt; &lt;span class="s2"&gt;"Add rate limiting middleware"&lt;/span&gt;
bernstein live    &lt;span class="c"&gt;# TUI&lt;/span&gt;
bernstein cost    &lt;span class="c"&gt;# spend so far&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For multi-stage work, a YAML plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;goal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;limiting&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;middleware"&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
        &lt;span class="na"&gt;complexity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;medium&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;goal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Integration&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tests&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;limiter"&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qa&lt;/span&gt;
        &lt;span class="na"&gt;complexity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;low&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docs&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;goal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Document&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;limiting&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;OpenAPI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;spec"&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docs&lt;/span&gt;
        &lt;span class="na"&gt;complexity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;low&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bernstein run plan.yaml              &lt;span class="c"&gt;# deterministic execution&lt;/span&gt;
bernstein run &lt;span class="nt"&gt;--dry-run&lt;/span&gt; plan.yaml    &lt;span class="c"&gt;# preview + cost estimate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mix models in the same run. Claude Code for architecture, Gemini for boilerplate, Aider with a local Ollama model for offline tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/chernistry/bernstein" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;. Apache 2.0. Star if it saves you a merge conflict.&lt;/p&gt;

&lt;p&gt;If you've been babysitting one agent at a time, try the worktree-per-agent pattern and tell me what breaks. I'm especially interested in failure modes I haven't hit yet.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>python</category>
      <category>ai</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
