DEV Community: ORCHESTRATE

The sales goals will increase until the stock tanks! Good people doing bad 51ht!

ORCHESTRATE — Sun, 10 May 2026 21:29:23 +0000

What do we really know of our actions? How do our words generate causality? How does the software you write, user interfaces you design, and dashboard labels you select impact lives?

Is the system consuming souls faster then cash, maybe it is time to change the system?

There is more to AI than LLMs, thank goodness! Universal Natural Intelligence, Active Inference, Variational and Expected Free Energy,

ORCHESTRATE — Sun, 10 May 2026 21:26:05 +0000

...and what if...
https://www.youtube.com/playlist?list=PLdcyEw9QUgjw3WRzwa99ff9U-gkWT9Nrs

https://www.linkedin.com/in/mpolzin

https://youtube.com/@orchestratemaster

https://www.linkedin.com/company/solution-wrighT

Stockholm Syndrome @ Work Is your bad boss causing your divorce and affecting your great grandchildren?

ORCHESTRATE — Sun, 10 May 2026 21:18:55 +0000

Where does trauma come from?
Why does one person walk away from a bad experience and another ran back to it?
Why does you boss tell you to do a good job and then issue a performance review for working to slow?
What do we really know about how we see the world, each other, and our own agency?

We might FINALLY have a workable answer.

Expected Free Energy is really Epistemic over Pragmatic. Do this a few times and your software is already smarter than an LLM they will ever build

ORCHESTRATE — Sun, 10 May 2026 21:15:38 +0000

Your brain seeks prediction correctness not reward

ORCHESTRATE — Sun, 10 May 2026 21:13:22 +0000

Universal Natural Intelligence Baby Birds Born and Dance

ORCHESTRATE — Sun, 10 May 2026 21:12:07 +0000

Universal Natural Intelligence Vision

ORCHESTRATE — Sun, 10 May 2026 21:09:56 +0000

Deeping my Active Inference stills. Trilled to see this working so quickly.
Real time image detection running on a desktop PC. NO LLM, no network needed.

Bird Meadow v2: an external review found a silent bug, refuted my Nx port, and endorsed our audit-anchor pattern. Here's the loop closing.

ORCHESTRATE — Thu, 07 May 2026 22:08:10 +0000

External-review credit: Jeremy Jones ran the v1 + v2 adversarial review panels (eight-critic LLM-assisted) that surfaced the findings closed in this post. The single most consequential finding (the Dirichlet bug) and the single most consequential refutation (the Nx port) both came from his loop. Thank you.

TL;DR

Hours ago we published Bird Meadow — a multi-agent Active Inference workbench in pure Elixir — with a public ask: poke holes in it. An external review panel responded within 24 hours. Two follow-up reviews (v1 + v2 delta) gave us a punch list.

This post documents what closed:

v1.1-remediation — fixed a silent Dirichlet learning bug, sharpened framing, hardened multi-agent collision logic
v1.2-hardening — Mnesia consistency model, signal-race property tests, telemetry-context discipline, a 100/100 statistical regime test, CI workflow
v1.3-falsifiability — the GW1 three-arm experiment (EFE vs greedy vs random) and the G4 belief-evolution prediction
v2-equivalence-proof — proved primitive-level Nx equivalence to 1e-9, measured the drop-in dispatch as a 5x perf regression, reverted it, documented the honest finding

Every wave landed with passing tests, signed tags, and source-code audit anchors that fail when the claim drifts. Repo: TheORCHESTRATEActiveInferenceWorkbench.

The bigger story is the methodology. The reviewer called the audit-anchor-as-source-code-test pattern "the single most valuable thing this codebase has taught us." That endorsement is what this post is really about.

v1.1 — the silent Dirichlet bug

The 🔴 finding from the v1 review:

DirichletUpdateA reads marginal_state_belief from the bundle map; the field lives on agent state. The Map.get fallback fires every call. Online learning of A reduces to averaging observation counts uniformly across hidden states regardless of agent posterior.

Confirmed and extended. DirichletUpdateB had the same bug and a complete no-op branch — q_now also fell through to nil, so the entire B-update was dead code. The agent appeared to be learning. It was not.

Fix:

# Before — always hit the fallback
q_s = Map.get(bundle, :marginal_state_belief, uniform(length(hd(a))))

# After — read from agent state with explicit empty handling
q_s =
  case state.marginal_state_belief do
    [] -> uniform(length(hd(a)))
    [_ | _] = vec -> vec
  end

Three positive regression tests now guard against this returning. They assert state-dependent alpha deltas — not just "alpha changed" (which the buggy version would also pass). If the bug returns, parallel scenarios with different state.marginal_state_belief would produce identical alpha matrices, and the test fails loud.

This was the only 🔴 in the panel. It shipped alone (commit 96f4c35), in isolation, before the rename and before the audit-anchor doc additions, so its blast radius would be unambiguous.

v1.2 — distributed-systems audit anchors

The Kingsbury-named findings (K1–K7) targeted distributed systems concerns the v1 work hadn't formally addressed:

K1 — Mnesia consistency. New event_log_consistency_test.exs runs 8 parallel writers × 25 events each and asserts per-agent_id monotonicity of the timestamp field. Documented model: per-agent causal ordering; cross-agent ordering is timestamp-best-effort and may interleave under microsecond-equal commits.
K2 — Signal-route races. Adversarial integration test fires perceive and plan signals from 6 task-spawned senders across 4 ticks, asserts the agent's belief evolution remains causal regardless of interleaving.
K3 — Telemetry context. Process.put/get doesn't propagate across Task.async. Added moduledoc warning + 5-test property suite using Task.async_stream over policies; either provenance survives, or it fails-loud (no silent loss).
K4 — MVP statistical regime. 100 episodes on tiny_open_goal with production defaults. 100/100 success rate — gives us a hard floor that future regressions would visibly fail.
A2 — Policy enumeration cost. enumerate_policies/depth is |A|^d exponential. Now warned in docstring with practical ceiling.
C1 — CI workflow. .github/workflows/ci.yml runs mix compile --warnings-as-errors + mix test --exclude slow_experiment. README badge added.

K5 deserves its own paragraph because it was an over-correction I caught only via Plan-agent stress-test of my draft. The reviewer's note said "sort intentions deterministically" — sounds like a 15-minute change. But sorting iteration order doesn't prevent two birds from landing on the same previously-empty cell. The actual fix is a three-phase sweep: collect intentions → detect target conflicts → tie-break (lowest agent_id wins, losers get {:blocked, :collision}) → commit. ~30 lines, with a property test that asserts the rule across random multi-bird action maps.

The honest version of "I read the finding carefully" is: the first read produced the wrong fix. Ship the right one.

v1.3 — falsifiability

This is where we stopped patching and started measuring claims that could falsify the system.

GW1 — the three-arm experiment

The reviewer's joint Gershman-Wolpert finding: the bundle's hand-crafted geometric prior toward the loud-token gradient might be doing all the work. EFE machinery vs. baseline greedy might show no difference if the prior is already strong.

Tested. Three arms, identical ConvergentBird bundle, identical 8×8 corner-spawned matching priors:

Arm	Action selection	Median final distance
AI	EFE-weighted policy posterior	7.0
GreedyLoudest	Pragmatic-greedy on observation amplitude	14.0
Random	Uniform random walk	7.0

The honest result: GreedyLoudest performed worse than random walk. Why? Because the greedy baseline ties on equal-amplitude tokens and defaults to :stay — so it sat there. That's a publishable finding about the baseline's failure mode, not about EFE's superiority.

What it actually says: the bundle's geometric prior is doing real work (random walk and EFE both hit the loud token) and EFE's value-add is matching random-walk performance with directional consistency that the test doesn't yet measure. The next experiment should isolate that — but we shipped what we measured, including the inconvenient bit.

G4 — belief-evolution prediction

A specific quantitative prediction: in a custom 4-state stochastic environment, withholding observations from t=5 to t=9 should cause the marginal posterior entropy to broaden toward ln(4) ≈ 1.386 during the window and snap back to the observed-belief entropy when observations resume.

Measured trajectory:

Arm A (full obs):     0.042 → 0.042 → 0.042 → 0.042 → 0.042 → 0.042 → ... (constant)
Arm B (withheld 5-9): 0.042 → 0.042 → 0.042 → 0.042 → 0.042 → 1.245 → 1.369 → 1.384 → 1.386 → 1.386 → 0.042 → 0.042

Asymptotically converges to ln 4 under withholding. Snaps back. Textbook trajectory. The test asserts both monotonic broadening during the window (within ε) and recovery within 2 ticks of resumption — so future regressions to the predictive rollout machinery would fail visibly.

v2-equivalence-proof — the substrate finding

The Wolpert W1 finding, escalated in the v2 review to "load-bearing capability constraint": pure-Elixir list math hits the Jido per-action 60s timeout for ComplexBird at policy depth ≥ 2 on the 1000-dim observation space. Original plan: Nx port to lift the ceiling.

What we proved

ActiveInferenceCore.Math.Nx.matvec/2 and softmax/1 produce numerically equivalent output to the pure-Elixir reference within 1.0e-9 on random inputs at meadow scale (1000×1152), edge cases (1×1, zero matrix, empty vector), sharply-peaked softmax inputs, and 1000-dim policy logits. 9 tests, 0 failures. This is the artifact future redesign builds on.

What we refuted

Drop-in dispatch — wiring Math.matvec/softmax to call through Math.Nx via a config flag — was prototyped and benchmarked on ComplexBird depth 2 on a 4×4 meadow:

Path	Wall-clock
Pure-Elixir	~26 s
Nx (BinaryBackend, drop-in dispatch)	~121 s

Speedup: 0.22x. Five times slower. Plus accumulated summation-order divergence above 1e-6 on the long log-domain matvecs after composition through log_eps + matvec + softmax — despite primitive equivalence holding at 1e-9.

Root cause: per-call Nx.tensor(...) / Nx.to_list(...) boundary conversions dominate when the kernel itself is small (single matvec on a few thousand elements) and is invoked thousands of times per Plan call. The default BinaryBackend has no SIMD acceleration to amortise the conversion cost.

The honest scoping

Drop-in primitive replacement is the wrong design. To deliver a speedup the inner sweep must be tensorised as a whole: batched matvec across policies, defn-compiled kernels, EXLA or Torchx backend so conversion cost amortises. That is multi-week work tracked as v2.1 and not part of the v1.x remediation series.

The benchmark file now ships as a baseline measurement of the pure-Elixir path only (25.34s on this machine, well under the 60s Jido timeout). The benchmark passes with that finding written into its assertions. Equivalence is proven, performance is refuted, redesign is documented. Future work has a fixed-point reference to build against.

This is the audit-grade move: don't ship the regression. Document why it didn't work. Make the artifact useful even when the optimization fails.

The audit-anchor-as-source-code-test pattern

This is what the reviewer called "the single most valuable thing this codebase has taught us."

Every claim that lives in a docstring or design document has a corresponding test that enforces the claim at the source-code or mathematical-property level. Examples currently in the workbench:

vfe_bound_test.exs — F[q] ≥ -ln p(y) against brute-force forward algorithm
elbo_bound_test.exs — ELBO[q] ≤ ln p(y)
q_vs_p_naming_test.exs — production and audit code paths can't accidentally merge
blanket_ci_test.exs — inter-agent Markov blanket is a real conditional-independence partition (replay-determinism test)
no_thermo_overclaim_test.exs — source-code lint against thermodynamic overclaims
dirichlet_update_a_test.exs / dirichlet_update_b_test.exs — state-dependent alpha deltas (the v1.1 fix)
event_log_consistency_test.exs — per-agent_id monotonicity under N parallel writers
nx_benchmark_test.exs — substrate ceiling baseline (the v2 finding)
experiment_one_v2_test.exs — the GW1 three-arm result
belief_evolution_prediction_test.exs — the G4 predictive trajectory

Each one is a claim that fails loud when it drifts. Each one was named in a review or surfaced from a refused over-claim. Each one is a piece of the methodology, not the math.

The reviewer recommended adoption by their own Ecphory project. That is the genuine endorsement — not "the math is right" (which any standard derivation should be), but "the way you defend the math against drift is something we want too."

What's deferred, by name

v2.1 — full inner-sweep Nx redesign (batched matvec across policies, defn kernels, EXLA backend). Multi-week. Tracked in OPS.md §4.
GreedyLoudest tie-break refinement — current baseline defaults to :stay on amplitude ties. A directional tie-break would make the EFE comparison sharper.
:world_models → :spec_registry app rename — Mix umbrella requires app atom = directory name. Documented in ADR-001 as a v2-milestone change with the migration shim.

These are named, not hidden. If we shipped a regression while pretending it was a feature, the audit-anchor pattern would be performance art. The whole point is that the substrate finding is the deliverable.

How to verify

git clone https://github.com/TMDLRG/TheORCHESTRATEActiveInferenceWorkbench.git
cd TheORCHESTRATEActiveInferenceWorkbench/active_inference
mix deps.get
mix compile --warnings-as-errors
mix test --exclude slow_experiment   # 322 tests, 0 failures
mix test --include slow_experiment apps/agent_plane/test/meadow/nx_benchmark_test.exs
mix phx.server                       # → http://localhost:4000/labs/meadow

Tags to pull: v1.1-remediation, v1.2-hardening, v1.3-falsifiability, v2-equivalence-proof. Each one ships with passing tests and the OPS.md / README updates that document its scope.

Credit

The Dirichlet bug, the substrate refutation, and the audit-anchor endorsement all came from one external loop. Jeremy Jones ran the eight-critic LLM-assisted review panel that produced the v1 + v2 reports. The methodology of "ask the public to poke holes; respond honestly with code, not press releases" works only if the holes-pokers exist and the honest response shows up. Jeremy's panel is both halves of that.

The next finding is welcome. Open an issue. The loop is open.

The workbench is a pedagogical Active Inference reference — discrete-time POMDP with mean-field VMP and EFE-weighted policy posterior, one specific instantiation under the FEP framework. Mathematical source: Parr, Pezzulo & Friston (2022) Active Inference, MIT Press. Code license: CC BY-NC-ND.

Bird Meadow: a multi-agent Active Inference world I'd like the community to poke holes in

ORCHESTRATE — Thu, 07 May 2026 17:40:57 +0000

TL;DR. I'm Michael Polzin. I just shipped, as open source, a multi-agent Active Inference world — birds that hear and sing — running on top of audit-corrected variational free energy / expected free energy math from Parr, Pezzulo & Friston (2022, MIT Press). It's pure Elixir on the BEAM (Jido v2.2.0 — no Python, no LangChain). 78 tests pass. Five audit anchors verified against a brute-force forward-backward ground truth. Six scenarios reproduce visually in a Phoenix LiveView at /labs/meadow.

I am asking the Active Inference / Elixir / scientific-computing communities to poke holes in this. If the math is wrong, or if my falsifiable empirical claims don't reproduce, I want to hear it now — publicly, with the receipts attached. The repo is below.

Repo: https://github.com/TMDLRG/TheORCHESTRATEActiveInferenceWorkbench
Latest commit: 650a185 (2026-05-07)

What's verified

Five audit anchors corresponding to claims about the variational inference identity, each tested against a brute-force forward-backward HMM (AgentPlane.ExactInference) on small enumerable bundles:

F[q] >= -ln p(y) — agent_plane/test/meadow/vfe_bound_test.exs. Passing for every length-3 obs sequence under stay/stay and flip/stay actions, with exact-marginal q, uniform q, and point-mass-wrong q.
ELBO[q] <= ln p(y) — agent_plane/test/meadow/elbo_bound_test.exs. Passing under same conditions.
q (recognition) vs p(eta given y) (exact posterior) code-path separation — agent_plane/test/meadow/q_vs_p_naming_test.exs. Code-grep + spec-level enforced; the two cannot collide in source.
Inter-agent CI (Markov-blanket) partition — agent_plane/test/meadow/blanket_ci_test.exs. Replay determinism with :argmax selection: bird A's beliefs are bitwise-identical when bird B is replaced by a scripted-action stand-in.
No thermodynamic over-claim — agent_plane/test/meadow/no_thermo_overclaim_test.exs. Recursive lint over apps/{agent_plane,world_plane}/lib for enthalpy/helmholtz/gibbs outside disclaimed docstrings.

A subtle thing I caught while writing this: my first textbook chain VFE used log(B * q_prev) (the Jensen-tightening form). The mean-field bound F[q] >= -ln p(y) requires log(B) * q_prev (the "expected log") instead. Both are valid VFE decompositions, but only the latter satisfies the joint mean-field bound that the audit anchor cites. The bound test specifically exercises the textbook form. If you want to nitpick this further I'd love the conversation.

What's visible in the live UI

mix phx.server then http://localhost:4000/labs/meadow. Click cells to place birds, pick a tier (Convergent, Simple, Complex, Resonant), pick a preferred song token (t1-t4), press Start.

I drove six scenarios end-to-end through the LiveView in Chrome:

Scenario	Setup	Outcome
A	Same-prior ConvergentBirds at corners of 8x8, distance 14	Cluster at distance ~5 by t=321 (reached distance 1 at t=65)
B	Orthogonal-prior pair, same setup	Looser cluster, distance ~3 at t=176
C	SimpleBirds (uniform-A on hearing factors) at corners	Never moved. Birds only sing. Audit prediction confirmed
D	4 ConvergentBirds, mixed t1/t2 priors	Clusters form, but cross token boundaries at v1
E	4x4 grid, same-prior pair always in hearing range	Tight tracking - Bird 2 picks `move_north` toward singing Bird 1
F	UI safety guards (duplicate, empty start, remove, reset)	All work as designed

What I am being honest about

These are real, named limits — not hidden:

ConvergentBird is drawn to any audible source. Token preference modulates the strength of attraction, not its presence. Matching priors give a tighter cluster (Experiment 1: median 4 vs 8 control) but orthogonal-prior pairs still drift together. Stronger token discrimination would need a partner_token-conditional A-factor structure.
Call-response at policy_depth >= 2 is throttled by Jido's per-action 60s timeout at experimental scale on 1000-dim observation matvecs in pure Elixir. The integration test passes at depth 1; the call-response hypothesis at depth 2 needs an Nx-backed math path. Documented in source.
ResonantBird's hierarchical meta-loop is currently a context-swap heuristic, not a full hierarchical Bayesian planner. The existing AgentPlane.Hierarchical is maze-coupled; rewiring for meadows is plumbing, not new science.
Spatial convergence required adding a tier. The original plan claimed SimpleBird would converge. It doesn't — SimpleBird's A is uniform conditional on state. ConvergentBird (5-state partner_bearing factor with a bearing-update B kernel) is the minimal POMDP factor structure that makes EFE produce a movement gradient. This is named honestly in the source moduledoc.

How to reproduce, locally, in under 5 minutes

git clone https://github.com/TMDLRG/TheORCHESTRATEActiveInferenceWorkbench.git
cd TheORCHESTRATEActiveInferenceWorkbench/active_inference

# Fast scientific suite (~60s on a laptop):
mix test apps/world_plane/test/worlds/ \
         apps/agent_plane/test/meadow_obs_adapter_test.exs \
         apps/agent_plane/test/bundle_builder/ \
         apps/agent_plane/test/meadow/ \
         apps/workbench_web/test/workbench_web/

# Run the experiments at smoke scale (~4 min):
mix test apps/agent_plane/test/meadow/experiment_one_test.exs \
         apps/agent_plane/test/meadow/experiment_two_test.exs \
         --include slow_experiment

# Open the UI:
MIX_ENV=dev mix phx.server   # then http://localhost:4000/labs/meadow

What I'd love from this community

Active inference researchers: is the partner_bearing factor honest to the spirit of Friston's framework? Are my audit anchors the right ones? What additional ones would you want?
Elixir / Nx people: what's the cleanest path to put the inner matvec on Nx so we can run policy_depth >= 2 within Jido's per-action timeout?
Anyone: clone, run, file an issue, send a PR. Tell me where the reasoning is wrong. I built this expecting to be corrected.

The commit message and project memory both say it: this build was done to take a previously-private audit and demonstrate it as working code, in public, with the math honest and the gaps named. If the community confirms — or refutes — any of this, the truth wins either way.

Built by Michael Polzin (THE ORCHESTRATE METHOD / LEVEL UP). Code is CC BY-NC-ND. The mathematical content is from Parr, Pezzulo & Friston (2022) Active Inference, MIT Press. Generated with substantial Claude Code pair-programming, all of which is reviewable in the commit history.

Why AI Training Programs Don't Move Organizational Maturity

ORCHESTRATE — Mon, 04 May 2026 12:11:39 +0000

The most expensive lesson in enterprise AI right now

Here's the line that surprises every leadership team I've worked with on AI maturity:

You can train every employee in your org on AI and still not move a single maturity stage.

This is counterintuitive, expensive when learned the hard way, and increasingly the dominant failure mode of corporate AI programs in 2026.

Training feels like progress. It looks like progress on the dashboards. It is reported up to the board as progress. And it almost never produces progress.

This article is about why.

What "maturity" actually measures

The AI Usage Maturity Model — and frankly any honest organizational maturity model — measures one thing: what the organization can repeatably do without depending on specific people.

Stage 1: ad-hoc individual use.
Stage 2: pilot capability — the org can run experiments.
Stage 3: production capability — the org has governance, policy, and at least one production AI use case.
Stage 4: AI as infrastructure — multiple production use cases, measured outcomes, governance that compounds.
Stage 5: AI as default — embedded in standard processes, new use cases are routine.

Notice what's missing from those definitions: any reference to what individual employees know. Stages are not measured by employee knowledge. They're measured by organizational capability.

This is the trap. Training transfers knowledge to individuals. Maturity is a property of organizations. Moving the first does not necessarily move the second.

The failure mode in concrete terms

Here's what happens, mechanically, when an organization invests heavily in AI training without changing any underlying process.

Day 1: Leadership announces a company-wide AI literacy program. Big budget. Mandatory courses. Certifications. The HR dashboard turns green. The board hears "we're investing in AI capability."

Month 2: Employees finish the courses. They know how to use prompts. They understand hallucinations. They've practiced with sample tools.

Month 3: An employee — let's call her Maria — tries to use what she learned. She wants to use an AI summarization tool for vendor contracts. The procurement process has no path for AI tools. The legal team has no review process for AI-summarized documents. Her manager's quarterly review has no place to credit her for AI leverage.

Month 4: Maria stops trying. She uses the tool covertly for tasks she can't be caught using it on. She doesn't disclose. The org gets none of the visibility, governance, or compounding learning.

Month 6: An audit asks "how is the org using AI?" Nobody has a clean answer. The training program is reported as "92% completion" because that's the only number anyone can produce. Maria doesn't show up in any of the metrics.

Month 12: The org runs a maturity assessment. It scores Stage 1 — same as the start of the year. Leadership is confused. They invested. They trained. What happened?

What happened is that training transferred capability to Maria and the org didn't have process changes that allowed Maria's capability to flow upward into organizational capability.

Trained people in untrained processes

The general principle is one most engineering leaders will recognize from a different domain:

You cannot raise a system above the bottleneck of its slowest constraint.

In throughput optimization, this is Goldratt's Theory of Constraints. In organizational change, it's the same dynamic. Training raises the capability of individual workers. But the organization's AI capability is gated by the slowest of its constraints — usually procurement, legal review, performance management, or escalation paths.

If procurement takes 9 months to onboard a new AI tool, no amount of training accelerates that.

If legal review for AI-generated work takes 6 weeks, no amount of training accelerates that.

If performance reviews don't credit AI leverage, no amount of training will sustain its use.

Trained people stuck in untrained processes do exactly what you'd expect: get frustrated, then quiet, then revert to old workflows that don't fight the system.

What actually moves maturity

The interventions that move maturity stages are almost always process changes, not knowledge changes. Three that consistently work:

1. Make AI use the path of least resistance.

If AI use requires extra approvals, longer review cycles, or special procurement paths, employees will avoid it. If AI use shortens review cycles, simplifies procurement, or reduces documentation burden, employees will seek it out. The procurement process at one organization I observed was rewritten so that, all else equal, an AI-capable tool became the default over a non-AI equivalent. This pushed AI adoption in via the back door of routine purchases, not through the front door of strategic initiatives.

2. Put SLAs on the gates.

Most pilot purgatory is caused by review processes with no time-bound commitments. A use case proposal sits in legal review for 11 weeks because nothing forced a decision. Add a 14-day SLA to AI review — auto-approve with logging if not reviewed in 14 days — and pilot purgatory collapses. This single change, in the orgs I've seen apply it, has been the highest-leverage process change for moving from Stage 2 to Stage 3.

3. Make AI leverage visible in performance reviews.

Not measured strictly. Just present. One organization added a single line to quarterly reviews: "give one example of AI leverage in your work this quarter." Not weighted, not graded. Just asked. It changed what people noticed and what they tried.

Notice what's not on this list: more training, more certifications, more vendor demos.

Where training does fit

Training is not useless. It's a useful Stage 1 input — especially in orgs where employees have not used AI tools at all and need a baseline of literacy.

But training is necessary and insufficient. It's the floor, not the ceiling. By Stage 2, training has done its work and the next move is process change.

The trap is treating training as a substitute for process change because training is easier to budget and measure than process change.

The diagnostic question

If you want to know whether your org's AI program is producing maturity or just producing certificates, ask one question:

"What can we do today as an organization that we couldn't do 12 months ago — without depending on specific named individuals?"

If the answer is "our employees know more about AI," you have not moved maturity. You have moved knowledge.

If the answer is "we have a 14-day SLA on AI review and it's working," or "AI-capable tools became the procurement default," or "we have a documented production use case the original team has rotated off," you have moved maturity.

The first answer is what training produces. The second answer is what process change produces. Both are valuable. They are not the same thing. And budgets that confuse them keep producing dashboards that look like progress on top of orgs that haven't actually moved.

This article is adapted from a LinkedIn series on the AI Usage Maturity Model.

Ambiguity Is Computational Debt: Why Structured Prompts Outperform Long Ones

ORCHESTRATE — Mon, 04 May 2026 12:10:59 +0000

The principle nobody states out loud

There is a one-line principle that quietly governs almost everything good about prompt engineering:

Every ambiguity you leave in a prompt is computational work the model wastes guessing.

This sounds abstract. It's not. It's the single most useful lens for understanding why one prompt produces work you'd ship and another prompt — for the same task, on the same model — produces something you'd be embarrassed to send.

Once you see it, you can't unsee it.

The two jobs the model is doing

When you give an AI model a prompt, it's almost never doing one job. It's doing two:

Figure out what you actually want.
Produce it.

Job 2 is the one we think about. It's the visible work — the writing, the code, the analysis, the summary.

Job 1 is invisible. It happens inside the response. The model has to infer:

What's the deliverable? A draft? A finished product? A list? An essay?
Who is producing this? Me as a generic assistant? Me as a senior engineer? Me as a consultant?
Who's it for? Technical reader? Skeptical exec? Total beginner?
What does "good" look like in this context? Brief? Comprehensive? Funny? Sober?
What format does the output need to take? Markdown? Plain text? Bullets? Prose?

Every one of those questions, if not answered in the prompt, gets guessed at by the model. And every guess is a place where the output can drift.

Why this matters in practice

Here's the failure pattern that ambiguity causes, and you'll recognize it immediately:

"The output is technically correct, but it's not quite what I wanted."

That phrase — "not quite what I wanted" — is almost always Job 1 going wrong. The model produced the right kind of thing. It just produced the wrong version of it. Wrong tone, wrong audience, wrong level of detail, wrong format.

People diagnose this as "AI is bad at X." It's almost never that. The model is highly capable. The model is also a stranger who's never read your mind, met your audience, or seen your previous work. It's filling in blanks you didn't realize you left.

The 200-word prompt that beats the 20-word one

A common myth: "good prompts are short and punchy."

This is wrong. Specific prompts beat vague ones. Length is a side effect of specificity, not a goal.

A 20-word prompt:

"Write a board update for our Q3 results."

A 200-word prompt:

Write a Q3 board update.
Length: 600 words.
Sections: Highlights, Risks, Asks (in that order).

Audience: a 7-person board, two of whom are first-time investors and need
more context on SaaS metrics like ARR and net revenue retention.

Voice: founder communicating to a chair who wants the bad news first.
Acknowledge what didn't work before listing wins.

Format: read on phone in transit, between other materials.
Bullets where possible, max 5 bullets per section.

Tone: sober, specific, no superlatives. No "we are excited to announce."

Constraints:
- Frame asks as decisions, not questions.
- Verify every metric before including it.
- Flag any number presented without context.

Reference: The chair praised last quarter's update for being skimmable
and direct. Match that register.

The 200-word prompt is not "longer for the sake of length." It is doing a different thing entirely. It's eliminating Job 1 — the model no longer has to guess at deliverable, role, context, audience, format, or tone — so it can spend its full pass on Job 2.

The output of the 200-word prompt is dramatically better not because the model is "trying harder." It's better because the model isn't burning capacity on guesswork.

A systematic 200-word prompt beats a random 200-word one

Here is the second-order observation, and it matters more than the first.

Length is not the same as structure.

You can write a 200-word prompt that's just a stream-of-consciousness list of things you remembered to mention: "make it detailed but not too long, for a smart audience but not too technical, kind of conversational but professional, with maybe some bullets but mostly prose, you know what I mean." This is verbose ambiguity. It is worse than the 20-word version because now the model has to do more inference work, and the additional words are mostly contradictions.

A systematic 200-word prompt is built around a frame the model can navigate. One frame I use:

Objective: what is the deliverable, exactly?
Role: who is producing it?
Context: what is the situation around it?
Handoff: who receives it and how?
Examples: what does good look like?
Structure: how is it laid out?
Tone: how does it sound?
Review/Assure/Test: did we check it?

When the prompt has structure, the model spends its capacity on the work — not on figuring out the relationships between your scattered constraints.

You don't have to use my frame. You do have to use a frame. Random verbosity is worse than terseness. Structured verbosity is worth its length.

The compounding benefit nobody talks about

There's a second effect of writing structured prompts that nobody mentions and that takes about three months to notice:

You start thinking this way.

Before structured prompting: someone hands you a vague request, you start working, you discover halfway through that you don't actually know what they wanted.

After three months of structured prompting: someone hands you a vague request, and your first instinct is to mentally fill in the blanks — what's the deliverable? who's it for? what's the format? — before you start.

The framework outlives the AI tool. You'll still be using it five years from now, on whatever model has replaced the one you're using today, and on tasks that don't involve AI at all.

How to apply this tomorrow

If you take one thing from this article, take this:

When your AI output is "almost right but not quite," don't iterate on the output. Iterate on the prompt. Specifically, find the part of Job 1 — deliverable, role, context, audience, format, tone — that you assumed the model would figure out, and write it down explicitly.

The output that lands in one pass is not the output produced by a smarter model. It's the output produced when the human stopped leaving the model to guess.

This article is adapted from a LinkedIn series on the ORCHESTRATE method for systematic prompting.

Capability vs Adoption: The AI Strategy Confusion That Wastes Millions

ORCHESTRATE — Mon, 27 Apr 2026 12:10:17 +0000

The $4M Question

A regional bank spent $4M on enterprise AI tooling. Eighteen months in, the CIO ran a dashboard query and discovered weekly active users sat at 11% of the licensed seats. He called me and asked the question every CIO in this position eventually asks: "Did the technology fail, or did the organization fail?"

The technology hadn't failed. The licenses were active. The integrations worked. The training had been delivered. The vendor's reference architecture was implemented to spec.

The organization had failed at something most AI strategies don't even measure: adoption.

Two Axes, Not One

Most AI strategy conversations conflate two completely independent things:

Capability is what the technology can do.

Models deployed
Integrations live
Licenses purchased
Features enabled
API call volume

Adoption is what humans actually do with the technology.

Weekly active users in the target population
Workflows redesigned around AI
Decisions accelerated
Outcomes attributable to AI-influenced work
Time-to-result on AI-eligible tasks

These are independent axes. You can be high capability / low adoption (the $500K shelfware problem). You can be low capability / high adoption (a small team doing brilliant work with free tools). You can be high on both, or low on both.

The AI Usage Maturity Model (AI-UMM) treats this as a 2x2. Most enterprise programs cluster in the high-capability / low-adoption quadrant. That is the most expensive quadrant to be stuck in, because the operating budget keeps charging the licenses regardless of the workflow change.

Why Capability Metrics Are Easier (And Misleading)

If you go back through the last three quarterly business reviews at most large enterprises, the AI section reads like a procurement report:

"We deployed Model X in Q3."
"We integrated AI Tool Y with Salesforce in Q4."
"We rolled out training to 5,000 employees."

These are capability metrics. They are easy to measure. They are easy to defend. They are also nearly worthless as predictors of business outcome.

A capability metric tells you what's possible. An adoption metric tells you what's happening. The difference between possible and happening is where most enterprise AI value gets stuck.

Four Adoption Metrics That Actually Matter

If your AI dashboard only shows capability metrics, you are flying blind on the half of the strategy that actually drives business outcome. Add these four:

1. Weekly active users in the target population. Not licensed seats — that's a capability metric. The denominator is "people whose job is supposed to change because of this tool." The numerator is "people who used it productively this week." If the ratio is below 30%, you are in the Pilot Plateau regardless of how the rest of the dashboard looks.

2. Workflow change rate. Pick the top 10 workflows the AI was supposed to influence. For each one, measure the percentage of work units that now flow through the AI tool versus the legacy path. If this number is not moving quarter-over-quarter, your investment is not changing how work gets done — it's just adding a parallel system.

3. Time-to-result delta. For AI-eligible tasks, what is the median completion time today versus six months ago? If this number is flat or worse, you have an integration problem (the AI is being used but is not faster) or a usage problem (the AI is being used wrong).

4. Quality drift. Quality at the same speed is fine; quality drop at the same speed is a hidden failure. Audit a sample of AI-influenced outputs against pre-AI baselines. Catch the regressions before customers do.

The Pilot Plateau

Stage 2 in AI-UMM is "Productive Pilots." It is where most enterprise AI programs go to die. Why? Because Stage 2 is comfortable.

Executives can point to a working pilot at the next board meeting.
Innovation teams can claim progress without organizational disruption.
IT can manage risk by keeping AI in a controlled sandbox.
The pilot team feels like rockstars.

No one in this configuration has a strong incentive to push to Stage 3 (Scaled Capability), because Stage 3 means actual organizational change: procurement decisions across business units, workflow redesign in functions that didn't run the pilot, performance metrics tied to AI-influenced outcomes, and operating model adjustments.

The Pilot Plateau is not a technology problem. It is an organizational design problem. The leaders who break out of it do three things:

Set Stage 3 success criteria at the start of the pilot, not after. "If this pilot works, here is what we will scale, who will own the scaling, and what budget is pre-approved." If you can't write that paragraph at pilot kickoff, your pilot will plateau.
Identify the Stage 3 sponsor on day one. This is usually NOT the pilot sponsor. The pilot sponsor is rewarded for innovation; the Stage 3 sponsor is rewarded for operational adoption. Different incentives, often different people. If you don't name them on day one, you don't have a path to Stage 3.
Treat the pilot as a hand-off exercise, not a proof-of-value exercise. A successful pilot ends with the operations team saying "we'll take it from here," not with the innovation team writing a celebration deck.

What This Means for Your Roadmap

Go pull your current AI roadmap. Count the milestones that are capability milestones (model deployed, integration shipped, training delivered). Count the milestones that are adoption milestones (workflows changed, weekly active users hit X, time-to-result improved by Y).

If the ratio is heavily skewed toward capability, your next quarterly review is going to be uncomfortable. The CFO will ask "what did we get?" and your roadmap will answer "we deployed things." That is not the answer the CFO is looking for.

The fix is not more capability investment. The fix is to reframe at least half the milestones around adoption and outcome. Some of those milestones will require organizational change that the IT function alone cannot deliver. That is the point. AI value at enterprise scale is an organizational design challenge, not a procurement challenge.

The Bottom Line

The bank in the opening recovered. We mapped 12 high-frequency workflows to specific AI use cases, identified non-IT champions inside each function, and tied 30% of digital transformation OKRs to adoption metrics. Twelve months later, weekly active users hit 64%. Same tools. Same training material. Different organizational design.

If your enterprise AI program feels stuck, the diagnostic is simple: pull up your dashboard and ask "is this measuring capability or adoption?" If it's capability, you don't have a strategy yet — you have a procurement plan.

Capability without adoption is shelfware. And shelfware shows up in the operating budget every single month.

This article is adapted from a LinkedIn series on the AI Usage Maturity Model.