DEV Community: Bobai Kato

Why ok: true Is Not Enough for AI Agent Execution

Bobai Kato — Tue, 21 Jul 2026 12:26:17 +0000

A successful command is not broad authorization

ok: true answers one useful question:

Did the selected Ota execution complete successfully?

It does not answer:

Did this lane prove every dependency mattered?
Did it cover every workflow in the repository?
Did it exercise a live external integration?
Is an agent now authorized to publish, deploy, or migrate?

Humans often carry those distinctions in context. Automated systems usually do not. If the only
machine signal is a Boolean, a narrow green result will eventually be consumed as a broader claim.

Ota 1.6.24 keeps execution success, proof breadth, and remaining uncertainty separate.

Read the proof carrier, not one field

Run the declared workflow through Ota's proof boundary:

ota proof runtime --workflow app --json --archive

A successful but bounded result can look like this:

{
  "ok": true,
  "phase": "readiness",
  "proof_verdict": "passed_with_unproven_boundaries",
  "proof_scope": {
    "kind": "runtime_path",
    "proof_class": "slice_proof",
    "workflow": "app",
    "task": "serve"
  },
  "not_proved": [
    {
      "kind": "broader_repo_completion_not_proved",
      "relative_to": "runtime_path"
    }
  ]
}

These fields are intentionally different:

ok reports selected execution success;
phase says where that result was established;
proof_verdict reports the derived proof posture;
proof_scope names the lane the artifact can speak for;
not_proved[] carries material boundaries that remain outside the claim.

An agent that reads only ok is not consuming the Ota proof contract correctly.

Why a qualified pass is not a failure

Suppose Ota starts an API and verifies its declared HTTP listener. That is useful runtime evidence.
It may prove the selected service became ready on the declared path.

It does not automatically prove:

a neighbouring release workflow;
a live payment or model API;
production data behavior;
a database's complete output-shaping role;
repository-wide completion.

Returning ok: false would discard the real success. Returning an unqualified pass would overstate
it. passed_with_unproven_boundaries preserves both truths.

The boundary is part of the result, not an apology after it.

What an agent should do next

An agent or CI integration should evaluate the result in this order:

Require ok: true for the selected execution.
Confirm proof_scope matches the task or workflow it intended to prove.
Read proof_verdict; do not translate every green execution into an unqualified proof pass.
Inspect not_proved[] for a boundary relevant to the next action.
Apply policy before crossing into a heavier lane such as deployment, migration, publishing, or live external effects.

That last step matters. Proof and authorization are related, but they are not the same object. A
runtime receipt can establish what happened. It should not silently grant permission for a more
dangerous task.

The verdict is derived, not narrated

Ota derives the terminal proof verdict at the decision boundary. A materially bounded proof cannot
serialize as an unqualified pass merely because an output formatter prefers a simpler status.

The same principle applies to dependency evidence. Caller-side activity, dependency readiness,
transaction-bound exercise, and negative-control causality retain different evidence levels. A
consumer does not get to upgrade them by ignoring the accompanying boundary.

That is the practical difference between execution logs and governance evidence: the artifact
carries enough structure for another system to refuse an over-broad interpretation.

What this does not solve

Ota cannot stop a general-purpose agent from bypassing Ota and invoking an arbitrary shell directly.
The repository contract is the specification; enforcement must also exist at the callable runner,
sandbox, CI, or merge-gate chokepoint.

Within Ota, agent mode restricts execution to the declared safe closure. Outside Ota, a harness must
expose Ota's governed lanes instead of handing the agent an unrestricted shell and hoping it behaves.

That boundary is important because ok: true is evidence, not a universal capability token.

The operational rule

For humans, agents, and CI:

Treat ok as execution status. Treat proof_verdict, proof_scope, and not_proved as the
authority on what that success established.

This makes a green result more useful, not less. It can safely drive automation because its limits
are available to the same machine that reads its success.

References

Originally posted here: https://ota.run/blog/why-ok-true-is-not-enough-for-ai-agent-execution

Pressure-testing Ota on Open WebUI: proof cleanup ownership, bootstrap truth, and native vs Compose runtime boundaries

Bobai Kato — Mon, 20 Jul 2026 18:43:02 +0000

Overview

Open WebUI exposed a real Ota lifecycle boundary.

This was not mainly a parsing or contract-shape repo. The contract was already strong enough to model:

source-checkout verification
packaged native runtime through uv run open-webui serve
frontend development runtime
default Docker Compose runtime

What the repo exposed was operational truth after proof:

a successful native proof still left a host workload alive
the first cleanup fix then widened too far and treated a Compose-owned runtime as the same class of host workload

That made Open WebUI a valuable pressure repo.

It forced Ota to get more precise about cleanup ownership instead of treating all successful runtime proof as one generic teardown problem.

The current pressure contract pins released Ota v1.6.24. Its latest green matrix run proves
the release surface at the exact contract and workflow revision linked below.

What Open WebUI exposed in Ota

This repo exposed four meaningful weaknesses.

1. proof success was weaker than it looked

The first issue was not that runtime proof failed.

It was that runtime proof succeeded and still left the native workload alive afterward.

In this repo, the packaged native workflow launches:

serve:native:
  launch:
    kind: command
    exe: uv
    args:
      - run
      - open-webui
      - serve
      - --host
      - 0.0.0.0
      - --port
      - "8080"

Ota proved that workflow, but the launched process tree was still alive after proof completed.

In GitHub Actions, that surfaced through setup-uv post-job cleanup, which blocked while the uv cache was still in use.

That was an Ota gap.

If proof succeeds but leaves behind repo-owned runtime state that later breaks CI cleanup, the proof surface is still incomplete.

2. native service cleanup widened past its real ownership boundary

The first core fix made Ota clean selected native service workloads after successful proof.

That was directionally correct, but Open WebUI immediately exposed the next boundary.

The Docker workflow uses a native task shape to launch Compose:

compose:up:
  launch:
    kind: command
    exe: docker
    args:
      - compose
      - up

That is a native command surface, but it is not the same kind of ownership as a host-managed runtime like uv run open-webui serve.

The first cleanup widening treated it as if Ota should reclaim the resulting listener directly like a host workload. That was too broad.

This repo made the distinction unavoidable:

host-owned native process workload
adapter-owned Compose runtime

Those are not the same cleanup family.

3. contract-owned CI bootstrap truth still matters

The workflow was already using ota-run/setup with contract mode, which is the right direction.

The contract bootstrap truth also had to stay aligned so the setup surface consumed the same install story the repo intended to publish.

That was not the original runtime bug, but it was still real governance drift.

A pressure repo should not say one thing in ota.yaml and effectively rely on another in CI.

4. mixed-runtime repos punish vague lifecycle ownership

Open WebUI combines several legitimate paths:

Python ownership through uv
Node ownership for frontend/build work
a packaged native runtime
a frontend-only runtime
a Docker Compose runtime

That combination matters because it punishes sloppy lifecycle assumptions quickly.

If Ota collapses all of that into “run a process and then stop something later,” the product boundary is still too weak.

What had to be fixed in Ota core

Open WebUI led to three real Ota fixes.

proof success-path cleanup for host-owned native workloads

Ota's proof teardown was tightened so it no longer stops only the outer ota up process.

It now also cleans selected native host service workloads that Ota actually owns for the chosen workflow.

That closes the gap where proof succeeded but left the packaged native runtime process tree alive after completion.

cleanup narrowing for Compose-owned runtime lanes

The first cleanup fix exposed a second Ota boundary, and that also had to be corrected.

Ota now skips that host-workload cleanup path for Compose-owned native service tasks such as docker compose up.

That matters because Compose runtime cleanup belongs to adapter-owned repo cleanup surfaces, not to host-process workload reclamation.

Without that distinction, Ota was overreaching past the lifecycle it actually owned.

proof cleanup now respects workflow-owned active execution scope

The last failure was subtler.

Docker proof itself was healthy, but teardown still failed because cleanup hit Ota's active execution registry and treated the proved compose:up service task as an unrelated blocker.

That was also an Ota gap.

The selected workflow had launched that runtime, proved it, and was now trying to clean up its own owned execution surface.

Ota now scopes workflow cleanup to the active execution ownership actually selected by the proof lane.

That keeps ordinary ota clean strict while allowing proof teardown to reclaim the workflow-owned execution it just created.

What changed in the contract

The final contract is stronger because it uses cleaner contract-owned truth in the same places where the product now got stronger.

The packaged native workflow stays explicit:

native:
  intent: local_development
  prepare:
    task: setup:env
  setup:
    task: setup
  run:
    task: serve:native
  readiness:
    surfaces:
      - native-web
  exposes:
    - surface: native-web

The Docker workflow stays separate instead of pretending it is the same runtime family:

docker:
  intent: packaged_runtime
  run:
    task: compose:up
  readiness:
    surfaces:
      - docker-web
  exposes:
    - surface: docker-web

That separation matters because the repo is not publishing one runtime claim.

It is publishing at least two different runtime claims with different lifecycle owners.

The contract bootstrap truth was also aligned so CI, humans, and agents all read the same Ota install intent from the repo contract.

The branch carries a 1.6.24 contract floor and structured bootstrap source, so CI, humans,
and agents use the same released Ota surface.

What the matrix now proves

The matrix does more than parse the contract, but each OS lane has an explicit scope.

All three lanes run ota validate, capture ota doctor, show task and workflow usage, inspect
execution topology, and dry-run the verify task and native workflow. They also dry-run the
finite backend-format and frontend-test tasks.

Ubuntu is the proof lane. It additionally runs the full frontend verification aggregate, the
finite backend-format and frontend-test tasks, then proves both the packaged native runtime and
the Docker Compose runtime. It also dry-runs the Docker workflow before executing its proof lane.

macOS and Windows are deliberately audit-only: they do not claim task execution, native runtime
proof, Compose proof, or Docker-workflow coverage. That is an explicit matrix boundary, not an
implicit gap hidden behind a green OS label.

The contract also models the frontend development runtime and the Playwright-enabled Compose
overlay. Neither is runtime-proved by this matrix. GPU and API-exposure overlays remain outside the
declared pressure scope. Those are modeled paths, not evidence that every Open WebUI runtime family
has been proved.

The green v1.6.24 matrix run
proved the tightened lifecycle boundaries on Ubuntu:

the native proof succeeded and cleaned up its host-owned runtime
the Docker Compose proof succeeded and tore down without tripping over its own active execution record
the contract-owned bootstrap, real verification tasks, and runtime-proof lanes agreed

Why this repo mattered

Open WebUI mattered because it is exactly the kind of repo that exposes whether a readiness system really understands lifecycle ownership.

It is not enough to support:

Python
Node
a native service task
a Compose runtime task

The product also has to know which cleanup responsibility belongs to which path.

This repo forced Ota to answer that cleanly:

proof success should not leave host-owned native workloads behind
Compose-owned runtimes should not be reclaimed as if they were the same class of host process

That is not cosmetic polish.

That is operational trust.

If Ota says a runtime path was proved, it should also leave the repo and CI environment in a truthful post-proof state.

Remaining boundary

The pressure pass also exposed a separate dependency-hydration limitation. Open WebUI's committed
uv.lock was stale, and a normal project uv sync can rewrite it before an unrelated verification
task inspects the working tree. The contract avoids that mutation for the upstream Ruff formatting
gate by using an isolated uvx tool bootstrap.

That is a truthful contract choice, not a complete Ota solution. Ota still needs a typed uv
frozen-lockfile posture for dependency hydration so a contract can require uv to refuse stale lock
state before execution rather than relying on task-specific avoidance. This remains an Ota product
gap; it was not hidden to make the matrix green.

Pressure-testing Ota on Kylrix: Next.js runtime projection and dual-mode contributor proof

Bobai Kato — Sat, 18 Jul 2026 15:36:41 +0000

Overview

Kylrix is a large Next.js workspace with two materially different local stories:

a contributor path backed by SQLite
a self-hosted Compose topology with Appwrite and supporting services

That distinction made it a useful Ota pressure target. A contract that treated both paths as one
generic “run the app” command would either hide the contributor truth or overclaim the self-hosted
one.

The useful outcome was not more shell automation. It was a narrower contributor contract and an
honest production-topology boundary.

The contributor path needed one source of listener truth

The contributor runtime is a Next.js service on port 3005. Before this pressure pass, a contract
author had to duplicate that bind truth in two places:

Next.js launch flags
the listener used for readiness and projected URLs

That duplication drifts easily. The stronger Ota shape is to keep the listener canonical and project
known server flags from it:

surfaces:
  contributor:web:
    kind: http
    port: 3005
    path: /

tasks:
  dev:
    launch:
      kind: command
      exe: pnpm
      args: [exec, next, dev, --turbopack]
      runtime_projection:
        listener: contributor:web
        adapter: nextjs

Ota derives the supported Next.js bind flags from the declared listener. Readiness and the rendered
external URL now describe the same runtime truth as the process command.

Native and container verification should not fight over state

Kylrix also needed a real dual-mode contributor path. The finite verification workflow runs the
same SQLite-backed test, lint, and build tasks on the host or in Ota's pinned Node container.

The important detail is attachment isolation:

execution:
  backends:
    container:
      image: node@sha256:a25c9934ff6382cd4f08b6bc26c82bf4ea69b1e6f8dabfb2ead457374127c365
  contexts:
    contributor:container:
      backend: container
      lifecycle: persistent
      attachments:
        isolated_paths: [node_modules, .next, .pnpm-store]

That makes the mode boundary operational instead of cosmetic. Container verification does not reuse
or corrupt native dependency and build state, while both lanes still execute the same declared
verification closure.

The self-hosted topology is governed, but bounded

Kylrix's full self-hosted stack is not a second contributor runtime. It uses Compose to build and
launch Kylrix, Appwrite, MariaDB, Redis, and Caddy. Ota owns the interpolation env file, Compose
file selection, image build, lifecycle, and readiness surface:

selfhost:up:
  launch:
    kind: compose
    action: up
    detach: true
  runtime:
    kind: service
    surfaces: [selfhost:web]

The contract does not claim that it has provisioned Appwrite credentials, completed the interactive
setup wizard, or proved application-level self-hosted behavior. Those steps require operator
credentials and are explicitly outside this workflow.

That boundary is as important as the Compose ownership. A green lifecycle proof should not be read
as a green product deployment.

What the matrix proves

The Kylrix Ota matrix proves the narrow contributor surface across native and container lanes:

contract validation, doctor, and task discovery
safe-agent task and workflow previews
real SQLite-backed test, lint, and build verification
native and container workflow receipts
native and container runtime proof for the Next.js contributor service

The self-hosted Compose workflow is modeled and dry-run covered, but not matrix-executed. It remains
deliberately outside the contributor proof claim because its Appwrite provisioning is operator-owned.

The released v1.6.24 pressure run is #29572073845.

Why this repo mattered

Kylrix reinforced two Ota design rules:

listener truth should be declared once and projected into supported launch adapters
native, container, and self-hosted paths must be governed as distinct execution realities, not flattened into one optimistic “works locally” claim

The result is a contract that helps contributors start with the finite SQLite path, gives Ota a
truthful container equivalent, and keeps the heavier Appwrite topology visible without pretending it
has been fully proved.

Pressure-testing Ota on lead-quorum: native Python truth, repo-local fulfillment, and runtime bind projection

Bobai Kato — Fri, 17 Jul 2026 12:51:03 +0000

Overview

lead-quorum was a strong pilot repo because it was small enough to reason about and real enough to fail honestly.

It has:

repo-local Python environment ownership
pinned dependency installation
env bootstrap from example truth
a deterministic local test surface
live external verification
a local web runtime
a distributed demo path
a Docker build lane

That is exactly the kind of repo where a contract can look clean while still hiding real setup and execution drift.

Why this repo mattered

The useful pressure here was not “can Ota run one Python command.”

The useful pressure was whether Ota could stay truthful when the repo itself owns:

the .venv
the dependency install lane
the local executable path
the runtime listener truth

If Ota probes or fulfills those in the wrong order, the contract is not trustworthy even if the repo itself is valid.

That is what made lead-quorum valuable.

What the contract now models

The final contract is explicit about the repo’s real setup split.

Setup is not one opaque shell step. It is three different ownership surfaces:

copy .env from .env.example only if missing
create the repo-local virtual environment
hydrate dependencies through typed uv requirements-file installation

That looks like this in the contract:

setup:
  aggregate:
    tasks:
      - setup:env
      - setup:venv
      - setup:deps

setup:env:
  action:
    kind: copy_if_missing
    from: .env.example
    to: .env

setup:venv:
  action:
    kind: ensure_virtualenv
    path: .venv
    python: "3.12"

setup:deps:
  prepare:
    kind: dependency_hydration
    medium: package_dependencies
    source:
      kind: uv
      cwd: .
      mode: pip_requirements
      requirements_file: requirements.txt

The contract also keeps verification and external-runtime claims separate:

verify for deterministic local validation
live for Gemini-backed end-to-end testing
app for the local web service
distributed for the A2A demo path

That matters because a working local scoring test and a live distributed runtime are not the same readiness claim.

What lead-quorum exposed in Ota

This repo exposed three real Ota gaps.

1. native repo-local fulfillment and probing were ordered incorrectly

Older Ota could still probe repo-local Python executables too early.

That is the wrong trust order for a repo that creates its own .venv as part of setup. If the repo-local interpreter path is declared as:

exe: .venv/bin/python

then Ota has to materialize the dependency/setup closure before treating that path as a fulfilled runtime command.

Otherwise a valid contract can fail just because Ota asked the question too early.

That is exactly what this repo exposed.

2. native Python candidate selection was too weak

lead-quorum also exposed a narrower but important gap in typed Python hydration.

When Ota selected a local Python candidate for setup and hydration, it could still choose the wrong host interpreter path instead of the repo’s intended environment. In practice that meant a dependency like cryptography could start building against the wrong target environment instead of the repo-owned Python lane.

That is not a repo bug. That is a readiness engine selecting the wrong execution truth.

3. runtime bind truth was duplicated between launch args and runtime listeners

The local web service made a third weakness obvious.

Before widening, the contract still had to repeat bind truth in two places:

command-line runtime args such as --host and --port
the declared runtime listener surface

That duplication is fragile.

The stronger product shape is for the runtime listener to stay canonical and for Ota to project the supported bind flags for known servers. lead-quorum became the first real repo to pressure that widening cleanly.

What changed in Ota

lead-quorum drove platform fixes, not repo-local workarounds.

native repo-local fulfillment now respects setup materialization order

Ota now runs the selected setup closure before probing repo-local backend/runtime commands that depend on that materialized state.

That closes the gap where a repo could declare a truthful .venv/bin/python lane and still fail because Ota evaluated it before the repo-local environment existed.

Python candidate selection is now stricter and more truthful

Ota also now prefers version-matching Python candidates before falling back to weaker generic host candidates.

That makes typed Python hydration much less likely to drift onto the wrong interpreter family when the contract has already declared the intended Python lane.

runtime listener truth can now project bind args

This repo also helped widen Ota’s runtime-to-launch projection.

The local web service can now declare listener truth once and let Ota project supported bind args for a known adapter:

launch:
  kind: command
  exe: .venv/bin/uvicorn
  args:
    - web.app:app
  runtime_projection:
    listener: web
    adapter: uvicorn

That is a cleaner long-term shape than duplicating --host and --port in every Python service contract.

What the matrix now proves

The released v1.6.24 pressure matrix on Ota's fork is green across Ubuntu, macOS, Windows, and
a Dockerfile-owned Ubuntu container lane. Ubuntu and macOS execute the native deterministic and
runtime-proof lanes. Windows is intentionally audit-only in this matrix: it validates the contract
and discovers the task surface, but does not execute native tasks or runtime proof.

That matters because this repo’s truth is not only “tests pass on one machine.” The matrix proves:

ota validate
ota doctor
ota tasks --use
ota tasks --safe --use
task dry-run coverage for setup, deterministic test, live test, distributed demo, and Docker build lanes
workflow dry-run coverage for verify, app, live, and distributed
real setup
real deterministic test
real verify workflow proof
real local app workflow proof
real docker:build
real verify:container workflow against the repository Dockerfile image
contract-modeled and dry-run-covered live external and distributed workflows; their real matrix execution is conditional on GOOGLE_API_KEY and was skipped in this released-version run

That is a much stronger outcome than “the contract parses.”

Why this repo was a good pilot

lead-quorum did exactly what a first pilot should do.

It did not mainly expose repo noise. It exposed trust gaps in Ota itself:

setup before probe
correct interpreter selection
one canonical runtime bind truth

Those are real product boundaries.

That is why this repo was worth pressure-testing.

It also reinforced an important standard for future pilots:

a receipt or contract is only useful if it stays aligned with the actual decision and execution path, not just with what the repo intended in prose.

lead-quorum is created and maintained by Vinicius Pereira. The
Ota integration is proposed upstream; the forked matrix below is pre-merge pressure evidence, not
evidence from the canonical repository.

Pressure-testing Ota on Bedrock: query identity as replay evidence

Bobai Kato — Fri, 17 Jul 2026 12:48:25 +0000

Overview

Bedrock, created and maintained by Vinicius Pereira, is not a
normal application test repo.

It is a natural-language-to-SQL stability harness. It records generated SQL across repeated runs,
replays the committed fixture against a frozen SQLite store, compares the results with a defended
answer key, and blocks a candidate when reliability regresses.

That made it a strong pressure target for a more specific Ota question:

when a verification lane replays prior model behavior, how should Ota distinguish the inputs that
defined the replay from observations that explain what happened?

The answer cannot be one undifferentiated receipt field.

Why Bedrock mattered

Bedrock already had the useful artifacts in its repository:

data/fixture.jsonl: one recorded SQL query per question and run
data/store.db: the frozen SQLite state
data/baseline.json: the defended regression baseline
docs/scorecard.md: the rendered stability result

Those artifacts answer different questions.

A committed fixture, store, and baseline are replay inputs. They define the deterministic lane.
A repeated SQL trace is observed behavior. It can show that one question produced different query
shapes, but it is not an input Ota used to decide whether the current command could start. In this
repo, fixture.jsonl has a valid dual role: the deterministic replay consumes it and Ota parses it
as historical query evidence. That does not make it independent corroboration; it is one captured
artifact with two explicitly separate meanings.

That distinction became the pressure boundary.

What the contract models

The deterministic gate now declares the replay inputs explicitly:

gate:
  replay_inputs:
    - id: recorded_sql
      kind: static_file
      path: data/fixture.jsonl
    - id: frozen_store
      kind: static_file
      path: data/store.db
    - id: defended_baseline
      kind: static_file
      path: data/baseline.json

Those files are captured as declared, static replay truth.

The recorded query trace is declared separately:

```yaml | Witnessed Query Trace
gate:
witnessed_observations:
query_traces:
- id: recorded_sql
path: data/fixture.jsonl




That is the important Ota boundary.

- `replay_inputs[]` says what the deterministic gate consumed
- `witnessed_observations.query_traces[]` says what behavior the fixture contains

The second must not be promoted into the first. Otherwise a receipt can make historical model
output look like a pre-execution decision input, which weakens replay and evidence semantics.

## What the replay can honestly prove

Bedrock's offline path is deliberately narrow and useful:

- replay the committed SQL fixture
- execute it against the committed SQLite store
- compare the result set with the defended answer key
- compare the candidate score with the committed baseline

The contract makes that boundary visible:



```yaml | Offline Gate
gate:
  command:
    exe: python
    args:
      - -B
      - main.py
      - gate
  safe_for_agent: true

That proves the declared fixture, store, and baseline path. It does not prove that a live model
will generate the same SQL tomorrow, or that a live database has not changed.

Those claims remain separate in the contract's record:live lane:

record:live:
  effects:
    writes:
      - data/fixture.jsonl
    network: true
    network_kind: integration_test
    external_state:
      - anthropic_api

That lane reaches a live model and rewrites the recorded fixture. It is intentionally outside the
deterministic, agent-safe replay workflow.

What Bedrock exposed in Ota

Bedrock helped close an Ota evidence-model gap.

The initial temptation was to place per-record query identity under receipt
evaluated_inputs[]. That would have been wrong. A query trace is execution evidence, not a
current-run input.

Ota now keeps the split explicit:

declared static files stay under receipt evaluated_inputs[]
historical query identities and divergence summaries live under receipt witnessed_observations.query_traces[]

That gives later comparison and correlation a cleaner foundation:

inputs can remain unchanged
observed query behavior can still diverge

Neither fact is allowed to impersonate the other.

Bedrock also made one remaining boundary clear: its live recording path depends on a generic,
unpinned pip requirements lane. Ota does not yet own that as typed dependency hydration. The
contract names it as outside the deterministic proof instead of hiding it in setup shell.

What the matrix proves

The pressure matrix is intentionally offline and cross-platform:

ota validate
ota doctor
ota tasks --use
ota tasks --safe --use
agent-mode task dry-run for the stability gate
agent-mode workflow dry-run and preparation for verify
deterministic test execution
scorecard replay against the committed SQL fixture
baseline gate execution
archived receipt for the declared workflow
native coverage on Ubuntu, macOS, and Windows
the same replay workflow in Ota's pinned Python container context

This is not a claim that Bedrock's live model lane is hermetic. It is proof that the deterministic
replay lane is explicit about its inputs, output evidence, and boundary. The contract's file checks
assert artifact presence; Ota then captures SHA-256 source identities before execution and compares
them during replay. Those observed identities are not yet contract-declared expected digests, so
they attest what the lane used rather than pinning a separate immutable contract assertion.

The released v1.6.24 pressure matrix passed across native Ubuntu, macOS, Windows, and the
container replay lane on Ota's fork: run #29572073749.

Why this repo mattered

Bedrock clarified a product rule that applies beyond AI evaluation:

do not collapse observed behavior into declared execution inputs.

For this repository, a clean replay input can narrow or acquit the input class it actually names.
A divergent query trace can explain a flap. Neither artifact should overclaim the other role.

That is the right foundation for Ota's replay and receipt model: declared truth, execution truth,
and witnessed evidence remain linked, but separate.

Ota v1.6.24 Now Available: Runtime Proof, Replay Trust, and Typed Execution

Bobai Kato — Fri, 17 Jul 2026 12:44:38 +0000

Idea

v1.6.24 is an execution-evidence release.

The pressure behind it was a trust problem: a green command is useful, but it does not automatically prove that a declared dependency shaped the result, that the same material inputs can be replayed, or that a new run still represents the same operational truth.

Ota now draws those boundaries more sharply.

This release strengthens three connected questions:

did the selected runtime start and become ready?
did the proof actually exercise the declared dependency seam?
can a later run identify and replay the material inputs that shaped the original execution?

It also moves more setup and runtime behavior from shell convention into typed contract ownership, so the evidence has stronger declared truth beneath it.

Feature

v1.6.24 strengthens six connected parts of Ota's execution-governance model.

1. Runtime proof now distinguishes readiness, exercise, and causality

Runtime proof should not overstate what a green lane established.

v1.6.24 adds a stronger evidence ladder for declared dependency seams:

reachable means the selected path reached the dependency boundary
exercised requires transaction-bound evidence that the dependency participated in the current run
fault_tested requires a separate, same-obligation negative control that fails for the declared missing-effect reason

For marker-bound seams, Ota now creates an opaque transaction marker for the producer path and requires the observer to recover it from the dependency. A clean observer exit is not enough. The evidence has to match the current transaction and declared obligation.

Negative controls are strict too. An unrelated timeout, setup failure, crash, or generic non-zero exit cannot promote a seam to fault_tested. The control must be valid, bound to the same obligation, and classified as the expected obligation failure or missing effect.

The resulting proof can remain honestly qualified:

{
  "ok": true,
  "proof_verdict": "passed_with_unproven_boundaries",
  "dependency_evidence": [
    {
      "dependency_id": "service:postgres",
      "level": "fault_tested"
    }
  ],
  "not_proved": [
    {
      "kind": "dependency_output_shaping_not_proved"
    }
  ]
}

That distinction matters. Ota can prove that a declared PostgreSQL seam was causally exercised without pretending that PostgreSQL shaped every application output or that one narrow proof covered the whole repository.

2. Replay became execution-authored instead of reconstructed later

v1.6.24 adds first-class replay against archived execution truth:

ota up --workflow verify --replay-baseline latest --json

The replay surface reports one of three explicit outcomes:

replay_verified
replay_failed
replay_unavailable

Ota compares the selected workflow, backend, provider, remote target, and lifecycle against the archived baseline before treating it as comparable execution. It also classifies last-known-good evidence as verified, stale, or unavailable rather than silently treating any old green receipt as current proof.

Receipts now retain more of the material input identity needed for honest replay and diagnosis, including clean Git state, policy rulesets, environment-source identity, lockfiles, Node runtime identity, immutable Compose image identity, and dependency-hydration provenance where available.

The evidence model also keeps inputs separate from witnessed output. Declared replay inputs belong under replay_inputs; observed query behavior belongs under witnessed_observations.query_traces. That lets later comparisons distinguish changed inputs from changed runtime behavior instead of mixing both into one evidence list.

3. Dependency provenance became more explicit

Dependency hydration is not fully described by the package-manager verb alone.

This release widens typed hydration with source posture:

uv hydration can declare a default index, ordered additional indexes, and offline intent
.NET restore records runner-resolved NuGet feed identity when it can establish it
pnpm workspace hydration can declare a first-class filter

The trust rule remains conservative. If Ota cannot recover authoritative feed identity, the evidence narrows the diagnosis; it does not become a false hermeticity claim.

This gives receipts and replay a better answer to a common drift question: not only which package manager ran, but which declared or resolved dependency source shaped the run.

4. More execution truth moved into typed contract ownership

v1.6.24 removes more reasons to hide important repo behavior in opaque shell commands.

The release adds or widens:

action.kind: build_container_image for typed image file, context, tag, and provider ownership
generated artifact lineage through declared producers, outputs, inputs, and consumer requirements
task-level only_on platform constraints
launch.runtime_projection adapters for Uvicorn and Next.js
closure-aware execution modes for tasks and aggregates
aggregate dry-run governance derived from the actual dependency closure

Execution-mode admission is stricter as well. If a task does not advertise a requested mode, Ota refuses it instead of quietly selecting a different execution path.

Together, these changes strengthen the relationship between declared contract truth, the selected runtime path, and the evidence Ota emits afterward.

5. Task discovery and execution now agree more closely

Humans and agents need to see the same callable surface that the runner will enforce.

Task discovery now reports closure-aware run guidance, including:

human execution commands
agent execution commands
agent policy posture
machine-readable use.modes[]

Ota also tightened CI verification recovery and governance reconciliation so merge-gate output can explain the required verification lanes, the cited inputs behind the decision, and the decision basis instead of publishing a verdict with weak provenance.

The result is a cleaner path from discovery to execution: task listings, dry-run governance, runtime admission, and receipts are less likely to describe competing truths.

6. Runtime cleanup and interruption became safer

Some of the most important trust work in this release is deliberately unglamorous.

v1.6.24 improves runtime correctness around interruption and cleanup:

Ctrl+C is classified consistently as an interrupted execution with exit code 130
signal forwarding targets the selected child process tree without killing the invoking shell
interrupt state is scoped to the active execution epoch
native command services publish clearer readiness and endpoint truth
runtime proof starts and stops only proof-owned Compose dependencies
services that were already running before proof are preserved
container proof cleanup follows the selected effective execution mode

This matters because evidence is only trustworthy when the runner also owns failure, interruption, and cleanup boundaries correctly.

Pressure-tested on real repositories

The release was shaped through real repository pressure rather than fixtures alone.

Athena API forced marker-bound PostgreSQL seam proof, same-obligation controls, and typed production image ownership.
Bedrock pressured replay inputs, presentation and comparator profiles, and witnessed query traces.
Lead Quorum pressured uv source posture, runtime projection, and image-build ownership.
Kylrix pressured Next.js runtime projection and truthful native/container mode admission.

Those repos exposed the places where a structurally valid contract could still produce evidence that was too broad, too weakly attributed, or too dependent on shell convention. The fixes in v1.6.24 are the product response to that pressure.

Docs

Use the live references for the shipped surface:

Get started: Install Ota
Proof semantics: Runtime Proof Evidence
Execution evidence: Execution Receipt
Contract fields: Contract Reference
Command behavior: Command Reference
Machine output: JSON Output Reference
Governance architecture: Execution Governance Loop

The published schemas and public examples have also been updated for replay inputs, witnessed query traces, dependency evidence, seam observations, negative controls, and explicit not_proved boundaries.

Release

v1.6.24 is live here: https://ota.run/releases/v1.6.24

Upgrade and verify the selected repo path:

ota upgrade
ota --version
ota validate
ota doctor --json
ota tasks --safe --use
ota run <task> --dry-run --json
ota proof runtime --workflow <workflow> --json

v1.6.24 moves Ota further from “the command passed” toward the stronger question: what did this execution actually prove, which inputs shaped it, and what remains unproved?

Originally posted here: https://ota.run/blog/ota-v1-6-24-release-essay

Why AI Agent PRs Get Rejected And How Repo Contracts Help

Bobai Kato — Thu, 16 Jul 2026 21:18:31 +0000

Overview

A recent study of agent-generated pull-request fixes reported
that 46.41% of fixes proposed by Copilot, Devin, Cursor, and Claude were rejected.

That number matters, but the more useful question is why.

Some rejected PRs are simple model failures: the implementation is wrong, incomplete, or low
priority.

Ota addresses a separate, avoidable class of failure:

the agent ran the wrong verification path
the repo needed a service or env var that was never declared clearly
the change passed one local command but failed the real CI lane
the repo never made safe boundaries explicit
the agent stopped at "the code compiles" instead of "the repo's declared acceptance path passed"

That is not only an intelligence problem.

It is also a repo-governance problem.

No contract can make an incorrect or low-priority implementation worth merging. It can remove the
avoidable execution uncertainty around it: whether the repo was ready, the right lane ran, the
required services existed, and the completion claim matched the repo's declared acceptance path.

The Hidden Failure Is Usually Not The Diff

When an agent opens a PR, maintainers are not only reviewing the code diff.

They are also reviewing whether the agent understood the repo well enough to:

prepare the repo correctly
choose the right workflow
run the right checks
avoid unsafe changes
prove that the change is actually complete

That is where a lot of agent PRs fall apart.

The repo may have the truth, but the truth is scattered:

setup instructions in one README
real verification logic in CI
service assumptions in Docker files
extra post-change steps in shell scripts
path sensitivity in maintainer memory

An agent sees all of that and still has to decide:

What does done mean here?

If the repo does not answer that clearly, the PR is already risky before the code review starts.

Why Rejected Agent PRs Should Be Framed As Repo Governance

The useful framing is not:

"Agents need better prompts."

The useful framing is:

"Repos need better execution truth."

A maintainer should be able to declare:

how the repo becomes ready
which tasks are canonical
which workflow should be used after a change
which services and env are required
which tasks are safe for an agent
which paths are protected
which verification lane proves completion

Without that, every agent run is partly reconstruction work.

That reconstruction cost shows up later as:

CI failures
incomplete implementations
wrong runtime assumptions
reviewer fatigue
rejected PRs that were never fully grounded in repo truth

What Ota Changes

Ota is not trying to make agents magically smarter.

Ota gives the repo an execution contract so the agent has less to guess.

That contract can declare:

toolchains and runtime requirements
dependency hydration and setup
services and readiness
tasks and workflows
safe task boundaries
protected and writable paths
verification paths after changes

Instead of asking an agent to infer "probably run tests," a repo can say what the acceptance path
actually is.

For example:

agent:
  entrypoint: verify
  default_task: verify
  safe_tasks:
    - lint
    - test
  verify_after_changes:
    - verify
  protected_paths:
    - .github/workflows
    - production/**
    - secrets/**

And:

workflows:
  default: app
  app:
    intent: local_development
    prepare:
      task: setup

And:

tasks:
  verify:
    aggregate:
      tasks:
        - lint
        - test

Now the repo is saying something operationally useful:

start from this task
these are the safe tasks
this is the post-change verification lane
these paths are not for autonomous editing

That does not guarantee the agent will write the right code.

But it does remove a large class of avoidable failure.

How This Reduces Rejected PRs

There are four concrete ways this helps.

1. The agent runs the repo's real verification path

The agent no longer has to guess whether pytest, npm test, go test, or one CI script is the
real acceptance lane.

The repo can declare the exact task or workflow that must pass.

That reduces PRs that fail because the agent validated the wrong thing.

2. The agent prepares the repo correctly before editing

A lot of wasted agent work starts before the code change:

dependencies were not hydrated correctly
a service was missing
env files were never prepared
the runtime was not actually ready

If readiness and setup are declared structurally, the agent has a better chance of operating on a
real working repo instead of a half-prepared one.

3. The agent stays inside explicit safety boundaries

Some changes should not happen autonomously.

That may include:

workflow files
deployment config
secrets surfaces
destructive tasks
data-reset lanes

If those boundaries are explicit, the agent can stop, escalate, or stay on the safe path instead
of wandering into a high-review or high-risk change set. A consuming runner or CI gate can then
enforce the same declared boundary where that enforcement is configured.

4. Reviewers get evidence, not reconstruction work

A reviewer should not have to reverse-engineer:

what the agent should have run
whether it used the right workflow
whether a failure came from code, setup drift, or missing services
whether the repo even exposed the right operational truth

Ota moves that toward explicit evidence:

contract validation
doctor output
task dry-run
workflow proof
execution receipts

That makes rejected PRs easier to understand and good PRs easier to trust.

The Real Wedge Is Not "AI Coding"

The real wedge is execution governance.

If AI agents are going to work across unfamiliar repos, those repos need a way to say:

what is required
what is safe
what should run
what success looks like

That is why this matters beyond agents.

The same contract truth helps:

new contributors
CI systems
remote sandboxes
internal automation
future maintainers

The agent use case just makes the pain impossible to ignore.

What A Better Agent PR Flow Looks Like

A stronger flow is simple:

ota doctor
ota validate .
ota tasks --use
ota up --workflow app
ota run verify

That is materially different from:

"Look around, pick some commands, and hope the repo agrees."

The first flow is governed.
The second is guesswork.

Final Point

If nearly half of agent-generated PR fixes are being rejected, the response should not only be to
measure model quality harder.

We should also ask whether repos are giving agents a trustworthy path to completion.

The stronger question is not:

"Can the agent write code?"

It is:

"Can the repo tell the agent what a correct, safe, complete change looks like?"

That is the layer Ota is built for.

Get Started With Ota

Install Ota:

curl -fsSL https://dist.ota.run/install.sh | sh

Windows:

irm https://dist.ota.run/install.ps1 | iex

Then start with:

ota doctor
ota init

Originally posted here: https://ota.run/blog/why-ai-agent-prs-get-rejected-and-how-repo-contracts-help-4h2m

AI Coding's Real Bottleneck Is Repository Execution Trust

Bobai Kato — Mon, 13 Jul 2026 20:20:02 +0000

Overview

For a while, the central question in AI coding felt obvious:

Can the model generate good code?

That is still important, but it is no longer the main bottleneck in many real workflows.

The bigger problem now is repository execution trust.

Can the agent trust the repository enough to act correctly?

Can the repository tell the agent what setup is required, what task is canonical, what workflow proves readiness, and what evidence should count when something fails?

If the answer is no, better generation alone does not solve the problem.

It just produces higher-quality guesses inside an ungoverned environment.

Generation Improved Faster Than Execution Trust

Agents are already reasonably good at many code-local tasks:

writing small features
fixing obvious bugs
updating tests
refactoring narrow modules
tracing stack-level failures

That is not where most teams feel the sharpest friction anymore.

The friction appears when the agent has to move from code generation into repository operation.

That is where questions like these start to matter:

what has to run first
which package manager is authoritative
which services need to be running
what environment values are real requirements
which task is safe to execute
which verification path is canonical
whether a failure came from code, setup, or contract drift

Those are not model-completion problems.

They are execution-trust problems.

The Hard Part Is No Longer Writing The Patch

In many repos, the hard part is no longer producing a plausible patch.

The hard part is knowing whether the patch was exercised against the right path.

An agent can write a technically good change and still fail the actual job if it:

ran the wrong test command
skipped setup that CI assumes
missed a required service
followed a stale README path
treated an environment problem as a code defect
passed a narrow local check while the real repo gate stayed unproven

When that happens, the failure is often blamed on the agent.

But the repo is usually part of the problem too.

The repo did not expose enough trustworthy execution truth for the agent to operate with confidence.

What Executable Trust Actually Means

Executable trust is the condition where a repository can answer operational questions clearly enough that humans, CI, and agents can take the same path and understand the result.

That means the repo can declare:

what it needs
how it becomes ready
what is safe to run
what should verify a change
what runtime path is primary
what evidence should be preserved
what failures mean

Without that, an agent is still reconstructing the repo from incomplete signals:

README prose
shell scripts
CI workflow fragments
package manifests
.env.example
local conventions
tribal knowledge

That reconstruction can look intelligent while still being fragile.

Bigger Context Windows Do Not Eliminate This

A larger context window helps an agent read more of the repo before deciding.

That is useful.

But it does not create authority.

If five different files imply five different setup paths, more context just lets the agent inspect more disagreement.

If CI, local scripts, and contributor docs have drifted apart, a larger window helps the agent see the drift. It does not tell the agent which path the repo actually considers correct.

This is why the bottleneck has shifted.

The question is no longer only:

Can the agent generate a good answer?

It is increasingly:

Can the repository expose a trustworthy execution path?

What Repository Execution Trust Looks Like In Practice

A repository with executable trust should make some things explicit instead of implied.

For example:

tasks:
  setup:
    prepare:
      kind: dependency_hydration
      medium: package_dependencies
      source:
        kind: node_package_manager
        manager: pnpm
        mode: install

  verify:
    aggregate:
      tasks:
        - lint
        - typecheck
        - test

workflows:
  default: verify
  verify:
    setup:
      task: setup
    run:
      task: verify

That does not just give the agent commands.

It gives the repo a declared setup path, a declared verification path, and a shared operational story for humans and automation.

And it gives Ota an executable path instead of a prose hint:

ota up --workflow verify
ota run verify
ota receipt --json --archive

That means:

ota up --workflow verify can take the declared setup path instead of guessing what must run first
ota run verify can execute the canonical verification lane instead of picking between README, CI, or shell drift
ota receipt --json --archive can preserve the execution and readiness evidence instead of leaving the result as unstructured terminal output

Now compare that with a weaker setup where the agent has to guess between:

npm test from the README
pnpm lint && pnpm test:ci from CI
a make check target that may or may not still be current
a hidden service dependency no one wrote down

That second repo does not have a generation problem first.

It has a trust problem.

Failures Need To Become Evidence

Once agents begin operating repos instead of just suggesting edits, output quality is no longer enough.

The system also needs evidence.

When setup fails, the useful artifact is not only stderr.

It is something closer to:

what contract or workflow was selected
what task actually ran
what readiness was expected
what service or env assumption was missing
whether the failure came from declared truth, hidden dependency, or drift

That is the level where repositories become governable instead of just runnable.

It is also the level where agents become more trustworthy, because their actions are bounded by declared paths and their failures are easier to interpret honestly.

This Is Why Ota Repo Contracts Matter

At Ota, this is the problem space we care about.

The value of a repo contract is not that it gives an agent one more config file to read.

The value is that it makes execution truth explicit and runnable:

setup
tasks
workflows
services
readiness
boundaries
receipts

That gives the agent something stronger than raw context.

It gives it an operating contract.

The Next Phase Of AI Coding

The next phase of AI coding is not only about larger models, longer context windows, or better patch generation.

It is about whether repositories can support trustworthy operation.

That means moving from:

"the agent can write code"

to:

"the agent can act inside a repo with clear execution truth"

Those are different maturity levels.

The first is impressive.

The second is what teams actually need if they want agents to work reliably beyond small edits.

Bottom Line

AI coding's bottleneck is no longer only generation quality.

In many real repos, the larger constraint is whether the repository can expose a trustworthy path from change to verified execution.

Until that trust layer exists, better generation will still run into the same wall:

good patches inside unclear repos.

That is why the next real improvement is not just more model capability.

It is a repo that can declare, execute, and preserve the same trustworthy path for developers, CI, and AI agents.

That is the operational layer Ota is building.

Originally posted here: https://ota.run/blog/ai-codings-real-bottleneck-is-repository-execution-trust

Why Agent Safety Needs Enforced Boundaries, Not Just Declared Ones

Bobai Kato — Sat, 11 Jul 2026 15:38:05 +0000

Overview

Agent safety does not get real the moment a repo declares:

safe tasks
protected paths
review-required lanes
external-effect lanes

That is only the first half.

The second half is whether anything fails when those declarations are wrong, stale, or bypassed.

That is the real line between repo governance and well-formatted advice.

Declared Boundaries Are Necessary

A repo should absolutely declare:

which tasks are safe for routine agent use
which workflows are review-required
which files are writable
which files are protected
which commands can mutate external state
which verification paths are finite and meaningful

Without that, the agent has to infer boundaries from README prose, CI jobs, helper scripts, and old maintainer habits.

That is not a safety model.

It is guesswork with better intentions.

But Declared Boundaries Are Not Enough

If the repo contract says one thing and the actual execution path does another, the declaration degrades quickly.

It becomes one more soft signal beside:

AGENTS.md
.env.example
a stale contributor guide
a CI workflow nobody meant to be canonical

That is why a repo contract only stays true if it lives inside the execution loop.

The local runner has to consume it.

CI has to consume it.

Receipts have to record the path that actually ran.

Otherwise the repo has declared boundaries, but no enforced boundaries.

A Simple Failure Shape

Imagine a repo declares:

test is safe for agents
publish is not
migrations/ is protected
production-facing tasks require review

That sounds fine on paper.

But now imagine:

the runner still lets an agent invoke publish
CI never checks whether the declared verification lane was the one that passed
the agent edits a protected file and nothing complains
the receipt only says "task completed" without recording the contract path or refusal state

At that point the repo has not actually governed anything.

It has just documented preferred behavior.

The dangerous part is that this can still look disciplined. The contract exists. The docs look good. The review story sounds serious.

But the boundary has no teeth.

Execution Is What Keeps Safety Truth Honest

A boundary becomes real when crossing it changes behavior.

That means:

an unsafe lane is refused by the runner in agent mode
a protected path boundary is enforced or surfaced by the consuming runner or harness instead of staying as passive metadata
a required verification lane is enforced at merge
a runtime capability boundary only exposes the callable surface the contract allows
the receipt shows whether execution was allowed, refused, blocked, or only advisory

This is the important shift:

agent safety is not only about what the repo can declare.

It is about what the repo can enforce.

Execution is the preservative for safety truth.

The settings the engine actually reads stay accurate far longer than the settings a team merely writes down.

Why Local Runner Enforcement Matters

The local runner is the fastest place to stop bad routine behavior.

If an agent asks for a task outside the declared safe surface, the runner should be able to refuse before the task starts.

That matters because:

the agent gets a clear stop signal early
the repo does not have to rely on prompt obedience
refusal becomes evidence, not folklore

This is where declared safe tasks stop being metadata and start becoming a real execution boundary.

Why CI Enforcement Matters

Local runner enforcement is not enough on its own.

Teams can use different agents.
People can route around local conventions.
Tools can drift.

The merge gate is where the repo gets one mandatory chokepoint.

That is why CI has to enforce the same contract truth too:

which verification lanes are required
which proof must exist
which workflow is canonical
whether contract and CI wiring have drifted apart

If the contract lies, the build should break visibly.

That is much stronger than hoping the next human or agent notices the mismatch by reading more context.

Why Receipts Matter

Safety also needs evidence.

If a boundary was crossed, refused, or bypassed, someone should not have to reconstruct that from chat logs or terminal fragments later.

Receipts are the durable layer that can say:

which contract path ran
which actor mode was used
whether the lane was allowed or refused
what evidence was collected
whether execution stayed inside the declared boundary

That is what makes enforcement auditable instead of anecdotal.

For the public receipt surface, see:

Execution Receipt

This Is Why Ota Is Not Just A Better Instruction File

The weaker model is:

write better guidance
tell the agent to be careful
document what should happen

The stronger model is:

declare the boundary once
enforce it locally
enforce it at merge
emit evidence when the boundary is crossed, blocked, or refused

That is the difference between declared agent safety and enforced agent safety.

Ota is trying to live on the stronger side of that line.

Bottom Line

Agent safety needs declared boundaries.

But it does not stop there.

If nothing consumes those boundaries, nothing fails when they drift, and no evidence is emitted when they are crossed, the repo has not built a safety system.

It has built documentation.

That is why agent safety needs enforced boundaries, not just declared ones.

Original posted here: https://ota.run/blog/why-agent-safety-needs-enforced-boundaries-not-just-declared-ones-4m7q

From Execution Logs to Governance Verdicts in Ota

Bobai Kato — Thu, 09 Jul 2026 12:10:16 +0000

Overview

Most execution tooling still makes operators do too much reconstruction work.

You get:

command output
step logs
maybe a receipt
maybe a CI summary

But the hardest governance questions are usually left implicit:

was this lane allowed or only runnable
should this path have been refused
was review required
was proof expected
did the result satisfy the declared governance bar

That is the gap Ota is closing.

The point is not to replace logs.

The point is to stop treating logs as the primary governance artifact.

Logs are still useful for detail. But governance should be emitted as structured truth:

what Ota decided before execution
what basis that decision used
what evidence was expected after execution
what evidence was actually present

That is a better operator surface, a better CI surface, and a much better agent surface.

What changed

Ota’s governance output is now moving from flat status reporting toward explicit verdict records.

For ota up --json, the shipped governance surface is:

governance.preflight
governance.post_execution

For selected-task preview and discovery surfaces such as ota run <task> --dry-run --json, the
same phase split is exposed under:

governance.evaluation.preflight
governance.evaluation.post_execution

That split matters.

Preflight answers:

is this lane allowed
is it refused
is it blocked
is review required
is a receipt or proof expected

Post-execution answers:

did execution happen
was it refused before execution
was the evidence bar satisfied
is proof missing
is there a receipt-linked crossing record

That is stronger than asking an operator, CI job, or agent to infer all of that from prose.

The important product shift

The important shift is not just “more JSON.”

It is that Ota is turning governance into a first-class output surface instead of an accidental by-product of command execution.

That means a machine consumer can now read stable fields such as:

governance.preflight.state on ota up --json
governance.preflight.decision_basis[] on ota up --json
governance.post_execution.state on ota up --json
governance.post_execution.decision_basis[] on ota up --json

And on selected-task preview surfaces:

governance.evaluation.preflight.state
governance.evaluation.preflight.decision_basis[]
governance.evaluation.post_execution.state
governance.evaluation.post_execution.decision_basis[]

And, in the newer trust-refinement layer:

governance.preflight.evidence_classes
governance.post_execution.evidence_classes
governance.evaluation.preflight.evidence_classes
governance.evaluation.post_execution.evidence_classes

Those fields let downstream consumers distinguish:

caller-asserted intent
Ota-derived governance truth
boundary-attested evidence such as receipt attachment state

That distinction matters because a structured field is only valuable if it is honest about what kind of truth it carries.

What this looks like

Here is the shipped shape from ota up --json when Ota can already distinguish preflight posture
from post-execution evidence posture:

{
  "governance": {
    "preflight": {
      "state": "allowed",
      "crossing_required": false,
      "crossing_classification": "routine",
      "evidence_classes": {
        "state": "derived",
        "crossing_required": "derived",
        "crossing_classification": "derived",
        "receipt_expected": "derived",
        "proof_expected": "derived"
      },
      "receipt_expected": true,
      "proof_expected": true
    },
    "post_execution": {
      "state": "evidence_missing",
      "execution_attempted": true,
      "refusal_occurred": false,
      "decision_basis": [
        {
          "id": "evidence:receipt_present",
          "family": "evidence_gate",
          "evidence_class": "attested"
        },
        {
          "id": "receipt_status:ready",
          "family": "receipt_evidence",
          "evidence_class": "attested"
        },
        {
          "id": "evidence:proof_missing",
          "family": "evidence_gate",
          "evidence_class": "derived"
        },
        {
          "id": "crossing_record:not_required",
          "family": "crossing_evidence",
          "evidence_class": "derived"
        }
      ],
      "evidence_classes": {
        "state": "derived",
        "execution_attempted": "derived",
        "refusal_occurred": "derived",
        "crossing_record_state": "derived",
        "receipt_present": "attested",
        "proof_present": "derived"
      },
      "receipt_present": true,
      "proof_present": false
    }
  }
}

The exact field set still varies by command surface. ota up --json uses governance.preflight /
governance.post_execution, while selected-task preview surfaces keep the same split under
governance.evaluation.*. What is now shipped consistently is:

preflight verdict
cited basis
post-execution evidence verdict
cited evidence basis
provenance on authoritative fields

Why this is better than logs alone

Logs are still where you go for:

full process output
stack traces
backend-specific details
raw command detail

But logs are weak as the primary governance surface because they force reconstruction.

A governance verdict should let an operator, CI system, or agent answer questions like:

was this refusal expected
which policy or boundary actually blocked it
was proof required here
did the final state satisfy the declared bar
is the result authoritative or only advisory

Those are not log-parsing questions.

They are governance questions.

That is why Ota should emit them directly.

Why this matters for AI agents

This matters even more for agents than for humans.

A human can read logs and infer intent.

An agent needs a stable contract.

If an agent only sees:

command succeeded
command failed
here is some text

then it still has to guess:

whether a lane was safe
whether it crossed a boundary
whether it should stop
whether the missing artifact is proof, setup, or review

That is exactly the kind of ambiguity that produces unsafe automation and noisy PRs.

A structured governance verdict is stronger because it gives the agent a machine-readable stop
sign or green light.

Why this matters for CI

CI also gets better when governance is structured.

Without a real verdict surface, CI usually ends up proving only:

some workflow ran
some job passed

That is weaker than:

the repo’s declared lane was allowed
the required lane actually ran
the expected evidence bar was satisfied
the repo did not silently drift away from its declared governance truth

That is the direction Ota is taking:

repo contract as the source of truth
local runner as fast feedback
machine-readable governance verdict as the integration surface

Then CI, agents, and later harnesses can all consume the same truth.

What still matters

This does not mean Ota is done the moment a JSON block exists.

A governance verdict only earns trust if:

it is emitted where the decision is made
it cites the real decision basis
it distinguishes derived truth from asserted truth
it stays reconciled with the actual execution path

That is why later trust work matters too.

The bar is not just machine-readable governance.

The bar is machine-readable governance that is authoritative enough to stop a human or agent from
having to reopen the logs for the first question.

The broader point

Execution logs tell you what happened.

Governance verdicts should tell you what Ota allowed, refused, required, and verified.

That is the difference between command output and execution governance.

And that is where Ota gets stronger:

not by hiding logs
not by narrating more prose
but by emitting the governance truth directly

Originally post here: https://ota.run/blog/from-execution-logs-to-governance-verdicts-in-ota

Pressure-testing Ota on OrchardCore: first-class dotnet restore and honest narrow .NET proof

Bobai Kato — Wed, 08 Jul 2026 20:26:51 +0000

Overview

OrchardCore mattered because it is a real .NET repo, not a toy starter.

Even a narrow slice of the repo carries several different truths at once:

dotnet as the real toolchain owner
restore as a real dependency-hydration lane
build and test as finite CLI surfaces, not shell glue
native and container execution both advertised by the contract
a much larger repo outside the selected slice that the contract should not pretend to own

That made OrchardCore a useful pressure repo for a simple question:

can Ota represent a serious .NET contributor path cleanly without collapsing back into raw shell
or overclaiming the whole repository?

Why this repo mattered

OrchardCore is a broad ASP.NET Core codebase with:

many projects
broader CI workflows
functional and browser-heavy test surfaces
asset and documentation paths outside the selected unit-project slice

That is exactly why the repo is useful.

A weak readiness contract would try to flatten all of that into one vague “build and test”
surface. A stronger contract narrows intentionally and says what it really owns.

The selected OrchardCore slice is honest:

restore one unit-test project
build that same project
test that same project
prove the slice on host and container paths where the contract advertises them

That is the right pressure bar for this repo.

What OrchardCore proved

1. Ota's .NET story is stronger when restore is first-class

The mature setup lane is not:

setup:
  run: dotnet restore test/OrchardCore.Abstractions.Tests/OrchardCore.Abstractions.Tests.csproj

The stronger contract shape is:

setup:
  prepare:
    kind: dependency_hydration
    medium: package_dependencies
    source:
      kind: dotnet_restore
      cwd: test/OrchardCore.Abstractions.Tests

That matters because the contract now owns:

the hydration lane itself
the fact that it is package dependency preparation
the requirement on the dotnet toolchain
the network semantics of the lane

Instead of leaving all of that implicit inside one shell string.

2. Finite .NET task bodies should stay structured

OrchardCore also proved that plain dotnet build and dotnet test lanes do not need to stay as
raw run bodies.

The stronger shape is:

build:
  command:
    exe: dotnet
    args:
      - build
      - --no-restore
    cwd: test/OrchardCore.Abstractions.Tests

and:

test:
  command:
    exe: dotnet
    args:
      - test
      - --no-restore
      - --verbosity
      - minimal
    cwd: test/OrchardCore.Abstractions.Tests

That gives Ota a better execution boundary:

executable identity is explicit
arguments are explicit
working directory is explicit
mode branches only need to vary context, not duplicate command text

That is more mature governance than copying the same shell body across native and container modes.

3. Honest narrowing is more valuable than fake full-repo coverage

OrchardCore did not force a dramatic new Ota core bug.

What it proved instead is equally useful:

Ota can already carry a real .NET repo slice cleanly when the contract is disciplined about what
it is and is not claiming.

This contract does not pretend to own:

the entire OrchardCore solution
every functional test lane
every database-backed or browser-backed path
every asset or documentation workflow

It owns one clear contributor-readiness slice and proves that slice well.

That is better than a broader but less trustworthy contract.

What changed in the contract

The important changes were not dramatic. They were governance upgrades.

Toolchain ownership is explicit:

toolchains:
  dotnet:
    version: "10.0"
    fulfillment:
      source: dotnet
      mode: run

The workflow now owns the setup boundary directly:

workflows:
  verify:
    prepare:
      task: setup
    run:
      task: verify

And verification stays aggregate-owned instead of shell-chained:

verify:
  aggregate:
    tasks:
      - build
      - test

That final shape is small, but it is honest and machine-readable.

What the matrix proves

The current green OrchardCore matrix run for this narrowed slice is #28971743168.

That run is enough to support the note because it proved the slice the contract actually claims:

ota validate
ota doctor
ota tasks --use
ota tasks --safe --use
dry-run task coverage
dry-run workflow coverage
native execution on the selected unit-project slice
container planning and execution for the same declared task surfaces
matrix coverage across Ubuntu, macOS, and Windows for the contract branch at that time

That matters more than pretending the branch should already prove every broader OrchardCore lane.

The linked pressure branch is now pinned to released Ota v1.6.23 for stable reference. This
green matrix run is the proof artifact for the narrowed slice this note describes.

Why this repo mattered for Ota

OrchardCore helped confirm that Ota's current .NET surfaces are no longer theoretical.

They are strong enough to model a real ASP.NET Core repo slice with:

first-class restore hydration
structured finite dotnet commands
workflow-owned setup
explicit host/container mode truth

without falling back to raw shell or overclaiming repo coverage.

That is the real value of this pressure repo.

It did not need to expose a dramatic bug to matter. It proved that Ota's newer .NET contract
story is mature enough to use on a serious repository, provided the contract narrows honestly.

Making Ota Governance Output Machine-Readable

Bobai Kato — Tue, 07 Jul 2026 14:02:27 +0000

Overview

Most engineering teams already have governance.

They just cannot execute it reliably.

The rules are split across README sections, CI snippets, old shell scripts, and internal memory. Humans can sometimes stitch that together. Agents usually cannot. New teammates definitely should not have to.

Ota takes a different position:

governance should be declared
execution should be selected from declared truth
outcomes should be emitted as machine-readable proof

That is what makes governance operational instead of aspirational.

The real problem with "documented" governance

In many repos, governance sounds clear until a real run starts:

setup is "documented," but there is no canonical execution surface
verification exists, but no one can say which path is safe, default, or complete
CI is green, but local and agent workflows drift from the CI lane
failures appear, but output does not explain which governance rule failed

This is where delivery slows down and trust drops.

The issue is not missing effort. The issue is missing machine-readable ownership.

What machine-readable governance output means in Ota

Ota separates governance into explicit operational layers:

contract truth in ota.yaml
executable task and workflow truth through ota tasks, ota run, and ota up
proof truth through receipts and JSON output surfaces

That gives one declared system for both humans and automation.

ota doctor
ota validate
ota tasks --use
ota run verify
ota receipt --json

When this is modeled correctly, operators and agents stop guessing which command is "the real one."

What actually shipped in `v1.6.23`

This is not just a philosophy note.

v1.6.23 widened the real machine-readable governance surface in concrete ways:

ota doctor --json now publishes governance.merge_gate
ota doctor --json now publishes governance.required_verification_lanes
projected verification lanes now carry stable metadata.governance.merge_check_id
ota tasks --json and ota workflows --json now publish capability_profile
execution and proof artifacts now carry clearer stage_family / phase truth
agent-mode governance output now distinguishes non-execution states like not_run_reason and crossing_record_state

That matters because downstream systems no longer need to scrape prose and guess:

which verification lane is merge-relevant
which lanes are callable or refused for an agent
which governance phase they are looking at
which follow-up artifact or proof path is canonical

This is the difference between “the repo has governance” and “other systems can consume the repo’s governance directly.”

Why this is a major shift for AI agents

Agents fail when repos force them to infer policy from prose.

Ota gives agents bounded, declared surfaces:

what they are allowed to run
what each path requires
what side effects are expected
what happened after execution

That changes the quality of automation from "best effort shell guessing" to governed execution.

It also reduces risk for maintainers because agent behavior can be reviewed against contract and receipt truth, not just prompt intent.

Opinionated by design: one truthful path beats five clever ones

Ota is intentionally opinionated here.

If governance is duplicated in scripts, docs, and CI YAML with no single owner, governance is already drifting.

The stronger pattern is:

one contract surface for readiness and execution truth
one set of canonical task/workflow entrypoints
one machine-readable output model for diagnostics and receipts

That is less "flexible" on paper and far more reliable in production engineering practice.

A concrete before-and-after

Before Ota, teams typically debug governance with archaeology:

read docs
inspect scripts
compare CI behavior
rerun commands until something works

With Ota, governance becomes inspectable output:

{
  "governance": {
    "required_verification_lanes": [
      {
        "merge_check_id": "ota.verify.verify",
        "lane_task": "verify",
        "lane_kind": "aggregate"
      }
    ],
    "merge_gate": {
      "state": "projected",
      "lanes": [
        {
          "merge_check_id": "ota.verify.verify",
          "lane_task": "verify"
        }
      ]
    }
  },
  "capability_profile": {
    "actor_mode": "agent"
  }
}

The exact fields vary by command, but the important point is that this is real Ota output shape, not a second narrated explanation layered on top afterward.

The value is deterministic, parseable operational truth that CI, UIs, automation, and operators can all consume consistently.

Why this is an engineering note and not just a product opinion

The engineering issue is simple:

prose can describe governance
scripts can enact governance
CI can partially enforce governance
but if none of that is emitted as stable machine-readable truth, every consumer rebuilds the model differently

That is how drift starts.

The practical bar for Ota is higher:

governance should be declared once
execution should select from that declared truth
machine-readable output should preserve what was selected, what was enforced, and what evidence exists afterward

That is what lets one repo surface drive:

local operator flows
CI and merge consumers
agent harnesses
later auditing and receipt review

Why this matters now

As AI agents become part of daily engineering execution, human-only governance is no longer enough.

If your repo cannot emit machine-readable readiness and proof, your automation stack will stay fragile no matter how good your prompts are.

Ota's approach is practical:

declare the operational truth once
execute declared truth explicitly
emit evidence that can be consumed by people and machines

That is how governance scales without losing trust.

DEV Community: Bobai Kato

Why ok: true Is Not Enough for AI Agent Execution

A successful command is not broad authorization

Read the proof carrier, not one field

Why a qualified pass is not a failure

What an agent should do next

The verdict is derived, not narrated

What this does not solve

The operational rule

References

Pressure-testing Ota on Open WebUI: proof cleanup ownership, bootstrap truth, and native vs Compose runtime boundaries

Overview

What Open WebUI exposed in Ota

1. proof success was weaker than it looked

2. native service cleanup widened past its real ownership boundary

3. contract-owned CI bootstrap truth still matters

4. mixed-runtime repos punish vague lifecycle ownership

What had to be fixed in Ota core

proof success-path cleanup for host-owned native workloads

cleanup narrowing for Compose-owned runtime lanes

proof cleanup now respects workflow-owned active execution scope

What changed in the contract

What the matrix now proves

Why this repo mattered

Remaining boundary

Links

Pressure-testing Ota on Kylrix: Next.js runtime projection and dual-mode contributor proof

Overview

The contributor path needed one source of listener truth

Native and container verification should not fight over state

The self-hosted topology is governed, but bounded

What the matrix proves

Why this repo mattered

Links

Pressure-testing Ota on lead-quorum: native Python truth, repo-local fulfillment, and runtime bind projection

Overview

Why this repo mattered

What the contract now models

What lead-quorum exposed in Ota

1. native repo-local fulfillment and probing were ordered incorrectly

2. native Python candidate selection was too weak

3. runtime bind truth was duplicated between launch args and runtime listeners

What changed in Ota

native repo-local fulfillment now respects setup materialization order

Python candidate selection is now stricter and more truthful

runtime listener truth can now project bind args

What the matrix now proves

Why this repo was a good pilot

Links

Pressure-testing Ota on Bedrock: query identity as replay evidence

Overview

Why Bedrock mattered

What the contract models

What Bedrock exposed in Ota

What the matrix proves

Why this repo mattered

Links

Ota v1.6.24 Now Available: Runtime Proof, Replay Trust, and Typed Execution

Idea

Feature

1. Runtime proof now distinguishes readiness, exercise, and causality

2. Replay became execution-authored instead of reconstructed later

3. Dependency provenance became more explicit

4. More execution truth moved into typed contract ownership

5. Task discovery and execution now agree more closely

6. Runtime cleanup and interruption became safer

Pressure-tested on real repositories

Docs

Release

Why AI Agent PRs Get Rejected And How Repo Contracts Help

Overview

The Hidden Failure Is Usually Not The Diff

Why Rejected Agent PRs Should Be Framed As Repo Governance

What Ota Changes

How This Reduces Rejected PRs

1. The agent runs the repo's real verification path

2. The agent prepares the repo correctly before editing

3. The agent stays inside explicit safety boundaries

4. Reviewers get evidence, not reconstruction work

The Real Wedge Is Not "AI Coding"

What actually shipped in `v1.6.23`