Bobai Kato for Ota

Posted on Jun 30 • Originally published at ota.run

AGENTS.md Is Not Enough for Safe AI Agent Execution

#agentsmd #agentsafety #reporeadiness #executiongovernance

Overview

AGENTS.md is useful.

It gives AI coding agents a place to find repo-specific guidance:

how to behave
what conventions matter
what areas need extra caution
what kinds of changes should trigger review

That is a meaningful improvement over sending an agent into a repo with no instructions at all.

But AGENTS.md is not enough.

It can tell an agent to be careful.

It cannot, by itself, make execution safe, verification trustworthy, or review inspectable.

For that, a repository needs more than instructions.

It needs:

declared safe commands
a canonical verification path
receipts that show what actually ran

That is the difference between agent guidance and execution governance.

Instructions Help. They Do Not Govern Execution.

An instruction file is still prose.

That means it can express intent, but it does not automatically create operational truth.

For example, AGENTS.md can say:

run the right checks before handoff
avoid destructive commands
do not edit generated files
ask before touching infrastructure

Those are good rules.

But notice what they leave unresolved:

which checks are the right ones
which commands are actually safe
which paths are protected structurally versus only suggested
what should count as evidence that verification happened
how to tell whether a failure came from code, setup, or drift

That is where many agent workflows still break down.

The agent may follow the spirit of the instructions and still take the wrong execution path.

Safe Commands Need To Be Explicit

One of the biggest gaps in agent-oriented repos is that they often declare guidance without declaring a safe command surface.

The repo may tell the agent:

Run tests before you finish.

But that still leaves a dangerous amount of interpretation.

Which task is safe?

Is it:

npm test
pnpm test
make check
docker compose run test
a narrower unit-test path
the CI workflow itself

And if several exist, which one is canonical for a routine code change?

The repo should not force the agent to infer that from scattered hints.

It should declare safe commands explicitly.

That means giving the repo a machine-readable answer to questions like:

what tasks exist
which ones are agent-safe
what they depend on
what runtime mode they use
what they are expected to verify

That is much stronger than asking an agent to "be careful" around shell commands it still has to interpret.

Verification Needs To Be A Path, Not A Suggestion

The second gap is verification.

Many repos still treat verification like a recommendation rather than a declared path.

An instruction file might say:

Make sure everything still works before handoff.

That sounds fine, but it is too loose for reliable agent execution.

A trustworthy repo should be able to say something more concrete:

this is the setup path
this is the finite verification workflow
these tasks are safe to run for routine work
this heavier path exists, but it is not the default

That is the difference between advice and governance.

Without a declared verification path, the agent may:

pass a narrow local check and miss the real gate
run a destructive path unnecessarily
skip a required service-backed test lane
choose the wrong runtime mode
report success without proving repo readiness

Receipts Are The Missing Trust Layer

Even explicit commands and verification paths are still weaker than they should be if nothing
records what actually ran.

This is where receipts matter.

A verification receipt is the difference between:

"the agent says it ran the checks"

and:

"the repo can show which task ran, under which contract, in which mode, with what outcome"

That is the trust boundary most agent workflows still lack.

Receipts help answer questions like:

what contract or workflow was selected
what task actually executed
what backend or runtime mode was used
whether setup ran first
whether readiness was reached
what evidence existed when the run failed

Without receipts, review still depends too heavily on:

agent narration
terminal screenshots
CI guesswork
someone remembering what the command probably was

With receipts, verification becomes inspectable.

What A Better Repo Looks Like

A stronger repository keeps these layers distinct:

AGENTS.md for human-written behavioral guidance
a contract surface for tasks, workflows, safe commands, and boundaries
receipts for execution evidence

For example:

- Prefer small diffs.
- Do not edit generated files manually.
- Escalate before changing deployment or billing flows.
- Use the declared verification path before handoff.

agent:
  safe_tasks:
    - lint
    - typecheck
    - test
  verify_after_changes:
    - test

tasks:
  test:
    command:
      exe: pnpm
      args: [test]
    depends_on:
      - setup

workflows:
  verify:
    setup:
      task: setup
    run:
      task: test

And then the execution layer should be able to produce evidence rather than only output:

ota run test --json
ota receipt --json --archive

The exact tool does not matter as much as the structure:

instructions
safe commands
verification path
receipt

That is the minimum shape of trustworthy agent execution.

Why This Matters More Now

This was already useful when agents mostly suggested edits.

It becomes much more important when agents are expected to:

choose commands
prepare environments
run checks
interpret failures
decide whether work is complete

At that point, the problem is no longer just "does the agent have instructions?"

The problem is whether the repo can expose:

a safe execution surface
a deterministic verification path
evidence that the declared path actually ran

That is a higher bar than AGENTS.md alone can satisfy.

This Is The Stronger Split

If you only need the boundary between instructions and contracts, read:

This post is narrower.

Its claim is not just that AGENTS.md and ota.yaml do different jobs.

Its claim is that even a good instruction file is still not enough unless the repo also declares:

which commands are safe
which verification path is canonical
what receipt counts as evidence

Bottom Line

AGENTS.md is a good start.

But repo instructions alone do not make agent execution safe, reviewable, or trustworthy.

To get there, repositories also need:

explicit safe commands
declared verification workflows
receipts that preserve execution evidence

That is how you move from:

"the agent had guidance"

to:

"the repo had governed execution"

Original Post: https://ota.run/blog/agents-md-is-not-enough-for-safe-ai-agent-execution

Top comments (19)

Vinicius Pereira • Jun 30

Strong agree on the core, and imo it's the same reason prompt-based guardrails fail: an instruction the agent can choose to ignore isn't a control, it's a suggestion w/ good intentions. Prose can't constrain execution, only the harness can. The model proposes, the code disposes.

Two places I'd push it further though. Declared safe commands only bite if the runner actually enforces the allowlist, i.e. the agent can run the declared tasks and literally nothing else. If it can still shell out to anything and you're just hoping it prefers the safe list, ota.yaml is AGENTS.md w/ nicer syntax, the teeth are in the runner refusing whatever isn't declared, not in the declaration itself. Same logic for receipts: a receipt is only worth something if the harness emits it, not the agent. If the agent writes its own receipt you're back to narration w/ extra steps, since the thing you're verifying is also the thing producing the evidence. Trust comes from the evidence being generated outside the agent's control. Nail those two and the instructions / contract / receipt split is genuinely the right shape.

Bobai Kato Ota • Jul 1 • Edited

Yes, fair push.

Ota is stronger than AGENTS.md because execution is contract-based and receipts are harness-authored, not agent-authored. But you’re right that safe_tasks is not yet a full runner-enforced allowlist.

Today, Ota can restrict execution to declared contract tasks, and it uses safe_for_agent / agent.safe_tasks for visibility, governance, and review. What is still missing is an agent-scoped execution mode where the runner refuses declared tasks outside the safe set.

So this is a real product gap, and your framing is the right bar. We’ll prioritize it.

Curious how you enforce that today. Are you using a custom harness, a sandboxed runner, or a tool we should look at?

Vinicius Pereira • Jul 1

Honestly it's less a product and more a discipline, I'm not running some slick custom sandbox. The rule I hold is that the agent never sits in the enforcement path, so two things.

First, a capability boundary instead of a preference. The agent only ever gets a fixed set of typed tools, so an undeclared action isn't deprioritized, it just isn't callable, there's no shell to reach for in the first place. Where I do use an agent runner, I lean on the harness being able to hard-deny a call before it executes rather than the model choosing to behave, e.g. Claude Code's permission modes + pre-tool hooks do exactly this, a blocked call is the runner refusing, not the model deciding. That's the closest off-the-shelf thing to what you're describing that I'd point you at.

Second, the receipt has to live one layer down from the thing being judged. For code work that's just CI as the runner, the agent proposes a diff but the gate outside it decides pass/fail and emits the evidence, and it can't merge past a red gate no matter what it thinks it did. I've got a small project (bedrock) built around this, the reliability eval runs in CI against a known-good key and fails the build, so the agent can't author its own green check, the harness does. The second the agent can write the receipt or reach past its tool set you're back to narration w/ extra steps, so the whole game is keeping the thing that verifies separate from the thing that acts. safe_tasks going runner-enforced is basically that, and imo it's the right thing to prioritize.

Bobai Kato Ota • Jul 1

That makes sense, but it also shows why discipline alone feels transitional to me.

If the boundary lives in operator habit, local harness setup, or “we know not to give the agent a shell,” it can work for one careful team, but it gets harder to scale cleanly across repos, teams, and orgs. What do you think?`

The thing I keep coming back to is: whose discipline is the control once this spreads? The individual operator, the team, platform engineering, or security? And how do you keep that boundary consistent when different teams use different agents and runners?

That is the part we want Ota to handle better: declare the allowed execution surface in the repo contract, let the runner enforce it, and keep the receipt and evidence outside the agent.

Curious how you think about that scaling problem today.

Vinicius Pereira • Jul 1

Yeah, I'll concede the framing, discipline as the control doesn't scale, and honestly it's the same critique I'd make of my own answer. "We know not to give the agent a shell" is just a prompt guardrail wearing an ops hat. So declaring the surface in the repo and having a runner enforce it is the right direction, no argument there.

Where I'd push is on where the teeth actually live once it spreads, because I don't think a repo contract plus a runner fully closes it either. The runner is per-dev and opt-in, so the moment one team points a different agent at the repo, or just runs the thing outside Ota, the contract is back to being a suggestion for them. You've moved the boundary from operator habit to runner adoption, which is genuinely better, but it's still a boundary someone can route around by not using the runner. So to me the durable version is the contract compiling down to enforcement at the points that are already mandatory and agent-agnostic across an org: CI required checks and branch protection on the merge side, and the sandbox plus capability and egress boundary on the execution side. Those don't care which agent or runner you used, they sit at a chokepoint nobody opts out of because it's the same gate human PRs already go through.

So I'd frame Ota's contract as the spec, the single declared source of what's allowed, and then the scaling answer is less "everyone adopts the same runner" and more "that spec is enforced at the org chokepoints that already exist and are already non-optional." The local runner enforcing it is great for fast feedback, but imo that's the best-effort layer, not the teeth. Whose discipline is the control? Ideally nobody's, it's the merge gate and the sandbox, same as it already is for the humans.

Bobai Kato Ota • Jul 2

That’s a strong point, and I think it pushes Ota in the right direction. I appreciate the feedback there.

I agree the local runner is not the final teeth if using it is still optional. That makes it the fast-feedback layer, not the durable control.

The stronger shape is:

ota.yaml as the declared spec
local runner enforcement for immediate feedback
CI required checks and branch protection as the merge gate
sandbox, capability, and egress controls as the execution boundary

That feels like the real scaling path, because it does not depend on everyone using the same agent or even the same runner. The contract stays the source of truth, but the enforcement lives at the chokepoints teams already cannot opt out of.

Curious where you’d start first if you were building that out: compile the contract into CI required checks, or into sandbox/capability policy for the runner?

Vinicius Pereira • Jul 2

I'd start with CI required checks, mostly for the reason you already named: it's the chokepoint nobody can opt out of, and it's one you inherit for free. Branch protection is already sitting in every repo, so you get a non-optional gate without building a new enforcement surface, and it holds even when people are on different runners or agents.

But I'd be clear about what that actually buys you. CI can only enforce the declared intent, whether the change conforms to ota.yaml at merge time. It can't see what the thing does once it runs. So it's the right first gate, but a green check means "this matches the contract," not "this can't misbehave." The sandbox and capability layer is what closes that gap, because it enforces at execution, where the unpredictable behavior actually shows up.

So the order I'd go is CI first, then compile the same ota.yaml into the sandbox and capability policy, so the runtime boundary is derived from the same source of truth instead of a second policy you hand-maintain. The failure I'd want to avoid is the two drifting apart, where CI says compliant and the runner still allows something the spec quietly stopped describing. One contract, two enforcement points, same file.

Bobai Kato Ota • Jul 2

That makes sense.

That’s the direction I'll be pushing Ota toward now: one contract, multiple enforcement points.

The part I’m also most interested in is the runtime side, because CI is clearer as a first enforcement point. On the sandbox/capability side, what are you actually trusting today: filesystem boundaries, tool allowlists, network/egress controls, or something else? Even a rough example from a real repo flow would be useful.

Vinicius Pereira • Jul 2

honestly, of the three, network egress is the one i trust most and instrument first, because it's where the irreversible stuff lives. filesystem and tool allowlists protect the repo; egress protects the outside world (prod writes, exfil, paid-api calls), and it's the hardest to fake compliance on. so my default is deny-all egress with an allowlist of the exact hosts a flow legitimately needs.

concrete example from a data pipeline i run: it pulls from one upstream source and writes to a google sheet, an airtable base, and a slack webhook, nothing else. so the capability boundary is literally those hosts on the egress allowlist, plus the repo and config mounted read-only and a single writable outputs/cache dir. if the process reaches any other host or tries to write outside that dir, it's already outside the contract and it dies there. i don't need it to declare intent, the boundary just doesn't exist for it. that maps onto your model cleanly: ota.yaml declares the allowed hosts and the writable paths, CI checks they're declared, and the runtime compiles that same block into default-deny egress plus a read-only mount.

the honest part: i trust tool/command allowlists the least on their own, because one broad tool reopens everything. a shell node, or an llm step that can emit arbitrary code, and the neat allowlist is decorative. so i don't lean on the tool list as the boundary, i keep the tool surface narrow and let egress and filesystem be the real backstop, on the assumption that any tool might do more than it says. it's cheaper to constrain what a step can reach and touch than to enumerate everything it might try to do.

Bobai Kato Ota • Jul 2

This is very useful. The key point is that the real boundary is reach and touch, not just command choice. Tool allowlists can narrow the surface, but egress and filesystem are the harder backstops because they constrain what execution can actually affect.
That maps cleanly to direction I want Ota to go: contract-declared hosts, writable paths, and runtime compilation into default-deny egress plus explicit writable boundaries from the same ota.yaml.

One thing I’m curious about from your example: is host-level allowlisting usually enough, or do you often want it narrower than that?

Vinicius Pereira • Jul 2

host-level is usually where i start, but it's only actually enough when the host is single-purpose and i own the whole surface. the slack webhook is the clean case: that host only accepts a post to one webhook path, so the host basically IS the capability and nothing narrower buys me anything.

where it stops being enough is the multi-capability hosts. sheets.googleapis.com and api.airtable.com are the same host for every sheet and every base on the planet. allow the host and a compromised step can still read or write any sheet or base the token can reach, not just the one output. so on those i don't try to write a finer network rule, i narrow the credential instead: the service account is shared onto that one sheet, the airtable PAT is scoped to that one base. the egress rule says which host, the token scope says what on that host, and a wrong destination is useless even when the step reaches it.

the reason i don't push the network rule below host is that TLS hides the path, so sub-host filtering means running a proxy that can see the request, which is a lot of moving parts for what credential scoping already gives me for free. the one place that logic breaks is shared object storage. an s3 or gcs bucket sits behind a host thousands of tenants share, so host-allow there still lets you exfil to your own bucket on the same name. that's the case where i'll actually reach for a per-bucket hostname or a private endpoint, because the credential trick doesn't save me.

so the way i'd put it: host-level is the right granularity for the network layer, but it's rarely the whole boundary on its own. for shared apis the tighter boundary is a token that can only address one thing, and i'd trust that over any finer network rule because it holds even when a step reaches the host it shouldn't.

Bobai Kato Ota • Jul 3

Makes sense. It sounds like the main special case is shared object storage, where host-level allowlisting is not enough and you need a bucket or private-endpoint style boundary instead, yeah?

That gives me a cleaner runtime shape to work toward: host-level egress as the default network layer, writable path boundaries alongside it, and explicit treatment of shared-host exceptions instead of pretending one rule fits every integration.

Have you mostly seen that shared-host exception with object storage, or are there other integrations where host-level allowlisting also breaks down for you?

Vinicius Pereira • Jul 3

object storage is the most common one, but it's really one of three shapes, and they fail for different reasons, so i've started sorting them that way.

first is the multi-tenant host, which is where object storage sits. one host, everyone's data behind it: s3, gcs, but also the shared CDNs and package registries, raw.githubusercontent.com serves any repo, registry.npmjs.org serves any package. host-allow reaches all of it. credential scoping handles the api version of this (the token can only see one base), but for the plain-storage and CDN version the token isn't in the path, so there it's a per-bucket hostname or a proxy that can see the request.

second is the relay host, the ones whose whole job is to fetch something on your behalf: url unfurlers, image proxies that take a source url, screenshot and import-from-url services. allow one of those and a compromised step doesn't reach the target directly, it asks the relay to. the host is allowed, but the effective destination is whatever you hand it. that one host-level and credential scoping both miss, because the reach is the feature.

third is the send host, and this is the one i've stopped trying to fix at the network or token layer at all: email, sms, a generic webhook post. the host's entire purpose is to deliver caller-supplied data to a caller-supplied destination. a send-only sendgrid key is correctly scoped and still lets a compromised step mail your data to anyone, because the recipient is the payload, not the permission. exfil rides out through the legitimate function. the only thing that closes it is an allowlist of destinations (recipients, webhook urls) at the policy layer, above both egress and credentials.

so the way i'd frame it for the runtime: host-level is the default and it holds for single-purpose hosts you own. multi-tenant hosts need a tighter address (credential scope for apis, per-bucket name for storage). relay and send hosts need destination constraints, because the danger there isn't which host you reached, it's where the host goes next. object storage just happens to be the one everyone hits first.

Bobai Kato Ota • Jul 3

Very interesting and useful. I like the way you’re separating where the real boundary actually lives: network boundary, credential or tenant boundary, effective destination boundary.

That makes the direction for Ota much clearer:
host allowlist as the baseline, then tighter handling for multi-tenant, relay, and send hosts instead of pretending they are all the same.
Very strong input. Thank you.

In practice, which do you see more often causing problems: relay hosts or send hosts? And is the usual failure over-broad allowlists, or just missing destination constraints entirely?

Vinicius Pereira • Jul 3

send hosts cause the incidents people notice, relay hosts cause the ones they don't. almost every stack has mail or a webhook somewhere, so when something goes wrong loudly that's usually the exit door, and it's what shows up in the postmortem. relays are sneakier because nobody remembers allowing them: the image proxy or the unfurler got allowlisted as a side effect of a legit feature months earlier, and it never reads like an egress path when you audit the list. so by incident count i'd say send hosts, by time-to-detection relays win by a lot.

on the second question, in my experience it's almost never that the allowlist is too broad, it's that destination constraints don't exist at all, and i'd argue it's not negligence. the vendors don't offer the knob. a send-only sendgrid key is the most scoped thing their console gives you, there is no "only these recipients" option, and it's the same story for most webhook and sms providers. so teams do the responsible thing at the only layer the tooling exposes, the host lands on the allowlist because the app genuinely needs it, and everyone reads that as covered. the recipient allowlist has to be built in your own code path, checked before the call goes out, and since nothing prompts you to build it, it mostly doesn't exist until an incident makes the case for it.

the practical version that's worked for me: pin the destination in config as data (the one webhook url, the short list of recipient domains) and have the sending function refuse anything else, no matter what the model or an upstream step asks for. it's boring and it's twenty lines, but it moves the decision from the credential, which can't express the constraint, to code that can.

Bobai Kato Ota • Jul 3

That’s solid input. What also stands out is your point that the real gap often is not “bad allowlist hygiene,” it’s that destination constraints do not exist at all because the vendor surface stops at host or credential scope. So the only honest place to enforce it is in the app path itself.

Out of curiosity, when you’ve put those destination checks in code, have you mostly treated them as repo-local policy, or did you ever want them declared centrally and reused across services?

Vinicius Pereira • Jul 4

repo-local first, every time. the twenty lines live next to the send call, the review happens in the same pr that changes the behavior, and there is no runtime dependency on anything. for one or two services that is the whole answer, and building more than that is ceremony.

the itch for central shows up the third time the same destination appears in another repo. copy-paste allowlists drift: one gets updated, the others silently do not, and the stalest copy is the one that bites you. what i landed on is central declaration, local enforcement. the destinations live in one versioned place as plain data, a config repo or a tiny shared package: this webhook for alerts, these recipient domains for receipts. each service pulls that pin at build or deploy time, and the refusal still happens in its own send path. one source of truth kills the drift, and there is still no policy service in the hot path to go down, or to get a broad unblock-everyone rule slapped on it during an incident, which is the wildcard problem again one layer up.

the version i avoid is enforcement in a central proxy, unless you already run one for other reasons. on paper it is the perfect chokepoint, in practice it inherits everyone's uptime and every send failure becomes its incident. data central, decision local, fail closed on a stale pin. boring, which is the point.

Bobai Kato Ota • Jul 4

This is very useful signal.

The split you described feels right: repo-local first, and only centralize the destination data when reuse makes drift the real problem.

When that shared destination truth drifted, was the main pain usually review visibility, stale pins, or rollout coordination across repos?

Vinicius Pereira • Jul 4

stale pins, and it was not close. rollout coordination is the pain you feel: you change the data and then chase every consumer to redeploy, but it is loud and bounded, you always know who is behind. review visibility mostly solved itself once the destination data lived in one repo, a change was a normal pr with a normal diff. stale pins are the quiet one. because the pin is pulled at build or deploy time, a service that has not shipped in six weeks is enforcing six-week-old truth, and nothing about it looks wrong. the drift is not a diff anyone reviews, it is an age everyone forgets.

two things took the teeth out of it. first, the pin version rides in the service startup log and in the send receipts, so "which truth is this service on" becomes a query instead of an investigation. second, an age alarm instead of a hard fail: warn when the pin is older than n days or a few versions behind, block only past a hard limit, because fail-closed on stale means a service that cannot deploy also cannot send, and that trade is rarely worth it. bonus points if a change to the central data fans out consumer rebuilds through ci, which turns the coordination chase into a pipeline.

and since you are clearly circling a runtime design here, happy to go deeper on how this maps to ota specifically, my dm is open.

View full discussion (19 comments)