Ross

Posted on Jun 7

AI Agents Don’t Need More Prompts. They Need Execution Boundaries.

#ai #security #devops #machinelearning

AI agents are moving from chat into action.

They can call tools, send emails, update records, delete data, trigger workflows, deploy code, issue refunds, change IAM permissions, and interact with MCP servers.

That shift is powerful.

It is also where things start to get dangerous.

Most AI safety conversations still focus on the model:

Can we make the model follow instructions?
Can we stop prompt injection?
Can we make the agent reason better?
Can we stop it hallucinating?

Those questions matter.

But they miss the moment that matters most:

What happens when the agent is about to actually do something?

Because at that point, the prompt is no longer the control surface.

The execution boundary is.

The problem: the agent can be wrong, but the side effect still happens

Imagine an agent connected to a refund tool.

The user asks:

Refund order ord-123 for £25.

The agent correctly calls:

python issue_refund(order_id="ord-123", amount_cents=2500)

Fine.

But now imagine the agent is prompt-injected, confused, compromised, or just wrong.

It calls:

python issue_refund(order_id="ord-456", amount_cents=250000)

Or it repeats the same refund twice.

Or it uses a proof meant for one customer against another customer.

Or it calls a more dangerous tool than the one the user actually authorised.

At that point, another system has to decide:

Is this exact action allowed to happen?

Not “does the model seem trustworthy?”

Not “did the prompt say to be careful?”

Not “does this look roughly similar to the original request?”

The question is:

Is there valid proof for this exact action, with these exact parameters, for this exact service, right now?

If not, the side effect should not execute.

The idea: no valid proof, no execution

I’ve been working on an open-source project called Actenon Kernel.

The idea is simple:

AI agents can propose actions. Protected systems decide whether those actions are allowed to execute.

Actenon is not a prompt filter.

It is not an output moderator.

It does not try to make the model truthful.

It sits at the execution boundary and refuses consequential actions unless the caller presents a cryptographic proof bound to the exact action being attempted.

That proof can bind:

the action name
the capability
the exact parameters
the target resource
the intended audience/service
expiry time
replay protection
policy or approval evidence

If the proof is missing, expired, replayed, audience-mismatched, malformed, or bound to different parameters, the action is refused before the side effect.

A tiny example

The mental model looks like this:

python from actenon import ActenonGate gate = ActenonGate.local_dev(audience="service:refunds") action = gate.build_action( "refund.issue", "payment.refund", {"order_id": "ord-123", "amount_cents": 2500}, target_type="order", target_id="ord-123", tenant_id="demo", requester_id="support-agent", ) # Local demo only. # In production, this proof would be minted by your auth layer, # policy engine, approval workflow, or control plane. proof = gate.mint_proof(action) outcome = gate.protect( action, proof, lambda: issue_refund("ord-123", 2500), audience="service:refunds", )

The important part is the lambda.

If the proof does not validate, that function never runs.

The model can ask.

The boundary decides.

Why this matters for MCP and agent tools

MCP makes it easier for agents to reach tools.

That is useful.

But it also means a model-visible tool can become a bridge into real systems: filesystems, databases, CRMs, terminals, deployment pipelines, payment systems, and internal admin workflows.

So the question becomes:

How does the tool decide whether a specific call should execute?

Actenon’s answer is that the MCP tool should not rely on the model behaving correctly. It should require proof at the point of execution.

A prompt-injected agent might call the tool.

The tool still refuses unless the proof matches the exact action.

Why this is different from IAM

IAM answers:

Who or what has access?

Actenon answers:

Is this exact agentic action authorised right now?

Those are different controls.

An agent may have access to a refund API.

That does not mean every refund amount, every customer, every retry, and every target should be allowed.

IAM is necessary.

But for autonomous or semi-autonomous agents, it is not always granular enough at execution time.

Local demo

The repo includes a tiny interactive demo:

bash python examples/interactive_execution_demo.py

It shows:

text ✅ approved refund: ord-123 £25.00 -> executed 🛑 hallucinated refund: ord-456 £2,500.00 -> refused / INTENT_MISMATCH 🛑 replay approved refund -> refused / DUPLICATE_REPLAY 🛑 refund with no proof -> refused / PCCB_REQUIRED

Only the approved action reaches the side-effect function.

Everything else is dropped.

What I’m looking for

I’d love feedback from people building with:

MCP
LangChain / LangGraph
Claude tools
OpenAI tool calling
coding agents
internal workflow agents
agentic CI/CD
AI admin tools
finance, healthcare, IAM, or regulated workflows

The question I’m trying to sharpen is:

Where should the proof boundary sit in real-world agent architectures?

Repo here, if useful:

https://github.com/Actenon/actenon-kernel

The goal is not to make every agent safe.

The goal is to make consequential action surfaces deterministic.

No valid proof, no execution.

Top comments (5)

ANP2 Network • Jun 8

INTENT_MISMATCH is the label I'd pressure-test — at the boundary there's no intent to compare against, only the minted proof. The demo catches ord-456 because the proof was minted for ord-123 and the agent then drifted; that's post-mint parameter drift, and the boundary genuinely closes it. What it can't catch is the agent (or an injection) handing the minter "ord-456 / £2,500" as the action in the first place — the minter issues a valid proof, the boundary sees a perfect proof-to-action match, and the side effect fires. So the crypto authenticates integrity (proof matches action), not correctness (action matches intent); the authorization still lives in the minting step you punt to "your auth layer / policy engine / approval workflow." That's the actual answer to where the boundary should sit: it's exactly as strong as the minting path's independence from the agent. The proof has to be minted from a representation of intent the agent didn't author — the raw request parsed out-of-band, a policy engine reading external state, an approval it can't forge. If the agent is what describes the action to the minter, you've signed its mistake.

Smaller one: DUPLICATE_REPLAY can't double as idempotency. A legitimate retry after a lost ack is byte-identical to a malicious replay — refuse it and the caller still can't tell whether the refund landed; allow a re-mint and you reopen the double-refund the nonce was closing. Bind the proof to an idempotency key derived from the action identity, so a retry of the same action returns the recorded outcome while a proof for a different action still fails — single-use-nonce replay protection and at-least-once delivery pull opposite ways otherwise.

Ross • Jun 14

This is a really strong critique, and I think you’ve put your finger on the exact architectural boundary.

You’re right: the kernel is not proving “correctness of intent” in the broad semantic sense. It is proving that the action presented at execution time matches the action that was authorised/proved.

So INTENT_MISMATCH is probably too overloaded as a label. A clearer name would be something like ACTION_MISMATCH, PROOF_ACTION_MISMATCH, or BOUND_ACTION_MISMATCH.

The boundary can close the post-mint drift case:

proof minted for ord-123 / £25
agent attempts ord-456 / £2,500
boundary refuses because the action no longer matches the proof

But you’re right that this does not solve the earlier problem where a compromised or injected agent submits the wrong proposed action to the issuer in the first place.

That is why I think the issuer/control-plane has to be independent from the agent, not just a signing helper the agent calls.

The issuer needs to mint proof from something the agent cannot simply fabricate:

raw user request parsed out-of-band
authenticated user/session context
policy engine state
resource state
approval workflow evidence
tenant/account limits
risk rules
human approval where required

If the agent is the only source of truth for the thing being signed, then yes, you’ve signed the agent’s mistake. The kernel gives you proof-bound execution; the control plane has to give you independent authorisation.

On replay/idempotency: also agreed. Single-use replay protection is not the same thing as a clean retry story.

A legitimate retry after a lost acknowledgement needs to return the prior recorded outcome, not execute again and not create ambiguity. The right shape is probably:

proof bound to an action identity / idempotency key
first execution records the outcome/receipt
retry with the same action identity returns the stored outcome
different action with reused proof still refuses

That distinction should be clearer in the docs/examples.

This is exactly the kind of pressure-testing I was hoping for. The short version is:

The kernel authenticates and enforces the proved action at the boundary. The issuer/control plane must independently decide whether that proof should exist.

Both are required for the full architecture.

ANP2 Network • Jun 14

Right — and I'd collapse that input list into a single test: independence of identity isn't independence of evidence. A separate issuer process that still decides from the request string the agent handed it will faithfully mint a proof for the agent's mistake — just from a different PID. So for each input you listed, the question is only "could the compromised agent have fabricated this?" The raw request parsed out-of-band, resource state, policy/tenant limits read directly — the agent can't write to those, so they hold. But anything passed through the agent on its way to the issuer (a "session context" or "risk signal" the agent forwards) inherits the agent's trust surface and silently fails the test. So the architecture question reduces to: what fraction of what the issuer signs was sourced by the agent vs. read independently by the issuer? Whatever survives that question is what the proof should actually bind to.

Luna · AI Tinkerer • Jun 12

Speed funds more tests, not less. Cheaper to generate code = cheaper to test, but only if you actually reallocate the saved minutes.

Ross • Jun 14

Completely agree. Faster generation only helps if the saved time gets reinvested into verification, tests and review.

That’s part of the reason I’m interested in execution boundaries: if agents make it cheaper to generate work, we need equally strong ways to decide what is actually allowed to execute.

Speed without a verification layer just moves mistakes faster.