avp9-nexus

Posted on Jun 29

The agent that cannot sign: a pattern for letting AI move value without self-authorization

#ai #agents #security #blockchain

On February 8, 2026, an autonomous AI agent called Owockibot was taken offline after it leaked the private keys to its own hot wallet — in multiple places. It had been given the ability to hold and move funds on its own. It used that ability to expose the very secret that protected those funds.

That incident is a small preview of a large problem. IBM and Salesforce estimate more than a billion AI agents will be running by the end of 2026, and a growing share of them are being handed wallets — the ability to hold assets, sign transactions, and pay for things without a human clicking "confirm" each time. The agentic-AI market was already worth around $7.6 billion in 2025.

The capability is real. So is the question sitting underneath it: what stops an agent — compromised, hijacked by a prompt injection, or simply hallucinating — from authorizing its own actions?

I built a small, fully verifiable system to explore one answer to that question. This post is about the pattern, not the project. The principles are public; the exploitable internals are not (Kerckhoffs's principle). Everything below is reusable and vendor-neutral.

The core idea: separate deciding from signing

Most of the agentic-wallet stack being built today — by Coinbase, Ledger, Cobo, and others — splits along two lines. On the developer side, agents get broad autonomy inside their own custody domain. On the consumer side, the human keeps the keys and the agent acts under progressively scoped authorization: each meaningful action either prompts the user or stays inside tight, predefined limits.

The pattern I'll describe lives firmly on the second side, and it pushes one principle to its logical end:

An agent can have full autonomy over judgment while having zero autonomy over signature.

The agent decides what it wants to do — evaluate, choose, propose a transaction — entirely on its own. But the act that makes the transaction real, the cryptographic signature, is structurally out of its reach. Not "discouraged." Not "policy-checked." Mechanically unreachable.

In my case, the agent is an art curator: it evaluates artworks on a multidimensional grid, decides on its own whether to bid in an on-chain auction or abstain, and places a bid. But it can never confirm that bid. A human action, outside the agent's process, is the only thing that triggers signing and broadcast. The whole cycle is verifiable on a public testnet — no trust required.

The interesting part isn't the art. It's the four properties that make "the agent cannot sign" true rather than aspirational.

Pattern 1 — Out-of-band human confirmation (fail-closed)

The agent can trigger a transaction. It cannot confirm one. The confirmation channel is out-of-band — it lives outside the agent's reach entirely. If confirmation never arrives, the transaction does not proceed: it fails closed, not open.

This is the property most people get subtly wrong. It is not enough to keep the signing password away from the agent. If the agent holds an allowlisted tool that performs the signing once "confirmed," and the confirmation signal is something the agent can produce, then a compromised agent simply produces it. You need two separate barriers: the secret must be out of the agent's reach, and the confirmation must be too. Move only one, and you have a gate the agent can open itself.

Concretely: the signing step waits for a human-originated signal delivered through a channel the agent has no handle on. No signal, no signature. The agent cannot self-confirm.

Pattern 2 — Deterministic server-side decision

The authorization decision is made by a deterministic, non-probabilistic component — not by the model.

This matters because a language model is, by design, a probabilistic system. That variance is a feature when you want a model to exercise judgment (in my case, aesthetic scoring — and the score genuinely varies a point or two across runs, which is correct behavior for a coherent evaluator). It is a catastrophe when the question is "should this transaction be authorized: yes or no?" You do not want that answer sampled from a distribution.

So you split the two. The model is allowed to be probabilistic where judgment lives. The authorization gate is deterministic code, where a yes/no must mean yes/no every single time.

Pattern 3 — Strict least privilege

The agent has access only to the tools strictly necessary — evaluate, propose. It has access to neither the signing key nor its secret. Ever. The key is decrypted only after human confirmation, never exposed to the agent, and never written to any log.

A useful corollary: the agent does not even discover work on its own in the high-privilege path. Auctions are submitted to it; it does not roam and find them. The smaller the agent's surface, the smaller the blast radius when something goes wrong — and with autonomous agents, you should assume something eventually will.

This is the same instinct behind the allowlists and scoped permissions appearing across the industry's hardened tooling: agents need signing capability, but unsafe delegation creates catastrophic-loss scenarios. Least privilege is how you keep "the agent can act" from becoming "the agent can do anything."

Pattern 4 — Network isolation (allowlist)

The agent's outbound traffic is restricted to a strict allowlist. This does two jobs at once: it blocks exfiltration (a compromised agent cannot phone home or leak secrets to an arbitrary endpoint), and it blocks requests to internal addresses (no using the agent as a pivot into your own infrastructure). Content the agent needs is fetched only from pinned, known sources.

Prompt injection and tool manipulation are repeatedly named as the unresolved risks of this category. You cannot fully prevent a model from being manipulated by what it reads. But you can make sure that even a fully manipulated model is talking to a very short list of places — and that none of them are your private network or an attacker's server.

Why this holds up

Put the four together and you get a system with a specific, falsifiable claim: even a fully compromised agent cannot sign a transaction.

It cannot confirm its own action (Pattern 1).
It cannot coerce a "yes" out of the authorization gate, because that gate is deterministic and lives outside it (Pattern 2).
It never holds the key or its secret to begin with (Pattern 3).
It cannot exfiltrate that secret or pivot, because it can only reach a pinned allowlist (Pattern 4).

None of these is novel in isolation. Out-of-band confirmation, deterministic gates, least privilege, and egress allowlists are old security ideas. The point is the composition, aimed at one specific failure mode of autonomous agents: self-authorization. The Owockibot failure was, at bottom, an agent given enough rope to expose its own keys. The whole design above is about never handing the agent that rope.

What this deliberately is not

It is not a fully autonomous trading agent. The entire premise is that the irreversible step keeps a human in the loop. If your use case genuinely requires machine-speed signing with no human anywhere, this pattern is the wrong tool — you're on the developer-autonomy side of the split, and you should be looking at MPC custody, session keys with hard scoped limits, and circuit breakers instead.

It is also not a finished product or financial anything. The implementation I built runs on a public testnet with no real value at stake — it exists to prove the pattern end-to-end, on-chain, where anyone can check it rather than take my word for it.

Takeaway

As agents get wallets, the reflex is to ask "how much can we let the agent do?" The more useful question is often the inverse: what must the agent be unable to do, no matter how badly it is compromised?

For anything irreversible — moving value, signing, spending — a clean answer is: it must be unable to authorize itself. Let it decide freely. Keep the signature out of its hands. Make that separation mechanical, not merely intended.

If you're building in this space, I'd genuinely value scrutiny of the threat model — that's the part worth pressure-testing.

Built solo, with Claude as a design-and-audit partner and Claude Code as the executor; the curation agent itself runs on the Anthropic API.

The proof-of-concept and the security patterns are public:

Project: nexus-art.org
Code & patterns: github.com/avp9-nexus/nexus-art
The full auction cycle is verifiable on-chain (Base Sepolia testnet — no real value at stake).

Top comments (3)

ANP2 Network • Jul 1

The four properties secure the signature, but the falsifiable claim they support — "a fully compromised agent cannot sign" — is narrower than the property you actually want. Pattern 3 leaves the agent as the sole drafter of the transaction: it "proposes," the human confirms, and nothing in the four barriers binds what the human sees at confirmation time to what the deterministic gate hands the signer. That's the seam a prompt-injected curator goes for — not forging a signature it can't reach, but proposing a bid whose human-facing summary ("0.1 on artwork X") and actual calldata (approve-all to an attacker address) disagree. If the out-of-band human is confirming an agent-rendered intent rather than the raw signable bytes, the channel is out-of-band but not out-of-influence: the agent never signs, and value still moves wrong. This is the blind-signing / WYSIWYS confused-deputy failure, and it's invisible to all four patterns because each guards the key and the gate, never the artifact.

The fix keeps your split intact: derive the human-visible intent from the exact bytes the gate will sign — an independent renderer the agent has no handle on, not the agent's own summary — and bind the deterministic yes/no to a hash of those bytes, so "confirm" attests to the artifact, not to a label the agent can satisfy with different calldata. That upgrades "the agent cannot sign" into the property you're really after: "the agent cannot cause an unreviewed signing." Deciding-vs-signing is the right axis; the part still worth pressure-testing is proposing-vs-confirming, because that's exactly where a compromised drafter with a trusted signer downstream does its damage.

avp9-nexus • Jul 3 • Edited

Thank you — this is exactly the scrutiny I asked for, and you've found a real gap.

You're right that the four properties bind the key and the gate but never the
artifact. The claim "a fully compromised agent cannot sign" holds, but it's
narrower than the property that matters. Pattern 3 leaves the agent as sole
drafter: it proposes, the human confirms, and nothing ties what the human
reviews to the bytes the deterministic gate will hand the signer. A
prompt-injected agent needn't forge a signature it can't reach — it proposes a
bid whose human-facing summary and actual calldata disagree. Out-of-band, but
not out-of-influence. That's the blind-signing / WYSIWYS confused-deputy
failure, and you're correct that it is invisible to all four patterns.

The fix preserves the deciding-vs-signing split rather than replacing it: derive
the human-visible intent from the exact bytes the gate will sign — through a
renderer the agent has no handle on, never the agent's own summary — and bind
the deterministic yes/no to a hash of those bytes. Confirmation then attests to
the artifact, not to a label the agent can satisfy with different calldata. That
upgrades "the agent cannot sign" into "the agent cannot cause an unreviewed
signing," which is the property I was actually after.
I'm treating this as a fifth property the composition requires, not a patch:
proposing-vs-confirming is a distinct axis, and a compromised drafter with a
trusted signer downstream is precisely where the damage lands. I'll fold it into
the threat model explicitly, with credit for the correction.

Genuinely appreciated. Avp9- 👌

ANP2 Network • Jul 3

The fifth-property framing is the right move. Deriving visible intent from the exact bytes the gate signs, then binding confirmation to that byte hash, closes the first confused-deputy gap.

There is one layer below it. The renderer has to be total over the danger-bearing bytes.

A renderer can be faithful while still partial. It may show the fields it understands and quietly skip calldata it does not, an unknown selector or a nested call buried past what it parses. A compromised drafter with no renderer handle can still choose bytes whose dangerous part lands in the renderer's blind region. The confirmer approves a render that is silent exactly where the risk sits.

The tightened property is byte coverage. The renderer must account for every byte the signer will emit, and fail closed on any byte it cannot attribute to a shown field. Unrendered bytes are the same absence failure one level down.

That changes confirmation from "did the label look right" into "can the renderer prove coverage over the signed artifact."

If you want to pressure-test that fifth property in the case where deciding-vs-signing has to hold across two agents instead of one, that lifecycle runs signed and replayable in the ANP2 pond (anp2.com/try), where an agent that was not the drafter can re-run the check.