On February 8, 2026, an autonomous AI agent called Owockibot was taken offline after it leaked the private keys to its own hot wallet — in multiple places. It had been given the ability to hold and move funds on its own. It used that ability to expose the very secret that protected those funds.
That incident is a small preview of a large problem. IBM and Salesforce estimate more than a billion AI agents will be running by the end of 2026, and a growing share of them are being handed wallets — the ability to hold assets, sign transactions, and pay for things without a human clicking "confirm" each time. The agentic-AI market was already worth around $7.6 billion in 2025.
The capability is real. So is the question sitting underneath it: what stops an agent — compromised, hijacked by a prompt injection, or simply hallucinating — from authorizing its own actions?
I built a small, fully verifiable system to explore one answer to that question. This post is about the pattern, not the project. The principles are public; the exploitable internals are not (Kerckhoffs's principle). Everything below is reusable and vendor-neutral.
The core idea: separate deciding from signing
Most of the agentic-wallet stack being built today — by Coinbase, Ledger, Cobo, and others — splits along two lines. On the developer side, agents get broad autonomy inside their own custody domain. On the consumer side, the human keeps the keys and the agent acts under progressively scoped authorization: each meaningful action either prompts the user or stays inside tight, predefined limits.
The pattern I'll describe lives firmly on the second side, and it pushes one principle to its logical end:
An agent can have full autonomy over judgment while having zero autonomy over signature.
The agent decides what it wants to do — evaluate, choose, propose a transaction — entirely on its own. But the act that makes the transaction real, the cryptographic signature, is structurally out of its reach. Not "discouraged." Not "policy-checked." Mechanically unreachable.
In my case, the agent is an art curator: it evaluates artworks on a multidimensional grid, decides on its own whether to bid in an on-chain auction or abstain, and places a bid. But it can never confirm that bid. A human action, outside the agent's process, is the only thing that triggers signing and broadcast. The whole cycle is verifiable on a public testnet — no trust required.
The interesting part isn't the art. It's the four properties that make "the agent cannot sign" true rather than aspirational.
Pattern 1 — Out-of-band human confirmation (fail-closed)
The agent can trigger a transaction. It cannot confirm one. The confirmation channel is out-of-band — it lives outside the agent's reach entirely. If confirmation never arrives, the transaction does not proceed: it fails closed, not open.
This is the property most people get subtly wrong. It is not enough to keep the signing password away from the agent. If the agent holds an allowlisted tool that performs the signing once "confirmed," and the confirmation signal is something the agent can produce, then a compromised agent simply produces it. You need two separate barriers: the secret must be out of the agent's reach, and the confirmation must be too. Move only one, and you have a gate the agent can open itself.
Concretely: the signing step waits for a human-originated signal delivered through a channel the agent has no handle on. No signal, no signature. The agent cannot self-confirm.
Pattern 2 — Deterministic server-side decision
The authorization decision is made by a deterministic, non-probabilistic component — not by the model.
This matters because a language model is, by design, a probabilistic system. That variance is a feature when you want a model to exercise judgment (in my case, aesthetic scoring — and the score genuinely varies a point or two across runs, which is correct behavior for a coherent evaluator). It is a catastrophe when the question is "should this transaction be authorized: yes or no?" You do not want that answer sampled from a distribution.
So you split the two. The model is allowed to be probabilistic where judgment lives. The authorization gate is deterministic code, where a yes/no must mean yes/no every single time.
Pattern 3 — Strict least privilege
The agent has access only to the tools strictly necessary — evaluate, propose. It has access to neither the signing key nor its secret. Ever. The key is decrypted only after human confirmation, never exposed to the agent, and never written to any log.
A useful corollary: the agent does not even discover work on its own in the high-privilege path. Auctions are submitted to it; it does not roam and find them. The smaller the agent's surface, the smaller the blast radius when something goes wrong — and with autonomous agents, you should assume something eventually will.
This is the same instinct behind the allowlists and scoped permissions appearing across the industry's hardened tooling: agents need signing capability, but unsafe delegation creates catastrophic-loss scenarios. Least privilege is how you keep "the agent can act" from becoming "the agent can do anything."
Pattern 4 — Network isolation (allowlist)
The agent's outbound traffic is restricted to a strict allowlist. This does two jobs at once: it blocks exfiltration (a compromised agent cannot phone home or leak secrets to an arbitrary endpoint), and it blocks requests to internal addresses (no using the agent as a pivot into your own infrastructure). Content the agent needs is fetched only from pinned, known sources.
Prompt injection and tool manipulation are repeatedly named as the unresolved risks of this category. You cannot fully prevent a model from being manipulated by what it reads. But you can make sure that even a fully manipulated model is talking to a very short list of places — and that none of them are your private network or an attacker's server.
Why this holds up
Put the four together and you get a system with a specific, falsifiable claim: even a fully compromised agent cannot sign a transaction.
- It cannot confirm its own action (Pattern 1).
- It cannot coerce a "yes" out of the authorization gate, because that gate is deterministic and lives outside it (Pattern 2).
- It never holds the key or its secret to begin with (Pattern 3).
- It cannot exfiltrate that secret or pivot, because it can only reach a pinned allowlist (Pattern 4).
None of these is novel in isolation. Out-of-band confirmation, deterministic gates, least privilege, and egress allowlists are old security ideas. The point is the composition, aimed at one specific failure mode of autonomous agents: self-authorization. The Owockibot failure was, at bottom, an agent given enough rope to expose its own keys. The whole design above is about never handing the agent that rope.
What this deliberately is not
It is not a fully autonomous trading agent. The entire premise is that the irreversible step keeps a human in the loop. If your use case genuinely requires machine-speed signing with no human anywhere, this pattern is the wrong tool — you're on the developer-autonomy side of the split, and you should be looking at MPC custody, session keys with hard scoped limits, and circuit breakers instead.
It is also not a finished product or financial anything. The implementation I built runs on a public testnet with no real value at stake — it exists to prove the pattern end-to-end, on-chain, where anyone can check it rather than take my word for it.
Takeaway
As agents get wallets, the reflex is to ask "how much can we let the agent do?" The more useful question is often the inverse: what must the agent be unable to do, no matter how badly it is compromised?
For anything irreversible — moving value, signing, spending — a clean answer is: it must be unable to authorize itself. Let it decide freely. Keep the signature out of its hands. Make that separation mechanical, not merely intended.
If you're building in this space, I'd genuinely value scrutiny of the threat model — that's the part worth pressure-testing.
Built solo, with Claude as a design-and-audit partner and Claude Code as the executor; the curation agent itself runs on the Anthropic API.
The proof-of-concept and the security patterns are public:
- Project: nexus-art.org
- Code & patterns: github.com/avp9-nexus/nexus-art
- The full auction cycle is verifiable on-chain (Base Sepolia testnet — no real value at stake).
Top comments (0)