Untrusted Code Execution: Why a Skill Allowlist Is the Floor

#security #ai #supplychain #opensource

In early 2026, attackers uploaded over 1,100 malicious skills to ClawHub — the community skill registry for OpenClaw. Many of the skills masqueraded as trading bots, productivity utilities, or coding helpers. Several climbed into the top-downloaded list. One attacker, working under the handle hightower6eu, uploaded dozens of nearly identical malicious skills in a single campaign. The Hacker News write-up named it ClawHavoc. IBM cited it as the textbook case for the first of their six agent risks: untrusted code execution.

The interesting question isn't "how did this happen." It's "what would have to be true about a personal AI agent's architecture for ClawHavoc to be impossible — not improbable, but actually mechanically impossible?" That's where allowlists stop being enough.

The receipts

OpenClaw's skill model is in the ballpark of every modern agent skill model — a manifest (SOUL.md in OpenClaw's case, SKILL.md in most others), some code or instruction body, and a marketplace to install from. The agent loads the skill, the skill becomes part of the agent's available capabilities, and the agent's prompts can now call into it. That's the design.

The economics of that design are heavily in the attacker's favor. A skill takes minutes to publish. A malicious skill that does its job — install an info-stealer, exfiltrate ~/.ssh, drop a persistent agent on the host — does it on first run. By the time anyone reports it, it's been on hundreds of developer machines for hours.

The defenses you typically reach for here are signing (publisher keys) and allowlists (only install from a curated set). Both help. Neither is sufficient on its own. Publisher keys can be stolen, especially for community skills with low publication barriers. Allowlists narrow the surface but don't change what an allowed skill is permitted to do once loaded. If your model is "trusted skills get full agent capabilities," then the day a trusted skill is compromised is the day the agent is compromised.

This is the gap IBM names. From the X-Force write-up: "Many of the issues are tied to command execution, leaked plaintext API keys and credentials, which can be stolen by threat actors via indirect prompt injection, malicious skills, or unsecured endpoints." The malicious-skills vector and the credential-leakage vector are the same problem in two costumes — both are "code I shouldn't trust got to do things only trusted code should do."

The architecture lesson

The instinct, when this kind of incident lands, is to lock down the marketplace. Vet uploads harder. Review the top-downloaded skills. Take down bad actors faster. All of that is right, and OpenClaw has done a lot of it.

What it doesn't fix is the architectural shape: a skill, once installed, inherits the agent's authority. The agent has filesystem access? The skill has filesystem access. The agent has the Anthropic API key in its environment? The skill has the Anthropic API key. The agent can call shell.run? The skill can call shell.run with any arguments it likes.

That shape is built into most agent stacks. You install a skill, and what you've actually done is granted that skill the union of every capability the agent already holds. The implicit trust boundary is "did this come from an allowlisted source," and once that check passes, no further gate runs. You don't get to express, "this skill should be able to do X but not Y" without changing the runtime.

The deeper question is whether the agent should be enforcing capability at the skill-call boundary, not at the install boundary. The install boundary checks where the code came from. The call boundary checks whether this specific call is one the user authorized this skill to make. The first is supply-chain hygiene. The second is least-privilege.

How Ardur handles it

In Ardur, a skill never inherits agent authority. Every tool call the skill makes — read a file, hit an HTTP endpoint, run a shell command, call a model — has to ride a cap-token that the runtime checks before the call executes. The cap-token is Biscuit-formatted, Ed25519-signed, scoped to a specific verb on a specific resource, and attenuable so the skill can only narrow it, never widen it.

Concretely: a skill that says it summarizes papers in ~/reading gets minted a cap-token at install time that reads, in effect, file.read on ~/reading/** only. If that skill turns malicious and tries to read ~/.ssh/id_rsa, the cap-token check fails at the runtime boundary. The tool call returns Denied. The agent didn't have to be smart enough to refuse — the wire format refused.

This sits inside the FusedRuntime pipeline as the first stage of every turn. Cap-token check is stage one. Cedar policy (Allow / Deny / Indeterminate over a richer policy language) is stage two. Cost gate is stage three. If any of those fail, the turn short-circuits before the model is even called. No skill can sneak a tool call past those stages, because the stages run on every tool call, not on the skill at install time.

The model identity matters here too. Cap-tokens derive their principal from the verified token, not from the caller's assertion. A malicious skill can claim to be a different skill, can claim to be the user, can claim to be the system. The runtime doesn't care — the principal is whichever signed party actually holds the unattenuated parent token, and that party is the one who installed the skill in the first place.

The Cedar policy layer is what lets you express the rules that depend on context the cap-token alone can't carry. "This skill can make http.fetch calls, but only between 9am and 5pm." "This skill can call the model, but only with claude-haiku, not claude-opus." Those policies are deny-by-default; an Indeterminate result is a Deny, which is the only safe interpretation when policy is uncertain.

AI skill is code you didn't write running with permissions you wouldn't grant

Where we're honest

I'd be misleading if I claimed Ardur's skill model is bulletproof. It isn't, in two specific places.

First, the cap-token model only protects against a malicious skill executing the wrong tool call. It does not protect against a malicious skill contributing prompt content that hijacks the agent's reasoning. That's the indirect prompt injection class, which is the next piece in this series and lives in the injection-defense crate, not the cap-token crate.

Second, Ardur's two open durability tickets — ARD-17 (single-phase journal commit) and ARD-19 (recall not yet wired to hybrid memory) — mean a poisoned skill that takes a single bad action right before a crash could orphan its receipt. We track both in RUN.md. They're the reason we still call Ardur dev fidelity.

The class of problem ClawHavoc is — "an allowlisted skill gets to do anything the agent could do" — isn't a class Ardur is vulnerable to, because the agent itself doesn't have that authority lying around to inherit. The agent has cap-tokens. The skill has narrower cap-tokens. Nothing has "all the things."

The takeaway

If you're running a personal AI agent and your model of skill security is "install from the official registry," you're depending on the registry's vetting pipeline being faster than attacker iteration. ClawHavoc says the attacker iteration is the part that's hard to slow down.

The shift worth making is from supply-chain hygiene (which you should still do) to least-privilege at the call boundary (which you also have to do). The first asks where the code came from. The second asks what the code is allowed to do once it's running. The second is the floor. The first is what you do on top of the floor.

If your stack can't answer "what cap-token did this tool call ride on?" — that's the gap.