How We Analyzed The Top 2,354 ClawHub Skills for Security

#ai #opensource #security #openclaw

By Julien Brouchier, MTS @ Trent AI

A scanner tells you whether a file is malware. That is a useful question, and the wrong one for an OpenClaw skill.

A skill is not a static binary. It is a configuration, a set of permissions, and an autonomous agent that will use them. The interesting security question is not "does this file match a known signature?" It is "what can this skill do once an agent starts running it, and does the way it is wired make that easy or hard for someone to abuse?"

That is the question behavioral analysis tries to answer. Here is how we asked it across the top 2,354 packages on ClawHub.

The pipeline

For every skill we checked on ClawHub we ran the same five-step pass:

Pull the package: manifest, SKILL.md, scripts, declared permissions, declared endpoints.
Resolve the configuration surface: where credentials live, how inputs reach the skill, what the skill writes back.
Resolve the permission surface: what tools, files, and network the skill can touch.
Resolve the composition surface: what other skills it can invoke or feed into.
Run a behavioral verdict against a fixed set of architectural checks (below) and place the package in one of three buckets.

Behavioral analysis is LLM-powered evaluation of both code and documentation. It reads packages the way an autonomous agent would: understanding intent, architecture, and trust boundaries, not just byte-level patterns. The checks are anchored to AI-specific threat frameworks: MITRE ATLAS, the OWASP Agentic AI Top 10, and the OpenClaw Trust Boundaries model.

In parallel, we ran each package through VirusTotal and recorded the verdict. We did not use VirusTotal as ground truth. We used it as a second axis.

What the behavioral checks actually look at

The checks are not about lines of code. They are about how the skill is wired.

Configuration surface.
Where do credentials live: environment variables, config files, inline in SKILL.md? Are secrets handled by reference or by value? If you read the source, do you have the key?

Permission scope.
What permissions does the skill request, and what is the smallest set it actually needs? A translation skill that requests filesystem access to / instead of ./workspace has expanded its blast radius for every vulnerability that touches input handling.

Network exposure.
Is any binding on 0.0.0.0 instead of localhost? That single character flip turns a local skill into a network-reachable service. Are external endpoints pinned to specific hosts and verified, or whatever resolves when the skill installs?

Input handling.
Does the skill validate file paths, URLs, and shell-bound arguments before passing them downstream? The agent providing those inputs can be manipulated through prompt injection. The skill is the boundary that has to assume nothing.

Side effects.
Does the skill write to disk, push to git, send messages, or call paid APIs without user confirmation? An agent that has been redirected by prompt injection will use whatever side effects the skill exposes.

Composition.
Can this skill invoke or be invoked by another in a way that creates a chained attack path? A skill that reads ~/.aws/credentials is risky on its own. A skill that reads ~/.aws/credentials plus a skill that posts data to an unverified webhook composes into something neither does alone.

Each check fires independently. A package can fail several at once. The average vulnerable skill in our corpus had 5.5 findings.

How the buckets are defined

We placed each package in one of three buckets based on the combination of findings:

Benign. No findings, or findings only at the lowest severity that do not compose into anything operationally interesting. Average 0–2 findings, CRITICAL is rare. The architecture mitigates residual risk.
Vulnerable. Findings exist but are consistent with developer mistakes. Preventable gaps, no adversarial intent. Built by developers who shipped a useful tool without the security controls the ecosystem never asked them to implement. Average 4–6 findings, 0–1 CRITICAL.
Malicious. Findings include patterns that only make sense as adversarial choices: instructions that target the agent's interpretation rather than the user's, exfiltration paths with no functional cover, behavior that diverges between documentation and runtime. Average 9–10 findings, 4–6 CRITICAL.

The strongest signal between vulnerable and malicious is not the raw number of findings. It is the density of CRITICAL findings. A package with six CRITICAL findings clustered around credential harvesting and exfiltration is a different animal from a package with five HIGH findings around missing input validation, even though both end up in the "has security issues" bucket.

Distinguishing vulnerable from malicious is the part of this work that takes the most judgment. The code quality is often similar across all three categories. The architecture is what separates them.

A walkthrough

Take a document translation skill we sampled. It does exactly what its name says: takes a file, sends the contents to a translation API, returns the result. Code is clean, readable, well-documented. No obfuscation. No hidden behavior.

But:

File paths are not validated. A prompt injection attack could point it at ~/.aws/credentials and the skill would upload that file to the translation API.
The API key is stored in plaintext in the script. Anyone who reads the source has the key.
The output path is not validated. A compromised translation API could write arbitrary files back to the system.

Three real findings; none of them are malware. None of them would trigger a signature-based scanner, because there is nothing to match. The skill is, in operational terms, a credential exfiltration path with one prompt injection between it and an attacker. This is the package that pages someone at 2am, and it scans clean.

Multiply that pattern across the registry, and you get the headline result. The point of the methodology is not the count. It is that you cannot reach this verdict by scanning files. You have to model what the skill does when an agent runs it.

Why we ran VirusTotal in parallel

VirusTotal is excellent at the question it answers: does this file match known-bad signatures? It is the wrong question for OpenClaw skills, but running it in parallel let us measure the gap.

The two systems disagreed on 89.5% of packages. That is not a criticism of either tool. They answer different questions, and AI agent skills introduce three threat dimensions signature-based detection was not designed for:

Documentation is executable. In traditional software, a README is inert text. In OpenClaw, a SKILL.md is processed by an agent that may follow its instructions.
Permissions are linguistic. Traditional packages declare permissions in manifests. AI agent skills request capabilities through natural language. A signature engine has nothing to match against.
Architecture is the vulnerability. Most flagged packages work exactly as intended. Their design creates the exploitable surface, and only architectural reasoning can identify it.

The numbers reflect this. Behavioral analysis flagged 840 packages that VirusTotal cleared, because those packages are not malware; they are misconfigured to be exploited. Seventeen packages that VirusTotal considered clean were flagged as actively malicious by behavioral analysis. Sixty-two that VirusTotal flagged as suspicious were correctly identified as benign by behavioral analysis. Both lenses matter. Neither is sufficient on its own.

Limits of this work

A few things behavioral analysis at this scale does not tell you:

Runtime drift. We analyzed each package as published. A skill that fetches code or instructions at runtime is harder to bound. We flag the fetch path but cannot verify what comes through it after publication.
Authorial intent. "Vulnerable" vs "malicious" is a judgment about architecture, not a claim about the author. We do not know which 4.4% of authors knew what they were doing.
Composition across users. We modeled compositions inside the registry. We did not model what happens when a user installs three skills and the combination of three otherwise-fine skills creates an attack path.
Single point in time. The registry changed during the analysis window. The numbers are a snapshot, not a steady state.

What this enables

If you maintain skills, the operational read is straightforward: the architecture matters more than the code. Most of what behavioral analysis catches is a permissions decision or a configuration decision, not a coding decision. Audit the wiring before you audit the implementation.

If you install skills, the read is that signature-based scanners on their own are not enough for this ecosystem. The interesting risk lives one layer up.

Full results, the malicious-bucket attack taxonomy with examples, and the cross-reference matrix are in the research piece: Malicious vs. Vulnerable: What We Found Analyzing The Most Popular 2,354 Skills on ClawHub with Trent.

The analysis was run with trentclaw, a security assessment skill for OpenClaw built by our team at Trent AI. Self-serve install, free API key from trent.ai/openclaw.