TL;DR:
- One CRITICAL command injection flaw
- A supply-chain prompt injection risk
- ~199,000 installs exposed to documented vulnerabilities
- The most popular skill in the ecosystem has a near-failing score
Last week I wrote about why I built SkillCompass β the measurement problem at the core of AI agent skill development, and why tweaking descriptions when the real bug is in D4 (Functional) sends you in circles. The launch got more traction than I expected: 40 GitHub stars and 420 downloads on ClawHub in the first four days, which told me the frustration was widely shared.
The obvious next question: if individual skills fail silently, what does the ecosystem look like at scale?
The timing felt right to ask it. OpenClaw's founder put it well when he launched on March 22nd: "With ClawHub enabled, the agent can search for skills automatically and pull in new ones as needed." That's powerful, and it means the registry's quality floor becomes your agent's quality floor. Until now, no one had looked systematically at what's actually in there.
So I ran SkillCompass on the top 100 ClawHub skills by download count. All 100 were evaluated across all six dimensions, scored, and classified.
The Surface Reading: Mostly Fine
70% of the top 100 pass all quality gates. The mean score is 73.8, just above the PASS threshold of 70. Security (D3) scores highest of any dimension at a mean of 8.5/10, making sense since the dominant skill type is single-purpose tool wrappers with naturally bounded permission scopes.
If you stopped there, you'd conclude the ecosystem is in decent shape. I don't think that's the right conclusion.
The Average Is Lying to You
An 8.5 security mean is achieved because roughly 85 of 100 skills have zero D3 findings at all. The remaining 15 pull the mean down only slightly, but those 15 skills are not randomly distributed across the download ranking, they are disproportionately concentrated among the most-installed skills in the ecosystem.
Four of the top 10 most-downloaded skills have documented security findings. The skills most people are actually running are overrepresented in the risk pool relative to their share of the dataset. A mean that weights a rank-95 skill equally with a rank-3 skill obscures this completely.
The CRITICAL Finding: D3 = 0
In SkillCompass, D3 is a hard gate. A Critical security finding forces FAIL regardless of overall score, no override. I wrote that rule deliberately: a skill that can execute arbitrary code isn't redeemable by good triggers or clean structure.
One skill in this dataset hit that gate. It sits at rank 37 with 6,221 downloads, scores 61/100 overall, and has the only D3 score of zero in the entire batch.
The finding is a textbook command injection. A challenge parameter passed by the user is concatenated unsanitized directly into a shell command. Any input containing shell metacharacters like ;, |, &, $( can execute arbitrary code on the host machine. This isn't theoretical: it's a working injection vector in a skill whose name implies safety, installed on over six thousand machines.
"A skill with 6,221 downloads that cannot pass the security gate signals a dangerous gap between popularity and quality in this ecosystem."
β SkillCompass Evaluation Report, March 2026
The skill should be pulled from the registry immediately. Until it is, do not install any identity-verification skill from ClawHub without auditing its shell-handling code first.
The HIGH Finding: An Indirect Prompt Injection in the Registry Itself
The HIGH finding is the most structurally interesting, because it's not really a bug in the skill itself. It's the registry itself.
A meta-skill at rank 43 (4,635 downloads) is designed to help agents discover and surface other skills from ClawHub and Skills.sh. It fetches skill descriptions from public registries and injects them directly into LLM context with no sanitization or filtering.
Anyone who publishes a skill with a crafted description can inject arbitrary instructions into the decision loop of any agent running a search. The attacker just needs to publish a skill, no infrastructure compromise required.
The search itself is the exposure point. And this isn't something the skill author can fix, it requires the registry to implement content filtering on published descriptions.
The MEDIUM Findings: Silent Data Flows
Nine skills carry MEDIUM findings. Most are not code vulnerabilities: they involve data transmission that users may not have consented to or even know about.
The two most significant patterns:
Undisclosed telemetry and data transmission. One analytics skill (rank 95) silently streams every CLI command's output to a third-party service, no privacy notice, no opt-out. An official CLI skill (rank 12) uploads the entire local folder on publish with no pre-flight summary; co-located secrets go with it. An audio transcription skill (rank 18) POSTs audio to an external API without a confirmation step.
Prompt injection via external content. The highest-download skill with a security finding (rank 8, 37,775 downloads) returns arbitrary MCP server responses directly into LLM context, a malicious server payload could override agent behavior. A video transcript skill does the same with content from arbitrary URLs. As agents become more autonomous, this attack class becomes more valuable to adversaries.
Beyond Security: The Quality Gap Nobody Talks About
Security got the headlines, but the quality dimensions told an equally uncomfortable story.
D2 (Trigger) is the weakest dimension at 6.2 mean. The reason is nearly universal: ~80% of skills define when to activate and never when not to. The not_for rejection boundary is missing across the ecosystem, the same gap I flagged in the launch post as a common individual skill failure.
D4 (Functional) sits at 6.6. About 60% of D4-weak skills document the happy path only, no error recovery, no edge cases, no output format specs. Around 40% read as user manuals rather than LLM instruction sets: they describe what the user should configure instead of what the model should do. This is the SQL skill failure mode from last week's post, playing out across dozens of skills in the wild.
These aren't neglected skills. This is what the average ClawHub skill looks like.
Popularity β Quality (Or Safety)
The key structural finding: download count and SkillCompass score are nearly uncorrelated.
The most-downloaded skill in the ecosystem (43,526 installs) scored 58, a near-FAIL, with weak functional specs and a D5 score reflecting skills that barely outperform asking the base model directly. A top-20 skill by downloads (15,623 installs) scored 56, dragged down by security concerns. Meanwhile, a rank-71 skill with under 2,300 downloads scored 88, the highest in the entire dataset.
ClawHub surfaces skills by popularity. The skills most users encounter first are not the best-built or safest. They're just the oldest or most-shared.
What This Means If You Use ClawHub Skills
The ecosystem is not broadly unsafe: 70% PASS, mean is 73.8, and tool wrappers are lower-risk by nature. But "not broadly unsafe" is different from "safe to install without reading."
- Don't use download count as a quality signal. Read the Transparency section before activating any skill in an agentic context.
- Scrutinize identity-verification and agent-orchestration skills. Highest severity findings in this batch.
- Review data transmission behavior before installing any skill that integrates external APIs, especially analytics-adjacent ones where telemetry may be continuous and undisclosed.
- Be cautious with any skill that discovers or loads other skills. The supply-chain injection risk needs a registry-level fix, not a skill patch.
Run It On Your Own Skills
This audit only covers the top 100 skills by download count, a tiny fraction of what's published on ClawHub.
If you've built or published skills, or regularly pull skills into your Claude Code or OpenClaw setup, SkillCompass runs in minutes and shows what's wrong, where, and what to fix first.
π Install on ClawHub β clawhub.ai/krishna-505/skill-compass
π Source code β github.com/Evol-ai/SkillCompass
Start with whichever skill has been annoying you most. That's usually where the most interesting finding is.
One final ask: share your results. If you got a PASS, add your score to your skill's README, it's a signal to users that someone actually checked. If you got a FAIL, fix the weakest dimension, re-scan, and open a PR. Every skill that improves raises the quality floor for the whole ecosystem.


Top comments (0)