DEV Community: John Smith

I Ran SkillCompass on the Top 100 ClawHub Skills: Here's What I Found

John Smith — Wed, 01 Apr 2026 10:47:24 +0000

TL;DR:

One CRITICAL command injection flaw
A supply-chain prompt injection risk
~199,000 installs exposed to documented vulnerabilities
The most popular skill in the ecosystem has a near-failing score

Last week I wrote about why I built SkillCompass — the measurement problem at the core of AI agent skill development, and why tweaking descriptions when the real bug is in D4 (Functional) sends you in circles. The launch got more traction than I expected: 40 GitHub stars and 420 downloads on ClawHub in the first four days, which told me the frustration was widely shared.

The obvious next question: if individual skills fail silently, what does the ecosystem look like at scale?

The timing felt right to ask it. OpenClaw's founder put it well when he launched on March 22nd: "With ClawHub enabled, the agent can search for skills automatically and pull in new ones as needed." That's powerful, and it means the registry's quality floor becomes your agent's quality floor. Until now, no one had looked systematically at what's actually in there.

So I ran SkillCompass on the top 100 ClawHub skills by download count. All 100 were evaluated across all six dimensions, scored, and classified.

The Surface Reading: Mostly Fine

70% of the top 100 pass all quality gates. The mean score is 73.8, just above the PASS threshold of 70. Security (D3) scores highest of any dimension at a mean of 8.5/10, making sense since the dominant skill type is single-purpose tool wrappers with naturally bounded permission scopes.

If you stopped there, you'd conclude the ecosystem is in decent shape. I don't think that's the right conclusion.

The Average Is Lying to You

An 8.5 security mean is achieved because roughly 85 of 100 skills have zero D3 findings at all. The remaining 15 pull the mean down only slightly, but those 15 skills are not randomly distributed across the download ranking, they are disproportionately concentrated among the most-installed skills in the ecosystem.

Four of the top 10 most-downloaded skills have documented security findings. The skills most people are actually running are overrepresented in the risk pool relative to their share of the dataset. A mean that weights a rank-95 skill equally with a rank-3 skill obscures this completely.

Full severity breakdown:

The CRITICAL Finding: D3 = 0

In SkillCompass, D3 is a hard gate. A Critical security finding forces FAIL regardless of overall score, no override. I wrote that rule deliberately: a skill that can execute arbitrary code isn't redeemable by good triggers or clean structure.

One skill in this dataset hit that gate. It sits at rank 37 with 6,221 downloads, scores 61/100 overall, and has the only D3 score of zero in the entire batch.

The finding is a textbook command injection. A challenge parameter passed by the user is concatenated unsanitized directly into a shell command. Any input containing shell metacharacters like ;, |, &, $( can execute arbitrary code on the host machine. This isn't theoretical: it's a working injection vector in a skill whose name implies safety, installed on over six thousand machines.

"A skill with 6,221 downloads that cannot pass the security gate signals a dangerous gap between popularity and quality in this ecosystem."
— SkillCompass Evaluation Report, March 2026

The skill should be pulled from the registry immediately. Until it is, do not install any identity-verification skill from ClawHub without auditing its shell-handling code first.

The HIGH Finding: An Indirect Prompt Injection in the Registry Itself

The HIGH finding is the most structurally interesting, because it's not really a bug in the skill itself. It's the registry itself.

A meta-skill at rank 43 (4,635 downloads) is designed to help agents discover and surface other skills from ClawHub and Skills.sh. It fetches skill descriptions from public registries and injects them directly into LLM context with no sanitization or filtering.

Anyone who publishes a skill with a crafted description can inject arbitrary instructions into the decision loop of any agent running a search. The attacker just needs to publish a skill, no infrastructure compromise required.

The search itself is the exposure point. And this isn't something the skill author can fix, it requires the registry to implement content filtering on published descriptions.

The MEDIUM Findings: Silent Data Flows

Nine skills carry MEDIUM findings. Most are not code vulnerabilities: they involve data transmission that users may not have consented to or even know about.

The two most significant patterns:

Undisclosed telemetry and data transmission. One analytics skill (rank 95) silently streams every CLI command's output to a third-party service, no privacy notice, no opt-out. An official CLI skill (rank 12) uploads the entire local folder on publish with no pre-flight summary; co-located secrets go with it. An audio transcription skill (rank 18) POSTs audio to an external API without a confirmation step.

Prompt injection via external content. The highest-download skill with a security finding (rank 8, 37,775 downloads) returns arbitrary MCP server responses directly into LLM context, a malicious server payload could override agent behavior. A video transcript skill does the same with content from arbitrary URLs. As agents become more autonomous, this attack class becomes more valuable to adversaries.

Beyond Security: The Quality Gap Nobody Talks About

Security got the headlines, but the quality dimensions told an equally uncomfortable story.

D2 (Trigger) is the weakest dimension at 6.2 mean. The reason is nearly universal: ~80% of skills define when to activate and never when not to. The not_for rejection boundary is missing across the ecosystem, the same gap I flagged in the launch post as a common individual skill failure.

D4 (Functional) sits at 6.6. About 60% of D4-weak skills document the happy path only, no error recovery, no edge cases, no output format specs. Around 40% read as user manuals rather than LLM instruction sets: they describe what the user should configure instead of what the model should do. This is the SQL skill failure mode from last week's post, playing out across dozens of skills in the wild.

These aren't neglected skills. This is what the average ClawHub skill looks like.

Popularity ≠ Quality (Or Safety)

The key structural finding: download count and SkillCompass score are nearly uncorrelated.

The most-downloaded skill in the ecosystem (43,526 installs) scored 58, a near-FAIL, with weak functional specs and a D5 score reflecting skills that barely outperform asking the base model directly. A top-20 skill by downloads (15,623 installs) scored 56, dragged down by security concerns. Meanwhile, a rank-71 skill with under 2,300 downloads scored 88, the highest in the entire dataset.

ClawHub surfaces skills by popularity. The skills most users encounter first are not the best-built or safest. They're just the oldest or most-shared.

What This Means If You Use ClawHub Skills

The ecosystem is not broadly unsafe: 70% PASS, mean is 73.8, and tool wrappers are lower-risk by nature. But "not broadly unsafe" is different from "safe to install without reading."

Don't use download count as a quality signal. Read the Transparency section before activating any skill in an agentic context.
Scrutinize identity-verification and agent-orchestration skills. Highest severity findings in this batch.
Review data transmission behavior before installing any skill that integrates external APIs, especially analytics-adjacent ones where telemetry may be continuous and undisclosed.
Be cautious with any skill that discovers or loads other skills. The supply-chain injection risk needs a registry-level fix, not a skill patch.

Run It On Your Own Skills

This audit only covers the top 100 skills by download count, a tiny fraction of what's published on ClawHub.

If you've built or published skills, or regularly pull skills into your Claude Code or OpenClaw setup, SkillCompass runs in minutes and shows what's wrong, where, and what to fix first.

🔗 Install on ClawHub → clawhub.ai/krishna-505/skill-compass
🔗 Source code → github.com/Evol-ai/SkillCompass

Start with whichever skill has been annoying you most. That's usually where the most interesting finding is.

One final ask: share your results. If you got a PASS, add your score to your skill's README, it's a signal to users that someone actually checked. If you got a FAIL, fix the weakest dimension, re-scan, and open a PR. Every skill that improves raises the quality floor for the whole ecosystem.

Your AI Agent is Failing. You Just Don’t Know Where.

John Smith — Thu, 26 Mar 2026 12:42:56 +0000

Launching SkillCompass: Diagnose and Improve AI Agent Skills Across 6 Dimensions

TL;DR:
AI agent skills fail silently with wrong outputs, security gaps, and redundant logic, and the standard fix (rewrite the description, add examples, tweak instructions) usually targets the wrong layer. SkillCompass is an evaluation-driven skill evolution engine: it scores your skills across 6 dimensions, pinpoints the weakest one, fixes it, proves it worked, then moves to the next weakest. One round at a time, each one proven before the next begins.

GitHub → Open source, MIT License. If you want the why and how, read on.

Most AI agent skills have a quiet problem: they work well enough that you keep using them, but not well enough if you stop fiddling with them. You tweak. You rewrite. You add examples. Sometimes things improve. Often they don't. You're never quite sure which change actually helped.

This isn't a skill-writing problem. It's a measurement problem. And it's worse than it sounds — without a diagnosis, every improvement attempt is as likely to make things worse as better.

The Loop You Can't See You're In

You have a skill that handles SQL queries. It works, mostly. But the outputs feel "off" on complex queries. So you try things.

You rewrite the description to be more specific. Trigger rate drops; wrong outputs remain. You rewrite the core instructions — JOINs now work, but subqueries broke. You add eight few-shot examples. The prompt balloons and quality drops across the board.

Three attempts. No progress. Somehow worse than when you started.

The worst part? You were optimizing the wrong thing the whole time.

The skill's real problem was D4 (Functional): once triggered, it simply didn't handle JOINs, subqueries, or CTEs in its execution. But because the description is the most visible part of a skill, that's what you kept tweaking. No amount of description tuning fixes a functional gap. You were going in circles because you had no diagnosis.

This is what I kept running into. And it's what pushed me to build SkillCompass.

The Missing Primitive: Skill Quality Measurement

When something goes wrong with an AI agent skill today, you have almost no tools to understand what is wrong. You can observe the output. You can guess. You can tweak and hope.

What you can't do is say: "The trigger logic is fine. The security is clean. The problem is specifically in the functional layer, and here's exactly what's weak."

That's the gap SkillCompass closes. After a lot of iteration, I landed on six dimensions that capture the full surface area of skill quality:

ID	Dimension	Weight	Purpose
D1	Structure	10%	Frontmatter validity, markdown format, declarations
D2	Trigger	15%	Activation quality, rejection accuracy, discoverability
D3	Security	20%	Secrets, injection, permissions, exfiltration
D4	Functional	30%	Core quality, edge cases, output stability, error handling
D5	Comparative	15%	Value over direct prompting (with vs without skill)
D6	Uniqueness	10%	Overlap, obsolescence risk, differentiation

D3 is a hard gate. A Critical security finding forces FAIL regardless of overall score — no override. D4 carries the most weight because a skill that doesn't work after triggering fails at its core job, regardless of how clean the rest is.

One command gives you the full picture:

/skill-compass evaluate {skill}

╭──────────────────────────────────────────────╮
│  SkillCompass — Skill Quality Report          │
│  sql-optimizer  ·  v1.0.0  ·  atom           │
├──────────────────────────────────────────────┤
│  D1  Structure    ██████░░░░  6/10           │
│  D2  Trigger      ███░░░░░░░  3/10  ← weak  │
│  D3  Security     ██░░░░░░░░  2/10  ⛔ CRIT  │
│  D4  Functional   ████░░░░░░  4/10           │
│  D5  Comparative  +0.12                      │
│  D6  Uniqueness   ███████░░░  7/10           │
├──────────────────────────────────────────────┤
│  Overall: 38/100  ·  Verdict: FAIL           │
│  Weakest: D3 Security — user input           │
│           concatenated into instructions     │
│  Action:  Initiate eval-improve cycle        │
│                                              │
│  ┌ eval-improve cycle ─────────────────────┐ │
│  │ improve D3 → re-eval → 38→52 CAUTION  │ │
│  │ improve D2 → re-eval → 52→62 CAUTION  │ │
│  │ improve D4 → re-eval → 62→71 PASS ✓   │ │
│  └─────────────────────────────────────────┘ │
╰──────────────────────────────────────────────╯

The D5 delta (+0.12) measures how much better tasks go with the skill versus asking the base model directly — a 60/40 blend of static analysis and real usage signals (trigger accuracy, correction patterns, adoption rate). A delta near zero means the skill is barely earning its place in the context window. Above +0.20 means it's genuinely pulling its weight.

The score isn't the point. The direction is. Instantly you know: stop touching the description. Fix D4. Clear the Security gate.

Fix the Weakest Link, Then the Next One

SkillCompass targets the weakest dimension and fixes it with a scoped change — not a wholesale rewrite. Each /eval-improve round follows a closed loop:

fix the weakest → re-evaluate → verify improvement → next weakest

No fix is saved unless the re-evaluation confirms it actually helped. If a dimension doesn't improve, changes are auto-discarded and the tool tells you where to look next.

Each round fixes one dimension, verifies it has improved, then automatically targets the next weakest. The cycle runs up to 6 rounds (default --max-iterations 6) and stops when the skill reaches PASS (score ≥ 70) — or when it hits the round limit.

In the example above: D3 fixed first (38→52), then D2 (52→62), then D4 (62→71 PASS ✓ — cycle stops).

Diagnose → targeted fix → verified improvement → next weakness → repeat. No guesswork. No going in circles.

Every change creates a versioned snapshot in a .skill-compass/ sidecar directory. Your SKILL.md stays clean, and you can roll back anytime. If any dimension drops more than 2 points after a fix, changes are automatically discarded.

The Dimension That Surprised Me: D6 Uniqueness

D6 was the hardest to justify in design reviews and the one I'm most glad I kept.

Models improve every month. A skill you installed eight months ago that meaningfully outperformed base Claude might now be dead weight — covering use cases the model handles natively, adding latency and context overhead for no gain. But nothing tells you this. The skill still "works." So it stays.

D6 tracks this drift by:

Comparing skill output vs. base model on the same tasks
Measuring whether the quality delta is shrinking
Flagging supersession risk: "The base model now handles 92% of this skill's test cases with equivalent or better quality"

When that happens, you get two concrete options: remove the skill and reclaim the context window, or narrow its scope to the edge cases where it still wins.

In the json-formatter case I tested, narrowing to deep-nesting scenarios only took D6 from 2 to 7, tightened the trigger, and tripled the with/without delta — because a smaller scope executed well beats a broad scope executed poorly.

Without D6, skill libraries quietly accumulate dead weight. I haven't seen another tool that addresses this.

What I'm Still Figuring Out

D5 (Comparative) is the measurement I'm least satisfied with. Quantifying "how much better is the output with the skill vs. without it" is genuinely hard to make rigorous — task diversity, evaluation criteria, and base model variance all make the delta noisy. The current approach is directionally useful, but I think there's a better method.

If you've solved skill-vs-baseline measurement — even partially — reply below. Are you using LLM-as-judge? Human evals on a fixed task set? Something else? Good approaches will go into v1.1 with contributor credit. This is the part of the problem I find most interesting and least resolved.

Try It

# 1. Clone and install dependencies
git clone https://github.com/Evol-ai/SkillCompass.git
cd SkillCompass && npm install && cd ..

# 2. Install to user-level (all projects) or project-level (current project only)
cp -r SkillCompass/ ~/.claude/skills/SkillCompass/
# or
cp -r SkillCompass/ .claude/skills/SkillCompass/

Requirements: Node.js. Single dependency: js-yaml. Works inside Claude Code or OpenClaw.

Start with /eval-skill on whichever skill has been annoying you most — that's usually where the most interesting finding is.

The SQL skill from the opening is now at 71. The subqueries work. The security gate is clear. The description I kept rewriting was never the problem — and now I know that with certainty rather than having to guess.

That's the shift SkillCompass aims to make: from "let's try something and see" toward "here's exactly what's weak, here's the fix, here's the proof it worked."

SkillCompass on GitHub — open source, MIT license. If something breaks, open an issue. If the D5 measurement problem resonates, drop a comment.

Launching SkillCompass: Diagnose and Improve AI Agent Skills Across 6 Dimensions