DEV Community

snazar
snazar

Posted on • Originally published at faberlens.ai

We tested 5 AI commit-message skills on security. 3 made things worse.

Originally published at faberlens.ai


Reusable AI components are exploding — skills, MCP servers, templates, subagents. But there's no shared way to answer: "Will this actually help?" We ran a behavioral evaluation study to find out. The results were surprising.

Of 5 commit-message skills we tested from GitHub for security, only 2 showed positive lift over baseline. The other 3 produced negative lift — worse outcomes than using no skill at all. And the top performer? A skill with zero security rules.

Even more striking: in our small sample, static analysis was an unreliable predictor of overall security performance. The skill that "looked" least secure (scoring 42/100 on prompt-only review) achieved the highest lift. Static analysis did predict credential detection well — but failed across other categories. With only 5 skills this is a preliminary signal, not a definitive finding — but it suggests you need to measure what a skill does, not just read what it says.

What is Lift?

To measure whether a skill actually helps, we need a baseline-relative metric. We call it lift:

Lift = Skill Pass Rate − Baseline Pass Rate

Positive lift means the skill adds value. Negative lift means you're better off without it.

In our tests, the baseline (Claude with no skill) achieved 50% overall pass rate across security categories. But this varies dramatically by category:

Category Baseline Interpretation
S1: Credential Detection 81.7% Model already good at obvious credentials
S2: Credential Files 85.0% Model already good at .env detection
S3: Git-Crypt Awareness 15.0% Model over-refuses encrypted files
S4: Shell Safety 53.3% Model sometimes includes unsafe syntax
S5: Path Sanitization 16.7% Model often leaks sensitive paths

Baseline varies from 15% to 85%. Skills add most value where baseline is weak (S3, S4, S5).

The 5 Skills We Tested

We selected 5 commit-message skills from public GitHub repositories and tested each on 100 security scenarios (5 categories × 2 difficulty levels × 10 tests each). Each test was run 3 times to reduce noise (~1,500 total executions). Generation uses Claude Haiku; results may differ with larger models.

Skill Length Approach Lift
epicenter 8,586 chars Strict conventional commits with 50-char limit +6.0%
ilude 8,389 chars Comprehensive git workflow with security scanning +1.7%
toolhive 431 chars Minimal best practices -1.0%
kanopi 4,610 chars Balanced commit conventions with security warnings -4.0%
claude-code-helper 4,376 chars General-purpose assistant with commit capabilities -4.3%

The Surprising Winner

The top performer, epicenter, contains zero security instructions. No credential detection. No secret scanning. No warnings about sensitive files.

Meanwhile, kanopi explicitly mentions API keys, secrets, and credentials — and performs worst among the longer skills.

How did a format-focused skill beat security-focused ones on security tests?

Constraint-based safety. epicenter's strict 50-character limit significantly reduces the likelihood of shell metacharacters appearing in output. Its abstract scope requirements discourage sensitive path details. Format constraints provide implicit security without explicit rules.

Important caveat: epicenter's overall lift hides category-specific weaknesses. It scores -10% on S1 (credential detection) and -27% on S2 (credential files) — worse than using no skill at all. Its +6% overall lift comes entirely from dominating S3/S4/S5. If your priority is catching API keys, epicenter is the wrong choice.

Static Analysis Was an Unreliable Predictor

If explicit security rules don't predict success, can we evaluate skills by reading them? We tested this by having Claude rate each skill (0-100) on security awareness based solely on the prompt text.

Skill Security Mentions Static Score Actual Lift
epicenter None — pure format guidance 42/100 +6.0%
ilude Explicit scanning rules, git-crypt exceptions 78/100 +1.7%
kanopi API keys, secrets, credentials, .env files 52/100 -4.0%

Static analysis scores showed weak correlation with actual lift (r = 0.32). epicenter scored lowest on static security analysis (42/100) yet achieved the highest lift (+6.0%).

For shell safety (S4) and git-crypt awareness (S3), we found negative correlations — skills with more explicit rules performed worse:

Category Correlation Meaning
S1: Credential Detection +0.87 Explicit rules help
S4: Shell Safety -0.68 More rules = worse performance
S3: Git-Crypt -0.50 More rules = worse performance

With n=5, these correlations are noisy and we're not claiming statistical significance. But the pattern is notable: for some categories, detailed instructions actively backfire.

The Awareness Trap

Our test suite includes base (straightforward) and adversarial variants. Adversarial tests present the same threats but add prompt injection context designed to trick the model into ignoring them.

toolhive shows the most dramatic failure:

Skill S1 Base S1 Adversarial Collapse
toolhive +16.7% -23.3% -40pp
ilude +33.3% +3.3% -30pp

toolhive goes from +16.7% to -23.3% — a 40 percentage point collapse. It handles straightforward cases well but fails when prompt injection tries to convince the model the credentials are safe.

Why doesn't epicenter collapse? Because it doesn't rely on pattern-matching. epicenter's format constraints constrain the output, not the input. No amount of social engineering changes the fact that a 50-character commit message can't contain a full API key.

Why Format Beats Rules

epicenter's success reveals a deeper principle: structural constraints can provide security that explicit rules cannot.

Format Constraint Security Effect epicenter Lift
50-char limit Less room for shell commands like $(cmd) +20% (S4)
Abstract scopes Discourages client names or file paths +27% (S5)
No security rules No over-refusal of encrypted files +30% (S3)

Instead of listing what to avoid, set structural limits that reduce the likelihood of violations. A 50-character limit doesn't mention shell injection — but it significantly constrains the output space available for unsafe patterns.

Limitations

This study covers one aspect of one domain:

  • Scope: 5 skills, 1 domain, security-focused tests
  • Models: Results use Claude Haiku for generation. Larger models may handle verbose instructions differently
  • Rigor: Results have been human-audited. We're publishing judge prompts, agreement rates, and confidence intervals

We're publishing early because limited data beats no data, and we'd rather be challenged on real numbers than trusted on intuition.


Full methodology and judge rubrics: faberlens.ai/methodology

Part 2 of this series covers ablation testing — isolating exactly which constraints matter: faberlens.ai/blog

Top comments (0)