Shinsuke KAGAWA

Posted on Apr 2

I Built a Skill Reviewer. Then I Ran It on Itself.

#ai #llm #softwareengineering #promptengineering

I built a tool that reviews Claude Code skills for quality issues.

Then I pointed it at its own source files. It found real problems.

The irony wasn't lost on me. But the more interesting question is: why did this happen, and what does it tell us about how LLM-based quality tools actually work?

The Setup

I maintain rashomon, a Claude Code plugin for prompt and skill optimization. It includes a skill reviewer agent that evaluates skill files against 8 research-backed patterns (BP-001 through BP-008) and 9 editing principles.

One of those patterns—BP-001—says: don't write instructions in negative form. Research shows LLMs often fail to follow "don't do X" instructions—negated prompts actually cause inverse scaling, where larger models perform worse. The fix is to rewrite them positively: instead of "don't skip P1 issues," write "evaluate all P1 issues in every review mode."

Simple enough.

Except both my agent definition files had a section called ## Prohibited Actions full of "don't" instructions.

The Discovery

I noticed this by reading my own code. But I wanted to see what happens when the tools catch it—or don't.

First, I ran the prompt-analyzer agent against both files. It analyzed them, found some issues, but gave the Prohibited Actions sections a pass. Its reasoning: these qualify as "safety-critical" exceptions to BP-001, since they constrain "destructive" behaviors.

That felt off. "Don't invent issues not supported by BP patterns" isn't a safety-critical instruction. It's a quality policy. The caller can override or discard the output.

So I ran the skill-reviewer agent against the same two files. The results were more interesting.

For skill-reviewer.md (reviewing itself), it flagged all four items in Prohibited Actions as BP-001 violations—P2 severity. Correct call.

For skill-creator.md (reviewing the other agent), it gave Prohibited Actions a pass. Same structure, same pattern, opposite judgment.

The same reviewer, applying the same criteria, reached opposite conclusions on the same construct.

Digging Into Logs

I could have speculated about why. Instead, I checked the subagent conversation logs.

The skill-creator review log showed this in the Step 1 pattern scan:

BP-001 (Negative Instructions): Lines 197-202 "Prohibited Actions" section uses negative form. However, per the BP-001 exception in skills.md, these are procedural/irreversible consequences (inventing knowledge, removing examples, overwriting files). The exception applies.

It did scan for BP-001. It found the section. But it classified the items as "irreversible consequences" and applied the exception.

The problem was clear: the exception rule said negative form is okay for "safety-critical operations, destructive actions, or order-dependent procedures." That's vague enough to stretch. "Inventing domain knowledge" sounds serious. "Removing user-provided examples" sounds destructive. If you squint, anything can be "destructive."

Nothing was wrong with the reviewer. It was doing exactly what I told it to do. That was the problem.

Fixing the Criteria, Not the Reviewer

The instinct is to blame the LLM: "it self-justified," "it was biased toward leniency." But the actual cause was simpler: the exception rule was written in a way that allowed two reasonable readings.

The fix wasn't to make the reviewer "smarter." It was to make the criteria harder to misread.

I replaced the broad exception language:

Exception: safety-critical operations, exact command sequences,
destructive actions, or order-dependent procedures

With a 4-condition checklist:

Exception: Negative form is permitted only when ALL are true:
(1) Violation destroys state in a single step
(2) Caller or subsequent steps cannot normally recover
(3) The constraint is operational/procedural, not a quality policy
(4) Positive rewording would expand or blur the target scope

And added concrete boundary examples—what qualifies, what doesn't:

Permitted (exception applies)	Not permitted (rewrite positively)
"Do not modify the command"	"Do not invent issues" -> "Base every issue on BP patterns"
"Do not add flags"	"Do not skip P1 issues" -> "Evaluate all P1 in every mode"
"Do not execute destructive operations"	"Do not create overlapping skills" -> "Verify no overlap before generating"

The key addition: "Outputs that the caller validates, overwrites, or discards are never irreversible." This one sentence eliminates most of the ambiguity. A subagent's output goes to a caller. The caller decides what to do with it. That's not irreversible.

The Retest

After updating the criteria, I ran the skill-reviewer again on both files.

skill-reviewer.md: Prohibited Actions flagged as BP-001 P2. All four items caught.

skill-creator.md: Two items flagged as quality policies that should be positive form. The remaining items—which are genuinely about operational constraints—were accepted.

Consistent. Explainable. And the reviewer could now articulate why each item was or wasn't an exception, because the criteria forced it to check specific conditions rather than make a gestalt judgment.

But I wasn't fully satisfied. In a further round of testing, the reviewer still occasionally applied exceptions loosely—recording "irreversible" in the justification field without explaining how it's irreversible.

So I added structured evidence to the output schema:

"patternExceptions": [{
  "pattern": "BP-001",
  "location": "section heading",
  "original": "quoted text",
  "conditions": {
    "singleStepDestruction": "true|false + evidence",
    "callerCannotRecover": "true|false + evidence",
    "operationalNotPolicy": "true|false + evidence",
    "positiveFormBlursScope": "true|false + evidence"
  }
}]

You can't just write "irreversible" anymore. You have to answer four yes/no questions with evidence. If any answer is no, it's not an exception.

What This Comes Down To

The criteria had a loophole wide enough to drive a truck through. Better criteria produced better reviews without changing the reviewer at all. The LLM wasn't "inconsistent"—the instructions were ambiguous. Two reasonable people could have read the old exception rule and reached different conclusions too.

Structured output helped more than I expected. The 4-condition checklist wasn't just about auditability—it changed how the reviewer thinks. When you have to fill in four fields with evidence, you can't hand-wave. The output structure becomes a thinking scaffold.

And running the tool on its own source files was uncomfortable in a useful way. The temptation is to say "well, I know what I meant." But the tool doesn't know what I meant. It reads what I wrote.

The Broader Problem: Skill Quality Is Hard

If you're building Claude Code skills, custom agents, or any kind of structured LLM instruction set—you've probably experienced this: the instructions work fine in your head, but the LLM does something unexpected. You add more instructions. It gets worse. You simplify. Something else breaks.

The issue is that you can't see your own blind spots. You know what you meant. The LLM reads what you wrote. The gap between intent and text is where bugs live.

This is why I built rashomon. It includes:

Skill review: Evaluate skill files against BP-001~008 patterns and 9 editing principles, with structured quality grades
Golden scenario evaluation: Test whether a skill actually works by comparing execution results with and without the skill, or before and after changes—not just whether it was loaded, but whether it made a measurable difference

The golden scenario part matters. "The skill was loaded" doesn't mean "the skill helped." You need to see the actual output difference to know if your skill is doing anything useful.

Try It

Rashomon is a Claude Code plugin. Install it and point the skill reviewer at your own skills.

# In Claude Code
/plugin marketplace add shinpr/rashomon
/plugin install rashomon@rashomon
# Restart session to activate

It will find problems. I know because it found problems in itself—and it's better for it now.

What's your experience with skill quality? Have you found ways to validate that your instructions actually do what you think they do?

DEV Community