Review-First Skill Development — Building Complex AI Skills One Rule at a Time

#claude #ai #productivity #programming

TL;DR

I’ve been struggling with AI-generated code for months. Getting AI to handle complex tasks is hard...
Writing the prompts or skills to make it happen is harder. So start with a review skill instead. When you define "what's wrong" one rule at a time, you can reach complex skills incrementally.

The Problem: The "Look-But-Don't-Touch" Code Problem

Code generated by AI coding tools keeps producing the same kinds of problems. Inconsistent naming, unstructured error handling, misaligned test strategies — all symptoms of missing non-functional requirements.

The natural response is to define quality standards in an LLM instruction set (this article uses Claude Skills as an example, but the same thinking applies to Copilot Instructions, Cursor Rules, or any similar mechanism). But quality criteria span many dimensions, and writing a perfect generation prompt from scratch isn't realistic.

The Core Concept: The First Skill You Build Should Be a Review Skill

Defining "generate good code" all at once is hard. But defining "this is wrong" is something you can do one rule at a time.

Easier to articulate — "Good" is elusive, but "Bad" is actionable. It's easy to say a variable named data is too vague, or an empty catch block is a problem.
Naturally incremental — Start with 3 rules. When you notice something else in production, add it. That's it.
Easier to measure — Generation quality is hard to evaluate objectively. Review accuracy is straightforward: did it catch what it should have caught?

From Review to Generation

A review skill isn't the end goal — it's the starting point. Extract the review criteria into a shared/ directory, and generation skills can reference the same definitions.

.claude/skills/
├── my-review-structure/        ← Code structure review
│   ├── SKILL.md
│   └── references/
│       ├── criteria.md         ← Review criteria (G1–G3)
│       └── output-format.md
├── my-make-testcase/           ← Test case generation
│   └── SKILL.md
├── my-review-testcase/         ← Test case review
│   ├── SKILL.md
│   └── references/
│       ├── criteria.md         ← Review criteria (T1–T2)
│       └── output-format.md
└── shared/                     ← Shared definitions (no SKILL.md = not a skill)
    ├── test-definitions.md     ← Test level definitions
    └── test-structure.md       ← Test structure rules

my-make-testcase and my-review-testcase
reference the same definitions in shared/.
If a generated output fails review,
that itself is a signal to improve the definitions.

This creates an automatic sandwich of generation → review. The generation skill's output must pass the review skill. If it doesn't, something is wrong with the shared definitions — and that becomes the improvement signal.

What I Learned in Practice

The first skill I built was a review skill for separation of concerns in code structure. Not "a skill that generates clean architecture" — a skill that flags when responsibilities are mixed.

Running it against a real project revealed over-detection. (To be honest, it was annoying at first.) The skill flagged "DB manager class requires DB connection at initialization" as a problem. But for a DB manager, requiring a DB connection is its core responsibility. I refined the rule to "flag only when a class requires dependencies unrelated to its primary responsibility." This kind of tuning only surfaces when you run against real code.

Next, I built a test case review skill. Again, starting from review. This forced me to define test levels (unit / integration / E2E) and a case table format — which naturally separated into shared definitions. Once those definitions existed, building a test case generation skill was a suddenly became easy: it just referenced the same shared files.

From this experience, I came to see review → extract shared definitions → generation as a repeatable pattern for skill development.

Takeaway

Review-First turns "I can't write this skill" into "I don't have to write it yet." You don't need a perfect generation prompt. You need your first NG pattern.

This isn't a silver bullet — but if you're struggling with inconsistent AI output, it gives you a concrete place to start. One practical note: as rules grow, keep them separated by concern (the shared/ pattern above) rather than packing everything into a single skill. Context windows are finite, and a focused skill outperforms an overloaded one.