Teaching an AI Skill to Learn from Its Own Mistakes

#ai #llm #automation #testing

Follow-up to Beyond the Prompt: Building Production-Grade AI Skills, which covers the original architecture and reasoning behind this Cypress E2E testing skill.

I have a Claude skill that generates Cypress E2E tests from Jira Acceptance Criteria.

It reads the ticket
discovers the relevant components
instruments them
generates the tests
validates that everything compiles.

The original version did all of that well, but what it didn't do was actually run the tests to see if they passed. Token budgets were tighter at the time, and adding a verification loop meant longer sessions, more context, more cost. So I didn't add it to the initial version.
I didn't realize that running the tests myself and fixing whatever broke would cost me three to four days of manual debugging on the first real run. That turned out to be a much more painful decision than I expected, even though skipping verification was saving me considerable tokens and making the entire process cheaper than it would have been otherwise.

Since then, Opus 4.6 and Sonnet 4.6 have moved to 1,000,000 token context windows, and Claude Code's prompt caching keeps the cost of longer sessions reasonable. Since that constraint didn't apply anymore, I went back and made three changes which have made a massive difference.

1. Human Review Gate on the Test Matrix

The skill already had human review gates in a few places, including reviewing the instrumentation before it gets applied. The new addition is that the human review gate now also prompts you to:

Review the test matrix plan
Verify the acceptance criteria mapping
Approve before the tests are generated

The reason I added this is that I noticed on complex tickets with 15+ acceptance criteria, the matrix was occasionally miscategorizing tests or missing edge cases.
Part of that comes from a limitation that still exists: the skill currently doesn't extract acceptance criteria from screenshots attached to the ticket.
If an AC only exists in an image, it gets missed. That's a future improvement, but for now the human review gate catches those gaps before they turn into missing tests.

The gate is a configuration parameter. If you don't want the back-and-forth of reviewing every matrix, you can turn it off and the skill will proceed straight through to instrumentation and test creation without involving a human in the loop.

2. Verify-and-Fix Loop

This was the biggest change. The skill now actually runs each test before considering it done, and if it fails, it classifies why it failed because not every failure is something the skill should be fixing.

Logic Error: An actual bug in the application code. The skill stops and tells you about it. Fixing business logic is not what this skill is for.
Infrastructure Error: Something environmental is wrong, like the dev server not running or a dependency missing. The skill flags it and waits for you to sort it out.
Instrumentation Error: Something wrong with how the test was written or how the component was set up for testing. That's the skill's territory and it will fix it, re-run, and loop until it passes or hits a retry limit.

A real example that came up on my first run: a component had the correct data-testid on it, but there was a transparent overlay element in the DOM sitting on top of it.
Cypress could find the element but couldn't click it because the overlay was intercepting the interaction.

The skill identified what was blocking it,
adjusted the test to handle the overlay,
and it passed on retry.

That's the kind of issue I would have spent a long time staring at Cypress output trying to figure out, and the old generate-only workflow had no way of catching it at all.

3. Pattern Persistence Across Runs

That transparent overlay wasn't unique to one component. The same pattern existed across a dozen components in the app, which meant every test touching those components would hit the same failure for the same reason.
In the old version each run was completely stateless, so the skill would discover the overlay problem, fix it for that one test, and then have no memory of it on the next run.
Same discovery, same fix, same wasted tokens, every single time.

Now every time the skill discovers a failure pattern during the verify-and-fix loop, it persists it with metadata:

The error message
The root cause
How it was resolved
Which component it came from

This file gets read at the start of every new run before any tests are generated, so the skill is consulting its own history of failures before writing a single test.

The effect compounds over time. The first run discovers patterns reactively and fixes them as they come up. The second run already knows those patterns and generates tests that account for them from the start. Each iteration produces fewer first-attempt failures and uses fewer tokens because less of the session is spent rediscovering things it already learned.

The original 8-phase lifecycle and three-pillar architecture are documented in Beyond the Prompt: Building Production-Grade AI Skills.

Follow the build on dev.to and Substack.