supreet singh

Posted on Feb 24

Beyond the Prompt: Building Production-Grade AI Skills — A Case Study

#ai #llm #automation #testing

In Part 1, we examined the theory of moving from linear prompting to Skill-based orchestration—a workflow abstraction layer that uses progressive disclosure to keep the model focused and efficient.

Every sprint, our QA team was catching the same category of bug: a developer had shipped a feature that passed code review but missed a subtle Acceptance Criterion buried in the Jira ticket. By the time the miss was found, the developer had already moved on. I built a Skill to close that gap—a system that automatically bridges Jira ACs to executable Cypress E2E tests. This article is a technical walkthrough of how it works.

The Problem: Requirement Disconnect and the Structural Gap

In my current company, we manage multiple product lines simultaneously, ranging from fragile legacy systems to modern React builds. We have a small, highly capable QA team, but they are consistently outnumbered by the volume of features being shipped across these tracks.

A typical failure looked like this: a Product Owner writes an AC that says "the modal must close when the user clicks outside it." The developer builds the modal, ensures the internal logic works, and ships it. However, they neglect the "click outside" requirement—a small but critical UX detail. QA catches the miss four days later. The developer has to context-switch back into code they've mentally closed. That single miss costs an hour of velocity—multiplied across a dozen stories per sprint.

This created a structural gap in our process:

Missed Requirements: Developers, naturally focused on technical complexity, occasionally overlook smaller, subtle ACs buried in a ticket description.
Context Latency: The cost of "re-learning" a feature to fix a basic AC failure after days of drift is a massive drain on velocity.
Manual Testing Bottleneck: Lacking automated E2E coverage for new features forces QA to burn limited bandwidth on repetitive manual verification instead of hunting for deep edge cases.

I built this Skill to serve as a checkpoint. It acts as a forcing function for requirement alignment, allowing developers to verify their work against the source of truth before shipping.

Pillar 1: Layered Context Management (The Discovery Layer)

The primary challenge I wanted to avoid was Context Rot. If I had loaded 5,000 lines of documentation into a single prompt, the model's reasoning performance would have dropped significantly. To solve this, I implemented a three-tier hierarchy to ensure only the context strictly necessary for the task is loaded at each stage.

Understanding the Discovery Hierarchy

Front Matter (The Gateway): This YAML block acts as the entry point. It tells Claude when this skill is relevant. By using specific trigger terms like "Cypress" or "Jira ID," we ensure the heavy logic of this skill is only active when needed, preserving the global context window.

The Main Body (The Router): This acts as the "map" for the 8-phase workflow. It manages the sequence and the gates between them. It knows that Phase 3 follows Phase 2, but it doesn't need to know the specific syntax for a Cypress intercept yet.

Reference Files (The Specialists): These hold the deep, domain-specific logic. We separate these files so that Claude only "reads" the Jira API documentation while fetching tickets. Once it moves to writing tests, it drops that context and loads the test-generation.md context. This selective loading keeps the "cognitive scope" narrow.

Pillar 2: Persistent State and Hashing (The Continuity Layer)

Standard AI interactions are stateless, which doesn't work for multi-stage engineering tasks. I built a JSON-based memory system so the Skill can track its progress and detect requirement changes over time.

Memory Management and Session Resumption

The Skill utilizes a dedicated JSON memory file that acts as its persistent state. This file is updated at the conclusion of every phase, ensuring the system never "forgets" the context of its work.

The memory file specifically tracks ticket processing, code instrumentation maps, and the Integrity Hash. I use an MD5 hash of the raw Acceptance Criteria text field from the Jira API response.

By comparing the current AC hash against the stored hash in memory.json, the Skill detects "staleness." In practice, this means a developer can re-run the Skill on the same ticket after a code review without re-generating 40 test files that haven't changed. Only the deltas are ever touched, saving significant token costs and preventing unnecessary file writes.

This architecture also enables Session Resumption. If the process is interrupted, the Skill identifies the incomplete session from the memory file and prompts the user to resume from the last successful phase.

Pillar 3: Review Gates and Governance (The Control Layer)

Even with robust discovery and memory, technical reliability isn't enough. When a system can modify production code, it needs human accountability. I wanted to avoid a "black box" scenario where component instrumentation happened in unexpected places.

The Configuration Layer

I implemented a Configuration Layer that allows the user to toggle "Human Review Gates." This flexibility ensures the architecture adapts to the developer's specific need for control—whether they are shipping a critical billing flow or iterating quickly on a new UI component.

Strategic Review Gates

When these gates are enabled, the workflow pauses at critical decision points:

Component Discovery Gate (Phase 2): The Skill presents a list of exactly which React components it intends to instrument. This prevents unexpected changes in sensitive areas of the codebase.
Test Matrix Gate (Phase 6): Before generating Cypress files, the Skill displays a comprehensive plan categorizing tests into Happy Path, Negative, Edge Case, and Error scenarios.

The Agentic Workflow

When triggered, the Skill follows a rigorous 8-phase lifecycle, where each phase serves as a technical checkpoint.

Breaking Down the Lifecycle

Phase	Name	Description
0	Preflight	Verifies Jira connectivity via the Atlassian MCP and ensures the workspace root is correct.
1	Context & Hashing	Fetches the Jira AC and compares the MD5 hash against the memory file.
2	Discovery	Scans the codebase to map Jira stories to specific React components.
3	Dependency Check	Verifies that required testing libraries (Cypress, custom commands, the TestId enum file) are present and importable. This prevents instrumenting components with identifiers that don't yet exist.
4	Instrumentation	Performs the "surgery" on production code, injecting unique test identifiers.
5	Test Matrix Approval	The Skill presents the full test plan to the developer before writing a single file. This is the last human checkpoint before code is generated.
6	Test Generation	Generates the Page Objects and the actual Spec files based on the approved matrix.
7	Validation	Executes a validation script that compiles the new code and checks for lint errors.

Technical Implementation Details

Self-Healing Validation

If the TypeScript compiler (tsc) or linter finds an issue, the Skill enters an iterative "Self-Healing" loop (up to three times). It reads the error log, identifies the file and line number, and attempts a fix. If it fails after three attempts, it exits the loop, preserves the state in memory.json, and surfaces the raw compiler output. The developer can patch it manually and resume from that phase rather than starting over.

MUI-Aware Instrumentation

Standard AI often struggles with Material UI (MUI) components, which often hide the actual HTML input inside a wrapper. A naive implementation would instrument an MUI TextField like this:

<TextField data-testid={TestId.SEARCH_INPUT} />

Cypress can't interact with that because the testid lands on the wrapper div. The Skill knows to use the inputProps attribute:

<TextField inputProps={{ 'data-testid': TestId.SEARCH_INPUT }} />

Encoding this library-specific knowledge into a reference file ensures any developer on the team gets it right the first time.

Sharing the Vision, Not the Source

Because this Skill was built for production use at my company, the source code stays internal—but the architectural patterns here (progressive disclosure, persistent state, human-in-the-loop governance) are entirely portable. I'm currently formalizing these same patterns into an open-source "Second Brain" project, and I look forward to sharing its structure with the community soon.

Closing Thoughts: Engineering with Confidence

What this system ultimately gave me is confidence at the moment of merge. Before, shipping a feature meant hoping that QA would catch what I'd missed. Now, before a PR is opened, I have programmatic proof that every Acceptance Criterion has a corresponding test.

The regression library builds itself sprint by sprint, and the QA team's attention is freed for the complex, judgment-heavy testing that automation can't replace. That shift—from reactive bug-catching to proactive verification—is what I set out to build when I started exploring Skills. It turns out the architecture was always capable of it; we just needed a system to unlock it.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.