Demi Jiang

Posted on Apr 24

AI-Powered Test Coverage Gap Analysis: How I Use Claude Code + gstack to Generate Test Cases

#ai #testing #automation #qa

Every QA engineer knows the feeling: you're staring at a test suite that covers the happy path, maybe a few edge cases, and you have a nagging suspicion there's a whole category of scenarios nobody's thought to test. Writing those missing tests from scratch is slow, tedious, and mentally expensive. You're essentially doing product archaeology — reverse-engineering what the app actually does so you can describe it in test form.

I found a way to automate that archaeology. In a single session, I used Claude Code and a tool called gstack to navigate our live staging app, compare what it actually does against our existing Notion test cases, and generate 24 new BDD-formatted test cases — all exported directly back into Notion. Here's the exact workflow, including the prompts I used and the lessons I learned the hard way.

1. The Problem: Test Coverage Gaps Are Hard to Find Manually

Manual gap analysis is a two-step cognitive problem. First you have to deeply understand what the application does — every mode, every edge case, every permission flow. Then you have to hold that in your head while scanning a test case database and noticing what's missing. Neither step is easy. Both together are exhausting.

For any non-trivial feature, you'll have test cases for the happy path and maybe a few known edge cases. But what about different input types? State transitions that only happen under specific conditions? Browser-specific behaviors? Permission flows? You often don't know what's missing until something breaks in production.

The approach I'd been using — read the test suite, open the app, click around, write notes — doesn't scale. What I needed was a way to have the analysis done for me, with the application as the source of truth rather than my memory of it.

2. The Tools: Claude Code, Notion MCP, and gstack

Before diving into the workflow, here's what each tool actually does.

Claude Code is Anthropic's CLI for Claude. You run it from your terminal or VS Code and interact with it conversationally. It can execute bash commands, read and write files, call external APIs, and — crucially for this workflow — use MCP servers to connect to external tools.

Notion MCP is a Model Context Protocol server that lets Claude read and write Notion pages directly. Once configured, you can tell Claude to fetch a Notion page, read its content, and write new pages back — all from a single conversation.

gstack is an open-source tool that gives Claude a headless browser. It exposes three skills:

Skill	What it does	Fixes bugs?
`/browse`	Navigate a URL, interact with the UI, take screenshots, verify specific flows	No — exploration only
`/qa-only`	Systematic QA sweep of the whole app — structured report, health score, repro steps, screenshots	No — report only
`/qa`	Same as `/qa-only`, plus iteratively patches bugs in source code, commits each fix, re-verifies	Yes — fixes and commits

For this workflow I used /browse — I wanted exploration and screenshots, not code changes.

3. Setup: Getting Everything Connected

Install Claude Code from the Anthropic CLI docs. You can use it from the terminal or the VS Code extension. I used both — VS Code for reviewing output, terminal for running prompts.

Configure Notion MCP by editing ~/.claude.json:

{
  "mcpServers": {
    "notion": {
      "type": "http",
      "url": "https://mcp.notion.com/mcp"
    }
  }
}

You'll also need to authorize the Notion integration from your Notion workspace settings and give it access to the relevant pages. Claude will automatically pick up the MCP config on next launch.

Install gstack following the instructions in its repo. Once installed, the /browse, /qa-only, and /qa skills become available inside Claude Code sessions.

⚠️ Set your permission mode. By default, Claude Code asks for approval before running commands or making changes. For this kind of exploratory session, constant approval prompts break your flow. Set the permission mode to acceptEdits so Claude can run freely. Be aware of what this means — you're giving it latitude to make changes, so use it in a sandboxed or read-only context where possible.

Why this matters for QA: The setup cost here is low — maybe 20 minutes including Notion authorization. The payoff is a reusable pipeline. Once it's configured, every future gap analysis session starts from step one with no additional setup.

4. The Workflow: Six Prompts, One Session

Here's the complete workflow:

┌─────────────────────────────────────────────────────────────┐
│                    GAP ANALYSIS WORKFLOW                     │
└─────────────────────────────────────────────────────────────┘

  [Notion DB]          [Live App]           [Notion DB]
      │                    │                     │
      ▼                    ▼                     │
  ┌────────┐         ┌──────────┐                │
  │ Step 1 │         │ Step 2   │                │
  │  Read  │         │ Explore  │                │
  │existing│         │  app via │                │
  │  TCs   │         │  gstack  │                │
  └────┬───┘         └────┬─────┘                │
       │                  │                      │
       └────────┬──────────┘                     │
                ▼                                │
           ┌────────┐                            │
           │ Step 3 │                            │
           │Compare │                            │
           │& find  │                            │
           │  gaps  │                            │
           └────┬───┘                            │
                ▼                                │
           ┌────────┐                            │
           │ Step 4 │                            │
           │ Draft  │                            │
           │  new   │                            │
           │  TCs   │                            │
           └────┬───┘                            │
                ▼                                │
           ┌────────┐                            │
           │ Step 5 │                            │
           │Refine  │                            │
           │to BDD  │                            │
           │format  │                            │
           └────┬───┘                            │
                ▼                                ▼
           ┌────────┐                       ┌────────┐
           │ Step 6 │──────────────────────▶│ New TC │
           │ Export │                       │ pages  │
           │to Notion│                      │in DB   │
           └────────┘                       └────────┘

Step 1 — Read Existing Test Cases from Notion

Fetch this Notion page and list all existing test cases with their names
and a one-line summary of what each one covers:
[your Notion test case database URL]

Claude fetches the Notion database, reads each page, and produces a structured list: test case name, what it covers. This becomes the baseline for the gap analysis.

💡 Include the full URL in your prompt every time. Don't say "the Notion page from earlier" or "the test database we discussed." Across tool calls and session boundaries, Claude needs explicit references. Paste the full URL in every prompt that references a Notion page.

Step 2 — Explore the App and Understand What It Does

Browse [your staging app URL]
Login with username [test-account] password [password]
Put the entire login and exploration in one bash script so the browser
session stays alive.
Take screenshots of each part of [the feature] and summarise how it works.

This is where gstack does the heavy lifting. Claude uses the /browse skill to launch a headless browser, log in, navigate through every state of the feature, take screenshots, and come back with a written summary of how it all works.

⚠️ Put login and exploration in a single bash script. This is the most important gotcha in the whole workflow. The gstack browser server restarts between separate bash calls, which kills all browser state — including your login session. If you run login in one call and exploration in the next, Claude will be looking at a logged-out app. Combine everything into one script.

What you get back is a detailed summary of every state the feature can be in: what controls are visible, what actions are available, what happens when you submit or cancel, and screenshots of each screen. Claude understands the feature better after two minutes of headless browsing than you could communicate with a paragraph of description.

Why this matters for QA: The app is the source of truth, not documentation or memory. When Claude explores the live app, it sees what users see — including states that might not be documented anywhere.

Step 3 — Compare Against Existing Tests and Find Gaps

Compare the feature you just explored against the existing test cases listed earlier.
Identify gaps — features or scenarios with no test coverage.
Group by area (e.g. different input types, error states, permissions,
edge cases, browser-specific behaviour).

Claude now has both sides: what the app does (from exploration) and what's already tested (from Notion). It produces a gap analysis grouped by area, surfacing scenarios that hadn't been explicitly tested — different input variations, specific error and timeout states, permission-related flows, and behavior under degraded conditions.

This took about 30 seconds.

Step 4 — Draft New Test Cases (Without Writing to Notion Yet)

Please create new test case entries for each gap you identified.
Do NOT write directly to Notion yet — show me the drafts first.

⚠️ Always review before writing to Notion. Notion changes cannot be reverted through Claude. If you let it write directly and the output is wrong — wrong format, wrong numbering, duplicate entries — you're cleaning up manually. The "show me the drafts first" step is non-negotiable.

Claude generates a draft for each gap: a title, a brief description, and rough test steps. At this point the format isn't quite right yet, but the content is there.

Step 5 — Refine to Match Your BDD Format

Can you follow the same format I have here:
[URL of an existing well-formatted test case as a reference]

Rewrite all the draft test cases using that exact format:
Feature block with user story, Background, Scenario with Given/When/Then steps,
Execution Steps checklist, and Notes/Bug Link section.
Number them starting from [next available number].
Still do NOT write to Notion yet.

I pointed Claude at an existing test case as the template and asked it to rewrite all drafts to match — Feature block, Background, Scenario, Given/When/Then, Execution Steps checklist, Notes/Bug Link. I also specified the starting test case number so the new ones numbered sequentially from where the existing ones left off.

This step is worth taking seriously. A test case that's technically correct but formatted wrong creates work for whoever has to use it. Getting the format right before export means the output is immediately usable.

Step 6 — Export to Notion

Write all the new test cases to Notion.
Create each one as a new page inside [your database name]
using the same format as the existing entries.

Claude uses the Notion MCP to create each test case as a new page in the database, including the full BDD content block and page properties: Case Type, Priority, Status.

Why this matters for QA: The output lands directly in the tool your team already uses. No copy-pasting, no reformatting, no "I'll add this to Notion later." It's there.

5. The Prompts as a Reusable Template

Here's the complete sequence you can adapt for your own app and test database:

# Step 1 — Read existing test cases
Fetch this Notion page and list all existing test cases with their names
and a one-line summary of what each one covers:
[your Notion test case database URL]

# Step 2 — Explore the app
Browse [your staging app URL]
Login with username [test-account] password [password]
Put the entire login and exploration in one bash script so the browser
session stays alive.
Take screenshots of each part of [the feature] and summarise how it works.

# Step 3 — Gap analysis
Compare the feature you just explored against the existing test cases listed earlier.
Identify gaps — features or scenarios with no test coverage.
Group by area.

# Step 4 — Draft
Please create new test case entries for each gap you identified.
Do NOT write directly to Notion yet — show me the drafts first.

# Step 5 — Format
Can you follow the same format I have here:
[URL of an existing well-formatted test case]
Rewrite all the draft test cases using that exact format.
Number them starting from [TC-XX].
Still do NOT write to Notion yet.

# Step 6 — Export
Write all the new test cases to Notion.
Create each one as a new page inside [your database]
using the same format as the existing entries.

6. Gotchas and Lessons Learned

These aren't theoretical — each one cost me time before I figured it out.

1. One bash script for login + exploration. The gstack browser server restarts between separate bash invocations. Combine login and exploration into a single script.

2. Always use explicit URLs. Vague references like "the page from before" break across tool calls and context boundaries. Include the full URL in every prompt that references a Notion page.

3. Review drafts before writing to Notion. Notion write operations through Claude are not reversible via Claude. The "show me first" step is cheap insurance.

4. Set acceptEdits permission mode for exploration sessions. Constant approval prompts fragment the session. Set it for exploration, but be aware of what you're enabling.

5. Save reusable prompts as custom skills. Claude Code supports custom skills — markdown files in ~/.claude/skills/. If you run gap analyses regularly, turn the prompt sequence into a skill so you invoke it with one command instead of retyping a paragraph.

6. Use a dedicated test account. Your credentials go into a prompt that Claude executes. Don't use your personal account.

7. Results

One session. Here's what came out of it:

24 new test cases generated in a single session
All formatted correctly: Feature block, Background, Scenario, Given/When/Then, Execution Steps checklist, Notes section
All written as new pages in the Notion database with correct properties (Case Type, Priority, Status)
Coverage gaps closed across multiple areas that hadn't been explicitly tested before

Before this session, gap analysis for a feature this size would have taken me half a day. The session itself took about 45 minutes, most of which was reviewing the drafts at steps 4 and 5. The test cases needed minor tweaks — a few Given steps needed more context, one When step was slightly off — but the heavy lifting was done. I was editing, not authoring from scratch.

8. What Else You Can Do With This Approach

The six-step workflow is one combination. The underlying capability is more flexible.

Requirements-first: Instead of exploring the app, feed Claude your requirements doc or spec. "Here are the acceptance criteria. Here are the existing test cases. What scenarios aren't covered?" This works well for features that aren't built yet.

Code-first: Point Claude at the codebase and ask it to surface untested paths. "Here's the source code for this feature. Here are the existing test cases. What code paths have no test coverage?" This gets you into edge cases that are invisible from the UI.

All three combined: The most complete analysis uses all three inputs simultaneously — what the spec says the app should do, what the app actually does, and what the code does under the hood.

Scheduled gap analysis: Once the workflow is stable, run it on a cadence — every sprint, every release. A fresh gap analysis against a growing test suite catches regression in coverage: features that expanded but whose tests didn't.

Conclusion

Test coverage gaps exist because comparing "what the app does" against "what we've tested" is cognitively expensive. AI is good at exactly that kind of comparison when you give it the right inputs.

The workflow I described gives it those inputs systematically: read the existing tests, explore the live app, find the delta, draft the missing coverage, format it correctly, write it back. Each step is mechanical. The judgment calls — are these test cases accurate? are the priorities right? — still belong to you. But the archaeology is automated.

24 test cases in one session. That's the headline. The more important number is how many more sessions like this I can run without burning out on the manual version.

DEV Community