<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Demi Jiang</title>
    <description>The latest articles on DEV Community by Demi Jiang (@demi_jiang_3bfb65a7d28774).</description>
    <link>https://dev.to/demi_jiang_3bfb65a7d28774</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3895305%2F1fdb9a3a-a00b-48f4-9123-7cbb0f3c8727.png</url>
      <title>DEV Community: Demi Jiang</title>
      <link>https://dev.to/demi_jiang_3bfb65a7d28774</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/demi_jiang_3bfb65a7d28774"/>
    <language>en</language>
    <item>
      <title>A Tiered Playwright E2E Strategy: From PR Smoke to Production Validation</title>
      <dc:creator>Demi Jiang</dc:creator>
      <pubDate>Tue, 23 Jun 2026 01:19:49 +0000</pubDate>
      <link>https://dev.to/demi_jiang_3bfb65a7d28774/a-tiered-playwright-e2e-strategy-from-pr-smoke-to-production-validation-4o01</link>
      <guid>https://dev.to/demi_jiang_3bfb65a7d28774/a-tiered-playwright-e2e-strategy-from-pr-smoke-to-production-validation-4o01</guid>
      <description>&lt;p&gt;&lt;em&gt;A field write-up on a domain/feature-driven Playwright setup — the framework&lt;br&gt;
configuration, the tag strategy that ties tests to a test-management system, and the&lt;br&gt;
tiered run model (smoke on every PR → nightly regression → post-release production&lt;br&gt;
validation). Tooling and infrastructure specifics are generalized so the&lt;br&gt;
patterns are reusable anywhere.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Context&lt;/li&gt;
&lt;li&gt;Framework configuration: a layered, project-partitioned setup&lt;/li&gt;
&lt;li&gt;Tag strategy: two independent axes&lt;/li&gt;
&lt;li&gt;The smoke tier&lt;/li&gt;
&lt;li&gt;Worker tuning&lt;/li&gt;
&lt;li&gt;Production validation&lt;/li&gt;
&lt;li&gt;The tiered run model, end to end&lt;/li&gt;
&lt;li&gt;What I'd tell another team starting this&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  At a glance — the run tiers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier (tag)&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Workers&lt;/th&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Smoke&lt;/strong&gt; (&lt;code&gt;@smoke&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Every PR&lt;/td&gt;
&lt;td&gt;Curated subset, per-domain matrix&lt;/td&gt;
&lt;td&gt;1 / job&lt;/td&gt;
&lt;td&gt;Fast merge-gate feedback (~5 min P95)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Regression&lt;/strong&gt; (no tier tag)&lt;/td&gt;
&lt;td&gt;Nightly&lt;/td&gt;
&lt;td&gt;Everything&lt;/td&gt;
&lt;td&gt;Many&lt;/td&gt;
&lt;td&gt;Broad coverage overnight&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Production validation&lt;/strong&gt; (&lt;code&gt;@production-validation&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;After each production release (or more frequent releases)&lt;/td&gt;
&lt;td&gt;Small, stable critical-path set, per region&lt;/td&gt;
&lt;td&gt;Tuned&lt;/td&gt;
&lt;td&gt;Catch prod-only issues early, across regions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Endurance&lt;/strong&gt; (&lt;code&gt;@endurance&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Dedicated schedule&lt;/td&gt;
&lt;td&gt;One long-running spec, isolated project&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Cover tens-of-minutes flows off the critical path&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  Context
&lt;/h2&gt;

&lt;p&gt;Picture a product large enough that its end-to-end suite spans several independent&lt;br&gt;
feature domains (think: onboarding, checkout, search, messaging, billing, integrations)&lt;br&gt;
and has to run against everything from a local dev build to production in multiple&lt;br&gt;
geographic regions.&lt;/p&gt;

&lt;p&gt;The hard part of E2E at this scale isn't writing tests — it's keeping them fast enough&lt;br&gt;
to gate PRs, trustworthy enough that red means red, and traceable enough that a&lt;br&gt;
failure maps to a known test case. Almost every decision below is in service of one of&lt;br&gt;
those three.&lt;/p&gt;

&lt;p&gt;Throughout, I'll use generic domain names (&lt;code&gt;checkout&lt;/code&gt;, &lt;code&gt;search&lt;/code&gt;, &lt;code&gt;onboarding&lt;/code&gt;, …) as&lt;br&gt;
stand-ins for whatever your product's feature areas happen to be.&lt;/p&gt;


&lt;h2&gt;
  
  
  1. Framework configuration: a layered, project-partitioned setup
&lt;/h2&gt;
&lt;h3&gt;
  
  
  A shared base config, thin per-app overrides
&lt;/h3&gt;

&lt;p&gt;There's one &lt;strong&gt;base config&lt;/strong&gt; that every app/package inherits (&lt;code&gt;playwright.base.config&lt;/code&gt;),&lt;br&gt;
and a thin top-level &lt;code&gt;playwright.config.ts&lt;/code&gt; that spreads it and adds what's local. This&lt;br&gt;
keeps cross-cutting settings (reporters, trace/screenshot-on-failure, timeouts) in one&lt;br&gt;
place and lets each consumer override only what it needs.&lt;/p&gt;
&lt;h3&gt;
  
  
  Projects = domain/feature partitions
&lt;/h3&gt;

&lt;p&gt;Rather than one giant test pool, the suite is split into &lt;strong&gt;Playwright projects by feature&lt;br&gt;
domain&lt;/strong&gt; — &lt;code&gt;checkout&lt;/code&gt;, &lt;code&gt;search&lt;/code&gt;, &lt;code&gt;onboarding&lt;/code&gt;, &lt;code&gt;messaging&lt;/code&gt;, &lt;code&gt;billing&lt;/code&gt;, and so on. Each&lt;br&gt;
project points at its own &lt;code&gt;testDir&lt;/code&gt;. This buys two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;CI parallelism&lt;/strong&gt; — each domain runs as its own CI job, in parallel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ownership routing&lt;/strong&gt; — when a domain's job goes red, it routes to the team that owns
it, not to a shared "the E2E suite is broken" alert.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Splitting a heavy domain by wall-clock, not by name
&lt;/h3&gt;

&lt;p&gt;One domain will inevitably become the timeout risk — usually the one that owns slow,&lt;br&gt;
media- or generation-heavy flows (capture → processing → artifact generation). When a&lt;br&gt;
single domain dominates the run, &lt;strong&gt;split it into multiple projects backed by the same&lt;br&gt;
folder&lt;/strong&gt;, partitioned by spec file. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;domain&amp;gt;-core&lt;/code&gt; — the fast UI specs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;domain&amp;gt;-heavy&lt;/code&gt; — the slow media/processing/generation specs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;domain&amp;gt;-endurance&lt;/code&gt; — an isolated long-running spec (more on this below).&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The key lesson: &lt;strong&gt;balance the split by measured per-spec duration, not by what the names&lt;br&gt;
suggest.&lt;/strong&gt; The goal is jobs that finish in roughly equal wall-clock. Re-check the split&lt;br&gt;
against a recent HTML report's per-spec timings and rebalance — grouping "by feeling"&lt;br&gt;
leaves one job idle while the other is the bottleneck.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These live on the projects that actually need them — not globally — so unrelated domains&lt;br&gt;
aren't launched with flags they don't use.&lt;/p&gt;
&lt;h3&gt;
  
  
  One subtle but important config decision: don't let &lt;code&gt;.env&lt;/code&gt; clobber the CLI
&lt;/h3&gt;

&lt;p&gt;Load &lt;code&gt;.env&lt;/code&gt; with &lt;code&gt;override: false&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;dotenv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../.env&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;override&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reasoning is worth internalizing because the failure mode is silent: with&lt;br&gt;
&lt;code&gt;override: true&lt;/code&gt;, a value you pass on the command line&lt;br&gt;
(&lt;code&gt;APP_ENVIRONMENT=production pnpm exec playwright test …&lt;/code&gt;) gets &lt;strong&gt;reverted to the &lt;code&gt;.env&lt;/code&gt;&lt;br&gt;
default before setup runs&lt;/strong&gt;, so your "production" run quietly executes against staging.&lt;br&gt;
&lt;code&gt;override: false&lt;/code&gt; makes &lt;strong&gt;CLI-passed env vars win&lt;/strong&gt;, with &lt;code&gt;.env&lt;/code&gt; only supplying defaults&lt;br&gt;
for what the caller didn't set. Caller intent should always beat ambient config.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Tag strategy: two independent axes
&lt;/h2&gt;

&lt;p&gt;This is the part most teams under-invest in, and it's what makes the suite legible at&lt;br&gt;
scale. Use &lt;strong&gt;two orthogonal tagging axes&lt;/strong&gt;, both via Playwright's runtime &lt;code&gt;tag&lt;/code&gt; attribute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                 Axis 1 — Traceability            Axis 2 — Run tier
                 (WHICH test case?)               (WHEN does it run?)
                 ┌──────────────────┐             ┌────────────────────────┐
  one test ─────►│ @TC042           │  ── plus ──►│ @smoke                 │
                 │ (stable join key │             │ @production-validation │
                 │  to test-mgmt DB)│             │ @endurance / (none)    │
                 └──────────────────┘             └────────────────────────┘
            file renames don't break it       decides the pipeline it lands in
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Axis 1 — Traceability: every test carries a stable test-case ID
&lt;/h3&gt;

&lt;p&gt;Every &lt;code&gt;test&lt;/code&gt;/&lt;code&gt;describe&lt;/code&gt; carries a &lt;code&gt;@TCxxx&lt;/code&gt; tag that matches a row in a test-management&lt;br&gt;
system. This tag is the &lt;strong&gt;stable join key&lt;/strong&gt; between the spec and the test-case record —&lt;br&gt;
file renames and refactors don't break it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Complete checkout with saved card&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@TC042&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why a runtime &lt;code&gt;tag&lt;/code&gt; and not a string in the test title or a JSDoc comment?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reporter output&lt;/strong&gt; — Playwright's JSON reporter emits &lt;code&gt;tags: [...]&lt;/code&gt; per test, so an
automated reconciliation job can sync results back to the test-management system. JSDoc
never reaches the reporter; title prefixes have to be parsed out of strings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLI filtering&lt;/strong&gt; — &lt;code&gt;--grep @TC042&lt;/code&gt; runs exactly one case; &lt;code&gt;--grep @smoke&lt;/code&gt; runs a tier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tooling standard&lt;/strong&gt; — TestRail / Xray / Zephyr / Qase reporters all consume the
runtime &lt;code&gt;tag&lt;/code&gt; attribute, so you're aligned with the ecosystem.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Multi-TC tagging — only for sequential journeys.&lt;/strong&gt; When several test cases are steps in&lt;br&gt;
&lt;em&gt;one&lt;/em&gt; journey that shares auth/setup/state (e.g. a third-party integration flow:&lt;br&gt;
connect → fetch data → perform action → push result), tag the single test with all of&lt;br&gt;
them and use &lt;code&gt;test.step('TCxxx: …')&lt;/code&gt; so the report still attributes the failure to the&lt;br&gt;
right step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;connect, fetch, and push to the external system&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;TC101: connect the integration&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;TC102: view connected details&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;TC103: push a record&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The rule of thumb: &lt;em&gt;can these scenarios run independently in any order against fresh&lt;br&gt;
state?&lt;/em&gt; &lt;strong&gt;Yes → one test each. No, each depends on the previous step → one multi-TC&lt;br&gt;
test.&lt;/strong&gt; Splitting a dependent journey would mean paying for the auth flow, any remote&lt;br&gt;
connection, and fixture setup once per step instead of once total.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Axis 2 — Run tier: which pipeline a test belongs to
&lt;/h3&gt;

&lt;p&gt;Independent of its TC ID, each test opts into a run tier:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tag&lt;/th&gt;
&lt;th&gt;When it runs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@smoke&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every PR (a curated subset)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@production-validation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;After each production release (or more frequent releases), fanned out per region&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;(no tier tag)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;Full regression, nightly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@endurance&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Its own dedicated scheduled workflow only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  3. The smoke tier — fast, curated, every PR
&lt;/h2&gt;

&lt;p&gt;Smoke is the always-on PR gate, and its design is deliberate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Curated by QA, not by engineers.&lt;/strong&gt; A spec is in smoke iff its row in the
test-management system has the smoke box checked. To add/remove a spec, you flip the box
first, then sync the &lt;code&gt;@smoke&lt;/code&gt; tag. This keeps one team accountable for the smoke surface
instead of it growing ad hoc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-domain matrix.&lt;/strong&gt; Smoke runs as a parallel matrix across feature domains; each job
provisions &lt;strong&gt;only the account cohorts that domain needs&lt;/strong&gt;, with &lt;strong&gt;1 worker&lt;/strong&gt; per job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A wall-clock budget.&lt;/strong&gt; Set a target (e.g. keep PR smoke under ~5 minutes P95). Because
the domain jobs run in parallel, the budget is per-job, not the sum. The budget is the
forcing function that keeps anyone from quietly adding a multi-minute spec to smoke.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The endurance spec (a long-running, real-time flow that can take tens of minutes) is the&lt;br&gt;
explicit counter-example: it &lt;strong&gt;cannot&lt;/strong&gt; live in smoke or even nightly regression. It sits&lt;br&gt;
in its own Playwright project that no general pipeline's domain&lt;br&gt;
allowlist includes, run only by a dedicated low-frequency scheduled workflow.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The lesson: &lt;em&gt;give genuinely outlier tests their own isolated lane so their slowness can&lt;br&gt;
never block the merge queue.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  4. Worker tuning — match the constraint, not the core count
&lt;/h2&gt;

&lt;p&gt;Sensible defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// CI = 3 workers, local = 4 (safe ceiling for sequential IdP logins).&lt;/span&gt;
&lt;span class="c1"&gt;// Override with PLAYWRIGHT_WORKERS.&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;NUM_WORKERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;PLAYWRIGHT_WORKERS&lt;/span&gt;
  &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;PLAYWRIGHT_WORKERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;CI&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The non-obvious lesson: &lt;strong&gt;worker count is bounded by the weakest shared dependency, not&lt;br&gt;
by your CI runner's CPUs.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Common binding constraints are (a) the identity provider's&lt;br&gt;
tolerance for near-simultaneous logins and (b) the capacity of the shared environment&lt;br&gt;
under test. Cranking workers higher can produce &lt;em&gt;more&lt;/em&gt; failures, not faster runs —&lt;br&gt;
failures that masquerade as test flake but are really the backend or IdP saturating. Make&lt;br&gt;
the worker count an env-driven dial (&lt;code&gt;PLAYWRIGHT_WORKERS&lt;/code&gt;) so you can tune per environment&lt;br&gt;
without code changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Production validation — multi-region, release-triggered, intentionally small
&lt;/h2&gt;

&lt;p&gt;After each production release — and, as cadence increases, on more frequent releases to&lt;br&gt;
catch issues sooner — run a &lt;strong&gt;small, stable, curated set&lt;/strong&gt; of critical flows against&lt;br&gt;
production, &lt;strong&gt;fanned out across every geographic region&lt;/strong&gt;, via a manually dispatched&lt;br&gt;
workflow. Design principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Region = a region-pinned login.&lt;/strong&gt; A user's region claim drives backend routing, so
"run this spec against region X" is implemented as "log in with an X-region account."
The workflow passes the correct per-region API base URL through to the runner so any
admin/setup calls hit the right backend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static accounts, no provisioning in prod.&lt;/strong&gt; Unlike lower environments (which
dynamically provision throwaway accounts), production validation uses a fixed set of
pre-created accounts stored as a secret, region-keyed. Dynamic provisioning is &lt;em&gt;disabled&lt;/em&gt;
in prod, and there's a defence-in-depth guard that refuses to call internal admin APIs
against production. &lt;strong&gt;You do not let an E2E suite create or mutate data in production by
accident.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Region-specific skips are explicit and gated.&lt;/strong&gt; Where one region renders a different
UI or has a known backend issue, the skip is gated on a region env var (inert
everywhere except prod) with a comment pointing at the follow-up to remove it. Skips are
visible and temporary, never silent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QA owns the list.&lt;/strong&gt; The prod-validation set is intentionally tiny and stable;
engineers don't add to it without QA sign-off. &lt;code&gt;playwright test --grep
@production-validation --list&lt;/code&gt; is the source of truth for what's in it.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  6. The tiered run model, end to end
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PR opened ─────────────► @smoke         (per-domain matrix, 1 worker, &amp;lt;5 min budget)
                              │
nightly ───────────────► full regression (everything without a tier tag)
                              │
after each release ────► @production-validation (multi-region fan-out, static accounts)

(separate lane) ───────► @endurance     (dedicated scheduled workflow, isolated project)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each tier trades coverage for speed deliberately. PRs get fast, narrow feedback;&lt;br&gt;
regression gets breadth overnight; production gets a small, high-confidence&lt;br&gt;
critical-path check across regions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd tell another team starting this
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Invest in the tag taxonomy before the suite is big.&lt;/strong&gt; Two axes — a stable test-case
ID for traceability, a run-tier tag for pipeline routing — pay for themselves the day
you have more than ~50 tests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tune workers to the weakest shared dependency&lt;/strong&gt;, and make it an env dial. The runner's
core count is rarely the real ceiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Give outlier tests their own lane.&lt;/strong&gt; One tens-of-minutes endurance test does not
belong in any pipeline that gates a merge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat the smoke list as a governed asset&lt;/strong&gt; with a wall-clock budget and a single
owner — otherwise it bloats until it's no longer "smoke."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Never let E2E mutate production by accident&lt;/strong&gt; — disable provisioning, pin accounts,
and add a guard that refuses admin calls against prod.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make caller intent beat ambient config&lt;/strong&gt; (&lt;code&gt;dotenv override: false&lt;/code&gt;). The silent
"ran against the wrong environment" bug is brutal to debug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skips must be explicit, gated, and commented&lt;/strong&gt; with a path to removal — a silent skip
is just lost coverage wearing a green check.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;These are generic, reusable patterns for a large multi-domain E2E suite. Adapt the tier&lt;br&gt;
names, region model, domain partitioning, and tooling to your own stack.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>automation</category>
      <category>cicd</category>
      <category>devops</category>
      <category>testing</category>
    </item>
    <item>
      <title>AI-Powered Test Coverage Gap Analysis: How I Use Claude Code + gstack to Generate Test Cases</title>
      <dc:creator>Demi Jiang</dc:creator>
      <pubDate>Fri, 24 Apr 2026 05:44:31 +0000</pubDate>
      <link>https://dev.to/demi_jiang_3bfb65a7d28774/ai-powered-test-coverage-gap-analysis-how-i-use-claude-code-gstack-to-generate-test-cases-264a</link>
      <guid>https://dev.to/demi_jiang_3bfb65a7d28774/ai-powered-test-coverage-gap-analysis-how-i-use-claude-code-gstack-to-generate-test-cases-264a</guid>
      <description>&lt;p&gt;Every QA engineer knows the feeling: you're staring at a test suite that covers the happy path, maybe a few edge cases, and you have a nagging suspicion there's a whole category of scenarios nobody's thought to test. Writing those missing tests from scratch is slow, tedious, and mentally expensive. You're essentially doing product archaeology — reverse-engineering what the app actually does so you can describe it in test form.&lt;/p&gt;

&lt;p&gt;I found a way to automate that archaeology. In a single session, I used Claude Code and a tool called gstack to navigate our live staging app, compare what it actually does against our existing Notion test cases, and generate 24 new BDD-formatted test cases — all exported directly back into Notion. Here's the exact workflow, including the prompts I used and the lessons I learned the hard way.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Problem: Test Coverage Gaps Are Hard to Find Manually
&lt;/h2&gt;

&lt;p&gt;Manual gap analysis is a two-step cognitive problem. First you have to deeply understand what the application does — every mode, every edge case, every permission flow. Then you have to hold that in your head while scanning a test case database and noticing what's missing. Neither step is easy. Both together are exhausting.&lt;/p&gt;

&lt;p&gt;For any non-trivial feature, you'll have test cases for the happy path and maybe a few known edge cases. But what about different input types? State transitions that only happen under specific conditions? Browser-specific behaviors? Permission flows? You often don't know what's missing until something breaks in production.&lt;/p&gt;

&lt;p&gt;The approach I'd been using — read the test suite, open the app, click around, write notes — doesn't scale. What I needed was a way to have the analysis done for me, with the application as the source of truth rather than my memory of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Tools: Claude Code, Notion MCP, and gstack
&lt;/h2&gt;

&lt;p&gt;Before diving into the workflow, here's what each tool actually does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Code&lt;/strong&gt; is Anthropic's CLI for Claude. You run it from your terminal or VS Code and interact with it conversationally. It can execute bash commands, read and write files, call external APIs, and — crucially for this workflow — use MCP servers to connect to external tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notion MCP&lt;/strong&gt; is a Model Context Protocol server that lets Claude read and write Notion pages directly. Once configured, you can tell Claude to fetch a Notion page, read its content, and write new pages back — all from a single conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;gstack&lt;/strong&gt; is an open-source tool that gives Claude a headless browser. It exposes three skills:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Fixes bugs?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/browse&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Navigate a URL, interact with the UI, take screenshots, verify specific flows&lt;/td&gt;
&lt;td&gt;No — exploration only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/qa-only&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Systematic QA sweep of the whole app — structured report, health score, repro steps, screenshots&lt;/td&gt;
&lt;td&gt;No — report only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/qa&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same as &lt;code&gt;/qa-only&lt;/code&gt;, plus iteratively patches bugs in source code, commits each fix, re-verifies&lt;/td&gt;
&lt;td&gt;Yes — fixes and commits&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For this workflow I used &lt;code&gt;/browse&lt;/code&gt; — I wanted exploration and screenshots, not code changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Setup: Getting Everything Connected
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install Claude Code&lt;/strong&gt; from the Anthropic CLI docs. You can use it from the terminal or the VS Code extension. I used both — VS Code for reviewing output, terminal for running prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configure Notion MCP&lt;/strong&gt; by editing &lt;code&gt;~/.claude.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"notion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp.notion.com/mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll also need to authorize the Notion integration from your Notion workspace settings and give it access to the relevant pages. Claude will automatically pick up the MCP config on next launch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install gstack&lt;/strong&gt; following the instructions in its repo. Once installed, the &lt;code&gt;/browse&lt;/code&gt;, &lt;code&gt;/qa-only&lt;/code&gt;, and &lt;code&gt;/qa&lt;/code&gt; skills become available inside Claude Code sessions.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Set your permission mode.&lt;/strong&gt; By default, Claude Code asks for approval before running commands or making changes. For this kind of exploratory session, constant approval prompts break your flow. Set the permission mode to &lt;code&gt;acceptEdits&lt;/code&gt; so Claude can run freely. Be aware of what this means — you're giving it latitude to make changes, so use it in a sandboxed or read-only context where possible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Why this matters for QA:&lt;/strong&gt; The setup cost here is low — maybe 20 minutes including Notion authorization. The payoff is a reusable pipeline. Once it's configured, every future gap analysis session starts from step one with no additional setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Workflow: Six Prompts, One Session
&lt;/h2&gt;

&lt;p&gt;Here's the complete workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                    GAP ANALYSIS WORKFLOW                     │
└─────────────────────────────────────────────────────────────┘

  [Notion DB]          [Live App]           [Notion DB]
      │                    │                     │
      ▼                    ▼                     │
  ┌────────┐         ┌──────────┐                │
  │ Step 1 │         │ Step 2   │                │
  │  Read  │         │ Explore  │                │
  │existing│         │  app via │                │
  │  TCs   │         │  gstack  │                │
  └────┬───┘         └────┬─────┘                │
       │                  │                      │
       └────────┬──────────┘                     │
                ▼                                │
           ┌────────┐                            │
           │ Step 3 │                            │
           │Compare │                            │
           │&amp;amp; find  │                            │
           │  gaps  │                            │
           └────┬───┘                            │
                ▼                                │
           ┌────────┐                            │
           │ Step 4 │                            │
           │ Draft  │                            │
           │  new   │                            │
           │  TCs   │                            │
           └────┬───┘                            │
                ▼                                │
           ┌────────┐                            │
           │ Step 5 │                            │
           │Refine  │                            │
           │to BDD  │                            │
           │format  │                            │
           └────┬───┘                            │
                ▼                                ▼
           ┌────────┐                       ┌────────┐
           │ Step 6 │──────────────────────▶│ New TC │
           │ Export │                       │ pages  │
           │to Notion│                      │in DB   │
           └────────┘                       └────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 1 — Read Existing Test Cases from Notion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fetch this Notion page and list all existing test cases with their names
and a one-line summary of what each one covers:
[your Notion test case database URL]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude fetches the Notion database, reads each page, and produces a structured list: test case name, what it covers. This becomes the baseline for the gap analysis.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Include the full URL in your prompt every time.&lt;/strong&gt; Don't say "the Notion page from earlier" or "the test database we discussed." Across tool calls and session boundaries, Claude needs explicit references. Paste the full URL in every prompt that references a Notion page.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 2 — Explore the App and Understand What It Does
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Browse &lt;span class="o"&gt;[&lt;/span&gt;your staging app URL]
Login with username &lt;span class="o"&gt;[&lt;/span&gt;test-account] password &lt;span class="o"&gt;[&lt;/span&gt;password]
Put the entire login and exploration &lt;span class="k"&gt;in &lt;/span&gt;one bash script so the browser
session stays alive.
Take screenshots of each part of &lt;span class="o"&gt;[&lt;/span&gt;the feature] and summarise how it works.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where gstack does the heavy lifting. Claude uses the &lt;code&gt;/browse&lt;/code&gt; skill to launch a headless browser, log in, navigate through every state of the feature, take screenshots, and come back with a written summary of how it all works.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Put login and exploration in a single bash script.&lt;/strong&gt; This is the most important gotcha in the whole workflow. The gstack browser server restarts between separate bash calls, which kills all browser state — including your login session. If you run login in one call and exploration in the next, Claude will be looking at a logged-out app. Combine everything into one script.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What you get back is a detailed summary of every state the feature can be in: what controls are visible, what actions are available, what happens when you submit or cancel, and screenshots of each screen. Claude understands the feature better after two minutes of headless browsing than you could communicate with a paragraph of description.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters for QA:&lt;/strong&gt; The app is the source of truth, not documentation or memory. When Claude explores the live app, it sees what users see — including states that might not be documented anywhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Compare Against Existing Tests and Find Gaps
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compare the feature you just explored against the existing test cases listed earlier.
Identify gaps — features or scenarios with no test coverage.
Group by area (e.g. different input types, error states, permissions,
edge cases, browser-specific behaviour).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude now has both sides: what the app does (from exploration) and what's already tested (from Notion). It produces a gap analysis grouped by area, surfacing scenarios that hadn't been explicitly tested — different input variations, specific error and timeout states, permission-related flows, and behavior under degraded conditions.&lt;/p&gt;

&lt;p&gt;This took about 30 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Draft New Test Cases (Without Writing to Notion Yet)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Please create new test case entries for each gap you identified.
Do NOT write directly to Notion yet — show me the drafts first.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Always review before writing to Notion.&lt;/strong&gt; Notion changes cannot be reverted through Claude. If you let it write directly and the output is wrong — wrong format, wrong numbering, duplicate entries — you're cleaning up manually. The "show me the drafts first" step is non-negotiable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude generates a draft for each gap: a title, a brief description, and rough test steps. At this point the format isn't quite right yet, but the content is there.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — Refine to Match Your BDD Format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Can you follow the same format I have here:
[URL of an existing well-formatted test case as a reference]

Rewrite all the draft test cases using that exact format:
Feature block with user story, Background, Scenario with Given/When/Then steps,
Execution Steps checklist, and Notes/Bug Link section.
Number them starting from [next available number].
Still do NOT write to Notion yet.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I pointed Claude at an existing test case as the template and asked it to rewrite all drafts to match — Feature block, Background, Scenario, Given/When/Then, Execution Steps checklist, Notes/Bug Link. I also specified the starting test case number so the new ones numbered sequentially from where the existing ones left off.&lt;/p&gt;

&lt;p&gt;This step is worth taking seriously. A test case that's technically correct but formatted wrong creates work for whoever has to use it. Getting the format right before export means the output is immediately usable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6 — Export to Notion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write all the new test cases to Notion.
Create each one as a new page inside [your database name]
using the same format as the existing entries.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude uses the Notion MCP to create each test case as a new page in the database, including the full BDD content block and page properties: Case Type, Priority, Status.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters for QA:&lt;/strong&gt; The output lands directly in the tool your team already uses. No copy-pasting, no reformatting, no "I'll add this to Notion later." It's there.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Prompts as a Reusable Template
&lt;/h2&gt;

&lt;p&gt;Here's the complete sequence you can adapt for your own app and test database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Step 1 — Read existing test cases&lt;/span&gt;
Fetch this Notion page and list all existing test cases with their names
and a one-line summary of what each one covers:
[your Notion test case database URL]

&lt;span class="gh"&gt;# Step 2 — Explore the app&lt;/span&gt;
Browse [your staging app URL]
Login with username [test-account] password [password]
Put the entire login and exploration in one bash script so the browser
session stays alive.
Take screenshots of each part of [the feature] and summarise how it works.

&lt;span class="gh"&gt;# Step 3 — Gap analysis&lt;/span&gt;
Compare the feature you just explored against the existing test cases listed earlier.
Identify gaps — features or scenarios with no test coverage.
Group by area.

&lt;span class="gh"&gt;# Step 4 — Draft&lt;/span&gt;
Please create new test case entries for each gap you identified.
Do NOT write directly to Notion yet — show me the drafts first.

&lt;span class="gh"&gt;# Step 5 — Format&lt;/span&gt;
Can you follow the same format I have here:
[URL of an existing well-formatted test case]
Rewrite all the draft test cases using that exact format.
Number them starting from [TC-XX].
Still do NOT write to Notion yet.

&lt;span class="gh"&gt;# Step 6 — Export&lt;/span&gt;
Write all the new test cases to Notion.
Create each one as a new page inside [your database]
using the same format as the existing entries.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  6. Gotchas and Lessons Learned
&lt;/h2&gt;

&lt;p&gt;These aren't theoretical — each one cost me time before I figured it out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. One bash script for login + exploration.&lt;/strong&gt; The gstack browser server restarts between separate bash invocations. Combine login and exploration into a single script.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Always use explicit URLs.&lt;/strong&gt; Vague references like "the page from before" break across tool calls and context boundaries. Include the full URL in every prompt that references a Notion page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Review drafts before writing to Notion.&lt;/strong&gt; Notion write operations through Claude are not reversible via Claude. The "show me first" step is cheap insurance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Set &lt;code&gt;acceptEdits&lt;/code&gt; permission mode for exploration sessions.&lt;/strong&gt; Constant approval prompts fragment the session. Set it for exploration, but be aware of what you're enabling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Save reusable prompts as custom skills.&lt;/strong&gt; Claude Code supports custom skills — markdown files in &lt;code&gt;~/.claude/skills/&lt;/code&gt;. If you run gap analyses regularly, turn the prompt sequence into a skill so you invoke it with one command instead of retyping a paragraph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Use a dedicated test account.&lt;/strong&gt; Your credentials go into a prompt that Claude executes. Don't use your personal account.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Results
&lt;/h2&gt;

&lt;p&gt;One session. Here's what came out of it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;24 new test cases generated&lt;/strong&gt; in a single session&lt;/li&gt;
&lt;li&gt;All formatted correctly: Feature block, Background, Scenario, Given/When/Then, Execution Steps checklist, Notes section&lt;/li&gt;
&lt;li&gt;All written as new pages in the Notion database with correct properties (Case Type, Priority, Status)&lt;/li&gt;
&lt;li&gt;Coverage gaps closed across multiple areas that hadn't been explicitly tested before&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before this session, gap analysis for a feature this size would have taken me half a day. The session itself took about 45 minutes, most of which was reviewing the drafts at steps 4 and 5. The test cases needed minor tweaks — a few Given steps needed more context, one When step was slightly off — but the heavy lifting was done. I was editing, not authoring from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. What Else You Can Do With This Approach
&lt;/h2&gt;

&lt;p&gt;The six-step workflow is one combination. The underlying capability is more flexible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Requirements-first:&lt;/strong&gt; Instead of exploring the app, feed Claude your requirements doc or spec. "Here are the acceptance criteria. Here are the existing test cases. What scenarios aren't covered?" This works well for features that aren't built yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code-first:&lt;/strong&gt; Point Claude at the codebase and ask it to surface untested paths. "Here's the source code for this feature. Here are the existing test cases. What code paths have no test coverage?" This gets you into edge cases that are invisible from the UI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All three combined:&lt;/strong&gt; The most complete analysis uses all three inputs simultaneously — what the spec says the app should do, what the app actually does, and what the code does under the hood.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scheduled gap analysis:&lt;/strong&gt; Once the workflow is stable, run it on a cadence — every sprint, every release. A fresh gap analysis against a growing test suite catches regression in coverage: features that expanded but whose tests didn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Test coverage gaps exist because comparing "what the app does" against "what we've tested" is cognitively expensive. AI is good at exactly that kind of comparison when you give it the right inputs.&lt;/p&gt;

&lt;p&gt;The workflow I described gives it those inputs systematically: read the existing tests, explore the live app, find the delta, draft the missing coverage, format it correctly, write it back. Each step is mechanical. The judgment calls — are these test cases accurate? are the priorities right? — still belong to you. But the archaeology is automated.&lt;/p&gt;

&lt;p&gt;24 test cases in one session. That's the headline. The more important number is how many more sessions like this I can run without burning out on the manual version.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/docs/claude-code" rel="noopener noreferrer"&gt;Claude Code documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol (MCP)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.notion.so/help/notion-ai-mcp" rel="noopener noreferrer"&gt;Notion MCP server documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/garrytan/gstack" rel="noopener noreferrer"&gt;gstack on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cucumber.io/docs/gherkin/reference/" rel="noopener noreferrer"&gt;Gherkin / BDD reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>testing</category>
      <category>ai</category>
      <category>qa</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
