DEV Community: Demi Jiang

A Tiered Playwright E2E Strategy: From PR Smoke to Production Validation

Demi Jiang — Tue, 23 Jun 2026 01:19:49 +0000

A field write-up on a domain/feature-driven Playwright setup — the framework
configuration, the tag strategy that ties tests to a test-management system, and the
tiered run model (smoke on every PR → nightly regression → post-release production
validation). Tooling and infrastructure specifics are generalized so the
patterns are reusable anywhere.

Context
Framework configuration: a layered, project-partitioned setup
Tag strategy: two independent axes
The smoke tier
Worker tuning
Production validation
The tiered run model, end to end
What I'd tell another team starting this

At a glance — the run tiers

Tier (tag)	Trigger	Scope	Workers	Goal
Smoke (`@smoke`)	Every PR	Curated subset, per-domain matrix	1 / job	Fast merge-gate feedback (~5 min P95)
Regression (no tier tag)	Nightly	Everything	Many	Broad coverage overnight
Production validation (`@production-validation`)	After each production release (or more frequent releases)	Small, stable critical-path set, per region	Tuned	Catch prod-only issues early, across regions
Endurance (`@endurance`)	Dedicated schedule	One long-running spec, isolated project	1	Cover tens-of-minutes flows off the critical path

Context

Picture a product large enough that its end-to-end suite spans several independent
feature domains (think: onboarding, checkout, search, messaging, billing, integrations)
and has to run against everything from a local dev build to production in multiple
geographic regions.

The hard part of E2E at this scale isn't writing tests — it's keeping them fast enough
to gate PRs, trustworthy enough that red means red, and traceable enough that a
failure maps to a known test case. Almost every decision below is in service of one of
those three.

Throughout, I'll use generic domain names (checkout, search, onboarding, …) as
stand-ins for whatever your product's feature areas happen to be.

1. Framework configuration: a layered, project-partitioned setup

A shared base config, thin per-app overrides

There's one base config that every app/package inherits (playwright.base.config),
and a thin top-level playwright.config.ts that spreads it and adds what's local. This
keeps cross-cutting settings (reporters, trace/screenshot-on-failure, timeouts) in one
place and lets each consumer override only what it needs.

Projects = domain/feature partitions

Rather than one giant test pool, the suite is split into Playwright projects by feature
domain — checkout, search, onboarding, messaging, billing, and so on. Each
project points at its own testDir. This buys two things:

CI parallelism — each domain runs as its own CI job, in parallel.
Ownership routing — when a domain's job goes red, it routes to the team that owns it, not to a shared "the E2E suite is broken" alert.

Splitting a heavy domain by wall-clock, not by name

One domain will inevitably become the timeout risk — usually the one that owns slow,
media- or generation-heavy flows (capture → processing → artifact generation). When a
single domain dominates the run, split it into multiple projects backed by the same
folder, partitioned by spec file. For example:

<domain>-core — the fast UI specs.
<domain>-heavy — the slow media/processing/generation specs.
<domain>-endurance — an isolated long-running spec (more on this below).

The key lesson: balance the split by measured per-spec duration, not by what the names
suggest. The goal is jobs that finish in roughly equal wall-clock. Re-check the split
against a recent HTML report's per-spec timings and rebalance — grouping "by feeling"
leaves one job idle while the other is the bottleneck.

These live on the projects that actually need them — not globally — so unrelated domains
aren't launched with flags they don't use.

One subtle but important config decision: don't let `.env` clobber the CLI

Load .env with override: false:

dotenv.config({ path: '../.env', override: false });

The reasoning is worth internalizing because the failure mode is silent: with
override: true, a value you pass on the command line
(APP_ENVIRONMENT=production pnpm exec playwright test …) gets reverted to the .env
default before setup runs, so your "production" run quietly executes against staging.
override: false makes CLI-passed env vars win, with .env only supplying defaults
for what the caller didn't set. Caller intent should always beat ambient config.

2. Tag strategy: two independent axes

This is the part most teams under-invest in, and it's what makes the suite legible at
scale. Use two orthogonal tagging axes, both via Playwright's runtime tag attribute.

                 Axis 1 — Traceability            Axis 2 — Run tier
                 (WHICH test case?)               (WHEN does it run?)
                 ┌──────────────────┐             ┌────────────────────────┐
  one test ─────►│ @TC042           │  ── plus ──►│ @smoke                 │
                 │ (stable join key │             │ @production-validation │
                 │  to test-mgmt DB)│             │ @endurance / (none)    │
                 └──────────────────┘             └────────────────────────┘
            file renames don't break it       decides the pipeline it lands in

Axis 1 — Traceability: every test carries a stable test-case ID

Every test/describe carries a @TCxxx tag that matches a row in a test-management
system. This tag is the stable join key between the spec and the test-case record —
file renames and refactors don't break it.

test.describe('Complete checkout with saved card', { tag: ['@TC042'] }, () => { ... });

Why a runtime tag and not a string in the test title or a JSDoc comment?

Reporter output — Playwright's JSON reporter emits tags: [...] per test, so an automated reconciliation job can sync results back to the test-management system. JSDoc never reaches the reporter; title prefixes have to be parsed out of strings.
CLI filtering — --grep @TC042 runs exactly one case; --grep @smoke runs a tier.
Tooling standard — TestRail / Xray / Zephyr / Qase reporters all consume the runtime tag attribute, so you're aligned with the ecosystem.

Multi-TC tagging — only for sequential journeys. When several test cases are steps in
one journey that shares auth/setup/state (e.g. a third-party integration flow:
connect → fetch data → perform action → push result), tag the single test with all of
them and use test.step('TCxxx: …') so the report still attributes the failure to the
right step:

test('connect, fetch, and push to the external system', async ({ page }) => {
  await test.step('TC101: connect the integration', async () => { ... });
  await test.step('TC102: view connected details',  async () => { ... });
  await test.step('TC103: push a record',           async () => { ... });
});

The rule of thumb: can these scenarios run independently in any order against fresh
state? Yes → one test each. No, each depends on the previous step → one multi-TC
test. Splitting a dependent journey would mean paying for the auth flow, any remote
connection, and fixture setup once per step instead of once total.

Axis 2 — Run tier: which pipeline a test belongs to

Independent of its TC ID, each test opts into a run tier:

Tag	When it runs
`@smoke`	Every PR (a curated subset)
`@production-validation`	After each production release (or more frequent releases), fanned out per region
(no tier tag)	Full regression, nightly
`@endurance`	Its own dedicated scheduled workflow only

3. The smoke tier — fast, curated, every PR

Smoke is the always-on PR gate, and its design is deliberate:

Curated by QA, not by engineers. A spec is in smoke iff its row in the test-management system has the smoke box checked. To add/remove a spec, you flip the box first, then sync the @smoke tag. This keeps one team accountable for the smoke surface instead of it growing ad hoc.
Per-domain matrix. Smoke runs as a parallel matrix across feature domains; each job provisions only the account cohorts that domain needs, with 1 worker per job.
A wall-clock budget. Set a target (e.g. keep PR smoke under ~5 minutes P95). Because the domain jobs run in parallel, the budget is per-job, not the sum. The budget is the forcing function that keeps anyone from quietly adding a multi-minute spec to smoke.

The endurance spec (a long-running, real-time flow that can take tens of minutes) is the
explicit counter-example: it cannot live in smoke or even nightly regression. It sits
in its own Playwright project that no general pipeline's domain
allowlist includes, run only by a dedicated low-frequency scheduled workflow.

The lesson: give genuinely outlier tests their own isolated lane so their slowness can
never block the merge queue.

4. Worker tuning — match the constraint, not the core count

Sensible defaults:

// CI = 3 workers, local = 4 (safe ceiling for sequential IdP logins).
// Override with PLAYWRIGHT_WORKERS.
export const NUM_WORKERS = process.env.PLAYWRIGHT_WORKERS
  ? parseInt(process.env.PLAYWRIGHT_WORKERS, 10)
  : process.env.CI ? 3 : 4;

The non-obvious lesson: worker count is bounded by the weakest shared dependency, not
by your CI runner's CPUs.

Common binding constraints are (a) the identity provider's
tolerance for near-simultaneous logins and (b) the capacity of the shared environment
under test. Cranking workers higher can produce more failures, not faster runs —
failures that masquerade as test flake but are really the backend or IdP saturating. Make
the worker count an env-driven dial (PLAYWRIGHT_WORKERS) so you can tune per environment
without code changes.

5. Production validation — multi-region, release-triggered, intentionally small

After each production release — and, as cadence increases, on more frequent releases to
catch issues sooner — run a small, stable, curated set of critical flows against
production, fanned out across every geographic region, via a manually dispatched
workflow. Design principles:

Region = a region-pinned login. A user's region claim drives backend routing, so "run this spec against region X" is implemented as "log in with an X-region account." The workflow passes the correct per-region API base URL through to the runner so any admin/setup calls hit the right backend.
Static accounts, no provisioning in prod. Unlike lower environments (which dynamically provision throwaway accounts), production validation uses a fixed set of pre-created accounts stored as a secret, region-keyed. Dynamic provisioning is disabled in prod, and there's a defence-in-depth guard that refuses to call internal admin APIs against production. You do not let an E2E suite create or mutate data in production by accident.
Region-specific skips are explicit and gated. Where one region renders a different UI or has a known backend issue, the skip is gated on a region env var (inert everywhere except prod) with a comment pointing at the follow-up to remove it. Skips are visible and temporary, never silent.
QA owns the list. The prod-validation set is intentionally tiny and stable; engineers don't add to it without QA sign-off. playwright test --grep @production-validation --list is the source of truth for what's in it.

6. The tiered run model, end to end

PR opened ─────────────► @smoke         (per-domain matrix, 1 worker, <5 min budget)
                              │
nightly ───────────────► full regression (everything without a tier tag)
                              │
after each release ────► @production-validation (multi-region fan-out, static accounts)

(separate lane) ───────► @endurance     (dedicated scheduled workflow, isolated project)

Each tier trades coverage for speed deliberately. PRs get fast, narrow feedback;
regression gets breadth overnight; production gets a small, high-confidence
critical-path check across regions.

What I'd tell another team starting this

Invest in the tag taxonomy before the suite is big. Two axes — a stable test-case ID for traceability, a run-tier tag for pipeline routing — pay for themselves the day you have more than ~50 tests.
Tune workers to the weakest shared dependency, and make it an env dial. The runner's core count is rarely the real ceiling.
Give outlier tests their own lane. One tens-of-minutes endurance test does not belong in any pipeline that gates a merge.
Treat the smoke list as a governed asset with a wall-clock budget and a single owner — otherwise it bloats until it's no longer "smoke."
Never let E2E mutate production by accident — disable provisioning, pin accounts, and add a guard that refuses admin calls against prod.
Make caller intent beat ambient config (dotenv override: false). The silent "ran against the wrong environment" bug is brutal to debug.
Skips must be explicit, gated, and commented with a path to removal — a silent skip is just lost coverage wearing a green check.

These are generic, reusable patterns for a large multi-domain E2E suite. Adapt the tier
names, region model, domain partitioning, and tooling to your own stack.

AI-Powered Test Coverage Gap Analysis: How I Use Claude Code + gstack to Generate Test Cases

Demi Jiang — Fri, 24 Apr 2026 05:44:31 +0000

Every QA engineer knows the feeling: you're staring at a test suite that covers the happy path, maybe a few edge cases, and you have a nagging suspicion there's a whole category of scenarios nobody's thought to test. Writing those missing tests from scratch is slow, tedious, and mentally expensive. You're essentially doing product archaeology — reverse-engineering what the app actually does so you can describe it in test form.

I found a way to automate that archaeology. In a single session, I used Claude Code and a tool called gstack to navigate our live staging app, compare what it actually does against our existing Notion test cases, and generate 24 new BDD-formatted test cases — all exported directly back into Notion. Here's the exact workflow, including the prompts I used and the lessons I learned the hard way.

1. The Problem: Test Coverage Gaps Are Hard to Find Manually

Manual gap analysis is a two-step cognitive problem. First you have to deeply understand what the application does — every mode, every edge case, every permission flow. Then you have to hold that in your head while scanning a test case database and noticing what's missing. Neither step is easy. Both together are exhausting.

For any non-trivial feature, you'll have test cases for the happy path and maybe a few known edge cases. But what about different input types? State transitions that only happen under specific conditions? Browser-specific behaviors? Permission flows? You often don't know what's missing until something breaks in production.

The approach I'd been using — read the test suite, open the app, click around, write notes — doesn't scale. What I needed was a way to have the analysis done for me, with the application as the source of truth rather than my memory of it.

2. The Tools: Claude Code, Notion MCP, and gstack

Before diving into the workflow, here's what each tool actually does.

Claude Code is Anthropic's CLI for Claude. You run it from your terminal or VS Code and interact with it conversationally. It can execute bash commands, read and write files, call external APIs, and — crucially for this workflow — use MCP servers to connect to external tools.

Notion MCP is a Model Context Protocol server that lets Claude read and write Notion pages directly. Once configured, you can tell Claude to fetch a Notion page, read its content, and write new pages back — all from a single conversation.

gstack is an open-source tool that gives Claude a headless browser. It exposes three skills:

Skill	What it does	Fixes bugs?
`/browse`	Navigate a URL, interact with the UI, take screenshots, verify specific flows	No — exploration only
`/qa-only`	Systematic QA sweep of the whole app — structured report, health score, repro steps, screenshots	No — report only
`/qa`	Same as `/qa-only`, plus iteratively patches bugs in source code, commits each fix, re-verifies	Yes — fixes and commits

For this workflow I used /browse — I wanted exploration and screenshots, not code changes.

3. Setup: Getting Everything Connected

Install Claude Code from the Anthropic CLI docs. You can use it from the terminal or the VS Code extension. I used both — VS Code for reviewing output, terminal for running prompts.

Configure Notion MCP by editing ~/.claude.json:

{
  "mcpServers": {
    "notion": {
      "type": "http",
      "url": "https://mcp.notion.com/mcp"
    }
  }
}

You'll also need to authorize the Notion integration from your Notion workspace settings and give it access to the relevant pages. Claude will automatically pick up the MCP config on next launch.

Install gstack following the instructions in its repo. Once installed, the /browse, /qa-only, and /qa skills become available inside Claude Code sessions.

⚠️ Set your permission mode. By default, Claude Code asks for approval before running commands or making changes. For this kind of exploratory session, constant approval prompts break your flow. Set the permission mode to acceptEdits so Claude can run freely. Be aware of what this means — you're giving it latitude to make changes, so use it in a sandboxed or read-only context where possible.

Why this matters for QA: The setup cost here is low — maybe 20 minutes including Notion authorization. The payoff is a reusable pipeline. Once it's configured, every future gap analysis session starts from step one with no additional setup.

4. The Workflow: Six Prompts, One Session

Here's the complete workflow:

┌─────────────────────────────────────────────────────────────┐
│                    GAP ANALYSIS WORKFLOW                     │
└─────────────────────────────────────────────────────────────┘

  [Notion DB]          [Live App]           [Notion DB]
      │                    │                     │
      ▼                    ▼                     │
  ┌────────┐         ┌──────────┐                │
  │ Step 1 │         │ Step 2   │                │
  │  Read  │         │ Explore  │                │
  │existing│         │  app via │                │
  │  TCs   │         │  gstack  │                │
  └────┬───┘         └────┬─────┘                │
       │                  │                      │
       └────────┬──────────┘                     │
                ▼                                │
           ┌────────┐                            │
           │ Step 3 │                            │
           │Compare │                            │
           │& find  │                            │
           │  gaps  │                            │
           └────┬───┘                            │
                ▼                                │
           ┌────────┐                            │
           │ Step 4 │                            │
           │ Draft  │                            │
           │  new   │                            │
           │  TCs   │                            │
           └────┬───┘                            │
                ▼                                │
           ┌────────┐                            │
           │ Step 5 │                            │
           │Refine  │                            │
           │to BDD  │                            │
           │format  │                            │
           └────┬───┘                            │
                ▼                                ▼
           ┌────────┐                       ┌────────┐
           │ Step 6 │──────────────────────▶│ New TC │
           │ Export │                       │ pages  │
           │to Notion│                      │in DB   │
           └────────┘                       └────────┘

Step 1 — Read Existing Test Cases from Notion

Fetch this Notion page and list all existing test cases with their names
and a one-line summary of what each one covers:
[your Notion test case database URL]

Claude fetches the Notion database, reads each page, and produces a structured list: test case name, what it covers. This becomes the baseline for the gap analysis.

💡 Include the full URL in your prompt every time. Don't say "the Notion page from earlier" or "the test database we discussed." Across tool calls and session boundaries, Claude needs explicit references. Paste the full URL in every prompt that references a Notion page.

Step 2 — Explore the App and Understand What It Does

Browse [your staging app URL]
Login with username [test-account] password [password]
Put the entire login and exploration in one bash script so the browser
session stays alive.
Take screenshots of each part of [the feature] and summarise how it works.

This is where gstack does the heavy lifting. Claude uses the /browse skill to launch a headless browser, log in, navigate through every state of the feature, take screenshots, and come back with a written summary of how it all works.

⚠️ Put login and exploration in a single bash script. This is the most important gotcha in the whole workflow. The gstack browser server restarts between separate bash calls, which kills all browser state — including your login session. If you run login in one call and exploration in the next, Claude will be looking at a logged-out app. Combine everything into one script.

What you get back is a detailed summary of every state the feature can be in: what controls are visible, what actions are available, what happens when you submit or cancel, and screenshots of each screen. Claude understands the feature better after two minutes of headless browsing than you could communicate with a paragraph of description.

Why this matters for QA: The app is the source of truth, not documentation or memory. When Claude explores the live app, it sees what users see — including states that might not be documented anywhere.

Step 3 — Compare Against Existing Tests and Find Gaps

Compare the feature you just explored against the existing test cases listed earlier.
Identify gaps — features or scenarios with no test coverage.
Group by area (e.g. different input types, error states, permissions,
edge cases, browser-specific behaviour).

Claude now has both sides: what the app does (from exploration) and what's already tested (from Notion). It produces a gap analysis grouped by area, surfacing scenarios that hadn't been explicitly tested — different input variations, specific error and timeout states, permission-related flows, and behavior under degraded conditions.

This took about 30 seconds.

Step 4 — Draft New Test Cases (Without Writing to Notion Yet)

Please create new test case entries for each gap you identified.
Do NOT write directly to Notion yet — show me the drafts first.

⚠️ Always review before writing to Notion. Notion changes cannot be reverted through Claude. If you let it write directly and the output is wrong — wrong format, wrong numbering, duplicate entries — you're cleaning up manually. The "show me the drafts first" step is non-negotiable.

Claude generates a draft for each gap: a title, a brief description, and rough test steps. At this point the format isn't quite right yet, but the content is there.

Step 5 — Refine to Match Your BDD Format

Can you follow the same format I have here:
[URL of an existing well-formatted test case as a reference]

Rewrite all the draft test cases using that exact format:
Feature block with user story, Background, Scenario with Given/When/Then steps,
Execution Steps checklist, and Notes/Bug Link section.
Number them starting from [next available number].
Still do NOT write to Notion yet.

I pointed Claude at an existing test case as the template and asked it to rewrite all drafts to match — Feature block, Background, Scenario, Given/When/Then, Execution Steps checklist, Notes/Bug Link. I also specified the starting test case number so the new ones numbered sequentially from where the existing ones left off.

This step is worth taking seriously. A test case that's technically correct but formatted wrong creates work for whoever has to use it. Getting the format right before export means the output is immediately usable.

Step 6 — Export to Notion

Write all the new test cases to Notion.
Create each one as a new page inside [your database name]
using the same format as the existing entries.

Claude uses the Notion MCP to create each test case as a new page in the database, including the full BDD content block and page properties: Case Type, Priority, Status.

Why this matters for QA: The output lands directly in the tool your team already uses. No copy-pasting, no reformatting, no "I'll add this to Notion later." It's there.

5. The Prompts as a Reusable Template

Here's the complete sequence you can adapt for your own app and test database:

# Step 1 — Read existing test cases
Fetch this Notion page and list all existing test cases with their names
and a one-line summary of what each one covers:
[your Notion test case database URL]

# Step 2 — Explore the app
Browse [your staging app URL]
Login with username [test-account] password [password]
Put the entire login and exploration in one bash script so the browser
session stays alive.
Take screenshots of each part of [the feature] and summarise how it works.

# Step 3 — Gap analysis
Compare the feature you just explored against the existing test cases listed earlier.
Identify gaps — features or scenarios with no test coverage.
Group by area.

# Step 4 — Draft
Please create new test case entries for each gap you identified.
Do NOT write directly to Notion yet — show me the drafts first.

# Step 5 — Format
Can you follow the same format I have here:
[URL of an existing well-formatted test case]
Rewrite all the draft test cases using that exact format.
Number them starting from [TC-XX].
Still do NOT write to Notion yet.

# Step 6 — Export
Write all the new test cases to Notion.
Create each one as a new page inside [your database]
using the same format as the existing entries.

6. Gotchas and Lessons Learned

These aren't theoretical — each one cost me time before I figured it out.

1. One bash script for login + exploration. The gstack browser server restarts between separate bash invocations. Combine login and exploration into a single script.

2. Always use explicit URLs. Vague references like "the page from before" break across tool calls and context boundaries. Include the full URL in every prompt that references a Notion page.

3. Review drafts before writing to Notion. Notion write operations through Claude are not reversible via Claude. The "show me first" step is cheap insurance.

4. Set acceptEdits permission mode for exploration sessions. Constant approval prompts fragment the session. Set it for exploration, but be aware of what you're enabling.

5. Save reusable prompts as custom skills. Claude Code supports custom skills — markdown files in ~/.claude/skills/. If you run gap analyses regularly, turn the prompt sequence into a skill so you invoke it with one command instead of retyping a paragraph.

6. Use a dedicated test account. Your credentials go into a prompt that Claude executes. Don't use your personal account.

7. Results

One session. Here's what came out of it:

24 new test cases generated in a single session
All formatted correctly: Feature block, Background, Scenario, Given/When/Then, Execution Steps checklist, Notes section
All written as new pages in the Notion database with correct properties (Case Type, Priority, Status)
Coverage gaps closed across multiple areas that hadn't been explicitly tested before

Before this session, gap analysis for a feature this size would have taken me half a day. The session itself took about 45 minutes, most of which was reviewing the drafts at steps 4 and 5. The test cases needed minor tweaks — a few Given steps needed more context, one When step was slightly off — but the heavy lifting was done. I was editing, not authoring from scratch.

8. What Else You Can Do With This Approach

The six-step workflow is one combination. The underlying capability is more flexible.

Requirements-first: Instead of exploring the app, feed Claude your requirements doc or spec. "Here are the acceptance criteria. Here are the existing test cases. What scenarios aren't covered?" This works well for features that aren't built yet.

Code-first: Point Claude at the codebase and ask it to surface untested paths. "Here's the source code for this feature. Here are the existing test cases. What code paths have no test coverage?" This gets you into edge cases that are invisible from the UI.

All three combined: The most complete analysis uses all three inputs simultaneously — what the spec says the app should do, what the app actually does, and what the code does under the hood.

Scheduled gap analysis: Once the workflow is stable, run it on a cadence — every sprint, every release. A fresh gap analysis against a growing test suite catches regression in coverage: features that expanded but whose tests didn't.

Conclusion

Test coverage gaps exist because comparing "what the app does" against "what we've tested" is cognitively expensive. AI is good at exactly that kind of comparison when you give it the right inputs.

The workflow I described gives it those inputs systematically: read the existing tests, explore the live app, find the delta, draft the missing coverage, format it correctly, write it back. Each step is mechanical. The judgment calls — are these test cases accurate? are the priorities right? — still belong to you. But the archaeology is automated.

24 test cases in one session. That's the headline. The more important number is how many more sessions like this I can run without burning out on the manual version.

DEV Community: Demi Jiang

A Tiered Playwright E2E Strategy: From PR Smoke to Production Validation

Contents

At a glance — the run tiers

Context

1. Framework configuration: a layered, project-partitioned setup

A shared base config, thin per-app overrides

Projects = domain/feature partitions

Splitting a heavy domain by wall-clock, not by name

One subtle but important config decision: don't let `.env` clobber the CLI

2. Tag strategy: two independent axes

Axis 1 — Traceability: every test carries a stable test-case ID

Axis 2 — Run tier: which pipeline a test belongs to

3. The smoke tier — fast, curated, every PR

4. Worker tuning — match the constraint, not the core count

5. Production validation — multi-region, release-triggered, intentionally small

6. The tiered run model, end to end

What I'd tell another team starting this

AI-Powered Test Coverage Gap Analysis: How I Use Claude Code + gstack to Generate Test Cases

1. The Problem: Test Coverage Gaps Are Hard to Find Manually

2. The Tools: Claude Code, Notion MCP, and gstack

3. Setup: Getting Everything Connected

4. The Workflow: Six Prompts, One Session

Step 1 — Read Existing Test Cases from Notion

Step 2 — Explore the App and Understand What It Does

Step 3 — Compare Against Existing Tests and Find Gaps

Step 4 — Draft New Test Cases (Without Writing to Notion Yet)

Step 5 — Refine to Match Your BDD Format

Step 6 — Export to Notion

5. The Prompts as a Reusable Template

6. Gotchas and Lessons Learned

7. Results

8. What Else You Can Do With This Approach

Conclusion

References

DEV Community: Demi Jiang

A Tiered Playwright E2E Strategy: From PR Smoke to Production Validation

Contents

At a glance — the run tiers

Context

1. Framework configuration: a layered, project-partitioned setup

A shared base config, thin per-app overrides

Projects = domain/feature partitions

Splitting a heavy domain by wall-clock, not by name

One subtle but important config decision: don't let .env clobber the CLI

2. Tag strategy: two independent axes

Axis 1 — Traceability: every test carries a stable test-case ID

Axis 2 — Run tier: which pipeline a test belongs to

3. The smoke tier — fast, curated, every PR

4. Worker tuning — match the constraint, not the core count

5. Production validation — multi-region, release-triggered, intentionally small

6. The tiered run model, end to end

What I'd tell another team starting this

AI-Powered Test Coverage Gap Analysis: How I Use Claude Code + gstack to Generate Test Cases

1. The Problem: Test Coverage Gaps Are Hard to Find Manually

2. The Tools: Claude Code, Notion MCP, and gstack

3. Setup: Getting Everything Connected

4. The Workflow: Six Prompts, One Session

Step 1 — Read Existing Test Cases from Notion

Step 2 — Explore the App and Understand What It Does

Step 3 — Compare Against Existing Tests and Find Gaps

Step 4 — Draft New Test Cases (Without Writing to Notion Yet)

Step 5 — Refine to Match Your BDD Format

Step 6 — Export to Notion

5. The Prompts as a Reusable Template

6. Gotchas and Lessons Learned

7. Results

8. What Else You Can Do With This Approach

Conclusion

References

One subtle but important config decision: don't let `.env` clobber the CLI