Demi Jiang

Posted on Jun 23

A Tiered Playwright E2E Strategy: From PR Smoke to Production Validation

#testing #devops #automation #cicd

A field write-up on a domain/feature-driven Playwright setup — the framework
configuration, the tag strategy that ties tests to a test-management system, and the
tiered run model (smoke on every PR → nightly regression → post-release production
validation). Tooling and infrastructure specifics are generalized so the
patterns are reusable anywhere.

Context
Framework configuration: a layered, project-partitioned setup
Tag strategy: two independent axes
The smoke tier
Worker tuning
Production validation
The tiered run model, end to end
What I'd tell another team starting this

At a glance — the run tiers

Tier (tag)	Trigger	Scope	Workers	Goal
Smoke (`@smoke`)	Every PR	Curated subset, per-domain matrix	1 / job	Fast merge-gate feedback (~5 min P95)
Regression (no tier tag)	Nightly	Everything	Many	Broad coverage overnight
Production validation (`@production-validation`)	After each production release (or more frequent releases)	Small, stable critical-path set, per region	Tuned	Catch prod-only issues early, across regions
Endurance (`@endurance`)	Dedicated schedule	One long-running spec, isolated project	1	Cover tens-of-minutes flows off the critical path

Context

Picture a product large enough that its end-to-end suite spans several independent
feature domains (think: onboarding, checkout, search, messaging, billing, integrations)
and has to run against everything from a local dev build to production in multiple
geographic regions.

The hard part of E2E at this scale isn't writing tests — it's keeping them fast enough
to gate PRs, trustworthy enough that red means red, and traceable enough that a
failure maps to a known test case. Almost every decision below is in service of one of
those three.

Throughout, I'll use generic domain names (checkout, search, onboarding, …) as
stand-ins for whatever your product's feature areas happen to be.

1. Framework configuration: a layered, project-partitioned setup

A shared base config, thin per-app overrides

There's one base config that every app/package inherits (playwright.base.config),
and a thin top-level playwright.config.ts that spreads it and adds what's local. This
keeps cross-cutting settings (reporters, trace/screenshot-on-failure, timeouts) in one
place and lets each consumer override only what it needs.

Projects = domain/feature partitions

Rather than one giant test pool, the suite is split into Playwright projects by feature
domain — checkout, search, onboarding, messaging, billing, and so on. Each
project points at its own testDir. This buys two things:

CI parallelism — each domain runs as its own CI job, in parallel.
Ownership routing — when a domain's job goes red, it routes to the team that owns it, not to a shared "the E2E suite is broken" alert.

Splitting a heavy domain by wall-clock, not by name

One domain will inevitably become the timeout risk — usually the one that owns slow,
media- or generation-heavy flows (capture → processing → artifact generation). When a
single domain dominates the run, split it into multiple projects backed by the same
folder, partitioned by spec file. For example:

<domain>-core — the fast UI specs.
<domain>-heavy — the slow media/processing/generation specs.
<domain>-endurance — an isolated long-running spec (more on this below).

The key lesson: balance the split by measured per-spec duration, not by what the names
suggest. The goal is jobs that finish in roughly equal wall-clock. Re-check the split
against a recent HTML report's per-spec timings and rebalance — grouping "by feeling"
leaves one job idle while the other is the bottleneck.

These live on the projects that actually need them — not globally — so unrelated domains
aren't launched with flags they don't use.

One subtle but important config decision: don't let `.env` clobber the CLI

Load .env with override: false:

dotenv.config({ path: '../.env', override: false });

The reasoning is worth internalizing because the failure mode is silent: with
override: true, a value you pass on the command line
(APP_ENVIRONMENT=production pnpm exec playwright test …) gets reverted to the .env
default before setup runs, so your "production" run quietly executes against staging.
override: false makes CLI-passed env vars win, with .env only supplying defaults
for what the caller didn't set. Caller intent should always beat ambient config.

2. Tag strategy: two independent axes

This is the part most teams under-invest in, and it's what makes the suite legible at
scale. Use two orthogonal tagging axes, both via Playwright's runtime tag attribute.

                 Axis 1 — Traceability            Axis 2 — Run tier
                 (WHICH test case?)               (WHEN does it run?)
                 ┌──────────────────┐             ┌────────────────────────┐
  one test ─────►│ @TC042           │  ── plus ──►│ @smoke                 │
                 │ (stable join key │             │ @production-validation │
                 │  to test-mgmt DB)│             │ @endurance / (none)    │
                 └──────────────────┘             └────────────────────────┘
            file renames don't break it       decides the pipeline it lands in

Axis 1 — Traceability: every test carries a stable test-case ID

Every test/describe carries a @TCxxx tag that matches a row in a test-management
system. This tag is the stable join key between the spec and the test-case record —
file renames and refactors don't break it.

test.describe('Complete checkout with saved card', { tag: ['@TC042'] }, () => { ... });

Why a runtime tag and not a string in the test title or a JSDoc comment?

Reporter output — Playwright's JSON reporter emits tags: [...] per test, so an automated reconciliation job can sync results back to the test-management system. JSDoc never reaches the reporter; title prefixes have to be parsed out of strings.
CLI filtering — --grep @TC042 runs exactly one case; --grep @smoke runs a tier.
Tooling standard — TestRail / Xray / Zephyr / Qase reporters all consume the runtime tag attribute, so you're aligned with the ecosystem.

Multi-TC tagging — only for sequential journeys. When several test cases are steps in
one journey that shares auth/setup/state (e.g. a third-party integration flow:
connect → fetch data → perform action → push result), tag the single test with all of
them and use test.step('TCxxx: …') so the report still attributes the failure to the
right step:

test('connect, fetch, and push to the external system', async ({ page }) => {
  await test.step('TC101: connect the integration', async () => { ... });
  await test.step('TC102: view connected details',  async () => { ... });
  await test.step('TC103: push a record',           async () => { ... });
});

The rule of thumb: can these scenarios run independently in any order against fresh
state? Yes → one test each. No, each depends on the previous step → one multi-TC
test. Splitting a dependent journey would mean paying for the auth flow, any remote
connection, and fixture setup once per step instead of once total.

Axis 2 — Run tier: which pipeline a test belongs to

Independent of its TC ID, each test opts into a run tier:

Tag	When it runs
`@smoke`	Every PR (a curated subset)
`@production-validation`	After each production release (or more frequent releases), fanned out per region
(no tier tag)	Full regression, nightly
`@endurance`	Its own dedicated scheduled workflow only

3. The smoke tier — fast, curated, every PR

Smoke is the always-on PR gate, and its design is deliberate:

Curated by QA, not by engineers. A spec is in smoke iff its row in the test-management system has the smoke box checked. To add/remove a spec, you flip the box first, then sync the @smoke tag. This keeps one team accountable for the smoke surface instead of it growing ad hoc.
Per-domain matrix. Smoke runs as a parallel matrix across feature domains; each job provisions only the account cohorts that domain needs, with 1 worker per job.
A wall-clock budget. Set a target (e.g. keep PR smoke under ~5 minutes P95). Because the domain jobs run in parallel, the budget is per-job, not the sum. The budget is the forcing function that keeps anyone from quietly adding a multi-minute spec to smoke.

The endurance spec (a long-running, real-time flow that can take tens of minutes) is the
explicit counter-example: it cannot live in smoke or even nightly regression. It sits
in its own Playwright project that no general pipeline's domain
allowlist includes, run only by a dedicated low-frequency scheduled workflow.

The lesson: give genuinely outlier tests their own isolated lane so their slowness can
never block the merge queue.

4. Worker tuning — match the constraint, not the core count

Sensible defaults:

// CI = 3 workers, local = 4 (safe ceiling for sequential IdP logins).
// Override with PLAYWRIGHT_WORKERS.
export const NUM_WORKERS = process.env.PLAYWRIGHT_WORKERS
  ? parseInt(process.env.PLAYWRIGHT_WORKERS, 10)
  : process.env.CI ? 3 : 4;

The non-obvious lesson: worker count is bounded by the weakest shared dependency, not
by your CI runner's CPUs.

Common binding constraints are (a) the identity provider's
tolerance for near-simultaneous logins and (b) the capacity of the shared environment
under test. Cranking workers higher can produce more failures, not faster runs —
failures that masquerade as test flake but are really the backend or IdP saturating. Make
the worker count an env-driven dial (PLAYWRIGHT_WORKERS) so you can tune per environment
without code changes.

5. Production validation — multi-region, release-triggered, intentionally small

After each production release — and, as cadence increases, on more frequent releases to
catch issues sooner — run a small, stable, curated set of critical flows against
production, fanned out across every geographic region, via a manually dispatched
workflow. Design principles:

Region = a region-pinned login. A user's region claim drives backend routing, so "run this spec against region X" is implemented as "log in with an X-region account." The workflow passes the correct per-region API base URL through to the runner so any admin/setup calls hit the right backend.
Static accounts, no provisioning in prod. Unlike lower environments (which dynamically provision throwaway accounts), production validation uses a fixed set of pre-created accounts stored as a secret, region-keyed. Dynamic provisioning is disabled in prod, and there's a defence-in-depth guard that refuses to call internal admin APIs against production. You do not let an E2E suite create or mutate data in production by accident.
Region-specific skips are explicit and gated. Where one region renders a different UI or has a known backend issue, the skip is gated on a region env var (inert everywhere except prod) with a comment pointing at the follow-up to remove it. Skips are visible and temporary, never silent.
QA owns the list. The prod-validation set is intentionally tiny and stable; engineers don't add to it without QA sign-off. playwright test --grep @production-validation --list is the source of truth for what's in it.

6. The tiered run model, end to end

PR opened ─────────────► @smoke         (per-domain matrix, 1 worker, <5 min budget)
                              │
nightly ───────────────► full regression (everything without a tier tag)
                              │
after each release ────► @production-validation (multi-region fan-out, static accounts)

(separate lane) ───────► @endurance     (dedicated scheduled workflow, isolated project)

Each tier trades coverage for speed deliberately. PRs get fast, narrow feedback;
regression gets breadth overnight; production gets a small, high-confidence
critical-path check across regions.

What I'd tell another team starting this

Invest in the tag taxonomy before the suite is big. Two axes — a stable test-case ID for traceability, a run-tier tag for pipeline routing — pay for themselves the day you have more than ~50 tests.
Tune workers to the weakest shared dependency, and make it an env dial. The runner's core count is rarely the real ceiling.
Give outlier tests their own lane. One tens-of-minutes endurance test does not belong in any pipeline that gates a merge.
Treat the smoke list as a governed asset with a wall-clock budget and a single owner — otherwise it bloats until it's no longer "smoke."
Never let E2E mutate production by accident — disable provisioning, pin accounts, and add a guard that refuses admin calls against prod.
Make caller intent beat ambient config (dotenv override: false). The silent "ran against the wrong environment" bug is brutal to debug.
Skips must be explicit, gated, and commented with a path to removal — a silent skip is just lost coverage wearing a green check.

These are generic, reusable patterns for a large multi-domain E2E suite. Adapt the tier
names, region model, domain partitioning, and tooling to your own stack.

DEV Community

A Tiered Playwright E2E Strategy: From PR Smoke to Production Validation

Contents

At a glance — the run tiers

Context

1. Framework configuration: a layered, project-partitioned setup

A shared base config, thin per-app overrides

Projects = domain/feature partitions

Splitting a heavy domain by wall-clock, not by name

One subtle but important config decision: don't let `.env` clobber the CLI

2. Tag strategy: two independent axes

Axis 1 — Traceability: every test carries a stable test-case ID

Axis 2 — Run tier: which pipeline a test belongs to

3. The smoke tier — fast, curated, every PR

4. Worker tuning — match the constraint, not the core count

5. Production validation — multi-region, release-triggered, intentionally small

6. The tiered run model, end to end

What I'd tell another team starting this

Top comments (0)

Contents

At a glance — the run tiers

Context

1. Framework configuration: a layered, project-partitioned setup

A shared base config, thin per-app overrides

Projects = domain/feature partitions

Splitting a heavy domain by wall-clock, not by name

One subtle but important config decision: don't let .env clobber the CLI

2. Tag strategy: two independent axes

Axis 1 — Traceability: every test carries a stable test-case ID

Axis 2 — Run tier: which pipeline a test belongs to

3. The smoke tier — fast, curated, every PR

4. Worker tuning — match the constraint, not the core count

5. Production validation — multi-region, release-triggered, intentionally small

6. The tiered run model, end to end

What I'd tell another team starting this

One subtle but important config decision: don't let `.env` clobber the CLI