DEV Community

Demi Jiang
Demi Jiang

Posted on

A Tiered Playwright E2E Strategy: From PR Smoke to Production Validation

A field write-up on a domain/feature-driven Playwright setup — the framework
configuration, the tag strategy that ties tests to a test-management system, and the
tiered run model (smoke on every PR → nightly regression → post-release production
validation). Tooling and infrastructure specifics are generalized so the
patterns are reusable anywhere.


Contents

  1. Context
  2. Framework configuration: a layered, project-partitioned setup
  3. Tag strategy: two independent axes
  4. The smoke tier
  5. Worker tuning
  6. Production validation
  7. The tiered run model, end to end
  8. What I'd tell another team starting this

At a glance — the run tiers

Tier (tag) Trigger Scope Workers Goal
Smoke (@smoke) Every PR Curated subset, per-domain matrix 1 / job Fast merge-gate feedback (~5 min P95)
Regression (no tier tag) Nightly Everything Many Broad coverage overnight
Production validation (@production-validation) After each production release (or more frequent releases) Small, stable critical-path set, per region Tuned Catch prod-only issues early, across regions
Endurance (@endurance) Dedicated schedule One long-running spec, isolated project 1 Cover tens-of-minutes flows off the critical path

Context

Picture a product large enough that its end-to-end suite spans several independent
feature domains (think: onboarding, checkout, search, messaging, billing, integrations)
and has to run against everything from a local dev build to production in multiple
geographic regions.

The hard part of E2E at this scale isn't writing tests — it's keeping them fast enough
to gate PRs, trustworthy enough that red means red, and traceable enough that a
failure maps to a known test case. Almost every decision below is in service of one of
those three.

Throughout, I'll use generic domain names (checkout, search, onboarding, …) as
stand-ins for whatever your product's feature areas happen to be.


1. Framework configuration: a layered, project-partitioned setup

A shared base config, thin per-app overrides

There's one base config that every app/package inherits (playwright.base.config),
and a thin top-level playwright.config.ts that spreads it and adds what's local. This
keeps cross-cutting settings (reporters, trace/screenshot-on-failure, timeouts) in one
place and lets each consumer override only what it needs.

Projects = domain/feature partitions

Rather than one giant test pool, the suite is split into Playwright projects by feature
domain
checkout, search, onboarding, messaging, billing, and so on. Each
project points at its own testDir. This buys two things:

  1. CI parallelism — each domain runs as its own CI job, in parallel.
  2. Ownership routing — when a domain's job goes red, it routes to the team that owns it, not to a shared "the E2E suite is broken" alert.

Splitting a heavy domain by wall-clock, not by name

One domain will inevitably become the timeout risk — usually the one that owns slow,
media- or generation-heavy flows (capture → processing → artifact generation). When a
single domain dominates the run, split it into multiple projects backed by the same
folder
, partitioned by spec file. For example:

  • <domain>-core — the fast UI specs.
  • <domain>-heavy — the slow media/processing/generation specs.
  • <domain>-endurance — an isolated long-running spec (more on this below).

The key lesson: balance the split by measured per-spec duration, not by what the names
suggest.
The goal is jobs that finish in roughly equal wall-clock. Re-check the split
against a recent HTML report's per-spec timings and rebalance — grouping "by feeling"
leaves one job idle while the other is the bottleneck.

Browser launch options where the product needs them

Specs that exercise device capture need a fake media stream so Chromium can "see/hear"
mocked input without real hardware:

launchOptions: {
  args: [
    '--use-fake-device-for-media-stream',
    '--auto-accept-camera-and-microphone-capture',
    // For a streamed media mock that must autoplay without a user gesture:
    '--autoplay-policy=no-user-gesture-required',
  ],
}
Enter fullscreen mode Exit fullscreen mode

These live on the projects that actually need them — not globally — so unrelated domains
aren't launched with flags they don't use.

One subtle but important config decision: don't let .env clobber the CLI

Load .env with override: false:

dotenv.config({ path: '../.env', override: false });
Enter fullscreen mode Exit fullscreen mode

The reasoning is worth internalizing because the failure mode is silent: with
override: true, a value you pass on the command line
(APP_ENVIRONMENT=production pnpm exec playwright test …) gets reverted to the .env
default before setup runs
, so your "production" run quietly executes against staging.
override: false makes CLI-passed env vars win, with .env only supplying defaults
for what the caller didn't set. Caller intent should always beat ambient config.


2. Tag strategy: two independent axes

This is the part most teams under-invest in, and it's what makes the suite legible at
scale. Use two orthogonal tagging axes, both via Playwright's runtime tag attribute.

                 Axis 1 — Traceability            Axis 2 — Run tier
                 (WHICH test case?)               (WHEN does it run?)
                 ┌──────────────────┐             ┌────────────────────────┐
  one test ─────►│ @TC042           │  ── plus ──►│ @smoke                 │
                 │ (stable join key │             │ @production-validation │
                 │  to test-mgmt DB)│             │ @endurance / (none)    │
                 └──────────────────┘             └────────────────────────┘
            file renames don't break it       decides the pipeline it lands in
Enter fullscreen mode Exit fullscreen mode

Axis 1 — Traceability: every test carries a stable test-case ID

Every test/describe carries a @TCxxx tag that matches a row in a test-management
system. This tag is the stable join key between the spec and the test-case record —
file renames and refactors don't break it.

test.describe('Complete checkout with saved card', { tag: ['@TC042'] }, () => { ... });
Enter fullscreen mode Exit fullscreen mode

Why a runtime tag and not a string in the test title or a JSDoc comment?

  1. Reporter output — Playwright's JSON reporter emits tags: [...] per test, so an automated reconciliation job can sync results back to the test-management system. JSDoc never reaches the reporter; title prefixes have to be parsed out of strings.
  2. CLI filtering--grep @TC042 runs exactly one case; --grep @smoke runs a tier.
  3. Tooling standard — TestRail / Xray / Zephyr / Qase reporters all consume the runtime tag attribute, so you're aligned with the ecosystem.

Multi-TC tagging — only for sequential journeys. When several test cases are steps in
one journey that shares auth/setup/state (e.g. a third-party integration flow:
connect → fetch data → perform action → push result), tag the single test with all of
them and use test.step('TCxxx: …') so the report still attributes the failure to the
right step:

test('connect, fetch, and push to the external system', async ({ page }) => {
  await test.step('TC101: connect the integration', async () => { ... });
  await test.step('TC102: view connected details',  async () => { ... });
  await test.step('TC103: push a record',           async () => { ... });
});
Enter fullscreen mode Exit fullscreen mode

The rule of thumb: can these scenarios run independently in any order against fresh
state?
Yes → one test each. No, each depends on the previous step → one multi-TC
test.
Splitting a dependent journey would mean paying for the auth flow, any remote
connection, and fixture setup once per step instead of once total.

Axis 2 — Run tier: which pipeline a test belongs to

Independent of its TC ID, each test opts into a run tier:

Tag When it runs
@smoke Every PR (a curated subset)
@production-validation After each production release (or more frequent releases), fanned out per region
(no tier tag) Full regression, nightly
@endurance Its own dedicated scheduled workflow only

3. The smoke tier — fast, curated, every PR

Smoke is the always-on PR gate, and its design is deliberate:

  • Curated by QA, not by engineers. A spec is in smoke iff its row in the test-management system has the smoke box checked. To add/remove a spec, you flip the box first, then sync the @smoke tag. This keeps one team accountable for the smoke surface instead of it growing ad hoc.
  • Per-domain matrix. Smoke runs as a parallel matrix across feature domains; each job provisions only the account cohorts that domain needs, with 1 worker per job.
  • A wall-clock budget. Set a target (e.g. keep PR smoke under ~5 minutes P95). Because the domain jobs run in parallel, the budget is per-job, not the sum. The budget is the forcing function that keeps anyone from quietly adding a multi-minute spec to smoke.

The endurance spec (a long-running, real-time flow that can take tens of minutes) is the
explicit counter-example: it cannot live in smoke or even nightly regression. It sits
in its own Playwright project that no general pipeline's domain
allowlist includes, run only by a dedicated low-frequency scheduled workflow.

The lesson: give genuinely outlier tests their own isolated lane so their slowness can
never block the merge queue.


4. Worker tuning — match the constraint, not the core count

Sensible defaults:

// CI = 3 workers, local = 4 (safe ceiling for sequential IdP logins).
// Override with PLAYWRIGHT_WORKERS.
export const NUM_WORKERS = process.env.PLAYWRIGHT_WORKERS
  ? parseInt(process.env.PLAYWRIGHT_WORKERS, 10)
  : process.env.CI ? 3 : 4;
Enter fullscreen mode Exit fullscreen mode

The non-obvious lesson: worker count is bounded by the weakest shared dependency, not
by your CI runner's CPUs.

Common binding constraints are (a) the identity provider's
tolerance for near-simultaneous logins and (b) the capacity of the shared environment
under test. Cranking workers higher can produce more failures, not faster runs —
failures that masquerade as test flake but are really the backend or IdP saturating. Make
the worker count an env-driven dial (PLAYWRIGHT_WORKERS) so you can tune per environment
without code changes.


5. Production validation — multi-region, release-triggered, intentionally small

After each production release — and, as cadence increases, on more frequent releases to
catch issues sooner — run a small, stable, curated set of critical flows against
production, fanned out across every geographic region, via a manually dispatched
workflow. Design principles:

  • Region = a region-pinned login. A user's region claim drives backend routing, so "run this spec against region X" is implemented as "log in with an X-region account." The workflow passes the correct per-region API base URL through to the runner so any admin/setup calls hit the right backend.
  • Static accounts, no provisioning in prod. Unlike lower environments (which dynamically provision throwaway accounts), production validation uses a fixed set of pre-created accounts stored as a secret, region-keyed. Dynamic provisioning is disabled in prod, and there's a defence-in-depth guard that refuses to call internal admin APIs against production. You do not let an E2E suite create or mutate data in production by accident.
  • Region-specific skips are explicit and gated. Where one region renders a different UI or has a known backend issue, the skip is gated on a region env var (inert everywhere except prod) with a comment pointing at the follow-up to remove it. Skips are visible and temporary, never silent.
  • QA owns the list. The prod-validation set is intentionally tiny and stable; engineers don't add to it without QA sign-off. playwright test --grep @production-validation --list is the source of truth for what's in it.

6. The tiered run model, end to end

PR opened ─────────────► @smoke         (per-domain matrix, 1 worker, <5 min budget)
                              │
nightly ───────────────► full regression (everything without a tier tag)
                              │
after each release ────► @production-validation (multi-region fan-out, static accounts)

(separate lane) ───────► @endurance     (dedicated scheduled workflow, isolated project)
Enter fullscreen mode Exit fullscreen mode

Each tier trades coverage for speed deliberately. PRs get fast, narrow feedback;
regression gets breadth overnight; production gets a small, high-confidence
critical-path check across regions.


What I'd tell another team starting this

  1. Invest in the tag taxonomy before the suite is big. Two axes — a stable test-case ID for traceability, a run-tier tag for pipeline routing — pay for themselves the day you have more than ~50 tests.
  2. Tune workers to the weakest shared dependency, and make it an env dial. The runner's core count is rarely the real ceiling.
  3. Give outlier tests their own lane. One tens-of-minutes endurance test does not belong in any pipeline that gates a merge.
  4. Treat the smoke list as a governed asset with a wall-clock budget and a single owner — otherwise it bloats until it's no longer "smoke."
  5. Never let E2E mutate production by accident — disable provisioning, pin accounts, and add a guard that refuses admin calls against prod.
  6. Make caller intent beat ambient config (dotenv override: false). The silent "ran against the wrong environment" bug is brutal to debug.
  7. Skips must be explicit, gated, and commented with a path to removal — a silent skip is just lost coverage wearing a green check.

These are generic, reusable patterns for a large multi-domain E2E suite. Adapt the tier
names, region model, domain partitioning, and tooling to your own stack.

Top comments (0)