A field write-up on a domain/feature-driven Playwright setup — the framework
configuration, the tag strategy that ties tests to a test-management system, and the
tiered run model (smoke on every PR → nightly regression → post-release production
validation). Tooling and infrastructure specifics are generalized so the
patterns are reusable anywhere.
Contents
- Context
- Framework configuration: a layered, project-partitioned setup
- Tag strategy: two independent axes
- The smoke tier
- Worker tuning
- Production validation
- The tiered run model, end to end
- What I'd tell another team starting this
At a glance — the run tiers
| Tier (tag) | Trigger | Scope | Workers | Goal |
|---|---|---|---|---|
Smoke (@smoke) |
Every PR | Curated subset, per-domain matrix | 1 / job | Fast merge-gate feedback (~5 min P95) |
| Regression (no tier tag) | Nightly | Everything | Many | Broad coverage overnight |
Production validation (@production-validation) |
After each production release (or more frequent releases) | Small, stable critical-path set, per region | Tuned | Catch prod-only issues early, across regions |
Endurance (@endurance) |
Dedicated schedule | One long-running spec, isolated project | 1 | Cover tens-of-minutes flows off the critical path |
Context
Picture a product large enough that its end-to-end suite spans several independent
feature domains (think: onboarding, checkout, search, messaging, billing, integrations)
and has to run against everything from a local dev build to production in multiple
geographic regions.
The hard part of E2E at this scale isn't writing tests — it's keeping them fast enough
to gate PRs, trustworthy enough that red means red, and traceable enough that a
failure maps to a known test case. Almost every decision below is in service of one of
those three.
Throughout, I'll use generic domain names (checkout, search, onboarding, …) as
stand-ins for whatever your product's feature areas happen to be.
1. Framework configuration: a layered, project-partitioned setup
A shared base config, thin per-app overrides
There's one base config that every app/package inherits (playwright.base.config),
and a thin top-level playwright.config.ts that spreads it and adds what's local. This
keeps cross-cutting settings (reporters, trace/screenshot-on-failure, timeouts) in one
place and lets each consumer override only what it needs.
Projects = domain/feature partitions
Rather than one giant test pool, the suite is split into Playwright projects by feature
domain — checkout, search, onboarding, messaging, billing, and so on. Each
project points at its own testDir. This buys two things:
- CI parallelism — each domain runs as its own CI job, in parallel.
- Ownership routing — when a domain's job goes red, it routes to the team that owns it, not to a shared "the E2E suite is broken" alert.
Splitting a heavy domain by wall-clock, not by name
One domain will inevitably become the timeout risk — usually the one that owns slow,
media- or generation-heavy flows (capture → processing → artifact generation). When a
single domain dominates the run, split it into multiple projects backed by the same
folder, partitioned by spec file. For example:
-
<domain>-core— the fast UI specs. -
<domain>-heavy— the slow media/processing/generation specs. -
<domain>-endurance— an isolated long-running spec (more on this below).
The key lesson: balance the split by measured per-spec duration, not by what the names
suggest. The goal is jobs that finish in roughly equal wall-clock. Re-check the split
against a recent HTML report's per-spec timings and rebalance — grouping "by feeling"
leaves one job idle while the other is the bottleneck.
Browser launch options where the product needs them
Specs that exercise device capture need a fake media stream so Chromium can "see/hear"
mocked input without real hardware:
launchOptions: {
args: [
'--use-fake-device-for-media-stream',
'--auto-accept-camera-and-microphone-capture',
// For a streamed media mock that must autoplay without a user gesture:
'--autoplay-policy=no-user-gesture-required',
],
}
These live on the projects that actually need them — not globally — so unrelated domains
aren't launched with flags they don't use.
One subtle but important config decision: don't let .env clobber the CLI
Load .env with override: false:
dotenv.config({ path: '../.env', override: false });
The reasoning is worth internalizing because the failure mode is silent: with
override: true, a value you pass on the command line
(APP_ENVIRONMENT=production pnpm exec playwright test …) gets reverted to the .env
default before setup runs, so your "production" run quietly executes against staging.
override: false makes CLI-passed env vars win, with .env only supplying defaults
for what the caller didn't set. Caller intent should always beat ambient config.
2. Tag strategy: two independent axes
This is the part most teams under-invest in, and it's what makes the suite legible at
scale. Use two orthogonal tagging axes, both via Playwright's runtime tag attribute.
Axis 1 — Traceability Axis 2 — Run tier
(WHICH test case?) (WHEN does it run?)
┌──────────────────┐ ┌────────────────────────┐
one test ─────►│ @TC042 │ ── plus ──►│ @smoke │
│ (stable join key │ │ @production-validation │
│ to test-mgmt DB)│ │ @endurance / (none) │
└──────────────────┘ └────────────────────────┘
file renames don't break it decides the pipeline it lands in
Axis 1 — Traceability: every test carries a stable test-case ID
Every test/describe carries a @TCxxx tag that matches a row in a test-management
system. This tag is the stable join key between the spec and the test-case record —
file renames and refactors don't break it.
test.describe('Complete checkout with saved card', { tag: ['@TC042'] }, () => { ... });
Why a runtime tag and not a string in the test title or a JSDoc comment?
-
Reporter output — Playwright's JSON reporter emits
tags: [...]per test, so an automated reconciliation job can sync results back to the test-management system. JSDoc never reaches the reporter; title prefixes have to be parsed out of strings. -
CLI filtering —
--grep @TC042runs exactly one case;--grep @smokeruns a tier. -
Tooling standard — TestRail / Xray / Zephyr / Qase reporters all consume the
runtime
tagattribute, so you're aligned with the ecosystem.
Multi-TC tagging — only for sequential journeys. When several test cases are steps in
one journey that shares auth/setup/state (e.g. a third-party integration flow:
connect → fetch data → perform action → push result), tag the single test with all of
them and use test.step('TCxxx: …') so the report still attributes the failure to the
right step:
test('connect, fetch, and push to the external system', async ({ page }) => {
await test.step('TC101: connect the integration', async () => { ... });
await test.step('TC102: view connected details', async () => { ... });
await test.step('TC103: push a record', async () => { ... });
});
The rule of thumb: can these scenarios run independently in any order against fresh
state? Yes → one test each. No, each depends on the previous step → one multi-TC
test. Splitting a dependent journey would mean paying for the auth flow, any remote
connection, and fixture setup once per step instead of once total.
Axis 2 — Run tier: which pipeline a test belongs to
Independent of its TC ID, each test opts into a run tier:
| Tag | When it runs |
|---|---|
@smoke |
Every PR (a curated subset) |
@production-validation |
After each production release (or more frequent releases), fanned out per region |
| (no tier tag) | Full regression, nightly |
@endurance |
Its own dedicated scheduled workflow only |
3. The smoke tier — fast, curated, every PR
Smoke is the always-on PR gate, and its design is deliberate:
-
Curated by QA, not by engineers. A spec is in smoke iff its row in the
test-management system has the smoke box checked. To add/remove a spec, you flip the box
first, then sync the
@smoketag. This keeps one team accountable for the smoke surface instead of it growing ad hoc. - Per-domain matrix. Smoke runs as a parallel matrix across feature domains; each job provisions only the account cohorts that domain needs, with 1 worker per job.
- A wall-clock budget. Set a target (e.g. keep PR smoke under ~5 minutes P95). Because the domain jobs run in parallel, the budget is per-job, not the sum. The budget is the forcing function that keeps anyone from quietly adding a multi-minute spec to smoke.
The endurance spec (a long-running, real-time flow that can take tens of minutes) is the
explicit counter-example: it cannot live in smoke or even nightly regression. It sits
in its own Playwright project that no general pipeline's domain
allowlist includes, run only by a dedicated low-frequency scheduled workflow.
The lesson: give genuinely outlier tests their own isolated lane so their slowness can
never block the merge queue.
4. Worker tuning — match the constraint, not the core count
Sensible defaults:
// CI = 3 workers, local = 4 (safe ceiling for sequential IdP logins).
// Override with PLAYWRIGHT_WORKERS.
export const NUM_WORKERS = process.env.PLAYWRIGHT_WORKERS
? parseInt(process.env.PLAYWRIGHT_WORKERS, 10)
: process.env.CI ? 3 : 4;
The non-obvious lesson: worker count is bounded by the weakest shared dependency, not
by your CI runner's CPUs.
Common binding constraints are (a) the identity provider's
tolerance for near-simultaneous logins and (b) the capacity of the shared environment
under test. Cranking workers higher can produce more failures, not faster runs —
failures that masquerade as test flake but are really the backend or IdP saturating. Make
the worker count an env-driven dial (PLAYWRIGHT_WORKERS) so you can tune per environment
without code changes.
5. Production validation — multi-region, release-triggered, intentionally small
After each production release — and, as cadence increases, on more frequent releases to
catch issues sooner — run a small, stable, curated set of critical flows against
production, fanned out across every geographic region, via a manually dispatched
workflow. Design principles:
- Region = a region-pinned login. A user's region claim drives backend routing, so "run this spec against region X" is implemented as "log in with an X-region account." The workflow passes the correct per-region API base URL through to the runner so any admin/setup calls hit the right backend.
- Static accounts, no provisioning in prod. Unlike lower environments (which dynamically provision throwaway accounts), production validation uses a fixed set of pre-created accounts stored as a secret, region-keyed. Dynamic provisioning is disabled in prod, and there's a defence-in-depth guard that refuses to call internal admin APIs against production. You do not let an E2E suite create or mutate data in production by accident.
- Region-specific skips are explicit and gated. Where one region renders a different UI or has a known backend issue, the skip is gated on a region env var (inert everywhere except prod) with a comment pointing at the follow-up to remove it. Skips are visible and temporary, never silent.
-
QA owns the list. The prod-validation set is intentionally tiny and stable;
engineers don't add to it without QA sign-off.
playwright test --grep @production-validation --listis the source of truth for what's in it.
6. The tiered run model, end to end
PR opened ─────────────► @smoke (per-domain matrix, 1 worker, <5 min budget)
│
nightly ───────────────► full regression (everything without a tier tag)
│
after each release ────► @production-validation (multi-region fan-out, static accounts)
(separate lane) ───────► @endurance (dedicated scheduled workflow, isolated project)
Each tier trades coverage for speed deliberately. PRs get fast, narrow feedback;
regression gets breadth overnight; production gets a small, high-confidence
critical-path check across regions.
What I'd tell another team starting this
- Invest in the tag taxonomy before the suite is big. Two axes — a stable test-case ID for traceability, a run-tier tag for pipeline routing — pay for themselves the day you have more than ~50 tests.
- Tune workers to the weakest shared dependency, and make it an env dial. The runner's core count is rarely the real ceiling.
- Give outlier tests their own lane. One tens-of-minutes endurance test does not belong in any pipeline that gates a merge.
- Treat the smoke list as a governed asset with a wall-clock budget and a single owner — otherwise it bloats until it's no longer "smoke."
- Never let E2E mutate production by accident — disable provisioning, pin accounts, and add a guard that refuses admin calls against prod.
-
Make caller intent beat ambient config (
dotenv override: false). The silent "ran against the wrong environment" bug is brutal to debug. - Skips must be explicit, gated, and commented with a path to removal — a silent skip is just lost coverage wearing a green check.
These are generic, reusable patterns for a large multi-domain E2E suite. Adapt the tier
names, region model, domain partitioning, and tooling to your own stack.
Top comments (0)