A flaky test is a test that passes sometimes and fails sometimes — on the same code, with no changes. They're the most corrosive problem in a test suite because they turn your CI from a quality signal into noise.
Teams respond to flaky tests in predictable ways: first they rerun them, then they add retries, then they quarantine them, then they just stop looking at red CI. By the time a real regression ships, no one trusts the tests enough to catch it.
This guide covers the 7 root causes of flaky E2E tests and how to fix each one permanently — not with retries that hide the problem, but with changes that make the test reliable.
Quick Reference: 7 Causes of Flaky Tests
| # | Root Cause | Symptom | Fix |
|---|---|---|---|
| 1 | Timing / race conditions | "element not found" on CI | Replace waitForTimeout with condition-based waits |
| 2 | Brittle selectors | Breaks on CSS rename | Use getByRole, getByTestId, getByLabel
|
| 3 | Shared test state | Fails in parallel, passes solo | Isolate data per test, reset state in afterEach
|
| 4 | Environment instability | CI fails, local passes | Health checks, mock external APIs, raise timeouts |
| 5 | Animation interference | Random assertion failures |
reducedMotion: 'reduce' in Playwright config |
| 6 | Parallelism conflicts | Fails with --workers > 1
|
Scope data to workerIndex
|
| 7 | UI changes / locator drift | Breaks after refactors | Semantic selectors or intent-based self-healing |
Why Flaky Tests Are Worse Than No Tests
A test suite with 20% flakiness is worse than a smaller, reliable suite:
- False positives: CI fails on green code — developers learn to ignore it
- Investigation overhead: every failure requires triage
- Trust erosion: once trust breaks, it doesn't come back without deliberate effort
- Coverage rot: flaky tests get disabled, leaving real gaps behind
The Google Testing Blog has documented that even 1% flakiness in a large suite creates enough noise to meaningfully slow down development. At 10%+, teams functionally stop relying on CI.
1. Timing and Race Conditions
Symptom: Test fails with "element not found" or "timeout" — sometimes. Usually on CI, rarely locally.
What not to do:
// Arbitrary sleeps are fragile and slow
await page.waitForTimeout(2000);
await page.click('#submit-btn');
Fix: Use explicit waits that respond to actual application state:
// Wait for the element to be visible and enabled
await page.waitForSelector('#submit-btn', { state: 'visible' });
await page.click('#submit-btn');
// Wait for network to settle
await page.click('#submit-btn');
await page.waitForLoadState('networkidle');
// Wait for a specific response
const [response] = await Promise.all([
page.waitForResponse(r => r.url().includes('/api/submit') && r.status() === 200),
page.click('#submit-btn'),
]);
// Wait for navigation
await Promise.all([
page.waitForURL('**/dashboard'),
page.click('#login-btn'),
]);
Set explicit timeouts in your Playwright config:
export default {
timeout: 30000,
expect: { timeout: 10000 },
use: { actionTimeout: 10000 },
};
2. Brittle Selectors
Symptom: Test breaks after a UI change that didn't change behavior.
Fragile selectors:
// ❌ Breaks when class name changes
await page.click('.btn-primary-v2-active');
// ❌ Breaks when DOM restructures
await page.click('div > div:nth-child(3) > button');
Resilient selectors:
// ✅ ARIA role + name
await page.getByRole('button', { name: 'Sign In' }).click();
// ✅ Test ID — explicit contract
await page.getByTestId('submit-button').click();
// ✅ Label association
await page.getByLabel('Email address').fill('user@example.com');
// ✅ Visible text
await page.click('button:has-text("Sign In")');
Add data-testid attributes to key interactive elements as a team convention. This creates an explicit contract between tests and developers.
3. Shared or Leaked Test State
Symptom: Tests pass in isolation but fail when run together.
Fix: Make every test self-contained:
test.beforeEach(async ({ page }) => {
const user = await createTestUser({ role: 'admin' });
await loginAs(page, user);
});
test.afterEach(async () => {
await cleanupTestUsers();
});
For browser state:
// playwright.config.ts
export default {
use: {
storageState: undefined, // fresh context per test
},
};
4. Environment and Network Instability
Symptom: Tests fail on CI but not locally.
Health check before tests:
async function globalSetup() {
const maxRetries = 10;
for (let i = 0; i < maxRetries; i++) {
try {
const res = await fetch(process.env.BASE_URL + '/health');
if (res.ok) break;
} catch {
await new Promise(r => setTimeout(r, 2000));
}
if (i === maxRetries - 1) throw new Error('App did not start');
}
}
Mock external services that are rate-limited in CI:
await page.route('**/api.stripe.com/**', route =>
route.fulfill({ status: 200, body: JSON.stringify({ status: 'succeeded' }) })
);
5. Animation and Transition Interference
Symptom: Test clicks an element that's animating in/out and gets wrong behavior.
Fix:
// playwright.config.ts
export default {
use: {
reducedMotion: 'reduce', // disable CSS animations
},
};
Or inject a CSS override:
test.beforeEach(async ({ page }) => {
await page.addStyleTag({
content: `*, *::before, *::after {
animation-duration: 0ms !important;
transition-duration: 0ms !important;
}`,
});
});
6. Test Runner Parallelism Conflicts
Symptom: Tests pass with --workers=1 but fail with parallel execution.
Fix: Use unique data per worker:
test('create item', async ({ page }, testInfo) => {
const userId = `test-user-${testInfo.workerIndex}`;
// Each worker uses its own user, no conflicts
});
7. UI Changes Breaking Locators
This is the single largest driver of "tests as a maintenance burden."
Root cause: Tests are coupled to implementation details. Every locator-based test is a bet that the DOM won't change. That bet loses constantly in teams shipping fast.
Short-term fix: Migrate to semantic selectors (see #2). Add data-testid to critical elements.
Systematic fix: Shiplight's intent-cache-heal pattern eliminates this entire class of flakiness. Instead of CSS selectors, Shiplight stores the semantic intent of each step:
# Survives CSS renames, refactors, layout changes
goal: Verify checkout flow
statements:
- intent: Add item to cart
- intent: Proceed to checkout
- intent: Fill in shipping details
- VERIFY: order confirmation is displayed
When a locator breaks, Shiplight's AI resolves the correct element from the live DOM. A developer renaming a CSS class doesn't break a single test.
How to Triage Flaky Tests at Scale
Step 1: Quarantine, don't delete
test.skip('checkout flow — flaky, tracked in TICKET-123', async ({ page }) => {
// ...
});
Step 2: Add retries temporarily
export default {
retries: process.env.CI ? 2 : 0,
};
Retries are symptom management, not a fix. Remove them once the root cause is fixed.
Step 3: Measure flakiness per test
npx playwright test --reporter=json > results.json
Filter for "status": "flaky" to get a ranked list.
Step 4: Fix in order of frequency — the 20% of tests causing 80% of flakiness. Common culprits: auth flows, tests hitting external APIs, tests with waitForTimeout.
Preventing Flakiness in New Tests
-
Never use
waitForTimeout— always wait for a condition - Always use semantic selectors — role, label, testid; never CSS classes or nth-child
- Create isolated test data per test, clean up after
- Test one thing per test — easier to debug
-
Run tests locally with
--headedbefore committing
Key Takeaways
- Retries hide flakiness, they don't fix it — treat as temporary, track root cause
-
Timing issues are the #1 cause — replace
waitForTimeoutwith condition-based waits - Selectors should reflect user intent — role, label, testid; never CSS class or DOM position
- Test isolation is non-negotiable — shared state is a reliability time bomb
- UI changes cause chronic flakiness — Shiplight's self-healing resolves elements by intent, eliminating this entire category
References: Playwright documentation · Google Testing Blog · Shiplight self-healing tests
Top comments (0)