Shiplight

Posted on Apr 8 • Originally published at shiplight.ai

How to Fix Flaky Tests: Root Causes and Permanent Fixes

#playwright #testing #automation #cicd

A flaky test is a test that passes sometimes and fails sometimes — on the same code, with no changes. They're the most corrosive problem in a test suite because they turn your CI from a quality signal into noise.

Teams respond to flaky tests in predictable ways: first they rerun them, then they add retries, then they quarantine them, then they just stop looking at red CI. By the time a real regression ships, no one trusts the tests enough to catch it.

This guide covers the 7 root causes of flaky E2E tests and how to fix each one permanently — not with retries that hide the problem, but with changes that make the test reliable.

Quick Reference: 7 Causes of Flaky Tests

#	Root Cause	Symptom	Fix
1	Timing / race conditions	"element not found" on CI	Replace `waitForTimeout` with condition-based waits
2	Brittle selectors	Breaks on CSS rename	Use `getByRole`, `getByTestId`, `getByLabel`
3	Shared test state	Fails in parallel, passes solo	Isolate data per test, reset state in `afterEach`
4	Environment instability	CI fails, local passes	Health checks, mock external APIs, raise timeouts
5	Animation interference	Random assertion failures	`reducedMotion: 'reduce'` in Playwright config
6	Parallelism conflicts	Fails with `--workers > 1`	Scope data to `workerIndex`
7	UI changes / locator drift	Breaks after refactors	Semantic selectors or intent-based self-healing

Why Flaky Tests Are Worse Than No Tests

A test suite with 20% flakiness is worse than a smaller, reliable suite:

False positives: CI fails on green code — developers learn to ignore it
Investigation overhead: every failure requires triage
Trust erosion: once trust breaks, it doesn't come back without deliberate effort
Coverage rot: flaky tests get disabled, leaving real gaps behind

The Google Testing Blog has documented that even 1% flakiness in a large suite creates enough noise to meaningfully slow down development. At 10%+, teams functionally stop relying on CI.

1. Timing and Race Conditions

Symptom: Test fails with "element not found" or "timeout" — sometimes. Usually on CI, rarely locally.

What not to do:

// Arbitrary sleeps are fragile and slow
await page.waitForTimeout(2000);
await page.click('#submit-btn');

Fix: Use explicit waits that respond to actual application state:

// Wait for the element to be visible and enabled
await page.waitForSelector('#submit-btn', { state: 'visible' });
await page.click('#submit-btn');

// Wait for network to settle
await page.click('#submit-btn');
await page.waitForLoadState('networkidle');

// Wait for a specific response
const [response] = await Promise.all([
  page.waitForResponse(r => r.url().includes('/api/submit') && r.status() === 200),
  page.click('#submit-btn'),
]);

// Wait for navigation
await Promise.all([
  page.waitForURL('**/dashboard'),
  page.click('#login-btn'),
]);

Set explicit timeouts in your Playwright config:

export default {
  timeout: 30000,
  expect: { timeout: 10000 },
  use: { actionTimeout: 10000 },
};

2. Brittle Selectors

Symptom: Test breaks after a UI change that didn't change behavior.

Fragile selectors:

// ❌ Breaks when class name changes
await page.click('.btn-primary-v2-active');

// ❌ Breaks when DOM restructures
await page.click('div > div:nth-child(3) > button');

Resilient selectors:

// ✅ ARIA role + name
await page.getByRole('button', { name: 'Sign In' }).click();

// ✅ Test ID — explicit contract
await page.getByTestId('submit-button').click();

// ✅ Label association
await page.getByLabel('Email address').fill('user@example.com');

// ✅ Visible text
await page.click('button:has-text("Sign In")');

Add data-testid attributes to key interactive elements as a team convention. This creates an explicit contract between tests and developers.

3. Shared or Leaked Test State

Symptom: Tests pass in isolation but fail when run together.

Fix: Make every test self-contained:

test.beforeEach(async ({ page }) => {
  const user = await createTestUser({ role: 'admin' });
  await loginAs(page, user);
});

test.afterEach(async () => {
  await cleanupTestUsers();
});

For browser state:

// playwright.config.ts
export default {
  use: {
    storageState: undefined, // fresh context per test
  },
};

4. Environment and Network Instability

Symptom: Tests fail on CI but not locally.

Health check before tests:

async function globalSetup() {
  const maxRetries = 10;
  for (let i = 0; i < maxRetries; i++) {
    try {
      const res = await fetch(process.env.BASE_URL + '/health');
      if (res.ok) break;
    } catch {
      await new Promise(r => setTimeout(r, 2000));
    }
    if (i === maxRetries - 1) throw new Error('App did not start');
  }
}

Mock external services that are rate-limited in CI:

await page.route('**/api.stripe.com/**', route =>
  route.fulfill({ status: 200, body: JSON.stringify({ status: 'succeeded' }) })
);

5. Animation and Transition Interference

Symptom: Test clicks an element that's animating in/out and gets wrong behavior.

Fix:

// playwright.config.ts
export default {
  use: {
    reducedMotion: 'reduce', // disable CSS animations
  },
};

Or inject a CSS override:

test.beforeEach(async ({ page }) => {
  await page.addStyleTag({
    content: `*, *::before, *::after { 
      animation-duration: 0ms !important; 
      transition-duration: 0ms !important; 
    }`,
  });
});

6. Test Runner Parallelism Conflicts

Symptom: Tests pass with --workers=1 but fail with parallel execution.

Fix: Use unique data per worker:

test('create item', async ({ page }, testInfo) => {
  const userId = `test-user-${testInfo.workerIndex}`;
  // Each worker uses its own user, no conflicts
});

7. UI Changes Breaking Locators

This is the single largest driver of "tests as a maintenance burden."

Root cause: Tests are coupled to implementation details. Every locator-based test is a bet that the DOM won't change. That bet loses constantly in teams shipping fast.

Short-term fix: Migrate to semantic selectors (see #2). Add data-testid to critical elements.

Systematic fix: Shiplight's intent-cache-heal pattern eliminates this entire class of flakiness. Instead of CSS selectors, Shiplight stores the semantic intent of each step:

# Survives CSS renames, refactors, layout changes
goal: Verify checkout flow
statements:
  - intent: Add item to cart
  - intent: Proceed to checkout
  - intent: Fill in shipping details
  - VERIFY: order confirmation is displayed

When a locator breaks, Shiplight's AI resolves the correct element from the live DOM. A developer renaming a CSS class doesn't break a single test.

How to Triage Flaky Tests at Scale

Step 1: Quarantine, don't delete

test.skip('checkout flow — flaky, tracked in TICKET-123', async ({ page }) => {
  // ...
});

Step 2: Add retries temporarily

export default {
  retries: process.env.CI ? 2 : 0,
};

Retries are symptom management, not a fix. Remove them once the root cause is fixed.

Step 3: Measure flakiness per test

npx playwright test --reporter=json > results.json

Filter for "status": "flaky" to get a ranked list.

Step 4: Fix in order of frequency — the 20% of tests causing 80% of flakiness. Common culprits: auth flows, tests hitting external APIs, tests with waitForTimeout.

Preventing Flakiness in New Tests

Never use waitForTimeout — always wait for a condition
Always use semantic selectors — role, label, testid; never CSS classes or nth-child
Create isolated test data per test, clean up after
Test one thing per test — easier to debug
Run tests locally with --headed before committing

Key Takeaways

Retries hide flakiness, they don't fix it — treat as temporary, track root cause
Timing issues are the #1 cause — replace waitForTimeout with condition-based waits
Selectors should reflect user intent — role, label, testid; never CSS class or DOM position
Test isolation is non-negotiable — shared state is a reliability time bomb
UI changes cause chronic flakiness — Shiplight's self-healing resolves elements by intent, eliminating this entire category