DEV Community

Shiplight
Shiplight

Posted on • Originally published at shiplight.ai

How to Fix Flaky Tests: Root Causes and Permanent Fixes

A flaky test is a test that passes sometimes and fails sometimes — on the same code, with no changes. They're the most corrosive problem in a test suite because they turn your CI from a quality signal into noise.

Teams respond to flaky tests in predictable ways: first they rerun them, then they add retries, then they quarantine them, then they just stop looking at red CI. By the time a real regression ships, no one trusts the tests enough to catch it.

This guide covers the 7 root causes of flaky E2E tests and how to fix each one permanently — not with retries that hide the problem, but with changes that make the test reliable.

Quick Reference: 7 Causes of Flaky Tests

# Root Cause Symptom Fix
1 Timing / race conditions "element not found" on CI Replace waitForTimeout with condition-based waits
2 Brittle selectors Breaks on CSS rename Use getByRole, getByTestId, getByLabel
3 Shared test state Fails in parallel, passes solo Isolate data per test, reset state in afterEach
4 Environment instability CI fails, local passes Health checks, mock external APIs, raise timeouts
5 Animation interference Random assertion failures reducedMotion: 'reduce' in Playwright config
6 Parallelism conflicts Fails with --workers > 1 Scope data to workerIndex
7 UI changes / locator drift Breaks after refactors Semantic selectors or intent-based self-healing

Why Flaky Tests Are Worse Than No Tests

A test suite with 20% flakiness is worse than a smaller, reliable suite:

  • False positives: CI fails on green code — developers learn to ignore it
  • Investigation overhead: every failure requires triage
  • Trust erosion: once trust breaks, it doesn't come back without deliberate effort
  • Coverage rot: flaky tests get disabled, leaving real gaps behind

The Google Testing Blog has documented that even 1% flakiness in a large suite creates enough noise to meaningfully slow down development. At 10%+, teams functionally stop relying on CI.

1. Timing and Race Conditions

Symptom: Test fails with "element not found" or "timeout" — sometimes. Usually on CI, rarely locally.

What not to do:

// Arbitrary sleeps are fragile and slow
await page.waitForTimeout(2000);
await page.click('#submit-btn');
Enter fullscreen mode Exit fullscreen mode

Fix: Use explicit waits that respond to actual application state:

// Wait for the element to be visible and enabled
await page.waitForSelector('#submit-btn', { state: 'visible' });
await page.click('#submit-btn');

// Wait for network to settle
await page.click('#submit-btn');
await page.waitForLoadState('networkidle');

// Wait for a specific response
const [response] = await Promise.all([
  page.waitForResponse(r => r.url().includes('/api/submit') && r.status() === 200),
  page.click('#submit-btn'),
]);

// Wait for navigation
await Promise.all([
  page.waitForURL('**/dashboard'),
  page.click('#login-btn'),
]);
Enter fullscreen mode Exit fullscreen mode

Set explicit timeouts in your Playwright config:

export default {
  timeout: 30000,
  expect: { timeout: 10000 },
  use: { actionTimeout: 10000 },
};
Enter fullscreen mode Exit fullscreen mode

2. Brittle Selectors

Symptom: Test breaks after a UI change that didn't change behavior.

Fragile selectors:

// ❌ Breaks when class name changes
await page.click('.btn-primary-v2-active');

// ❌ Breaks when DOM restructures
await page.click('div > div:nth-child(3) > button');
Enter fullscreen mode Exit fullscreen mode

Resilient selectors:

// ✅ ARIA role + name
await page.getByRole('button', { name: 'Sign In' }).click();

// ✅ Test ID — explicit contract
await page.getByTestId('submit-button').click();

// ✅ Label association
await page.getByLabel('Email address').fill('user@example.com');

// ✅ Visible text
await page.click('button:has-text("Sign In")');
Enter fullscreen mode Exit fullscreen mode

Add data-testid attributes to key interactive elements as a team convention. This creates an explicit contract between tests and developers.

3. Shared or Leaked Test State

Symptom: Tests pass in isolation but fail when run together.

Fix: Make every test self-contained:

test.beforeEach(async ({ page }) => {
  const user = await createTestUser({ role: 'admin' });
  await loginAs(page, user);
});

test.afterEach(async () => {
  await cleanupTestUsers();
});
Enter fullscreen mode Exit fullscreen mode

For browser state:

// playwright.config.ts
export default {
  use: {
    storageState: undefined, // fresh context per test
  },
};
Enter fullscreen mode Exit fullscreen mode

4. Environment and Network Instability

Symptom: Tests fail on CI but not locally.

Health check before tests:

async function globalSetup() {
  const maxRetries = 10;
  for (let i = 0; i < maxRetries; i++) {
    try {
      const res = await fetch(process.env.BASE_URL + '/health');
      if (res.ok) break;
    } catch {
      await new Promise(r => setTimeout(r, 2000));
    }
    if (i === maxRetries - 1) throw new Error('App did not start');
  }
}
Enter fullscreen mode Exit fullscreen mode

Mock external services that are rate-limited in CI:

await page.route('**/api.stripe.com/**', route =>
  route.fulfill({ status: 200, body: JSON.stringify({ status: 'succeeded' }) })
);
Enter fullscreen mode Exit fullscreen mode

5. Animation and Transition Interference

Symptom: Test clicks an element that's animating in/out and gets wrong behavior.

Fix:

// playwright.config.ts
export default {
  use: {
    reducedMotion: 'reduce', // disable CSS animations
  },
};
Enter fullscreen mode Exit fullscreen mode

Or inject a CSS override:

test.beforeEach(async ({ page }) => {
  await page.addStyleTag({
    content: `*, *::before, *::after { 
      animation-duration: 0ms !important; 
      transition-duration: 0ms !important; 
    }`,
  });
});
Enter fullscreen mode Exit fullscreen mode

6. Test Runner Parallelism Conflicts

Symptom: Tests pass with --workers=1 but fail with parallel execution.

Fix: Use unique data per worker:

test('create item', async ({ page }, testInfo) => {
  const userId = `test-user-${testInfo.workerIndex}`;
  // Each worker uses its own user, no conflicts
});
Enter fullscreen mode Exit fullscreen mode

7. UI Changes Breaking Locators

This is the single largest driver of "tests as a maintenance burden."

Root cause: Tests are coupled to implementation details. Every locator-based test is a bet that the DOM won't change. That bet loses constantly in teams shipping fast.

Short-term fix: Migrate to semantic selectors (see #2). Add data-testid to critical elements.

Systematic fix: Shiplight's intent-cache-heal pattern eliminates this entire class of flakiness. Instead of CSS selectors, Shiplight stores the semantic intent of each step:

# Survives CSS renames, refactors, layout changes
goal: Verify checkout flow
statements:
  - intent: Add item to cart
  - intent: Proceed to checkout
  - intent: Fill in shipping details
  - VERIFY: order confirmation is displayed
Enter fullscreen mode Exit fullscreen mode

When a locator breaks, Shiplight's AI resolves the correct element from the live DOM. A developer renaming a CSS class doesn't break a single test.

How to Triage Flaky Tests at Scale

Step 1: Quarantine, don't delete

test.skip('checkout flow — flaky, tracked in TICKET-123', async ({ page }) => {
  // ...
});
Enter fullscreen mode Exit fullscreen mode

Step 2: Add retries temporarily

export default {
  retries: process.env.CI ? 2 : 0,
};
Enter fullscreen mode Exit fullscreen mode

Retries are symptom management, not a fix. Remove them once the root cause is fixed.

Step 3: Measure flakiness per test

npx playwright test --reporter=json > results.json
Enter fullscreen mode Exit fullscreen mode

Filter for "status": "flaky" to get a ranked list.

Step 4: Fix in order of frequency — the 20% of tests causing 80% of flakiness. Common culprits: auth flows, tests hitting external APIs, tests with waitForTimeout.

Preventing Flakiness in New Tests

  • Never use waitForTimeout — always wait for a condition
  • Always use semantic selectors — role, label, testid; never CSS classes or nth-child
  • Create isolated test data per test, clean up after
  • Test one thing per test — easier to debug
  • Run tests locally with --headed before committing

Key Takeaways

  • Retries hide flakiness, they don't fix it — treat as temporary, track root cause
  • Timing issues are the #1 cause — replace waitForTimeout with condition-based waits
  • Selectors should reflect user intent — role, label, testid; never CSS class or DOM position
  • Test isolation is non-negotiable — shared state is a reliability time bomb
  • UI changes cause chronic flakiness — Shiplight's self-healing resolves elements by intent, eliminating this entire category

References: Playwright documentation · Google Testing Blog · Shiplight self-healing tests

Top comments (0)