DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

How to Build a Flaky Test Detox Pipeline in CI

How to Build a Flaky Test Detox Pipeline in CI

How to Build a Flaky Test Detox Pipeline in CI

Flaky tests are one of the fastest ways to destroy trust in a test suite, because they can pass and fail without any code changes. A practical way to fix that is to build a small “detox pipeline” that detects flaky behavior, isolates the cause, and prevents the same problem from returning.

Why flakiness matters

A flaky test is a test that produces different results across runs even when the code and test stay unchanged. Common causes include bad assumptions, async timing, shared state, concurrency, and test-order dependencies.

The problem is not just inconvenience. Flaky tests slow down delivery, create false alarms in CI, and train developers to ignore red builds.

What this tutorial covers

You will learn how to:

  • Detect flaky tests with repeat runs.
  • Classify failures by cause.
  • Stabilize the test by removing nondeterminism.
  • Add safeguards so the same issue does not come back.

The examples use a JavaScript-style testing stack, but the workflow applies to most test runners. The important part is the method, not the framework.

Step 1: Reproduce the flake

Before fixing anything, prove the test is truly flaky. Run it many times under the same conditions and record whether it passes or fails, because nondeterminism is what defines a flaky test.

A simple local retry loop can expose instability:

for i in {1..50}; do
  echo "Run $i"
  npm test  runInBand flaky.spec.js || break
done
Enter fullscreen mode Exit fullscreen mode

If the test fails on run 17 and passes again on run 18 with no code changes, you have a real flake, not a one-off failure.

Step 2: Identify the cause

Once you can reproduce the flake, categorize it. Common categories include timing, shared state, order dependency, concurrency, and external dependencies.

Use this checklist:

  • Timing: Is the test using fixed sleeps instead of waiting for a real condition?
  • Shared state: Does it reuse a database row, file, or global object?
  • Ordering: Does it pass only if another test runs first?
  • Concurrency: Does it assume async work finishes in a specific order?
  • External systems: Does it depend on a live API or network state?

This classification matters because each root cause has a different fix.

Step 3: Replace sleeps with waits

A classic source of flakiness is waiting an arbitrary amount of time and hoping the app is ready. That approach is brittle because real systems vary in speed.

Bad pattern:

await page.click('button#save');
await page.waitForTimeout(2000);
expect(await page.textContent('.status')).toBe('Saved');
Enter fullscreen mode Exit fullscreen mode

Better pattern:

await page.click('button#save');
await expect(page.locator('.status')).toHaveText('Saved');
Enter fullscreen mode Exit fullscreen mode

The second version waits for the condition that actually matters, which makes the test far more deterministic.

Step 4: Remove shared state

Flaky tests often appear when one test leaves behind data that another test accidentally consumes. The fix is to make every test create and clean up its own environment.

Here is a simple pattern:

beforeEach(async () => {
  await db.reset();
  await seedUser({ email: 'test@example.com' });
});

afterEach(async () => {
  await db.reset();
});
Enter fullscreen mode Exit fullscreen mode

A test suite should be able to run in any order and still behave the same way. If a test only passes after another test has already “prepared” the world, it is hiding a dependency.

Step 5: Freeze time and randomness

Tests that depend on the current time, random IDs, or unstable ordering can fail for reasons unrelated to the feature being tested. Freezing these inputs makes the test repeatable.

Example with time:

jest.useFakeTimers();
jest.setSystemTime(new Date('2026-05-31T12:00:00Z'));

expect(formatExpiryDate()).toBe('Expires on 31 May 2026');
Enter fullscreen mode Exit fullscreen mode

Example with randomness:

jest.spyOn(Math, 'random').mockReturnValue(0.42);
Enter fullscreen mode Exit fullscreen mode

If your code generates IDs, timestamps, or sort order, make those values injectable so tests can control them. That small design choice pays off quickly in reliability.

Step 6: Build a flake detector job

A good CI setup does more than report failures. It actively looks for instability by repeating suspicious tests several times and flagging inconsistent results.

A practical CI job might:

  1. Run changed tests once in the normal pipeline.
  2. Re-run any failing test three to five times.
  3. Mark a test as flaky if it sometimes passes and sometimes fails.
  4. Quarantine the test until it is fixed.

Example GitHub Actions-style step:

- name: Retry suspicious tests
  run: |
    npx jest flaky.spec.js runInBand || \
    (npx jest flaky.spec.js runInBand && exit 1)
Enter fullscreen mode Exit fullscreen mode

This is intentionally simple. In a real team, you would likely use a dedicated retry mechanism and a report of known flaky tests.

Step 7: Quarantine carefully

Quarantining means temporarily separating a flaky test from the main gate so it does not keep blocking everyone else. That can be useful, but it should never become a permanent excuse to ignore the problem.

Use quarantine only when:

  • The flake is understood and logged.
  • The team has a ticket to fix it.
  • The test is still visible in reports.
  • There is a clear plan to restore it to the main suite.

A quarantined test should be treated like a broken smoke alarm: muted for the moment, but never forgotten.

Step 8: Add a non-flaky replacement

Sometimes the fastest path is to rewrite the test around a stronger invariant. Instead of testing implementation details, test the observable behavior that actually matters.

For example, if a UI test is flaky because it watches a loading spinner, replace it with a test that verifies the final data appears:

test('shows saved profile data', async ({ page }) => {
  await page.goto('/profile');
  await page.fill('#name', 'Amina');
  await page.click('button#save');
  await expect(page.locator('[data-testid="profile-name"]')).toHaveText('Amina');
});
Enter fullscreen mode Exit fullscreen mode

This version avoids depending on a transient loading state and focuses on the user-visible outcome.

Practical workflow

A reliable detox workflow looks like this:

  1. Detect the flake with repeated runs.
  2. Categorize the cause.
  3. Fix timing, state, or ordering issues.
  4. Freeze time and randomness where needed.
  5. Add a CI retry or flaky-test report.
  6. Quarantine only as a temporary measure.
  7. Replace the brittle test with a stronger one.

This workflow turns flakiness from a vague annoyance into a trackable engineering problem.

Team habits that prevent flakes

Prevention is easier than cleanup. Teams that keep test suites stable usually do a few things consistently:

  • Avoid sleep-style waits unless absolutely necessary.
  • Reset test data before every test.
  • Keep tests independent and order-agnostic.
  • Mock or stub unstable external services.
  • Review new tests for determinism before merging.

The main idea is simple: a test should fail because the product regressed, not because the environment hiccupped.

Final example

Suppose this test fails sometimes:

test('user sees order total', async ({ page }) => {
  await page.goto('/checkout');
  await page.click('button#place-order');
  await page.waitForTimeout(3000);
  expect(await page.textContent('#order-total')).toBe('$42.00');
});
Enter fullscreen mode Exit fullscreen mode

A sturdier version would be:

test('user sees order total', async ({ page }) => {
  await page.goto('/checkout');
  await page.click('button#place-order');
  await expect(page.locator('#order-total')).toHaveText('$42.00');
});
Enter fullscreen mode Exit fullscreen mode

If the result still flakes, the next move is not more retries. It is to inspect whether the page depends on unstable network timing, shared state, or hidden ordering assumptions.

The best flaky-test detox pipelines do one thing extremely well: they make nondeterminism visible early enough that it never becomes normal.

-

Rizwan Saleem | https://rizwansaleem.co

Top comments (0)