The Flaky Test Epidemic: A Practical Guide to Tests You Can Actually Trust

#testing #playwright #qa #automation

Last month, I watched a senior engineer on my team disable a test that had been failing intermittently for three weeks. His exact words: "I don't have time to babysit this thing." That test was covering a critical auth flow. Two weeks later, a bug shipped to production in that exact flow.

Flaky tests are not just annoying. They are actively dangerous. And based on conversations I have in interviews and across QA communities, this problem is getting worse, not better.

The Real Cost of Flaky Tests

Here is a number that should scare you: teams with high flaky test rates spend up to 30% of their engineering time investigating false failures. I have seen this firsthand across multiple organizations.

But the bigger cost is not time. It is trust.

When your CI pipeline cries wolf enough times, people stop listening. They start clicking "re-run" without reading the failure. They start merging with failing tests. They start skipping the pipeline entirely on "small changes." And suddenly your test suite is decoration, not protection.

I see this pattern constantly when conducting mock interviews on AssertHired. I will ask a candidate how they handle flaky tests, and the most common answer is "we just retry them." That is not a strategy. That is a coping mechanism.

The Five Usual Suspects

After debugging hundreds of flaky tests across different codebases, I have found that nearly all of them fall into five categories. Knowing which category you are dealing with cuts your debugging time in half.

1. Timing and Race Conditions

This is the biggest one, accounting for roughly 40% of flaky tests I have encountered. Your test assumes something will happen in a specific order, but the application does not guarantee that order.

The classic example: clicking a button and immediately asserting that a modal appeared. Sometimes the modal takes 50ms. Sometimes it takes 500ms. Your test passes locally but fails in CI where resources are more constrained.

The fix: Stop using arbitrary waits. In Playwright, use await expect(locator).toBeVisible() instead of await page.waitForTimeout(2000). The difference is that the first approach polls intelligently until the condition is true (or times out), while the second just hopes two seconds is enough.

A more subtle version of this problem: your test creates data via an API call and immediately navigates to a page expecting that data to be rendered. If there is any async processing, caching, or eventual consistency involved, you have a race condition.

The fix: Wait for the actual signal, not an arbitrary delay. Poll the API until the data is confirmed, or wait for a specific DOM element that only appears once the data has loaded.

2. Shared State Between Tests

This one is sneaky. Test A creates a user named "testuser@example.com." Test B also creates a user with that same email. When they run in sequence, everything is fine. When they run in parallel or in a different order, one of them explodes with a unique constraint violation.

I once spent two full days debugging a flaky test that only failed on Tuesdays. Turns out, it shared a database record with another test that only ran as part of the Tuesday scheduled suite. Two days of my life I will never get back.

The fix: Every test should create its own isolated data with unique identifiers. I like using a pattern like test-${Date.now()}-${randomSuffix} for any test data. Yes, it means more data cleanup, but it means zero cross-test contamination.

3. Environment Dependencies

Your test works perfectly on your MacBook. It fails in CI. Why? Because your CI runner has 2 CPU cores and 4GB of RAM instead of your 16-core M3 with 32GB.

Other environment culprits: different timezone settings, different locale settings, different screen resolutions for visual tests, network latency to external services, and DNS resolution timing.

The fix: Make your test environment as deterministic as possible. Pin your timezone in CI. Use fixed viewports for visual tests. And for the love of all things good, do not let your tests hit real external services. Mock them.

4. Order-Dependent Assertions

Your test asserts that a list contains items in a specific order, but the API does not guarantee ordering. Or your test checks element.textContent === "3 items" but the count depends on data that other tests may have created.

The fix: If ordering does not matter for the feature, do not assert on ordering. Use expect(items).toContain(expected) instead of expect(items[0]).toBe(expected). If you need to verify a count, make sure you are counting within an isolated scope.

5. Resource Cleanup Failures

A test opens a database connection, creates a WebSocket, or spawns a subprocess. When the test passes, cleanup runs. When it fails, cleanup gets skipped. Now the next test starts with leaked resources and behaves unpredictably.

The fix: Use beforeEach/afterEach hooks for setup and teardown, not inline code. In Playwright, leverage the built-in fixtures system. The framework guarantees teardown runs regardless of test outcome. This is exactly the kind of thing that separates solid automation from fragile scripts.

My Debugging Workflow

When I encounter a flaky test, I follow a specific sequence before I touch any code:

Step 1: Reproduce it. Run the test 50 times in a loop. In Playwright: npx playwright test my-test.spec.ts --repeat-each=50. If it does not fail at least once in 50 runs, run it in CI instead, since the environment difference might be the trigger.

Step 2: Check the category. Look at the failure message and stack trace. Is it a timeout? Probably timing. Is it a data conflict? Probably shared state. Is it consistent in one environment but not another? Probably environment.

Step 3: Isolate it. Run the flaky test by itself. If it passes consistently in isolation but fails when run with others, you have a shared state or resource cleanup problem. If it fails even in isolation, you have a timing or environment issue.

Step 4: Add logging, not retries. Before you add a retry, add a console.log at every async boundary in the test. Capture timestamps. You want to see exactly where the timing gap is happening. Retries hide bugs. Logging reveals them.

Building a Flake-Resistant Culture

Fixing individual flaky tests is important, but the real win comes from building habits that prevent them in the first place.

Quarantine immediately. When a test starts flaking, move it to a separate "quarantine" test suite that runs but does not block the pipeline. This keeps your main pipeline trustworthy while giving you time to fix the flake properly.

Track your flake rate. Measure the percentage of CI runs that fail due to flaky tests (not real bugs). If that number is above 5%, you have a problem that deserves dedicated sprint time. Most mature teams I have seen target under 2%.

Make flake fixes count as real work. This is a culture problem as much as a technical one. If fixing flaky tests is seen as janitorial work that does not "count," nobody will do it. Flake fixes are quality engineering. Treat them that way.

Stop Retrying, Start Fixing

The QA community's biggest struggle right now is not a lack of tools or frameworks. It is trust erosion. Every flaky test that gets retried instead of fixed is a small withdrawal from your team's confidence in the test suite.

The next time a test flakes on you, resist the urge to click "re-run." Open the failure. Categorize it. Fix it. Your future self (and your team) will thank you.

Aston Cook is a Senior QA Automation Engineer and founder of AssertHired, an AI-powered mock interview platform for QA professionals. He has conducted 50+ automation engineer interviews and writes about QA career development. Find him on LinkedIn (16K+ followers) or at asserthired.com.