What are flaky tests?
Flaky tests produce different results each time you run them. This means you can rerun the same CI job on the same commit 10 times and see inconsistent results. These flaky tests are the tests that fail and block your PRs when you fix a typo or bump a dependency, even though your changes couldn’t have possibly broken a test. They’re the tests that make you habitually hit the rerun job button when tests fail.
Why do flaky tests occur?
Flaky tests are most common in integration tests and e2e (end-to-end) tests. These tests tend to be complex and can be flaky for a multitude of reasons.
- Flaky test code (bad)
- Flaky application code (really bad)
- Flaky infrastructure (pain 🙃)
This flakiness is inevitable to some degree. You write complex e2e and integration tests to test your complex system. These more complex tests give you a higher degree of confidence about your complex system. This is why the best teams in the world write e2e tests. It’s a conscious tradeoff of complexity, flakiness, and confidence where you write more complex and potentially more flaky tests for better confidence that tests validate the complete system.
Test Type | Complexity | Confidence | Flakiness |
---|---|---|---|
Unit tests | Low | Low | High |
Integration tests | Medium | Medium | Medium |
End-to-End tests | High | High | Low |
We can break down the causes of flaky tests into a few common categories. Let me give you some examples.
Machine learning and data science
Python has a special and common use case for training and deploying machine learning models. The problem with applications of this nature is that they're probabilistic in nature. If you're testing training code for a model or regression on a data set, the convergence of a model given fixed parameters is non-deterministic unless you're willing to let the model train for an impractical number of epochs.
What this means is that flakiness in machine learning and some data science applications is almost inevitable. The best you can do in these cases is to tweak the assertion criteria so the tests are less flaky. We'll also discuss how you can do this in a later section.
Concurrency and Test Order Dependencies
When tests are run concurrently and in random order, tests that are otherwise not flaky can become flaky due to isolation problems.Consider the following scenario where test A updates a row in the table and test B deletes the row. When executed alone, these tests may produce deterministic results. When executed concurrently, they can become flaky if test B deletes the row updated by test A before test A’s assertion completes.
Consider another scenario when test A and test B are executed in different order, they could affect each other’s results. If test B deletes the row updated by test A before test A runs, the test could fail. The timing in which these tests are executed affects their results.
External resources
If you’re to be truly confident in the results of your tests, you need tests that cover parts of your entire system. This sometimes means external resources that aren’t set up and taken down with your test run.
This commonly means running tests against a dev or staging environment, against external databases and APIs, or interacting with a third-party service. These systems’ states are inconsistent from run to run. Whether it’s a QA engineer who unknowingly updates a row that your e2e tests depend on or it’s an external API timing out because of slow network performance, these resources are unreliable. Unreliable external resources create flaky tests.
Lack of isolation
Your tests should ideally be isolated. This means they’re independent of each other and they’re independent from run to run. (Good) Tests read, create, update, and delete artifacts in the form of database entries, files uploaded, users created, and background job scheduled. This is great for write tests that are realistic and inspire confidence, but these artifacts need to be properly cleaned up so they don’t impact the results of future runs.
To have truly independent tests, you need to be running them on a fresh setup each time. But you can’t reasonably create an entire VM or clusters of VMs to run each test, which is how you’d achieve perfect isolation test to test. You sometimes can’t even run each CI job against a fresh setup. If you tried, you’d either be paying buckets for CI resources or you’re waiting for hours to await resources on a busy day.
The alternative is to have setup and tear down between tests, test suites, or CI jobs. The problem here, is that they’re difficult to get perfect. You might forget some teardown logic here and there, some bits of orphaned code might sprinkle random artifacts, and jobs that don’t run to completion due to crashes or cancellations will also leave behind a mess. The build-up of these unexpected artifacts will cause flakiness.
Inconsistent runtimes
Each CI machine will perform differently. This could be due to differences in CPU performance, disk performance, RAM available, and network load. This could cause some seriously annoying to debug problems, usually involving out-of-memory issues or tests timing out because a particular CI machine is slower.
These are extra annoying to fix because they’re impossible to reproduce locally. After all, your development machine is probably several times more powerful than any CI machine. These problems are also more common if your tests involve physical devices, such as phones and watches for native development or GPUs for AI and simulations.
Buggy production code
Sometimes, the root of flakiness isn’t in your test code but in your production code. This means you have actual bugs to deal with. You might have queries or APIs that occasionally fail in production. You might have discovered unhandled edge cases. You may have stumbled upon problems with your infrastructure if you test on a development or staging environment.
This is probably the most concerning type of flaky test, but also the hardest to fix. You want to fix these problems since this inconsistent behavior could affect paying customers. Fixing these problems is often difficult because they might result from multiple moving parts in a larger system. We'll discuss how to best approach these difficult fixes in a later section.
Why should I be concerned?
There’s good reason to be concerned about flaky tests. Flaky tests are rarely the result of bad test code alone. They could indicate flakiness in your infrastructure and production code. For many teams, this is the single biggest reason for investigating and fixing every flaky test.
You might also notice the impact of flaky tests on the reliability of your CI. You rely on CI as guardrails so you can ship code quickly without worrying about introducing regressions or breaking existing features. Flaky tests can both ruin your trust in CI results and seriously slow down your team. If your CI jobs fail 20% of the time, you have to choose between rerunning your tests or ignoring flaky failures which forces you to choose between slowing your team down or poisoning your test culture.
"Flaky tests are poison. You can have the best tests in the world, the most sophisticated testing that finds the craziest bugs no one would think of, but if developers can’t trust the results, they can’t trust the results 99% of the time, it doesn’t matter."
Roy Williams, Meta
Dealing with Flaky Tests
Flaky tests are inevitable if you write e2e tests at scale. Your software is complex, so you need some complex tests to provide confidence and validate your changes. If your tests are complex, you will have some flakiness in your tests. It’s a matter of economics; you can make your tests complex and reliable, but that will require many engineering hours that you may be unable to justify. Instead, you should focus on reducing the impact of flaky tests and tackle high-impact flaky tests efficiently.
To effectively reduce the impact of flaky tests, you should do the following:
- Avoid flaky tests by learning common anti-patterns
- Detect flaky tests with automated tools and communicate them with your team
- Quarantine known flaky tests to mitigate it’s impact on the team
- Fix high impact flaky tests for the bang for your buck
Let’s walk through each of these steps in more detail.
Avoiding Flaky Tests in Pytest
Avoid overly strict assertions
One of the weird quirks of Python floating point arithmetic is that numbers that logically should be equal, are sometimes not. Take the following example:
>>> 0.1 + 0.2 == 0.3
False
>>> 0.2 + 0.2 == 0.4
True
This is due to the way floating point numbers are represented in memory, certain numbers like 0.3 are stored as a number very close to the original value (0.30000000000000004), but not the exact same. This problem also exists in other languages like JavaScript and C++, but is much more likely to cause problems in common applications of Python like machine learning and data science.
Pytest provides a great way to combat floating point errors with the approx()
method. You can use it on floating point assertions and even numpy
arrays:
from pytest import approx
>>> 0.1 + 0.2 == approx(0.3)
True
>>> np.array([0.1, 0.2]) + np.array([0.2, 0.4]) == approx(np.array([0.3, 0.6]))
True
You can also assign relative and absolute tolerances for these approximations, which you can read more about in the
Pytest docs on approx().
Another reason to avoid overly strict assertions is for machine learning applications. The convergence of some algorithms, even over a consistent data set, is non-deterministic. You will have to tweak these assertions to be a good approximation of "good enough" to avoid excessive flakiness.
Avoid hard awaits
Asynchronous programming in Python is complicated due to the limitations of the Python interpreter, but it hasn’t stopped Python developers from trying. The problem with the many libraries that do introduce multi-threading, multi-processing, and event-loop based asyncio
is that they’re more awkward to test in Python. We’re gonna take asyncio
as an example here, but the same principles apply to other implementations of async or multi-threaded programming in Python.
In Pytest, if you try to test an async method using asyncio
naively, the test will fail because Pytest doesn’t support async operations.
from example import async_method
async def test():
res = await async_method(0)
## This will return false because
assert res == 42
The best solution is to use a library like pytest-asyncio to properly support async operations:
@pytest.mark.asyncio
async def test():
res = await method(0)
assert res == 42
Properly awaiting for async operations will usually result in a lower chance of flakiness in Python.
What often causes flakiness when dealing with async operations is naively waiting for some time. This approach is intuitive but will be flaky because async operations don’t return in a deterministic amount of time. You can try stretching your sleep time longer and longer, which might become more reliable, but your test suites will take ages to run.
Naively waiting will almost always cause some degree of flakiness, so if you see code snippets like this in your test suite, fix them immediately:
from example import async_method
import time
async def test():
res = await async_method(0)
time.sleep(2)
## This will be flaky
assert res === 42
You should also avoid mocking the async code for testing purposes. Mocking async introduces unnecessary complexity and reduces your confidence level in your test by making it less realistic.
Control your test environment
If your test environment changes from run to run, your tests are much more likely to be flaky. Your Pytest e2e and integration tests will likely depend on your entire system's state to produce consistent results. If you test on a persistent test environment or a shared dev/staging environment, that environment might differ each time the test is run.
This is particularly problematic if you're using a library like pytest-xdist to distribute your tests across multiple CPUs. Testing in parallel and in random order introduces opportunities for both inconsistent system states to cause flakiness and race conditions where concurrently running tests interfere with each other.
If you're using a persistent testing environment, you must ensure that artifacts created by your tests, such as new database tables and rows, files created in storage, and new background jobs started, are properly cleaned up. Leftover data from run to run might affect the results of future test runs.
Similarly, you should be careful when testing a development or staging environment. These environments might have other developers who create and destroy data as part of their development process. This constantly changing environment can cause flakiness if other developers accidentally update or delete tables and files used during testing, create resources with unique IDs that collide with a test, or use up the environment's compute resources causing the test to timeout.
The easiest way to avoid these potential problems is to use a fresh environment whenever possible when you run your tests. A transient environment set up and destroyed at the end of each test run will help ensure that each run's environment is consistent and no artifacts are left behind. If that's impossible, pay close attention to how you design your tests' setup and takedown to remove all dependencies. Each test should set up and take down its own resources.
Limit External Resources
While it's important to test all parts of your app fully, it's important to limit the number of tests that involve external resources, like external APIs and third-party integrations. These external resources may go down, they might change, and you may hit rate limits. They're all potential sources of flaky failures. You should aim to cover each external resource in a few tests, but avoid involving them in your tests excessively.
Mocking is a decent strategy for testing specific behaviors, but it should not be used as a replacement for end-to-end testing. You can't entirely avoid external resources, but you should be mindful of which test suites involve these resources. Have some dedicated tests that involve external resources and mock them for other tests.
Automatic Reruns
If you only have a few flaky tests, rerunning tests on failure to clear flakiness can be effective. Libraries like pytest-rerunfailures let you configure reruns easily, which reruns each test on the first failure to see if it's just flaky.
This approach does have its own tradeoffs. It will cost more CI resources and CI time, which depending on the size of your test suite can be a negligible or profound problem. If you have hundreds of thousands of tests, and a few thousand of them are flaky, this will be unbelievably expensive. You will also still catch flaky tests slipping through the cracks occasionally by pure chance.
A much bigger problem is reruns just cover up the problem. When you introduce new flaky tests, you may not even notice. This is bad because some flaky tests indicate real problems with your code and the silent build-up of flaky tests can cause scaling problems within your CI system.
Detecting Flaky Tests in Pytest
If you only have a few flaky tests, a great place to start is by having a central place to report them, such as a simple spreadsheet. To identify flaky tests, you can use commit SHAs and test retries. A commit SHA is a unique identifier for a specific commit in your version control system, like Git. If a test passes for one commit SHA but fails for the same SHA later, it's likely flaky since nothing has changed.
Dropbox open-sourced a solution that helps you find flaky tests called pytest-flakefinder. This allows you to automatically rerun each test multiple times to find flaky tests. By default, the tool will run each test 50 times on the same commit and validate if the results are consistent. On the same commit, the code should be identical from run to run. If a test is not producing consistent results, it is considered flaky.
Running tests multiple times to find flaky tests is a great way to audit your existing repo or new suites of tests. The problem with using a tool like this is that it relies on a "periodic cleanup" type workflow, instead of monitoring continuously. Relying on finding time to periodically clean up tech debt rarely works. Other higher-priority tasks, typically anything but testing, will displace this workflow and you'll end up with a pileup of flaky tests again.
Continuously monitoring flaky tests also gives the benefit of catching flaky tests due to changes in external resources, tool rot, and other sources of flakiness that happen due to drift and not code change. These problems are often easier to fix when caught early. Continuous monitoring also tells you if your fixes actually worked or not, since most flaky tests aren't fixed completely on the first attempt but will see reduced flakiness.
Some form of continuous monitoring is especially important for Python applications that test ML or data science-related code. As we mentioned before, many ML and data science applications are inherently non-deterministic, and the best fix for these flaky tests is to just tweak the assertion parameters so they're acceptably flaky.
Collecting test results in CI and comparing results on the same commit is a great starting point for detecting flaky tests continuously and automatically. There are some nuances to consider for the different branches on which the tests are run. You should assume that the test should not normally fail on the main
branch (or any other stable branch) but can be expected to fail more frequently on your PR and feature branches. How much you factor different branches' flakiness signals will depend on your team's circumstances.
If you're looking for a tool for this, Trunk Flaky Tests automatically detects and tracks flaky tests for you. Trunk also aggregates and displays the detected flaky tests, their past results, relevant stack traces, and flaky failure summaries on a single dashboard.
For example, you'll get a GitHub comment on each PR, calling out if a flaky test caused the CI jobs to fail.
Quarantining Flaky Tests in Pytest
What do you do with flaky tests after you detect them? In an ideal world, you'd fix them immediately. In reality, they're usually sent to the back of your backlog. You have project deadlines to meet and products to deliver, all of which deliver more business value than fixing a flaky test. What's most likely is that you'll always have some known but unfixed flaky tests in your repo, so the goal is to reduce their impact before you can fix them.
We've written in a past blog that flaky tests are harmful for most teams because they block PRs and reduce trust in tests. So once you know a test to be flaky, it's important to stop it from producing noise in CI and blocking PRs. We recommend you do this by quarantining these tests.
Quarantining is the process of continuing to run a flaky test in CI without allowing failures to block PR merge. We recommend this method over disabling or deleting tests because disabled tests are usually forgotten and swept under the rug. We want our tests to produce less noise, not 0 noise.
It's important to note that studies have shown that initial attempted fixes for flaky tests usually don't succeed. You need to have a historical record to know if the fix reduced the flake rate or completely fixed the flaky test. This is another reason why we believe you should keep running tests quarantined.
To quarantine tests in Pytest, you can mark them as quarantined and run them in a separate CI job that doesn't block your PRs.
@pytest.mark.quarantined
def flaky_test():
# Generates a random float between 0 and 1
random_value = random.random()
assert random_value <= 0.6
You can then run them with pytest -v -m "quarantined"
and exclude them with pytest -v -m "not quarantined"
.
Some limitations here:
- This requires a code change to quarantine instead of happening dynamically at runtime.
- Doesn't integrate with automated systems to detect flaky tests and quarantine them.
- They produce very little noise and are easily forgotten.
A better way to approach this is to quarantine at runtime, which means to quarantine failures without updating the code. As tests are labeled flaky or return to a healthy status, they should be quarantined and unquarantined automatically. This is especially important for large repos with many contributors. A possible approach to accomplish this is to host a record of known flaky tests, run all tests, and then check if all failures are due to known flaky tests. If all tests are from known flaky tests, override the exit code of the CI job to pass.
Again, this is really useful for ML and data science applications, where models become more or less flaky when updated. You might just want to quarantine certain tests that are flaky because the models don't converge in a short enough amount of time. Since they still run, you can track their failure rate to make sure they're not degrading and inspect the failure logs to catch actual errors.
If you're looking for a way to quarantine flaky tests at runtime, Trunk can help you here. Trunk will check if failed tests are known to be flaky and unblock your PRs if all failures can be quarantined.
Learn more about Trunk's Quarantining.
Fixing Flaky Tests
We've covered some common anti-patterns earlier to help you avoid flaky tests, but if your flaky test is due to a more complex cause, how you approach a fix will vary heavily. We can't show you how to fix every way your tests flake; it can be very complex. Instead, let's cover prioritizing which tests to fix and reproducing flaky tests.
When deciding on which tests to fix first, we first need a way to rank them by their impact. What we discovered to be a good measure of impact when we worked with our partners is that flaky tests blocking the most PRs should be fixed first. Ultimately, we want to eliminate flaky tests because they block PRs from being merged when they fail in CI. You can do this by tracking the number of times a known flaky test fails in CI on PR branches, either manually for smaller projects or automatically with a tool.
This also helps you justify the engineering effort put towards fixing flaky tests, because with tech-debt, justifying for time invested is often a bigger blocker than the fix itself. When you reduce the number of blocked PRs, you save expensive engineering hours. You can further extrapolate the number of engineering hours saved by factoring in context-switching costs, which some studies show to be ~23 minutes per context switch for knowledge workers.
This approach is especially true for flakiness in ML and data science apps, where a complete fix is impractical. For these applications, a fix usually just finds the high-impact flaky tests and reduces how strict the assertions are.
After finding the high-impact tests to fix first, you can use a library like pytest-replay to help reproduce tests locally. The difficult-to-reproduce tests often reveal real problems with your infrastructure or code. Pytest-replay lets you replay failures from replay records saved in CI, which can be a real-time save.
If you're looking for a straightforward way to report flaky tests for any language or framework, see their impact on your team, find the highest impact tests to fix, and track past failures and stack traces, you can try Trunk Flaky Tests.
Learn more about Trunk Flaky Tests dashboards.
Need help?
Flaky tests take a combination of well-written tests, taking advantage of your test framework's capabilities, and good tooling to eliminate. If you write e2e tests with Python and Pytest and face flaky tests, know that you're not alone. You don't need to invent your own tools. Trunk can give you the tools needed to tackle flaky tests:
- Autodetect the flaky tests in your build system
- See them in a dashboard across all your repos
- Quarantine tests manually or automatically
- Get detailed stats to target the root cause of the problem
- Get reports weekly, nightly, or instantly sent right to email and Slack
- Intelligently file tickets to the right engineer
Top comments (0)