top of page

Advice #5: How to Mitigate Flaky Tests

Test flakiness poses a significant challenge in automated testing. In this study, we delve into a more precise definition of flaky tests and propose a straightforward yet effective solution for their elimination. Additionally, we present a real-world example to illustrate the impact of our approach.

Test flakiness is considered one of the main challenges of automated testing [1]. According to Google, flaky tests constituted 16% of all test failures within their system [2]. These flaky tests took 1.5 times longer to fix compared to non-flaky ones.

The traditional definition of flaky tests says that ‘A flaky test refers to testing that generates inconsistent results, failing or passing unpredictably, without any modifications to the code under testing.’ Unfortunately, this definition is not precise, let’s make a better one.

Usually, a test is flaky, when sometimes it passes and sometimes it fails. However, there are other cases as well. It can occur that a test always fails but at a different execution point. If the test were not flaky, this test would always fail and would detect a bug. Another case is when the result is inconsistent not because of the automated test case, but the system itself is inconsistent. This means that if the test is executed manually, the system works either correctly or incorrectly. So this is another problem (flaky system), but not a flaky test. Now we can define flaky tests as test cases that execution results in different outcomes including the case of failures at different execution points, assuming that an equivalent test case has the same result whenever it is manually executed.

Here we don’t consider flaky (inconsistent) systems, only flaky tests. There is a significant difference between failed and flaky tests, namely, a failed test will always fail at the same execution point, but a flaky test may pass. If a flaky test passes, then the (part of) requirement covered by the test is correct. (Note that we don’t consider the case of flaky systems.) If so then why do we spend a lot of time and money to fix flaky tests? It’s enough to re-execute them to pass. Anyway, a flaky test cases may pass for the first execution and we don’t even know that it’s flaky.

Thus, instead of fixing the flaky tests, execute them more times, and if they pass, then they can be considered good tests showing the correctness of the related requirement. In Harmony you can set the number of re-executions:

Here, the failed tests are re-executed twice. Accepting our former advice about making all test cases passed, only the new faults lead to failed tests, plus the flaky tests. This means that only a small percentage of the tests should be re-executed.

A frequent cause of flakiness is the shared environment for parallel test execution. Even if all the test cases are independent and can be executed in any order, executing them in parallel may cause unexpected inner states, resulting in failed tests. Therefore, the failed test cases cannot be re-executed in parallel.

For a test case that is both flaky and detects a bug, the reexcution is not enough as it will always fail. In this case, we should keep all the reports and select the failure of the latest execution point. Hopefully, this is the defect that the identical manual test would detect.

Of course, we should analyze the failed test after the final re-execution and there may remain some flaky tests. These can be fixed or changed by better test cases. However, the number of remaining flaky tests is much less (or zero) so we can significantly reduce the costs of fixing flaky tests and shorten the bugfix time.

We test Harmony with Harmony. Omitting test cases that are included in other test cases we have 70 e2e tests. Executing them five failed, and after the re-execution, all test cases passed. Only five tests should the system re-excute, i.e. 7%. It’s not a huge number, the test execution time didn’t increase significantly. Below are some of the test cases for a feature:

Two test cases passed only for the second time (for the first retry), i.e. they are flaky tests.

58 views0 comments


bottom of page