How do flaky tests fail?

definition

Flaky test, flake, noun

A test that passes and fails with the exact same code.

How is that even possible if we haven't changed a single line of code?

You will see flaky tests described as non-deterministic. But what does that even mean?

Flaky test failures just seem to be random. But if you really pay attention, flakiness depends on:

interactions between components
execution conditions

Interactions between components

Chances are you are familiar with the Test Pyramid:

Test Pyramid

Components interacting

Here are the components that are usually involved for each test type:

	Unit test	Integration test	Acceptance test
Test code	✔	✔	✔
Code under test	✔	✔	✔
Dependencies code		✔	✔
Webdriver			✔
Browser			✔
Frontend code			✔
Backend code			✔

The flakiness pyramid

The further up you are in the test pyramid, the more components are interacting, the more likely the test will be flaky.

In a Unit Test, there are no interactions apart from the test code and the module under test. You cannot make it simpler.

In an Integration Test, your module is interacting with its dependencies. The more dependencies, the more complex the test will be.

In an Acceptance Test (or System Test, or End-to-End Test), the whole application is involved, from the frontend to the database. Your test code might trigger a click on a frontend button using a browser driver (like Selenium Webdriver), that triggers an asynchronous HTTP request to your test server, which runs your backend code, which talks to the database.

As you can see, an Acceptance Test is much more complex than an Integration Test, which is more complex than a Unit Test.

Asynchronous interactions

When you move up the test pyramid, chances are that some of these components will be interacting asynchronously.

For example:

an Ajax request from the client
an asynchronous indexation in ElasticSearch
a background job enqueued
a timeout

If you don't wait properly for these things to finish, flakiness will bite you.

Asynchronous flakiness

The more asynchronous interactions there are, the more likely the test will be flaky.

Execution conditions

There are external conditions that can influence the way a test executes.

A series of events

We can see a test execution as a series of events. When these events happen in a specific order, the test passes. But external conditions can alter the order in which these events take place and result in a test failure.

Environment

Your laptop, your coworker's desktop computer and a CI cloud instance can have differences, such as:

CPU
Memory
Disk I/O
Networking I/O
Operating System
Dependencies versions (runtime, databases, etc.)

Each one of them can make a difference in the events order, because one environment will execute some tasks faster than another.

For example, a test might run faster on your local environment than on the CI, because your machine is much more powerful.

Resources

The test and all the components involved use and compete for available resources, such as:

CPU
Memory
Networking I/O
Disk I/O

The availability of these resources will never be exactly the same between two different test executions.

Your computer might have more programs running or a browser tab that uses your CPU.

The CI service might have "noisy neighbors" if some resources are shared with different clients or even with your coworkers.

The tests that have run previously, might also have left the system in a different resource usage state.

Tests execution order

note

Tests execution order changes over time.

You change the execution order when you:

execute your tests in a random order (and you should)
add or remove a test
split the test suite into different parallel jobs to speed up the build time

Order flakiness

Tests that are dependent on execution order are more likely to be flaky.

A test can depend on execution order if:

a previous test leaves the system in a bad state (database records, global or class variables, cache entries, cookies, files)
it expects the system to be in a specific state, but does not set the state by itself

Examples of smells:

using before(:all) to create records in an Rspec test to speed up test execution
using class variables in the code under test

Best practice

As a rule of thumb:

prepare the state for each test independently
revert to a clean state after each test execution
avoid using shared state between tests

Time and date

Each and every test execution happens at a different date/time. The code under test and the test code itself can sometimes depend on them.

caution

Tests that depend on date and/or time are more likely to be flaky.

Example of smells:

asserting date/time without freezing it
mixing methods that respect and don't respect timezones

Best practice

Always freeze time when testing time-dependent code.

Others

Other things can lead to flakiness.

Unordered collections

caution

Tests that assert collections that have an unspecified order are more likely to be flaky.

This typically happen when you test the result of a query that does not have an order statement.

Random data

caution

Tests use random data are more likely to be flaky.

You may generate random data in order the prepare your database state. A typical example is the use of Faker in your factories. When you do that, the random data might sometimes satisfy your test conditions, but sometimes trigger unexpected behaviors.

definition

Interactions between components#

Components interacting#

The flakiness pyramid

Asynchronous interactions#

Asynchronous flakiness

Execution conditions#

A series of events

Environment#

Resources#

Tests execution order#

note

Order flakiness

Best practice

Time and date#

caution

Best practice

Others#

Unordered collections#

caution

Random data#

caution

Interactions between components

Components interacting

Asynchronous interactions

Execution conditions

Environment

Resources

Tests execution order

Time and date

Others

Unordered collections

Random data