What are flaky tests?

definition

Flaky test, flake, noun

A test that passes and fails with the exact same code.

I think it is reasonable to say that most developers who have worked on significant software development projects have encountered flaky tests at one point or another.

Picture this:

I have finished working on a feature. The pull request is ready and has been reviewed, I have written tests, the CI build is green on my branch and I finally squash and merge it back into master. It will soon land in production thanks to our Continuous Deployment system.

But then it happens... again. The master branch is red, the deployment is blocked and the failing test has nothing to do with what I have changed. I cannot leave master red, so I click "rebuild" in our CI interface and cross my fingers. Will there be yet another surprise?

This is just one of so many stories about flaky tests we can relate to.

They sabotage productivity

The main problem with flaky tests is that they kill your team's productivity and morale over time.

tip

A fast feedback loop is at the core of high performing software teams.

Flow is a key factor to a developer's productivity and happiness, and the waiting time after a code push is killing it.

The longer the time to feedback is, the more it will prevent you from moving on to your next task. And if you had already moved on, the context switch caused by the red build will interrupt your flow.

In a Continuous Deployment context, flaky tests will also delay deployments.

Overall, flaky tests add a lot of friction and prevent your team to reach its full potential.

They erode confidence

tip

A stable test suite is what enables software engineers to be confident that what they develop works and hasn't broken anything.

If you lose control over flaky tests, the team will:

  • retry or even ignore red builds
  • lose time and get frustrated trying to debug flaky tests
  • delete tests they cannot fix
  • think that writing tests is not worth it
  • invest less time in writing them
caution

Eventually bugs will start to creep into production, quality will drop and your users and your business will pay the price.

Another consequence is that engineers will lose trust in your delivery pipeline over time. They will think that your company is doing it wrong and wish they would work for a high-performing team.

They are hard to fix

They are, really, one of the hardest things a software engineer has to deal with (after cache invalidation and naming things of course).

note

To fix a bug efficiently, it usually takes:

  • artifacts about what is breaking (exception, stacktrace, logs)
  • the ability to reproduce it locally
  • efficient tools (tests, debugger)
  • a fast feedback loop when fixing it
warning

With flaky tests, you have (almost) none of that.

You have very few artifacts to analyze what is happening. At most, you usually have a failing assertion in your test, maybe an exception. But usually the error you are seeing is very remote to the root cause.

To reproduce the flaky test, you can run it a 100 times locally and hope to see the error. Even if you are lucky enough, it does not really help because what you would need is a way to reproduce it consistently.

If you think you know the root cause and push a fix, the feedback loop is extremely slow. It could take days before the flaky test fails again.

They are nobody’s responsibility

note

While flaky tests should be everybody's responsibility, it is very hard for a team to collectively own the problem and handle it in the long run.

Usually, one motivated developer who really cares will spend some time debugging and fixing a few flaky tests. But he also has other priorities and the pressure to deliver.

But it is a long term effort that you need to sustain over time. The "hero" developer will get tired of the team relying on him to handle them.

What you need is a strategy, tools and collective responsibility to succeed.

They are inevitable

It is easy to think that your team is alone. Really, you must be doing something wrong.

But it turns out that each and every team has to deal with them, even the best of them.

A dedicated team at Google

Google has a team fully dedicated to providing accurate and timely information about test flakiness to help developers know whether they are being harmed by test flakiness.

Flaky Tests at Google and How We Mitigate Them

The presentation from John Micco about The State of Continuous Integration Testing @Google provides dizzying numbers:

Almost 16% of our 4.2M tests have some level of flakiness.

We spend between 2 and 16% of our compute resources re-running flaky tests.

85% of green to red transitions are due to flakes.

Flakes insertion rate is equal to fix rate.

note

It is important to realize that even the most talented engineers with loads of resources are having a hard time managing flakiness.

Unfortunately, most companies and especially startups cannot afford dedicating resources to manage them.

Yet, not taking care of them is undermining your productivity, confidence, and even happiness at work.

Who wants to wait yet another CI build before moving on? Who wants to deploy with the fear of breaking something?

They are a huge problem

tip

Since flaky tests are ubiquitous and inevitable, engineering teams must learn how to deal with a certain level of flakiness, and most importantly learn how to minimize the cost to developers.

The cost of flaky tests is far from negligible, quite the opposite.

Big players like Google know that all too well and they take it very seriously.

The scale at which they operate makes it even worse. They are literally losing millions of dollars in developers time and compute resources.

The problem is that smaller teams usually don’t realise that they are also losing a lot of money and energy because of them. Of course you are not losing as much as Google is, but it is proportional to your apps and team’s size, so it is significant for your budget.

Here is a simple calculation that can help you estimate the cost of flakiness for your team:

10 min CI build time
* 1 re-run/day because of flaky test
* 6 developers
* 100$/hour developer cost / 60 min
* 20 days in a month
= 2000$/month

Can you imagine? 2000$/month just for one flaky test failing every day?

And this is very conservative. It is not even taking into account:

  • the execution cost on the CI
  • the fixing cost
  • the frustration cost
  • the context switch cost

What now?

This documentation will help you:

  • define a strategy to tackle flacky tests
  • understand the most common root causes and their resolution
  • learn how to collect artifacts that will be useful for the diagnosis

If you have a flaky test problem and need help, contact me.