On false negatives and false positives

In a recent post, I wrote about how trust in your test automation is needed to create confidence in your system under test. In this follow up post (of sorts), I’d like to take a closer look at two phenomena that can seriously undermine this trust: false positives and false negatives.

False positives
Simply put, false positives are test instances that fail without there being a defect in the application under test, i.e., the test itself is the reason for the failure. False positives can occur for a multitude of reasons, including:

  • No appropriate waiting is implemented for an object before your test (for example written using Selenium WebDriver) is interacting with it.
  • You specified incorrect test data, for example a customer or an account number that is (with reason) not present in the application under test.

False positives can be really annoying. It takes time to analyze their root cause, which wouldn’t be so bad if the root cause was in the application under test, but that would be an actual defect, not a false positive. The minutes (hours) spent getting to the root cause of tests that fail because they’ve been poorly written would almost always have been better spent otherwise, on writing stable and better performing tests in the first place, for example.

If they’re part of an automated build and deployment process, you can find yourself in even bigger trouble with false positives. They stall your build process unnecessarily, thereby delaying deployments that your customers or other teams are waiting for.

There’s also another risk associated with false positives: when they’re not taken care of as soon as possible, people will start taking them for granted. Tests that regularly or consistently cause false positives will be disregarded and, ultimately, left alone to die. This is a real waste of the time it took to create that test in the first place, I’d say. I’m all for removing tests from your test base if they no longer serve a purpose, but removing them just because they fail either intermittently or every time is NOT a good reason to discard them.

And talking about tests failing intermittently, those (also known as flaky tests) are the worst, simply because the root cause for their failure often cannot be easily determined, which in turn makes it hard to fix them. As an example: I’ve seen tests that ran overnight and failed sometimes and passed on other occasions. It took weeks before I found out what caused them to fail: on some test runs this particular test (we’re talking about an end to end test that took a couple of minutes to run here) was started just before midnight, causing funny behavior in subsequent steps that were completed after midnight (when a new day started). On other test runs, either the test started after midnight or was completed before midnight, resulting in a ‘pass’. Good luck debugging that during office hours!

False negatives
While false positives can be really annoying, the true risk with regards to trust in test automation is at the other end of the unreliability spectrum: enter false negatives. These are tests that pass but shouldn’t, because there IS an actual defect in the application under test, it’s just not picked up by the test(s) responsible for covering the area of the application where the defect occurs.

False negatives are far more dangerous than false positives, since they instill a false sense of confidence in the quality of your application. You think you’ve got everything covered with your test set, and all lights are green, but there’s still a defect (or two, or ten, or …) that goes unnoticed. And guess who’s going to find them? Your users. Which is exactly what you thought you were preventing by writing your automated regression tests.

Detecting false negatives is hard, too, since they don’t let you know that they’re there. They simply take up their space in your test set, running without trouble, never actually catching a defect. Sometimes, these false negatives are introduced at the time of writing the test, simply because the person responsible for creating the tests isn’t paying attention. Often, though, false negatives spring into life over time. Consider the following HTML snippet, representing a web page showing an error message:

		<div class="error">Here's a random error message</div>

One of the checks you’re performing is that no error messages are displayed in a happy path test case:

@Test(description="Check there is no error message when login is successful")
public void testSuccessfulLogin() {
	LoginPage lp = new LoginPage();
	HomePage ep = lp.correctLogin("username", "password");

public bool hasNoErrorText() {

	return driver.findElements(By.className("error")).size() == 0;

Now consider the event that at some point in time, your developer decides to mix up class annotations (not a highly unlikely event!), which results in a new way of displaying error messages:

		<div class="failure">Here's a random error message</div>

The aforementioned test will still run without trouble after this change. The only problem is that its defect finding ability is gone, since it wouldn’t notice in case an error message IS displayed! Granted, this might be an oversimplified example, and a decent test set would have additional tests that would fail after the change (because they expect an error message that is no longer there, for example), but I hope you can see where I’m going with this: false negatives can be introduced over time, without anyone knowing.

So, how to reduce (or even eliminate) the risk of false negatives? If you’re dealing with unit tests, I highly recommend experimenting with mutation testing to assess the quality of your unit test set and its defect finding ability. For other types of automated checks, I recommend the practice of checking your checks regularly. Not just at the time of creating your tests, though! Periodically, set some time aside to review your tests and see if they still make sense and still possess their original defect finding power. This will reduce the risk of false negatives to a minimum, keeping your tests fresh and preserving trust in your test set and, by association, in your application under test. And isn’t that what testing is all about?

Trust automation

By now, most people will be aware that the primary goal of test automation is NOT ‘finding bugs’. Sure, it’s nice if your automated tests catch a bug or two that would otherwise find its way into your production environment, but until artificial intelligence in test automation takes off (by which I mean REALLY takes off), testers will be way more adept at finding bugs than even the most sophisticated of automated testing solutions.

No, the added value of test automation is in something different: confidence. From the Oxford dictionary:

Confidence: Firm belief in the reliability, truth, or ability of someone or something.

A set of good automated tests will instill in stakeholders (including but definitely not limited to testers) the confidence that a system under test produces output or performs certain functions as previously specified (whether that’s also the intended way a system should work is a whole different discussion). Even after possibly rounds and rounds of changes, fixes, patches, refactorings and other updates.

This confidence is established by trust:

Trust: The feeling or belief that one can have faith in or rely on someone or something.

Until recently, I wasn’t aware of the subtle difference between trust and confidence. It doesn’t really help either that both terms translate to the same word in Dutch (‘vertrouwen’). Then I saw this explanation by Michalis Pavlidis:

Both concepts refer to expectations which may lapse into disappointments. However, trust is the means by which someone achieves confidence in something. Trust establishes confidence. The other way to achieve confidence is through control. So, you will feel confident in your friend that he won’t betray you if you trust him or/and if you control him.

Taking this back to software development, and to test automation in particular: confidence in the quality of a piece of software is achieved by control over its quality or by trust therein. As we’re talking about test automation here, more specifically the automated execution of predefined checks, I think it’s safe to say that we can forget about test automation being able to actively control the quality of the end product. That leaves the instilling of trust (with confidence as a result) in the system under test as a prime goal for test automation. It’s generally a bad idea to make automated checks the sole responsible entity for creating and maintaining this trust, by the way, but when applied correctly, they can certainly contribute a lot.

However, in order to enable your automated checks to build this trust, you need to trust the automated checks themselves as well. Simply (?) put: I wouldn’t be confident in the quality of a system if that quality was (even partially) assessed by automated checks I don’t trust.

Luckily, there are various ways to increase the trust in your automated checks. Here are three of them:

  • Run your automated checks early and often to see if they are fast, stable and don’t turn out to be high-maintenance drama queens.
  • Check your checks to see if they still have the trust-creation power you intended them to have when you first created them. Investigate whether applying techniques such as mutation testing can be helpful when doing so.
  • Have your automated checks communicate both their intent (what is it that is being checked?) and their result (what was the result? what went wrong?) clearly and unambiguously to the outside world. This is something I’ll definitely investigate and write about in future posts, but it should be clear that the value of automated checks that are unclear as to their intent and their result are not providing the value they should.

Test automation is great, yet it renders itself virtually worthless if it can’t be trusted. Make sure that you trust your automation before you use it to build confidence in your system under test.