On supporting Continuous Testing with FITR test automation

Last week I had the chance to participate as a contributor to my very first webinar. The people at SeaLights, an Israeli company that offers a management platform for Continuous Testing, asked me to come on the webinar and share my views on test automation and continuous testing. In this post, I’ll share some of the thoughts and opinions I talked about there.

Test automation is everywhere, however…
Test automation is everywhere. That’s probably nothing new.

A lot of organizations are adopting Continuous Integration and Continuous Delivery. Also nothing new.

To be able to ‘do’ CI/CD, a lot of organizations are relying on their automated tests to help safeguard quality thresholds while increasing release speed. Again, no breaking news here.

However, to safeguard quality in CI and CD you’ll need to be able to do Continuous Testing (CT). Here’s my definition of CT, which I used in the webinar:

Continuous Testing is a process that allows you to gauge the quality of your software on demand. No matter if you’re building and deploying once a month or once a minute, CT allows you to get insight into application quality at all times.

It won’t come as a surprise to you that automated tests often form a big part of an organization’s CT strategy. However, just having automated tests is not enough to be able to support CT. Your automated test approach should not just cover application functionality and coverage, but also:

  • A solid test data management strategy
  • Effective test environment management
  • Informative reporting, targeted towards all relevant audiences

And probably many more things that I’m forgetting here..

In order to be able to leverage your automated tests successfully for supporting CT, I’ve come up with a model based on four pillars that need to be in place:

From AT to CT with FITR tests

Let’s take a quick look at each of these FITR pillars and how they are necessary in CT.

Automated tests need to be focused to effectively support CT. ‘Focused’ has two dimensions here.

First of all, your tests should be targeted at the right application component and/or layer. It does not make sense to use a user interface-driven test to test application logic that’s exposed through an API (and subsequently presented through the user interface), for example. Similarly, it does not make sense to write API-level tests that validate the inner workings of a calculation algorithm if unit tests can provide the same level of coverage. By now, most of you will be familiar with the test automation pyramid. While I think it’s not to be used as a guideline, I find that the pyramid does provide a good starting point for discussion when it comes to focusing your tests at the right level. Use the model to your advantage.

The second aspect of focused automated tests is that your tests should test what they can do effectively. This boils down to sticking to what your test solution and tools in it do best, and leaving the rest either to other tools or to testers, depending on what’s there to be tested. Don’t try and force your tool to do things it isn’t supposed to (here’s an example).

If your tests are unfocused, they are far more likely to be slow to run, to have high maintenance costs and to provide inaccurate or shallow feedback on application quality.

Touching upon shallow or inaccurate feedback, automated tests also need to be informative to effectively support CT. ‘Informative’ also has two separate dimensions.

Most importantly, the results produced and the feedback provided by your automated tests should allow you, or the system that’s doing the interpretation for you (such as an automated build tool), make important decisions based on that feedback. Make sure that the test results and reporting provided contain clear results, information and error messages, targeted towards the intended audience. Keep in mind that every audience has its own requirements when it comes to this information. Developers likely want to see stack trace, whereas managers don’t. Find out what the target audience for your reporting and test results is, what their requirements are, and then cater to them as best as you can. This might mean creating more that one report (or source of information in general) for a single test run. That’s OK.

Another important aspect of informative automated tests is that it should be clear what they do (and what they don’t do). You can make your tests themselves be more informative through various means, including (but not limited to) using naming conventions, using a BDD tool such as Cucumber or SpecFlow to create living documentation for your tests (blasphemy? maybe.. but if it works, it works), and following good programming practices to make your code better readable and maintainable.

When automated test solutions and the results they produce are not informative, valuable time is wasted analyzing shallow feedback, or gathering missing information, which evidently breaks the ‘continuous’ part of CT.

When you’re relying on your automated tests to make important decisions in your CT activities, you’d better make sure they’re trustworthy. As I described in more detail in previous posts, automated tests that cannot be trusted are essentially worthless. Make sure to eliminate false positives (tests that fail when they shouldn’t), but also false negatives (tests that pass when they shouldn’t).

The essential idea behind CT (referring to the definition I gave at the beginning of this blog post) is that you’re able to determine application quality on demand. Which means you should be able to run your automated tests on demand. Especially when you’re including API-level and end-to-end tests, this is often not as easy as it sounds. There are two main factors that can hinder the repeatability of your tests:

  • Test data. This is in my opinion one of the hardest ones to get right, especially when talking end-to-end tests. Lots of applications I see and work with have (overly) complex data models or share test data with other systems. And if you’re especially lucky, you’ll get both. A solid test data strategy should be put in place to do CT, meaning that you’ll either have to create fresh test data at the start of every test run or have the ability to restore test data before every test run. Unfortunately, both options can be quite time consuming (if at all attainable and manageable), drawing you further away from the ‘C’ in CT instead of bringing you closer.
  • Test environments. If your application communicates with other components, applications or systems (and pretty much all of them do nowadays), you’ll need suitable test environments for each of these dependencies. This is also easier said than done. One possible way to deal with this is by using a form of simulation, such as mocking or service virtualization. Mocks or virtual assets are under your full control, allowing you to speed up your testing efforts, or even enable them in the first place. Use simulation carefully, though, since it’s yet another thing to be managed and maintained, and make sure to test against the real thing periodically for optimal results.

Having the above four pillars in place does not guarantee that you’ll be able to perform your testing as continuously as your CI/CD process requires, but it will likely give it a solid push in the right direction.

Also, the FITR model I described here is far from finished. If there’s anything I forgot or got wrong, feel free to comment or contact me through email. I’d love to get feedback.

Finally, if you’re interested in the webinar I talked about earlier, but haven’t seen it, it’s freely available on YouTube:

Stop sweeping your failing tests under the RUG

Hello and welcome to this week’s rant on bad practices in test automation! Today’s serving of automation bitterness is inspired by a question I saw (and could not NOT reply to) on LinkedIn. It went something like this:

My tests are sometimes failing for no apparent reason. How can I implement a mechanism that automatically retries running the failing tests to get them to pass?

It’s questions like this that make the hairs in my neck stand on end. Instead of sweeping your test results under the RUG (Retry Until Green), how about diving into your tests and fixing the root cause of the failure?

First of all, there is ALWAYS a reason your test fails. It might be the application under test (your test did its work, congratulations!), but it might just as well be your test itself that’s causing the failure. The fact that the reason for the failure is not apparent does not mean you can simply ignore it and try running your test a second time to see if it passes then. No, it means there’s work for you to do. It might not be fun work: dealing with and catching with all kinds of exceptions that can be thrown by a Selenium test can be very tricky. The task also might not be suitable for you: maybe you’re inexperienced and therefore think ‘forget debugging, I’ll just retry the test, that’s way easier’. That’s OK, we’ve all been inexperienced at some point in our career. In a lot of ways, most of us still are. And I myself have not exactly been innocent of this behavior in the past either.

But at some point in time, it’s time to get over complaining about flaky tests and doing something about it. That means diving deep into your tests, how they interact with your application under test, getting to the root cause of the error or exception being thrown and fixing it, for once and for all. Here’s a real world example from a project I’m currently working on.

In my tests, I need to fill out a form to create a new savings account. Because the application needs to be sure that all information entered is valid, there’s a lot of front-end input validation going on (zip code needs to exist, email address should be formatted correctly, etc.). Whenever the application is busy validating or processing input, a modal appears that indicates to the end user that the site is busy processing input, and that therefore you should wait a little before proceeding. Sounds like a good idea, right? However, when you want your tests to fill in these forms automatically, you’ll sometimes run into the issue that you’re trying to click a button or complete a text field while it is being blocked by the modal. Cue WebDriverException (“other object would receive the click”) and failing test.

Now, there are two ways to deal with this:

  1. Sweep the test result under the RUG and retry until that pesky modal does not block your script from completing, or
  2. Catch the WebDriverException, wait until the modal is no longer there and do your click or sendKeys again. Writing wrappers around the Selenium API calls is a great way of achieving this, by the way.

Option 1. is the easy way. Option 2. is the right way. You choose. Just know that every failing test is trying to tell you something. Most of the time, it’s telling you to write a better test.

One more argument in favour of NOT sweeping your flaky tests under the RUG, but preventing them from happening in the future: some day, your organization might start, you know, actually relying on these test results. For example as part of a go / no go decision for deployment into production. If I were to call the shots, I’d make sure that all my tests that I rely on for making that decision were:

Really, it’s time to quit tolerating flaky tests. Repair them or throw them away, because what’s the added value of an unreliable test?. Just don’t sweep your failing tests under the RUG.

On false negatives and false positives

In a recent post, I wrote about how trust in your test automation is needed to create confidence in your system under test. In this follow up post (of sorts), I’d like to take a closer look at two phenomena that can seriously undermine this trust: false positives and false negatives.

False positives
Simply put, false positives are test instances that fail without there being a defect in the application under test, i.e., the test itself is the reason for the failure. False positives can occur for a multitude of reasons, including:

  • No appropriate waiting is implemented for an object before your test (for example written using Selenium WebDriver) is interacting with it.
  • You specified incorrect test data, for example a customer or an account number that is (with reason) not present in the application under test.

False positives can be really annoying. It takes time to analyze their root cause, which wouldn’t be so bad if the root cause was in the application under test, but that would be an actual defect, not a false positive. The minutes (hours) spent getting to the root cause of tests that fail because they’ve been poorly written would almost always have been better spent otherwise, on writing stable and better performing tests in the first place, for example.

If they’re part of an automated build and deployment process, you can find yourself in even bigger trouble with false positives. They stall your build process unnecessarily, thereby delaying deployments that your customers or other teams are waiting for.

There’s also another risk associated with false positives: when they’re not taken care of as soon as possible, people will start taking them for granted. Tests that regularly or consistently cause false positives will be disregarded and, ultimately, left alone to die. This is a real waste of the time it took to create that test in the first place, I’d say. I’m all for removing tests from your test base if they no longer serve a purpose, but removing them just because they fail either intermittently or every time is NOT a good reason to discard them.

And talking about tests failing intermittently, those (also known as flaky tests) are the worst, simply because the root cause for their failure often cannot be easily determined, which in turn makes it hard to fix them. As an example: I’ve seen tests that ran overnight and failed sometimes and passed on other occasions. It took weeks before I found out what caused them to fail: on some test runs this particular test (we’re talking about an end to end test that took a couple of minutes to run here) was started just before midnight, causing funny behavior in subsequent steps that were completed after midnight (when a new day started). On other test runs, either the test started after midnight or was completed before midnight, resulting in a ‘pass’. Good luck debugging that during office hours!

False negatives
While false positives can be really annoying, the true risk with regards to trust in test automation is at the other end of the unreliability spectrum: enter false negatives. These are tests that pass but shouldn’t, because there IS an actual defect in the application under test, it’s just not picked up by the test(s) responsible for covering the area of the application where the defect occurs.

False negatives are far more dangerous than false positives, since they instill a false sense of confidence in the quality of your application. You think you’ve got everything covered with your test set, and all lights are green, but there’s still a defect (or two, or ten, or …) that goes unnoticed. And guess who’s going to find them? Your users. Which is exactly what you thought you were preventing by writing your automated regression tests.

Detecting false negatives is hard, too, since they don’t let you know that they’re there. They simply take up their space in your test set, running without trouble, never actually catching a defect. Sometimes, these false negatives are introduced at the time of writing the test, simply because the person responsible for creating the tests isn’t paying attention. Often, though, false negatives spring into life over time. Consider the following HTML snippet, representing a web page showing an error message:

		<div class="error">Here's a random error message</div>

One of the checks you’re performing is that no error messages are displayed in a happy path test case:

@Test(description="Check there is no error message when login is successful")
public void testSuccessfulLogin() {
	LoginPage lp = new LoginPage();
	HomePage ep = lp.correctLogin("username", "password");

public bool hasNoErrorText() {

	return driver.findElements(By.className("error")).size() == 0;

Now consider the event that at some point in time, your developer decides to mix up class annotations (not a highly unlikely event!), which results in a new way of displaying error messages:

		<div class="failure">Here's a random error message</div>

The aforementioned test will still run without trouble after this change. The only problem is that its defect finding ability is gone, since it wouldn’t notice in case an error message IS displayed! Granted, this might be an oversimplified example, and a decent test set would have additional tests that would fail after the change (because they expect an error message that is no longer there, for example), but I hope you can see where I’m going with this: false negatives can be introduced over time, without anyone knowing.

So, how to reduce (or even eliminate) the risk of false negatives? If you’re dealing with unit tests, I highly recommend experimenting with mutation testing to assess the quality of your unit test set and its defect finding ability. For other types of automated checks, I recommend the practice of checking your checks regularly. Not just at the time of creating your tests, though! Periodically, set some time aside to review your tests and see if they still make sense and still possess their original defect finding power. This will reduce the risk of false negatives to a minimum, keeping your tests fresh and preserving trust in your test set and, by association, in your application under test. And isn’t that what testing is all about?