More troubles with test data

If managing test data in complex end-to-end automated test scenarios is an art, I’m certainly not an artist (yet). If it is a science, I’m still looking for the solution to the problem. At this moment, I’m not even sure which of the two it is, really..

The project
Some time ago, I wrote a blog post on different strategies to manage test data in end-to-end test automation. A couple of months down the road, and we’re still struggling. We are faced with the task of writing automated user interface-driven tests for a complex application. The user interface in itself isn’t that complex, and our tool of choice handles it decently. So far, so good.

As with all test automation projects I work on, I’d like to keep the end goal in mind. For now, running the automated end-to-end tests once every fortnight (at the end of a sprint) is good enough. I know, don’t ask, but the client is satisfied with that at the moment. Still, I’d like to create a test automation solution that can be run on demand. If that’s once every two weeks, all right. It should also be possible to run the test suite ten times per day, though. Shorten the feedback loops and all that.

The test data challenge
The real challenge with this project, as with a number of other projects I’ve worked on in the past, is in ensuring that the test data required to successfully run the tests is present and in the right state at all times. There are a number of complicating factors that we need to deal (or live) with:

  • The data model is fairly complex, with a large number of data entities and relations between them. What makes it really tough is that there is nobody available that completely understands it. I don’t want to mess around assuming that the data model looks a certain way.
  • As of now, there is no on demand back-up restoration procedure. Database back-ups are made daily in the test environment, but restoring them is a manual task at the moment, blocking us from recreating a known test data state whenever we want to.
  • There is no API that makes it easy for us to inject and remove specific data entities. All we have is the user interface, which results in long setup times during test execution, and direct database access, which isn’t of real use since we don’t know the data model details.

Our current solution
Since we haven’t figured out a proper way to manage test data for this project yet, we’re dealing with it the easiest way available: by simply creating the test data we need for a given test at the start of that test. I’ve mentioned the downsides of this approach in my previous post on managing test data (again, here it is), but it’s all we can do for now. We’re still in the early stages of automation, so it’s not something that’s holding us back to much, but all parties involved realize that this is not a sustainable solution for the longer term.

The way forward
What we’re looking at now is an approach that looks roughly like this:

  1. A database backup that contains all test data required is created with every new release.
  2. We are given permission to restore that database backup on demand, a process that takes a couple of minutes and currently is not yet automated.
  3. We are given access to a job that installs the latest data model configuration (this changes often, sometimes multiple times per day) to ensure that everything is up to date.
  4. We recreate the test data database manually before each regression test run.

This looks like the best possible solution at the moment, given the available knowledge and resources. There are still some things I’d like to improve in the long run, though:

  • I’d like database recreation and configuration to be a fully automated process, so it can more easily be integrated into the testing and deployment process.
  • There’s still the part where we need to make sure that the test data set is up to date. As the application evolves, so do our test cases, and somebody needs to make sure that the backup we use for testing contains all the required test data.

As you can see, we’re making progress, but it is slow. It makes me realize that managing test data for these complex automation projects is possibly the hardest problem I’ve encountered so far in my career. There’s no one-stop solution for it, either. So much depends on the availability of technical hooks, domain knowledge and resources at the client side.

On the up side, last week I met with a couple of fellow engineers from a testing services and solutions provider, just to pick their brain on this test data issue. They said they have encountered the same problem with their clients as well, and were working on what could be a solution to this problem. They too realize that it’ll never be a 100% solution to all test data issues for all organizations, but they’re confident that they can provide them (and consultants like myself) with a big step forwards. I haven’t heard too many details, but I know they know what they’re talking about, so there might be some light at the end of the tunnel! We’re going to look into a way to collaborate on this solution, which I am pretty excited about, since I’d love to have something in my tool belt that helps my clients tackle their test data issues. To be continued!

Managing test data in end-to-end test automation

One of the biggest challenges I’m facing in projects I’m contributing to is the proper handling of test data in automated tests, and especially in end-to-end test automation. For unit and integration testing, it is often a good idea to resort to mocking or stubbing the data layer to remain in control over the test data that is used in and required for the tests to be executed. When doing end-to-end tests, however, keeping all required test data in check in an automated manner is no easy task. I say ‘in an automated manner’ here, because once you start to rely on manual intervention for the preparation or cleaning up of test data, then you’re moving away from the ability to test on demand, which is generally not a good thing. If you want your tests to truly run on demand, having to rely on someone (or a third party process) to manage the test data can be a serious bottleneck. Even more so with distributed applications, where teams often do not have enough control over the dependencies they require in order to be able to do end-to-end (or even integration) testing.

In this post, I’d like to consider a number of possible strategies for dealing with test data in end-to-end tests. I’ll take a look at their benefits and their drawbacks to see if there’s one strategy that trumps all others (spoiler alert: probably not…).

Creating test data during test execution
One approach is to start every test, suite or run with a set-up phase where the test data required for that specific test, suite or run is created. This can be done by any technical means available: be it through direct INSERT statements in a database, a series of API calls that create new users, orders or any other type of test data object, or (if there really is no alternative) through the user interface. The main benefit of this approach is that there is a strong coupling between the created test data and the actual test, meaning that the right test data is always available. There’s some rather big drawbacks as well, though:

  • Setting up test data takes additional time, especially when doing through the user interface.
  • Setting up test data requires additional code, which increases the maintenance burden of your automated tests.
  • If an error occurs during the test data setup fase of your test, your actual test result will be unpredictable and therefore cannot be trusted. That is, if your test isn’t simply aborted before the actual test steps are executed at all…
  • This approach potentially requires tests to depend on one another in terms of the sequence in which they’re executed, which is a definite anti-pattern of test automation.

I’ve used this approach several times in my projects, with mixed results. Sometimes it works just fine, sometimes a little less so. The latter is most often the case when the data model is really complex and there’s no other way than mimicking user interaction by means of tools such as Selenium to get the data prepared.

Query test data prior to test execution
The second approach to dealing with test data around automated tests is to query the data before the actual test runs. This can be done either directly on the database, or possibly through an API (or even a user interface) that allows you to retrieve customers, articles or whatever type of data object you need for your test. The main benefit of this approach is that you’re not losing time creating test data when all you really care about is the test results, and that this approach results in less test automation code to maintain, especially when you can query the database directly. Here too, there are a couple of drawbacks that render this approach less than ideal as well:

  • There’s no guarantee that the exact data you require for a test case (especially with edge cases) is actually present in the database. For example, how many customers that have lived in Nowhereville for 17,5 years, together with their wife and their blue parrot, are actually in your database?
  • Sometimes getting the query right so that you’re 100% sure that you get the right test data is a daunting task, requiring very specific knowledge of the system. This might make this approach less than ideal for some teams.
  • Also, even when you get the query exactly right, does that really guarantee that you get results that you’re 100% sure will be a perfect fit for your test case?

Reset the test data state before or after a test run
I think this is potentially the best approach when having to deal with test data in end-to-end tests: either setting the test data database to the exact state by restoring a database backup or cleaning up the test database afterwards. This guarantees that test data-wise, you’re always in the exact same state before / after a test run, which massively improves predictability and repeatability of your tests. The main drawback is that often, this is not an option either due to nobody knowing enough about the data model to allow this, or by access to the database being restricted for reasons good or less than good. Also, when you’re dealing with large databases, doing a database reset or rollback might be a matter of hours, which slows down your feedback loop significantly, rendering it next to useless when your tests are part of a CD pipeline.

Virtualizing the data layer
Nowadays, there are several solutions on the market that allow you to effectively virtualize your data layer for testing purposes. A prime example of such a solution is Delphix, but there are several other tools on the market as well. I haven’t experimented with any of these for long enough to actually have formed an educated opinion, but one thing I don’t really like about this approach is that virtualizing the data layer (however efficient it may be) voids the concept of executing a true end-to-end test, since there’s no actual data layer involved anymore. Then again, for other types of testing, it may actually be a very good concept, just like service virtualization is for simulating the behavior of critical yet hard-to-access dependencies in test environments.

So, what’s your take on this?
In short, I haven’t found the ideal solution yet. I’d love to read about the approaches other people and teams are taking when it comes to managing test data in end-to-end automated tests, so feel free to send me an email, or even better, leave a comment to this post. Looking forward to seeing your replies!