THIS ARTICLE IS A CAUTIONARY TALE
The world is awash with data. We make decisions based on it every day. From the trivial (“I think I will buy this book based on the 4 star average rating on Amazon”) to the possibly catastrophic (“I think I will invest my life savings based on the projections in the Enron prospectus”). In software testing, critical decisions on risk and product shipping can be, and often are, made based on data. Typically, this data takes the form of test case and defect metrics. When we try and answer our customers’ questions about the product we are testing it may be tempting to let the data speak for itself. However, I believe it's important to recognise that metrics are nothing without analysis and this analysis is subject to biases, misrepresentation, misunderstanding and manipulation. I’d like you to humour me. Go to Google Images, type "Deming Data God" in the text box and click Search.
Back with me now? Good.
If your search was anything like mine, you'd have no doubt seen a lot of images of slides referencing Edward Deming's famous quote “In God we trust; all others must bring data”. In my experience, this quote is often used as a substitute for an actual argument. “Hey, we should introduce toolset X. I have data that it will reduce our cycles times by 58.7%. Deming said that having data is good. Therefore I win”.
However, there are two problems with this quote. The first problem is that there is no actual data that Deming ever said this (which I find deliciously ironic). The second, and far more important problem, is that it actually contradicts something that Deming did say, namely “Experience by itself teaches nothing". The crucial point here is that data by itself tells us nothing - without an understanding of the interactions of the systems and people the data relates to
. The metrics generated by testing should not be used without understanding the context in which they were produced.
Having been a student of history, I tend to use the following example to illustrate this point.
Knowing that the Sudetenland was annexed in 1938, that Hitler ordered the invasion of Poland in 1939, and - two days later - Great Britain and France declared war on Germany tells you nothing about why the Second World War occurred. They are just three points of data. You can, obviously, string them together, but that provides nothing more than a shallow causation. This level of inspection would be acceptable for a child at primary school, but you'd expect a deeper level of reasoning from any higher student of the topic.
I decided to write about this subject after recalling a conversation I had with an Iteration Manager (IM) on the value (or not!) of the vast amount of test case pass/fail data they were receiving from the Build Server each day. One particular comment has always stuck with me - the IM felt these results gave value and helped them sleep at night. In a way I can relate to this; tucking into a great work of fiction before bed also helps me drift off at night. Is the data provided by the Build Server really sufficient to help someone wanting to deliver working software sleep soundly?
I'd argue that it's not, for three reasons. The first is that, unless they were incredibly well briefed, this person had little or no idea what the substance of these tests were. Did they cover business critical functions? Were they simply validating the correct font size displayed on each screen? Without this additional context the pass/fail data is largely meaningless - simply a collection of ticks and crosses.
Secondly, there is a false correlation between “all our test cases are passing” and “our product contains no serious bugs that will impact our customers or our ability to do business”. If the history of software development has taught us anything it's that, typically, we release with unknown defects present. I'm sure I'm not alone in having worked on projects where, once your release is live, bugs are found by customers in areas where all the tests had passed. And with software creeping into every facet of life, this could become much more common. An illustration of this point can be found a few kilometres away from where I'm sitting right now at the recently built Fiona Stanley Hospital in Perth.
Part of the technology implemented at the hospital was to allow the patients to order food via electronic devices at each bed. However, despite the system working functionally, many older and disabled patients were unable to order their food because they didn't know how to use the devices the software ran on. Just think about that for a moment; a hospital is supposed to care for the most needy and vulnerable, not to put them through hardship and distress to simply request a meal. That is the difference between a system being functionally correct and a system being able to be used by the people who rely on it. The difference between “our test cases have passed” and “our product contains no serious bugs that will impact our customers”.
Thirdly, even if we accept that today a series of passed test cases means our product is of sufficient quality to release, there is no guarantee that this will hold true tomorrow. In The Black Swan, Professor Nassim Nicholas Taleb uses the following example to illustrate this (Taleb's inspiration here is Bertrand Russell, who uses a chicken in his writing):
A turkey will, over a period of many months, accumulate a vast amount of data that they will be fed at a particular time each day and then spend the rest of the day merrily doing whatever it is that turkeys do. This model holds true day after day, therefore the turkey assumes that it will continue to hold true. Which it does...until Christmas Eve.
Just as the turkey's model falls apart when an unexpected event with large consequences occurs, so too can our model that test cases passing equalling a successful release.
This is not to say that we cannot learn from past experiences. Taleb makes this point as follows: "What can a turkey learn about what is in store for it tomorrow from the events of yesterday? A lot, perhaps, but certainly a little less than it thinks, and it is that “little less” that may make all the difference". And, for us as software testers, that little less could be the difference between the same successful release we had last week and appearing on the front page of the newspaper tomorrow.
Should we then abandon the capture and analysis of such data? Of course not, because it still provides some value. At a minimum, it can provide a useful framework we can use as a basis for further questions and exploration. Michael Bolton, in his blog post on Meaningful Metrics, refers to these as inquiry metrics - metrics that are used “to prompt questions about their assumptions, rather than to provide answers or drive their work”. The data can provide indicators of areas where more exploration may be required, or of potential trends. However, as we (and the turkey) found out earlier, the predictive value of these trends needs to be carefully weighed. Just as all theories in science are considered to be provisional, so should any conclusions we draw with regards to quality based the metrics we generate and gather as part of the testing process.
Metrics can provide useful assistance for testers when we communicate about the product under test but we need to understand the basis of the metrics and their predictive limitations. For example, to really derive value from test case related data, you have to understand
- The basis of the data you are reporting on (what the tests are actually doing), and
- How the test outcomes translate into something of value for your customers.
Consider the following hypothetical statement: “Five percent of our test cases have failed but they are limited to one specific area of the business”.
Is this good news or bad? If you can't answer that question as a tester then it's unlikely that your customers would be able to either. With a deeper understanding of the data, it is possible to give a significantly more meaningful response such as “In invoice processing, we have seen a number of test cases failures relating to batch processing. The team have a manual work around, but the increased processing time will cost in the region of one FTE effort per day until this is resolved should it go into production in this state”.
To my mind, the same applies to bug-related data. What do the statistics mean in terms of thebusiness? Can you translate that raw data into usable information that relates to, for example, business risk, reputational damage, lost income, increasing processing times or impact on front line support? To return to the history analogy I used earlier, knowing how many casualties occurred in a battle tells you nothing about the historical context or importance of said battle. It provides no information , furthers no understanding and, in a worst case scenario, can lead you to derive a false conclusion (not all battles are won by the side with the fewest casualties, for example).
Testers, yes, by all means bring your data. But understand how it was derived and what it means for your customers.
Because “In God we trust but in all other matters we must be able to provide reasoned insights, leveraging the data gathered by processes, understood on what we know, what we think we know and what we do not know, about the product under test”.
It’s not as snappy I'll grant you, and I don’t expect to see it on Powerpoint slides anytime soon, but I find it far better suited for dealing with the hugely complex relationship between the software, the humans that designed it, built it, tested it and, ultimately, use it.