Sunday, December 3, 2017

Compare Testing

If you believe that testing is inherently about information then you might enjoy Edward Tufte's take on that term:
Information consists of differences that make a difference.
We identify differences by comparison, something that as a working tester you'll be familiar with. I bet you ask a classic testing question of someone, including yourself, on a regular basis:
  • Our competitor's software is fast. Fast ... compared to what?
  • We must export to a good range of image formats. Good ... compared to what?
  • The layout must be clean. Clean ... compared to what?
But while comparison as a tool to get clarification by conversation is important, for me, it feels like testing is more fundamentally about comparisons.

James Bach has said "all tests must include an oracle of some kind or else you would call it just a tour rather than a test." An oracle is a tool that can help to determine whether something is a problem. And how is the value extracted from an oracle? By comparison with observation!

But we've learned to be wary of treating an oracle as an all-knowing arbiter of rightness. Having something to compare with should not lure you into this appealing trap:
I see X, the oracle says Y. Ha ha! Expect a bug report, developer!
Comparison is a two-way street and driving in the other direction can take you to interesting places:
I see X, the oracle says Y. Ho hum. I wonder whether this is a reasonable oracle for this situation?
Cem Kaner has written sceptically about the idea that the engine of testing is comparison to an oracle:
As far as I know, there is no empirical research to support the claim that testers in fact always rely on comparisons to expectations ... That assertion does not match my subjective impression of what happens in my head when I test. It seems to me that misbehaviors often strike me as obvious without any reference to an alternative expectation. One could counter this by saying that the comparison is implicit (unconscious) and maybe it is. But there is no empirical evidence of this, and until there is, I get to group the assertion with Santa Claus and the Tooth Fairy. Interesting, useful, but not necessarily true.
While I don't have any research to point to either, and Kaner's position is a reasonable one, my intuition here doesn't match his. (Though I do enjoy how Kaner tests the claim that testing is about comparisons by comparing it to his own experience.) Where we're perhaps closer is in the perspective that not all comparisons in testing are between the system under test and an oracle with a view to determine whether the system behaviour is acceptable.

Comparing oracles to each other might be one example. And why might we do that? As Elaine Weyuker suggests in On Testing Non-testable Programs, partial oracles (oracles that are known to be incomplete or unreliable in some way) are common. To compare oracles we might gather data from each of them; inspect it; look for ways in which each has utility (such as which has more predictive power in scenarios of interest).

And there we are again! The "more" in "which has more predictive power" is relative, it's telling us that we are comparing and, in fact, here we're using comparisons to make a decision about which comparisons might be useful in our testing. I find that testing is frequently non-linear like that.

Another way in which comparison is at the very heart of testing is during exploration. Making changes (e.g. to product, data, environment, ...) and seeing what happens as a result is a comparison task. Comparing two states separated by a (believed) known set of actions irrespective of whether you have an idea about what to expect is one way of building up knowledge and intuition about the system under test, and of helping to decide what to try next, what to save for later, what looks uninteresting (for now).

Again this throws up meta tasks: how to know which aspects of a system's state to compare? How to know which variables it is even possible to compare? How to access the state of those at the right frequency and granularity to make them usable? And again there's a potential cycle: gather data on what it might be possible to compare; inspect those possibilities; find ways in which they might have utility.

I started here with a Tufte quote about information being differences that make a difference, and said that identifying the differences is an act of comparison. I didn't say at that point but identifying the ones that make a difference is also a comparison task. And the same skills and tools that can be used for one can be used for both: testing skills and tools.

1 comment:

  1. I got used in a previous role in assessing bugs for severity and impact. So I would often produce bug reports where the bug was reported as being severe but not very impactful (and sometimes I would even warn the dev about it and say something like "You will be marking that as Will Not Fix, I assume. I'm cool with that but don't tell anyone I said so.")

    Or I might mark a bug as not very severe at all but highly impactful, such as a typo that was actually a mis-spelling of the CEO's name (in which case, my private message to the dev would be "Fix that NOW or we will all be sacked!")

    Traditionally, the thing about the ancient Greek oracles was that their utterances were highly gnomic and subject to interpretation. And that interpretation was down to the interpreter's knowledge of what things were like in the Real World at that moment. On that basis, then, testing oracles are no different. I was comparing the technical nature of a bug and the ways it manifested itself with the likely impact in the Real World. And your likelihood of getting sacked because of an error in the code isn't something that can be easily quantified and definitely can't be automated! :-)