Daring to Trust in Your Luck while Testing Code 🎲

#testing #programming #coding #probability

Figuring out if changes to a codebase will introduce defects can be tricky. Consider the following function written in Typescript, and used for modifying data of a resource on a database:


type ResourceAttributes: {
  a0?: string;
  a1?: string;
  a2?: string;
  ...
  a9?: string;
  b0?: string;
  b1?: string;
  ...
  z9?: string;
};

type Resource = {
  id: string;
  modifiedAttributes: ResourceAttributes
};

type UpdateResourceProps = Resource;

const updateResource = (props: UpdateResourceProps) => {
...
}

Because of the sheer amount of input combinations this function accepts, writing a comprehensive test suite might take a long time. For example, what if modifying certain attributes on the database level at the same time causes an exception which needs to be handled, even though most other attributes do not have this interdependency?

Does this mean I have to carefully look out for these rare situations before I am able to say I have some amount of confidence in my tests?

An approach I take when I find myself in these kind of situations is utilising the so-called black-box or monkey testing, in which the function is tested with a randomly generated set of inputs.

But this introduces a new set of problems when the input can be a combination of many variables. Testing all the possible combinations suddenly becomes costly in terms of development time and other valuable resources.

There has be a middle ground between a low confidence in the test suite and the burden of having too many similar tests. Let me explore a way of approaching this problem using the cornerstone of statistics.

The Central Limit Theorem

Most will be familiar with the idea of normal distribution, or more well known as the bell curve. It is the fundamental reason behind why when we measure some characteristic in a group, the highest number of our measurements will fall around the middle, and the remaining measurements will trail off in both directions around the center.

The overlooked part about bell curves is what it tells about the nature of the characteristic we are measuring, and the way we make the measurements.

The Recipe for a Bell Curve

Creating a bell curve from scratch is actually pretty simple.

First step is to pick one (or more) randomly varying, quantifiable measure(s). If I pick multiple measures, these measures must be independent of each other.

In this context, randomness is defined as the unpredictability of the outcome of the measurement, and quantifiability means being able to convert the outcome to a number. Independency means measures should not influence the outcome of each other.

A great candidate for this is rolling dice, as it fulfils all three of these criteria. 🎲

Let's say I chose rolling two dices as the measures I want to create a bell curve out of. I quantify dice rolls by the number I see on the top when the roll ends.

Second step is to simply repeat the sampling of the measure.

Sampling in this case refers to action of rolling the dice, because each roll produces an output that can be quantified.

For example, I might have rolled the following pairs:


[1, 3]
[1, 4]
[3, 6]
...

Last step is to add the quantification of samples of this measure, so that they individually constitute the sums of the sample items.

For my case, I would get the following sums:


1 + 3 = 4
1 + 4 = 5
3 + 6 = 9
...

I then plot the sums on a graph. Horizontal axis shows the sums I got, vertical axis shows the total amount of occurrences of that sum.

Central limit theorem posits that if I keep repeating sampling procedure (as the count of rolls approaches infinity), the distribution of sums would look more and more like a bell curve.

Sweet!

At this point, I have no reasonable doubt about the fairness or legitimacy of the dice rolls.

In fact, this emerging characteristic of this distribution provides assurance that the dice I used was capable of being unpredictable. Any major skew in this distribution would be put the legitimacy of the dice to question.

It's hard to believe in coincidence, but it's even harder to believe in anything else. – John Green

When you think about it, it would be fishy for the dice rolls to often sum up to a 12 or 2 when these sum can produced by two combinations, but rarely sum up to a 6, 7, or 8, when these sums should be produced frequently, right?

Another assurance I have from the bell curve is the chance of having drawn an exceptional sample getting lower and lower as the repeated sampling continues, as well as the chance of not having drawn a particular sample.

Anything that can go wrong will go wrong. – Murphy's Law

Using the dice example, this would be similar to the situation where I suspiciously never roll a specific combination of numbers, let's say [6, 6] after a very large number of rolls, let's say a thousand rolls.

Applying Central Limit Theorem in Testing

When I relate these inferences to testing, I see these parallels:

The first step of picking a random, quantifiable variable can be equated to declaring and initialising a set of variables that are based on a random number generation algorithm. This algorithm is a function that allows me to reasonably produce a set of random inputs that I can supply to the function I want to test. My go-to open source utilities for this are chance.js and factory.ts.
Repeated sampling can be equated to the repeated execution of the tests with a particular combination of program's input, such that the combination is randomly determined for each repetition. Testing frameworks usually come with a method for this, such as Jest's each function.

With these in mind, I am in a position to argue that, a set of randomly generated input samples, given they meet the aforementioned conditions, may be sufficient for writing a test for a function and concluding it works as intended if they pass these tests, just as I may conclude that the dice is fair – as long as the tester can reasonably ignore the interdependencies between the input variables.

Pitfalls of relying on chance

What if I suspect there is an interdependency among inputs the tester cannot ignore? In fact, this is the only case where using random testing would objectively be a bad idea. In this case, it's a better idea for the tester to thoughtfully cover all these cases with manually constructed tests, because the assumption of independency between input variables does not hold.

Using the dice example: if I could predict the second die roll from the first one, would there a reason for rolling the second die ever again? In terms of testing, this would be equivalent to tester generating tests where the input provided to tests is not representative of the input that the function is supposed to receive.

It is also not a good idea to include these tests among those that are meant to be used to check for regressions in the codebase as a whole, for example in the deployment pipeline. Since the sampling of input is random, it would not be pleasant to have tests fail just because of random sampling (at that stage of development).

Conclusion

In conclusion, inferences from the central limit theorem could be used as a compromise to cut down the exploration phase of writing tests, as long as the tester is aware of the pitfalls, and knows when not to use it.