A demonstration of Mutation Testing

#testing #javascript #jest

Test coverage is one of the simplest possible metrics to help gauge quality of testing, which makes it one that is often targeted (don’t commit any code with less than 80% coverage) and potentially gamed. A lot of people dismiss it entirely for those reasons. While there is a good defence to be made for it, that’s not the purpose of this post. Instead, I want to provide a simple example of how test coverage can be misleading and introduce mutation testing as a way to address those shortcomings.

Fizzbuzz: A high coverage and buggy example

First, the example code. There’s a simple little game that comes up in coding interviews called fizzbuzz. The rules are:

Take turns counting, starting from 1;
If a number is a multiple of 3, say “fizz” instead;
If a number is a multiple of 5, say “buzz” instead;
If a number is a multiple of both 3 and 5, say “fizzbuzz”.

I’ve implemented that algorithm in a JavaScript function, ready to ship out to our clients. The code for this example is on github, if you’d like to play along. I’ve run all the tests, the all pass, and I even have 100% coverage. So we’re good to ship, right?

Well, actually, no. Of course not. Almost immediately, my client comes back to me saying almost everything in their app is broken. The fizzbuzz game doesn’t work. Their customers are furious.

This is no doubt a caricature of a situation we’re all familiar with: a bug gets out to production despite our best efforts testing before release. 100% test coverage didn’t serve as the guarantee we might have thought it did.

Let’s take a look at the code we shipped in this example:

function fizzbuzz(number) {
    var result = '';
    if (number % 3 === 0) {
        result += 'fooz'
    }
    if (number % 5 === 0) {
        result += 'buzz'
    }
    return result;
}

That’s… pretty terrible. I’m sure you can guess that the tests must be equally terrible to run without raising any alarms. Take a minute to think about what kinds of things go wrong with unit tests that might makes this happen. Bad specs? Bad assertions? Remember we know that the code did, at least, run. Sure enough:

describe("Fizzbuzz", function() {
    it("gets fizzbuzz", function() {
        fizzbuzz(15);
    });

    it("not fizzbuzz", function() {
        fizzbuzz(8);
    });
});

Turns out these tests don’t actually assert against anything. Fizzbuzz of 15 should return a string “fizzbuzz”, but we never check the results of calling fizzbuzz(15). At least we know we didn’t throw an error, but that’s about it.

Introducing mutation testing

This is where mutation testing comes in. The concept is this: given some code with passing tests, we’ll deliberately introduce bugs into that code and run the tests again. If the tests fail, that means they caught the bug, and we call that a success. We want the tests to fail! If the tests pass, that means that they’re not capable of catching the bug.

Whereas regular coverage just tells you that your code ran, mutation coverage tells you whether your tests can fail.

For JavaScript, I use Stryker, a tool named for a character in the X-Men movies known for killing mutants. He’s a bad guy in the movies, but he’s on our side now. It supports React, Angular, Vue, and TypeScript. And of course there are similar tools in other languages, though I haven’t used them. The setup is very easy, since it just hooks into your existing test suite to run tests you’ve already written.

Let’s run Stryker on our example code:

Stryker generates 14 mutants from our function, and shows that our tests manage to kill none of them. This is a much more helpful number than coverage was. And much like coverage, it reports for us exactly which mutants survived and, while it doesn’t tell us exactly what tests we need, it does point us in the right direction. If no test fails when we force an if condition to always be true, that means we don’t have any tests where it’s false.

In mutant #7, for instance, the string “fooz” in the code—a typo that we didn’t catch—was replaced with an empty string. Because no test failed, the mutant is counted as a survivor. This is telling us explicitly that this string is never checked in the tests. Let’s fix that.

Fixing fizzbuzz

The easiest thing we can do is just add an assertion to one of the existing tests:

    it("gets fizzbuzz", function() {
        expect(fizzbuzz(15)).toEqual("fizzbuzz");
    });

As always, we want to make sure this test actually fails, and it does:

Next, we can fix the code. If we tried to run our mutation tests right away we’d be in trouble. Stryker wouldn’t be able to tell us if a failure is because our test successfully found a mutant, or if a failure is just because the code is broken in the first place. Luckily, the fix here is easy, we just have to correct the typo:

    if (number % 3 === 0) {
        result += 'fizz';     // not "fooz"
    }

Now that tests are passing—note that the coverage results are still happily and unhelpfully at 100%—running the mutation tests again shows us that we were able to catch all but two mutants:

I’ll leave it as an exercise for the reader to figure out which two mutants remain and how to catch them too. One last time, here’s a link to the code to get you started.

Mutation testing in real life

This toy example is obviously contrived to show an extreme case, but this works on real code too. I have a number of examples of production code that had full test coverage but still had bugs in areas where mutation testing shone a big red spotlight. As was the case here, it was still up to me to add the tests necessary to assert against the code in question and figure out what the bug was, but it did help tell me where to look.

Mutation testing isn’t a perfect replacement for test coverage, of course. It is only able to catch certain classes of bugs, usually around flow control, booleans, and assignments. It won’t catch faulty logic, or fitness for purpose, though you may find that being unable to test something is a sign that something is wrong. In fact, if you work through the example above, you can find that it is possible to catch 100% of mutants and still not function as a good implementation of fizzbuzz. Even if you add additional mutations with Stryker’s plugin API, like any tool it will never catch everything.

It also takes quite a while to run, since it has to run tests for every mutant it generates. Using jest, Stryker is smart enough to run only the tests that cover the mutated file, but it is still more resource intensive. In this small example, jest finishes in 1 second while Stryker takes 6. Because of that, it’s not something that I include as part of a regular build pipeline, though it is certainly possible.

I can also give you a bit of a shortcut. In my experience, the types of tests that are required for mutation testing tend to be the same types of tests required for branch coverage. This is just an anecdotal correlation based on the handful of products I’ve used this on, so don’t take my word for it. However, if you’re set on using coverage as a test quality gauge, at least upgrade to making sure all your branches are covered.

These days, I treat mutation testing as a tool for occasionally reviewing unit tests, especially when there are large changes. Tests are code, after all, and all code can have bugs in it. Even if you don’t consider unit tests part of a tester’s responsibility, they are the foundation of a solid test strategy, so we do well to make sure that they’re doing what we think they are.