This post is based on a Gophercon talk by Daniela Petruzalek: Who tests the tests?
A little bit of history
In the beginning, we checked our code manually, running the application and trying different inputs: we called that manual testing. Then we discovered that we could write code to test application code to check if it's correct: we called that automatic testing (unit, integration, functional, etc.)
Now we are in the AI era where we write fewer tests and even less code, we need a way to swiftly check that the tests are correct and that they are testing the cases we expect in our application.
How can we know if code written by AI is correct?
You can use automatic tests in the same way we used to check code written by humans.
One option would be to let the AI write the application code and you write the tests, you can even use TDD where you, the human, write the tests and let the AI write the implementation after.
Another option is to let the AI write the application code and the tests, but then how can you be sure that those tests are testing the things you want or need? Reading and understanding the tests would be the best, but can we do that automatically?
Enter: mutation tests
Mutation testing is a way to check that the tests you have are testing the code the way you want by making small changes to the application code and checking that the tests fail as expected.
It's based on a concept known as a mutant: a version of your application with a small change.
How does it work?
The mutation testing cycle is this:
- Create a mutant: change the application code by applying just one change
- Run your test suite
- Check how many mutants were killed
After running your tests, the ones that failed are said to have killed the mutant, those tests are testing something related to the code you changed and are, from the perspective of that change, good tests.
After many changes, if a test never failed it means that test didn't kill any mutant and is a weak test. You should probably delete that test or write a better one.
If a mutant is never killed, then that code is not being tested (no coverage) or is being tested poorly, you have an opportunity to write a test to check that piece of code if needed.
Do I have to make these changes manually?
You can do it manually, but there are some tools to aid you:
- C#, TypeScript and Scala: Stryker
- Go: go-gremlins
- Java: pitest
- Python: mutatest
- Rust: mutants.rs
These tools provide a way to make those changes automatically and some of them run the tests for you.
Why do we need mutation testing?
Let's see a very simple example in Python:
def divide(a, b):
if b == 0:
raise ValueError("can't divide by 0")
return a/b
A simple test suite we can have is:
import unittest
from divide import divide
class TestDivide(unittest.TestCase):
def test_divide_error(self):
with self.assertRaises(ValueError):
divide(1, 0)
def test_divide_success(self):
self.assertEqual(1, divide(1, 1))
if __name__ == '__main__':
unittest.main()
Run the tests and everything is fine:
$ python tests.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.000s
OK
Here we are testing both flows, when an error is raised and when we complete a successful operation, coverage is at 100% but those tests are not great, and here's why: let's change the implementation of divide from a/b to a*b:
def divide(a, b):
if b == 0:
raise ValueError("can't divide by 0")
return a*b
Running the tests should fail, right? They don't, because 1*1 is still 1:
$ python tests.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.000s
OK
This is a problematic test:
def test_divide_success(self):
self.assertEqual(1, divide(1, 1))
Even when you have 100% test coverage, that doesn't mean you have 100% test case coverage; some cases may be missing or not being tested correctly.
What we did here was a mutation: from a/b to a*b, that mutation gives us information about our tests that we didn't have before.
How to read the results of mutation testing?
When you run your test suite against mutated code, each test can do one of two things: kill the mutant (the test fails, so it detected the change) or survive (the test still passes, so it missed the change). After many mutation, you can summarize how often each test killed mutants, for example:
After 200 Mutations:
TestA: 140 Kills, 60 Survived
TestB: 200 Kills
TestC: 30 Kills, 170 Survived
How to interpret this:
TestA: is good test because on most mutations it was killed.
TestB: is your strongest test, it detects every mutation you ran.
TestC: is the weak one, it usually doesn't detect mutations.
When not to use mutation testing?
In some projects or languages, running mutation tests can take hours. For each mutant, the tool has to compile (if needed) and run the full test suite. If your test suite takes 10 minutes, each mutation will take at least that long, so keep this in mind.
Use this approach on small projects or ones where the "edit, compile, test" cycle is small.
Closing
So who tests the tests? In practice, mutation testing does that: by checking that your tests react when the code is deliberately broken.
Top comments (0)