Mutation testing – clearing up a basic misconception
It’s been quite instructive over the summer to take a look at some existing blog posts on mutation testing. They vary in quality, so I thought I’d take a look at some of them and try to clear up a few points.
One post that turns up high in a list of search results for me is Jeremy Jarrell’s post from 2010.
This is a nice self-contained example of applying one simple mutation, that is well explained with code examples at both the C# source and the IL level. However, it falls down when Jeremy says:
After running our test suite against our mutated assembly we see that we now have a mixture of both passing and failing tests. Unlike traditional unit testing, we actually strive for 100% of our tests failing when doing mutation testing. Why? Because if our tests failed after the code was mutated then we know they’re serving us well as regression tests by detecting changes in logic.
This is wrong I’m afraid. What we strive for with mutation testing is for the test suite to fail for each mutation applied. We are testing the adequacy of our test suite, so only one test actually needs to fail for each mutation. Jeremy’s approach would imply that we have 100% code coverage, and meaningful code coverage at that, in every unit test we write. It’s actually in part the ability to pass mutation testing on the first unit test failure that enables us to get NinjaTurtles running at an acceptable speed.
Alexander Beletsky has read Jeremy’s post (he says so), and he elaborates on Jeremy’s incorrect statement, saying:
In ideal case all green tests have to be turned red, if some of tests are still green that means that testing code is not good enough to react on mutation, so actual test code must be reviewed and corrected.
Again, this is wrong. In an ideal case, the test suite turns red. It doesn’t matter at all how many tests fail as long as at least one does. I still like to think of this in terms of test-drivenness: in an ideal world, nothing in your code can get there except by your first writing a test to dictate that functionality. Thus everything in your code should be meaningfully tested somewhere in your test suite.
James McCaffrey says it well, as you might expect – the emphasis below is mine not his:
Mutation Testing is a technique that software engineers can use to measure the effectiveness of their overall testing effort. Suppose a team of testers has created many individual tests; let’s call the collection of tests the test suite. In mutation testing, the original system/program under test is mutated to create a faulty version called a mutant. The mutant program is tested using the suite of individual test cases. The test suite should produce new test case failures; if no new failures appear then that probably means the test suite does not exercise the code path containing the mutated code, which means the test suite does not fully test the system/program. You can then create new test cases that do exercise the mutant code which may reveal bugs. Notice that the term “mutation testing” is somewhat misleading — in mutation testing you do not actually test the system under test, you measure the effectiveness of a collection of tests.
The highlighted statement implies a value to running mutation testing over a test suite which is not all green at the outset. Personally, I think that if you’ve got red tests, your time is better spent fixing them than trying to apply mutation testing. If you like, the red tests are to my mind a higher order failure. But that aside, this description is spot on the money.
