loading...

Why code coverage is not a reliable metric

conectionist profile image conectionist Updated on ・2 min read

I often hear statements like "Let's increase code coverage" or "What's our code coverage?" (mostly from managers). And I think to myself "Oh, boy...".

Not because I don't like unit tests or think they're useless. On the contrary. I think unit tests are very important and useful.

What I'm against is using code coverage as a metric. Because it can mean nothing. And it usually doesn't.

Let me explain why

Consider the following class that validates and email address:

class EmailValidator
{
    public bool ValidateEmail(string email)
    {
        // just for demo purposes
        // I know it's not a very good email regex :P
        return new RegEx("[a-zA-Z0-9_]+[@][a-z]+[.]com").matches(email);
    }
}


And a class used for unit testing the email validator:

class EmailValidatorTest
{
    void TestValidEmail()
    {
        Assert.IsTrue(EmailValidator.ValidateEmail("someone@email.com"));
    }
}

What's my code coverage?

Well, I only have one class with one method. And my unit test goes though that code. That means my entire "codebase" is covered. Yay! 100% coverage!
That means my code is bullet-proof, right? WRONG!

Why? Because my regex, although not very complex, covers a lot scenarios.
I've only written a unit test for small letters.
I haven't written unit tests for emails that are comprised of:

  • empty string
  • caps
  • numbers
  • underscore
  • alphanumeric characters
  • combinations of the above (and there are plenty)

So, what's my REAL coverage? I would say less than 15%.

What about real life?

I'm not saying code coverage is always unreliable. But it only works as expected for simple cases. Like the following:

bool IsEven(int n)
{
    if(n%2 == 0)
        return true;
    else
        return false;
}


Two unit tests for this code will give 100% coverage and that's actually all the unit tests you need for this.

But you don't have code like this in real life. From my experience, less than 5% of any codebase has code that's dead simple like this.

Most of the time, you use third party libraries that have classes/functions like the regex I used above. Even though I only wrote one line, there are a bunch of if/else statements hiding behind that.

Conclusion

What I disagree with isn't code coverage per se. That actually helps the developer to visualize the areas in which s/he hasn't written any unit tests at all.

The problem is most people don't understand that
high code coverage ≠ code reliability.

You can have >95% code coverage and your application can be unacceptably buggy (I've actually witnessed this myself).

Posted on by:

Discussion

pic
Editor guide
 

You don't have 100% code coverage until you have tests for each possible input value. That's the point. Your first example is more than 99% away from 100% coverage :-) that's what most people get wrong.

 

I couldn't have said it better myself: "That's what most people get wrong".
It's the subtle difference between NAIVE code coverage (that tools show you) and REAL code coverage (that you can only deduce by using intuition... or very strong AI).

And that's why code coverage is not a reliable metric.

 

I fully agree with your observations about code coverage. Though, the testing approach behind it is a bit more precise and useful when being applied correctly.

What you mean with "naive" code coverage is called "statement coverage". It means counting how much code statements (lines of code) you have covered.

There are more precise metrics available like "decision coverage" which checks if you cover all possible outcomes of boolean decisions (true or false) of statements like "if" and "while". Or "condition coverage", which even looks into the boolean subexpressions. Etc.
The problem is, that most coverage tools are only able to check for statement coverage.

And, furthermore, for your email regex all this would not help since you do not use any boolean decisions here. But for that example you are simply facing the basic problem of test case selection. This has nothing to do with unit testing especially, you have that problem in all kinds of tests.

There is a "real" coverage which would mean to test all string (and non-string) inputs to your methods. But testing all cases is usually not feasible, so you need to make a selection.
Also here there are techniques available which at least help you to make a good selection of cases to test, like "equivalence partitioning", "boundary value analysis", "decision table testing", etc.

So you do not need to rely only on intuition to have good unit tests. And you are right, just writing some tests to achieve high statement coverage is definitely a very weak approach.

I agree with everything you said.
Especially the last sentence which is essentially the point I was trying to make when I wrote this article.

Thanks for sharing your thoughts.

P.S. "Intuition" was probably not the best choice of words. What I was trying to say is that it requires some degree of non-trivial thinking and is, therefore, not easy to write tools that help you in these situations.

 

Your first example is more than 99% away from 100% coverage :-)

That's wrong. The code is 100% covered with tests, yet it hasn't being tested for all possible cases.

 

Simple question: If not all cases have been tested, how can that be a coverage of 100%? ;) Please see the comment of frantzen as answer to my first comment. The many different meanings of "coverage" are explained.

 

Sounds to me like we're saying the same thing but in different ways :)

 

The return type of ValidateEmail is void and you return a boolean value in your method. Also the fact that your test doesn't contain any assertions makes it hard to understand since you don't explicitly state your expectation.

You should either return a boolean flag to indicate success/failure or throw an exception in case the the email is invalid. If you throw an exception you have to catch it in your test and mark the the test as a failing test so everyone who reads it can understand what happened when it fails.

Thanks for sharing your thoughts.

 

Good catch!
I made the suggested changes.
Thanks for the code review. ;)

 

I totally agree with what you've written in conclusion. Code coverage "is" a reliable metric, but it does not show the quality of tests, it only shows that code was covered with "some" tests.

We are working with legacy code, with really bad test coverage and bad tests in overall. What we did was that we actually removed some of the old tests and presented coverage drop to our stakeholders with comment that "this amount of tests did nothing to verify correctness of implementation".

 

Indeed, it doesn't show the quality.
Code coverage tools only tell you there are unit tests that cover SOME scenarios of a particular piece of code. But one unit test is rarely enough for the same code.
That's precisely the problem. Code coverage tools cannot tell you how "good" a unit test is. For that you need human logic or very strong artificial intelligence.
But tools that could do that would probably cost more than companies would be willing to invest in.

 

Yup, that's true, I was actually planning on developing some tool, that would indicate how "good" the tests are. It's not a silver bullet, but it should at least tell you things like:

  • the stub you've declared isn't used anywhere
  • the result you are receiving is not verified
  • there have been changes on some object without proper assertions, etc.

...and this list goes on, but I found it to be technically too challenging to just do it as a pet project :D.

That sounds cool.
Please consider sharing if you ever develop it.

 

Hi, nice article.

Do you think using PACT will help. We do not want separate Backend testing and relay on our Unit and Integration tests. We think that is more than sufficient. No Idea how good they are. We are relying on Code coverage (at the line, method and Class level) and with Sonar cube thing we are like 90%

Ours is a new application with few public APIs. And we do have automated systems test for the front end.

The application is growing at a rapid rate and I am not very confident of backend keeping testing at bay. An experienced developer in our team has shared this linked suggesting that we are doing it right in terms of keeping the quality of the code to optimal -
martinfowler.com/articles/microser...

Can you suggest to me, are we doing it the best way? There are no constraints on the budget or resources to limit yourself from doing anything that is needed.

 

Hello,

I'm not familiar PACT. I just did some quick research on it a few minutes ago.
If you say you're working on an API and using PACT to test it makes your life easier, I say go for it. :)

"We do not want separate Backend testing and relay on our Unit and Integration tests. We think that is more than sufficient"
Not sure if I understood correctly, but are you saying that you don't have any kind of tests on your backend? (unit, integration, system, acceptance, etc.)
If you don't, it's probably a good idea to have have some that test at least basic functionality.
To see writing any tests for your backend would be worth it, just look at your list of resolved bugs and make a rough esimate of how many could have been avoided
if you had tests (of any kind).
But if you feel that things ok this way and few bugs have been found in your backend by QA after rigurous testing, then it's probably not worth investing the time
to write any tests for this.

"We are relying on Code coverage (at the line, method and Class level) and with Sonar cube thing we are like 90%"
Yeah, well the point I was trying to make in this article is that code coverage is not a very reliable metric. :))
That doesn't mean that your unit tests are useless or that you should ignore your coverage.
It simply means that you having 90% coverage does not necessarily make your application almost bullet proof.

"An experienced developer in our team has shared this linked suggesting that we are doing it right in terms of keeping the quality of the code to optimal"
I've looked at the this, and it looks ok in theory. However, from my experience, the best solutions are taylored ones.
I think you should try the suggestions in that article and see if it works for you (or what part of it works for you).

"Can you suggest to me, are we doing it the best way?"
Well, as I mentioned above, the best solutions are the ones that are adapted to every team's needs (or project's requirements).
Without being involved in your project/team, it's difficult for me to say if it's the best. That's because "the best" depends on so many factors.
But, as I mentioned earlier, the link that your colleague shared sounds good to me. But the only way to know for sure it to try it.
If you say that there are no constraints on the budget or resources, you can probably afford to experiment and see what works for you best. ;)

Let me know if this helps! ;)

 

Hi, Thanks for replying-

Regarding - "Not sure if I understood correctly, but are you saying that you don't have any kind of tests on your backend?"
...yes we do have unit and integration but it's just that. We are not sure if those tests are sufficient. And we are not getting any strong opinion about doing or ignoring backend testing. how can find it that it is necessary? I read what you said about defect matrix - " just look at your list of resolved bugs and make a rough estimate of how many could have been avoided
if you had tests (of any kind)." we do not like to log many, we fix them as soon as we find them or we add them to using a story in our backlog for tracking.

about - "But if you feel that things ok this way and few bugs have been found in your backend by QA after rigorous testing, then it's probably not worth investing the time to write any tests for this." we are not doing manual testing at all and backend has never been exposed to any sort of testing except unit and integration(mostly using mock). This is the first time I am observing this way of programming without much testing & ready to learn from mistakes but not at the cost of leaving open ends, corner cases, vulnerabilities etc.. So I want to discuss this as much as possible in forums like this to gain a perspective :)

I would advise to do at some manual testing at least once in a while.
From my experience, simulations and real life situations (i.e. mocks vs actual manual testing) are not always equal.
When mocking, you're basically making some assumptions (which are usually favourable to your expectations) that might not be true in real-life scenarios.

 

Nice example. I included a link to this article, and copied a couple of lines from it in a presentation I did on database API testing last Friday in Dublin, where I went on to discuss domain partitioning:
aprogrammerwrites.eu/?p=2318

 

Cool. Glad you found it useful. :)

 

You have to learn about mutation testing.
It test your test to be sure it's reliable..

 

AFAIK, mutation testing implies tests that automatically modify (mutate) your code and checks how that affects/breaks existing functionality... Or something along those lines :D

However, it's only (fairly) easy to do for simple code. Realistic (i.e. complex) code requires a fair amount of AI for it to actually be useful.
That's the reason why it's not so popular.

But it would most certainly be a lot more useful than simple code coverage.

 

You don't have to code anything, you just pass the test directory as an argument and the mutation testing test everything for you.

You have nothing to do ! It's not popular because test aren't so much and the process is long. Really long. It can take up to 20 minutes at first run.

That sounds neat.

I'm going to research about this subject a bit more.

 

Your validator checks for emails being valid or not - it doesn't make any sense to check only one valid email address. Code coverage is an indicator of how much of your code is covered by any test, not a note for how good your test is. (Unfortunately, most code coverage measures don't check on how often a test ran over code to strengthen the result...)

 

This is exactly the point I'm trying to make!
Most people seem to think that if one or more unit tests have covered an area (i.e. has 100% coverage) then that area is bullet proof.
But the sad reality is that it may very well not be.

 

Add Sonar metrics to that list 😖

 

You mean adding sonar matrice helps or it makes it worse!!

 

That's a story for another day :)