Erik Dietrich

Posted on Apr 9, 2020 • Originally published at blog.ndepend.com

Unit Tests Correlate With Desirable Codebase Properties

#architecture #dotnet #csharp #unittest

I originally posted this on the NDepend blog about two years ago. NDepend is a static analysis tool that plugs right into Visual Studio, and it was what I used to gather and record this data.

Today, I give you the third post in a series about how unit tests affect codebases.

The first one wound up getting a lot of attention, which was fun. In it, I presented some analysis I'd done of about 100 codebases. I had formed hypotheses about how I thought unit tests would affect codebases, and then I tested those hypotheses.

In the second post, I incorporated a lot of the feedback that I had requested in the first post. Specifically, I partnered with someone to do more rigorous statistical analysis on the raw data that I'd found. The result was much more clarity about not only the correlations among code properties but also how much confidence we could have in those relationships. Some had strong relationships while others were likely spurious.

In this post, though, I'm incorporating the single biggest piece of feedback. I'm analyzing more codebases.

Analysis of 500 (ish) C# Codebases

Performing static analysis on and recording information about 500 codebases isn't especially easy. To facilitate this, I've done significant work automating ingestion of codebases:

Enabling autonomous batch operation
Logging which codebases fail and why
Building in redundancy against accidentally analyzing the same codebase twice.
Executing not just builds but also NuGet package restores and other build steps.

That's been a big help, but there's still the matter of finding these codebases. To do that, I mined a handful of "awesome codebase" lists, like this one. I pointed the analysis tool at something like 750 codebases, and it naturally filters out any that don't compile or otherwise have trouble in the automated process.

This left me with 503 valid codebases. That number came down to 495 once adjusted for codebases that, for whatever reason, didn't have any (non-third party) methods or types or that were otherwise somehow trivial.

So the results here are the results of using NDepend for static analysis on 495 C# codebases.

Stats About the Codebases

Alright. So what happened with the analysis? I'll start with some stats that interested me and hopefully interest you. I'm looking here to offer some perspective.

I analyzed a total of 6,273,547 logical lines of code (LLOC). That's a lot of code! (Note: codebases have fewer LLOC than editor LOC. The latter is probably how you're used to reasoning about lines of code.)
The mean codebase size was 12,674 and the median size was 3,652, so a handful of relative monsters dragged the mean value pretty high.
The maximum codebase size was 451,706 LLOC.
The percent of codebases with 50% or more test methods was 3.4%, which is pretty consistent with the 100 codebase analysis.
The percent of codebases with 40% or more test methods, however, was 9.1%, which was up from about 7% in the last analysis. So, it appears that we're getting a little more test-heavy codebase representation here.

Findings From Last Time

Here's a quick recap of some of the findings from last time around.

Average method cyclomatic complexity seemed unrelated to prevalence of unit tests.
Average method nesting depth also seemed unrelated to the prevalence of unit tests.
More lines of code per method probably correlated with more unit tests, surprisingly.
Lines of code per constructor decreased with an increase in unit tests, but the p-value was a little iffy.
Parameters per method decreased as unit tests increased and with pretty definitive p-value.
Number of overloads per method possibly had a negative relationship with unit test prevalence.
More inheritance correlated with fewer unit tests, fairly decisively.
Type cohesion correlated strongly with an increase in the unit test percentage.

I've omitted a few of the things I studied in the previous posts, both for the sake of brevity and in order to focus on what I think of as properties of clean codebases. Generally speaking, you want code with fewer lines, less complexity, fewer parameters, fewer overloads, and less nesting per method. In terms of types, you want a flat inheritance hierarchy and more cohesion.

What a Difference 400 Codebases Makes

So, let's take a look at what happens now that we substantially increased sample size. I'll summarize here and add a couple of screenshots below that.

Average method cyclomatic complexity has a strong negative correlation with prevalence of unit tests!
Average method nesting depth now correlates negatively with more unit tests, though p-value isn't perfect.
More lines of code per method flattened and saw a p-value spike, changing this to "probably no relationship."
Lines of code per constructor decreased even more with more unit tests, and p-value became bulletproof.
Parameters per method, like lines of code per constructive, became an even more bulletproof negative correlation.
Number of overloads per method became flatter and p-value worse, so I'm going to say a relationship here isn't overly likely.
More inheritance still correlated with fewer unit tests, but p-value is now non-trivial and the relationship flattened.
Type cohesion's relationship didn't change very much at all.

Average Cyclomatic Complexity Per Method

Average Method Nesting Depth

Lines of Code Per Method

Number of Overloads Per Method

Unit Tests and Clean Code

If I circle back to my original hypotheses, it seems I'm doing better as I add more codebases to the study.

With 500 codebases in the mix, the results have improved considerably, though I'm not entirely sure why. Perhaps some outliers skewed the original study a bit more, or perhaps this resulted from the codebase corpus on the whole becoming more "unit-test heavy." But whatever the reason, five times the sample size is starting to show some pretty definitive results.

The properties that we associate with clean code --- cohesion, minimal complexity, and overall thematic simplicity --- seem to show up more as unit tests show up more.

The only exception that truly surprises me was and remains lines of code per method. I wonder if this might be the result of a higher prevalence of properties in non-test-heavy codebases or some other common relationship situation. In any case, though, it's interesting.

But 500 codebases analyzed automatically and results synthesized with statistical modeling software, I feel pretty good about where this study is. And while it doesn't paint a "unit tests make everything rainbows and unicorns" picture, this study now demonstrates, pretty definitively, that codebases with unit tests also have other desirable properties.

What's Next?

I'm going to keep working, in conjunction with the person doing the statistical models, to study more properties of codebases. And I think, for now, I'm going to wrap this unit test study and move on to other things, satisfied that we've given it a pretty good treatment.

One thing that occurs to me is the somewhat important differences between 100 codebases and 500. Maybe I should grow the corpus to 1,000 or even 2,500 to make sure I don't see a similar reversal. But the thing is, that's a lot of codebases, and I've already nearly exhausted the "awesome lists," so I'm worried about diminishing returns. Rest assured, though --- I'll keep slurping down codebases, and if I find myself with significantly more at some point, we'll redo the analysis.

So what's next? I've had a few ideas and am brainstorming more of them.

See what effect having CQRS seems to have on a codebase.
Skinny or fat domain objects? Which is better?
How does functional style programming impact codebases?
How does a lot of async logic affect codebases?
Or what about lots of static methods?

These are just some ideas. Weigh in below in the comments with your own, if you'd like. Hopefully you all find this stuff as interesting as I do!

If you want to take a Freakonomics style look at your codebase by turning code into data, you can take NDepend for a 14 day spin.

Top comments (2)

KW Stannard • Apr 9 '20

Looking back at the old articles, as poorly and needlessly aggressive as Johan put it in his comment on your first article, I would love to see this analysis done but for public methods only instead of all methods. I personally find that making methods public in order to test them leads to worse designs than only testing the original public methods.

Erik Dietrich • Apr 14 '20

For better or worse, life has blown me in a different direction from my old consulting practice and this line of research. I no longer have access to the statistical help I was getting, nor the time to conduct additional studies :/