I've had a longtime theory that TODO comments in code don't get fixed any time soon. To get some numbers on this, I set out to analyze a set of GitHub repositories. You can read how I researched the lifetime of TODO comments in my previous post.
In this post, I'll look at the numbers. I'll also talk about some aspects lacking in my research and how I could address them in a next round.
As previously mentioned, I used a sample of GitHub repositories categorized as containing .NET code.
Here are some numbers:
- 23561 repositories
- 94928 files
- 120611 TODO comments
- All C# files, no VB.NET or F# in the dataset
There were two records that had a negative value for the age of the TODO comment. This is due to a wrong result from GitHub's GraphQL or me using the wrong query. If you know how to fix it, please answer my StackOverflow question.
There were also a lot of TODO comments with an age of 0 days. This is because I'm comparing the introduction of the comment to the latest commit. However, many repositories are put on GitHub as a sort of archive. The entire codebase is committed and pushed to GitHub once and then never touched again.
These are things that we should remove or take into account.
For the next parts, keep in mind that we're looking at TODO comments that are still present. I haven't researched TODO comments that have been fixed yet.
On average, the current set of TODO comments was introduced 528 days before the last commit. That's almost a year and a half. In all that time, it hasn't been crucial to fix the task.
If we look at the median, the number is slightly better. Half of the TODO comments lived for less than 246 days. Still a long time.
I created a graph of all the numbers and how they're spread out:
It's notable that most TODO comments have been around for less than 50 days. But these could be younger repositories too of course (i.e. those repositories that are put in GitHub for archival purposes only).
Filtering out those "young" repositories gives this graph:
As you can see, many TODO comments are still fairly young. 30% of the TODO comments are less than a year old (the 30th percentile is 280 days). So these still have a chance of being fixed relatively "fast."
The term fast is up for debate here. I'd personally say that even 107 days (the 10th percentile) is too long for a TODO comment to stick around. But if 30% is 280 days old or younger, that also means 70% of the TODO comments is older than 280 days.
In fact, I checked for the percentile of 365 days. It comes out at 0.377 so 62.3% of current TODO comments are older than one year. These probably have little chance of ever being fixed again.
You can find the data files and the source code of this project in the GitHub repository.
Obviously, it would be interesting to look at more languages. It could even be interesting to compare them.
Another thing I'd like to research is fixed TODO comments. As mentioned previously, I only look at current TODO comments. But to look at the average lifetime of all TODO comments, I would have to clone multiple repositories and go through the entire log.
This way, I could see when comments are introduced and when they're fixed. This would require an entirely different process. It would also take much longer, or I would have to use a smaller dataset.
Another next step will be to filter out obsolete repositories. For example, I could run the analysis on the repositories where the last commit has occurred in the past 3 months (or another arbitrary number).
Finally, there's more that I can automate. For example, I'm working with Excel now to calculate everything. But because of the size of the data, Excel slows down a lot.
But if TODO comments litter our code without ever providing much value, what should we then do with them? What if we can't do something right now, but don't want to forget? Tune in next week for a separate post on that.