loading...

Introducing Git Hammer: Statistics for Git Repositories

vorahsa profile image Jaakko Kangasharju ・4 min read

I've been feeling that every time I'm in a longer project, I end up putting together some simple shell scripts to gather statistics from the project git repository. These were always ad-hoc, needed to be tailored to each specific project, and the only way to follow the progress of the numbers was to run them and save the output.

Recently I had some free time, so I decided to finally write a proper program to do this stuff for me. It can now do everything that my ad-hoc scripts previously (actually somewhat more even), it stores whatever it computes so it takes very little time to keep up to date, and I'm happy enough with the functionality and code that I'm comfortable releasing it. So, meet Git Hammer.

What Does It Do?

The main thing that my scripts always had was the count of lines per person. Essentially, the script would run git blame on every source file and add all the counts together. This is, in a way, the core of Git Hammer also. Another thing is to count all the tests and group those too by person. But Git Hammer knows about all the commits, so it can do many kinds of statistics based only on the commits and not their contents. At least in theory; I haven't yet implemented much.

One nice feature is support for multi-repository projects. In the project where I was working when I first began to plan Git Hammer, we had the main app, but also several support libraries in other repositories. These libraries were being developed by the same team, but they were separated to allow other projects to use them too. So it makes sense to combine all these repositories under one set of statistics.

Graphs

Let's take a look at some graphs that Git Hammer can draw. I'm using the dev.to repository for these. First, let's look at line counts per author:

Graph of line counts per author over time

Well, that certainly displays the case where existing code was imported into a new repository. It's also not a very good graph: The legend with the author names is covering part of the data, and not nearly all authors are displayed. Running this kind of program on a repository with many many contributors can definitely uncover problems.

How many tests are getting written? Let's look at just the raw test counts this time.

Graph of test counts over time

Looks like a nice development. New tests are being written along new code.

We can also look at when the commits are happening. There is one graph for days of the week, and another for hours of the day.

Histogram of commits per day of week

Histogram of commits per hour of day

Looks like primarily of a day job: The majority of commits happen Monday to Friday, roughly during business hours.

By the way, this last graph uncovered a bug. I had been very happy with my graphs, but when I first saw the hour-of-day graph for dev.to, it was showing most activity happening in the night. Of course, this was a time zone issue: At some point in the processing, the commit times got converted to my local time zone (Berlin). Since most of the commits happen in New York, this pushed the times 6 hours ahead. So I did what seems to be the most common advice: I store the time zone associated with the commit explicitly in the database, and then use that when reading for display.

How Long Does It Take to Run?

Running git blame on every file in the repository probably sounds like it takes a long time. And it can. The main repository of my old project requires about 4 minutes. Of course, Git Hammer doesn't run this from scratch for every commit. Rather, it uses diffs provided by git to adjust its counts only where they might have changed. Processing the dev.to repository (about 1300 commits, 70000 lines of code in the latest version) took only 6 minutes on my Macbook Pro.

Larger repositories are a different case. My old project has over 33000 commits, maybe 250000 lines of code, and it takes over 12 hours to go through. Luckily, the process was using only about 20% of CPU and even towards the end well under 2 GB of memory, so I could keep working while it was running. Still, it may be that the time needed grows faster than the size of the repository, so trying a really massive repository is probably not a good idea.

Future Plans

Git Hammer is already almost a usable library. That will likely be the next step: Fix things that don't make sense in a library, maybe add some configuration points if needed, and upload to PyPI. I have also a long-term hope to make a Web service that uses Git Hammer to display project statistics on the Web.

Any contributions are welcome, starting from just ideas for features. The code base is also not very large, since a lot of the heavy lifting is handled by GitPython and SQLAlchemy. So it is probably comprehensible to many Python developers.

Caution

Deriving statistics from code is for entertainment purposes only. They have very little meaning, and none outside the specific project team, and should not be used as a basis for any decisions.

Posted on Aug 14 '19 by:

vorahsa profile

Jaakko Kangasharju

@vorahsa

I'm a generalist developer, preferring to have some skills in a variety of areas to being really good at only a few. I need to see how a technology solves real problems to really understand it.

Discussion

markdown guide
 

Hi Jaakko! Seems like a possibly neat tool, if you make it faster. 4 minutes is already quite a bit for a command line tool if you think about it. If I hadn't read this post and tried it I would have thought there was something wrong with it and pressed Ctrl-C which isn't the case.

Have you tried to measure/benchmark your tool just to get a better insight in what's going on with the speed?

Is it the git interface? Could compacting the repo increase the speed of git blame? Just asking, this is a wild guess :D

Is it walking all the files?

Can anything be done concurrently? After all if you run git blame on each file you find and gather the info, there's no data to synchronize between each git request so that part is a good candidate for concurrency. If not (I've never tried to run 10 or 100 git blame concurrently on the same repo), can the processing of the git blame data be done concurrently?

Is it the building of the charts and plotting the data?

PostgreSQL is a complicated requirement for a client side tool, have you consider using an embedded DB like SQLite?

Good job so far :-) Don't get discouraged by my inquiries :D

 

Hi and thanks for your comments! Those are very good insights, and I'm not discouraged at all, I'm happy that you're interested enough to comment.

The 4 minute time for processing a single commit doesn't actually happen. That was for running git blame on the repository as it is currently, so there will be a lot of processing per file as git tries to find the origin of each line. But the actual implementation starts at the beginning where every line in every file is from the first commit, so the blame is much faster. As I recall, it was about 20 seconds with the repository having maybe 100000 lines of code.

But you do make good points about that. The initial blame run that needs to go through every file could be optimized. In fact, now that I think of it, there is no need to invoke git blame there at all: Since it's the first commit in the repository, every line in every source file comes from that commit, so it's enough to just counts the lines in source files. That should be very fast :-) I think the reason why it's running blame is because I first started from HEAD, which does require running blame, and I just never rethought that.

Also, you're right that there is room for concurrency. As I mentioned in the post, the CPU usage didn't go above 20% at any point, so it would be possible to process at least a few files concurrently. The number of files processed concurrently would have to be limited, of course, but something like 5-10 feels like a feasible amount.

Most of the time with large repositories goes into processing the rest of the commits. Individual commit processing time is quite small, since unchanged files are not looked at, but that does build up when there are many commits. I think there is a possibility to improve that. Now the implementation runs blame on all the changed files, but it should be possible to use the diff information to run blame only on the changed parts of each file.

I didn't do too much benchmarking but I did experiment with running without gathering the line counts, that is, without running git blame. Such a run is so much faster that it's clear almost all of the processing time is in the blame. I don't know how much compacting the repo would help there.

It would be nice to allow use of other databases. The Postgres requirement is because some columns use the ARRAY type that, at least with SQLAlchemy, is only available with Postgres. So changing that would require rethinking how to represent the information that is now in those arrays.

The graph building and plotting does not take very much time at all. The most time-consuming part is processing the repository the first time, but then all that data is in the database, and further operations just read it from there instead of touching the repository.

 

The 4 minute time for processing a single commit doesn't actually happen. That was for running git blame on the repository as it is currently, so there will be a lot of processing per file as git tries to find the origin of each line.

What if by default you go back in time only a default amount? Usually people don't need all the statistics since the beginning of time. Maybe your tool could go back 30 days of commits (or 2 months, or 3, don't know, an arbitrary default) and if the user specificies a different range then you go back til the dawn of the ages. This might save startup time as well (which is already a little slow with Python).

Now the implementation runs blame on all the changed files, but it should be possible to use the diff information to run blame only on the changed parts of each file.

Good idea! According to the doc git blame can take a range of lines, so you might be able to use that:

➜  devto git:(master) git blame -L 10,20 README.md
243c44e2 (Mac Siri       2018-08-08 10:36:32 -0400 10) </div>
243c44e2 (Mac Siri       2018-08-08 10:36:32 -0400 11) <br/>
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 12) <p align="center">
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 13)   <a href="https://www.ruby-lang.org/en/">
0725b85e (Vinicius Stock 2019-01-09 18:59:38 -0200 14)     <img src="https://img.shields.io/badge/Ruby-v2.6.0-green.svg" alt="ruby version"/>
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 15)   </a>
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 16)   <a href="http://rubyonrails.org/">
14551ea8 (Mac Siri       2018-07-12 13:19:13 -0400 17)     <img src="https://img.shields.io/badge/Rails-v5.1.6-brightgreen.svg" alt="rails version"/>
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 18)   </a>
65110550 (Mac Siri       2018-08-08 12:07:00 -0400 19)   <a href="https://travis-ci.com/thepracticaldev/dev.to">
65110550 (Mac Siri       2018-08-08 12:07:00 -0400 20)     <img src="https://travis-ci.com/thepracticaldev/dev.to.svg?branch=master" alt="Travis Status for thepracticaldev/dev.to"/>

It would be nice to allow use of other databases.

I was saying that mainly because it would be so much easier to ship a self contained command line tool, instead of asking people to install PostgreSQL to use it. SQLite is present in many standalone apps for that very reason.

The Postgres requirement is because some columns use the ARRAY type

Got it, a possible alternative is to use the new JSON type in SQLite, you might be able to get something out of it.

A drawback of using SQLite and writing to the DB is that it doesn't play behave very well with lots of concurrency

What if by default you go back in time only a default amount? Usually people don't need all the statistics since the beginning of time.

That's a very good idea!

Got it, a possible alternative is to use the new JSON type in SQLite, you might be able to get something out of it.

Thanks, a JSON array could work as an ARRAY replacement. I'll have a look.

 

Neat tool, Jaakko! Reminds me a lot of the book "Your Code as a Crime Scene" - we could definitely benefit from more tooling in this area!

By the way, I noticed your authors chart has a number of duplicate/near-duplicate entries (ex. "Anna Buianova" and "Anna Buyanova", "Arun Kumar" appears twice) - are you familiar with git's mailmap feature? It's really useful in situations like this for canonicalizing author names!

 

Oops, it seems I forgot to recommend having a mailmap file. :-) So thanks for mentioning it.

For statistics like this, I have always written a mailmap file and kept it up to date, since as you say, it's very useful to have canonical author names and to have git always use the same name. For this post, though, I picked the dev.to repository since it's code that is common to everyone here, but it doesn't have a mailmap file, and I didn't want to do too much polishing just for these sample graphs.