DEV Community

Introducing Git Hammer: Statistics for Git Repositories

Jaakko Kangasharju on January 21, 2019

I've been feeling that every time I'm in a longer project, I end up putting together some simple shell scripts to gather statistics from the projec...

Read full post

rhymes • Jan 21 '19 • Edited

Hi Jaakko! Seems like a possibly neat tool, if you make it faster. 4 minutes is already quite a bit for a command line tool if you think about it. If I hadn't read this post and tried it I would have thought there was something wrong with it and pressed Ctrl-C which isn't the case.

Have you tried to measure/benchmark your tool just to get a better insight in what's going on with the speed?

Is it the git interface? Could compacting the repo increase the speed of git blame? Just asking, this is a wild guess :D

Is it walking all the files?

Can anything be done concurrently? After all if you run git blame on each file you find and gather the info, there's no data to synchronize between each git request so that part is a good candidate for concurrency. If not (I've never tried to run 10 or 100 git blame concurrently on the same repo), can the processing of the git blame data be done concurrently?

Is it the building of the charts and plotting the data?

PostgreSQL is a complicated requirement for a client side tool, have you consider using an embedded DB like SQLite?

Good job so far :-) Don't get discouraged by my inquiries :D

Jaakko Kangasharju • Jan 21 '19

Hi and thanks for your comments! Those are very good insights, and I'm not discouraged at all, I'm happy that you're interested enough to comment.

The 4 minute time for processing a single commit doesn't actually happen. That was for running git blame on the repository as it is currently, so there will be a lot of processing per file as git tries to find the origin of each line. But the actual implementation starts at the beginning where every line in every file is from the first commit, so the blame is much faster. As I recall, it was about 20 seconds with the repository having maybe 100000 lines of code.

But you do make good points about that. The initial blame run that needs to go through every file could be optimized. In fact, now that I think of it, there is no need to invoke git blame there at all: Since it's the first commit in the repository, every line in every source file comes from that commit, so it's enough to just counts the lines in source files. That should be very fast :-) I think the reason why it's running blame is because I first started from HEAD, which does require running blame, and I just never rethought that.

Also, you're right that there is room for concurrency. As I mentioned in the post, the CPU usage didn't go above 20% at any point, so it would be possible to process at least a few files concurrently. The number of files processed concurrently would have to be limited, of course, but something like 5-10 feels like a feasible amount.

Most of the time with large repositories goes into processing the rest of the commits. Individual commit processing time is quite small, since unchanged files are not looked at, but that does build up when there are many commits. I think there is a possibility to improve that. Now the implementation runs blame on all the changed files, but it should be possible to use the diff information to run blame only on the changed parts of each file.

I didn't do too much benchmarking but I did experiment with running without gathering the line counts, that is, without running git blame. Such a run is so much faster that it's clear almost all of the processing time is in the blame. I don't know how much compacting the repo would help there.

It would be nice to allow use of other databases. The Postgres requirement is because some columns use the ARRAY type that, at least with SQLAlchemy, is only available with Postgres. So changing that would require rethinking how to represent the information that is now in those arrays.

The graph building and plotting does not take very much time at all. The most time-consuming part is processing the repository the first time, but then all that data is in the database, and further operations just read it from there instead of touching the repository.

rhymes • Jan 21 '19

The 4 minute time for processing a single commit doesn't actually happen. That was for running git blame on the repository as it is currently, so there will be a lot of processing per file as git tries to find the origin of each line.

What if by default you go back in time only a default amount? Usually people don't need all the statistics since the beginning of time. Maybe your tool could go back 30 days of commits (or 2 months, or 3, don't know, an arbitrary default) and if the user specificies a different range then you go back til the dawn of the ages. This might save startup time as well (which is already a little slow with Python).

Now the implementation runs blame on all the changed files, but it should be possible to use the diff information to run blame only on the changed parts of each file.

Good idea! According to the doc git blame can take a range of lines, so you might be able to use that:

➜  devto git:(master) git blame -L 10,20 README.md
243c44e2 (Mac Siri       2018-08-08 10:36:32 -0400 10) </div>
243c44e2 (Mac Siri       2018-08-08 10:36:32 -0400 11) <br/>
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 12) <p align="center">
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 13)   <a href="https://www.ruby-lang.org/en/">
0725b85e (Vinicius Stock 2019-01-09 18:59:38 -0200 14)     <img src="https://img.shields.io/badge/Ruby-v2.6.0-green.svg" alt="ruby version"/>
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 15)   </a>
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 16)   <a href="http://rubyonrails.org/">
14551ea8 (Mac Siri       2018-07-12 13:19:13 -0400 17)     <img src="https://img.shields.io/badge/Rails-v5.1.6-brightgreen.svg" alt="rails version"/>
^301c608 (Mac Siri       2018-02-28 16:11:08 -0500 18)   </a>
65110550 (Mac Siri       2018-08-08 12:07:00 -0400 19)   <a href="https://travis-ci.com/thepracticaldev/dev.to">
65110550 (Mac Siri       2018-08-08 12:07:00 -0400 20)     <img src="https://travis-ci.com/thepracticaldev/dev.to.svg?branch=master" alt="Travis Status for thepracticaldev/dev.to"/>

It would be nice to allow use of other databases.

I was saying that mainly because it would be so much easier to ship a self contained command line tool, instead of asking people to install PostgreSQL to use it. SQLite is present in many standalone apps for that very reason.

The Postgres requirement is because some columns use the ARRAY type

Got it, a possible alternative is to use the new JSON type in SQLite, you might be able to get something out of it.

A drawback of using SQLite and writing to the DB is that it doesn't play behave very well with lots of concurrency

Jaakko Kangasharju • Jan 21 '19

What if by default you go back in time only a default amount? Usually people don't need all the statistics since the beginning of time.

That's a very good idea!

Got it, a possible alternative is to use the new JSON type in SQLite, you might be able to get something out of it.

Thanks, a JSON array could work as an ARRAY replacement. I'll have a look.

Rob Hoelz • Jan 21 '19

Neat tool, Jaakko! Reminds me a lot of the book "Your Code as a Crime Scene" - we could definitely benefit from more tooling in this area!

By the way, I noticed your authors chart has a number of duplicate/near-duplicate entries (ex. "Anna Buianova" and "Anna Buyanova", "Arun Kumar" appears twice) - are you familiar with git's mailmap feature? It's really useful in situations like this for canonicalizing author names!

Jaakko Kangasharju • Jan 21 '19

Oops, it seems I forgot to recommend having a mailmap file. :-) So thanks for mentioning it.

For statistics like this, I have always written a mailmap file and kept it up to date, since as you say, it's very useful to have canonical author names and to have git always use the same name. For this post, though, I picked the dev.to repository since it's code that is common to everyone here, but it doesn't have a mailmap file, and I didn't want to do too much polishing just for these sample graphs.