Learn about your project from git history

piczmar_0 profile image Marcin Piczkowski ・3 min read

When I enter a new project, apart from learning how to build and run it, I also like to check what to pay attention to.
Things like:

  • is the code clean and well structured?
  • are the developers, who wrote the code still around?
  • which parts are the source of the most headaches?

You can find some of this information from static code analysis tools (e.g. for Java: FindBugs, PMD, CheckStyle, or IntelliJ Idea plugin called SonarLint)

These all analyse current state of the source code, but it appears like you can get a lot of valuable information from the history of the project.

You could get an information like:

  • how intensive development is, e.g. how many people are working on the code and how many lines producing in certain range of time.
  • how many bugs vs. features related code is written (provided that each commit message has a different tag per feature and bug fix)
  • which files are updated most often in context of a bug
  • which files are usually updated together

There are already some tools which can do it (check references section below) but it is relatively simple to do such analysis on your own.

Here is a simple application written in Java which, given a git repository path, prints a list of the top 10 most frequently committed files together with the number of commits.

I used JGit library, which is an implementation of Git in Java.

The way it works is that I get a list of commits for a repository.

  public static Stream<RevCommit> getCommits(Git git) throws GitAPIException {
        return StreamSupport.stream(git.log().call().spliterator(), false);

Then for each commit I reference the revision tree which holds information about all the files in this revision.

public static Stream<ObjectId> getRevTrees(Stream<RevCommit> commitsStream) {
        return commitsStream
                .map(rev -> rev.getTree().getId());

For each such tree I compare it with a tree from previous commit and get a diff.

    public static List<DiffEntry> diff(Git git, ObjectId newCommit, ObjectId oldCommit) throws IOException {
        DiffFormatter df = new DiffFormatter(new ByteArrayOutputStream());
        return df.scan(newCommit, oldCommit);

I collect the list of all changed files in all the commits and at the end grouping them by file path and count the occurrences.
Finally, I'm sorting them by the most frequently occurring paths.

E.g. when you run the application on popular spring-framework master branch, you'll get results as below:


build.gradle - 1374
src/asciidoc/index.adoc - 444
spring-core/src/main/java/org/springframework/core/annotation/AnnotationUtils.java - 375
spring-beans/src/main/java/org/springframework/beans/factory/support/DefaultListableBeanFactory.java - 374
spring-context/src/main/java/org/springframework/context/annotation/ConfigurationClassParser.java - 362
spring-webmvc/src/main/java/org/springframework/web/servlet/config/annotation/WebMvcConfigurationSupport.java - 332
spring-webmvc/src/main/java/org/springframework/web/servlet/mvc/method/annotation/RequestMappingHandlerAdapter.java - 323
spring-web/src/main/java/org/springframework/http/HttpHeaders.java - 323
spring-web/src/main/java/org/springframework/web/client/RestTemplate.java - 320
spring-beans/src/main/java/org/springframework/beans/factory/support/AbstractAutowireCapableBeanFactory.java - 320

Based on that you can imagine which files are the ones to take a look at the first place. Then you would probably like to check where they are used to drill down dipper in the project.


Posted on by:

piczmar_0 profile

Marcin Piczkowski


Software engineer with over 10 years experience in different technology stacks, architecting, developing, CI/CD and leading teams. Currently working with Java, Node.JS and Serverless


markdown guide

"Those who learn from history are doomed to witness it happening again."