DEV Community

Cover image for Extracting git repository data with PyDriller
Matt Eland
Matt Eland

Posted on • Originally published at accessibleai.dev

Extracting git repository data with PyDriller

In early 2022 I debuted a talk called "Visualizing Code" that used data visualization to explore patterns in open source projects. This article is the first in a new series that walks you through how to discover and analyze patterns found in your own repositories.

Specifically, this article will cover how to take any public repository on GitHub and extract a CSV file full of commit history information. Once you have a data file available, the process for analyzing this dataset is fairly flexible and can be done in a variety of ways including Python code, Tableau, Power BI, or even Excel.

The code presented in this article assumes that you have pydriller and pandas installed. PyDriller is required for the content of this article while Pandas simply helps visualize the loaded data and export it to a CSV file. See installing PyDriller for more information on getting started, but typically pip install pydriller is all you need.

What is PyDriller?

PyDriller is an open-source Python library that allows you to "drill into" git repositories.

According to its GitHub repository, "PyDriller is a Python framework that helps developers in analyzing Git repositories. With PyDriller you can easily extract information about commits, developers, modified files, diffs, and source code.".

Using PyDriller we will be able to extract information from any public GitHub repository including:

  • Individual commits
  • Commit authors
  • Commit dates, times, and time zones
  • Files modified by each commit
  • The number of lines added and removed
  • Related commits
  • Code complexity metrics

Let's take a look at how it works

Connecting to the Repository

In order to grab information from a repository, we must first create a Repository object from a given GitHub URL.

The code for this is fairly simple:

# We need PyDriller to pull git repository information
from pydriller import Repository

# Replace this path with your own repository of interest
path = 'https://github.com/dotnet/machinelearning'
repo = Repository(path)
Enter fullscreen mode Exit fullscreen mode

This code doesn't actually analyze the repository, but it gets us to a state where we can traverse the commits that are part of the git repository.

We actually inspect these commits by calling traverse_commits() on our Repository object and looping over the results.

Important Note: looping over repository commits takes a long time for large repositories. It took 52 minutes to analyze the ML.NET repository this code example refers to, which had 2,681 commits at the time of analysis on February 25th, 2023.

The code below will loop over all commits and for each commit:

  • Build a list of files that are modified by that commit
  • Extract basic commit information
  • Calculate code metrics using PyDriller's Open Source Delta Maintainability Model (OS-DMM)

As each commit is read, it is added to a list of commits that serves as the final byproduct of the loading process.

The code listing follows:

# Loop over each PyDriller commit to transform it to a commit usable for analysis later
# NOTE: This can take a LONG time if there are many commits

commits = []
for commit in repo.traverse_commits():

    hash = commit.hash

    # Gather a list of files modified in the commit
    files = []
    try:
        for f in commit.modified_files:
            if f.new_path is not None:
                files.append(f.new_path) 
    except Exception:
        print('Could not read files for commit ' + hash)
        continue

    # Capture information about the commit in object format so I can reference it later
    record = {
        'hash': hash,
        'message': commit.msg,
        'author_name': commit.author.name,
        'author_email': commit.author.email,
        'author_date': commit.author_date,
        'author_tz': commit.author_timezone,
        'committer_name': commit.committer.name,
        'committer_email': commit.committer.email,
        'committer_date': commit.committer_date,
        'committer_tz': commit.committer_timezone,
        'in_main': commit.in_main_branch,
        'is_merge': commit.merge,
        'num_deletes': commit.deletions,
        'num_inserts': commit.insertions,
        'net_lines': commit.insertions - commit.deletions,
        'num_files': commit.files,
        'branches': ', '.join(commit.branches), # Comma separated list of branches the commit is found in
        'files': ', '.join(files), # Comma separated list of files the commit modifies
        'parents': ', '.join(commit.parents), # Comma separated list of parents
        # PyDriller Open Source Delta Maintainability Model (OS-DMM) stat. See https://pydriller.readthedocs.io/en/latest/deltamaintainability.html for metric definitions
        'dmm_unit_size': commit.dmm_unit_size,
        'dmm_unit_complexity': commit.dmm_unit_complexity,
        'dmm_unit_interfacing': commit.dmm_unit_interfacing,
    }
    # Omitted: modified_files (list), project_path, project_name
    commits.append(record)
Enter fullscreen mode Exit fullscreen mode

You'll note that the code above is in a try / except. This is because GitHub responded unexpectedly to some requests PyDriller made for commit details. Knowing this to be a valid repository, I felt the best strategy was to log the error occurred with the commit hash and exclude those commits from the final result set.

Validating the Load Process

Once the data is loaded (which could take some time), it's time to ensure it appears valid.

I chose to do this by using the popular Pandas library for tabular data analysis tasks.

While Pandas is typically used to analyze, sift, clean, and otherwise manipulate tabular data sources, our use in this phase of the project is fairly basic: load data into a tabular DataFrame, display a small preview of it, and then save it to disk.

The code to load and preview the dataset is as follows:

import pandas as pd

# Translate this list of commits to a Pandas data frame
df_commits = pd.DataFrame(commits)

# Display the first 5 rows of the DataFrame
df_commits.head()
Enter fullscreen mode Exit fullscreen mode

The final line's df_commits.head() call will display something like the following result if run in a Jupyter Notebook:

Pandas DataFrame

Important Note: displaying the first 5 rows of the DataFrame via the .head() call will only work if this code is executed as part of a Jupyter notebook and that line is the last line in a code cell. This step is optional, however, as its only purpose is to allow you a peek at the output.

Exporting the Data to a CSV File

Finally, you can save the contents of the data frame in a CSV file with the following code:

df_commits.to_csv('Commits.csv')
Enter fullscreen mode Exit fullscreen mode

This will save the file to disk in the current directory under the name Commits.csv.

Once the file has been written to disk, you can import it into Excel, Tableau, Power BI, or another data analysis tool.

Alternatively, you could load it up again with Pandas and visualize it with Python code as we'll explore in a future article.

Limitations and Next Steps

The code I've provided here is useful for generating a CSV file of commits from a public repository on GitHub.

This will be helpful if you want to visualize trends in your commit history by doing further manual data analysis or plugging the data into a data visualization tool.

However, this process does have a few limitations:

First, it does not work on non-public repositories on GitHub or on git repositories not on GitHub.

Secondly, I observed that 10 of the 2,681 commits I attempted to interpret had some form of a persistent error retrieving their data from GitHub via PyDriller. I've only seen this issue on one repository, but you may encounter it in your own repositories.

Third, this process takes a significant amount of time to process repository history. In my experiments I saw performance of processing about 1 commit every 800 milliseconds, which means that most repositories will take a non-trivial amount of time to process.

Finally, this process only tracks high-level information about git changes and does not include details on the individual lines or code changes that they contained, though files modified should be included.


All told, PyDriller is an excellent utility to easily get git repository data prepared and ready for further analysis.

Stay tuned for future articles showing how to work with the data this captures from your git repository.

Top comments (2)

Collapse
 
chrisgreening profile image
Chris Greening

Ohhhh this is a really excellent demo, thanks for sharing Matt!

I haven't worked with pydriller before but this is definitely on my shortlist now for future data sci hobby projects/exercises, I always love sourcing/extracting raw data for projects and this seems like a fantastic tool for that 😎

Collapse
 
integerman profile image
Matt Eland

It's also possible to pull data out of files using just git commands, but it's not going to be as polished as this. I hope to write a separate article about that process someday. Thanks for your kind words.