DEV Community

Jake Carpenter
Jake Carpenter

Posted on

Shrinking your git repository with BFG Repo-Cleaner

I recently decided to find a way to reduce the size of a git repository for a project. Previous engineers had committed some relatively large files and it took too long to clone the repository. We deleted the files months ago, but they are buried in history. I found a tool called BFG Repo-Cleaner that makes this incredibly easy. Using it, I was able to decrease the size of our project from around 750MB to under 10MB without losing any valuable history.

In addition to solving my use-case, this tool can also be used if someone has made the mistake of committing secrets/credentials into the repository, which makes knowing how to use this tool a life-saver!

Prerequisites

  • Java runtime 8+
  • Some instructions below require the use of Bash, but the tool can be used in Windows with any command prompt

Mirror repository

For the best results, a full mirror of the repository is needed. It will be easier if all feature/bug branches are deleted before attempting this. Mirroring will pull the entire repository but will not show editable/working files.

git clone --mirror git://your.server.com/your-big-repo.git
Enter fullscreen mode Exit fullscreen mode

Optional - Identify large files in history

Depending on what needs to be removed from the repository history, knowing which files are the largest can be helpful. I used a helpful script written by Antony Stubbs that can list those.

Create a file called git-large-files and make it executable.

touch git-large-files
chmod +x git-large-files
Enter fullscreen mode Exit fullscreen mode

Paste in the following Bash script that is slightly modified from Antony's original:

#!/bin/bash

# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs

# set the internal field spereator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';

# number of objects to print
count=25

# list all objects including their size, sort by size
objects=`git verify-pack -v ./objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head -n ${count}`

echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."

output="size,pack,SHA,location"
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`git rev-list --all --objects | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done

 echo -e $output | column -t -s ', '
Enter fullscreen mode Exit fullscreen mode

Now execute this script from the repository. It will list the largest 25 files in your entire history. These files can be specifically targeted at a later step.

cd your-big-repo.git
../git-large-files
Enter fullscreen mode Exit fullscreen mode

Approach 1 - Deleting files larger than specific size

One approach is to allow the tool to find and clean files larger specific size. To strip files over 20MB, for example, execute the following:

java -jar /path/to/bfg.jar --strip-blobs-bigger-than 20M your-big-repo.git
Enter fullscreen mode Exit fullscreen mode

The tool will output a report as it executes a list of the deleted files. Always review this list. Next, the garbage collector needs to run to actually delete those files. Do this before attempting to run the tool again with other parameters:

cd your-big-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
Enter fullscreen mode Exit fullscreen mode

Approach 2 - Deleting matching files

Another common usage is to delete specific filename(s). This is especially useful when following a previous step that identified the largest files in your repository.

// Delete a single file
java -jar /path/to/bfg.jar --delete-files 'some-image-that-was-not-needed.png' your-big-repo.git

// Delete many matching files
java -jar /path/to/bfg.jar --delete-files '{*.apk,*.app,yarn.lock}' your-big-repo.git

// Delete a folder
java -jar /path/to/bfg.jar --delete-folders 'build' your-big-repo.git
Enter fullscreen mode Exit fullscreen mode

Like using the other approach, the tool will output a report as it executes which includes a list of the deleted files. Always review this list. Next, the garbage collector needs to run to actually delete those files. Do this before attempting to run the tool again with other parameters:

cd your-big-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
Enter fullscreen mode Exit fullscreen mode

Optional - Include the latest commit when cleaning (NOT RECOMMENDED)

The previous examples utilize the tool's default behavior of ignoring all files on your current commit. While it is safer to delete any current files manually then run this tool, you can opt to include the current commit.

java -jar /path/to/bfg.jar --no-blob-protection --delete-files 'file-still-in-HEAD.png' repo.git
Enter fullscreen mode Exit fullscreen mode

Override the remote repository

At this point there should be a significant size of the repository between now and before this tool was used. To make this change permanent, though, the changes to history need to override what exists on the server. A few notes:

  • Do not allow team-members to check in additional changes from their local repositories. The entire team will need to re-clone these changes.
  • Consider pushing these changes to another remote repository to smoke test. Start it, run tests, and do whatever else needed to feel confident the code is still in working condition.
  • Before overriding the remote repository, make another backup of the repository somewhere safe using git clone --mirror ... just in-case.

Once ready to override the remote repository, force push these changes:

git push --force
Enter fullscreen mode Exit fullscreen mode

Finally, change to another directory and re-clone the repository using the standard approach.

Top comments (3)

Collapse
 
tbroyer profile image
Thomas Broyer

Fwiw, do you know about git filter-repo? It's the officially recommended replacement for git filter-branch, and does what you needed to do and much more.

Your commands map to it as:
Approach 1

git filter-repo --strip-blobs-bigger-than 20M
Enter fullscreen mode Exit fullscreen mode

Approach 2

// Delete a single file
git filter-repo --invert-paths --use-base-name --path 'some-image-that-was-not-needed.png'

// Delete many matching files
git filter-repo --invert-paths --use-base-name --path-glob '*.apk' --path-glob '*.app' --path yarn.lock

// Delete a folder
git filter-repo --invert-paths --use-base-name --path build
Enter fullscreen mode Exit fullscreen mode

(--invert-paths is needed here as by default filter-repo will keep what matches, so you need to tell it to keep everything but what you want to remove)

Collapse
 
cavo789 profile image
Christophe Avonture • Edited

Just-in-time for me : I've such repo with binairies files in it. Will try this week. Did you know if there is f.i. a docker image ready-to-use to make the BFG tool ever simple ?

(found this one hub.docker.com/r/tagplus5/git-bfg/...)

Collapse
 
jakecarpenter profile image
Jake Carpenter

Nice find! I'll give it a try later to make sure it works with the optional bash script I included.