DEV Community

Yuichi Tanaka
Yuichi Tanaka

Posted on

Git: rewriting entire history

What if you find a sensitive data is committed to Git? You should remove that file. What if your repository gets too large and it takes over an hour to clone? You should remove large files to reduce repository size.

However, removing those files and committing that change is not enough. The sensitive data or large files still exist in Git history.

Therefore, you should remove sensitive data or large files from the entire repository history.

How to do that? Use git-filter-repo.

git-filter-repo

git-filter-repo is a tool to rewrite entire repository history. It's fast and safe.

Removing a single file

If you want to remove a file called sensitive.md:

    $ git filter-repo --path sensitive.md --invert-paths
    Parsed 104 commits
    New history written in 0.16 seconds; now repacking/cleaning...
    Repacking your repo and cleaning out old unneeded objects
    HEAD is now at 58387b2 Modify README
    Enumerating objects: 6, done.
    Counting objects: 100% (6/6), done.
    Delta compression using up to 4 threads
    Compressing objects: 100% (3/3), done.
    Writing objects: 100% (6/6), done.
    Total 6 (delta 0), reused 4 (delta 0)
    Completely finished after 0.38 seconds.

--path option specifies which path to include to the new history. With --invert-path option, --path means which path to exclude from the new history.

Then, a file called [sensitive.md](http://sensitive.md) is completely removed from the entire history. Therefore, it looks sensitive.md didn't exist from the initial commit.

Removing all files bigger than a certain size

If you want to remove files whose size is over 100KB, you can use --strip-blobs-bigger-than option as follows:

    $ git filter-repo --strip-blobs-bigger-than 100K                                                                                        466ms  Fri Jan 10 22:23:35 2020
    Processed 318 blob sizes
    Parsed 106 commits
    New history written in 0.10 seconds; now repacking/cleaning...
    Repacking your repo and cleaning out old unneeded objects
    HEAD is now at 0fc502c Modify README
    Enumerating objects: 312, done.
    Counting objects: 100% (312/312), done.
    Delta compression using up to 4 threads
    Compressing objects: 100% (209/209), done.
    Writing objects: 100% (312/312), done.
    Total 312 (delta 98), reused 312 (delta 98)
    Computing commit graph generation numbers: 100% (104/104), done.
    Completely finished after 0.31 seconds.

There are many other examples at git-filter-repo man page.

Why not git filter-branch?

git filter-branch command used to be an official way to rewrite history. However, you'll see a warning like below when you try to execute git filter-branch on Git 2.24.0 or later.

    WARNING: git-filter-branch has a glut of gotchas generating mangled history
             rewrites.  Hit Ctrl-C before proceeding to abort, then use an
             alternative filtering tool such as 'git filter-repo'
             (https://github.com/newren/git-filter-repo/) instead.  See the
             filter-branch manual page for more details; to squelch this warning,
             set FILTER_BRANCH_SQUELCH_WARNING=1.

Compared to git filter-branch, git-filter-repo has several advantages:

  • Simple
  • Fast
  • Safe

For example, when removing a file that has modified 100 times, git filter-branch takes 17 times longer than git-filter-repo! The repository I used on that test is public here, you can test by yourself. Removing sensitive.md from this repo, git filter-repo took 0.84 second and git filter-branch took 14.49 seconds.

For more details about the git filter-branch issues, see git filter-branch man page.

Install

git filter-repo is not included in the official Git command, so you should install it by yourself. If you use a package manager like Homebrew, you can use those tools.

For more details, see the official installation documentation.

Top comments (0)