DEV Community

Cover image for Removing Sensitive Data From Git History
Jeff Edmondson
Jeff Edmondson

Posted on • Updated on • Originally published at jeffedmondson.dev

Removing Sensitive Data From Git History

If you pushed sensitive data to a public repository assume that it has been comprised & change it!
I would like to think that we have all be there: accidentally pushing a secret (access token, password, connection string, etc) to your remote git server and immediately proceed to start to freak out. Or maybe its just me 🀷🏽. Anyhow, luckily for you I have indeed done it before and found a great tool to re-write history! BFG is a tool that allows you to scrub your git repositories clean and remove all sensitive data that you might have accidentally committed and luckily is extremely simple to use and extremely fast πŸš€. Go ahead and download the jar file & make sure that you have java installed on your machine in order to run it.

Removing Sensitive Data (Data has feelings too)

  1. Remove the secrets from your files & commit into your master branch. The reason we have to do this is because the HEAD commit is considered a protected commit and the BFG tool will not touch it.
git commit -m "Oops"
Enter fullscreen mode Exit fullscreen mode
  1. Get a fresh clone of your repository. You will want to use the --mirror flag here since we are only want the git history & not all of the actual files in your repository.
git clone --mirror your_repo_url
Enter fullscreen mode Exit fullscreen mode
  1. Save a copy of this new clone just in case anything bad happens.
  2. Create a new .txt file and add the sensitive information that would want to remove. Each entry should be on a separate line. BFG will search throughout your entire git history for any occurrences of these keys.
superawesomeapikey1
password123
Enter fullscreen mode Exit fullscreen mode
  1. Run the command to replace the all occurrences of you secrets with ***REMOVED*** .
java -jar bfg.jar --replace-text secrets.txt  your_repo.git
Enter fullscreen mode Exit fullscreen mode

β€” Note: There are many other commands that you can run instead of replacing the text. Check out their documentation for all of the commands.

  1. Change your directory to your git repo and run the following command to strip out all of the "dirty" data.
cd your_repo
git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enter fullscreen mode Exit fullscreen mode

This command is actually two commands concatenated together by the && operator. This appears to be a bit more complicated but lets try to break it down:

git reflog expire --expire=now --all
Enter fullscreen mode Exit fullscreen mode

First, reflog here is short for "Reference Logs". Reference logs just keep track of when you git history (branches, commits) where last updated in your git repository. expire --expire=now in this command is tell git to prune all older git reflog entries. Finally we are adding the tag --all to tell git that we want this operation ran on all references.

git gc --prune=now --aggressive
Enter fullscreen mode Exit fullscreen mode

Secondly, we run the git gc command. This command is essentially cleaning up your repository. --prune=now tells git to remove all references to any orphaned or unreachable git objects. --aggresssive is telling git to sacrifice speed in order to clean everything as well as it can.

  1. Once everything looks good in your local repository it is time to push it back up to your remote repository with a simple:
git push 
Enter fullscreen mode Exit fullscreen mode

Now all of the secrets that you specified in the secrets.txt file will be replaced with ***REMOVED***

and you will have a squeaky clean git history! πŸŽ‰ One important thing to note here is if you are working on a team with other people you are going to want to make sure that they do a new git clone . Thanks for reading! And make sure to give BFG some love and give them a star on their repo on GitHub!

Top comments (11)

Collapse
 
downey profile image
Tim Downey • Edited

+1

Once a secret is leaked publicly, consider it compromised and rotate it!

Bots are constantly watching Github for leaked public credentials and will find them before you can remove them.

If it wasn't pushed publicly, though, this seems like a good way of cleaning up.

Collapse
 
somedood profile image
Basti Ortiz

I'm surprised by how short this was. I'm not an expert in Git, but I thought the redaction process was much more tedious. Apparently not.

How does this method fare when "dirty" commits have already been pushed remotely (i.e. GitHub), though?

Also, are there any other ways to "clean up" the workstations of other maintainers without having to manually intervene (by cloning a fresh repository as you noted in the end)? For large teams, I would imagine how much of a hassle this would be; even more so for major open-source projects.

Collapse
 
jeff_codes profile image
Jeff Edmondson

This also updates all dirty commits if you push the references up to GitHub. However, I am not sure about your second question. I bet that there would be some sort of method to do this but I am not currently aware of it. But hopefully code reviews would catch these keys from being merged into the main code line.

Collapse
 
somedood profile image
Basti Ortiz

Ah, that's unfortunate. All the more incentive to prioritize "prevention" over the "cure".

Collapse
 
patricknelson profile image
Patrick Nelson • Edited

I think this needs a few big disclaimers:

Secrets: This doesn't really purge secrets (if it's ever been pushed); once they're out, they're out for good. It may look "squeaky clean" now, but once you've pushed to git, consider it compromised forever, particularly since anyone could have pulled it in between your first push and when you fixed the problem. Always reset/rotate secrets once you've made this mistake! Github's own help section elaborates on this further: help.github.com/en/enterprise/2.17...

History rewrites: When working with other people, history rewrites on a branch that others may be using is generally considered a big "no no". This is because once it's been pushed, it could get pulled by someone else and once that happens, their next pull will end up forcing a merge-commit because the other users commit history will still be untouched. This creates a mess, so this requires that all other users who may have pulled to then also perform a local reset/rebase against the latest from the remote that they pulled from (can be difficult once they've made changes since then).

See: mirrors.edge.kernel.org/pub/softwa...

For true distributed development that supports proper merging, published branches should never be rewritten.

TL;DR: Don't push credentials, if you do, consider them compromised and reset everything. Avoid rewriting pushed/public branches (especially master), if you have to, notify everyone who may have pulled the rewritten branch so they can reset/rebase their changes (or simply don't rewrite).

Collapse
 
otaskiran profile image
Osman TASKIRAN

I have tried to change a test with bfg. It runs perfect.
But it does not change develop branch files and commits.

What is your recommendation?

Should I rebase develop branch after applied bfg to master?

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt

I use git-filter-repo -- dev.to/patarapolw/i-turned-my-webs...

Collapse
 
nicolasbonnici profile image
Nicolas Bonnici

The real question is why store sensitive data on a code versioning tool? I Know the answer because i used to do those kind of stupid things.

Collapse
 
rodrigograca31 profile image
Rodrigo GraΓ§a πŸ‘¨πŸ»β€πŸ’»πŸ› • Edited

Isnt git commit --amend and a force push enough?
I'm pretty sure it is...
maybe the old commit stays somewhere "parentless" and git gc should fix that...?

Collapse
 
chitsutha profile image
Chitsutha

Thanks for the article. I was looking for this. :)