It's not a secret, Git doesn't handle large files well:
Indeed. The git architecture simply sucks for big objects. It was discussed somewhat during the early stages, but a lot of it really is pretty fundamental. (Linus Torvalds)
In this short post I'd like to:
- See what tools are available there to handle large files with Git
- Try one of those - DVC
Have you ever committed a few 100 MBs file to then realize it's part of the repo now and it would take quite an effort to carve it out and fix the repo:
Git clone takes hours, regular operations might take minutes instead of seconds - not the best idea indeed. And still, there are a lot of cases where we want to have a large file versioned in our repo - from game development to data science where we want to handle large datasets, videos, etc.
So, let's see what open-source and Git-compatible options do we have to deal with this:
Git-annex - pretty powerful and sophisticated tool, but it makes it hard to learn and manage to my mind
You can read (a somewhat outdated) overview of LFS and annex tools here, but this time I want to show you how the workflow looks like with DVC (yes! I'm one of the maintainers).
After DVC is installed all we need to do is to run
dvc add and set a storage you'd like to use to store your large files.
Let's try it right here and there, first we need a dummy repo:
$ mkdir example $ cd example $ git init $ dvc init $ git commit -m "initialize"
Second, generate a large file:
$ head -c1000000 /dev/urandom > large-file # Windows: fsutil file large-file test.txt 1048576
The workflow is similar to Git, but instead of
git add and
git push we run
dvc add and
dvc push when we want to save a large file:
$ dvc add large-file
Now, let's save it somewhere (we use Google Drive here, but it can be AWS S3, Google Cloud, local directory, and many other storage options):
$ dvc remote add -d mystorage gdrive://root/Storage $ dvc push
You'd need to create the
Storagedirectory in your Google Drive UI first and
dvc pushwill ask you to give it access to your storage. It is absolutely safe! - credentials are saved on your local machine in the
.dvc/tmp/gdrive-user-credentials.json, no access given outside.
Now, we can do
git commit to save DVC files instead of a large file itself (you can run
dvc status to see that
large-file is not handled and visible by Git anymore):
$ git add . $ git status On branch master Changes to be committed: (use "git restore --staged <file>..." to unstage) modified: .dvc/config new file: .gitignore new file: large-file.dvc $ git commit -a -m "add large file"
That's it for today, next time we'll see how did it work, what does
large-file.dvc mean, why does it create
.gitignore and how can we get our file back!