It's not a secret, Git doesn't handle large files well:
Indeed. The git architecture simply sucks for big objects. It was discussed somewhat during the early stages, but a lot of it really is pretty fundamental. (Linus Torvalds)
In this short post I'd like to:
- See what tools are available there to handle large files with Git
- Try one of those - DVC
Have you ever committed a few 100 MBs file to then realize it's part of the repo now and it would take quite an effort to carve it out and fix the repo:
Git clone takes hours, regular operations might take minutes instead of seconds - not the best idea indeed. And still, there are a lot of cases where we want to have a large file versioned in our repo - from game development to data science where we want to handle large datasets, videos, etc.
So, let's see what open-source and Git-compatible options do we have to deal with this:
Git-LFS - Github and Gitlab both support it and can store large files on their servers for you, with some limits
Git-annex - pretty powerful and sophisticated tool, but it makes it hard to learn and manage to my mind
DVC - Git for Data or Data Version Control - a tool made for ML and data projects, but on its fundamental level helps versioning large files
You can read (a somewhat outdated) overview of LFS and annex tools here, but this time I want to show you how the workflow looks like with DVC (yes! I'm one of the maintainers).
After DVC is installed all we need to do is to run dvc add
and set a storage you'd like to use to store your large files.
Let's try it right here and there, first we need a dummy repo:
$ mkdir example
$ cd example
$ git init
$ dvc init
$ git commit -m "initialize"
Second, generate a large file:
$ head -c1000000 /dev/urandom > large-file
# Windows: fsutil file large-file test.txt 1048576
The workflow is similar to Git, but instead of git add
and git push
we run dvc add
and dvc push
when we want to save a large file:
$ dvc add large-file
Now, let's save it somewhere (we use Google Drive here, but it can be AWS S3, Google Cloud, local directory, and many other storage options):
$ dvc remote add -d mystorage gdrive://root/Storage
$ dvc push
You'd need to create the
Storage
directory in your Google Drive UI first anddvc push
will ask you to give it access to your storage. It is absolutely safe! - credentials are saved on your local machine in the.dvc/tmp/gdrive-user-credentials.json
, no access given outside.
Now, we can do git commit
to save DVC files instead of a large file itself (you can run dvc status
to see that large-file
is not handled and visible by Git anymore):
$ git add .
$ git status
On branch master
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: .dvc/config
new file: .gitignore
new file: large-file.dvc
$ git commit -a -m "add large file"
That's it for today, next time we'll see how did it work, what does large-file.dvc
mean, why does it create .gitignore
and how can we get our file back!
Top comments (7)
I need to checkout DVC, It's been making too much noise lately to let it slide by.
Would be great to hear your feedback, Waylon!
Where do you suggest to get started? Is there a good hello world? How well does it play with S3?
It plays really well with S3! That one of the biggest differences with LFS, for example. I think the good starting point is dvc.org/doc/start and please don't hesitate to me or our team if something is not clear.
Whilst it would still initially be slow, if the large files rarely change you could put them in their own repository, and include them as a git sub module. That way you get the performance you would expect from git in your main repo, whilst being able to version large files
Good point, haven't tried this before. Probably it might work for smaller datasets, but will start breaking at multi GB data because of Git servers limits in terms of overall Git repo size.
Sub modules are an interesting beast of their own. But yes, multi gigabyte repos will still cause issues