Ivan Shcheklein

Posted on Jul 19, 2020

How to version large files with Git

#git #howto #github #tutorial

It's not a secret, Git doesn't handle large files well:

Indeed. The git architecture simply sucks for big objects. It was discussed somewhat during the early stages, but a lot of it really is pretty fundamental. (Linus Torvalds)

In this short post I'd like to:

See what tools are available there to handle large files with Git
Try one of those - DVC

Have you ever committed a few 100 MBs file to then realize it's part of the repo now and it would take quite an effort to carve it out and fix the repo:

Git clone takes hours, regular operations might take minutes instead of seconds - not the best idea indeed. And still, there are a lot of cases where we want to have a large file versioned in our repo - from game development to data science where we want to handle large datasets, videos, etc.

So, let's see what open-source and Git-compatible options do we have to deal with this:

Git-LFS - Github and Gitlab both support it and can store large files on their servers for you, with some limits
Git-annex - pretty powerful and sophisticated tool, but it makes it hard to learn and manage to my mind
DVC - Git for Data or Data Version Control - a tool made for ML and data projects, but on its fundamental level helps versioning large files

You can read (a somewhat outdated) overview of LFS and annex tools here, but this time I want to show you how the workflow looks like with DVC (yes! I'm one of the maintainers).

After DVC is installed all we need to do is to run dvc add and set a storage you'd like to use to store your large files.

Let's try it right here and there, first we need a dummy repo:

$ mkdir example
$ cd example
$ git init
$ dvc init
$ git commit -m "initialize"

Second, generate a large file:

$ head -c1000000 /dev/urandom > large-file 
# Windows: fsutil file large-file test.txt 1048576

The workflow is similar to Git, but instead of git add and git push we run dvc add and dvc push when we want to save a large file:

$ dvc add large-file

Now, let's save it somewhere (we use Google Drive here, but it can be AWS S3, Google Cloud, local directory, and many other storage options):

$ dvc remote add -d mystorage gdrive://root/Storage
$ dvc push

You'd need to create the Storage directory in your Google Drive UI first and dvc push will ask you to give it access to your storage. It is absolutely safe! - credentials are saved on your local machine in the .dvc/tmp/gdrive-user-credentials.json, no access given outside.

Now, we can do git commit to save DVC files instead of a large file itself (you can run dvc status to see that large-file is not handled and visible by Git anymore):

$ git add .
$ git status

On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   .dvc/config
    new file:   .gitignore
    new file:   large-file.dvc

$ git commit -a -m "add large file"

That's it for today, next time we'll see how did it work, what does large-file.dvc mean, why does it create .gitignore and how can we get our file back!

Latest comments (7)

Waylon Walker • Jul 28 '20

I need to checkout DVC, It's been making too much noise lately to let it slide by.

Ivan Shcheklein • Jul 28 '20

Would be great to hear your feedback, Waylon!

Waylon Walker • Jul 28 '20

Where do you suggest to get started? Is there a good hello world? How well does it play with S3?

Ivan Shcheklein • Jul 28 '20

It plays really well with S3! That one of the biggest differences with LFS, for example. I think the good starting point is dvc.org/doc/start and please don't hesitate to me or our team if something is not clear.

Gary Bell • Jul 19 '20

Whilst it would still initially be slow, if the large files rarely change you could put them in their own repository, and include them as a git sub module. That way you get the performance you would expect from git in your main repo, whilst being able to version large files

Ivan Shcheklein • Jul 19 '20

Good point, haven't tried this before. Probably it might work for smaller datasets, but will start breaking at multi GB data because of Git servers limits in terms of overall Git repo size.

Gary Bell • Jul 19 '20

Sub modules are an interesting beast of their own. But yes, multi gigabyte repos will still cause issues