DEV Community: Martin Daniel

Why Git is not enough for data science

Martin Daniel — Wed, 05 May 2021 08:06:08 +0000

TL;DR Git is used in almost every software development project to track code and file changes. Based on this ability to track every change, there has also been a tremendous increase in Gits adoption for Data science projects. In this post we discuss;

Benefits of Git for data science
The gaps and limitations of Git
Best practices for using Git for data science projects

For those of you familiar with Git jump to the section “Why Git is important to learn for data science”

What is Git and how does it work?

“Git is a free and open-source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.” - Git

As the description states, Git is a version control system. It helps record, track, and save any change made to source code and to quickly and easily recover any previous state.
Git uses a distributed version control model. This means that there can be many copies (or forks/remotes in the GitHub world) of the repository. When working locally, Git is the program that you will use to keep track of changes to your repository.

GitHub.com is a location on the internet that acts as a remote location for your repository. GitHub provides a backup of your work that can be retrieved if your local copy is lost (e.g., if your computer falls off a pier). GitHub also allows you to share your work and collaborate with others on a project.

Similar tools to GitHub are GitLab, Bitbucket.

Source: Pro Git by Scott Chacon and Ben Straub.

How Git and Github, GitLab, or Bitbucket help you work better

There are several practical ways Git helps a development project.

Keep track of changes to your code locally using git.

Synchronize code between different versions (i.e. your own versions or others’ versions).
Test changes to code without losing the original.
Revert back to an older version of code, if needed.
Backup your files on the cloud (GitHub/GitLab/Bitbucket).
Share your files on GitHub/GitLab/Bitbucket and collaborate with others.

Up to this point, it's clear why Git is a powerful tool that will help you record, track and store any change made to almost any file in a project. We understood how Git helps you work well as a team and this is one of the main reasons why Git is so widely used in software development projects.

These benefits are also relevant for data science projects, to manage the code that supports their work, but it does not translate 1:1.

Is Git important to learn for data science?

If you want to join a data science project as a collaborator, you will have to face some challenges, for example;

Review all ongoing research and repositories on a specific topic and pick the most promising one. (If you are working on an open source project)
You will need to understand the current state of that project and how it has evolved over time.
Identify which directions are promising and still worth exploring. In this step, reviewing ideas and approaches that were tried and abandoned is also important, since you don’t want to unnecessarily repeat work someone tried unsuccessfully. Usually, these failed approaches are not documented and forgotten, which is a huge challenge.
You will need to collect all the pieces of the project (data, code, etc.) which might be spread out over multiple platforms, and sometimes not completely accessible.
Last but not least, once you’ve made some improvement or explored a new direction, there is no easy way to contribute your results back to the project.

To summarize, nowadays there are multiple challenges which aren’t gracefully handled in today’s tools.

Now let’s see how Git can help us fill in some of the missing pieces.

With Git’s ability to track every change we made to our files, we can show all the directions taken, when they were taken, and by whom! It's possible to see the entire Git history as an actual story and understand what was done at every step (commit) and have some documentation (commit messages). You can also share it with other collaborators using one of the previously mentioned platforms.

By using a more traditional software development workflow, you begin to treat your models more like an application and less as a script, which makes it easier to manage and leads to higher quality outcomes.

Still, Git has limitations

Although there are significant benefits from using version control tools like Git, they come with a high overhead cost.
The overhead comes from the need to ensure every change goes through the "commit" process, which most often means using the command line and terminal. Since the terminal is so unfamiliar to most analysts (and even data scientists), you don't just need to learn Git, you also need to learn the terminal! This is not quick, and having your efficiency suffer while struggling to remember what command to write is a huge turn-off. If this is your case, you can check this blog post Effective Linux & Bash for Data Scientist

Also, Git can't do all the heavy lifting on its own. What do I mean by this?

Git can't support experiment tracking. Here is a nice post comparing some existing tools. ML experiment tracking tools that fit your data science workflow
Git can't track big files (datasets and models). You can find more information about this in this post Comparing Data Version Control tools

Best practices for structuring a Data Science project using Git.

With all that said I propose a solution to integrate the Git mindset into your DS project. It’s composed of a few components: experiment tracking, version control and using data as source code.

Experiment tracking

You can implement experiment tracking by taking two approaches either using a dedicated tool or using Git. You can also find more information about this on ML experiment tracking tools that fit your data science workflow

External tracking

You have to log all your experiment information on an external system.

This approach has some advantages:

A lot of excellent tools have been developed
It’s an intuitive way to do it. No need to stop before taking a new direction to create a new commit on a git project

But, with advantages come the disadvantages:

There is no clear connection between the code and the experiment results
It’s hard to review
Reproducibility. It’s not easy to reproduce what lead to an experiment result

Version Control with Git Tracking

You consider each experiment a git commit, this means that any change to the project will create a new version since code, data, and parameters are part of the source code

Some advantages are:

Reproducing is easy, just do a git checkout and you have code, parameters, data, models.
You get all the context related to an experiment
Collaboration. As mentioned throughout this article Git in combination with some of the other platforms gives you the possibility to parallelize work
If combined with data versioning tools you can also accept data contributions
GitOps, CI/CD - Makes it easier to integrate with the existing git ecosystem for CI/CD, PRs

It also has some disadvantages:

Can be messy when having a lot of experiments meaning having a lot of commits
Change on the mindset to start considering any new direction as a commit

Of course, you can do a mix of both ideas.

Data as source code

As I mentioned before Git was developed to track changes in text files, not large binary files. So tracking a project data set is not an option. This is why I recommend two options to use:

For a non-changing dataset, you can upload it to a server and access it through a URL
In case you have a data set that might change you should consider versioning it using one tool. You can find a great comparison here Comparing Data Version Control tools.

You can also find more information about why is it a good idea to version your data set on this blog post Datasets should behave like Git repositories

Conclusion

Implementing these suggested practices for Git offer several benefits:

Consolidate all your project files, data and models in one place
Review tools which make it easier to contribute to an ongoing project and easier to check these contributions.
Easier to reproduce and reuse work from previous projects
CI/CD, if you are happy with the contributions that were made you can have an automatic way to merge them, taking the code and the data, test them, and ship them to production.

Day by day, Git is being used in more Data Science projects. I hope that by reading this article you will have a better understanding on what are its limitations and its strengths and how you can use it with your colleagues. Good luck!

DAGsHub Storage: configure a DVC remote without a DevOps degree

Martin Daniel — Sun, 17 Jan 2021 14:01:58 +0000

DVC is a great tool; it lets you track and share your data, models, and experiments. It also supports pipelines to version control the steps in a typical ML workflow. To share your data and models, you will need to configure a DVC remote (such as S3, GCloud, GDrive, etc.), but doing so can be a hassle and take a tremendous amount of time.

Too many things to order... Photo by Hans-Peter Gauster on Unsplash

In this post, I'll show you that this configuration shouldn't have to be so difficult; it should be smooth and easy. To solve this issue, we created DAGsHub Storage, a DVC remote that is super easy to configure, no credit cards, no need to grant complex permissions, no cloud setup. Just five commands and you are ready to go!

To start, you will need to have a project on DAGsHub. There are two ways to do this, either create one from scratch or connect an existing project from any other platform (We support GitHub, GitLab, and BitBucket).

If you need, we have a tutorial on how to start a new project on our platform.

In order to continue with this tutorial you will need to install DVC first.

After DVC is installed, in a Git project, initialize it by running

dvc init

This command will create .dvc/.gitignore , .dvc/config, .dvc/plots, and .dvcignore. These entries can be committed with

git commit -m "Initialize DVC"

For the purpose of this tutorial I've created a new project with the following structure

data
├── processed
│   ├── test_text.txt
│   └── train_text.txt
└── raw
    └── test_full.txt

To start tracking our data, either a file or a directory, we use dvc add

dvc add data

Here is where DVC does its magic. It stores metadata about the entry added in a .dvc file; this is a small text file containing information about how to access the original entry but not the original entry itself. This command also adds the added entry to the .gitignore file, so we won't commit it by accident

In our case DVC created a file called data.dvc, which will look like this

outs:
- md5: 61b3e1a6439d6770be4d210b758f6cbd.dir
  size: 0
  nfiles: 3
  path: data

This is the file that will be versioned by Git

Following this step we are ready to commit the .dvc file as we would do with any source code.

git add data.dvc .gitignore
git commit -m "Add data"

Storing the data remotely

Configuring a bucket shouldn't be so hard! Photo by Jessica Johnston on Unsplash

Excellent! We are now tracking the versions of our data, and now we have to figure out where to store the data itself.

As I mentioned before, I will show you how to effortlessly configure a DVC remote. Following five simple commands, you will be pushing your data and models alongside your code. For easy comparison, I'll also show you the traditional way to set up remotes, so you can easily understand the time saved by using DAGsHub Storage.

How to do it without a DevOps degree

At DAGsHub, we automatically create a DVC remote with every project on the platform to push your data and models just as you receive a Git remote to push your code. This is where the simplicity starts showing! To push or pull data from this URL, we will use our existing DAGsHub credentials (via HTTPS basic authentication). Meaning we don't need to configure any IAM, provide Access tokens to access your bucket, or anything else related to a cloud provider.

Public repositories will have publicly readable data, same as the code. If you want to share or receive data from a collaborator, add them as a project collaborator. If your repository is private only maintainers will be able to pull or push data to it.

Basically, if you can clone the code, you can pull the data!

Let's get our hands dirty!

We need to add DAGsHub as our DVC remote

dvc remote add origin --local https://dagshub.com/<username>/<repo_name>.dvc

Next we need to tell DVC how to ask for our credentials

dvc remote modify origin --local auth basic
dvc remote modify origin --local user <username>
dvc remote modify origin --local ask_password true

And finally, push the data to the new remote

# Make sure you are using DVC 1.10 or greater for the next command
dvc push -r origin

And that's it! Just 5 commands and you configured your DVC remote effortlessly, we never opened a cloud provider webpage, handled complicated IAM, provided credit card information.

Easy peasy lemon squeezy

If you need more information about DAGsHub Storage, you can read our Feature Reference

How to do it WITH a DevOps degree – A Comparison

Before we dig into this section, DAGsHub currently supports AWS S3 and GCS in addition to DAGsHub Storage.

For the sake of this comparison, let's see how to do it for Amazon S3.

Hire AWS as your cloud provider. This involves taking out your credit card (If you already have an account, you can skip this step)
Set up a bucket to store your data
Install the AWS CLI tool
Log in to AWS using the CLI tool
If the user who is going to use the bucket is not an admin, create an IAM user

Assign it the correct permissions to use the bucket

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<IAM-user-ARN>" (e.g: "arn:aws:iam::7777777:user/dags-lover")
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
            ],
            "Resource": [
                "arn:aws:s3:::/*"
                "arn:aws:s3:::"
            ]
        }
    ]
}

A lot of things, right? All these steps are prone to errors even for the most experienced users, so if you are doing this for the first time, expect to miss something

It doesn't end there. If you want to integrate DAGsHub, you will need to add a Storage Key to your project settings so we will be able to list, show, and diff your files on our file viewer.

You will find this settings page on https://dagshub.com///settings/storage/keys

Once you enter your bucket URL you will receive all the instruction to add the storage key.

Keep up! We haven't finished yet! Now you will need to install the S3 package for DVC

pip install "dvc[s3]"
#Or if you are using poetry
poetry add dvc --extras "s3"

Following this, we will need to add the bucket as our remote

dvc remote add s3-remote s3://your-bucket/storage

And finally, we push our data

dvc push -r origin

DAGsHub storage to the rescue! Photo by Nikko Macaspac on Unsplash

Learn more

I hope this helped you understand how to set up a DVC remote (an easy way and a hard way). For more information about DAGsHub, check out our website, documentation, or join our Discord community.