Introduction to Data Version Control

#datascience #dataversioncontrol #datamanagement #machinelearning

Data Version Control is a free open-source system that ensures management for data,machine learning experiments and machine learning automations.By ensuring that scientist do not have to worry about which data model uses which dataset and the actions carried out to achieve the result, work has been made easier.Data scientists are able to manage large datasets with ease making collaboration better.

Data Version Control was first released in 2017 as a simple command line tool.It is based on existing version control tools like Git and CI.It tracks the changing versions of data and every commit changes done to any file.Therefore DVC is like git for machine learning projects.

The .dvc file is lightweight hence stored with code in github.The .dvc files is downloaded together with code from github. the large datasets used and the model ****files are always stored on the DVC remote storage while the .dvc files that points to the data files are stored on github.

DVC design principles

1. Codification: Definition of the project aspects like data and model versions or machine learning experiments in metafiles that are readable by humans.

2. Versioning: Commit DVC metafiles to git which enables the versioning and sharing of the entire project(that is datasets, source code and configuration, parameters and metrics) using git.

3. Secure Collaboration: Control the access and permissions to the project.

Characteristics of DVC

Data Version Control takes advantage of existing technologies with the aim of bringing the best software engineering practices to the field of data science.Some of the characteristics of DVC include:

Easy to use and install:
DVC doesnt require special infrastructure and knowledge.Furthermore, it does not depend on any external services.DVC can be easily intergrated with existing tools like Git.
Can work on top of Git Repo:
DVC sticks to the git workflow like commit,branching requests,pull,push,clone etc.It can also work on its own without the versioning capabilities.
DVC doesn't depend on the platform:
It can run and work on all major operating systems.It is independent of the programming languages and the machine learning libraries.

How to install Data Version Control on windows
DVC can be installed on both Linux and macOS.However we will look into the windows installation in this article.
To use DVC as a Python library, you can install it with conda or with pip.

Installation with choco
To install from command line use Chocolatey by using the choco command:
$ choco install dvc

Installation with conda:
Requires minioconda or anaconda distribution.Use conda from anaconda prompt.

$ conda install -c conda-forge mamba

$ mamba install -c conda-forge dvc

Installation using pip:
Virtual environment creation is recommended or using pipx to encapsulate your local environment.Python 3.8+ is needed to get the latest version of DVC

** $ pip install dvc**

Windows Installer:
Go to the https://dvc.org/ homepage and get the self-contained, executable installer, which is available from the Download button .You can also get it from the release page on GitHub.
To update the DVC download and run the installer again.Use Windows Uninstaller incase you want to uninstall the program from your machine.

Advantages of Data Version Control

Organized Machine learning data-
Data pipeline concept is used by DVC to version data using Git. The pipelines being lightweight allow organization and reproduciblity of workflows.
Share Models via Cloud Storage-
Using a centralized data storage scientists find it easy to perform experiments on a single shared machine which leads to better resource utilization.
Reproducibility-
DVC repositories store the history and details such what changes were made and when.It can also use no-code pulls to update requests with just one commit.The easy to use command line interface allow scientists to reproduce and organize feature stores with dvc get and dvc import commits.
Track & Visualize ML Models-
Versioning is achieved using Git workflows such as pull and push requests.DVC built in cache is used to store all the machine learning information which are further synchronized with remote cloud storage. DVC therefore, allows for the tracking of data and models for further versioning.

Disadvantages of DVC

a)Poor Performance in Sloppy Architecture
Data version control works alongside Git hence the team members are not able to enjoy the full benefits of this version control system if some information about the datasets for a given project is mising.Teams may have to manually develop extra features in DVC to meet certain demands of ML.
b)Redundancy
DVC uses pipeline management hence any use of a separate pipeline tool leads to redundancy.
c) Incorrect Configuration Risk
Should the working team forget to add the output file there is always a risk of incorrect confirguration of the pipeline.Furthermore, a DVC-produced version of project from last year may not work the same in today's circumstance.

DEV Community

Introduction to Data Version Control

Top comments (0)