DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

DVC Basics: Track Your First ML Dataset in 3 Commands

The Git Push That Cost Me 45 Minutes

I once watched a colleague accidentally push 2GB of training images to GitHub. The commit took 45 minutes to complete—and then GitHub rejected it anyway because of the 100MB file size limit. We had to use git filter-branch to rewrite history, which broke everyone's local clones.

The fix took an afternoon. The solution was DVC.

DVC (Data Version Control) handles large files and datasets the same way Git handles code: it tracks changes, enables versioning, and lets you switch between dataset versions with a single command. But instead of storing the actual data in your Git repo, DVC stores lightweight pointer files (.dvc files) and pushes the real data to remote storage like S3, GCS, or even a local folder.

Here's the thing most tutorials miss: you can get meaningful value from DVC with exactly three commands. No need to learn the entire ecosystem on day one.

Eyeglasses reflecting computer code on a monitor, ideal for technology and programming themes.

Photo by Kevin Ku on Pexels

Why Git Alone Fails for ML Data


Continue reading the full article on TildAlice

Top comments (0)