The Git Push That Cost Me 45 Minutes
I once watched a colleague accidentally push 2GB of training images to GitHub. The commit took 45 minutes to complete—and then GitHub rejected it anyway because of the 100MB file size limit. We had to use git filter-branch to rewrite history, which broke everyone's local clones.
The fix took an afternoon. The solution was DVC.
DVC (Data Version Control) handles large files and datasets the same way Git handles code: it tracks changes, enables versioning, and lets you switch between dataset versions with a single command. But instead of storing the actual data in your Git repo, DVC stores lightweight pointer files (.dvc files) and pushes the real data to remote storage like S3, GCS, or even a local folder.
Here's the thing most tutorials miss: you can get meaningful value from DVC with exactly three commands. No need to learn the entire ecosystem on day one.
Why Git Alone Fails for ML Data
Continue reading the full article on TildAlice

Top comments (0)