In the last article we talked about why ML is harder than regular software: code, data and environment all move at the same time. Today we're tackling the second one data with a tool called DVC (Data Version Control).
Why not just use Git?
Git is amazing for code, but it was designed for small text files. The moment you commit a 2 GB CSV or a folder of 50,000 images, things get unpleasant fast: the repo balloons, git clone becomes a coffee break, and GitHub starts politely asking you to leave.
DVC solves this by being "Git for data": it stores tiny pointer files in your repo and pushes the actual heavy data to a separate storage backend (S3, GCS, an SSH server, even a local folder). You get versioning, branching and reproducibility, without bloating Git.
Step 1: Install DVC
pip install dvc
If you want S3 support, install the extra:
pip install "dvc[s3]"
Other backends like gs, azure, ssh work the same way — just swap the extra.
Step 2: Initialize DVC in your project
Let's start a tiny project:
mkdir mlops-demo && cd mlops-demo
git init
dvc init
git commit -m "Initialize DVC"
dvc init creates a .dvc/ folder (a bit like .git/) and a .dvcignore file. From now on, DVC and Git work side by side.
Step 3: Track your first dataset
Let's grab a small dataset to play with. We'll use the classic Iris CSV:
mkdir data
curl -L https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv -o data/iris.csv
Note (PowerShell users): in PowerShell,
curlis an alias forInvoke-WebRequest, which doesn't accept the-Lflag and will error withA parameter cannot be found that matches parameter name 'L'. Use one of these instead:# Option 1: call the real curl binary (ships with Windows 10+) curl.exe -L https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv -o data/iris.csv # Option 2: native PowerShell Invoke-WebRequest -Uri https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv -OutFile data/iris.csv
curl.exefollows redirects by default, so-Lis optional there.
Now tell DVC to track it:
dvc add data/iris.csv
DVC will:
- Move
data/iris.csvinto its cache (.dvc/cache/) - Create a small pointer file
data/iris.csv.dvc - Add
data/iris.csvto.gitignoreautomatically
Commit the pointer, not the data, to Git:
git add data/iris.csv.dvc data/.gitignore
git commit -m "Track iris dataset with DVC"
If you peek inside data/iris.csv.dvc, you'll see something like:
outs:
- md5: 1f8e3c...
size: 3858
hash: md5
path: iris.csv
That hash is the version of your data. Change one byte in the CSV, and the hash changes.
Step 4: Set up a remote storage
Right now, the data only lives on your machine. Let's push it somewhere others (or future-you on another laptop) can pull it from.
For a quick local test, you can use a folder as a "remote":
mkdir -p /tmp/dvc-storage
dvc remote add -d localremote /tmp/dvc-storage
git add .dvc/config
git commit -m "Configure DVC remote"
For real projects, swap that with S3 or similar:
dvc remote add -d s3remote s3://my-bucket/dvc-storage
Then push the data:
dvc push
Step 5: The reproducibility test
This is the moment that makes DVC click. Let's pretend you're a teammate cloning the repo for the first time:
cd /tmp
git clone /path/to/mlops-demo fresh-clone
cd fresh-clone
ls data/
# Only iris.csv.dvc — the actual CSV is missing!
dvc pull
ls data/
# iris.csv is back, byte-for-byte identical
You just versioned a dataset alongside your code, without committing it to Git. 🎉
Step 6: Updating the dataset
Real data changes. Let's simulate that:
echo "6.0,3.0,4.5,1.5,versicolor" >> data/iris.csv
dvc add data/iris.csv
git add data/iris.csv.dvc
git commit -m "Add new sample to iris dataset"
dvc push
The pointer file's hash updated. If you ever need the old version of the data, just git checkout an older commit and run dvc pull , DVC fetches the dataset that matched that commit. Time travel for data.
Why this matters
With this in place, you can finally answer the question "which data produced that model?" with a Git commit hash. That's a huge upgrade.
In the next article, we'll add the second piece of the puzzle: experiment tracking with MLflow, so we never again lose track of which hyperparameters and which data produced which metric.
Stay tuned and have fun! 🥰
Top comments (0)