DEV Community

Mohamed Arbi
Mohamed Arbi

Posted on

1minMLOps #2 :Versioning your data with DVC

In the last article we talked about why ML is harder than regular software: code, data and environment all move at the same time. Today we're tackling the second one data with a tool called DVC (Data Version Control).

Why not just use Git?

Git is amazing for code, but it was designed for small text files. The moment you commit a 2 GB CSV or a folder of 50,000 images, things get unpleasant fast: the repo balloons, git clone becomes a coffee break, and GitHub starts politely asking you to leave.

DVC solves this by being "Git for data": it stores tiny pointer files in your repo and pushes the actual heavy data to a separate storage backend (S3, GCS, an SSH server, even a local folder). You get versioning, branching and reproducibility, without bloating Git.

Step 1: Install DVC

pip install dvc
Enter fullscreen mode Exit fullscreen mode

If you want S3 support, install the extra:

pip install "dvc[s3]"
Enter fullscreen mode Exit fullscreen mode

Other backends like gs, azure, ssh work the same way — just swap the extra.

Step 2: Initialize DVC in your project

Let's start a tiny project:

mkdir mlops-demo && cd mlops-demo
git init
dvc init
git commit -m "Initialize DVC"
Enter fullscreen mode Exit fullscreen mode

dvc init creates a .dvc/ folder (a bit like .git/) and a .dvcignore file. From now on, DVC and Git work side by side.

Step 3: Track your first dataset

Let's grab a small dataset to play with. We'll use the classic Iris CSV:

mkdir data
curl -L https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv -o data/iris.csv
Enter fullscreen mode Exit fullscreen mode

Note (PowerShell users): in PowerShell, curl is an alias for Invoke-WebRequest, which doesn't accept the -L flag and will error with A parameter cannot be found that matches parameter name 'L'. Use one of these instead:

# Option 1: call the real curl binary (ships with Windows 10+)
curl.exe -L https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv -o data/iris.csv

# Option 2: native PowerShell
Invoke-WebRequest -Uri https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv -OutFile data/iris.csv

curl.exe follows redirects by default, so -L is optional there.

Now tell DVC to track it:

dvc add data/iris.csv
Enter fullscreen mode Exit fullscreen mode

DVC will:

  1. Move data/iris.csv into its cache (.dvc/cache/)
  2. Create a small pointer file data/iris.csv.dvc
  3. Add data/iris.csv to .gitignore automatically

Commit the pointer, not the data, to Git:

git add data/iris.csv.dvc data/.gitignore
git commit -m "Track iris dataset with DVC"
Enter fullscreen mode Exit fullscreen mode

If you peek inside data/iris.csv.dvc, you'll see something like:

outs:
- md5: 1f8e3c...
  size: 3858
  hash: md5
  path: iris.csv
Enter fullscreen mode Exit fullscreen mode

That hash is the version of your data. Change one byte in the CSV, and the hash changes.

Step 4: Set up a remote storage

Right now, the data only lives on your machine. Let's push it somewhere others (or future-you on another laptop) can pull it from.

For a quick local test, you can use a folder as a "remote":

mkdir -p /tmp/dvc-storage
dvc remote add -d localremote /tmp/dvc-storage
git add .dvc/config
git commit -m "Configure DVC remote"
Enter fullscreen mode Exit fullscreen mode

For real projects, swap that with S3 or similar:

dvc remote add -d s3remote s3://my-bucket/dvc-storage
Enter fullscreen mode Exit fullscreen mode

Then push the data:

dvc push
Enter fullscreen mode Exit fullscreen mode

Step 5: The reproducibility test

This is the moment that makes DVC click. Let's pretend you're a teammate cloning the repo for the first time:

cd /tmp
git clone /path/to/mlops-demo fresh-clone
cd fresh-clone
ls data/
# Only iris.csv.dvc — the actual CSV is missing!

dvc pull
ls data/
# iris.csv is back, byte-for-byte identical
Enter fullscreen mode Exit fullscreen mode

You just versioned a dataset alongside your code, without committing it to Git. 🎉

Step 6: Updating the dataset

Real data changes. Let's simulate that:

echo "6.0,3.0,4.5,1.5,versicolor" >> data/iris.csv
dvc add data/iris.csv
git add data/iris.csv.dvc
git commit -m "Add new sample to iris dataset"
dvc push
Enter fullscreen mode Exit fullscreen mode

The pointer file's hash updated. If you ever need the old version of the data, just git checkout an older commit and run dvc pull , DVC fetches the dataset that matched that commit. Time travel for data.

Why this matters

With this in place, you can finally answer the question "which data produced that model?" with a Git commit hash. That's a huge upgrade.

In the next article, we'll add the second piece of the puzzle: experiment tracking with MLflow, so we never again lose track of which hyperparameters and which data produced which metric.

Stay tuned and have fun! 🥰

Top comments (0)