1minMLOps #2 :Versioning your data with DVC

Mohamed Arbi — Fri, 08 May 2026 12:38:00 +0000

In the last article we talked about why ML is harder than regular software: code, data and environment all move at the same time. Today we're tackling the second one data with a tool called DVC (Data Version Control).

Why not just use Git?

Git is amazing for code, but it was designed for small text files. The moment you commit a 2 GB CSV or a folder of 50,000 images, things get unpleasant fast: the repo balloons, git clone becomes a coffee break, and GitHub starts politely asking you to leave.

DVC solves this by being "Git for data": it stores tiny pointer files in your repo and pushes the actual heavy data to a separate storage backend (S3, GCS, an SSH server, even a local folder). You get versioning, branching and reproducibility, without bloating Git.

Step 1: Install DVC

pip install dvc

If you want S3 support, install the extra:

pip install "dvc[s3]"

Other backends like gs, azure, ssh work the same way — just swap the extra.

Step 2: Initialize DVC in your project

Let's start a tiny project:

mkdir mlops-demo && cd mlops-demo
git init
dvc init
git commit -m "Initialize DVC"

dvc init creates a .dvc/ folder (a bit like .git/) and a .dvcignore file. From now on, DVC and Git work side by side.

Step 3: Track your first dataset

Let's grab a small dataset to play with. We'll use the classic Iris CSV:

mkdir data
curl -L https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv -o data/iris.csv

Note (PowerShell users): in PowerShell, curl is an alias for Invoke-WebRequest, which doesn't accept the -L flag and will error with A parameter cannot be found that matches parameter name 'L'. Use one of these instead:
# Option 1: call the real curl binary (ships with Windows 10+)
curl.exe -L https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv -o data/iris.csv

# Option 2: native PowerShell
Invoke-WebRequest -Uri https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv -OutFile data/iris.csv
curl.exe follows redirects by default, so -L is optional there.

Now tell DVC to track it:

dvc add data/iris.csv

DVC will:

Move data/iris.csv into its cache (.dvc/cache/)
Create a small pointer file data/iris.csv.dvc
Add data/iris.csv to .gitignore automatically

Commit the pointer, not the data, to Git:

git add data/iris.csv.dvc data/.gitignore
git commit -m "Track iris dataset with DVC"

If you peek inside data/iris.csv.dvc, you'll see something like:

outs:
- md5: 1f8e3c...
  size: 3858
  hash: md5
  path: iris.csv

That hash is the version of your data. Change one byte in the CSV, and the hash changes.

Step 4: Set up a remote storage

Right now, the data only lives on your machine. Let's push it somewhere others (or future-you on another laptop) can pull it from.

For a quick local test, you can use a folder as a "remote":

mkdir -p /tmp/dvc-storage
dvc remote add -d localremote /tmp/dvc-storage
git add .dvc/config
git commit -m "Configure DVC remote"

For real projects, swap that with S3 or similar:

dvc remote add -d s3remote s3://my-bucket/dvc-storage

Then push the data:

dvc push

Step 5: The reproducibility test

This is the moment that makes DVC click. Let's pretend you're a teammate cloning the repo for the first time:

cd /tmp
git clone /path/to/mlops-demo fresh-clone
cd fresh-clone
ls data/
# Only iris.csv.dvc — the actual CSV is missing!

dvc pull
ls data/
# iris.csv is back, byte-for-byte identical

You just versioned a dataset alongside your code, without committing it to Git. 🎉

Step 6: Updating the dataset

Real data changes. Let's simulate that:

echo "6.0,3.0,4.5,1.5,versicolor" >> data/iris.csv
dvc add data/iris.csv
git add data/iris.csv.dvc
git commit -m "Add new sample to iris dataset"
dvc push

The pointer file's hash updated. If you ever need the old version of the data, just git checkout an older commit and run dvc pull , DVC fetches the dataset that matched that commit. Time travel for data.

Why this matters

With this in place, you can finally answer the question "which data produced that model?" with a Git commit hash. That's a huge upgrade.

In the next article, we'll add the second piece of the puzzle: experiment tracking with MLflow, so we never again lose track of which hyperparameters and which data produced which metric.

Stay tuned and have fun! 🥰

If you enjoyed this article, you can support my work here:

1minMLOps #1 : What is MLOps and why should you care?

Mohamed Arbi — Thu, 07 May 2026 15:08:56 +0000

If you've ever trained a beautiful model in a Jupyter notebook, watched the metrics shine, and then realized you have no idea how to actually put it in front of users, congratulations: you've just discovered why MLOps exists.

In this series, we are going to walk together from a notebook to a fully deployed, monitored and self-retraining ML system, one tiny step at a time. But before we write any code, let's get the foundations straight

So, what is MLOps?

MLOps (short for Machine Learning Operations) is the set of practices, tools and culture that lets you ship machine learning models to production reliably and repeatedly. Think of it as DevOps' younger sibling: same spirit (automation, reproducibility, monitoring), but adapted to the weirdness of ML, where your code is not the only thing that changes, your data changes, your model changes, and the world your model lives in changes too

A useful way to picture it is the ML lifecycle:

Data collection & versioning — where does the data come from, and which version did we train on?
Experimentation — which features, which model, which hyperparameters?
Training & evaluation — does it actually work, and is it better than what we had?
Packaging — wrap the model in something deployable
Deployment — serve predictions to real users (batch or real-time)
Monitoring — is it still working? Did the data drift?
Retraining — close the loop and start again

Traditional software has steps 4–6. ML has all seven, and steps 1–3 keep coming back to haunt you

Why "it works on my machine" is worse in ML

In classical software, if your code runs locally, it has a decent chance of running in production. In ML, that's a trap, because the model's behavior depends on three moving things, not one:

Code: the training script, the preprocessing, the inference logic
Data: the exact dataset (and its version) you trained on
Environment: Python version, library versions, CUDA versions, OS

Change any of these three and your "great model from Tuesday" becomes "mysterious garbage on Friday" This is why ML teams need stricter versioning, tracking and packaging discipline than most web teams.

What problems does MLOps actually solve?

Concrete pains you'll feel without MLOps, and that we'll fix in this series:

"Which dataset gave us that 0.94 F1 score? Nobody remembers."
"The model works locally but crashes in the Docker container."
"We retrained the model and accuracy dropped, but we can't roll back."
"Production is silently degrading and we noticed two weeks later."
"Every deploy is a hand-crafted artisanal disaster."

Each of these has a tool and a workflow that solves it, and we are going to meet them(almost) one by one

The MLOps stack we'll build

Here's a sneak peek of the tools we'll touch in the next articles:

DVC for data versioning
MLflow for experiment tracking and the model registry
FastAPI for serving
Docker for packaging (we'll lean a bit on Clelia's 1minDocker series here)
GitHub Actions for CI/CD
Evidently for monitoring data and model drift (we can use prometheus and grafana too)
A cloud provider (we'll pick one later) for actually deploying it all

Don't worry if some of these names sound intimidating, we'll introduce them gently, one per article, and always with a working example.

What you need to follow along

Nothing fancy:

Python 3.10+
git installed
A GitHub account
Docker installed (highly recommend to follow this series https://dev.to/astrabert/1mindocker-1-what-is-docker-3baa)
A laptop and ~1 minute per article 😉

In the next article, we'll get our hands dirty: we'll take a small dataset, version it with DVC, and finally answer the question "which data did we train on?" without crying

Stay tuned and have fun!

DEV Community: Mohamed Arbi