MUHAMMAD ABIODUN SULAIMAN for AWS Community Builders

Posted on Feb 1, 2025

🎯 ML Done Right: Versioning Datasets and Models with DVC & MLflow

#machinelearning #datascience #mlflow #dataversioncontrol

Introduction

Data versioning is a crucial aspect of Machine Learning (ML) workflows. It ensures that datasets are reproducible, traceable, and manageable throughout the ML lifecycle. Unlike traditional software development, where Git efficiently tracks code changes, ML workflows require specialized tools to version datasets, models, and metadata.

Two powerful tools for data versioning in ML pipelines are:
DVC (Data Version Control): Designed to manage large datasets efficiently and integrates seamlessly with Git.
MLflow: Focused on experiment tracking, model versioning, and lifecycle management. In this article, we will explore how to set up DVC and MLflow for data versioning in a machine learning workflow using Python.

Why Do We Need Data Versioning?

Before going into implementation, let’s first understand why data versioning is essential:

Reproducibility → Ensures that ML models can be recreated using the exact dataset used during training.
Collaboration → Enables teams to share and work on different versions of datasets.
Traceability → Keeps track of dataset changes, helping to identify issues in models.
Rollback & Experimentation → Facilitates easy rollback to previous dataset versions for comparison.
Storage Efficiency → Avoids redundancy by tracking only changes in datasets instead of duplicating entire files.

1️⃣ Setting Up DVC for Data Versioning

DVC is an open-source tool designed to handle large datasets efficiently while integrating seamlessly with Git.

Installation

To install DVC, run:

pip install dvc

If you’re using cloud storage (e.g., AWS S3, Google Drive, or Azure), install the appropriate extension:

pip install 'dvc[s3]'
pip install 'dvc[gdrive]'

Initialize DVC in a Project

Inside a Git-tracked ML project, initialize DVC:

git init
dvc init
git commit -m "Initialize DVC"

Adding and Versioning Data

Assume we have a dataset stored in data/:

dvc add data/

This creates:

data.dvc → A metadata file that tracks the dataset version.
Updates .gitignore to prevent large files from being committed to Git.

Now, commit these changes to Git:

git add data.dvc .gitignore
git commit -m "Track dataset with DVC"

Remote Storage Configuration

To store the dataset in cloud storage:

dvc remote add myremote s3://mybucket/path/
dvc push

This uploads the dataset to S3. Other options like Google Drive, Azure, and SSH are also supported.

Restoring Previous Versions

To retrieve a previous dataset version:

git checkout <commit_id>
dvc pull

This ensures that data and models remain synchronized with the desired version.

2️⃣ Using MLflow for Data Tracking

MLflow helps track datasets, experiments, and models during development.

Installation

pip install mlflow

Initializing MLflow Tracking

Start the MLflow tracking server:

mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns

This creates a local database (mlflow.db) to store ML experiments.

Logging Dataset Versions with MLflow

Modify your Python script to track dataset versions:

import mlflow
import mlflow.artifacts
import os

# Log dataset version
def log_dataset_version(dataset_path):
    with mlflow.start_run():
        mlflow.log_artifact(dataset_path, artifact_path="datasets")
        print("Dataset version logged in MLflow")

log_dataset_version("data/")

Tracking Experiments

When training an ML model, log key parameters and metrics:

import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Log model and metrics
with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "random_forest_model")

    print(f"Model logged with accuracy: {accuracy}")

Retrieving Dataset Versions

To list previous dataset versions in MLflow:

mlflow artifacts list <run_id>

3️⃣ Combining DVC and MLflow for a Complete Workflow

A best practice is to integrate DVC for dataset versioning and MLflow for experiment tracking.

🔄 Workflow Summary

Store dataset using DVC → dvc add data/
Push dataset to remote storage → dvc push
Track dataset version in MLflow → mlflow.log_artifact("data/")
Train ML models and log parameters in MLflow
Retrieve dataset versions via DVC and experiments via MLflow.

🛠️ Example: End-to-End Pipeline

import dvc.api
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Fetch dataset version from DVC
dataset_path = "data/"
dvc.api.get_url(dataset_path)

# Log dataset version in MLflow
mlflow.log_artifact(dataset_path, artifact_path="datasets")

# Load dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Log results in MLflow
with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "random_forest_model")

    print(f"Model logged with accuracy: {accuracy}")

Data versioning is critical for:
✅ Reproducibility → Ensure consistent ML results.
✅ Collaboration → Manage dataset changes efficiently.
✅ Traceability → Keep track of dataset & model versions.

By integrating DVC and MLflow, you can create a scalable, reproducible, and traceable ML pipeline.

You can connect with me via email or LinkedIn or medium