DEV Community

🎯 ML Done Right: Versioning Datasets and Models with DVC & MLflow

Introduction

Data versioning is a crucial aspect of Machine Learning (ML) workflows. It ensures that datasets are reproducible, traceable, and manageable throughout the ML lifecycle. Unlike traditional software development, where Git efficiently tracks code changes, ML workflows require specialized tools to version datasets, models, and metadata.

  • Two powerful tools for data versioning in ML pipelines are:
  • DVC (Data Version Control): Designed to manage large datasets efficiently and integrates seamlessly with Git.
  • MLflow: Focused on experiment tracking, model versioning, and lifecycle management. In this article, we will explore how to set up DVC and MLflow for data versioning in a machine learning workflow using Python.

Why Do We Need Data Versioning?

Before going into implementation, let’s first understand why data versioning is essential:

  1. Reproducibility β†’ Ensures that ML models can be recreated using the exact dataset used during training.
  2. Collaboration β†’ Enables teams to share and work on different versions of datasets.
  3. Traceability β†’ Keeps track of dataset changes, helping to identify issues in models.
  4. Rollback & Experimentation β†’ Facilitates easy rollback to previous dataset versions for comparison.
  5. Storage Efficiency β†’ Avoids redundancy by tracking only changes in datasets instead of duplicating entire files.

1️⃣ Setting Up DVC for Data Versioning

DVC is an open-source tool designed to handle large datasets efficiently while integrating seamlessly with Git.

Installation

To install DVC, run:

pip install dvc
Enter fullscreen mode Exit fullscreen mode

If you’re using cloud storage (e.g., AWS S3, Google Drive, or Azure), install the appropriate extension:

pip install 'dvc[s3]'
pip install 'dvc[gdrive]'
Enter fullscreen mode Exit fullscreen mode

Initialize DVC in a Project

Inside a Git-tracked ML project, initialize DVC:

git init
dvc init
git commit -m "Initialize DVC"
Enter fullscreen mode Exit fullscreen mode

Adding and Versioning Data

Assume we have a dataset stored in data/:

dvc add data/
Enter fullscreen mode Exit fullscreen mode

This creates:

  • data.dvc β†’ A metadata file that tracks the dataset version.
  • Updates .gitignore to prevent large files from being committed to Git.

Now, commit these changes to Git:

git add data.dvc .gitignore
git commit -m "Track dataset with DVC"
Enter fullscreen mode Exit fullscreen mode

Remote Storage Configuration

To store the dataset in cloud storage:

dvc remote add myremote s3://mybucket/path/
dvc push
Enter fullscreen mode Exit fullscreen mode

This uploads the dataset to S3. Other options like Google Drive, Azure, and SSH are also supported.

Restoring Previous Versions

To retrieve a previous dataset version:

git checkout <commit_id>
dvc pull
Enter fullscreen mode Exit fullscreen mode

This ensures that data and models remain synchronized with the desired version.

2️⃣ Using MLflow for Data Tracking

MLflow helps track datasets, experiments, and models during development.

Installation

pip install mlflow
Enter fullscreen mode Exit fullscreen mode

Initializing MLflow Tracking

Start the MLflow tracking server:

mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns
Enter fullscreen mode Exit fullscreen mode

This creates a local database (mlflow.db) to store ML experiments.

Logging Dataset Versions with MLflow

Modify your Python script to track dataset versions:

import mlflow
import mlflow.artifacts
import os

# Log dataset version
def log_dataset_version(dataset_path):
    with mlflow.start_run():
        mlflow.log_artifact(dataset_path, artifact_path="datasets")
        print("Dataset version logged in MLflow")

log_dataset_version("data/")
Enter fullscreen mode Exit fullscreen mode

Tracking Experiments

When training an ML model, log key parameters and metrics:

import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Log model and metrics
with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "random_forest_model")

    print(f"Model logged with accuracy: {accuracy}")
Enter fullscreen mode Exit fullscreen mode

Retrieving Dataset Versions

To list previous dataset versions in MLflow:

mlflow artifacts list <run_id>
Enter fullscreen mode Exit fullscreen mode

3️⃣ Combining DVC and MLflow for a Complete Workflow

A best practice is to integrate DVC for dataset versioning and MLflow for experiment tracking.

πŸ”„ Workflow Summary

  1. Store dataset using DVC β†’ dvc add data/
  2. Push dataset to remote storage β†’ dvc push
  3. Track dataset version in MLflow β†’ mlflow.log_artifact("data/")
  4. Train ML models and log parameters in MLflow
  5. Retrieve dataset versions via DVC and experiments via MLflow.

πŸ› οΈ Example: End-to-End Pipeline

import dvc.api
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Fetch dataset version from DVC
dataset_path = "data/"
dvc.api.get_url(dataset_path)

# Log dataset version in MLflow
mlflow.log_artifact(dataset_path, artifact_path="datasets")

# Load dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Log results in MLflow
with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "random_forest_model")

    print(f"Model logged with accuracy: {accuracy}")
Enter fullscreen mode Exit fullscreen mode

Data versioning is critical for:
βœ… Reproducibility β†’ Ensure consistent ML results.
βœ… Collaboration β†’ Manage dataset changes efficiently.
βœ… Traceability β†’ Keep track of dataset & model versions.

By integrating DVC and MLflow, you can create a scalable, reproducible, and traceable ML pipeline.

You can connect with me via email or LinkedIn or medium

Image of Datadog

How to Diagram Your Cloud Architecture

Cloud architecture diagrams provide critical visibility into the resources in your environment and how they’re connected. In our latest eBook, AWS Solution Architects Jason Mimick and James Wenzel walk through best practices on how to build effective and professional diagrams.

Download the Free eBook

Top comments (0)

Best Practices for Running  Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK cover image

Best Practices for Running Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK

This post discusses the process of migrating a growing WordPress eShop business to AWS using AWS CDK for an easily scalable, high availability architecture. The detailed structure encompasses several pillars: Compute, Storage, Database, Cache, CDN, DNS, Security, and Backup.

Read full post

AWS GenAI Live!

GenAI LIVE! is a dynamic live-streamed show exploring how AWS and our partners are helping organizations unlock real value with generative AI.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❀️