Preyum Kumar

Posted on Apr 21 • Edited on Apr 22

Mastering DVC and MLflow for MLOps: A Practical Guide

#machinelearning #mlops #devops #mlflow

In this AI-driven world, managing experiments and data versions is just as important as the model architecture itself. This guide covers the essentials of Data Version Control (DVC) and MLflow, the industry-standard tools for reproducible machine learning.

Index

Introduction to DVC and MLflow
DVC Installation and Initialization
Versioning Data with DVC
DVC Maintenance: Status and Cleanup
DVC Pipelines and Automation
Integrating DVC with AWS S3
MLflow Installation and Tracking UI
Logging Experiments and Artifacts
MLflow Model Registry
Remote Tracking with DagsHub and AWS
Unified Workflow: Putting it All Together

Introduction to DVC and MLflow

MLOps is the practice of applying DevOps principles to Machine Learning. Two critical pillars of MLOps are:

Data Versioning (DVC): Git is great for code but fails with large datasets. DVC allows you to version data, models, and pipelines just like you version code.
Experiment Tracking (MLflow): Keeping track of hyperparameters, metrics, and model versions across hundreds of runs.

DVC Installation and Initialization

DVC is built on top of Git and works seamlessly with it.

Installation:

pip install dvc

Initialization:
Run this in the root of your git repository:

dvc init

This creates a .dvc/ directory to store internal configurations. You should commit the changes made by dvc init to your git repository.

Versioning Data with DVC

DVC does not store the data itself in Git. Instead, it creates small .dvc files that act as placeholders/pointers to the actual data.

1. Adding Data:

dvc add data/raw_data.csv

This moves the actual file to DVC's cache and creates a data/raw_data.csv.dvc file.

2. Tracking changes in Git:

git add data/raw_data.csv.dvc .gitignore
git commit -m "Add raw data tracking"

3. Pushing to Remote:
Similar to git push, you can push your data to a remote storage (S3, GCS, Azure, or local).

dvc push

DVC Maintenance: Status and Cleanup

As your project grows, you need to manage your local data workspace and the DVC cache.

1. Checking Status:
To see if your data has changed or if you need to push/pull:

dvc status

2. Cleaning the Cache:
DVC stores versions of your data in a hidden .dvc/cache folder. If you run out of disk space, you can remove old, unused versions:

dvc gc --workspace # Garbage Collect: removes items not used by current branch

DVC Pipelines and Automation

DVC allows you to define "stages" of your ML pipeline in a dvc.yaml file. This ensures that if you change a script or a dataset, only the affected parts of the pipeline are rerun.

Adding a Stage:

dvc stage add -n data_ingestion \
                -d src/data_ingestion.py -d data/raw \
                -o data/processed \
                python src/data_ingestion.py

How does DVC 'know' what is produced?

DVC does not magically know what your script does. You define the Contract using the -o (outputs) flag.

You tell DVC: "If I run python train.py, it will produce model.pkl."
DVC stores this "contract" in dvc.yaml.
After the script runs, DVC checks if model.pkl exists. If it does, DVC calculates its hash and starts versioning it.

dvc repro vs. python script.py

Feature	`python src/train.py`	`dvc repro`
Execution	Runs the script immediately.	Runs only if dependencies have changed.
Tracking	MLflow logs results, but DVC doesn't track files.	MLflow logs results AND DVC tracks data/models.
Efficiency	Always runs (wastes time if data is same).	Skips work if results are already cached.
Automation	Manual.	Can run an entire 10-step pipeline with one command.

Running the Pipeline:

dvc repro

This command checks the dependencies and only runs the stages that have changed.

Integrating DVC with AWS S3

For professional projects, data is usually stored in the cloud.

1. Setup AWS Credentials:

pip install dvc[s3] awscli
aws configure

2. Add S3 Remote:

dvc remote add -d myremote s3://my-mlops-bucket/dvcstore

3. Push and Pull:

dvc push  # Uploads data to S3
dvc pull  # Downloads data from S3 (on a new machine)

MLflow Installation and Tracking UI

MLflow is used to track the "metadata" of your experiments.

Installation:

pip install mlflow

Starting the UI:

mlflow ui

By default, this opens at http://127.0.0.1:5000.

Tip: If the default database is deprecated, use a SQLite backend:

mlflow ui --backend-store-uri sqlite:///mlflow.db

Logging Experiments and Artifacts

Within your Python code, you can log everything from parameters to the final model pickle file.

Example Code:

import mlflow

# Set the experiment name
mlflow.set_experiment("Heart_Disease_Classification")

with mlflow.start_run():
    # Log Parameters (Hyperparameters)
    mlflow.log_param("n_estimators", 100)

    # Log Metrics (Results)
    mlflow.log_metric("accuracy", 0.95)

    # Log Artifacts (Plots, JSONs, code)
    mlflow.log_artifact("plots/confusion_matrix.png")

    # Log the Model itself
    mlflow.log_model(my_model, "random_forest_model")

Autologging:
For many libraries (Scikit-Learn, TensorFlow, XGBoost), you can log everything automatically with one line:

mlflow.sklearn.autolog()

MLflow Model Registry

The Model Registry is a centralized model store where you can manage the lifecycle of your models.

Note for MLflow 2.9+: MLflow is moving away from fixed Stages (Staging, Production) in favor of Aliases and Tags. This provides more flexibility for complex deployment workflows.

1. Registering a Model:
In your code, you can register a model after logging it:

mlflow.log_model(my_model, "artifact_path", registered_model_name="MyBestModel")

2. Managing with Aliases (The New Way):
Instead of transitioning to "Production", you now assign an Alias like @champion or @prod to a specific version.

from mlflow import MlflowClient
client = MlflowClient()

# Assign the 'prod' alias to version 1
client.set_registered_model_alias("MyBestModel", "prod", "1")

This allows your deployment scripts to fetch the model via models:/MyBestModel@prod without worrying about the version number.

3. Using Tags:
Tags are key-value pairs used to store metadata about a model version (e.g., validation_status: "passed").

Remote Tracking with DagsHub and AWS

Running MLflow locally is fine for individual work, but teams need a remote tracking server.

Using DagsHub:
DagsHub provides a hosted MLflow server for every repository.

Connect your GitHub repo to DagsHub.
Install the library: pip install dagshub.
Add this to your script:

import dagshub
import mlflow

dagshub.init(repo_owner='username', repo_name='my-repo', mlflow=True)

Using AWS (EC2 + S3):

EC2: Hosts the MLflow tracking server (metadata).
S3: Acts as the "artifact store" for models and plots.
Command to start server on EC2:

mlflow server --host 0.0.0.0 --default-artifact-root s3://my-bucket/artifacts

Unified Workflow: Putting it All Together

In a professional MLOps project, you use both tools in tandem:

DVC: Fetch the specific data version (dvc pull).
Code: Run training script with MLflow logging parameters and metrics.
MLflow UI: Compare runs and find the best hyperparameters.
DVC: Version the resulting best_model.pkl (dvc add best_model.pkl).
Git: Commit the .dvc files and training code (git commit).
MLflow Registry: Assign the @prod or @champion alias to the best model version.
CI/CD: Deploy the model by referencing its alias (e.g., models:/MyBestModel@prod).

DEV Community