DEV Community

Preyum Kumar
Preyum Kumar

Posted on

Mastering DVC and MLflow for MLOps: A Practical Guide

In this AI-driven world, managing experiments and data versions is just as important as the model architecture itself. This guide covers the essentials of Data Version Control (DVC) and MLflow, the industry-standard tools for reproducible machine learning.

Index

  1. Introduction to DVC and MLflow
  2. DVC Installation and Initialization
  3. Versioning Data with DVC
  4. DVC Maintenance: Status and Cleanup
  5. DVC Pipelines and Automation
  6. Integrating DVC with AWS S3
  7. MLflow Installation and Tracking UI
  8. Logging Experiments and Artifacts
  9. MLflow Model Registry
  10. Remote Tracking with DagsHub and AWS
  11. Unified Workflow: Putting it All Together

Introduction to DVC and MLflow

MLOps is the practice of applying DevOps principles to Machine Learning. Two critical pillars of MLOps are:

  • Data Versioning (DVC): Git is great for code but fails with large datasets. DVC allows you to version data, models, and pipelines just like you version code.
  • Experiment Tracking (MLflow): Keeping track of hyperparameters, metrics, and model versions across hundreds of runs.

DVC Installation and Initialization

DVC is built on top of Git and works seamlessly with it.

Installation:

pip install dvc
Enter fullscreen mode Exit fullscreen mode

Initialization:
Run this in the root of your git repository:

dvc init
Enter fullscreen mode Exit fullscreen mode

This creates a .dvc/ directory to store internal configurations. You should commit the changes made by dvc init to your git repository.

Versioning Data with DVC

DVC does not store the data itself in Git. Instead, it creates small .dvc files that act as placeholders/pointers to the actual data.

1. Adding Data:

dvc add data/raw_data.csv
Enter fullscreen mode Exit fullscreen mode

This moves the actual file to DVC's cache and creates a data/raw_data.csv.dvc file.

2. Tracking changes in Git:

git add data/raw_data.csv.dvc .gitignore
git commit -m "Add raw data tracking"
Enter fullscreen mode Exit fullscreen mode

3. Pushing to Remote:
Similar to git push, you can push your data to a remote storage (S3, GCS, Azure, or local).

dvc push
Enter fullscreen mode Exit fullscreen mode

DVC Maintenance: Status and Cleanup

As your project grows, you need to manage your local data workspace and the DVC cache.

1. Checking Status:
To see if your data has changed or if you need to push/pull:

dvc status
Enter fullscreen mode Exit fullscreen mode

Image showing 'dvc status' output in the terminal

2. Cleaning the Cache:
DVC stores versions of your data in a hidden .dvc/cache folder. If you run out of disk space, you can remove old, unused versions:

dvc gc --workspace # Garbage Collect: removes items not used by current branch
Enter fullscreen mode Exit fullscreen mode

DVC Pipelines and Automation

DVC allows you to define "stages" of your ML pipeline in a dvc.yaml file. This ensures that if you change a script or a dataset, only the affected parts of the pipeline are rerun.

Adding a Stage:

dvc stage add -n data_ingestion \
                -d src/data_ingestion.py -d data/raw \
                -o data/processed \
                python src/data_ingestion.py
Enter fullscreen mode Exit fullscreen mode

How does DVC 'know' what is produced?

DVC does not magically know what your script does. You define the Contract using the -o (outputs) flag.

  1. You tell DVC: "If I run python train.py, it will produce model.pkl."
  2. DVC stores this "contract" in dvc.yaml.
  3. After the script runs, DVC checks if model.pkl exists. If it does, DVC calculates its hash and starts versioning it.

Image of a DVC DAG (dvc dag) showing the visual pipeline flow

dvc repro vs. python script.py

Feature python src/train.py dvc repro
Execution Runs the script immediately. Runs only if dependencies have changed.
Tracking MLflow logs results, but DVC doesn't track files. MLflow logs results AND DVC tracks data/models.
Efficiency Always runs (wastes time if data is same). Skips work if results are already cached.
Automation Manual. Can run an entire 10-step pipeline with one command.

Running the Pipeline:

dvc repro
Enter fullscreen mode Exit fullscreen mode

This command checks the dependencies and only runs the stages that have changed.

Integrating DVC with AWS S3

For professional projects, data is usually stored in the cloud.

1. Setup AWS Credentials:

pip install dvc[s3] awscli
aws configure
Enter fullscreen mode Exit fullscreen mode

2. Add S3 Remote:

dvc remote add -d myremote s3://my-mlops-bucket/dvcstore
Enter fullscreen mode Exit fullscreen mode

3. Push and Pull:

dvc push  # Uploads data to S3
dvc pull  # Downloads data from S3 (on a new machine)
Enter fullscreen mode Exit fullscreen mode

MLflow Installation and Tracking UI

MLflow is used to track the "metadata" of your experiments.

Installation:

pip install mlflow
Enter fullscreen mode Exit fullscreen mode

Starting the UI:

mlflow ui
Enter fullscreen mode Exit fullscreen mode

By default, this opens at http://127.0.0.1:5000.

MLFlow Tracking Dashboard

Tip: If the default database is deprecated, use a SQLite backend:

mlflow ui --backend-store-uri sqlite:///mlflow.db
Enter fullscreen mode Exit fullscreen mode

Logging Experiments and Artifacts

Within your Python code, you can log everything from parameters to the final model pickle file.

Example Code:

import mlflow

# Set the experiment name
mlflow.set_experiment("Heart_Disease_Classification")

with mlflow.start_run():
    # Log Parameters (Hyperparameters)
    mlflow.log_param("n_estimators", 100)

    # Log Metrics (Results)
    mlflow.log_metric("accuracy", 0.95)

    # Log Artifacts (Plots, JSONs, code)
    mlflow.log_artifact("plots/confusion_matrix.png")

    # Log the Model itself
    mlflow.log_model(my_model, "random_forest_model")
Enter fullscreen mode Exit fullscreen mode

Runs Comparison

Autologging:
For many libraries (Scikit-Learn, TensorFlow, XGBoost), you can log everything automatically with one line:

mlflow.sklearn.autolog()
Enter fullscreen mode Exit fullscreen mode

MLflow Model Registry

The Model Registry is a centralized model store where you can manage the lifecycle of your models.

Note for MLflow 2.9+: MLflow is moving away from fixed Stages (Staging, Production) in favor of Aliases and Tags. This provides more flexibility for complex deployment workflows.

MLflow Stages (Deprecated)

MLflow Aliases and Tags (The New Way)

1. Registering a Model:
In your code, you can register a model after logging it:

mlflow.log_model(my_model, "artifact_path", registered_model_name="MyBestModel")
Enter fullscreen mode Exit fullscreen mode

2. Managing with Aliases (The New Way):
Instead of transitioning to "Production", you now assign an Alias like @champion or @prod to a specific version.

from mlflow import MlflowClient
client = MlflowClient()

# Assign the 'prod' alias to version 1
client.set_registered_model_alias("MyBestModel", "prod", "1")
Enter fullscreen mode Exit fullscreen mode

This allows your deployment scripts to fetch the model via models:/MyBestModel@prod without worrying about the version number.

3. Using Tags:
Tags are key-value pairs used to store metadata about a model version (e.g., validation_status: "passed").

Remote Tracking with [DagsHub](https://dagshub.com) and AWS

Running MLflow locally is fine for individual work, but teams need a remote tracking server.

Using DagsHub:
DagsHub provides a hosted MLflow server for every repository.

  1. Connect your GitHub repo to DagsHub.
  2. Install the library: pip install dagshub.
  3. Add this to your script:
import dagshub
import mlflow

dagshub.init(repo_owner='username', repo_name='my-repo', mlflow=True)
Enter fullscreen mode Exit fullscreen mode

DagsHub MLflow Server UI

Using AWS (EC2 + S3):

  • EC2: Hosts the MLflow tracking server (metadata).
  • S3: Acts as the "artifact store" for models and plots.
  • Command to start server on EC2:
mlflow server --host 0.0.0.0 --default-artifact-root s3://my-bucket/artifacts
Enter fullscreen mode Exit fullscreen mode

Unified Workflow: Putting it All Together

In a professional MLOps project, you use both tools in tandem:

  1. DVC: Fetch the specific data version (dvc pull).
  2. Code: Run training script with MLflow logging parameters and metrics.
  3. MLflow UI: Compare runs and find the best hyperparameters.
  4. DVC: Version the resulting best_model.pkl (dvc add best_model.pkl).
  5. Git: Commit the .dvc files and training code (git commit).
  6. MLflow Registry: Assign the @prod or @champion alias to the best model version.
  7. CI/CD: Deploy the model by referencing its alias (e.g., models:/MyBestModel@prod).

The MLOps Ecosystem Flowchart

Top comments (0)