In this AI-driven world, managing experiments and data versions is just as important as the model architecture itself. This guide covers the essentials of Data Version Control (DVC) and MLflow, the industry-standard tools for reproducible machine learning.
Index
- Introduction to DVC and MLflow
- DVC Installation and Initialization
- Versioning Data with DVC
- DVC Maintenance: Status and Cleanup
- DVC Pipelines and Automation
- Integrating DVC with AWS S3
- MLflow Installation and Tracking UI
- Logging Experiments and Artifacts
- MLflow Model Registry
- Remote Tracking with DagsHub and AWS
- Unified Workflow: Putting it All Together
Introduction to DVC and MLflow
MLOps is the practice of applying DevOps principles to Machine Learning. Two critical pillars of MLOps are:
- Data Versioning (DVC): Git is great for code but fails with large datasets. DVC allows you to version data, models, and pipelines just like you version code.
- Experiment Tracking (MLflow): Keeping track of hyperparameters, metrics, and model versions across hundreds of runs.
DVC Installation and Initialization
DVC is built on top of Git and works seamlessly with it.
Installation:
pip install dvc
Initialization:
Run this in the root of your git repository:
dvc init
This creates a .dvc/ directory to store internal configurations. You should commit the changes made by dvc init to your git repository.
Versioning Data with DVC
DVC does not store the data itself in Git. Instead, it creates small .dvc files that act as placeholders/pointers to the actual data.
1. Adding Data:
dvc add data/raw_data.csv
This moves the actual file to DVC's cache and creates a data/raw_data.csv.dvc file.
2. Tracking changes in Git:
git add data/raw_data.csv.dvc .gitignore
git commit -m "Add raw data tracking"
3. Pushing to Remote:
Similar to git push, you can push your data to a remote storage (S3, GCS, Azure, or local).
dvc push
DVC Maintenance: Status and Cleanup
As your project grows, you need to manage your local data workspace and the DVC cache.
1. Checking Status:
To see if your data has changed or if you need to push/pull:
dvc status
2. Cleaning the Cache:
DVC stores versions of your data in a hidden .dvc/cache folder. If you run out of disk space, you can remove old, unused versions:
dvc gc --workspace # Garbage Collect: removes items not used by current branch
DVC Pipelines and Automation
DVC allows you to define "stages" of your ML pipeline in a dvc.yaml file. This ensures that if you change a script or a dataset, only the affected parts of the pipeline are rerun.
Adding a Stage:
dvc stage add -n data_ingestion \
-d src/data_ingestion.py -d data/raw \
-o data/processed \
python src/data_ingestion.py
How does DVC 'know' what is produced?
DVC does not magically know what your script does. You define the Contract using the -o (outputs) flag.
- You tell DVC: "If I run
python train.py, it will producemodel.pkl." - DVC stores this "contract" in
dvc.yaml. - After the script runs, DVC checks if
model.pklexists. If it does, DVC calculates its hash and starts versioning it.
dvc repro vs. python script.py
| Feature | python src/train.py |
dvc repro |
|---|---|---|
| Execution | Runs the script immediately. | Runs only if dependencies have changed. |
| Tracking | MLflow logs results, but DVC doesn't track files. | MLflow logs results AND DVC tracks data/models. |
| Efficiency | Always runs (wastes time if data is same). | Skips work if results are already cached. |
| Automation | Manual. | Can run an entire 10-step pipeline with one command. |
Running the Pipeline:
dvc repro
This command checks the dependencies and only runs the stages that have changed.
Integrating DVC with AWS S3
For professional projects, data is usually stored in the cloud.
1. Setup AWS Credentials:
pip install dvc[s3] awscli
aws configure
2. Add S3 Remote:
dvc remote add -d myremote s3://my-mlops-bucket/dvcstore
3. Push and Pull:
dvc push # Uploads data to S3
dvc pull # Downloads data from S3 (on a new machine)
MLflow Installation and Tracking UI
MLflow is used to track the "metadata" of your experiments.
Installation:
pip install mlflow
Starting the UI:
mlflow ui
By default, this opens at http://127.0.0.1:5000.
Tip: If the default database is deprecated, use a SQLite backend:
mlflow ui --backend-store-uri sqlite:///mlflow.db
Logging Experiments and Artifacts
Within your Python code, you can log everything from parameters to the final model pickle file.
Example Code:
import mlflow
# Set the experiment name
mlflow.set_experiment("Heart_Disease_Classification")
with mlflow.start_run():
# Log Parameters (Hyperparameters)
mlflow.log_param("n_estimators", 100)
# Log Metrics (Results)
mlflow.log_metric("accuracy", 0.95)
# Log Artifacts (Plots, JSONs, code)
mlflow.log_artifact("plots/confusion_matrix.png")
# Log the Model itself
mlflow.log_model(my_model, "random_forest_model")
Autologging:
For many libraries (Scikit-Learn, TensorFlow, XGBoost), you can log everything automatically with one line:
mlflow.sklearn.autolog()
MLflow Model Registry
The Model Registry is a centralized model store where you can manage the lifecycle of your models.
Note for MLflow 2.9+: MLflow is moving away from fixed Stages (Staging, Production) in favor of Aliases and Tags. This provides more flexibility for complex deployment workflows.
1. Registering a Model:
In your code, you can register a model after logging it:
mlflow.log_model(my_model, "artifact_path", registered_model_name="MyBestModel")
2. Managing with Aliases (The New Way):
Instead of transitioning to "Production", you now assign an Alias like @champion or @prod to a specific version.
from mlflow import MlflowClient
client = MlflowClient()
# Assign the 'prod' alias to version 1
client.set_registered_model_alias("MyBestModel", "prod", "1")
This allows your deployment scripts to fetch the model via models:/MyBestModel@prod without worrying about the version number.
3. Using Tags:
Tags are key-value pairs used to store metadata about a model version (e.g., validation_status: "passed").
Remote Tracking with [DagsHub](https://dagshub.com) and AWS
Running MLflow locally is fine for individual work, but teams need a remote tracking server.
Using DagsHub:
DagsHub provides a hosted MLflow server for every repository.
- Connect your GitHub repo to DagsHub.
- Install the library:
pip install dagshub. - Add this to your script:
import dagshub
import mlflow
dagshub.init(repo_owner='username', repo_name='my-repo', mlflow=True)
Using AWS (EC2 + S3):
- EC2: Hosts the MLflow tracking server (metadata).
- S3: Acts as the "artifact store" for models and plots.
- Command to start server on EC2:
mlflow server --host 0.0.0.0 --default-artifact-root s3://my-bucket/artifacts
Unified Workflow: Putting it All Together
In a professional MLOps project, you use both tools in tandem:
- DVC: Fetch the specific data version (
dvc pull). - Code: Run training script with MLflow logging parameters and metrics.
- MLflow UI: Compare runs and find the best hyperparameters.
- DVC: Version the resulting
best_model.pkl(dvc add best_model.pkl). - Git: Commit the
.dvcfiles and training code (git commit). - MLflow Registry: Assign the
@prodor@championalias to the best model version. - CI/CD: Deploy the model by referencing its alias (e.g.,
models:/MyBestModel@prod).








Top comments (0)