The MLOps Compass: A Beginner's Guide to Building a Reproducible ML Pipeline
As a data scientist or machine learning engineer, you know that building a great model is only half the battle. The real challenge is getting that model into production and making sure the entire process is reliable and repeatable. This is the world of MLOps, where you connect the dots between model development and operational deployment.
In this post, I'll walk you through a hands-on project to build your very own reproducible MLOps pipeline right on your laptop. We'll use three fantastic open-source tools—Git, DVC, and Docker—to manage our code, data, and application environment. By the time we're done, you'll have a containerized model that's ready to go.
Step 1: Laying the Foundation with Git and Python
Every solid project starts with a good foundation. We'll use Git for version control and a clean Python virtual environment to keep our dependencies organized.
First, let's create a new project folder and initialize a Git repository.
mkdir mlops-local-project
cd mlops-local-project
git init
Next, create a virtual environment. This is a crucial step that keeps your project's dependencies separate from everything else on your computer, so you'll never run into weird conflicts.
python -m venv venv
conda activate venv
Finally, define your project's Python libraries in a requirements.txt file and install them. We'll need a few common ones like scikit-learn, pandas, joblib, and flask. Don't forget to create a .gitignore file to prevent things like your virtual environment from being committed to Git.
Step 2: Versioning Your Data with DVC
Git is great for code, but it just doesn't work for large data files or models. That's where DVC (Data Version Control)
comes in. DVC versions your data and models by creating small, lightweight metadata files that Git can track. The actual large files are stored in a local cache, so your Git repo stays lean and clean.
To get started, initialize DVC in your project:
dvc init
For our example, we'll use a simple dataset from scikit-learn
. We'll write a Python script to save the data as a CSV, and then use DVC to start tracking it.
# Prepare the data using a Python script
python src/prepare_data.py
# Add the data to DVC and commit the metadata file to Git
dvc add data/iris.csv
git add data/iris.csv.dvc
git commit -m "Add iris dataset with DVC"
Now, every time you make a change to your data, you can simply run dvc add
again. Git will track the new version without a massive, slow commit.
Step 3: The Pipeline - Training a Model
Reproducibility is a core part of MLOps. To make sure your model training can be repeated exactly, we'll define a DVC pipeline. This pipeline tracks all the dependencies between your data, code, and the final model, guaranteeing a consistent outcome.
First, write your training script (src/train.py)
, which will load the DVC-tracked data and train a simple RandomForestClassifier
.
# A simple script to train and save a model.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import joblib
def train_model():
df = pd.read_csv('data/iris.csv')
X = df.drop('target', axis=1)
y = df['target']
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
joblib.dump(model, 'models/model.joblib')
if __name__ == "__main__":
train_model()
Next, create aDVC.yaml
file to define your pipeline. This file tells DVC that the train stage depends on data/iris.csv
and src/train.py
and produces models/model.joblib
as output.
# Simplified DVC.yaml for our pipeline
stages:
train:
cmd: python src/train.py
deps:
- data/iris.csv
- src/train.py
outs:
- models/model.joblib
Now, whenever you want to re-run your training process, you simply type dvc repro. DVC will automatically check for changes in the dependencies and execute only the necessary steps, which saves a ton of time.
Step 4: Containerization with Docker
The final piece of the puzzle is to package your application and all its dependencies into a single, portable unit. Docker is the industry standard for this. It ensures your application will run exactly the same way, no matter what computer you run it on.
First, create a simple Flask API (app.py
) that loads your trained model and exposes a /predict
endpoint.
# A Flask app to serve predictions
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('models/model.joblib')
@app.route('/predict', methods=['POST'])
def predict():
# ... prediction logic here ...
return jsonify({'prediction': prediction})
Finally, create a Dockerfile
that defines the environment for your application. This file specifies the Python version, installs your dependencies, and copies your code into the container.
# Dockerfile to containerize the app
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]
With these files in place, you can build and run your container with just two commands:
docker build -t mlops-service .
docker run -p 5000:5000 mlops-service
This project gives you a solid foundation in MLOps. Once you're comfortable with this local workflow, you'll be well-equipped to add more advanced steps like continuous integration and continuous deployment (CI/CD) and cloud provisioning.
Download link - PyDVC-Docker-MLOps-Sandbox
Top comments (0)