DEV Community

Argha Sarkar
Argha Sarkar

Posted on

The MLOps Compass: A Local Guide to Building Your First Reproducible ML Pipeline

The MLOps Compass: A Beginner's Guide to Building a Reproducible ML Pipeline
As a data scientist or machine learning engineer, you know that building a great model is only half the battle. The real challenge is getting that model into production and making sure the entire process is reliable and repeatable. This is the world of MLOps, where you connect the dots between model development and operational deployment.

In this post, I'll walk you through a hands-on project to build your very own reproducible MLOps pipeline right on your laptop. We'll use three fantastic open-source tools—Git, DVC, and Docker—to manage our code, data, and application environment. By the time we're done, you'll have a containerized model that's ready to go.

Step 1: Laying the Foundation with Git and Python

Every solid project starts with a good foundation. We'll use Git for version control and a clean Python virtual environment to keep our dependencies organized.

First, let's create a new project folder and initialize a Git repository.

mkdir mlops-local-project
cd mlops-local-project
git init
Enter fullscreen mode Exit fullscreen mode

Next, create a virtual environment. This is a crucial step that keeps your project's dependencies separate from everything else on your computer, so you'll never run into weird conflicts.

python -m venv venv
conda activate venv
Enter fullscreen mode Exit fullscreen mode

Finally, define your project's Python libraries in a requirements.txt file and install them. We'll need a few common ones like scikit-learn, pandas, joblib, and flask. Don't forget to create a .gitignore file to prevent things like your virtual environment from being committed to Git.

Step 2: Versioning Your Data with DVC

Git is great for code, but it just doesn't work for large data files or models. That's where DVC (Data Version Control) comes in. DVC versions your data and models by creating small, lightweight metadata files that Git can track. The actual large files are stored in a local cache, so your Git repo stays lean and clean.

To get started, initialize DVC in your project:

dvc init
Enter fullscreen mode Exit fullscreen mode

For our example, we'll use a simple dataset from scikit-learn. We'll write a Python script to save the data as a CSV, and then use DVC to start tracking it.

# Prepare the data using a Python script
python src/prepare_data.py

# Add the data to DVC and commit the metadata file to Git
dvc add data/iris.csv
git add data/iris.csv.dvc
git commit -m "Add iris dataset with DVC"
Enter fullscreen mode Exit fullscreen mode

Now, every time you make a change to your data, you can simply run dvc addagain. Git will track the new version without a massive, slow commit.

Step 3: The Pipeline - Training a Model

Reproducibility is a core part of MLOps. To make sure your model training can be repeated exactly, we'll define a DVC pipeline. This pipeline tracks all the dependencies between your data, code, and the final model, guaranteeing a consistent outcome.

First, write your training script (src/train.py), which will load the DVC-tracked data and train a simple RandomForestClassifier.

# A simple script to train and save a model.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import joblib

def train_model():
    df = pd.read_csv('data/iris.csv')
    X = df.drop('target', axis=1)
    y = df['target']

    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)

    joblib.dump(model, 'models/model.joblib')

if __name__ == "__main__":
    train_model()
Enter fullscreen mode Exit fullscreen mode

Next, create aDVC.yaml file to define your pipeline. This file tells DVC that the train stage depends on data/iris.csv and src/train.py and produces models/model.joblib as output.

# Simplified DVC.yaml for our pipeline
stages:
  train:
    cmd: python src/train.py
    deps:
    - data/iris.csv
    - src/train.py
    outs:
    - models/model.joblib
Enter fullscreen mode Exit fullscreen mode

Now, whenever you want to re-run your training process, you simply type dvc repro. DVC will automatically check for changes in the dependencies and execute only the necessary steps, which saves a ton of time.

Step 4: Containerization with Docker

The final piece of the puzzle is to package your application and all its dependencies into a single, portable unit. Docker is the industry standard for this. It ensures your application will run exactly the same way, no matter what computer you run it on.

First, create a simple Flask API (app.py) that loads your trained model and exposes a /predict endpoint.

# A Flask app to serve predictions
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('models/model.joblib')

@app.route('/predict', methods=['POST'])
def predict():
    # ... prediction logic here ...
    return jsonify({'prediction': prediction})
Enter fullscreen mode Exit fullscreen mode

Finally, create a Dockerfile that defines the environment for your application. This file specifies the Python version, installs your dependencies, and copies your code into the container.

# Dockerfile to containerize the app
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]
Enter fullscreen mode Exit fullscreen mode

With these files in place, you can build and run your container with just two commands:

docker build -t mlops-service .
docker run -p 5000:5000 mlops-service
Enter fullscreen mode Exit fullscreen mode

This project gives you a solid foundation in MLOps. Once you're comfortable with this local workflow, you'll be well-equipped to add more advanced steps like continuous integration and continuous deployment (CI/CD) and cloud provisioning.

Download link - PyDVC-Docker-MLOps-Sandbox

Top comments (0)