DEV Community

Pejman Rezaei
Pejman Rezaei

Posted on • Edited on

Integrating MLflow with KubeFlow (Revised Edition)

MLflow—a robust open-source platform that simplifies the management of the machine learning lifecycle, including experimentation, reproducibility, and deployment. By integrating MLflow into Kubeflow, users can leverage MLflow’s intuitive UI and comprehensive model registry capabilities to enhance their machine learning workflows.

In the modern enterprise landscape, the demand for streamlined and scalable Machine Learning Operations (MLOps) frameworks has never been greater. With increasing complexities in model development, tracking, deployment, and monitoring, organizations need tools that seamlessly integrate to ensure efficiency and reliability. MLflow and Kubeflow are two such tools that, when integrated, provide a robust end-to-end solution for managing machine learning workflows. MLflow excels in tracking experiments, managing model lifecycle, and maintaining a centralized model registry. On the other hand, Kubeflow offers scalable pipelines, distributed training capabilities, hyperparameter optimization, and production-grade model serving on Kubernetes. Together, these tools form a comprehensive framework for MLOps that supports continuous integration and deployment (CI/CD), enabling enterprises to automate workflows, improve collaboration between data science and engineering teams, and ensure models are delivered to production faster and with fewer errors. This tutorial will guide you through the detailed process of integrating MLflow and Kubeflow into an enterprise-level MLOps framework, focusing on scalability, reproducibility, and automation.

This framework ensures:

  1. Scalability for high-demand ML workflows.
  2. Automation of CI/CD pipelines.
  3. Centralized tracking and monitoring.

Part 1

The first step will be setting up a Database because if you want to use MLflow's tracking functionality with a relational database backend, you will need a PostgreSQL (or another supported database) instance. Here’s a breakdown of why and how to set it up:

Why Use PostgreSQL with MLflow?

  • Experiment Tracking: MLflow uses a backend store to log experiments, runs, parameters, metrics, and artifacts. A relational database like PostgreSQL is a robust option for this purpose.
  • Scalability: Using a database allows you to efficiently manage and query large amounts of experiment data.
  • Persistence: A database ensures that your experiment data is stored persistently, even if the MLflow server is restarted.

Setting Up PostgreSQL for MLflow

Step 1: Deploy PostgreSQL in Your Kubernetes Cluster

You can deploy PostgreSQL using a Helm chart or a custom YAML configuration. Here’s a basic example using a Helm chart:

  1. Create MLflow namespace:
    kubectl create namespace mlflow
Enter fullscreen mode Exit fullscreen mode
  1. Turn postgres password into base64:
echo -n 'MyPostgresPass.!QAZ' | base64
Enter fullscreen mode Exit fullscreen mode
  1. Create a YAML file:
apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: mlflow
data:
  postgresql-password: TUxQbGF0Zm9ybTEyMzQuIVFBWg==
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
  namespace: mlflow
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-postgres
  namespace: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow-postgres
  template:
    metadata:
      labels:
        app: mlflow-postgres
    spec:
      containers:
      - name: postgres
        image: postgres:16
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: postgresql-password
        - name: POSTGRES_DB
          value: mlflow
        ports:
        - containerPort: 5432
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        volumeMounts:
        - name: postgres-storage
          mountPath: /var/lib/postgresql/data
          subPath: pgdata
      volumes:
      - name: postgres-storage
        persistentVolumeClaim:
          claimName: postgres-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-postgres
  namespace: mlflow
spec:
  type: ClusterIP
  ports:
  - port: 5432
    targetPort: 5432
  selector:
    app: mlflow-postgres

Enter fullscreen mode Exit fullscreen mode
  1. Apply:
kubectl apply -f postgresql-deployment.yaml
Enter fullscreen mode Exit fullscreen mode
  1. Create user and db on postgres for mlflow:

To set up a PostgreSQL database for MLflow, you'll need to create a user, set a password, create a database, and grant the necessary permissions. Here’s how you can do it step by step in the PostgreSQL shell (psql):

Step-by-Step Commands

  1. Log into PostgreSQL:
    First, log into your PostgreSQL server as a superuser (e.g., postgres):

    psql -U postgres
    
  2. Create a User:
    Replace mlflow and your_password with your desired username and password.

    CREATE USER mlflow WITH PASSWORD 'your_password';
    
    
  3. Create a Database:
    Replace mlflow_db with your desired database name.

    CREATE DATABASE mlflow_db;
    
    
  4. Grant Permissions:
    Grant the necessary permissions to the user for the database:

    GRANT ALL PRIVILEGES ON DATABASE mlflow_db TO mlflow;
    
    
  5. Exit the PostgreSQL Shell:
    After executing the commands, you can exit the psql shell:

    \q
    
    

Summary of Commands

Putting it all together, here are the commands you would run in the PostgreSQL shell:

CREATE USER mlflow WITH PASSWORD 'your_password';
CREATE DATABASE mlflow_db;
GRANT ALL PRIVILEGES ON DATABASE mlflow_db TO mlflow;

Enter fullscreen mode Exit fullscreen mode

Additional Considerations

  • Password Security: Make sure to use a strong password for your database user.
  • Database Connection: When configuring MLflow, use the following connection string format:

    postgresql://mlflow_user:your_password@<host>:<port>/mlflow_db
    
    

Replace <host> and <port> with your PostgreSQL server's address and port (default is 5432).

With these steps, you should have a PostgreSQL user and database set up for MLflow, ready for use!


Storage backend

When considering security for your MLflow setup, both Ceph and MinIO can be configured to be secure, but they have different security features and considerations. Here’s a comparison to help you decide which might be more appropriate for your use case:

Using Ceph

Pros:

  1. Robust Security Features: Ceph supports various security mechanisms, including:
    • Authentication: Ceph can use CephX for authentication, ensuring that only authorized clients can access the storage.
    • Encryption: Data can be encrypted both in transit (using TLS) and at rest.
    • Access Control: You can set fine-grained access control policies to restrict who can access specific buckets or objects.
  2. Scalability: Ceph is designed for scalability, making it suitable for large datasets and high availability.

Cons:

  1. Complexity: Setting up and managing Ceph can be more complex compared to simpler object storage solutions.
  2. Configuration Overhead: You may need to invest time in properly configuring security settings to ensure that your Ceph deployment is secure.

Using MinIO

Pros:

  1. S3 Compatibility: MinIO is compatible with the S3 API, making it easy to integrate with applications designed for S3 storage.
  2. Simplicity: MinIO is easier to set up and manage compared to Ceph, especially for smaller deployments.
  3. Built-in Security Features: MinIO provides:
    • Server-Side Encryption: You can enable server-side encryption for data at rest.
    • TLS Support: MinIO supports TLS for secure data transmission.
    • Access Policies: You can define bucket policies and user access controls.

Cons:

  1. Less Feature-Rich: While MinIO is secure and robust, it may not have the same level of advanced features and scalability as Ceph for very large deployments.

Security Recommendations

For Ceph:

  • Enable CephX Authentication: Ensure that you are using CephX for authentication.
  • Use TLS: Configure TLS for secure data transmission.
  • Regular Audits: Regularly audit your Ceph configuration and access logs to detect any unauthorized access.

For MinIO:

  • Enable TLS: Always use TLS to encrypt data in transit.
  • Use Strong Access Keys: Generate strong access and secret keys for your MinIO instance.
  • Set Bucket Policies: Define strict bucket policies to control access to your data.

Conclusion

Both Ceph and MinIO can be configured to be secure, but your choice may depend on your specific needs:

  • Choose Ceph if you need a highly scalable, feature-rich solution and are willing to manage its complexity.
  • Choose MinIO if you prefer a simpler, S3-compatible solution that is easy to set up and manage while still providing solid security features.

For this configuration, we prefer minio over ceph due to its simplicity and efficient resource allocation.

Step 1: Deploy MinIO in Your MLflow Namespace

In this scenario, we will utilize MinIO as the storage backend for MLflow to manage and store artifacts. When considering MinIO, we have two options:

  • Using Standalone MinIO
  • Using MinIO which comes with Kubeflow installation

Using Standalone MinIO (Skip this step if you want to use minio of kubeflow)

  • Pros:

    Isolation: Keeps MLflow and its storage independent, simplifying management.
    Customization: Allows for tailored configurations specific to MLflow needs.
    Version Control: Easier to manage updates and changes without affecting other components.

    • Cons:

    Resource Duplication: Requires additional resources and management overhead.
    Complexity: May complicate the deployment if not properly managed.

To install standalone Minio, follow steps below:

Step 1: Deploy MinIO in Your MLflow Namespace

Base64 Encode Your Keys:
The values for MINIO_ACCESS_KEY and MINIO_SECRET_KEY need to be base64 encoded. You can use the following command in your terminal:

echo -n 'myaccesskey' | base64
echo -n 'mysecretkey' | base64
Enter fullscreen mode Exit fullscreen mode

Again, insert your Base64 encoded string into the secrets section as the values for MINIO_ACCESS_KEY and MINIO_SECRET_KEY entries in the minio-deploy.yaml file. This file includes a deployment, service, persistent volume claim (PVC), and secret. The image uses the latest version of MinIO, which utilizes MINIO_ROOT_USER and MINIO_ROOT_PASSWORD as environment variables for the admin user and password of the MinIO installation. Additionally, a separate port is configured for the console UI in this file, allowing access to the dashboard independently of the API port.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio
  namespace: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: minio
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
      - name: minio
        image: minio/minio
        args:
          - server
          - /data
          - --console-address # set console ui a dedicated port
          - ":9001"
        ports:
        - containerPort: 9000
        - containerPort: 9001
        env:
        - name: MINIO_ROOT_USER
          valueFrom:
            secretKeyRef:
              name: minio-credentials
              key: MINIO_ACCESS_KEY
        - name: MINIO_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: minio-credentials
              key: MINIO_SECRET_KEY
        - name: MINIO_CONSOLE_PORT
          value: "9001"
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        volumeMounts:
        - name: minio-storage
          mountPath: /data # make minio storage persistent
      volumes:
      - name: minio-storage
        persistentVolumeClaim:
          claimName: minio-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: minio
  namespace: mlflow
spec:
  type: NodePort
  ports:
  - name: api
    port: 9000
    targetPort: 9000
  - name: ui
    port: 9001
    targetPort: 9001
  selector:
    app: minio
---
apiVersion: v1
kind: Secret
metadata:
  name: minio-credentials
  namespace: mlflow
type: Opaque
data:
  MINIO_ACCESS_KEY: EyMzQuIV
  MINIO_SECRET_KEY: TUxQbGF0Zm9yb
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: minio-pvc
  namespace: mlflow
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
Enter fullscreen mode Exit fullscreen mode

Step 2: Access MinIO

Get the MinIO Service URL:

You can access MinIO using the service name within the Kubernetes cluster. If you are using port-forwarding for local access, you can do:

kubectl port-forward svc/minio -n mlflow 9001:9001
Enter fullscreen mode Exit fullscreen mode

Step 3: Create a Bucket in MinIO

  1. Using the MinIO Console:

After logging in, you can create a bucket via the web interface.

  1. Using mc (MinIO Client):

If you prefer to use the command line, you can install mc and create bucket:

    ```
    kubectl port-forward svc/minio -n mlflow 9001:9001
    ```
Enter fullscreen mode Exit fullscreen mode
    then create a bucket:
Enter fullscreen mode Exit fullscreen mode
    ```
    mc alias set mlflow-minio http://localhost:9000 <myaccesskey> <mysecretkey>
    mc mb mlflow-minio/mlflow-bucket
    ```
Enter fullscreen mode Exit fullscreen mode

Step 4. create user and policy, then assign policy to user

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::mlflow-bucket"
    },
    {
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::mlflow-bucket/*"
    }
  ]
}

Enter fullscreen mode Exit fullscreen mode

This policy enables basic read and write operations on the specified S3 bucket and its contents named mlflow-bucket.

  • The first statement allows the actions s3:GetBucketLocation and s3:ListBucket on the bucket itself, enabling the user to retrieve the bucket's location and list its contents.

  • The second statement permits the actions s3:PutObject, s3:GetObject, and s3:DeleteObject on all objects within the mlflow-bucket. This allows the user to upload, download, and delete objects stored in the bucket.


Step 5: Configure Istio

When using Istio in your Kubernetes cluster, you may need to consider Istio configurations for MinIO and MLflow to ensure proper traffic management, security, and observability.

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: minio-gateway
  namespace: mlflow
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 9000  # For MinIO API
      name: minio-api
      protocol: HTTP
    hosts:
    - "*"
  - port:
      number: 9001  # For MinIO Web UI
      name: minio-ui
      protocol: HTTP
    hosts:
    - "*"
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: minio
  namespace: mlflow
spec:
  hosts:
  - "*"
  gateways:
  - minio-gateway
  http:
  - match:
    - port: 9000  # Match for API requests
      uri:
        prefix: /
    route:
    - destination:
        host: minio
        port:
          number: 9000
  - match:
    - port: 9001  # Match for UI requests
      uri:
        prefix: /
    route:
    - destination:
        host: minio
        port:
          number: 9001
Enter fullscreen mode Exit fullscreen mode
kubectl apply -f minio/minio-istio.yaml
Enter fullscreen mode Exit fullscreen mode

Using MinIO which comes with Kubeflow installation

step 1: configure Network Policy

If you decided to use minio of kubeflow as your MLflow storage backend, you need to set Minio-service of kubeflow namespace in your MLflow configs.

also, there is a NetworkPolicy in KubeFlow namespace which only allows traffic to minio from two namespaces:

kubectl describe networkpolicy -n kubeflow minio
Enter fullscreen mode Exit fullscreen mode
Name:         minio
Namespace:    kubeflow
Created on:   2025-04-28 14:20:07 +0330 +0330
Labels:       <none>
Annotations:  <none>
Spec:
  PodSelector:     app in (minio)
  Allowing ingress traffic:
    To Port: <any> (traffic allowed to all ports)
    From:
      NamespaceSelector: app.kubernetes.io/part-of in (kubeflow-profile)
    From:
      NamespaceSelector: kubernetes.io/metadata.name in (istio-system)
    From:
      PodSelector: <none>
  Not affecting egress traffic
  Policy Types: Ingress
Enter fullscreen mode Exit fullscreen mode

because we will deploy MLflow in mlflow namespace in this scenario, it doesn’t match any of those From: sources, so its TCP connection to MinIO is dropped. we need to modify this network policy to allow connection between MLFlow and Minio.
we can apply changes using yaml or a patch:

Option 1:

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: minio
  namespace: kubeflow
spec:
  podSelector:
    matchLabels:
      app: minio
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              app.kubernetes.io/part-of: kubeflow-profile
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: istio-system
        - namespaceSelector:      # NEW: allow mlflow namespace
            matchLabels:
              kubernetes.io/metadata.name: mlflow
      ports:
        - protocol: TCP
          port: 9000           # adjust if your MinIO listens on a different port
EOF
Enter fullscreen mode Exit fullscreen mode

Option 2: Patch command

kubectl patch networkpolicy minio -n kubeflow --type='json' -p='[
  {
    "op": "add",
    "path": "/spec/ingress/0/from/-",
    "value": {
      "namespaceSelector": {
        "matchLabels": {
          "kubernetes.io/metadata.name": "mlflow"
        }
      }
    }
  }
]'
Enter fullscreen mode Exit fullscreen mode

step 2: Create a Bucket in MinIO

    **Using `mc` (MinIO Client)**:

    [install](https://min.io/docs/minio/linux/reference/minio-mc.html#install-mc) `mc` and port forward Minio service:
Enter fullscreen mode Exit fullscreen mode
    ```
    kubectl port-forward svc/minio-service -n kubeflow 9000:9000
    ```
Enter fullscreen mode Exit fullscreen mode
    then create a bucket:
Enter fullscreen mode Exit fullscreen mode
    ```
    mc alias set minio-kf http://localhost:9000 <myaccesskey> <mysecretkey>
    mc mb minio-kf/mlflow-bucket
    ```
Enter fullscreen mode Exit fullscreen mode

MLFLOW

Does MLflow Need MinIO?

MLflow does not strictly require MinIO; however, it does need a storage backend to store artifacts and models. Here are some options:

  1. Local File Storage: You can use local paths to store artifacts, but this is not recommended for production environments due to scalability and persistence issues.
  2. Object Storage:
    • MinIO: If you prefer using an S3-compatible object storage service, MinIO is a popular choice for Kubernetes environments. It’s lightweight and easy to deploy.
    • Amazon S3: If you have access to AWS, you can use S3 directly.
    • Ceph Object Storage: Since you have a Ceph cluster, you can use it as an object storage backend. Ceph provides an S3-compatible interface, allowing you to use it similarly to MinIO or AWS S3.
  3. Database Storage: MLflow can also log to a relational database (e.g., PostgreSQL, MySQL) for tracking experiments.

Setting Up MLflow

We will start by creating a Dockerfile. This step is essential because the default MLflow image lacks the boto3 and psycopg2-binary packages, which are necessary for connecting MLflow to MinIO and PostgreSQL:

FROM ghcr.io/mlflow/mlflow:latest

RUN pip install psycopg2-binary boto3

CMD ["mlflow", "server"]
Enter fullscreen mode Exit fullscreen mode

Then build:

docker build -t prezaei/mlflow-custom:v1.0 .
Enter fullscreen mode Exit fullscreen mode

And deploy MLflow on Kubernetes by creating your own deployment YAML files.

note that because Kubernetes does not do env var substitution inside value: fields — it only sets them as independent environment variables. So $(POSTGRES_PASSWORD) will literally be interpreted as the string "$(POSTGRES_PASSWORD)", not the actual password. so we can not use env value like this:


name: BACKEND_STORE_URI
value: "postgresql+psycopg2://mlflow:$(POSTGRES_PASSWORD)@mlflow-postgres:5432/mlflow_db"

Enter fullscreen mode Exit fullscreen mode

To fix it, you should construct the full URI inside the container, using environment variables.
change your args: to construct the URI inside the container, like this:

command: ["sh", "-c"]
args:
  - |
    mlflow server \
      --host=0.0.0.0 \
      --port=5000 \
      --backend-store-uri=postgresql+psycopg2://mlflow:${POSTGRES_PASSWORD}@mlflow-postgres:5432/mlflow_db \
      --default-artifact-root=s3://mlflow-bucket

Enter fullscreen mode Exit fullscreen mode

Here’s a basic example using a deployment:

apiVersion: v1
kind: Service
metadata:
  name: mlflow-service
  namespace: mlflow
spec:
  selector:
    app: mlflow
  ports:
    - protocol: TCP
      port: 5000
      targetPort: 5000
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mlflow-sa
  namespace: mlflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
  namespace: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      serviceAccountName: mlflow-sa
      containers:
      - name: mlflow
        image: prezaei/mlflow-custom:v1.0
        ports:
          - containerPort: 5000
        env:
          - name: BACKEND_STORE_URI
            value: "postgresql+psycopg2://mlflow@mlflow-postgres:5432/mlflow_db"
          - name: POSTGRES_PASSWORD
            valueFrom:
              secretKeyRef:
                name: mlflow-secret
                key: POSTGRES_MLFLOW_PASS
          - name: MLFLOW_S3_ENDPOINT_URL
            value: "http://minio.mlflow.svc.cluster.local:9000"
          - name: AWS_S3_ADDRESSING_STYLE
            value: "path"
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: mlflow-secret
                key: AWS_ACCESS_KEY_ID
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: mlflow-secret
                key: AWS_SECRET_ACCESS_KEY
        command: ["sh", "-c"]
        args:
          - |
            mlflow server \
              --host=0.0.0.0 \
              --port=5000 \
              --backend-store-uri=postgresql+psycopg2://mlflow:${POSTGRES_PASSWORD}@mlflow-postgres:5432/mlflow_db \
              --default-artifact-root=s3://mlflow-bucket \
              --artifacts-destination s3://mlflow-bucket
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "4Gi"
            cpu: "2"

Enter fullscreen mode Exit fullscreen mode

And a secret:

apiVersion: v1
kind: Secret
metadata:
  name: mlflow-secret
  namespace: mlflow
type: Opaque
data:
  AWS_ACCESS_KEY_ID: bWxmbG93
  AWS_SECRET_ACCESS_KEY: VGsvUEFJa1I5fkxZbVp
  POSTGRES_MLFLOW_PASS: QXliRmoxVFdhMW
Enter fullscreen mode Exit fullscreen mode

Istio

When using Istio in your Kubernetes cluster, you may need to consider Istio configurations for MinIO and MLflow to ensure proper traffic management, security, and observability. Here’s a breakdown of what you might need:

Configure MLflow with Istio

If you are also exposing MLflow outside the cluster or want to manage traffic to it, you should similarly set up an Istio Virtual Service for MLflow.

Example Configuration for MLflow

  1. Create a Virtual Service for MLflow:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: mlflow
  namespace: mlflow
spec:
  gateways:
    - kubeflow/kubeflow-gateway
  hosts:
    - '*'
  http:
    - match:
        - uri:
            prefix: /mlflow/ # match any request with a URI that starts with /mlflow/
      rewrite:
        uri: / #requests matching /mlflow/ are rewritten to /, routing them to the root of the mlflow service
      route:
        - destination:
            host: mlflow-service.mlflow.svc.cluster.local
            port:
              number: 5000
    - match:
        - uri:
            prefix: /graphql
      rewrite:
        uri: /graphql
      route:
        - destination:
            host: mlflow-service.mlflow.svc.cluster.local
            port:
              number: 5000

Enter fullscreen mode Exit fullscreen mode

We configured settings to allow access to the MLflow UI at kubeflow.mydomain.com/mlflow/. However, when selecting run details in the MLflow UI, a 404 HTTP error code is encountered due to issues with the /graphql section. The /graphql prefix is responsible for handling backend GraphQL API requests, which are utilized by the Kubeflow UI to interact with MLflow.

  1. Apply the Configurations:
kubectl apply -f mlflow-virtualservice.yaml
Enter fullscreen mode Exit fullscreen mode

Next, we need to integrate an MLflow tab into the central dashboard of Kubeflow. So we will modify the ConfigMap for Kubeflow's dashboard to make MLflow visible:

kubectl edit cm centraldashboard-config -n kubeflow

and adding this config in menuLinks section:

            { 
                "type": "item",
                "link": "/mlflow/",
                "text": "MlFlow",
                "icon": "icons:cached"
            },
Enter fullscreen mode Exit fullscreen mode

Restarting the central dashboard deployment will result in the tab being added.

kubectl rollout restart deploy centraldashboard -n kubeflow
Enter fullscreen mode Exit fullscreen mode

Part 2

Nice work getting MLflow into Kubeflow! Now let’s walk through a detailed guide on how to test the integration. The goal is to verify that MLflow is working smoothly within the Kubeflow environment—logging experiments, models, parameters, and metrics. Here's how you can do it step by step:


✅ 1. Decide Where to Run the Code

To best test the integration, you should run the MLflow code inside Kubeflow Notebooks (e.g., a Jupyter Notebook in a Kubeflow workspace). This ensures that:

  • You're using the same Kubernetes network.
  • MLflow client talks directly to the MLflow tracking server you integrated.
  • Any paths (e.g., artifact store, model registry) resolve correctly within the cluster.

💡 Running from your laptop is okay only if you expose MLflow’s tracking server externally, which is not recommended for early testing due to security/config complexity.


✅ 2. Prepare the Kubeflow Notebook Environment

  1. Launch a notebook server in Kubeflow:
    • Go to the Kubeflow Dashboard → “Notebooks”.
    • Create a new notebook server (choose a Python-based image that supports pip).
  2. Install MLflow in the notebook:
pip install mlflow boto3 scikit-learn pandas
Enter fullscreen mode Exit fullscreen mode

You may also install any dependencies your test script needs.


✅ 3. Configure MLflow Client in the Notebook

Set up the MLflow client to point to your MLflow Tracking Server. Usually, this is something like:

import mlflow
import os

# Point to your MLflow tracking server
mlflow.set_tracking_uri("http://mlflow-service.<namespace>.svc.cluster.local:5000")
print("Tracking URI:", mlflow.get_tracking_uri())
Enter fullscreen mode Exit fullscreen mode

Replace with the actual Kubernetes namespace where MLflow is deployed.


✅ 4. Set Minio Credentials

os.environ["MLFLOW_S3_ENDPOINT_URL"] = "http://minio.mlflow.svc.cluster.local:9000"
os.environ["AWS_ACCESS_KEY_ID"] = "mlflow"
os.environ["AWS_SECRET_ACCESS_KEY"] = "***********"
Enter fullscreen mode Exit fullscreen mode

✅ 5. Run a Simple MLflow Test Script

Here’s a minimal working example:

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd

# Data
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

with mlflow.start_run(run_name="kubeflow-test-run") as run:
    model = RandomForestRegressor(n_estimators=100, max_depth=5)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
    mlflow.log_metric("mse", mean_squared_error(y_test, predictions))

    # Log the model
    mlflow.sklearn.log_model(model, "model")

    print("🏃 View run at:", f"{mlflow.get_tracking_uri()}/#/experiments/0/runs/{run.info.run_id}")
Enter fullscreen mode Exit fullscreen mode

✅ 5. Verify Results in Kubeflow Dashboard

  • Navigate to your MLflow dashboard integrated into Kubeflow.
  • Check if the experiment, run, parameters, metrics, and model are logged.
  • Try registering a model and promote it to a stage if the registry is enabled.

In the Experiments section of MLflow, you can view a list of runs and access detailed information for each run by selecting them:

Image description

In MLflow, the run details provide a comprehensive overview of a specific experiment run. Here’s the kind of information you can typically find in the run details:

1. Basic Run Information

  • Run ID: A unique identifier for the run.
  • Experiment ID: The ID of the experiment to which the run belongs.
  • Start Time: The timestamp when the run started.
  • End Time: The timestamp when the run finished.
  • Duration: The total time taken for the run.

2. Parameters

  • Parameters: Key-value pairs representing the hyperparameters or configurations used during the run.

3. Metrics

  • Metrics: Key-value pairs of numerical values that represent the performance of the model (e.g., accuracy, loss) at various stages of the run.
  • Logging: Metrics can be logged at different intervals throughout the run.

4. Artifacts

  • Artifacts: Files or outputs generated during the run, such as:
    • Model files
    • Plots and figures
    • Data files
    • Logs

5. Tags

  • Tags: Key-value pairs used to categorize and add metadata to the run (e.g., version of the code, experiment type).

6. Source Information

  • Source: Information about the source of the run, including:
    • The script or notebook used to run the experiment
    • The entry point of the run (if applicable)

7. Status

  • Status: The current state of the run (e.g., RUNNING, FINISHED, FAILED, or KILLED).

8. User Information

  • User: Information about the user who initiated the run (if applicable).

Image description


Understanding MLflow UI Components

The MLflow UI is an integral part of the MLflow platform, providing a visual interface for monitoring and comparing machine learning experiments. Here's an in-depth look at its components:
Experiments and Runs

Experiments: Group related runs for easy comparison and analysis.
Runs: Individual executions of a machine learning model, each with its own set of parameters, metrics, and artifacts.

Enter fullscreen mode Exit fullscreen mode

Detailed Run Information

Access detailed information for each run, including parameters, metrics, and artifacts.
View the history of a metric by selecting its name under the Metrics section.

Enter fullscreen mode Exit fullscreen mode

Views and Comparisons

Table View: Lists runs with sortable columns for names, creation times, and other key data.
Chart View: Visualize and compare runs using various charts, such as parallel coordinates.

Enter fullscreen mode Exit fullscreen mode

Artifacts

Store and retrieve output such as models and visualizations.

Enter fullscreen mode Exit fullscreen mode

Metric History

Track the performance of metrics over time, such as Mean Average Precision.

Enter fullscreen mode Exit fullscreen mode

Integration and Extensibility

MLflow UI can be extended to track runs from various sources, including local and remote servers.

Enter fullscreen mode Exit fullscreen mode

Also, In MLflow, logging and registering a model serve different purposes in the machine learning lifecycle. Here's a breakdown of the differences:

Logging a Model

  • Definition: Logging a model refers to the process of saving model artifacts (like the model itself, parameters, metrics, and artifacts) during an experiment.
  • Purpose: It allows you to keep track of different versions of models and their performance metrics during experimentation.
  • Usage: Typically done during training or evaluation, using functions like mlflow.log_model().
  • Scope: Logged models are associated with a specific run in the MLflow tracking server.

Registering a Model

  • Definition: Registering a model involves adding a model to the MLflow Model Registry, which is a centralized repository for managing and versioning models.
  • Purpose: It allows you to organize, manage, and deploy models in a more structured way. You can also promote models through stages (e.g., Staging, Production).
  • Usage: Done after logging a model, using functions like mlflow.register_model().
  • Scope: Registered models can be accessed and used independently of specific runs, facilitating model sharing and deployment.

Now let’s take it further and go through the Model Registry, versioning, staging, and visualizations. Here's a full guide with examples that you can use in your workflows.


📘 MLflow Model Registry – End-to-End Example

✅ Prerequisites

  • MLflow Tracking Server with PostgreSQL backend and MinIO set up ✅
  • Models logged to tracking server ✅
  • MLflow client access from notebook ✅

1. 🔖 Registering a Model

Once a model is logged (as you've done with mlflow.sklearn.log_model(model, "model")), you can register it.

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register the model under a name
model_uri = f"runs:/{run.info.run_id}/model"
model_name = "DiabetesRandomForest"

result = mlflow.register_model(model_uri=model_uri, name=model_name)
print(f"🎯 Registered model version: {result.version}")

Enter fullscreen mode Exit fullscreen mode

2. 📌 Add a Description

Adding helpful descriptions helps with collaboration.

client.update_registered_model(
    name=model_name,
    description="A RandomForestRegressor trained on the diabetes dataset."
)

client.update_model_version(
    name=model_name,
    version=result.version,
    description="Version 1: 100 estimators, max depth 5."
)

Enter fullscreen mode Exit fullscreen mode

Image description


🆕 3. Add a New Model Version

To add a new version, you can log a different model (e.g., new params or retrained model) and then register it under the same name.

# Train a new model with different parameters
model_v2 = RandomForestRegressor(n_estimators=200, max_depth=8)
model_v2.fit(X_train, y_train)

mlflow.sklearn.log_model(model_v2, "model")

# Register as new version
new_model_uri = f"runs:/{mlflow.active_run().info.run_id}/model"
registered_v2 = mlflow.register_model(model_uri=new_model_uri, name=model_name)

print(f"📦 Registered new model version: {registered_v2.version}")
Enter fullscreen mode Exit fullscreen mode

🏷️ 4. Add Tags to a Model Version

Tags are useful for categorization or additional metadata like author, dataset, accuracy, etc.

client.set_model_version_tag(
    name=model_name,
    version=registered_v2.version,
    key="model_type",
    value="random_forest"
)

client.set_model_version_tag(
    name=model_name,
    version=registered_v2.version,
    key="dataset_version",
    value="v1.1"
)

Enter fullscreen mode Exit fullscreen mode

🔗 5. Use Aliases for Model Versions (MLflow ≥ 2.2)

Aliases allow you to define human-readable names for model versions like @latest, @staging, etc.

# Add alias to version 2
client.set_registered_model_alias(
    name="DiabetesRandomForest",
    alias="prod",
    version="1"
)

# You can now refer to this model like:
model = mlflow.sklearn.load_model(f"models:/{model_name}@prod")

Enter fullscreen mode Exit fullscreen mode

You can also update or delete an alias:

# Change alias to another version
client.set_model_version_alias(name=model_name, version="1", alias="prod")

# Delete an alias (MLflow >= 2.6)
client.delete_model_version_alias(name=model_name, alias="prod")

Enter fullscreen mode Exit fullscreen mode

Image description


6. 🔄 List Model Versions

for mv in client.search_model_versions(f"name='{model_name}'"):
    print(f"🔢 Version {mv.version} - Status: {mv.current_stage}")

Enter fullscreen mode Exit fullscreen mode

7. 🚦 Transition Between Stages

MLflow supports these stages: None, Staging, Production, Archived.

client.transition_model_version_stage(
    name=model_name,
    version=result.version,
    stage="Staging",  # or "Production", "Archived"
    archive_existing_versions=True
)
Enter fullscreen mode Exit fullscreen mode

In MLflow, stages refer to the different phases that a model can be in within the Model Registry. These stages help manage the lifecycle of machine learning models, allowing teams to organize, promote, and deploy models systematically. Here are the main stages in MLflow:

1. Staging

  • Definition: A model in the Staging stage is considered to be ready for testing and validation.
  • Purpose: This stage allows users to evaluate the model in a controlled environment before it is promoted to production.
  • Usage: Typically used for models that have been recently logged and need to be tested for performance.

2. Production

  • Definition: A model in the Production stage is actively being used in a live environment.
  • Purpose: This indicates that the model has passed all necessary tests and is deemed reliable for making predictions on real data.
  • Usage: Models in this stage are often monitored for performance and may be updated or replaced as new models are developed.

3. Archived

  • Definition: A model in the Archived stage is no longer in active use.
  • Purpose: This stage is used to keep the model in the registry for historical reference while indicating that it should not be used for new predictions.
  • Usage: Models may be archived for various reasons, such as being replaced by newer versions or being deemed obsolete.

Summary of Stages

  • Staging: For testing and validation.
  • Production: For live use and active predictions.
  • Archived: For historical reference, not in active use.

Transitioning Between Stages

Models can transition between these stages based on their performance, testing results, and the needs of the organization. This structured approach to model management helps ensure that only the best-performing models are deployed in production, while also maintaining a clear history of model versions and their statuses.


8. 📊 View in UI

  • Visit: http://<your-mlflow-host>/#/models
  • Click on DiabetesRandomForest
  • You’ll see all versions, stages, parameters, metrics, and artifacts.

9. 🎯 Load Model by Stage (e.g., for serving)

model = mlflow.sklearn.load_model(model_uri=f"models:/{model_name}/Staging")
predictions = model.predict(X_test)

Enter fullscreen mode Exit fullscreen mode

10. 📉 Visualizations: Auto-Generated in UI

In the MLflow UI (under the experiment or model), MLflow provides charts like:

  • Line chart of metrics per run
  • Parallel coordinates for comparing multiple runs
  • Run comparison and filtering

If you want to create custom charts:

import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x=predictions, y=y_test)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Prediction vs Actual")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description


✅ Summary

Step Description
Log Model mlflow.sklearn.log_model(...)
Register mlflow.register_model(...)
Describe client.update_registered_model(...)
Transition client.transition_model_version_stage(...)
Load by Stage mlflow.sklearn.load_model("models:/ModelName/Stage")

Resources

Top comments (1)

Collapse
 
piatsko profile image
Andrey Piatsko

Thank you for the article! It provides great coverage of the topic, haven't seen this much in one paper.