Sanskriti Harmukh for Vultr

Posted on Jul 2 with Aashish Chaurasiya • Originally published at docs.vultr.com

Deploying Kubeflow as an AWS SageMaker Alternative

#mlops #kubernetes #ai #devops

Kubeflow is the open-source MLOps platform for Kubernetes, a self-hosted alternative to AWS SageMaker that bundles JupyterLab notebooks, KFP pipelines, the Trainer v2 API for distributed training, KServe for model serving, and Katib for hyperparameter optimisation. This guide deploys Kubeflow on a multi-node Kubernetes cluster, creates a user profile, runs a sample pipeline, executes a TrainJob, deploys an InferenceService, and launches a Katib experiment. By the end, you'll have a working Kubeflow platform covering the full ML lifecycle on your own cluster.

Prerequisite: Kubernetes cluster (v1.31+) with 3+ nodes and at least 4 CPU / 16 GB RAM per node. kubectl and kustomize (v5.4.3+) on your workstation. A default StorageClass for PVC provisioning.

SageMaker → Kubeflow Mapping

SageMaker	Kubeflow
SageMaker Studio	Kubeflow Notebooks (JupyterLab / VS Code / RStudio)
Training Jobs	Trainer v2 (PyTorch, DeepSpeed, MLX, JAX, XGBoost)
Pipelines	KFP — Kubeflow Pipelines
Model Registry	Kubeflow Model Registry
Endpoints	KServe (serverless serving + autoscaling)
Experiments	Katib (HPO + AutoML)

Install Kubeflow

1. Verify cluster connectivity and version:

$ kubectl cluster-info
$ kubectl version

Server Version must be 1.31 or later.

2. Clone and check out a Kubeflow release:

$ git clone https://github.com/kubeflow/manifests.git
$ cd manifests
$ git checkout 26.03

3. Apply the example overlay with a retry loop (kustomize occasionally races CRDs on the first pass):

$ for i in 1 2 3 4 5; do \
    kustomize build example | kubectl apply --server-side --force-conflicts -f - && break \
    || { echo "Attempt $i failed, retrying in 30s..."; sleep 30; }; \
  done

Warning: The default install ships demo credentials user@example.com / 12341234. Rotate them before exposing the cluster.

4. Verify the install:

$ kubectl get pods -n kubeflow --field-selector=status.phase!=Succeeded
$ kubectl get svc istio-ingressgateway -n istio-system
$ kubectl get crd | grep -E "kubeflow|kserve|katib|istio|knative|trainer" | wc -l

Create a User Profile

1. Save the Profile manifest:

$ nano user-profile.yaml

apiVersion: kubeflow.org/v1
kind: Profile
metadata:
  name: kubeflow-user-example-com
spec:
  owner:
    kind: User
    name: user@example.com

2. Apply it and confirm the namespace and service account:

$ kubectl apply -f user-profile.yaml
$ kubectl get namespace kubeflow-user-example-com
$ kubectl get serviceaccount default-editor -n kubeflow-user-example-com

3. Confirm there's a default StorageClass:

$ kubectl get storageclass

If none is marked default, patch one:

$ kubectl patch storageclass STORAGE_CLASS_NAME \
    -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'

Open the Dashboard

1. Port-forward the Istio ingress gateway:

$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

2. Open http://localhost:8080 and sign in with user@example.com / 12341234.

3. Switch namespace to kubeflow-user-example-com from the top-left selector.

Create a Notebook Server

Notebooks → New Notebook, name ml-workspace.
Pick an image (JupyterLab, VS Code, RStudio).
Minimum CPU 0.5, memory 1Gi, workspace volume 5Gi.
Click Launch and open the notebook when it's ready.

Run a quick sanity check inside the notebook:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.4f}")

Build and Run a Sample Pipeline

1. Save the pipeline:

$ nano sample_pipeline.py

from kfp import dsl, compiler

@dsl.component(base_image="python:3.11-slim")
def preprocess() -> str:
    import json
    return json.dumps({"samples": 1000, "features": 10, "status": "preprocessed"})

@dsl.component(base_image="python:3.11-slim")
def train(input_data: str) -> str:
    import json
    return json.dumps({"model": "random_forest", "accuracy": 0.95, "input": json.loads(input_data)})

@dsl.component(base_image="python:3.11-slim")
def evaluate(input_data: str):
    import json
    r = json.loads(input_data)
    print(f"Model: {r['model']}, Accuracy: {r['accuracy']}")

@dsl.pipeline(name="sample-ml-pipeline")
def ml_pipeline():
    p = preprocess()
    t = train(input_data=p.output)
    evaluate(input_data=t.output)

compiler.Compiler().compile(ml_pipeline, "pipeline.yaml")

2. Compile:

$ python3 sample_pipeline.py

3. Allow notebook traffic to reach the pipeline service:

$ nano allow-pipeline-access.yaml

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-notebook-to-pipeline
  namespace: kubeflow
spec:
  selector:
    matchLabels:
      app: ml-pipeline
  rules:
    - from:
        - source:
            namespaces: ["kubeflow-user-example-com"]

$ kubectl apply -f allow-pipeline-access.yaml

4. Upload and run the pipeline (from a notebook terminal):

$ curl -s -F "uploadfile=@pipeline.yaml" \
    -H "kubeflow-userid: user@example.com" \
    http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v2beta1/pipelines/upload

$ curl -s -X POST -H "Content-Type: application/json" \
    -H "kubeflow-userid: user@example.com" \
    http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v2beta1/experiments \
    -d '{"display_name":"sample-experiment","namespace":"kubeflow-user-example-com"}'

Use the returned IDs to start a run from the Pipelines UI.

Submit a Distributed Training Job

$ nano trainjob.yaml

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: pytorch-training
  namespace: kubeflow-user-example-com
spec:
  runtimeRef:
    name: torch-distributed
  trainer:
    image: ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0
    numNodes: 2
    resourcesPerNode:
      requests: {cpu: "500m", memory: "1Gi"}
      limits:   {cpu: "1",    memory: "2Gi"}

$ kubectl apply -f trainjob.yaml
$ kubectl get trainjob -n kubeflow-user-example-com

The STATE column flips to Complete once training finishes.

Serve a Model with KServe

$ nano sklearn-iris.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: kubeflow-user-example-com
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        requests: {cpu: 100m, memory: 256Mi}
        limits:   {cpu: "1",  memory: 1Gi}

$ kubectl apply -f sklearn-iris.yaml
$ kubectl get inferenceservice sklearn-iris -n kubeflow-user-example-com -w

When READY reads True, fire a prediction:

$ curl -s --max-time 30 -H "Content-Type: application/json" \
    http://sklearn-iris-predictor-00001-private.kubeflow-user-example-com.svc.cluster.local/v1/models/sklearn-iris:predict \
    -d '{"instances": [[6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6]]}'

Tune Hyperparameters with Katib

Inside a notebook:

import kubeflow.katib as katib

def objective(parameters):
    import time
    time.sleep(5)
    return 4 * int(parameters["a"]) - float(parameters["b"]) ** 2

parameters = {
    "a": katib.search.int(min=10, max=20),
    "b": katib.search.double(min=0.1, max=0.2),
}

client = katib.KatibClient(namespace="kubeflow-user-example-com")
name = "tune-experiment"
client.tune(
    name=name,
    objective=objective,
    parameters=parameters,
    objective_metric_name="result",
    objective_type="maximize",
    algorithm_name="random",
    max_trial_count=4,
    parallel_trial_count=2,
    resources_per_trial={"cpu": "1", "memory": "1Gi"},
)

client.wait_for_experiment_condition(name=name)
print(client.get_optimal_hyperparameters(name))

Rotate the Demo Password

$ pip install bcrypt
$ python3 -c "import bcrypt; print(bcrypt.hashpw(b'YOUR_PASSWORD', bcrypt.gensalt()).decode())"
$ kubectl create secret generic dex-passwords -n auth \
    --from-literal=DEX_USER_PASSWORD='BCRYPT_HASH_HERE' \
    --dry-run=client -o yaml | kubectl apply -f -
$ kubectl rollout restart deployment dex -n auth

Next Steps

Kubeflow is running with notebooks, pipelines, training, serving, and HPO. From here you can:

Add GPU workers and assign them to TrainJobs/InferenceServices via node selectors
Front the Istio gateway with cert-manager for production HTTPS
Wire S3-compatible storage for KFP artifacts and KServe model URIs

For the full guide with additional tips, visit the original article on Vultr Docs.

Top comments (1)

Aldo • Jul 8

It's always tempting to look at self-hosted MLOps platforms like Kubeflow, especially when comparing raw compute costs against a managed service like SageMaker. From a SaaS engineering perspective, though, the decision often boils down to a deeper total cost of ownership that extends well beyond infrastructure bills. We've seen firsthand how quickly the initial promise of control can turn into a substantial operational burden. Running Kubeflow reliably in production means you're essentially building and maintaining your own distributed system, complete with all the Kubernetes-level complexities like managing Istio for networking, dealing with various storage provisioners, and keeping up with the constant stream of component upgrades and patches.

This shift means your platform team isn't just enabling data scientists; they're becoming MLOps experts, deep-diving into operator logs and troubleshooting obscure Kubernetes resource issues when a training job stalls. While the flexibility to customize is undeniable, that flexibility comes at the cost of engineering velocity for your core product features. For us, the calculus often leans towards offloading that undifferentiated heavy lifting to a managed service, freeing up our valuable engineering talent to focus on what truly differentiates our SaaS offering, unless there's an extremely compelling reason around specific compliance, hyper-optimization at scale, or proprietary IP that absolutely cannot reside on a managed platform.