DEV Community

Cover image for Deploying Kubeflow as an AWS SageMaker Alternative
Sanskriti Harmukh for Vultr

Posted on with Aashish Chaurasiya • Originally published at docs.vultr.com

Deploying Kubeflow as an AWS SageMaker Alternative

Kubeflow is the open-source MLOps platform for Kubernetes, a self-hosted alternative to AWS SageMaker that bundles JupyterLab notebooks, KFP pipelines, the Trainer v2 API for distributed training, KServe for model serving, and Katib for hyperparameter optimisation. This guide deploys Kubeflow on a multi-node Kubernetes cluster, creates a user profile, runs a sample pipeline, executes a TrainJob, deploys an InferenceService, and launches a Katib experiment. By the end, you'll have a working Kubeflow platform covering the full ML lifecycle on your own cluster.

Prerequisite: Kubernetes cluster (v1.31+) with 3+ nodes and at least 4 CPU / 16 GB RAM per node. kubectl and kustomize (v5.4.3+) on your workstation. A default StorageClass for PVC provisioning.


SageMaker → Kubeflow Mapping

SageMaker Kubeflow
SageMaker Studio Kubeflow Notebooks (JupyterLab / VS Code / RStudio)
Training Jobs Trainer v2 (PyTorch, DeepSpeed, MLX, JAX, XGBoost)
Pipelines KFP — Kubeflow Pipelines
Model Registry Kubeflow Model Registry
Endpoints KServe (serverless serving + autoscaling)
Experiments Katib (HPO + AutoML)

Install Kubeflow

1. Verify cluster connectivity and version:

$ kubectl cluster-info
$ kubectl version
Enter fullscreen mode Exit fullscreen mode

Server Version must be 1.31 or later.

2. Clone and check out a Kubeflow release:

$ git clone https://github.com/kubeflow/manifests.git
$ cd manifests
$ git checkout 26.03
Enter fullscreen mode Exit fullscreen mode

3. Apply the example overlay with a retry loop (kustomize occasionally races CRDs on the first pass):

$ for i in 1 2 3 4 5; do \
    kustomize build example | kubectl apply --server-side --force-conflicts -f - && break \
    || { echo "Attempt $i failed, retrying in 30s..."; sleep 30; }; \
  done
Enter fullscreen mode Exit fullscreen mode

Warning: The default install ships demo credentials user@example.com / 12341234. Rotate them before exposing the cluster.

4. Verify the install:

$ kubectl get pods -n kubeflow --field-selector=status.phase!=Succeeded
$ kubectl get svc istio-ingressgateway -n istio-system
$ kubectl get crd | grep -E "kubeflow|kserve|katib|istio|knative|trainer" | wc -l
Enter fullscreen mode Exit fullscreen mode

Create a User Profile

1. Save the Profile manifest:

$ nano user-profile.yaml
Enter fullscreen mode Exit fullscreen mode
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
  name: kubeflow-user-example-com
spec:
  owner:
    kind: User
    name: user@example.com
Enter fullscreen mode Exit fullscreen mode

2. Apply it and confirm the namespace and service account:

$ kubectl apply -f user-profile.yaml
$ kubectl get namespace kubeflow-user-example-com
$ kubectl get serviceaccount default-editor -n kubeflow-user-example-com
Enter fullscreen mode Exit fullscreen mode

3. Confirm there's a default StorageClass:

$ kubectl get storageclass
Enter fullscreen mode Exit fullscreen mode

If none is marked default, patch one:

$ kubectl patch storageclass STORAGE_CLASS_NAME \
    -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'
Enter fullscreen mode Exit fullscreen mode

Open the Dashboard

1. Port-forward the Istio ingress gateway:

$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Enter fullscreen mode Exit fullscreen mode

2. Open http://localhost:8080 and sign in with user@example.com / 12341234.

3. Switch namespace to kubeflow-user-example-com from the top-left selector.


Create a Notebook Server

  1. Notebooks → New Notebook, name ml-workspace.
  2. Pick an image (JupyterLab, VS Code, RStudio).
  3. Minimum CPU 0.5, memory 1Gi, workspace volume 5Gi.
  4. Click Launch and open the notebook when it's ready.

Run a quick sanity check inside the notebook:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.4f}")
Enter fullscreen mode Exit fullscreen mode

Build and Run a Sample Pipeline

1. Save the pipeline:

$ nano sample_pipeline.py
Enter fullscreen mode Exit fullscreen mode
from kfp import dsl, compiler

@dsl.component(base_image="python:3.11-slim")
def preprocess() -> str:
    import json
    return json.dumps({"samples": 1000, "features": 10, "status": "preprocessed"})

@dsl.component(base_image="python:3.11-slim")
def train(input_data: str) -> str:
    import json
    return json.dumps({"model": "random_forest", "accuracy": 0.95, "input": json.loads(input_data)})

@dsl.component(base_image="python:3.11-slim")
def evaluate(input_data: str):
    import json
    r = json.loads(input_data)
    print(f"Model: {r['model']}, Accuracy: {r['accuracy']}")

@dsl.pipeline(name="sample-ml-pipeline")
def ml_pipeline():
    p = preprocess()
    t = train(input_data=p.output)
    evaluate(input_data=t.output)

compiler.Compiler().compile(ml_pipeline, "pipeline.yaml")
Enter fullscreen mode Exit fullscreen mode

2. Compile:

$ python3 sample_pipeline.py
Enter fullscreen mode Exit fullscreen mode

3. Allow notebook traffic to reach the pipeline service:

$ nano allow-pipeline-access.yaml
Enter fullscreen mode Exit fullscreen mode
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-notebook-to-pipeline
  namespace: kubeflow
spec:
  selector:
    matchLabels:
      app: ml-pipeline
  rules:
    - from:
        - source:
            namespaces: ["kubeflow-user-example-com"]
Enter fullscreen mode Exit fullscreen mode
$ kubectl apply -f allow-pipeline-access.yaml
Enter fullscreen mode Exit fullscreen mode

4. Upload and run the pipeline (from a notebook terminal):

$ curl -s -F "uploadfile=@pipeline.yaml" \
    -H "kubeflow-userid: user@example.com" \
    http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v2beta1/pipelines/upload

$ curl -s -X POST -H "Content-Type: application/json" \
    -H "kubeflow-userid: user@example.com" \
    http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v2beta1/experiments \
    -d '{"display_name":"sample-experiment","namespace":"kubeflow-user-example-com"}'
Enter fullscreen mode Exit fullscreen mode

Use the returned IDs to start a run from the Pipelines UI.


Submit a Distributed Training Job

$ nano trainjob.yaml
Enter fullscreen mode Exit fullscreen mode
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: pytorch-training
  namespace: kubeflow-user-example-com
spec:
  runtimeRef:
    name: torch-distributed
  trainer:
    image: ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0
    numNodes: 2
    resourcesPerNode:
      requests: {cpu: "500m", memory: "1Gi"}
      limits:   {cpu: "1",    memory: "2Gi"}
Enter fullscreen mode Exit fullscreen mode
$ kubectl apply -f trainjob.yaml
$ kubectl get trainjob -n kubeflow-user-example-com
Enter fullscreen mode Exit fullscreen mode

The STATE column flips to Complete once training finishes.


Serve a Model with KServe

$ nano sklearn-iris.yaml
Enter fullscreen mode Exit fullscreen mode
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: kubeflow-user-example-com
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        requests: {cpu: 100m, memory: 256Mi}
        limits:   {cpu: "1",  memory: 1Gi}
Enter fullscreen mode Exit fullscreen mode
$ kubectl apply -f sklearn-iris.yaml
$ kubectl get inferenceservice sklearn-iris -n kubeflow-user-example-com -w
Enter fullscreen mode Exit fullscreen mode

When READY reads True, fire a prediction:

$ curl -s --max-time 30 -H "Content-Type: application/json" \
    http://sklearn-iris-predictor-00001-private.kubeflow-user-example-com.svc.cluster.local/v1/models/sklearn-iris:predict \
    -d '{"instances": [[6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6]]}'
Enter fullscreen mode Exit fullscreen mode

Tune Hyperparameters with Katib

Inside a notebook:

import kubeflow.katib as katib

def objective(parameters):
    import time
    time.sleep(5)
    return 4 * int(parameters["a"]) - float(parameters["b"]) ** 2

parameters = {
    "a": katib.search.int(min=10, max=20),
    "b": katib.search.double(min=0.1, max=0.2),
}

client = katib.KatibClient(namespace="kubeflow-user-example-com")
name = "tune-experiment"
client.tune(
    name=name,
    objective=objective,
    parameters=parameters,
    objective_metric_name="result",
    objective_type="maximize",
    algorithm_name="random",
    max_trial_count=4,
    parallel_trial_count=2,
    resources_per_trial={"cpu": "1", "memory": "1Gi"},
)

client.wait_for_experiment_condition(name=name)
print(client.get_optimal_hyperparameters(name))
Enter fullscreen mode Exit fullscreen mode

Rotate the Demo Password

$ pip install bcrypt
$ python3 -c "import bcrypt; print(bcrypt.hashpw(b'YOUR_PASSWORD', bcrypt.gensalt()).decode())"
$ kubectl create secret generic dex-passwords -n auth \
    --from-literal=DEX_USER_PASSWORD='BCRYPT_HASH_HERE' \
    --dry-run=client -o yaml | kubectl apply -f -
$ kubectl rollout restart deployment dex -n auth
Enter fullscreen mode Exit fullscreen mode

Next Steps

Kubeflow is running with notebooks, pipelines, training, serving, and HPO. From here you can:

  • Add GPU workers and assign them to TrainJobs/InferenceServices via node selectors
  • Front the Istio gateway with cert-manager for production HTTPS
  • Wire S3-compatible storage for KFP artifacts and KServe model URIs

For the full guide with additional tips, visit the original article on Vultr Docs.

Top comments (0)