Kubeflow is the open-source MLOps platform for Kubernetes, a self-hosted alternative to AWS SageMaker that bundles JupyterLab notebooks, KFP pipelines, the Trainer v2 API for distributed training, KServe for model serving, and Katib for hyperparameter optimisation. This guide deploys Kubeflow on a multi-node Kubernetes cluster, creates a user profile, runs a sample pipeline, executes a TrainJob, deploys an InferenceService, and launches a Katib experiment. By the end, you'll have a working Kubeflow platform covering the full ML lifecycle on your own cluster.
Prerequisite: Kubernetes cluster (v1.31+) with 3+ nodes and at least 4 CPU / 16 GB RAM per node.
kubectlandkustomize(v5.4.3+) on your workstation. A default StorageClass for PVC provisioning.
SageMaker → Kubeflow Mapping
| SageMaker | Kubeflow |
|---|---|
| SageMaker Studio | Kubeflow Notebooks (JupyterLab / VS Code / RStudio) |
| Training Jobs | Trainer v2 (PyTorch, DeepSpeed, MLX, JAX, XGBoost) |
| Pipelines | KFP — Kubeflow Pipelines |
| Model Registry | Kubeflow Model Registry |
| Endpoints | KServe (serverless serving + autoscaling) |
| Experiments | Katib (HPO + AutoML) |
Install Kubeflow
1. Verify cluster connectivity and version:
$ kubectl cluster-info
$ kubectl version
Server Version must be 1.31 or later.
2. Clone and check out a Kubeflow release:
$ git clone https://github.com/kubeflow/manifests.git
$ cd manifests
$ git checkout 26.03
3. Apply the example overlay with a retry loop (kustomize occasionally races CRDs on the first pass):
$ for i in 1 2 3 4 5; do \
kustomize build example | kubectl apply --server-side --force-conflicts -f - && break \
|| { echo "Attempt $i failed, retrying in 30s..."; sleep 30; }; \
done
Warning: The default install ships demo credentials
user@example.com/12341234. Rotate them before exposing the cluster.
4. Verify the install:
$ kubectl get pods -n kubeflow --field-selector=status.phase!=Succeeded
$ kubectl get svc istio-ingressgateway -n istio-system
$ kubectl get crd | grep -E "kubeflow|kserve|katib|istio|knative|trainer" | wc -l
Create a User Profile
1. Save the Profile manifest:
$ nano user-profile.yaml
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
name: kubeflow-user-example-com
spec:
owner:
kind: User
name: user@example.com
2. Apply it and confirm the namespace and service account:
$ kubectl apply -f user-profile.yaml
$ kubectl get namespace kubeflow-user-example-com
$ kubectl get serviceaccount default-editor -n kubeflow-user-example-com
3. Confirm there's a default StorageClass:
$ kubectl get storageclass
If none is marked default, patch one:
$ kubectl patch storageclass STORAGE_CLASS_NAME \
-p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'
Open the Dashboard
1. Port-forward the Istio ingress gateway:
$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
2. Open http://localhost:8080 and sign in with user@example.com / 12341234.
3. Switch namespace to kubeflow-user-example-com from the top-left selector.
Create a Notebook Server
-
Notebooks → New Notebook, name
ml-workspace. - Pick an image (JupyterLab, VS Code, RStudio).
- Minimum CPU
0.5, memory1Gi, workspace volume5Gi. - Click Launch and open the notebook when it's ready.
Run a quick sanity check inside the notebook:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.4f}")
Build and Run a Sample Pipeline
1. Save the pipeline:
$ nano sample_pipeline.py
from kfp import dsl, compiler
@dsl.component(base_image="python:3.11-slim")
def preprocess() -> str:
import json
return json.dumps({"samples": 1000, "features": 10, "status": "preprocessed"})
@dsl.component(base_image="python:3.11-slim")
def train(input_data: str) -> str:
import json
return json.dumps({"model": "random_forest", "accuracy": 0.95, "input": json.loads(input_data)})
@dsl.component(base_image="python:3.11-slim")
def evaluate(input_data: str):
import json
r = json.loads(input_data)
print(f"Model: {r['model']}, Accuracy: {r['accuracy']}")
@dsl.pipeline(name="sample-ml-pipeline")
def ml_pipeline():
p = preprocess()
t = train(input_data=p.output)
evaluate(input_data=t.output)
compiler.Compiler().compile(ml_pipeline, "pipeline.yaml")
2. Compile:
$ python3 sample_pipeline.py
3. Allow notebook traffic to reach the pipeline service:
$ nano allow-pipeline-access.yaml
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: allow-notebook-to-pipeline
namespace: kubeflow
spec:
selector:
matchLabels:
app: ml-pipeline
rules:
- from:
- source:
namespaces: ["kubeflow-user-example-com"]
$ kubectl apply -f allow-pipeline-access.yaml
4. Upload and run the pipeline (from a notebook terminal):
$ curl -s -F "uploadfile=@pipeline.yaml" \
-H "kubeflow-userid: user@example.com" \
http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v2beta1/pipelines/upload
$ curl -s -X POST -H "Content-Type: application/json" \
-H "kubeflow-userid: user@example.com" \
http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v2beta1/experiments \
-d '{"display_name":"sample-experiment","namespace":"kubeflow-user-example-com"}'
Use the returned IDs to start a run from the Pipelines UI.
Submit a Distributed Training Job
$ nano trainjob.yaml
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: pytorch-training
namespace: kubeflow-user-example-com
spec:
runtimeRef:
name: torch-distributed
trainer:
image: ghcr.io/kubeflow/katib/pytorch-mnist-cpu:v0.19.0
numNodes: 2
resourcesPerNode:
requests: {cpu: "500m", memory: "1Gi"}
limits: {cpu: "1", memory: "2Gi"}
$ kubectl apply -f trainjob.yaml
$ kubectl get trainjob -n kubeflow-user-example-com
The STATE column flips to Complete once training finishes.
Serve a Model with KServe
$ nano sklearn-iris.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
namespace: kubeflow-user-example-com
annotations:
sidecar.istio.io/inject: "false"
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
resources:
requests: {cpu: 100m, memory: 256Mi}
limits: {cpu: "1", memory: 1Gi}
$ kubectl apply -f sklearn-iris.yaml
$ kubectl get inferenceservice sklearn-iris -n kubeflow-user-example-com -w
When READY reads True, fire a prediction:
$ curl -s --max-time 30 -H "Content-Type: application/json" \
http://sklearn-iris-predictor-00001-private.kubeflow-user-example-com.svc.cluster.local/v1/models/sklearn-iris:predict \
-d '{"instances": [[6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6]]}'
Tune Hyperparameters with Katib
Inside a notebook:
import kubeflow.katib as katib
def objective(parameters):
import time
time.sleep(5)
return 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
parameters = {
"a": katib.search.int(min=10, max=20),
"b": katib.search.double(min=0.1, max=0.2),
}
client = katib.KatibClient(namespace="kubeflow-user-example-com")
name = "tune-experiment"
client.tune(
name=name,
objective=objective,
parameters=parameters,
objective_metric_name="result",
objective_type="maximize",
algorithm_name="random",
max_trial_count=4,
parallel_trial_count=2,
resources_per_trial={"cpu": "1", "memory": "1Gi"},
)
client.wait_for_experiment_condition(name=name)
print(client.get_optimal_hyperparameters(name))
Rotate the Demo Password
$ pip install bcrypt
$ python3 -c "import bcrypt; print(bcrypt.hashpw(b'YOUR_PASSWORD', bcrypt.gensalt()).decode())"
$ kubectl create secret generic dex-passwords -n auth \
--from-literal=DEX_USER_PASSWORD='BCRYPT_HASH_HERE' \
--dry-run=client -o yaml | kubectl apply -f -
$ kubectl rollout restart deployment dex -n auth
Next Steps
Kubeflow is running with notebooks, pipelines, training, serving, and HPO. From here you can:
- Add GPU workers and assign them to TrainJobs/InferenceServices via node selectors
- Front the Istio gateway with cert-manager for production HTTPS
- Wire S3-compatible storage for KFP artifacts and KServe model URIs
For the full guide with additional tips, visit the original article on Vultr Docs.
Top comments (0)