Bursting to SageMaker Training from OpenShift Pipelines
Table of Contents
- Overview
- Architecture
- Prerequisites
- Phase 1: ROSA Cluster Setup
- Phase 2: OpenShift Pipelines Installation
- Phase 3: AWS Controllers for Kubernetes (ACK)
- Phase 4: Amazon SageMaker Integration
- Phase 5: Model Storage with S3
- Phase 6: KServe Model Serving
- Phase 7: End-to-End Pipeline
- Testing and Validation
- Resource Cleanup
- Troubleshooting
Overview
Project Purpose
This platform delivers a hybrid MLOps solution that optimizes costs by leveraging the best of both worlds: OpenShift for orchestration and management, and AWS SageMaker for intensive GPU training workloads. Instead of maintaining expensive GPU instances 24/7, this architecture enables dynamic "bursting" to AWS for training while maintaining cost-effective inference on OpenShift.
Key Value Propositions
- Cost Optimization: Pay for GPU instances only during training, not continuously
- Elastic Scalability: Burst to powerful AWS instances (ml.p4d.24xlarge) on-demand
- Hybrid Flexibility: Orchestrate from OpenShift while leveraging AWS managed services
- Automated Workflows: End-to-end MLOps pipelines with minimal manual intervention
- Production-Ready Serving: Low-latency inference on cost-effective OpenShift nodes
Solution Components
| Component | Purpose | Layer |
|---|---|---|
| ROSA | Managed OpenShift cluster on AWS | Infrastructure |
| OpenShift Pipelines | Tekton-based CI/CD orchestration | Orchestration |
| ACK (AWS Controllers for Kubernetes) | Manage AWS services from Kubernetes | Integration |
| Amazon SageMaker | Managed ML training with GPU instances | Training |
| Amazon S3 | Model artifacts and dataset storage | Data Lake |
| KServe | Model serving on OpenShift | Inference |
| Amazon ECR | Container registry for custom images | Container Registry |
Architecture
High-Level Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────┐
│ AWS Cloud │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ ROSA Cluster (VPC) │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ OpenShift Pipelines (Tekton) │ │ │
│ │ │ ┌────────────────┐ ┌──────────────────────────────┐ │ │ │
│ │ │ │ Pipeline │ │ AWS Controllers for │ │ │ │
│ │ │ │ Controller │─────►│ Kubernetes (ACK) │ │ │ │
│ │ │ └────────────────┘ └──────────┬───────────────────┘ │ │ │
│ │ │ │ │ │ │
│ │ └─────────────────────────────────────┼───────────────────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────────────────────────▼───────────────────────┐ │ │
│ │ │ KServe Model Serving │ │ │
│ │ │ ┌────────────────┐ ┌──────────────────────────────┐ │ │ │
│ │ │ │ InferenceService│◄─────│ Model from S3 │ │ │ │
│ │ │ │ (CPU Nodes) │ │ (Automatic Pull) │ │ │ │
│ │ │ └────────────────┘ └──────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ACK Triggers │ (Ephemeral) │
│ │ │
│ ┌─────────────────────────────────────▼───────────────────────────┐ │
│ │ Amazon SageMaker Training │ │
│ │ ┌────────────────────────────────────────────────────────────┐ │ │
│ │ │ Training Job (ml.p4d.24xlarge - 8x A100 GPUs) │ │ │
│ │ │ - Launched on-demand via ACK │ │ │
│ │ │ - Trains model using data from S3 │ │ │
│ │ │ - Saves model artifacts to S3 │ │ │
│ │ │ - Auto-terminates after completion │ │ │
│ │ └────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────▼───────────────────────────┐ │
│ │ Amazon S3 │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │
│ │ │ Training │ │ Model │ │ Inference │ │ │
│ │ │ Datasets │ │ Artifacts │ │ Results │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Amazon ECR │ │
│ │ - Custom training container images │ │
│ │ - Model serving images │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Workflow
- Data Preparation: Training datasets uploaded to S3
- Pipeline Trigger: Developer triggers OpenShift Pipeline
- Training Initiation: ACK creates SageMaker Training Job
- GPU Provisioning: SageMaker spins up ml.p4d.24xlarge instances
- Model Training: Training executes on high-performance GPUs
- Artifact Storage: Trained model saved to S3
- Instance Termination: GPU instances automatically shut down
- Model Deployment: KServe pulls model from S3 to OpenShift
- Inference Serving: Model serves predictions on cost-effective CPU nodes
- Monitoring: Pipeline tracks status and logs throughout
Cost Analysis
Traditional Approach (GPU instances running 24/7):
- ml.p4d.24xlarge: ~$32/hour
- Monthly cost: ~$23,040 (continuous operation)
Hybrid Approach (burst for training only):
- Training: 4 hours/week × $32/hour = $128/week = $512/month
- ROSA inference nodes: ~$1,500/month (m5.2xlarge instances)
- Total: ~$2,012/month
- Savings: ~91% compared to traditional approach
Prerequisites
Required Accounts and Subscriptions
- [ ] AWS Account with administrative access
- [ ] Red Hat Account with OpenShift subscription
- [ ] ROSA Enabled in your AWS account
- [ ] Amazon SageMaker Access in your target region
- [ ] AWS Service Quotas for ml.p4d instances (request if needed)
Required Tools
Install the following CLI tools on your workstation:
# AWS CLI (v2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# ROSA CLI
wget https://mirror.openshift.com/pub/openshift-v4/clients/rosa/latest/rosa-linux.tar.gz
tar -xvf rosa-linux.tar.gz
sudo mv rosa /usr/local/bin/rosa
rosa version
# OpenShift CLI (oc)
wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz
tar -xvf openshift-client-linux.tar.gz
sudo mv oc kubectl /usr/local/bin/
oc version
# Tekton CLI
curl -LO https://github.com/tektoncd/cli/releases/download/v0.33.0/tkn_0.33.0_Linux_x86_64.tar.gz
tar xvzf tkn_0.33.0_Linux_x86_64.tar.gz
sudo mv tkn /usr/local/bin/
tkn version
# Helm (v3)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version
AWS Prerequisites
Service Quotas
# Check SageMaker quotas
aws service-quotas get-service-quota \
--service-code sagemaker \
--quota-code L-2E8D9C5E \
--region us-east-1
# Check EC2 quotas for ROSA
aws service-quotas get-service-quota \
--service-code ec2 \
--quota-code L-1216C47A \
--region us-east-1
IAM Permissions
Your AWS IAM user/role needs permissions for:
- EC2 (VPC, subnets, security groups)
- IAM (roles, policies)
- S3 (buckets, objects)
- SageMaker (training jobs, models)
- ECR (repositories, images)
- CloudWatch (logs, metrics)
Knowledge Prerequisites
You should be familiar with:
- Machine Learning concepts (training, inference, model artifacts)
- AWS fundamentals (VPC, IAM, S3)
- Kubernetes basics (pods, deployments, services)
- CI/CD pipeline concepts
- Python and ML frameworks (TensorFlow, PyTorch, scikit-learn)
Phase 1: ROSA Cluster Setup
Step 1.1: Configure AWS CLI
# Configure AWS credentials
aws configure
# Verify configuration
aws sts get-caller-identity
Step 1.2: Initialize ROSA
# Log in to Red Hat
rosa login
# Verify ROSA prerequisites
rosa verify quota
rosa verify permissions
# Initialize ROSA in your AWS account
rosa init
Step 1.3: Create ROSA Cluster
Create a ROSA cluster optimized for MLOps workloads:
# Set environment variables
export CLUSTER_NAME="mlops-platform"
export AWS_REGION="us-east-1"
export MACHINE_TYPE="m5.2xlarge"
export COMPUTE_NODES=3
# Create ROSA cluster (takes ~40 minutes)
rosa create cluster \
--cluster-name $CLUSTER_NAME \
--region $AWS_REGION \
--multi-az \
--compute-machine-type $MACHINE_TYPE \
--compute-nodes $COMPUTE_NODES \
--machine-cidr 10.0.0.0/16 \
--service-cidr 172.30.0.0/16 \
--pod-cidr 10.128.0.0/14 \
--host-prefix 23 \
--yes
Configuration Rationale:
- m5.2xlarge: 8 vCPUs, 32 GB RAM - suitable for ML inference and pipeline orchestration
- 3 nodes: High availability for production workloads
- Multi-AZ: Ensures resilience for serving layer
Step 1.4: Monitor Cluster Creation
# Watch cluster installation progress
rosa logs install --cluster=$CLUSTER_NAME --watch
# Check cluster status
rosa describe cluster --cluster=$CLUSTER_NAME
Step 1.5: Create Admin User
# Create cluster admin user
rosa create admin --cluster=$CLUSTER_NAME
# Save the login command (will be displayed in output)
Step 1.6: Connect to Cluster
# Use the login command from previous step
oc login https://api.mlops-platform.xxxx.p1.openshiftapps.com:6443 \
--username cluster-admin \
--password <your-password>
# Verify cluster access
oc cluster-info
oc get nodes
Step 1.7: Create Project Namespaces
# Create namespace for pipelines
oc new-project mlops-pipelines
# Create namespace for model serving
oc new-project mlops-serving
# Create namespace for ACK controllers
oc new-project ack-system
Phase 2: OpenShift Pipelines Installation
Step 2.1: Install OpenShift Pipelines Operator
# Create operator subscription
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: openshift-pipelines
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: openshift-pipelines-operator
namespace: openshift-operators
spec:
channel: latest
name: openshift-pipelines-operator-rh
source: redhat-operators
sourceNamespace: openshift-marketplace
installPlanApproval: Automatic
EOF
Step 2.2: Verify Operator Installation
# Wait for operator to be ready (takes 2-3 minutes)
oc get csv -n openshift-operators | grep pipelines
# Verify Tekton components are running
oc get pods -n openshift-pipelines
# Check Tekton version
tkn version
Step 2.3: Configure Pipeline Service Account
# Create service account for pipelines
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: pipeline-sa
namespace: mlops-pipelines
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pipeline-sa-edit
namespace: mlops-pipelines
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: edit
subjects:
- kind: ServiceAccount
name: pipeline-sa
namespace: mlops-pipelines
EOF
Phase 3: AWS Controllers for Kubernetes (ACK)
ACK enables managing AWS services directly from Kubernetes using custom resources.
Step 3.1: Install ACK SageMaker Controller
# Set variables
export ACK_K8S_NAMESPACE=ack-system
export AWS_REGION=us-east-1
export ACK_SAGEMAKER_VERSION=1.2.10
# Download ACK SageMaker controller
export SERVICE=sagemaker
export RELEASE_VERSION=$(curl -sL https://api.github.com/repos/aws-controllers-k8s/${SERVICE}-controller/releases/latest | grep '"tag_name":' | cut -d'"' -f4)
wget https://github.com/aws-controllers-k8s/${SERVICE}-controller/releases/download/${RELEASE_VERSION}/install.yaml
# Apply ACK controller
kubectl apply -f install.yaml
# Verify installation
kubectl get pods -n ack-system
kubectl get crd | grep sagemaker
Step 3.2: Create IAM Role for ACK
# Create IAM policy for SageMaker access
cat > ack-sagemaker-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateTrainingJob",
"sagemaker:DescribeTrainingJob",
"sagemaker:StopTrainingJob",
"sagemaker:CreateModel",
"sagemaker:DeleteModel",
"sagemaker:DescribeModel",
"sagemaker:CreateEndpointConfig",
"sagemaker:DeleteEndpointConfig",
"sagemaker:DescribeEndpointConfig"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::mlops-*",
"arn:aws:s3:::mlops-*/*"
]
},
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:PassedToService": "sagemaker.amazonaws.com"
}
}
}
]
}
EOF
# Create policy
aws iam create-policy \
--policy-name ACKSageMakerPolicy \
--policy-document file://ack-sagemaker-policy.json
# Get OIDC provider for ROSA
export OIDC_PROVIDER=$(rosa describe cluster -c $CLUSTER_NAME -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
# Create trust policy
cat > ack-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_PROVIDER}:sub": "system:serviceaccount:ack-system:ack-sagemaker-controller"
}
}
}
]
}
EOF
# Create IAM role
export ACK_ROLE_ARN=$(aws iam create-role \
--role-name ACKSageMakerControllerRole \
--assume-role-policy-document file://ack-trust-policy.json \
--query 'Role.Arn' \
--output text)
# Attach policy to role
aws iam attach-role-policy \
--role-name ACKSageMakerControllerRole \
--policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/ACKSageMakerPolicy
echo "ACK IAM Role ARN: $ACK_ROLE_ARN"
Step 3.3: Configure ACK Controller with IAM Role
# Annotate service account
kubectl annotate serviceaccount -n ack-system ack-sagemaker-controller \
eks.amazonaws.com/role-arn=$ACK_ROLE_ARN
# Restart ACK controller to pick up annotation
kubectl rollout restart deployment -n ack-system ack-sagemaker-controller
# Verify controller is running
kubectl get pods -n ack-system
kubectl logs -n ack-system deployment/ack-sagemaker-controller
Phase 4: Amazon SageMaker Integration
Step 4.1: Create SageMaker Execution Role
# Create trust policy for SageMaker
cat > sagemaker-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
EOF
# Create SageMaker execution role
export SAGEMAKER_ROLE_ARN=$(aws iam create-role \
--role-name SageMakerMLOpsExecutionRole \
--assume-role-policy-document file://sagemaker-trust-policy.json \
--query 'Role.Arn' \
--output text)
# Attach AWS managed policy
aws iam attach-role-policy \
--role-name SageMakerMLOpsExecutionRole \
--policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
# Create custom S3 access policy
cat > sagemaker-s3-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::mlops-*",
"arn:aws:s3:::mlops-*/*"
]
}
]
}
EOF
aws iam put-role-policy \
--role-name SageMakerMLOpsExecutionRole \
--policy-name S3Access \
--policy-document file://sagemaker-s3-policy.json
echo "SageMaker Execution Role ARN: $SAGEMAKER_ROLE_ARN"
Step 4.2: Create S3 Buckets
# Create S3 buckets for ML artifacts
export ML_BUCKET="mlops-artifacts-${ACCOUNT_ID}"
export DATA_BUCKET="mlops-datasets-${ACCOUNT_ID}"
aws s3 mb s3://$ML_BUCKET --region $AWS_REGION
aws s3 mb s3://$DATA_BUCKET --region $AWS_REGION
# Enable versioning
aws s3api put-bucket-versioning \
--bucket $ML_BUCKET \
--versioning-configuration Status=Enabled
aws s3api put-bucket-versioning \
--bucket $DATA_BUCKET \
--versioning-configuration Status=Enabled
# Create folder structure
aws s3api put-object --bucket $ML_BUCKET --key models/
aws s3api put-object --bucket $ML_BUCKET --key checkpoints/
aws s3api put-object --bucket $DATA_BUCKET --key training/
aws s3api put-object --bucket $DATA_BUCKET --key validation/
echo "S3 Buckets created:"
echo " Models: s3://$ML_BUCKET"
echo " Data: s3://$DATA_BUCKET"
Step 4.3: Create ECR Repository
# Create ECR repository for custom training images
aws ecr create-repository \
--repository-name mlops/training \
--region $AWS_REGION
# Get ECR login command
aws ecr get-login-password --region $AWS_REGION | \
docker login --username AWS --password-stdin \
${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com
export ECR_TRAINING_URI="${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/mlops/training"
echo "ECR Repository: $ECR_TRAINING_URI"
Step 4.4: Build Custom Training Container
# Create directory for training container
mkdir -p sagemaker-training
cd sagemaker-training
# Create training script
cat > train.py <<'PYTHON'
import argparse
import os
import json
import joblib
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import boto3
def load_data_from_s3(data_dir):
"""Load training and validation data"""
print(f"Loading data from {data_dir}")
# Load training data
X_train = np.load(os.path.join(data_dir, 'train', 'X_train.npy'))
y_train = np.load(os.path.join(data_dir, 'train', 'y_train.npy'))
# Load validation data
X_val = np.load(os.path.join(data_dir, 'validation', 'X_val.npy'))
y_val = np.load(os.path.join(data_dir, 'validation', 'y_val.npy'))
return X_train, y_train, X_val, y_val
def train_model(X_train, y_train, hyperparameters):
"""Train Random Forest model"""
print("Training model with hyperparameters:", hyperparameters)
model = RandomForestClassifier(
n_estimators=hyperparameters['n_estimators'],
max_depth=hyperparameters['max_depth'],
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train)
return model
def evaluate_model(model, X_val, y_val):
"""Evaluate model on validation set"""
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
report = classification_report(y_val, y_pred, output_dict=True)
print(f"Validation Accuracy: {accuracy:.4f}")
print(classification_report(y_val, y_pred))
return accuracy, report
def save_model(model, model_dir, metrics):
"""Save model and metrics"""
os.makedirs(model_dir, exist_ok=True)
# Save model
model_path = os.path.join(model_dir, 'model.joblib')
joblib.dump(model, model_path)
print(f"Model saved to {model_path}")
# Save metrics
metrics_path = os.path.join(model_dir, 'metrics.json')
with open(metrics_path, 'w') as f:
json.dump(metrics, f, indent=2)
print(f"Metrics saved to {metrics_path}")
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Hyperparameters
parser.add_argument('--n_estimators', type=int, default=100)
parser.add_argument('--max_depth', type=int, default=10)
# SageMaker specific arguments
parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR', '/opt/ml/model'))
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN', '/opt/ml/input/data/train'))
parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION', '/opt/ml/input/data/validation'))
args = parser.parse_args()
# Load data
data_dir = os.path.dirname(args.train)
X_train, y_train, X_val, y_val = load_data_from_s3(data_dir)
# Train model
hyperparameters = {
'n_estimators': args.n_estimators,
'max_depth': args.max_depth
}
model = train_model(X_train, y_train, hyperparameters)
# Evaluate model
accuracy, report = evaluate_model(model, X_val, y_val)
# Save model and metrics
metrics = {
'accuracy': accuracy,
'classification_report': report,
'hyperparameters': hyperparameters
}
save_model(model, args.model_dir, metrics)
print("Training completed successfully!")
PYTHON
# Create Dockerfile
cat > Dockerfile <<'DOCKERFILE'
FROM python:3.10-slim
# Install dependencies
RUN pip install --no-cache-dir \
numpy==1.24.3 \
scikit-learn==1.3.0 \
joblib==1.3.2 \
boto3==1.28.25
# Copy training script
COPY train.py /opt/ml/code/train.py
# Set working directory
WORKDIR /opt/ml/code
# Set entry point
ENV SAGEMAKER_PROGRAM train.py
ENTRYPOINT ["python", "train.py"]
DOCKERFILE
# Build and push image
docker build -t mlops-training:latest .
docker tag mlops-training:latest $ECR_TRAINING_URI:latest
docker push $ECR_TRAINING_URI:latest
cd ..
echo "Training container image pushed to ECR"
Phase 5: Model Storage with S3
Step 5.1: Upload Sample Training Data
# Create sample dataset
mkdir -p sample-data
cd sample-data
python3 <<PYTHON
import numpy as np
# Generate synthetic classification dataset
np.random.seed(42)
# Training data
X_train = np.random.randn(1000, 20)
y_train = np.random.randint(0, 2, 1000)
# Validation data
X_val = np.random.randn(200, 20)
y_val = np.random.randint(0, 2, 200)
# Save to files
np.save('X_train.npy', X_train)
np.save('y_train.npy', y_train)
np.save('X_val.npy', X_val)
np.save('y_val.npy', y_val)
print("Sample dataset created")
PYTHON
# Upload to S3
aws s3 cp X_train.npy s3://$DATA_BUCKET/training/
aws s3 cp y_train.npy s3://$DATA_BUCKET/training/
aws s3 cp X_val.npy s3://$DATA_BUCKET/validation/
aws s3 cp y_val.npy s3://$DATA_BUCKET/validation/
cd ..
echo "Sample data uploaded to S3"
Step 5.2: Create ConfigMap for S3 Configuration
# Store S3 bucket names in ConfigMap
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: mlops-config
namespace: mlops-pipelines
data:
ML_BUCKET: "$ML_BUCKET"
DATA_BUCKET: "$DATA_BUCKET"
AWS_REGION: "$AWS_REGION"
SAGEMAKER_ROLE_ARN: "$SAGEMAKER_ROLE_ARN"
ECR_TRAINING_URI: "$ECR_TRAINING_URI"
EOF
Phase 6: KServe Model Serving
Step 6.1: Install KServe
# Install Serverless Operator (prerequisite for KServe)
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: serverless-operator
namespace: openshift-operators
spec:
channel: stable
name: serverless-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
installPlanApproval: Automatic
EOF
# Wait for operator to be ready
sleep 30
oc get csv -n openshift-operators | grep serverless
# Install KServe via Red Hat OpenShift AI or manually
# For this guide, we'll install KServe components manually
# Install Knative Serving
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: knative-serving
---
apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
name: knative-serving
namespace: knative-serving
spec:
ingress:
istio:
enabled: false
config:
domain:
svc.cluster.local: ""
EOF
# Install KServe
export KSERVE_VERSION=v0.11.0
kubectl apply -f https://github.com/kserve/kserve/releases/download/${KSERVE_VERSION}/kserve.yaml
# Wait for KServe to be ready
kubectl wait --for=condition=Ready pods --all -n kserve --timeout=300s
Step 6.2: Create Custom ServingRuntime for scikit-learn
# Create scikit-learn serving runtime
cat <<EOF | oc apply -f -
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: sklearn-runtime
namespace: mlops-serving
spec:
supportedModelFormats:
- name: sklearn
version: "1"
autoSelect: true
containers:
- name: kserve-container
image: kserve/sklearnserver:v0.11.0
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
EOF
Step 6.3: Create Service Account for Model Access
# Create IAM role for KServe to access S3
cat > kserve-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_PROVIDER}:sub": "system:serviceaccount:mlops-serving:kserve-sa"
}
}
}
]
}
EOF
# Create role
export KSERVE_ROLE_ARN=$(aws iam create-role \
--role-name KServeS3AccessRole \
--assume-role-policy-document file://kserve-trust-policy.json \
--query 'Role.Arn' \
--output text)
# Create S3 read policy
cat > kserve-s3-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::${ML_BUCKET}",
"arn:aws:s3:::${ML_BUCKET}/*"
]
}
]
}
EOF
aws iam put-role-policy \
--role-name KServeS3AccessRole \
--policy-name S3ReadAccess \
--policy-document file://kserve-s3-policy.json
# Create service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: kserve-sa
namespace: mlops-serving
annotations:
eks.amazonaws.com/role-arn: $KSERVE_ROLE_ARN
EOF
Phase 7: End-to-End Pipeline
Step 7.1: Create Pipeline Tasks
# Create Task for SageMaker training
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: sagemaker-training
namespace: mlops-pipelines
spec:
params:
- name: job-name
type: string
description: SageMaker training job name
- name: role-arn
type: string
description: SageMaker execution role ARN
- name: image-uri
type: string
description: Training container image URI
- name: instance-type
type: string
default: ml.p4d.24xlarge
- name: instance-count
type: string
default: "1"
- name: volume-size
type: string
default: "50"
- name: max-runtime
type: string
default: "3600"
- name: data-bucket
type: string
- name: model-bucket
type: string
steps:
- name: create-training-job
image: amazon/aws-cli:latest
script: |
#!/bin/bash
set -e
# Create SageMaker training job manifest
cat > training-job.yaml <<YAML
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: TrainingJob
metadata:
name: \$(params.job-name)
namespace: mlops-pipelines
spec:
trainingJobName: \$(params.job-name)
roleARN: \$(params.role-arn)
algorithmSpecification:
trainingImage: \$(params.image-uri)
trainingInputMode: File
resourceConfig:
instanceType: \$(params.instance-type)
instanceCount: \$(params.instance-count)
volumeSizeInGB: \$(params.volume-size)
inputDataConfig:
- channelName: train
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3URI: s3://\$(params.data-bucket)/training/
s3DataDistributionType: FullyReplicated
- channelName: validation
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3URI: s3://\$(params.data-bucket)/validation/
s3DataDistributionType: FullyReplicated
outputDataConfig:
s3OutputPath: s3://\$(params.model-bucket)/models/
stoppingCondition:
maxRuntimeInSeconds: \$(params.max-runtime)
YAML
# Apply the training job
kubectl apply -f training-job.yaml
echo "SageMaker training job created: \$(params.job-name)"
- name: wait-for-completion
image: amazon/aws-cli:latest
script: |
#!/bin/bash
set -e
echo "Waiting for training job to complete..."
while true; do
STATUS=\$(kubectl get trainingjob \$(params.job-name) -n mlops-pipelines -o jsonpath='{.status.trainingJobStatus}')
echo "Current status: \$STATUS"
if [ "\$STATUS" == "Completed" ]; then
echo "Training job completed successfully!"
break
elif [ "\$STATUS" == "Failed" ] || [ "\$STATUS" == "Stopped" ]; then
echo "Training job failed or was stopped"
exit 1
fi
sleep 30
done
# Get model artifact location
MODEL_URI=\$(kubectl get trainingjob \$(params.job-name) -n mlops-pipelines -o jsonpath='{.status.modelArtifacts.s3ModelArtifacts}')
echo "Model artifacts saved to: \$MODEL_URI"
echo -n "\$MODEL_URI" > /workspace/model-uri.txt
workspaces:
- name: output
description: Workspace to store output
EOF
Step 7.2: Create Task for Model Deployment
# Create Task for deploying model to KServe
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: deploy-model
namespace: mlops-pipelines
spec:
params:
- name: model-name
type: string
description: Name for the deployed model
- name: model-uri
type: string
description: S3 URI of the model artifacts
- name: model-format
type: string
default: sklearn
steps:
- name: create-inference-service
image: quay.io/openshift/origin-cli:latest
script: |
#!/bin/bash
set -e
# Create InferenceService
cat > inference-service.yaml <<YAML
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: \$(params.model-name)
namespace: mlops-serving
spec:
predictor:
serviceAccountName: kserve-sa
model:
modelFormat:
name: \$(params.model-format)
storageUri: \$(params.model-uri)
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
YAML
kubectl apply -f inference-service.yaml
echo "InferenceService created: \$(params.model-name)"
# Wait for InferenceService to be ready
kubectl wait --for=condition=Ready \
inferenceservice/\$(params.model-name) \
-n mlops-serving \
--timeout=300s
echo "Model deployment completed successfully!"
# Get inference endpoint
ENDPOINT=\$(kubectl get inferenceservice \$(params.model-name) -n mlops-serving -o jsonpath='{.status.url}')
echo "Inference endpoint: \$ENDPOINT"
EOF
Step 7.3: Create Complete MLOps Pipeline
# Create the full pipeline
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: mlops-pipeline
namespace: mlops-pipelines
spec:
params:
- name: model-name
type: string
description: Name for the model
default: ml-model
- name: sagemaker-role-arn
type: string
description: SageMaker execution role ARN
- name: training-image-uri
type: string
description: ECR URI for training container
- name: data-bucket
type: string
description: S3 bucket with training data
- name: model-bucket
type: string
description: S3 bucket for model artifacts
- name: instance-type
type: string
description: SageMaker instance type
default: ml.m5.xlarge
workspaces:
- name: shared-workspace
tasks:
- name: train-model
taskRef:
name: sagemaker-training
params:
- name: job-name
value: "\$(params.model-name)-\$(context.pipelineRun.uid)"
- name: role-arn
value: "\$(params.sagemaker-role-arn)"
- name: image-uri
value: "\$(params.training-image-uri)"
- name: instance-type
value: "\$(params.instance-type)"
- name: data-bucket
value: "\$(params.data-bucket)"
- name: model-bucket
value: "\$(params.model-bucket)"
workspaces:
- name: output
workspace: shared-workspace
- name: deploy-model
runAfter:
- train-model
taskRef:
name: deploy-model
params:
- name: model-name
value: "\$(params.model-name)"
- name: model-uri
value: "s3://\$(params.model-bucket)/models/\$(params.model-name)-\$(context.pipelineRun.uid)/output/model.tar.gz"
EOF
Step 7.4: Create PipelineRun
# Create workspace PVC
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mlops-workspace
namespace: mlops-pipelines
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
EOF
# Create PipelineRun to execute the pipeline
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: mlops-pipeline-run-
namespace: mlops-pipelines
spec:
pipelineRef:
name: mlops-pipeline
params:
- name: model-name
value: "classifier-model"
- name: sagemaker-role-arn
value: "$SAGEMAKER_ROLE_ARN"
- name: training-image-uri
value: "$ECR_TRAINING_URI:latest"
- name: data-bucket
value: "$DATA_BUCKET"
- name: model-bucket
value: "$ML_BUCKET"
- name: instance-type
value: "ml.m5.xlarge"
workspaces:
- name: shared-workspace
persistentVolumeClaim:
claimName: mlops-workspace
serviceAccountName: pipeline-sa
EOF
Testing and Validation
Test 1: Monitor Pipeline Execution
# List pipeline runs
tkn pipelinerun list -n mlops-pipelines
# Get latest pipeline run
export PIPELINE_RUN=$(tkn pipelinerun list -n mlops-pipelines -o jsonpath='{.items[0].metadata.name}')
# Watch pipeline execution
tkn pipelinerun logs $PIPELINE_RUN -f -n mlops-pipelines
# Check pipeline status
tkn pipelinerun describe $PIPELINE_RUN -n mlops-pipelines
Test 2: Verify SageMaker Training Job
# List SageMaker training jobs via ACK
kubectl get trainingjobs -n mlops-pipelines
# Get training job details
kubectl describe trainingjob -n mlops-pipelines
# Check training job in AWS Console
aws sagemaker list-training-jobs --region $AWS_REGION
# View training job logs
export TRAINING_JOB_NAME=$(kubectl get trainingjobs -n mlops-pipelines -o jsonpath='{.items[0].metadata.name}')
aws logs tail /aws/sagemaker/TrainingJobs --follow --log-stream-name-prefix $TRAINING_JOB_NAME
Test 3: Verify Model Deployment
# Check InferenceService status
kubectl get inferenceservice -n mlops-serving
# Get inference endpoint
export INFERENCE_URL=$(kubectl get inferenceservice classifier-model -n mlops-serving -o jsonpath='{.status.url}')
echo "Inference URL: $INFERENCE_URL"
# Test inference with sample data
curl -X POST $INFERENCE_URL/v1/models/classifier-model:predict \
-H "Content-Type: application/json" \
-d '{
"instances": [
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
]
}'
Test 4: Load Testing
# Create load test script
cat > load-test.sh <<'BASH'
#!/bin/bash
INFERENCE_URL=$1
REQUESTS=$2
echo "Running $REQUESTS inference requests to $INFERENCE_URL"
for i in $(seq 1 $REQUESTS); do
curl -s -X POST $INFERENCE_URL/v1/models/classifier-model:predict \
-H "Content-Type: application/json" \
-d '{
"instances": [
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
]
}' > /dev/null &
if [ $((i % 10)) -eq 0 ]; then
echo "Sent $i requests"
fi
done
wait
echo "Load test completed"
BASH
chmod +x load-test.sh
# Run load test
./load-test.sh $INFERENCE_URL 100
Resource Cleanup
To avoid ongoing AWS charges, follow these steps to clean up all resources.
Step 1: Delete InferenceServices
# Delete all InferenceServices
kubectl delete inferenceservice --all -n mlops-serving
# Verify deletion
kubectl get inferenceservice -n mlops-serving
Step 2: Delete Pipelines and Runs
# Delete all pipeline runs
kubectl delete pipelinerun --all -n mlops-pipelines
# Delete pipelines
kubectl delete pipeline mlops-pipeline -n mlops-pipelines
# Delete tasks
kubectl delete task --all -n mlops-pipelines
# Delete PVC
kubectl delete pvc mlops-workspace -n mlops-pipelines
Step 3: Delete SageMaker Training Jobs
# Delete ACK SageMaker resources
kubectl delete trainingjobs --all -n mlops-pipelines
# Verify in AWS Console or CLI
aws sagemaker list-training-jobs --region $AWS_REGION
Step 4: Delete S3 Buckets
# Delete all objects in buckets
aws s3 rm s3://$ML_BUCKET --recursive --region $AWS_REGION
aws s3 rm s3://$DATA_BUCKET --recursive --region $AWS_REGION
# Delete buckets
aws s3 rb s3://$ML_BUCKET --region $AWS_REGION
aws s3 rb s3://$DATA_BUCKET --region $AWS_REGION
echo "S3 buckets deleted"
Step 5: Delete ECR Repository
# Delete ECR repository
aws ecr delete-repository \
--repository-name mlops/training \
--force \
--region $AWS_REGION
echo "ECR repository deleted"
Step 6: Delete ACK Controllers
# Delete ACK SageMaker controller
kubectl delete -f install.yaml
# Delete ACK namespace
kubectl delete namespace ack-system
Step 7: Delete ROSA Cluster
# Delete ROSA cluster (takes ~10-15 minutes)
rosa delete cluster --cluster=$CLUSTER_NAME --yes
# Wait for cluster deletion
rosa logs uninstall --cluster=$CLUSTER_NAME --watch
# Verify deletion
rosa list clusters
Step 8: Delete IAM Resources
# Detach policies and delete ACK role
aws iam detach-role-policy \
--role-name ACKSageMakerControllerRole \
--policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/ACKSageMakerPolicy
aws iam delete-role --role-name ACKSageMakerControllerRole
aws iam delete-policy \
--policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/ACKSageMakerPolicy
# Delete SageMaker execution role
aws iam delete-role-policy \
--role-name SageMakerMLOpsExecutionRole \
--policy-name S3Access
aws iam detach-role-policy \
--role-name SageMakerMLOpsExecutionRole \
--policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam delete-role --role-name SageMakerMLOpsExecutionRole
# Delete KServe role
aws iam delete-role-policy \
--role-name KServeS3AccessRole \
--policy-name S3ReadAccess
aws iam delete-role --role-name KServeS3AccessRole
echo "IAM resources deleted"
Step 9: Clean Up Local Files
# Remove temporary files
rm -f ack-sagemaker-policy.json
rm -f ack-trust-policy.json
rm -f sagemaker-trust-policy.json
rm -f sagemaker-s3-policy.json
rm -f kserve-trust-policy.json
rm -f kserve-s3-policy.json
rm -f install.yaml
rm -rf sagemaker-training/
rm -rf sample-data/
rm -f load-test.sh
echo "Local files cleaned up"
Verification
# Verify ROSA cluster is deleted
rosa list clusters
# Verify S3 buckets are deleted
aws s3 ls | grep mlops
# Verify ECR repositories are deleted
aws ecr describe-repositories --region $AWS_REGION | grep mlops
# Verify IAM roles are deleted
aws iam list-roles | grep -E "ACKSageMaker|SageMakerMLOps|KServeS3"
echo "Cleanup verification complete"
Troubleshooting
Issue: ACK Controller Cannot Create SageMaker Jobs
Symptoms: TrainingJob CR is created but SageMaker job doesn't start
Solutions:
- Verify ACK controller has correct IAM role
- Check service account annotation
- Verify SageMaker execution role exists and has permissions
- Check CloudWatch logs for ACK controller
# Check ACK controller logs
kubectl logs -n ack-system deployment/ack-sagemaker-controller
# Verify service account annotation
kubectl get sa -n ack-system ack-sagemaker-controller -o yaml
# Test IAM role assumption
aws sts assume-role-with-web-identity \
--role-arn $ACK_ROLE_ARN \
--role-session-name test \
--web-identity-token $(kubectl create token ack-sagemaker-controller -n ack-system)
Issue: KServe Cannot Pull Model from S3
Symptoms: InferenceService stuck in "Downloading" state
Solutions:
- Verify KServe service account has correct IAM role
- Check S3 bucket permissions
- Verify model URI is correct
- Check storage-initializer logs
# Check InferenceService status
kubectl describe inferenceservice -n mlops-serving
# Check storage-initializer logs
kubectl logs -n mlops-serving -l serving.kserve.io/inferenceservice=classifier-model -c storage-initializer
# Verify S3 access
kubectl run aws-cli --rm -it --image=amazon/aws-cli --serviceaccount=kserve-sa -n mlops-serving -- \
s3 ls s3://$ML_BUCKET/models/
Issue: Pipeline Run Fails
Symptoms: PipelineRun shows failed status
Solutions:
- Check pipeline run logs
- Verify all parameters are correct
- Check task pod logs
- Verify service account permissions
# View pipeline run logs
tkn pipelinerun logs $PIPELINE_RUN -n mlops-pipelines
# Check failed task
kubectl get pods -n mlops-pipelines | grep $PIPELINE_RUN
# View task pod logs
kubectl logs -n mlops-pipelines <pod-name>
# Check events
kubectl get events -n mlops-pipelines --sort-by='.lastTimestamp'
Issue: SageMaker Training Job Fails
Symptoms: TrainingJob CR shows "Failed" status
Solutions:
- Check training container logs in CloudWatch
- Verify training data exists in S3
- Check SageMaker execution role permissions
- Verify container image is accessible
# Get training job name
kubectl get trainingjob -n mlops-pipelines -o jsonpath='{.items[0].metadata.name}'
# Check CloudWatch logs
aws logs tail /aws/sagemaker/TrainingJobs --follow --log-stream-name-prefix $TRAINING_JOB_NAME
# List training jobs
aws sagemaker describe-training-job --training-job-name $TRAINING_JOB_NAME
Issue: High Inference Latency
Symptoms: Model serving responses are slow
Solutions:
- Scale InferenceService replicas
- Adjust resource requests/limits
- Enable autoscaling
- Check network latency
# Scale InferenceService
kubectl scale --replicas=3 inferenceservice/classifier-model -n mlops-serving
# Enable autoscaling
kubectl patch inferenceservice classifier-model -n mlops-serving --type='json' \
-p='[{"op": "add", "path": "/spec/predictor/scaleTarget", "value": 10}]'
# Check pod resource usage
kubectl top pods -n mlops-serving
Debug Commands
# View all resources in namespace
kubectl get all -n mlops-pipelines
kubectl get all -n mlops-serving
# Describe resources
kubectl describe trainingjob -n mlops-pipelines
kubectl describe inferenceservice -n mlops-serving
# Check logs
kubectl logs -n ack-system deployment/ack-sagemaker-controller
kubectl logs -n kserve deployment/kserve-controller-manager
# View events
kubectl get events -n mlops-pipelines --sort-by='.lastTimestamp'
kubectl get events -n mlops-serving --sort-by='.lastTimestamp'
Top comments (0)