DEV Community

Marco Gonzalez
Marco Gonzalez

Posted on

Hybrid MLOps Pipeline: Implementation Guide

Bursting to SageMaker Training from OpenShift Pipelines

Table of Contents

  1. Overview
  2. Architecture
  3. Prerequisites
  4. Phase 1: ROSA Cluster Setup
  5. Phase 2: OpenShift Pipelines Installation
  6. Phase 3: AWS Controllers for Kubernetes (ACK)
  7. Phase 4: Amazon SageMaker Integration
  8. Phase 5: Model Storage with S3
  9. Phase 6: KServe Model Serving
  10. Phase 7: End-to-End Pipeline
  11. Testing and Validation
  12. Resource Cleanup
  13. Troubleshooting

Overview

Project Purpose

This platform delivers a hybrid MLOps solution that optimizes costs by leveraging the best of both worlds: OpenShift for orchestration and management, and AWS SageMaker for intensive GPU training workloads. Instead of maintaining expensive GPU instances 24/7, this architecture enables dynamic "bursting" to AWS for training while maintaining cost-effective inference on OpenShift.

Key Value Propositions

  • Cost Optimization: Pay for GPU instances only during training, not continuously
  • Elastic Scalability: Burst to powerful AWS instances (ml.p4d.24xlarge) on-demand
  • Hybrid Flexibility: Orchestrate from OpenShift while leveraging AWS managed services
  • Automated Workflows: End-to-end MLOps pipelines with minimal manual intervention
  • Production-Ready Serving: Low-latency inference on cost-effective OpenShift nodes

Solution Components

Component Purpose Layer
ROSA Managed OpenShift cluster on AWS Infrastructure
OpenShift Pipelines Tekton-based CI/CD orchestration Orchestration
ACK (AWS Controllers for Kubernetes) Manage AWS services from Kubernetes Integration
Amazon SageMaker Managed ML training with GPU instances Training
Amazon S3 Model artifacts and dataset storage Data Lake
KServe Model serving on OpenShift Inference
Amazon ECR Container registry for custom images Container Registry

Architecture

High-Level Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                          AWS Cloud                                       │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │                    ROSA Cluster (VPC)                              │ │
│  │  ┌─────────────────────────────────────────────────────────────┐  │ │
│  │  │           OpenShift Pipelines (Tekton)                       │  │ │
│  │  │  ┌────────────────┐      ┌──────────────────────────────┐  │  │ │
│  │  │  │ Pipeline       │      │  AWS Controllers for         │  │  │ │
│  │  │  │ Controller     │─────►│  Kubernetes (ACK)            │  │  │ │
│  │  │  └────────────────┘      └──────────┬───────────────────┘  │  │ │
│  │  │                                     │                       │  │ │
│  │  └─────────────────────────────────────┼───────────────────────┘  │ │
│  │                                        │                          │ │
│  │  ┌─────────────────────────────────────▼───────────────────────┐ │ │
│  │  │           KServe Model Serving                               │ │ │
│  │  │  ┌────────────────┐      ┌──────────────────────────────┐  │ │ │
│  │  │  │ InferenceService│◄─────│  Model from S3               │  │ │ │
│  │  │  │ (CPU Nodes)     │      │  (Automatic Pull)            │  │ │ │
│  │  │  └────────────────┘      └──────────────────────────────┘  │ │ │
│  │  └──────────────────────────────────────────────────────────────┘ │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                                        │                                │
│                          ACK Triggers  │ (Ephemeral)                    │
│                                        │                                │
│  ┌─────────────────────────────────────▼───────────────────────────┐  │
│  │           Amazon SageMaker Training                              │  │
│  │  ┌────────────────────────────────────────────────────────────┐ │  │
│  │  │  Training Job (ml.p4d.24xlarge - 8x A100 GPUs)             │ │  │
│  │  │  - Launched on-demand via ACK                              │ │  │
│  │  │  - Trains model using data from S3                         │ │  │
│  │  │  - Saves model artifacts to S3                             │ │  │
│  │  │  - Auto-terminates after completion                        │ │  │
│  │  └────────────────────────────────────────────────────────────┘ │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                        │                                │
│  ┌─────────────────────────────────────▼───────────────────────────┐  │
│  │                    Amazon S3                                     │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │  │
│  │  │  Training    │  │  Model       │  │  Inference           │  │  │
│  │  │  Datasets    │  │  Artifacts   │  │  Results             │  │  │
│  │  └──────────────┘  └──────────────┘  └──────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                    Amazon ECR                                     │  │
│  │  - Custom training container images                              │  │
│  │  - Model serving images                                          │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Workflow

  1. Data Preparation: Training datasets uploaded to S3
  2. Pipeline Trigger: Developer triggers OpenShift Pipeline
  3. Training Initiation: ACK creates SageMaker Training Job
  4. GPU Provisioning: SageMaker spins up ml.p4d.24xlarge instances
  5. Model Training: Training executes on high-performance GPUs
  6. Artifact Storage: Trained model saved to S3
  7. Instance Termination: GPU instances automatically shut down
  8. Model Deployment: KServe pulls model from S3 to OpenShift
  9. Inference Serving: Model serves predictions on cost-effective CPU nodes
  10. Monitoring: Pipeline tracks status and logs throughout

Cost Analysis

Traditional Approach (GPU instances running 24/7):

  • ml.p4d.24xlarge: ~$32/hour
  • Monthly cost: ~$23,040 (continuous operation)

Hybrid Approach (burst for training only):

  • Training: 4 hours/week × $32/hour = $128/week = $512/month
  • ROSA inference nodes: ~$1,500/month (m5.2xlarge instances)
  • Total: ~$2,012/month
  • Savings: ~91% compared to traditional approach

Prerequisites

Required Accounts and Subscriptions

  • [ ] AWS Account with administrative access
  • [ ] Red Hat Account with OpenShift subscription
  • [ ] ROSA Enabled in your AWS account
  • [ ] Amazon SageMaker Access in your target region
  • [ ] AWS Service Quotas for ml.p4d instances (request if needed)

Required Tools

Install the following CLI tools on your workstation:

# AWS CLI (v2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# ROSA CLI
wget https://mirror.openshift.com/pub/openshift-v4/clients/rosa/latest/rosa-linux.tar.gz
tar -xvf rosa-linux.tar.gz
sudo mv rosa /usr/local/bin/rosa
rosa version

# OpenShift CLI (oc)
wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz
tar -xvf openshift-client-linux.tar.gz
sudo mv oc kubectl /usr/local/bin/
oc version

# Tekton CLI
curl -LO https://github.com/tektoncd/cli/releases/download/v0.33.0/tkn_0.33.0_Linux_x86_64.tar.gz
tar xvzf tkn_0.33.0_Linux_x86_64.tar.gz
sudo mv tkn /usr/local/bin/
tkn version

# Helm (v3)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version
Enter fullscreen mode Exit fullscreen mode

AWS Prerequisites

Service Quotas

# Check SageMaker quotas
aws service-quotas get-service-quota \
  --service-code sagemaker \
  --quota-code L-2E8D9C5E \
  --region us-east-1

# Check EC2 quotas for ROSA
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-1216C47A \
  --region us-east-1
Enter fullscreen mode Exit fullscreen mode

IAM Permissions

Your AWS IAM user/role needs permissions for:

  • EC2 (VPC, subnets, security groups)
  • IAM (roles, policies)
  • S3 (buckets, objects)
  • SageMaker (training jobs, models)
  • ECR (repositories, images)
  • CloudWatch (logs, metrics)

Knowledge Prerequisites

You should be familiar with:

  • Machine Learning concepts (training, inference, model artifacts)
  • AWS fundamentals (VPC, IAM, S3)
  • Kubernetes basics (pods, deployments, services)
  • CI/CD pipeline concepts
  • Python and ML frameworks (TensorFlow, PyTorch, scikit-learn)

Phase 1: ROSA Cluster Setup

Step 1.1: Configure AWS CLI

# Configure AWS credentials
aws configure

# Verify configuration
aws sts get-caller-identity
Enter fullscreen mode Exit fullscreen mode

Step 1.2: Initialize ROSA

# Log in to Red Hat
rosa login

# Verify ROSA prerequisites
rosa verify quota
rosa verify permissions

# Initialize ROSA in your AWS account
rosa init
Enter fullscreen mode Exit fullscreen mode

Step 1.3: Create ROSA Cluster

Create a ROSA cluster optimized for MLOps workloads:

# Set environment variables
export CLUSTER_NAME="mlops-platform"
export AWS_REGION="us-east-1"
export MACHINE_TYPE="m5.2xlarge"
export COMPUTE_NODES=3

# Create ROSA cluster (takes ~40 minutes)
rosa create cluster \
  --cluster-name $CLUSTER_NAME \
  --region $AWS_REGION \
  --multi-az \
  --compute-machine-type $MACHINE_TYPE \
  --compute-nodes $COMPUTE_NODES \
  --machine-cidr 10.0.0.0/16 \
  --service-cidr 172.30.0.0/16 \
  --pod-cidr 10.128.0.0/14 \
  --host-prefix 23 \
  --yes
Enter fullscreen mode Exit fullscreen mode

Configuration Rationale:

  • m5.2xlarge: 8 vCPUs, 32 GB RAM - suitable for ML inference and pipeline orchestration
  • 3 nodes: High availability for production workloads
  • Multi-AZ: Ensures resilience for serving layer

Step 1.4: Monitor Cluster Creation

# Watch cluster installation progress
rosa logs install --cluster=$CLUSTER_NAME --watch

# Check cluster status
rosa describe cluster --cluster=$CLUSTER_NAME
Enter fullscreen mode Exit fullscreen mode

Step 1.5: Create Admin User

# Create cluster admin user
rosa create admin --cluster=$CLUSTER_NAME

# Save the login command (will be displayed in output)
Enter fullscreen mode Exit fullscreen mode

Step 1.6: Connect to Cluster

# Use the login command from previous step
oc login https://api.mlops-platform.xxxx.p1.openshiftapps.com:6443 \
  --username cluster-admin \
  --password <your-password>

# Verify cluster access
oc cluster-info
oc get nodes
Enter fullscreen mode Exit fullscreen mode

Step 1.7: Create Project Namespaces

# Create namespace for pipelines
oc new-project mlops-pipelines

# Create namespace for model serving
oc new-project mlops-serving

# Create namespace for ACK controllers
oc new-project ack-system
Enter fullscreen mode Exit fullscreen mode

Phase 2: OpenShift Pipelines Installation

Step 2.1: Install OpenShift Pipelines Operator

# Create operator subscription
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-pipelines
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-pipelines-operator
  namespace: openshift-operators
spec:
  channel: latest
  name: openshift-pipelines-operator-rh
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Automatic
EOF
Enter fullscreen mode Exit fullscreen mode

Step 2.2: Verify Operator Installation

# Wait for operator to be ready (takes 2-3 minutes)
oc get csv -n openshift-operators | grep pipelines

# Verify Tekton components are running
oc get pods -n openshift-pipelines

# Check Tekton version
tkn version
Enter fullscreen mode Exit fullscreen mode

Step 2.3: Configure Pipeline Service Account

# Create service account for pipelines
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pipeline-sa
  namespace: mlops-pipelines
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pipeline-sa-edit
  namespace: mlops-pipelines
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
- kind: ServiceAccount
  name: pipeline-sa
  namespace: mlops-pipelines
EOF
Enter fullscreen mode Exit fullscreen mode

Phase 3: AWS Controllers for Kubernetes (ACK)

ACK enables managing AWS services directly from Kubernetes using custom resources.

Step 3.1: Install ACK SageMaker Controller

# Set variables
export ACK_K8S_NAMESPACE=ack-system
export AWS_REGION=us-east-1
export ACK_SAGEMAKER_VERSION=1.2.10

# Download ACK SageMaker controller
export SERVICE=sagemaker
export RELEASE_VERSION=$(curl -sL https://api.github.com/repos/aws-controllers-k8s/${SERVICE}-controller/releases/latest | grep '"tag_name":' | cut -d'"' -f4)

wget https://github.com/aws-controllers-k8s/${SERVICE}-controller/releases/download/${RELEASE_VERSION}/install.yaml

# Apply ACK controller
kubectl apply -f install.yaml

# Verify installation
kubectl get pods -n ack-system
kubectl get crd | grep sagemaker
Enter fullscreen mode Exit fullscreen mode

Step 3.2: Create IAM Role for ACK

# Create IAM policy for SageMaker access
cat > ack-sagemaker-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:StopTrainingJob",
        "sagemaker:CreateModel",
        "sagemaker:DeleteModel",
        "sagemaker:DescribeModel",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:DeleteEndpointConfig",
        "sagemaker:DescribeEndpointConfig"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::mlops-*",
        "arn:aws:s3:::mlops-*/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "sagemaker.amazonaws.com"
        }
      }
    }
  ]
}
EOF

# Create policy
aws iam create-policy \
  --policy-name ACKSageMakerPolicy \
  --policy-document file://ack-sagemaker-policy.json

# Get OIDC provider for ROSA
export OIDC_PROVIDER=$(rosa describe cluster -c $CLUSTER_NAME -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Create trust policy
cat > ack-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:ack-system:ack-sagemaker-controller"
        }
      }
    }
  ]
}
EOF

# Create IAM role
export ACK_ROLE_ARN=$(aws iam create-role \
  --role-name ACKSageMakerControllerRole \
  --assume-role-policy-document file://ack-trust-policy.json \
  --query 'Role.Arn' \
  --output text)

# Attach policy to role
aws iam attach-role-policy \
  --role-name ACKSageMakerControllerRole \
  --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/ACKSageMakerPolicy

echo "ACK IAM Role ARN: $ACK_ROLE_ARN"
Enter fullscreen mode Exit fullscreen mode

Step 3.3: Configure ACK Controller with IAM Role

# Annotate service account
kubectl annotate serviceaccount -n ack-system ack-sagemaker-controller \
  eks.amazonaws.com/role-arn=$ACK_ROLE_ARN

# Restart ACK controller to pick up annotation
kubectl rollout restart deployment -n ack-system ack-sagemaker-controller

# Verify controller is running
kubectl get pods -n ack-system
kubectl logs -n ack-system deployment/ack-sagemaker-controller
Enter fullscreen mode Exit fullscreen mode

Phase 4: Amazon SageMaker Integration

Step 4.1: Create SageMaker Execution Role

# Create trust policy for SageMaker
cat > sagemaker-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create SageMaker execution role
export SAGEMAKER_ROLE_ARN=$(aws iam create-role \
  --role-name SageMakerMLOpsExecutionRole \
  --assume-role-policy-document file://sagemaker-trust-policy.json \
  --query 'Role.Arn' \
  --output text)

# Attach AWS managed policy
aws iam attach-role-policy \
  --role-name SageMakerMLOpsExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

# Create custom S3 access policy
cat > sagemaker-s3-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::mlops-*",
        "arn:aws:s3:::mlops-*/*"
      ]
    }
  ]
}
EOF

aws iam put-role-policy \
  --role-name SageMakerMLOpsExecutionRole \
  --policy-name S3Access \
  --policy-document file://sagemaker-s3-policy.json

echo "SageMaker Execution Role ARN: $SAGEMAKER_ROLE_ARN"
Enter fullscreen mode Exit fullscreen mode

Step 4.2: Create S3 Buckets

# Create S3 buckets for ML artifacts
export ML_BUCKET="mlops-artifacts-${ACCOUNT_ID}"
export DATA_BUCKET="mlops-datasets-${ACCOUNT_ID}"

aws s3 mb s3://$ML_BUCKET --region $AWS_REGION
aws s3 mb s3://$DATA_BUCKET --region $AWS_REGION

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket $ML_BUCKET \
  --versioning-configuration Status=Enabled

aws s3api put-bucket-versioning \
  --bucket $DATA_BUCKET \
  --versioning-configuration Status=Enabled

# Create folder structure
aws s3api put-object --bucket $ML_BUCKET --key models/
aws s3api put-object --bucket $ML_BUCKET --key checkpoints/
aws s3api put-object --bucket $DATA_BUCKET --key training/
aws s3api put-object --bucket $DATA_BUCKET --key validation/

echo "S3 Buckets created:"
echo "  Models: s3://$ML_BUCKET"
echo "  Data: s3://$DATA_BUCKET"
Enter fullscreen mode Exit fullscreen mode

Step 4.3: Create ECR Repository

# Create ECR repository for custom training images
aws ecr create-repository \
  --repository-name mlops/training \
  --region $AWS_REGION

# Get ECR login command
aws ecr get-login-password --region $AWS_REGION | \
  docker login --username AWS --password-stdin \
  ${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

export ECR_TRAINING_URI="${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/mlops/training"
echo "ECR Repository: $ECR_TRAINING_URI"
Enter fullscreen mode Exit fullscreen mode

Step 4.4: Build Custom Training Container

# Create directory for training container
mkdir -p sagemaker-training
cd sagemaker-training

# Create training script
cat > train.py <<'PYTHON'
import argparse
import os
import json
import joblib
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import boto3

def load_data_from_s3(data_dir):
    """Load training and validation data"""
    print(f"Loading data from {data_dir}")

    # Load training data
    X_train = np.load(os.path.join(data_dir, 'train', 'X_train.npy'))
    y_train = np.load(os.path.join(data_dir, 'train', 'y_train.npy'))

    # Load validation data
    X_val = np.load(os.path.join(data_dir, 'validation', 'X_val.npy'))
    y_val = np.load(os.path.join(data_dir, 'validation', 'y_val.npy'))

    return X_train, y_train, X_val, y_val

def train_model(X_train, y_train, hyperparameters):
    """Train Random Forest model"""
    print("Training model with hyperparameters:", hyperparameters)

    model = RandomForestClassifier(
        n_estimators=hyperparameters['n_estimators'],
        max_depth=hyperparameters['max_depth'],
        random_state=42,
        n_jobs=-1
    )

    model.fit(X_train, y_train)
    return model

def evaluate_model(model, X_val, y_val):
    """Evaluate model on validation set"""
    y_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    report = classification_report(y_val, y_pred, output_dict=True)

    print(f"Validation Accuracy: {accuracy:.4f}")
    print(classification_report(y_val, y_pred))

    return accuracy, report

def save_model(model, model_dir, metrics):
    """Save model and metrics"""
    os.makedirs(model_dir, exist_ok=True)

    # Save model
    model_path = os.path.join(model_dir, 'model.joblib')
    joblib.dump(model, model_path)
    print(f"Model saved to {model_path}")

    # Save metrics
    metrics_path = os.path.join(model_dir, 'metrics.json')
    with open(metrics_path, 'w') as f:
        json.dump(metrics, f, indent=2)
    print(f"Metrics saved to {metrics_path}")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters
    parser.add_argument('--n_estimators', type=int, default=100)
    parser.add_argument('--max_depth', type=int, default=10)

    # SageMaker specific arguments
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR', '/opt/ml/model'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN', '/opt/ml/input/data/train'))
    parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION', '/opt/ml/input/data/validation'))

    args = parser.parse_args()

    # Load data
    data_dir = os.path.dirname(args.train)
    X_train, y_train, X_val, y_val = load_data_from_s3(data_dir)

    # Train model
    hyperparameters = {
        'n_estimators': args.n_estimators,
        'max_depth': args.max_depth
    }
    model = train_model(X_train, y_train, hyperparameters)

    # Evaluate model
    accuracy, report = evaluate_model(model, X_val, y_val)

    # Save model and metrics
    metrics = {
        'accuracy': accuracy,
        'classification_report': report,
        'hyperparameters': hyperparameters
    }
    save_model(model, args.model_dir, metrics)

    print("Training completed successfully!")
PYTHON

# Create Dockerfile
cat > Dockerfile <<'DOCKERFILE'
FROM python:3.10-slim

# Install dependencies
RUN pip install --no-cache-dir \
    numpy==1.24.3 \
    scikit-learn==1.3.0 \
    joblib==1.3.2 \
    boto3==1.28.25

# Copy training script
COPY train.py /opt/ml/code/train.py

# Set working directory
WORKDIR /opt/ml/code

# Set entry point
ENV SAGEMAKER_PROGRAM train.py

ENTRYPOINT ["python", "train.py"]
DOCKERFILE

# Build and push image
docker build -t mlops-training:latest .
docker tag mlops-training:latest $ECR_TRAINING_URI:latest
docker push $ECR_TRAINING_URI:latest

cd ..
echo "Training container image pushed to ECR"
Enter fullscreen mode Exit fullscreen mode

Phase 5: Model Storage with S3

Step 5.1: Upload Sample Training Data

# Create sample dataset
mkdir -p sample-data
cd sample-data

python3 <<PYTHON
import numpy as np

# Generate synthetic classification dataset
np.random.seed(42)

# Training data
X_train = np.random.randn(1000, 20)
y_train = np.random.randint(0, 2, 1000)

# Validation data
X_val = np.random.randn(200, 20)
y_val = np.random.randint(0, 2, 200)

# Save to files
np.save('X_train.npy', X_train)
np.save('y_train.npy', y_train)
np.save('X_val.npy', X_val)
np.save('y_val.npy', y_val)

print("Sample dataset created")
PYTHON

# Upload to S3
aws s3 cp X_train.npy s3://$DATA_BUCKET/training/
aws s3 cp y_train.npy s3://$DATA_BUCKET/training/
aws s3 cp X_val.npy s3://$DATA_BUCKET/validation/
aws s3 cp y_val.npy s3://$DATA_BUCKET/validation/

cd ..
echo "Sample data uploaded to S3"
Enter fullscreen mode Exit fullscreen mode

Step 5.2: Create ConfigMap for S3 Configuration

# Store S3 bucket names in ConfigMap
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: mlops-config
  namespace: mlops-pipelines
data:
  ML_BUCKET: "$ML_BUCKET"
  DATA_BUCKET: "$DATA_BUCKET"
  AWS_REGION: "$AWS_REGION"
  SAGEMAKER_ROLE_ARN: "$SAGEMAKER_ROLE_ARN"
  ECR_TRAINING_URI: "$ECR_TRAINING_URI"
EOF
Enter fullscreen mode Exit fullscreen mode

Phase 6: KServe Model Serving

Step 6.1: Install KServe

# Install Serverless Operator (prerequisite for KServe)
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: serverless-operator
  namespace: openshift-operators
spec:
  channel: stable
  name: serverless-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Automatic
EOF

# Wait for operator to be ready
sleep 30
oc get csv -n openshift-operators | grep serverless

# Install KServe via Red Hat OpenShift AI or manually
# For this guide, we'll install KServe components manually

# Install Knative Serving
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: knative-serving
---
apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  ingress:
    istio:
      enabled: false
  config:
    domain:
      svc.cluster.local: ""
EOF

# Install KServe
export KSERVE_VERSION=v0.11.0
kubectl apply -f https://github.com/kserve/kserve/releases/download/${KSERVE_VERSION}/kserve.yaml

# Wait for KServe to be ready
kubectl wait --for=condition=Ready pods --all -n kserve --timeout=300s
Enter fullscreen mode Exit fullscreen mode

Step 6.2: Create Custom ServingRuntime for scikit-learn

# Create scikit-learn serving runtime
cat <<EOF | oc apply -f -
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: sklearn-runtime
  namespace: mlops-serving
spec:
  supportedModelFormats:
    - name: sklearn
      version: "1"
      autoSelect: true
  containers:
    - name: kserve-container
      image: kserve/sklearnserver:v0.11.0
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
        limits:
          cpu: "2"
          memory: "4Gi"
EOF
Enter fullscreen mode Exit fullscreen mode

Step 6.3: Create Service Account for Model Access

# Create IAM role for KServe to access S3
cat > kserve-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:mlops-serving:kserve-sa"
        }
      }
    }
  ]
}
EOF

# Create role
export KSERVE_ROLE_ARN=$(aws iam create-role \
  --role-name KServeS3AccessRole \
  --assume-role-policy-document file://kserve-trust-policy.json \
  --query 'Role.Arn' \
  --output text)

# Create S3 read policy
cat > kserve-s3-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::${ML_BUCKET}",
        "arn:aws:s3:::${ML_BUCKET}/*"
      ]
    }
  ]
}
EOF

aws iam put-role-policy \
  --role-name KServeS3AccessRole \
  --policy-name S3ReadAccess \
  --policy-document file://kserve-s3-policy.json

# Create service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kserve-sa
  namespace: mlops-serving
  annotations:
    eks.amazonaws.com/role-arn: $KSERVE_ROLE_ARN
EOF
Enter fullscreen mode Exit fullscreen mode

Phase 7: End-to-End Pipeline

Step 7.1: Create Pipeline Tasks

# Create Task for SageMaker training
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: sagemaker-training
  namespace: mlops-pipelines
spec:
  params:
    - name: job-name
      type: string
      description: SageMaker training job name
    - name: role-arn
      type: string
      description: SageMaker execution role ARN
    - name: image-uri
      type: string
      description: Training container image URI
    - name: instance-type
      type: string
      default: ml.p4d.24xlarge
    - name: instance-count
      type: string
      default: "1"
    - name: volume-size
      type: string
      default: "50"
    - name: max-runtime
      type: string
      default: "3600"
    - name: data-bucket
      type: string
    - name: model-bucket
      type: string
  steps:
    - name: create-training-job
      image: amazon/aws-cli:latest
      script: |
        #!/bin/bash
        set -e

        # Create SageMaker training job manifest
        cat > training-job.yaml <<YAML
        apiVersion: sagemaker.services.k8s.aws/v1alpha1
        kind: TrainingJob
        metadata:
          name: \$(params.job-name)
          namespace: mlops-pipelines
        spec:
          trainingJobName: \$(params.job-name)
          roleARN: \$(params.role-arn)
          algorithmSpecification:
            trainingImage: \$(params.image-uri)
            trainingInputMode: File
          resourceConfig:
            instanceType: \$(params.instance-type)
            instanceCount: \$(params.instance-count)
            volumeSizeInGB: \$(params.volume-size)
          inputDataConfig:
            - channelName: train
              dataSource:
                s3DataSource:
                  s3DataType: S3Prefix
                  s3URI: s3://\$(params.data-bucket)/training/
                  s3DataDistributionType: FullyReplicated
            - channelName: validation
              dataSource:
                s3DataSource:
                  s3DataType: S3Prefix
                  s3URI: s3://\$(params.data-bucket)/validation/
                  s3DataDistributionType: FullyReplicated
          outputDataConfig:
            s3OutputPath: s3://\$(params.model-bucket)/models/
          stoppingCondition:
            maxRuntimeInSeconds: \$(params.max-runtime)
        YAML

        # Apply the training job
        kubectl apply -f training-job.yaml

        echo "SageMaker training job created: \$(params.job-name)"

    - name: wait-for-completion
      image: amazon/aws-cli:latest
      script: |
        #!/bin/bash
        set -e

        echo "Waiting for training job to complete..."

        while true; do
          STATUS=\$(kubectl get trainingjob \$(params.job-name) -n mlops-pipelines -o jsonpath='{.status.trainingJobStatus}')

          echo "Current status: \$STATUS"

          if [ "\$STATUS" == "Completed" ]; then
            echo "Training job completed successfully!"
            break
          elif [ "\$STATUS" == "Failed" ] || [ "\$STATUS" == "Stopped" ]; then
            echo "Training job failed or was stopped"
            exit 1
          fi

          sleep 30
        done

        # Get model artifact location
        MODEL_URI=\$(kubectl get trainingjob \$(params.job-name) -n mlops-pipelines -o jsonpath='{.status.modelArtifacts.s3ModelArtifacts}')
        echo "Model artifacts saved to: \$MODEL_URI"
        echo -n "\$MODEL_URI" > /workspace/model-uri.txt
  workspaces:
    - name: output
      description: Workspace to store output
EOF
Enter fullscreen mode Exit fullscreen mode

Step 7.2: Create Task for Model Deployment

# Create Task for deploying model to KServe
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: deploy-model
  namespace: mlops-pipelines
spec:
  params:
    - name: model-name
      type: string
      description: Name for the deployed model
    - name: model-uri
      type: string
      description: S3 URI of the model artifacts
    - name: model-format
      type: string
      default: sklearn
  steps:
    - name: create-inference-service
      image: quay.io/openshift/origin-cli:latest
      script: |
        #!/bin/bash
        set -e

        # Create InferenceService
        cat > inference-service.yaml <<YAML
        apiVersion: serving.kserve.io/v1beta1
        kind: InferenceService
        metadata:
          name: \$(params.model-name)
          namespace: mlops-serving
        spec:
          predictor:
            serviceAccountName: kserve-sa
            model:
              modelFormat:
                name: \$(params.model-format)
              storageUri: \$(params.model-uri)
              resources:
                requests:
                  cpu: "1"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "4Gi"
        YAML

        kubectl apply -f inference-service.yaml

        echo "InferenceService created: \$(params.model-name)"

        # Wait for InferenceService to be ready
        kubectl wait --for=condition=Ready \
          inferenceservice/\$(params.model-name) \
          -n mlops-serving \
          --timeout=300s

        echo "Model deployment completed successfully!"

        # Get inference endpoint
        ENDPOINT=\$(kubectl get inferenceservice \$(params.model-name) -n mlops-serving -o jsonpath='{.status.url}')
        echo "Inference endpoint: \$ENDPOINT"
EOF
Enter fullscreen mode Exit fullscreen mode

Step 7.3: Create Complete MLOps Pipeline

# Create the full pipeline
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: mlops-pipeline
  namespace: mlops-pipelines
spec:
  params:
    - name: model-name
      type: string
      description: Name for the model
      default: ml-model
    - name: sagemaker-role-arn
      type: string
      description: SageMaker execution role ARN
    - name: training-image-uri
      type: string
      description: ECR URI for training container
    - name: data-bucket
      type: string
      description: S3 bucket with training data
    - name: model-bucket
      type: string
      description: S3 bucket for model artifacts
    - name: instance-type
      type: string
      description: SageMaker instance type
      default: ml.m5.xlarge
  workspaces:
    - name: shared-workspace
  tasks:
    - name: train-model
      taskRef:
        name: sagemaker-training
      params:
        - name: job-name
          value: "\$(params.model-name)-\$(context.pipelineRun.uid)"
        - name: role-arn
          value: "\$(params.sagemaker-role-arn)"
        - name: image-uri
          value: "\$(params.training-image-uri)"
        - name: instance-type
          value: "\$(params.instance-type)"
        - name: data-bucket
          value: "\$(params.data-bucket)"
        - name: model-bucket
          value: "\$(params.model-bucket)"
      workspaces:
        - name: output
          workspace: shared-workspace

    - name: deploy-model
      runAfter:
        - train-model
      taskRef:
        name: deploy-model
      params:
        - name: model-name
          value: "\$(params.model-name)"
        - name: model-uri
          value: "s3://\$(params.model-bucket)/models/\$(params.model-name)-\$(context.pipelineRun.uid)/output/model.tar.gz"
EOF
Enter fullscreen mode Exit fullscreen mode

Step 7.4: Create PipelineRun

# Create workspace PVC
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mlops-workspace
  namespace: mlops-pipelines
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
EOF

# Create PipelineRun to execute the pipeline
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: mlops-pipeline-run-
  namespace: mlops-pipelines
spec:
  pipelineRef:
    name: mlops-pipeline
  params:
    - name: model-name
      value: "classifier-model"
    - name: sagemaker-role-arn
      value: "$SAGEMAKER_ROLE_ARN"
    - name: training-image-uri
      value: "$ECR_TRAINING_URI:latest"
    - name: data-bucket
      value: "$DATA_BUCKET"
    - name: model-bucket
      value: "$ML_BUCKET"
    - name: instance-type
      value: "ml.m5.xlarge"
  workspaces:
    - name: shared-workspace
      persistentVolumeClaim:
        claimName: mlops-workspace
  serviceAccountName: pipeline-sa
EOF
Enter fullscreen mode Exit fullscreen mode

Testing and Validation

Test 1: Monitor Pipeline Execution

# List pipeline runs
tkn pipelinerun list -n mlops-pipelines

# Get latest pipeline run
export PIPELINE_RUN=$(tkn pipelinerun list -n mlops-pipelines -o jsonpath='{.items[0].metadata.name}')

# Watch pipeline execution
tkn pipelinerun logs $PIPELINE_RUN -f -n mlops-pipelines

# Check pipeline status
tkn pipelinerun describe $PIPELINE_RUN -n mlops-pipelines
Enter fullscreen mode Exit fullscreen mode

Test 2: Verify SageMaker Training Job

# List SageMaker training jobs via ACK
kubectl get trainingjobs -n mlops-pipelines

# Get training job details
kubectl describe trainingjob -n mlops-pipelines

# Check training job in AWS Console
aws sagemaker list-training-jobs --region $AWS_REGION

# View training job logs
export TRAINING_JOB_NAME=$(kubectl get trainingjobs -n mlops-pipelines -o jsonpath='{.items[0].metadata.name}')
aws logs tail /aws/sagemaker/TrainingJobs --follow --log-stream-name-prefix $TRAINING_JOB_NAME
Enter fullscreen mode Exit fullscreen mode

Test 3: Verify Model Deployment

# Check InferenceService status
kubectl get inferenceservice -n mlops-serving

# Get inference endpoint
export INFERENCE_URL=$(kubectl get inferenceservice classifier-model -n mlops-serving -o jsonpath='{.status.url}')
echo "Inference URL: $INFERENCE_URL"

# Test inference with sample data
curl -X POST $INFERENCE_URL/v1/models/classifier-model:predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
       1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Test 4: Load Testing

# Create load test script
cat > load-test.sh <<'BASH'
#!/bin/bash
INFERENCE_URL=$1
REQUESTS=$2

echo "Running $REQUESTS inference requests to $INFERENCE_URL"

for i in $(seq 1 $REQUESTS); do
  curl -s -X POST $INFERENCE_URL/v1/models/classifier-model:predict \
    -H "Content-Type: application/json" \
    -d '{
      "instances": [
        [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
         1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
      ]
    }' > /dev/null &

  if [ $((i % 10)) -eq 0 ]; then
    echo "Sent $i requests"
  fi
done

wait
echo "Load test completed"
BASH

chmod +x load-test.sh

# Run load test
./load-test.sh $INFERENCE_URL 100
Enter fullscreen mode Exit fullscreen mode

Resource Cleanup

To avoid ongoing AWS charges, follow these steps to clean up all resources.

Step 1: Delete InferenceServices

# Delete all InferenceServices
kubectl delete inferenceservice --all -n mlops-serving

# Verify deletion
kubectl get inferenceservice -n mlops-serving
Enter fullscreen mode Exit fullscreen mode

Step 2: Delete Pipelines and Runs

# Delete all pipeline runs
kubectl delete pipelinerun --all -n mlops-pipelines

# Delete pipelines
kubectl delete pipeline mlops-pipeline -n mlops-pipelines

# Delete tasks
kubectl delete task --all -n mlops-pipelines

# Delete PVC
kubectl delete pvc mlops-workspace -n mlops-pipelines
Enter fullscreen mode Exit fullscreen mode

Step 3: Delete SageMaker Training Jobs

# Delete ACK SageMaker resources
kubectl delete trainingjobs --all -n mlops-pipelines

# Verify in AWS Console or CLI
aws sagemaker list-training-jobs --region $AWS_REGION
Enter fullscreen mode Exit fullscreen mode

Step 4: Delete S3 Buckets

# Delete all objects in buckets
aws s3 rm s3://$ML_BUCKET --recursive --region $AWS_REGION
aws s3 rm s3://$DATA_BUCKET --recursive --region $AWS_REGION

# Delete buckets
aws s3 rb s3://$ML_BUCKET --region $AWS_REGION
aws s3 rb s3://$DATA_BUCKET --region $AWS_REGION

echo "S3 buckets deleted"
Enter fullscreen mode Exit fullscreen mode

Step 5: Delete ECR Repository

# Delete ECR repository
aws ecr delete-repository \
  --repository-name mlops/training \
  --force \
  --region $AWS_REGION

echo "ECR repository deleted"
Enter fullscreen mode Exit fullscreen mode

Step 6: Delete ACK Controllers

# Delete ACK SageMaker controller
kubectl delete -f install.yaml

# Delete ACK namespace
kubectl delete namespace ack-system
Enter fullscreen mode Exit fullscreen mode

Step 7: Delete ROSA Cluster

# Delete ROSA cluster (takes ~10-15 minutes)
rosa delete cluster --cluster=$CLUSTER_NAME --yes

# Wait for cluster deletion
rosa logs uninstall --cluster=$CLUSTER_NAME --watch

# Verify deletion
rosa list clusters
Enter fullscreen mode Exit fullscreen mode

Step 8: Delete IAM Resources

# Detach policies and delete ACK role
aws iam detach-role-policy \
  --role-name ACKSageMakerControllerRole \
  --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/ACKSageMakerPolicy

aws iam delete-role --role-name ACKSageMakerControllerRole

aws iam delete-policy \
  --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/ACKSageMakerPolicy

# Delete SageMaker execution role
aws iam delete-role-policy \
  --role-name SageMakerMLOpsExecutionRole \
  --policy-name S3Access

aws iam detach-role-policy \
  --role-name SageMakerMLOpsExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

aws iam delete-role --role-name SageMakerMLOpsExecutionRole

# Delete KServe role
aws iam delete-role-policy \
  --role-name KServeS3AccessRole \
  --policy-name S3ReadAccess

aws iam delete-role --role-name KServeS3AccessRole

echo "IAM resources deleted"
Enter fullscreen mode Exit fullscreen mode

Step 9: Clean Up Local Files

# Remove temporary files
rm -f ack-sagemaker-policy.json
rm -f ack-trust-policy.json
rm -f sagemaker-trust-policy.json
rm -f sagemaker-s3-policy.json
rm -f kserve-trust-policy.json
rm -f kserve-s3-policy.json
rm -f install.yaml
rm -rf sagemaker-training/
rm -rf sample-data/
rm -f load-test.sh

echo "Local files cleaned up"
Enter fullscreen mode Exit fullscreen mode

Verification

# Verify ROSA cluster is deleted
rosa list clusters

# Verify S3 buckets are deleted
aws s3 ls | grep mlops

# Verify ECR repositories are deleted
aws ecr describe-repositories --region $AWS_REGION | grep mlops

# Verify IAM roles are deleted
aws iam list-roles | grep -E "ACKSageMaker|SageMakerMLOps|KServeS3"

echo "Cleanup verification complete"
Enter fullscreen mode Exit fullscreen mode

Troubleshooting

Issue: ACK Controller Cannot Create SageMaker Jobs

Symptoms: TrainingJob CR is created but SageMaker job doesn't start

Solutions:

  1. Verify ACK controller has correct IAM role
  2. Check service account annotation
  3. Verify SageMaker execution role exists and has permissions
  4. Check CloudWatch logs for ACK controller
# Check ACK controller logs
kubectl logs -n ack-system deployment/ack-sagemaker-controller

# Verify service account annotation
kubectl get sa -n ack-system ack-sagemaker-controller -o yaml

# Test IAM role assumption
aws sts assume-role-with-web-identity \
  --role-arn $ACK_ROLE_ARN \
  --role-session-name test \
  --web-identity-token $(kubectl create token ack-sagemaker-controller -n ack-system)
Enter fullscreen mode Exit fullscreen mode

Issue: KServe Cannot Pull Model from S3

Symptoms: InferenceService stuck in "Downloading" state

Solutions:

  1. Verify KServe service account has correct IAM role
  2. Check S3 bucket permissions
  3. Verify model URI is correct
  4. Check storage-initializer logs
# Check InferenceService status
kubectl describe inferenceservice -n mlops-serving

# Check storage-initializer logs
kubectl logs -n mlops-serving -l serving.kserve.io/inferenceservice=classifier-model -c storage-initializer

# Verify S3 access
kubectl run aws-cli --rm -it --image=amazon/aws-cli --serviceaccount=kserve-sa -n mlops-serving -- \
  s3 ls s3://$ML_BUCKET/models/
Enter fullscreen mode Exit fullscreen mode

Issue: Pipeline Run Fails

Symptoms: PipelineRun shows failed status

Solutions:

  1. Check pipeline run logs
  2. Verify all parameters are correct
  3. Check task pod logs
  4. Verify service account permissions
# View pipeline run logs
tkn pipelinerun logs $PIPELINE_RUN -n mlops-pipelines

# Check failed task
kubectl get pods -n mlops-pipelines | grep $PIPELINE_RUN

# View task pod logs
kubectl logs -n mlops-pipelines <pod-name>

# Check events
kubectl get events -n mlops-pipelines --sort-by='.lastTimestamp'
Enter fullscreen mode Exit fullscreen mode

Issue: SageMaker Training Job Fails

Symptoms: TrainingJob CR shows "Failed" status

Solutions:

  1. Check training container logs in CloudWatch
  2. Verify training data exists in S3
  3. Check SageMaker execution role permissions
  4. Verify container image is accessible
# Get training job name
kubectl get trainingjob -n mlops-pipelines -o jsonpath='{.items[0].metadata.name}'

# Check CloudWatch logs
aws logs tail /aws/sagemaker/TrainingJobs --follow --log-stream-name-prefix $TRAINING_JOB_NAME

# List training jobs
aws sagemaker describe-training-job --training-job-name $TRAINING_JOB_NAME
Enter fullscreen mode Exit fullscreen mode

Issue: High Inference Latency

Symptoms: Model serving responses are slow

Solutions:

  1. Scale InferenceService replicas
  2. Adjust resource requests/limits
  3. Enable autoscaling
  4. Check network latency
# Scale InferenceService
kubectl scale --replicas=3 inferenceservice/classifier-model -n mlops-serving

# Enable autoscaling
kubectl patch inferenceservice classifier-model -n mlops-serving --type='json' \
  -p='[{"op": "add", "path": "/spec/predictor/scaleTarget", "value": 10}]'

# Check pod resource usage
kubectl top pods -n mlops-serving
Enter fullscreen mode Exit fullscreen mode

Debug Commands

# View all resources in namespace
kubectl get all -n mlops-pipelines
kubectl get all -n mlops-serving

# Describe resources
kubectl describe trainingjob -n mlops-pipelines
kubectl describe inferenceservice -n mlops-serving

# Check logs
kubectl logs -n ack-system deployment/ack-sagemaker-controller
kubectl logs -n kserve deployment/kserve-controller-manager

# View events
kubectl get events -n mlops-pipelines --sort-by='.lastTimestamp'
kubectl get events -n mlops-serving --sort-by='.lastTimestamp'
Enter fullscreen mode Exit fullscreen mode

Top comments (0)