Marco Gonzalez

Posted on Dec 29, 2025 • Edited on Mar 13

Hybrid MLOps Pipeline: Implementation Guide

#machinelearning #kubernetes #aws #devops

Bursting to SageMaker Training from OpenShift Pipelines

Overview
Architecture
Prerequisites
Phase 1: ROSA Cluster Setup
Phase 2: OpenShift Pipelines Installation
Phase 3: AWS Controllers for Kubernetes (ACK)
Phase 4: Amazon SageMaker Integration
Phase 5: Model Storage with S3
Phase 6: KServe Model Serving
Phase 7: End-to-End Pipeline
Testing and Validation
Resource Cleanup
Troubleshooting

Overview

Project Purpose

This platform delivers a hybrid MLOps solution that optimizes costs by leveraging the best of both worlds: OpenShift for orchestration and management, and AWS SageMaker for intensive GPU training workloads. Instead of maintaining expensive GPU instances 24/7, this architecture enables dynamic "bursting" to AWS for training while maintaining cost-effective inference on OpenShift.

Key Value Propositions

Cost Optimization: Pay for GPU instances only during training, not continuously
Elastic Scalability: Burst to powerful AWS instances (ml.p4d.24xlarge) on-demand
Hybrid Flexibility: Orchestrate from OpenShift while leveraging AWS managed services
Automated Workflows: End-to-end MLOps pipelines with minimal manual intervention
Production-Ready Serving: Low-latency inference on cost-effective OpenShift nodes

Solution Components

Component	Purpose	Layer
ROSA	Managed OpenShift cluster on AWS	Infrastructure
OpenShift Pipelines	Tekton-based CI/CD orchestration	Orchestration
ACK (AWS Controllers for Kubernetes)	Manage AWS services from Kubernetes	Integration
Amazon SageMaker	Managed ML training with GPU instances	Training
Amazon S3	Model artifacts and dataset storage	Data Lake
KServe	Model serving on OpenShift	Inference
Amazon ECR	Container registry for custom images	Container Registry

Architecture

High-Level Architecture Diagram

Workflow

Data Preparation: Training datasets uploaded to S3
Pipeline Trigger: Developer triggers OpenShift Pipeline
Training Initiation: ACK creates SageMaker Training Job
GPU Provisioning: SageMaker spins up ml.p4d.24xlarge instances
Model Training: Training executes on high-performance GPUs
Artifact Storage: Trained model saved to S3
Instance Termination: GPU instances automatically shut down
Model Deployment: KServe pulls model from S3 to OpenShift
Inference Serving: Model serves predictions on cost-effective CPU nodes
Monitoring: Pipeline tracks status and logs throughout

Cost Analysis

Traditional Approach (GPU instances running 24/7):

ml.p4d.24xlarge: ~$32/hour
Monthly cost: ~$23,040 (continuous operation)

Hybrid Approach (burst for training only):

Training: 4 hours/week × $32/hour = $128/week = $512/month
ROSA inference nodes: ~$1,500/month (m5.2xlarge instances)
Total: ~$2,012/month
Savings: ~91% compared to traditional approach

Prerequisites

Required Accounts and Subscriptions

[ ] AWS Account with administrative access
[ ] Red Hat Account with OpenShift subscription
[ ] ROSA Enabled in your AWS account
[ ] Amazon SageMaker Access in your target region
[ ] AWS Service Quotas for ml.p4d instances (request if needed)

Required Tools

Install the following CLI tools on your workstation:

# AWS CLI (v2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# ROSA CLI
wget https://mirror.openshift.com/pub/openshift-v4/clients/rosa/latest/rosa-linux.tar.gz
tar -xvf rosa-linux.tar.gz
sudo mv rosa /usr/local/bin/rosa
rosa version

# OpenShift CLI (oc)
wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz
tar -xvf openshift-client-linux.tar.gz
sudo mv oc kubectl /usr/local/bin/
oc version

# Tekton CLI
curl -LO https://github.com/tektoncd/cli/releases/download/v0.33.0/tkn_0.33.0_Linux_x86_64.tar.gz
tar xvzf tkn_0.33.0_Linux_x86_64.tar.gz
sudo mv tkn /usr/local/bin/
tkn version

# Helm (v3)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

AWS Prerequisites

Service Quotas

# Check SageMaker quotas
aws service-quotas get-service-quota \
  --service-code sagemaker \
  --quota-code L-2E8D9C5E \
  --region us-east-1

# Check EC2 quotas for ROSA
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-1216C47A \
  --region us-east-1

IAM Permissions

Your AWS IAM user/role needs permissions for:

EC2 (VPC, subnets, security groups)
IAM (roles, policies)
S3 (buckets, objects)
SageMaker (training jobs, models)
ECR (repositories, images)
CloudWatch (logs, metrics)

Knowledge Prerequisites

You should be familiar with:

Machine Learning concepts (training, inference, model artifacts)
AWS fundamentals (VPC, IAM, S3)
Kubernetes basics (pods, deployments, services)
CI/CD pipeline concepts
Python and ML frameworks (TensorFlow, PyTorch, scikit-learn)

Phase 1: ROSA Cluster Setup

Step 1.1: Configure AWS CLI

# Configure AWS credentials
aws configure

# Verify configuration
aws sts get-caller-identity

Step 1.2: Initialize ROSA

# Log in to Red Hat
rosa login

# Verify ROSA prerequisites
rosa verify quota
rosa verify permissions

# Initialize ROSA in your AWS account
rosa init

Step 1.3: Create ROSA Cluster

Create a ROSA cluster optimized for MLOps workloads:

# Set environment variables
export CLUSTER_NAME="mlops-platform"
export AWS_REGION="us-east-1"
export MACHINE_TYPE="m5.2xlarge"
export COMPUTE_NODES=3

# Create ROSA cluster (takes ~40 minutes)
rosa create cluster \
  --cluster-name $CLUSTER_NAME \
  --region $AWS_REGION \
  --multi-az \
  --compute-machine-type $MACHINE_TYPE \
  --compute-nodes $COMPUTE_NODES \
  --machine-cidr 10.0.0.0/16 \
  --service-cidr 172.30.0.0/16 \
  --pod-cidr 10.128.0.0/14 \
  --host-prefix 23 \
  --yes

Configuration Rationale:

m5.2xlarge: 8 vCPUs, 32 GB RAM - suitable for ML inference and pipeline orchestration
3 nodes: High availability for production workloads
Multi-AZ: Ensures resilience for serving layer

Step 1.4: Monitor Cluster Creation

# Watch cluster installation progress
rosa logs install --cluster=$CLUSTER_NAME --watch

# Check cluster status
rosa describe cluster --cluster=$CLUSTER_NAME

Step 1.5: Create Admin User

# Create cluster admin user
rosa create admin --cluster=$CLUSTER_NAME

# Save the login command (will be displayed in output)

Step 1.6: Connect to Cluster

# Use the login command from previous step
oc login https://api.mlops-platform.xxxx.p1.openshiftapps.com:6443 \
  --username cluster-admin \
  --password <your-password>

# Verify cluster access
oc cluster-info
oc get nodes

Step 1.7: Create Project Namespaces

# Create namespace for pipelines
oc new-project mlops-pipelines

# Create namespace for model serving
oc new-project mlops-serving

# Create namespace for ACK controllers
oc new-project ack-system

Phase 2: OpenShift Pipelines Installation

Step 2.1: Install OpenShift Pipelines Operator

# Create operator subscription
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-pipelines
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-pipelines-operator
  namespace: openshift-operators
spec:
  channel: latest
  name: openshift-pipelines-operator-rh
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Automatic
EOF

Step 2.2: Verify Operator Installation

# Wait for operator to be ready (takes 2-3 minutes)
oc get csv -n openshift-operators | grep pipelines

# Verify Tekton components are running
oc get pods -n openshift-pipelines

# Check Tekton version
tkn version

Step 2.3: Configure Pipeline Service Account

# Create service account for pipelines
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pipeline-sa
  namespace: mlops-pipelines
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pipeline-sa-edit
  namespace: mlops-pipelines
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
- kind: ServiceAccount
  name: pipeline-sa
  namespace: mlops-pipelines
EOF

Phase 3: AWS Controllers for Kubernetes (ACK)

ACK enables managing AWS services directly from Kubernetes using custom resources.

Step 3.1: Install ACK SageMaker Controller

# Set variables
export ACK_K8S_NAMESPACE=ack-system
export AWS_REGION=us-east-1
export ACK_SAGEMAKER_VERSION=1.2.10

# Download ACK SageMaker controller
export SERVICE=sagemaker
export RELEASE_VERSION=$(curl -sL https://api.github.com/repos/aws-controllers-k8s/${SERVICE}-controller/releases/latest | grep '"tag_name":' | cut -d'"' -f4)

wget https://github.com/aws-controllers-k8s/${SERVICE}-controller/releases/download/${RELEASE_VERSION}/install.yaml

# Apply ACK controller
kubectl apply -f install.yaml

# Verify installation
kubectl get pods -n ack-system
kubectl get crd | grep sagemaker

Step 3.2: Create IAM Role for ACK

# Create IAM policy for SageMaker access
cat > ack-sagemaker-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:StopTrainingJob",
        "sagemaker:CreateModel",
        "sagemaker:DeleteModel",
        "sagemaker:DescribeModel",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:DeleteEndpointConfig",
        "sagemaker:DescribeEndpointConfig"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::mlops-*",
        "arn:aws:s3:::mlops-*/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "sagemaker.amazonaws.com"
        }
      }
    }
  ]
}
EOF

# Create policy
aws iam create-policy \
  --policy-name ACKSageMakerPolicy \
  --policy-document file://ack-sagemaker-policy.json

# Get OIDC provider for ROSA
export OIDC_PROVIDER=$(rosa describe cluster -c $CLUSTER_NAME -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Create trust policy
cat > ack-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:ack-system:ack-sagemaker-controller"
        }
      }
    }
  ]
}
EOF

# Create IAM role
export ACK_ROLE_ARN=$(aws iam create-role \
  --role-name ACKSageMakerControllerRole \
  --assume-role-policy-document file://ack-trust-policy.json \
  --query 'Role.Arn' \
  --output text)

# Attach policy to role
aws iam attach-role-policy \
  --role-name ACKSageMakerControllerRole \
  --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/ACKSageMakerPolicy

echo "ACK IAM Role ARN: $ACK_ROLE_ARN"

Step 3.3: Configure ACK Controller with IAM Role

# Annotate service account
kubectl annotate serviceaccount -n ack-system ack-sagemaker-controller \
  eks.amazonaws.com/role-arn=$ACK_ROLE_ARN

# Restart ACK controller to pick up annotation
kubectl rollout restart deployment -n ack-system ack-sagemaker-controller

# Verify controller is running
kubectl get pods -n ack-system
kubectl logs -n ack-system deployment/ack-sagemaker-controller

Phase 4: Amazon SageMaker Integration

Step 4.1: Create SageMaker Execution Role

# Create trust policy for SageMaker
cat > sagemaker-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create SageMaker execution role
export SAGEMAKER_ROLE_ARN=$(aws iam create-role \
  --role-name SageMakerMLOpsExecutionRole \
  --assume-role-policy-document file://sagemaker-trust-policy.json \
  --query 'Role.Arn' \
  --output text)

# Attach AWS managed policy
aws iam attach-role-policy \
  --role-name SageMakerMLOpsExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

# Create custom S3 access policy
cat > sagemaker-s3-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::mlops-*",
        "arn:aws:s3:::mlops-*/*"
      ]
    }
  ]
}
EOF

aws iam put-role-policy \
  --role-name SageMakerMLOpsExecutionRole \
  --policy-name S3Access \
  --policy-document file://sagemaker-s3-policy.json

echo "SageMaker Execution Role ARN: $SAGEMAKER_ROLE_ARN"

Step 4.2: Create S3 Buckets

# Create S3 buckets for ML artifacts
export ML_BUCKET="mlops-artifacts-${ACCOUNT_ID}"
export DATA_BUCKET="mlops-datasets-${ACCOUNT_ID}"

aws s3 mb s3://$ML_BUCKET --region $AWS_REGION
aws s3 mb s3://$DATA_BUCKET --region $AWS_REGION

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket $ML_BUCKET \
  --versioning-configuration Status=Enabled

aws s3api put-bucket-versioning \
  --bucket $DATA_BUCKET \
  --versioning-configuration Status=Enabled

# Create folder structure
aws s3api put-object --bucket $ML_BUCKET --key models/
aws s3api put-object --bucket $ML_BUCKET --key checkpoints/
aws s3api put-object --bucket $DATA_BUCKET --key training/
aws s3api put-object --bucket $DATA_BUCKET --key validation/

echo "S3 Buckets created:"
echo "  Models: s3://$ML_BUCKET"
echo "  Data: s3://$DATA_BUCKET"

Step 4.3: Create ECR Repository

# Create ECR repository for custom training images
aws ecr create-repository \
  --repository-name mlops/training \
  --region $AWS_REGION

# Get ECR login command
aws ecr get-login-password --region $AWS_REGION | \
  docker login --username AWS --password-stdin \
  ${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

export ECR_TRAINING_URI="${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/mlops/training"
echo "ECR Repository: $ECR_TRAINING_URI"

Step 4.4: Build Custom Training Container

# Create directory for training container
mkdir -p sagemaker-training
cd sagemaker-training

# Create training script
cat > train.py <<'PYTHON'
import argparse
import os
import json
import joblib
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import boto3

def load_data_from_s3(data_dir):
    """Load training and validation data"""
    print(f"Loading data from {data_dir}")

    # Load training data
    X_train = np.load(os.path.join(data_dir, 'train', 'X_train.npy'))
    y_train = np.load(os.path.join(data_dir, 'train', 'y_train.npy'))

    # Load validation data
    X_val = np.load(os.path.join(data_dir, 'validation', 'X_val.npy'))
    y_val = np.load(os.path.join(data_dir, 'validation', 'y_val.npy'))

    return X_train, y_train, X_val, y_val

def train_model(X_train, y_train, hyperparameters):
    """Train Random Forest model"""
    print("Training model with hyperparameters:", hyperparameters)

    model = RandomForestClassifier(
        n_estimators=hyperparameters['n_estimators'],
        max_depth=hyperparameters['max_depth'],
        random_state=42,
        n_jobs=-1
    )

    model.fit(X_train, y_train)
    return model

def evaluate_model(model, X_val, y_val):
    """Evaluate model on validation set"""
    y_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    report = classification_report(y_val, y_pred, output_dict=True)

    print(f"Validation Accuracy: {accuracy:.4f}")
    print(classification_report(y_val, y_pred))

    return accuracy, report

def save_model(model, model_dir, metrics):
    """Save model and metrics"""
    os.makedirs(model_dir, exist_ok=True)

    # Save model
    model_path = os.path.join(model_dir, 'model.joblib')
    joblib.dump(model, model_path)
    print(f"Model saved to {model_path}")

    # Save metrics
    metrics_path = os.path.join(model_dir, 'metrics.json')
    with open(metrics_path, 'w') as f:
        json.dump(metrics, f, indent=2)
    print(f"Metrics saved to {metrics_path}")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters
    parser.add_argument('--n_estimators', type=int, default=100)
    parser.add_argument('--max_depth', type=int, default=10)

    # SageMaker specific arguments
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR', '/opt/ml/model'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN', '/opt/ml/input/data/train'))
    parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION', '/opt/ml/input/data/validation'))

    args = parser.parse_args()

    # Load data
    data_dir = os.path.dirname(args.train)
    X_train, y_train, X_val, y_val = load_data_from_s3(data_dir)

    # Train model
    hyperparameters = {
        'n_estimators': args.n_estimators,
        'max_depth': args.max_depth
    }
    model = train_model(X_train, y_train, hyperparameters)

    # Evaluate model
    accuracy, report = evaluate_model(model, X_val, y_val)

    # Save model and metrics
    metrics = {
        'accuracy': accuracy,
        'classification_report': report,
        'hyperparameters': hyperparameters
    }
    save_model(model, args.model_dir, metrics)

    print("Training completed successfully!")
PYTHON

# Create Dockerfile
cat > Dockerfile <<'DOCKERFILE'
FROM python:3.10-slim

# Install dependencies
RUN pip install --no-cache-dir \
    numpy==1.24.3 \
    scikit-learn==1.3.0 \
    joblib==1.3.2 \
    boto3==1.28.25

# Copy training script
COPY train.py /opt/ml/code/train.py

# Set working directory
WORKDIR /opt/ml/code

# Set entry point
ENV SAGEMAKER_PROGRAM train.py

ENTRYPOINT ["python", "train.py"]
DOCKERFILE

# Build and push image
docker build -t mlops-training:latest .
docker tag mlops-training:latest $ECR_TRAINING_URI:latest
docker push $ECR_TRAINING_URI:latest

cd ..
echo "Training container image pushed to ECR"

Phase 5: Model Storage with S3

Step 5.1: Upload Sample Training Data

# Create sample dataset
mkdir -p sample-data
cd sample-data

python3 <<PYTHON
import numpy as np

# Generate synthetic classification dataset
np.random.seed(42)

# Training data
X_train = np.random.randn(1000, 20)
y_train = np.random.randint(0, 2, 1000)

# Validation data
X_val = np.random.randn(200, 20)
y_val = np.random.randint(0, 2, 200)

# Save to files
np.save('X_train.npy', X_train)
np.save('y_train.npy', y_train)
np.save('X_val.npy', X_val)
np.save('y_val.npy', y_val)

print("Sample dataset created")
PYTHON

# Upload to S3
aws s3 cp X_train.npy s3://$DATA_BUCKET/training/
aws s3 cp y_train.npy s3://$DATA_BUCKET/training/
aws s3 cp X_val.npy s3://$DATA_BUCKET/validation/
aws s3 cp y_val.npy s3://$DATA_BUCKET/validation/

cd ..
echo "Sample data uploaded to S3"

Step 5.2: Create ConfigMap for S3 Configuration

# Store S3 bucket names in ConfigMap
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: mlops-config
  namespace: mlops-pipelines
data:
  ML_BUCKET: "$ML_BUCKET"
  DATA_BUCKET: "$DATA_BUCKET"
  AWS_REGION: "$AWS_REGION"
  SAGEMAKER_ROLE_ARN: "$SAGEMAKER_ROLE_ARN"
  ECR_TRAINING_URI: "$ECR_TRAINING_URI"
EOF

Phase 6: KServe Model Serving

Step 6.1: Install KServe

# Install Serverless Operator (prerequisite for KServe)
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: serverless-operator
  namespace: openshift-operators
spec:
  channel: stable
  name: serverless-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Automatic
EOF

# Wait for operator to be ready
sleep 30
oc get csv -n openshift-operators | grep serverless

# Install KServe via Red Hat OpenShift AI or manually
# For this guide, we'll install KServe components manually

# Install Knative Serving
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: knative-serving
---
apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  ingress:
    istio:
      enabled: false
  config:
    domain:
      svc.cluster.local: ""
EOF

# Install KServe
export KSERVE_VERSION=v0.11.0
kubectl apply -f https://github.com/kserve/kserve/releases/download/${KSERVE_VERSION}/kserve.yaml

# Wait for KServe to be ready
kubectl wait --for=condition=Ready pods --all -n kserve --timeout=300s

Step 6.2: Create Custom ServingRuntime for scikit-learn

# Create scikit-learn serving runtime
cat <<EOF | oc apply -f -
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: sklearn-runtime
  namespace: mlops-serving
spec:
  supportedModelFormats:
    - name: sklearn
      version: "1"
      autoSelect: true
  containers:
    - name: kserve-container
      image: kserve/sklearnserver:v0.11.0
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
        limits:
          cpu: "2"
          memory: "4Gi"
EOF

Step 6.3: Create Service Account for Model Access

# Create IAM role for KServe to access S3
cat > kserve-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:mlops-serving:kserve-sa"
        }
      }
    }
  ]
}
EOF

# Create role
export KSERVE_ROLE_ARN=$(aws iam create-role \
  --role-name KServeS3AccessRole \
  --assume-role-policy-document file://kserve-trust-policy.json \
  --query 'Role.Arn' \
  --output text)

# Create S3 read policy
cat > kserve-s3-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::${ML_BUCKET}",
        "arn:aws:s3:::${ML_BUCKET}/*"
      ]
    }
  ]
}
EOF

aws iam put-role-policy \
  --role-name KServeS3AccessRole \
  --policy-name S3ReadAccess \
  --policy-document file://kserve-s3-policy.json

# Create service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kserve-sa
  namespace: mlops-serving
  annotations:
    eks.amazonaws.com/role-arn: $KSERVE_ROLE_ARN
EOF

Phase 7: End-to-End Pipeline

Step 7.1: Create Pipeline Tasks

# Create Task for SageMaker training
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: sagemaker-training
  namespace: mlops-pipelines
spec:
  params:
    - name: job-name
      type: string
      description: SageMaker training job name
    - name: role-arn
      type: string
      description: SageMaker execution role ARN
    - name: image-uri
      type: string
      description: Training container image URI
    - name: instance-type
      type: string
      default: ml.p4d.24xlarge
    - name: instance-count
      type: string
      default: "1"
    - name: volume-size
      type: string
      default: "50"
    - name: max-runtime
      type: string
      default: "3600"
    - name: data-bucket
      type: string
    - name: model-bucket
      type: string
  steps:
    - name: create-training-job
      image: amazon/aws-cli:latest
      script: |
        #!/bin/bash
        set -e

        # Create SageMaker training job manifest
        cat > training-job.yaml <<YAML
        apiVersion: sagemaker.services.k8s.aws/v1alpha1
        kind: TrainingJob
        metadata:
          name: \$(params.job-name)
          namespace: mlops-pipelines
        spec:
          trainingJobName: \$(params.job-name)
          roleARN: \$(params.role-arn)
          algorithmSpecification:
            trainingImage: \$(params.image-uri)
            trainingInputMode: File
          resourceConfig:
            instanceType: \$(params.instance-type)
            instanceCount: \$(params.instance-count)
            volumeSizeInGB: \$(params.volume-size)
          inputDataConfig:
            - channelName: train
              dataSource:
                s3DataSource:
                  s3DataType: S3Prefix
                  s3URI: s3://\$(params.data-bucket)/training/
                  s3DataDistributionType: FullyReplicated
            - channelName: validation
              dataSource:
                s3DataSource:
                  s3DataType: S3Prefix
                  s3URI: s3://\$(params.data-bucket)/validation/
                  s3DataDistributionType: FullyReplicated
          outputDataConfig:
            s3OutputPath: s3://\$(params.model-bucket)/models/
          stoppingCondition:
            maxRuntimeInSeconds: \$(params.max-runtime)
        YAML

        # Apply the training job
        kubectl apply -f training-job.yaml

        echo "SageMaker training job created: \$(params.job-name)"

    - name: wait-for-completion
      image: amazon/aws-cli:latest
      script: |
        #!/bin/bash
        set -e

        echo "Waiting for training job to complete..."

        while true; do
          STATUS=\$(kubectl get trainingjob \$(params.job-name) -n mlops-pipelines -o jsonpath='{.status.trainingJobStatus}')

          echo "Current status: \$STATUS"

          if [ "\$STATUS" == "Completed" ]; then
            echo "Training job completed successfully!"
            break
          elif [ "\$STATUS" == "Failed" ] || [ "\$STATUS" == "Stopped" ]; then
            echo "Training job failed or was stopped"
            exit 1
          fi

          sleep 30
        done

        # Get model artifact location
        MODEL_URI=\$(kubectl get trainingjob \$(params.job-name) -n mlops-pipelines -o jsonpath='{.status.modelArtifacts.s3ModelArtifacts}')
        echo "Model artifacts saved to: \$MODEL_URI"
        echo -n "\$MODEL_URI" > /workspace/model-uri.txt
  workspaces:
    - name: output
      description: Workspace to store output
EOF

Step 7.2: Create Task for Model Deployment

# Create Task for deploying model to KServe
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: deploy-model
  namespace: mlops-pipelines
spec:
  params:
    - name: model-name
      type: string
      description: Name for the deployed model
    - name: model-uri
      type: string
      description: S3 URI of the model artifacts
    - name: model-format
      type: string
      default: sklearn
  steps:
    - name: create-inference-service
      image: quay.io/openshift/origin-cli:latest
      script: |
        #!/bin/bash
        set -e

        # Create InferenceService
        cat > inference-service.yaml <<YAML
        apiVersion: serving.kserve.io/v1beta1
        kind: InferenceService
        metadata:
          name: \$(params.model-name)
          namespace: mlops-serving
        spec:
          predictor:
            serviceAccountName: kserve-sa
            model:
              modelFormat:
                name: \$(params.model-format)
              storageUri: \$(params.model-uri)
              resources:
                requests:
                  cpu: "1"
                  memory: "2Gi"
                limits:
                  cpu: "2"
                  memory: "4Gi"
        YAML

        kubectl apply -f inference-service.yaml

        echo "InferenceService created: \$(params.model-name)"

        # Wait for InferenceService to be ready
        kubectl wait --for=condition=Ready \
          inferenceservice/\$(params.model-name) \
          -n mlops-serving \
          --timeout=300s

        echo "Model deployment completed successfully!"

        # Get inference endpoint
        ENDPOINT=\$(kubectl get inferenceservice \$(params.model-name) -n mlops-serving -o jsonpath='{.status.url}')
        echo "Inference endpoint: \$ENDPOINT"
EOF

Step 7.3: Create Complete MLOps Pipeline

# Create the full pipeline
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: mlops-pipeline
  namespace: mlops-pipelines
spec:
  params:
    - name: model-name
      type: string
      description: Name for the model
      default: ml-model
    - name: sagemaker-role-arn
      type: string
      description: SageMaker execution role ARN
    - name: training-image-uri
      type: string
      description: ECR URI for training container
    - name: data-bucket
      type: string
      description: S3 bucket with training data
    - name: model-bucket
      type: string
      description: S3 bucket for model artifacts
    - name: instance-type
      type: string
      description: SageMaker instance type
      default: ml.m5.xlarge
  workspaces:
    - name: shared-workspace
  tasks:
    - name: train-model
      taskRef:
        name: sagemaker-training
      params:
        - name: job-name
          value: "\$(params.model-name)-\$(context.pipelineRun.uid)"
        - name: role-arn
          value: "\$(params.sagemaker-role-arn)"
        - name: image-uri
          value: "\$(params.training-image-uri)"
        - name: instance-type
          value: "\$(params.instance-type)"
        - name: data-bucket
          value: "\$(params.data-bucket)"
        - name: model-bucket
          value: "\$(params.model-bucket)"
      workspaces:
        - name: output
          workspace: shared-workspace

    - name: deploy-model
      runAfter:
        - train-model
      taskRef:
        name: deploy-model
      params:
        - name: model-name
          value: "\$(params.model-name)"
        - name: model-uri
          value: "s3://\$(params.model-bucket)/models/\$(params.model-name)-\$(context.pipelineRun.uid)/output/model.tar.gz"
EOF

Step 7.4: Create PipelineRun

# Create workspace PVC
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mlops-workspace
  namespace: mlops-pipelines
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
EOF

# Create PipelineRun to execute the pipeline
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: mlops-pipeline-run-
  namespace: mlops-pipelines
spec:
  pipelineRef:
    name: mlops-pipeline
  params:
    - name: model-name
      value: "classifier-model"
    - name: sagemaker-role-arn
      value: "$SAGEMAKER_ROLE_ARN"
    - name: training-image-uri
      value: "$ECR_TRAINING_URI:latest"
    - name: data-bucket
      value: "$DATA_BUCKET"
    - name: model-bucket
      value: "$ML_BUCKET"
    - name: instance-type
      value: "ml.m5.xlarge"
  workspaces:
    - name: shared-workspace
      persistentVolumeClaim:
        claimName: mlops-workspace
  serviceAccountName: pipeline-sa
EOF

Testing and Validation

Test 1: Monitor Pipeline Execution

# List pipeline runs
tkn pipelinerun list -n mlops-pipelines

# Get latest pipeline run
export PIPELINE_RUN=$(tkn pipelinerun list -n mlops-pipelines -o jsonpath='{.items[0].metadata.name}')

# Watch pipeline execution
tkn pipelinerun logs $PIPELINE_RUN -f -n mlops-pipelines

# Check pipeline status
tkn pipelinerun describe $PIPELINE_RUN -n mlops-pipelines

Test 2: Verify SageMaker Training Job

# List SageMaker training jobs via ACK
kubectl get trainingjobs -n mlops-pipelines

# Get training job details
kubectl describe trainingjob -n mlops-pipelines

# Check training job in AWS Console
aws sagemaker list-training-jobs --region $AWS_REGION

# View training job logs
export TRAINING_JOB_NAME=$(kubectl get trainingjobs -n mlops-pipelines -o jsonpath='{.items[0].metadata.name}')
aws logs tail /aws/sagemaker/TrainingJobs --follow --log-stream-name-prefix $TRAINING_JOB_NAME

Test 3: Verify Model Deployment

# Check InferenceService status
kubectl get inferenceservice -n mlops-serving

# Get inference endpoint
export INFERENCE_URL=$(kubectl get inferenceservice classifier-model -n mlops-serving -o jsonpath='{.status.url}')
echo "Inference URL: $INFERENCE_URL"

# Test inference with sample data
curl -X POST $INFERENCE_URL/v1/models/classifier-model:predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
       1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
    ]
  }'

Test 4: Load Testing

# Create load test script
cat > load-test.sh <<'BASH'
#!/bin/bash
INFERENCE_URL=$1
REQUESTS=$2

echo "Running $REQUESTS inference requests to $INFERENCE_URL"

for i in $(seq 1 $REQUESTS); do
  curl -s -X POST $INFERENCE_URL/v1/models/classifier-model:predict \
    -H "Content-Type: application/json" \
    -d '{
      "instances": [
        [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
         1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
      ]
    }' > /dev/null &

  if [ $((i % 10)) -eq 0 ]; then
    echo "Sent $i requests"
  fi
done

wait
echo "Load test completed"
BASH

chmod +x load-test.sh

# Run load test
./load-test.sh $INFERENCE_URL 100

Resource Cleanup

To avoid ongoing AWS charges, follow these steps to clean up all resources.

Step 1: Delete InferenceServices

# Delete all InferenceServices
kubectl delete inferenceservice --all -n mlops-serving

# Verify deletion
kubectl get inferenceservice -n mlops-serving

Step 2: Delete Pipelines and Runs

# Delete all pipeline runs
kubectl delete pipelinerun --all -n mlops-pipelines

# Delete pipelines
kubectl delete pipeline mlops-pipeline -n mlops-pipelines

# Delete tasks
kubectl delete task --all -n mlops-pipelines

# Delete PVC
kubectl delete pvc mlops-workspace -n mlops-pipelines

Step 3: Delete SageMaker Training Jobs

# Delete ACK SageMaker resources
kubectl delete trainingjobs --all -n mlops-pipelines

# Verify in AWS Console or CLI
aws sagemaker list-training-jobs --region $AWS_REGION

Step 4: Delete S3 Buckets

# Delete all objects in buckets
aws s3 rm s3://$ML_BUCKET --recursive --region $AWS_REGION
aws s3 rm s3://$DATA_BUCKET --recursive --region $AWS_REGION

# Delete buckets
aws s3 rb s3://$ML_BUCKET --region $AWS_REGION
aws s3 rb s3://$DATA_BUCKET --region $AWS_REGION

echo "S3 buckets deleted"

Step 5: Delete ECR Repository

# Delete ECR repository
aws ecr delete-repository \
  --repository-name mlops/training \
  --force \
  --region $AWS_REGION

echo "ECR repository deleted"

Step 6: Delete ACK Controllers

# Delete ACK SageMaker controller
kubectl delete -f install.yaml

# Delete ACK namespace
kubectl delete namespace ack-system

Step 7: Delete ROSA Cluster

# Delete ROSA cluster (takes ~10-15 minutes)
rosa delete cluster --cluster=$CLUSTER_NAME --yes

# Wait for cluster deletion
rosa logs uninstall --cluster=$CLUSTER_NAME --watch

# Verify deletion
rosa list clusters

Step 8: Delete IAM Resources

# Detach policies and delete ACK role
aws iam detach-role-policy \
  --role-name ACKSageMakerControllerRole \
  --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/ACKSageMakerPolicy

aws iam delete-role --role-name ACKSageMakerControllerRole

aws iam delete-policy \
  --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/ACKSageMakerPolicy

# Delete SageMaker execution role
aws iam delete-role-policy \
  --role-name SageMakerMLOpsExecutionRole \
  --policy-name S3Access

aws iam detach-role-policy \
  --role-name SageMakerMLOpsExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

aws iam delete-role --role-name SageMakerMLOpsExecutionRole

# Delete KServe role
aws iam delete-role-policy \
  --role-name KServeS3AccessRole \
  --policy-name S3ReadAccess

aws iam delete-role --role-name KServeS3AccessRole

echo "IAM resources deleted"

Step 9: Clean Up Local Files

# Remove temporary files
rm -f ack-sagemaker-policy.json
rm -f ack-trust-policy.json
rm -f sagemaker-trust-policy.json
rm -f sagemaker-s3-policy.json
rm -f kserve-trust-policy.json
rm -f kserve-s3-policy.json
rm -f install.yaml
rm -rf sagemaker-training/
rm -rf sample-data/
rm -f load-test.sh

echo "Local files cleaned up"

Verification

# Verify ROSA cluster is deleted
rosa list clusters

# Verify S3 buckets are deleted
aws s3 ls | grep mlops

# Verify ECR repositories are deleted
aws ecr describe-repositories --region $AWS_REGION | grep mlops

# Verify IAM roles are deleted
aws iam list-roles | grep -E "ACKSageMaker|SageMakerMLOps|KServeS3"

echo "Cleanup verification complete"

Troubleshooting

Issue: ACK Controller Cannot Create SageMaker Jobs

Symptoms: TrainingJob CR is created but SageMaker job doesn't start

Solutions:

Verify ACK controller has correct IAM role
Check service account annotation
Verify SageMaker execution role exists and has permissions
Check CloudWatch logs for ACK controller

# Check ACK controller logs
kubectl logs -n ack-system deployment/ack-sagemaker-controller

# Verify service account annotation
kubectl get sa -n ack-system ack-sagemaker-controller -o yaml

# Test IAM role assumption
aws sts assume-role-with-web-identity \
  --role-arn $ACK_ROLE_ARN \
  --role-session-name test \
  --web-identity-token $(kubectl create token ack-sagemaker-controller -n ack-system)

Issue: KServe Cannot Pull Model from S3

Symptoms: InferenceService stuck in "Downloading" state

Solutions:

Verify KServe service account has correct IAM role
Check S3 bucket permissions
Verify model URI is correct
Check storage-initializer logs

# Check InferenceService status
kubectl describe inferenceservice -n mlops-serving

# Check storage-initializer logs
kubectl logs -n mlops-serving -l serving.kserve.io/inferenceservice=classifier-model -c storage-initializer

# Verify S3 access
kubectl run aws-cli --rm -it --image=amazon/aws-cli --serviceaccount=kserve-sa -n mlops-serving -- \
  s3 ls s3://$ML_BUCKET/models/

Issue: Pipeline Run Fails

Symptoms: PipelineRun shows failed status

Solutions:

Check pipeline run logs
Verify all parameters are correct
Check task pod logs
Verify service account permissions

# View pipeline run logs
tkn pipelinerun logs $PIPELINE_RUN -n mlops-pipelines

# Check failed task
kubectl get pods -n mlops-pipelines | grep $PIPELINE_RUN

# View task pod logs
kubectl logs -n mlops-pipelines <pod-name>

# Check events
kubectl get events -n mlops-pipelines --sort-by='.lastTimestamp'

Issue: SageMaker Training Job Fails

Symptoms: TrainingJob CR shows "Failed" status

Solutions:

Check training container logs in CloudWatch
Verify training data exists in S3
Check SageMaker execution role permissions
Verify container image is accessible

# Get training job name
kubectl get trainingjob -n mlops-pipelines -o jsonpath='{.items[0].metadata.name}'

# Check CloudWatch logs
aws logs tail /aws/sagemaker/TrainingJobs --follow --log-stream-name-prefix $TRAINING_JOB_NAME

# List training jobs
aws sagemaker describe-training-job --training-job-name $TRAINING_JOB_NAME

Issue: High Inference Latency

Symptoms: Model serving responses are slow

Solutions:

Scale InferenceService replicas
Adjust resource requests/limits
Enable autoscaling
Check network latency

# Scale InferenceService
kubectl scale --replicas=3 inferenceservice/classifier-model -n mlops-serving

# Enable autoscaling
kubectl patch inferenceservice classifier-model -n mlops-serving --type='json' \
  -p='[{"op": "add", "path": "/spec/predictor/scaleTarget", "value": 10}]'

# Check pod resource usage
kubectl top pods -n mlops-serving

Debug Commands

# View all resources in namespace
kubectl get all -n mlops-pipelines
kubectl get all -n mlops-serving

# Describe resources
kubectl describe trainingjob -n mlops-pipelines
kubectl describe inferenceservice -n mlops-serving

# Check logs
kubectl logs -n ack-system deployment/ack-sagemaker-controller
kubectl logs -n kserve deployment/kserve-controller-manager

# View events
kubectl get events -n mlops-pipelines --sort-by='.lastTimestamp'
kubectl get events -n mlops-serving --sort-by='.lastTimestamp'

Table of Contents

Overview

Project Purpose

Key Value Propositions

Solution Components

Architecture

High-Level Architecture Diagram

Workflow

Cost Analysis

Prerequisites

Required Accounts and Subscriptions

Required Tools

AWS Prerequisites

Service Quotas

IAM Permissions

Knowledge Prerequisites

Phase 1: ROSA Cluster Setup

Step 1.1: Configure AWS CLI

Step 1.2: Initialize ROSA

Step 1.3: Create ROSA Cluster

Step 1.4: Monitor Cluster Creation

Step 1.5: Create Admin User

Step 1.6: Connect to Cluster

Step 1.7: Create Project Namespaces

Phase 2: OpenShift Pipelines Installation

Step 2.1: Install OpenShift Pipelines Operator

Step 2.2: Verify Operator Installation

Step 2.3: Configure Pipeline Service Account

Phase 3: AWS Controllers for Kubernetes (ACK)

Step 3.1: Install ACK SageMaker Controller

Step 3.2: Create IAM Role for ACK

Step 3.3: Configure ACK Controller with IAM Role

Phase 4: Amazon SageMaker Integration

Step 4.1: Create SageMaker Execution Role

Step 4.2: Create S3 Buckets

Step 4.3: Create ECR Repository

Step 4.4: Build Custom Training Container

Phase 5: Model Storage with S3

Step 5.1: Upload Sample Training Data

Step 5.2: Create ConfigMap for S3 Configuration

Phase 6: KServe Model Serving

Step 6.1: Install KServe

Step 6.2: Create Custom ServingRuntime for scikit-learn

Step 6.3: Create Service Account for Model Access

Phase 7: End-to-End Pipeline

Step 7.1: Create Pipeline Tasks

Step 7.2: Create Task for Model Deployment

Step 7.3: Create Complete MLOps Pipeline

Step 7.4: Create PipelineRun

Testing and Validation

Test 1: Monitor Pipeline Execution

Test 2: Verify SageMaker Training Job

Test 3: Verify Model Deployment

Test 4: Load Testing

Resource Cleanup

Step 1: Delete InferenceServices

Step 2: Delete Pipelines and Runs

Step 3: Delete SageMaker Training Jobs

Step 4: Delete S3 Buckets

Step 5: Delete ECR Repository

Step 6: Delete ACK Controllers

Step 7: Delete ROSA Cluster

Step 8: Delete IAM Resources

Step 9: Clean Up Local Files

Verification

Troubleshooting

Issue: ACK Controller Cannot Create SageMaker Jobs

Issue: KServe Cannot Pull Model from S3

Issue: Pipeline Run Fails

Issue: SageMaker Training Job Fails

Issue: High Inference Latency

Debug Commands