DEV Community

Marco Gonzalez
Marco Gonzalez

Posted on

RAG Integration: DeepSeek’s New BFF in the AI World

In this tutorial, I'll show you how to build a backend application using Azure OpenAI's Language Model (LLM) and introduce you to what's new with DeepSeek's LLM. It's simpler than it might sound!

Important Notes:

May difference between OpenAI and DeepSeek does not lie on the setup, but the performance, so feel free to replace "DeepSeek" everytime you see "OpenAI" in this blog entry.

Table of Contents

  1. Introduction
  2. Platform Overview
  3. Cloud Platform Decision Matrix
  4. Prerequisites
  5. Project 1: Enterprise-Grade RAG Platform
  6. Project 2: Hybrid MLOps Pipeline
  7. Project 3: Unified Data Fabric (Data Lakehouse)
  8. Multi-Cloud Integration Patterns
  9. Total Cost of Ownership Analysis
  10. Migration Strategies
  11. Resource Cleanup
  12. Troubleshooting

Introduction

Modern enterprises face a critical decision when building cloud-native AI and data platforms: AWS or Azure? This comprehensive guide demonstrates how to build three production-grade platforms on both cloud providers, providing side-by-side comparisons to help you make informed decisions.

What You'll Learn

This guide shows you how to implement identical architectures on both AWS and Azure:

Project 1: Enterprise RAG Platform

  • AWS: Amazon Bedrock + AWS Glue + Milvus on ROSA
  • Azure: Azure OpenAI + Azure Data Factory + Milvus on ARO
  • Privacy-first Retrieval-Augmented Generation
  • Vector database integration
  • Secure private connectivity

Project 2: Hybrid MLOps Pipeline

  • AWS: SageMaker + OpenShift Pipelines + KServe on ROSA
  • Azure: Azure ML + Azure DevOps + KServe on ARO
  • Cost-optimized GPU training
  • Kubernetes-native serving
  • End-to-end automation

Project 3: Unified Data Fabric

  • AWS: Apache Spark + AWS Glue Catalog + S3 + Iceberg
  • Azure: Apache Spark + Azure Purview + ADLS Gen2 + Delta Lake
  • Stateless compute architecture
  • Medallion data organization
  • ACID transactions

Why This Comparison Matters

Choosing the right cloud platform impacts:

  • Total Cost: 20-40% difference in monthly spending
  • Developer Productivity: Ecosystem integration and tooling
  • Vendor Lock-in: Portability and migration flexibility
  • Enterprise Integration: Existing infrastructure and contracts

Platform Overview

Unified Multi-Cloud Architecture

Both implementations follow the same architectural patterns while leveraging platform-specific managed services:

┌─────────────────────────────────────────────────────────────────────┐
│                     Enterprise Organization                          │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │     Red Hat OpenShift (ROSA on AWS / ARO on Azure)            │ │
│  │              - Unified Control Plane                           │ │
│  │              - Application Orchestration                       │ │
│  │              - Developer Platform                              │ │
│  └───────────────────────────┬───────────────────────────────────┘ │
│                              │                                      │
│              ┌───────────────┼───────────────┐                     │
│              │               │               │                     │
│  ┌───────────▼─────┐ ┌──────▼──────┐ ┌─────▼──────────┐          │
│  │   RAG Project   │ │MLOps Project│ │ Data Lakehouse │          │
│  │                 │ │             │ │                │          │
│  │ AWS:            │ │ AWS:        │ │ AWS:           │          │
│  │ - Bedrock       │ │ - SageMaker │ │ - Glue Catalog │          │
│  │ - Glue ETL      │ │ - ACK       │ │ - S3 + Iceberg │          │
│  │                 │ │             │ │                │          │
│  │ Azure:          │ │ Azure:      │ │ Azure:         │          │
│  │ - OpenAI        │ │ - Azure ML  │ │ - Purview      │          │
│  │ - Data Factory  │ │ - ASO       │ │ - ADLS + Delta │          │
│  └─────────────────┘ └─────────────┘ └────────────────┘          │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │              Cloud Services Layer                             │ │
│  │  AWS: IAM + S3 + PrivateLink + CloudWatch                    │ │
│  │  Azure: AAD + Blob + Private Link + Monitor                  │ │
│  └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Technology Stack: AWS vs Azure

Component AWS Solution Azure Solution OpenShift Platform
Kubernetes ROSA (Red Hat OpenShift on AWS) ARO (Azure Red Hat OpenShift) Both use Red Hat OpenShift
LLM Platform Amazon Bedrock (Claude 3.5) Azure OpenAI Service (GPT-4) Same API patterns
ML Training Amazon SageMaker Azure Machine Learning Both burst from OpenShift
Data Catalog AWS Glue Data Catalog Azure Purview / Unity Catalog Unified metadata layer
Object Storage Amazon S3 Azure Data Lake Storage Gen2 S3-compatible APIs
Table Format Apache Iceberg Delta Lake Open source options
Vector DB Milvus (self-hosted) Milvus / Cosmos DB Same deployment
ETL Service AWS Glue (serverless) Azure Data Factory (serverless) Similar orchestration
CI/CD OpenShift Pipelines (Tekton) Azure DevOps / Tekton Kubernetes-native
K8s Integration AWS Controllers (ACK) Azure Service Operator (ASO) Custom resources
Private Network AWS PrivateLink Azure Private Link VPC/VNet integration
Authentication IRSA (IAM for Service Accounts) Workload Identity Pod-level identity

Cloud Platform Decision Matrix

When to Choose AWS

Best For:

  1. AI/ML Innovation: Amazon Bedrock offers broader model selection (Claude, Llama 2, Stable Diffusion)
  2. Serverless-First: AWS Glue, Lambda, and Bedrock have no minimum fees
  3. Startup/Scale-up: Pay-as-you-go pricing favors variable workloads
  4. Data Engineering: S3 + Glue + Athena is industry standard
  5. Multi-Region: Better global infrastructure coverage

AWS Advantages:

  • Superior AI model marketplace (Anthropic, Cohere, AI21, Meta)
  • True serverless data catalog (Glue) with no base costs
  • More mature spot instance ecosystem for cost savings
  • Better S3 ecosystem and tooling integration
  • Stronger open-source community adoption

When to Choose Azure

Best For:

  1. Microsoft Ecosystem: Tight integration with Office 365, Teams, Power Platform
  2. Enterprise Windows: Native Windows container support
  3. Hybrid Cloud: Azure Arc and on-premises integration
  4. Enterprise Agreements: Existing Microsoft licensing discounts
  5. Regulated Industries: Better compliance certifications in some regions

Azure Advantages:

  • Seamless Microsoft 365 and Active Directory integration
  • Superior Windows and .NET container support
  • Better hybrid cloud story with Azure Arc
  • Integrated Azure Synapse for unified analytics
  • Potentially lower costs with existing EA agreements

Decision Criteria Scorecard

Criteria AWS Score Azure Score Weight Notes
AI Model Selection 9/10 7/10 High AWS Bedrock has more models
ML Training Cost 8/10 8/10 High Equivalent spot pricing
Data Lake Maturity 10/10 8/10 High S3 is industry standard
Serverless Pricing 9/10 7/10 Medium AWS Glue has no minimums
Enterprise Integration 7/10 10/10 High Azure wins for Microsoft shops
Hybrid Cloud 7/10 9/10 Medium Azure Arc is superior
Developer Ecosystem 9/10 7/10 Medium Larger open-source community
Compliance Certifications 9/10 9/10 High Equivalent for most use cases
Global Infrastructure 10/10 8/10 Low AWS has more regions
Pricing Transparency 8/10 7/10 Medium AWS pricing is clearer

Total Weighted Score: AWS: 8.5/10 | Azure: 8.1/10

Verdict: Choose based on your organization's existing ecosystem. Both platforms are capable; the difference is in integration, not capability.

Prerequisites

Common Prerequisites (Both Platforms)

Required Accounts:

  • Cloud platform account with administrative access
  • Red Hat Account with OpenShift subscription
  • Credit card for cloud charges

Required Tools (install on your workstation):

# Common tools for both platforms
# OpenShift CLI (oc)
wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz
tar -xvf openshift-client-linux.tar.gz
sudo mv oc kubectl /usr/local/bin/
oc version

# Helm (v3)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

# Tekton CLI
curl -LO https://github.com/tektoncd/cli/releases/download/v0.33.0/tkn_0.33.0_Linux_x86_64.tar.gz
tar xvzf tkn_0.33.0_Linux_x86_64.tar.gz
sudo mv tkn /usr/local/bin/
tkn version

# Python 3.11+
python3 --version

# Container tools (Docker or Podman)
podman --version
Enter fullscreen mode Exit fullscreen mode

AWS-Specific Prerequisites

# AWS CLI (v2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version

# ROSA CLI
wget https://mirror.openshift.com/pub/openshift-v4/clients/rosa/latest/rosa-linux.tar.gz
tar -xvf rosa-linux.tar.gz
sudo mv rosa /usr/local/bin/rosa
rosa version

# Configure AWS
aws configure
aws sts get-caller-identity

# Initialize ROSA
rosa login
rosa verify quota
rosa verify permissions
rosa init
Enter fullscreen mode Exit fullscreen mode

Azure-Specific Prerequisites

# Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
az --version

# ARO extension
az extension add --name aro --index https://az.aroapp.io/stable

# Azure CLI login
az login
az account show

# Register required providers
az provider register --namespace Microsoft.RedHatOpenShift --wait
az provider register --namespace Microsoft.Compute --wait
az provider register --namespace Microsoft.Storage --wait
az provider register --namespace Microsoft.Network --wait
Enter fullscreen mode Exit fullscreen mode

Service Quotas Verification

AWS:

# EC2 vCPU quota
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-1216C47A \
  --region us-east-1

# SageMaker training instances
aws service-quotas get-service-quota \
  --service-code sagemaker \
  --quota-code L-2E8D9C5E \
  --region us-east-1
Enter fullscreen mode Exit fullscreen mode

Azure:

# Check compute quota
az vm list-usage --location eastus --output table

# Check ML compute quota
az ml compute list-usage --location eastus
Enter fullscreen mode Exit fullscreen mode

Project 1: Enterprise-Grade RAG Platform

RAG Platform Overview

This project implements a privacy-first Retrieval-Augmented Generation (RAG) system. Both AWS and Azure implementations achieve the same functionality but use platform-specific managed services.

Architecture Comparison

AWS Architecture:

ROSA → AWS PrivateLink → Amazon Bedrock (Claude 3.5)
  ↓
Milvus Vector DB (on ROSA)
  ↓
AWS Glue ETL → S3
Enter fullscreen mode Exit fullscreen mode

Azure Architecture:

ARO → Azure Private Link → Azure OpenAI (GPT-4)
  ↓
Milvus Vector DB (on ARO)
  ↓
Azure Data Factory → Blob Storage
Enter fullscreen mode Exit fullscreen mode

Side-by-Side Service Mapping

Function AWS Service Azure Service Implementation Difference
LLM API Amazon Bedrock Azure OpenAI Service Different model families
Private Network AWS PrivateLink Azure Private Link Similar configuration
ETL Pipeline AWS Glue (Serverless) Azure Data Factory Different pricing models
Metadata AWS Glue Data Catalog Azure Purview Different scopes
Storage Amazon S3 Azure Blob Storage / ADLS Gen2 S3 API vs Blob API
Vector DB Milvus on ROSA Milvus on ARO / Cosmos DB Self-hosted vs managed option
Auth IRSA (IAM Roles) Workload Identity Similar pod-level identity
Embedding Titan Embeddings OpenAI Embeddings Different dimensions

AWS Implementation (RAG)

AWS Phase 1: ROSA Cluster Setup

# Set environment variables
export CLUSTER_NAME="rag-platform-aws"
export AWS_REGION="us-east-1"
export MACHINE_TYPE="m5.2xlarge"
export COMPUTE_NODES=3

# Create ROSA cluster (takes ~40 minutes)
rosa create cluster \
  --cluster-name $CLUSTER_NAME \
  --region $AWS_REGION \
  --multi-az \
  --compute-machine-type $MACHINE_TYPE \
  --compute-nodes $COMPUTE_NODES \
  --machine-cidr 10.0.0.0/16 \
  --service-cidr 172.30.0.0/16 \
  --pod-cidr 10.128.0.0/14 \
  --host-prefix 23 \
  --yes

# Monitor installation
rosa logs install --cluster=$CLUSTER_NAME --watch

# Create admin and connect
rosa create admin --cluster=$CLUSTER_NAME
oc login <api-url> --username cluster-admin --password <password>

# Create namespaces
oc new-project redhat-ods-applications
oc new-project rag-application
oc new-project milvus
Enter fullscreen mode Exit fullscreen mode

AWS Phase 2: Amazon Bedrock via PrivateLink

# Get ROSA VPC details
export ROSA_VPC_ID=$(aws ec2 describe-vpcs \
  --filters "Name=tag:Name,Values=*${CLUSTER_NAME}*" \
  --query 'Vpcs[0].VpcId' \
  --output text \
  --region $AWS_REGION)

export PRIVATE_SUBNET_IDS=$(aws ec2 describe-subnets \
  --filters "Name=vpc-id,Values=$ROSA_VPC_ID" "Name=tag:Name,Values=*private*" \
  --query 'Subnets[*].SubnetId' \
  --output text \
  --region $AWS_REGION)

# Create VPC Endpoint Security Group
export VPC_ENDPOINT_SG=$(aws ec2 create-security-group \
  --group-name bedrock-vpc-endpoint-sg \
  --description "Security group for Bedrock VPC endpoint" \
  --vpc-id $ROSA_VPC_ID \
  --region $AWS_REGION \
  --output text \
  --query 'GroupId')

# Allow HTTPS from ROSA nodes
aws ec2 authorize-security-group-ingress \
  --group-id $VPC_ENDPOINT_SG \
  --protocol tcp \
  --port 443 \
  --cidr 10.0.0.0/16 \
  --region $AWS_REGION

# Create Bedrock VPC Endpoint
export BEDROCK_VPC_ENDPOINT=$(aws ec2 create-vpc-endpoint \
  --vpc-id $ROSA_VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.${AWS_REGION}.bedrock-runtime \
  --subnet-ids $PRIVATE_SUBNET_IDS \
  --security-group-ids $VPC_ENDPOINT_SG \
  --private-dns-enabled \
  --region $AWS_REGION \
  --output text \
  --query 'VpcEndpoint.VpcEndpointId')

# Wait for availability
aws ec2 wait vpc-endpoint-available \
  --vpc-endpoint-ids $BEDROCK_VPC_ENDPOINT \
  --region $AWS_REGION

# Create IAM role for Bedrock access (IRSA pattern)
export OIDC_PROVIDER=$(rosa describe cluster -c $CLUSTER_NAME -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

cat > bedrock-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": "arn:aws:bedrock:${AWS_REGION}::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
    }
  ]
}
EOF

aws iam create-policy \
  --policy-name BedrockInvokePolicy \
  --policy-document file://bedrock-policy.json

cat > trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:rag-application:bedrock-sa"
        }
      }
    }
  ]
}
EOF

export BEDROCK_ROLE_ARN=$(aws iam create-role \
  --role-name rosa-bedrock-access \
  --assume-role-policy-document file://trust-policy.json \
  --query 'Role.Arn' \
  --output text)

aws iam attach-role-policy \
  --role-name rosa-bedrock-access \
  --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/BedrockInvokePolicy

# Create Kubernetes service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: bedrock-sa
  namespace: rag-application
  annotations:
    eks.amazonaws.com/role-arn: $BEDROCK_ROLE_ARN
EOF
Enter fullscreen mode Exit fullscreen mode

AWS Phase 3: AWS Glue Data Pipeline

# Create S3 bucket
export BUCKET_NAME="rag-documents-${ACCOUNT_ID}"
aws s3 mb s3://$BUCKET_NAME --region $AWS_REGION

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket $BUCKET_NAME \
  --versioning-configuration Status=Enabled \
  --region $AWS_REGION

# Create folder structure
aws s3api put-object --bucket $BUCKET_NAME --key raw-documents/
aws s3api put-object --bucket $BUCKET_NAME --key processed-documents/
aws s3api put-object --bucket $BUCKET_NAME --key embeddings/

# Create Glue IAM role
cat > glue-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {"Service": "glue.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

aws iam create-role \
  --role-name AWSGlueServiceRole-RAG \
  --assume-role-policy-document file://glue-trust-policy.json

aws iam attach-role-policy \
  --role-name AWSGlueServiceRole-RAG \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

# Create S3 access policy
cat > glue-s3-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::${BUCKET_NAME}/*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::${BUCKET_NAME}"
    }
  ]
}
EOF

aws iam put-role-policy \
  --role-name AWSGlueServiceRole-RAG \
  --policy-name S3Access \
  --policy-document file://glue-s3-policy.json

# Create Glue database
aws glue create-database \
  --database-input '{
    "Name": "rag_documents_db",
    "Description": "RAG document metadata"
  }' \
  --region $AWS_REGION

# Create Glue crawler
aws glue create-crawler \
  --name rag-document-crawler \
  --role arn:aws:iam::${ACCOUNT_ID}:role/AWSGlueServiceRole-RAG \
  --database-name rag_documents_db \
  --targets '{
    "S3Targets": [{"Path": "s3://'$BUCKET_NAME'/raw-documents/"}]
  }' \
  --region $AWS_REGION
Enter fullscreen mode Exit fullscreen mode

AWS Phase 4: Milvus Vector Database

# Install Milvus using Helm
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm repo update

helm install milvus-operator milvus/milvus-operator \
  --namespace milvus \
  --create-namespace

# Create PVCs
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: milvus-etcd-pvc
  namespace: milvus
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
  storageClassName: gp3-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: milvus-minio-pvc
  namespace: milvus
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi
  storageClassName: gp3-csi
EOF

# Deploy Milvus
cat > milvus-values.yaml <<EOF
cluster:
  enabled: true
service:
  type: ClusterIP
  port: 19530
standalone:
  replicas: 1
  resources:
    limits:
      cpu: "4"
      memory: 8Gi
    requests:
      cpu: "2"
      memory: 4Gi
etcd:
  persistence:
    enabled: true
    existingClaim: milvus-etcd-pvc
minio:
  persistence:
    enabled: true
    existingClaim: milvus-minio-pvc
EOF

helm install milvus milvus/milvus \
  --namespace milvus \
  --values milvus-values.yaml \
  --wait

# Get Milvus endpoint
export MILVUS_HOST=$(oc get svc milvus -n milvus -o jsonpath='{.spec.clusterIP}')
export MILVUS_PORT=19530
Enter fullscreen mode Exit fullscreen mode

AWS Phase 5: RAG Application Deployment

# Create application code
mkdir -p rag-app-aws/src

cat > rag-app-aws/requirements.txt <<EOF
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
pymilvus==2.3.3
boto3==1.29.7
python-dotenv==1.0.0
EOF

# Create FastAPI application (abbreviated for space)
cat > rag-app-aws/src/main.py <<'PYTHON'
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import os, json, boto3
from pymilvus import connections, Collection

app = FastAPI(title="Enterprise RAG API - AWS")

MILVUS_HOST = os.getenv("MILVUS_HOST")
AWS_REGION = os.getenv("AWS_REGION", "us-east-1")
BEDROCK_MODEL = "anthropic.claude-3-5-sonnet-20241022-v2:0"

bedrock = boto3.client('bedrock-runtime', region_name=AWS_REGION)

@app.on_event("startup")
async def startup():
    connections.connect(host=MILVUS_HOST, port=19530)

class QueryRequest(BaseModel):
    query: str
    top_k: int = 5
    max_tokens: int = 1000

@app.post("/query")
async def query_rag(req: QueryRequest):
    # Generate embedding with Bedrock Titan
    embed_resp = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": req.query, "dimensions": 1024})
    )
    embedding = json.loads(embed_resp['body'].read())['embedding']

    # Search Milvus
    coll = Collection("rag_documents")
    results = coll.search([embedding], "embedding", {"metric_type": "L2"}, limit=req.top_k)

    # Build context
    context = "\n\n".join([hit.entity.get("text") for hit in results[0]])

    # Call Bedrock Claude
    prompt = f"Context:\n{context}\n\nQuestion: {req.query}\n\nAnswer:"
    response = bedrock.invoke_model(
        modelId=BEDROCK_MODEL,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": req.max_tokens,
            "messages": [{"role": "user", "content": prompt}]
        })
    )

    answer = json.loads(response['body'].read())['content'][0]['text']
    return {"answer": answer, "sources": [{"chunk": hit.entity.get("text")} for hit in results[0]]}

@app.get("/health")
async def health():
    return {"status": "healthy", "platform": "AWS", "model": "Claude 3.5 Sonnet"}
PYTHON

# Create Dockerfile
cat > rag-app-aws/Dockerfile <<EOF
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
EOF

# Build and deploy
cd rag-app-aws
podman build -t rag-app-aws:v1.0 .
oc create imagestream rag-app-aws -n rag-application
podman tag rag-app-aws:v1.0 image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0
podman push image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0 --tls-verify=false
cd ..

# Deploy to OpenShift
cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-app-aws
  namespace: rag-application
spec:
  replicas: 2
  selector:
    matchLabels:
      app: rag-app-aws
  template:
    metadata:
      labels:
        app: rag-app-aws
    spec:
      serviceAccountName: bedrock-sa
      containers:
      - name: app
        image: image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0
        ports:
        - containerPort: 8000
        env:
        - name: MILVUS_HOST
          value: "$MILVUS_HOST"
        - name: AWS_REGION
          value: "$AWS_REGION"
---
apiVersion: v1
kind: Service
metadata:
  name: rag-app-aws
  namespace: rag-application
spec:
  selector:
    app: rag-app-aws
  ports:
  - port: 80
    targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: rag-app-aws
  namespace: rag-application
spec:
  to:
    kind: Service
    name: rag-app-aws
  tls:
    termination: edge
EOF

# Get URL and test
export RAG_URL_AWS=$(oc get route rag-app-aws -n rag-application -o jsonpath='{.spec.host}')
curl https://$RAG_URL_AWS/health
Enter fullscreen mode Exit fullscreen mode

Azure Implementation (RAG)

Azure Phase 1: ARO Cluster Setup

# Set environment variables
export CLUSTER_NAME="rag-platform-azure"
export LOCATION="eastus"
export RESOURCE_GROUP="rag-platform-rg"

# Create resource group
az group create \
  --name $RESOURCE_GROUP \
  --location $LOCATION

# Create virtual network
az network vnet create \
  --resource-group $RESOURCE_GROUP \
  --name aro-vnet \
  --address-prefixes 10.0.0.0/22

# Create master subnet
az network vnet subnet create \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --name master-subnet \
  --address-prefixes 10.0.0.0/23 \
  --service-endpoints Microsoft.ContainerRegistry

# Create worker subnet
az network vnet subnet create \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --name worker-subnet \
  --address-prefixes 10.0.2.0/23 \
  --service-endpoints Microsoft.ContainerRegistry

# Disable subnet private endpoint policies
az network vnet subnet update \
  --name master-subnet \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --disable-private-link-service-network-policies true

# Create ARO cluster (takes ~35 minutes)
az aro create \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --vnet aro-vnet \
  --master-subnet master-subnet \
  --worker-subnet worker-subnet \
  --worker-count 3 \
  --worker-vm-size Standard_D8s_v3

# Get credentials
export ARO_URL=$(az aro show \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --query consoleUrl -o tsv)

export ARO_PASSWORD=$(az aro list-credentials \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --query kubeadminPassword -o tsv)

# Login
oc login $ARO_URL -u kubeadmin -p $ARO_PASSWORD

# Create namespaces
oc new-project rag-application
oc new-project milvus
Enter fullscreen mode Exit fullscreen mode

Azure Phase 2: Azure OpenAI via Private Link

# Create Azure OpenAI resource
export OPENAI_NAME="rag-openai-${RANDOM}"

az cognitiveservices account create \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --kind OpenAI \
  --sku S0 \
  --location $LOCATION \
  --custom-domain $OPENAI_NAME \
  --public-network-access Disabled

# Deploy GPT-4 model
az cognitiveservices account deployment create \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --deployment-name gpt-4 \
  --model-name gpt-4 \
  --model-version "0613" \
  --model-format OpenAI \
  --sku-capacity 10 \
  --sku-name "Standard"

# Deploy text-embedding model
az cognitiveservices account deployment create \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --deployment-name text-embedding-ada-002 \
  --model-name text-embedding-ada-002 \
  --model-version "2" \
  --model-format OpenAI \
  --sku-capacity 10 \
  --sku-name "Standard"

# Create Private Endpoint
export VNET_ID=$(az network vnet show \
  --resource-group $RESOURCE_GROUP \
  --name aro-vnet \
  --query id -o tsv)

export SUBNET_ID=$(az network vnet subnet show \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --name worker-subnet \
  --query id -o tsv)

export OPENAI_ID=$(az cognitiveservices account show \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --query id -o tsv)

az network private-endpoint create \
  --name openai-private-endpoint \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --subnet worker-subnet \
  --private-connection-resource-id $OPENAI_ID \
  --group-id account \
  --connection-name openai-connection

# Create Private DNS Zone
az network private-dns zone create \
  --resource-group $RESOURCE_GROUP \
  --name privatelink.openai.azure.com

az network private-dns link vnet create \
  --resource-group $RESOURCE_GROUP \
  --zone-name privatelink.openai.azure.com \
  --name openai-dns-link \
  --virtual-network aro-vnet \
  --registration-enabled false

# Create DNS record
export ENDPOINT_IP=$(az network private-endpoint show \
  --name openai-private-endpoint \
  --resource-group $RESOURCE_GROUP \
  --query 'customDnsConfigs[0].ipAddresses[0]' -o tsv)

az network private-dns record-set a create \
  --name $OPENAI_NAME \
  --zone-name privatelink.openai.azure.com \
  --resource-group $RESOURCE_GROUP

az network private-dns record-set a add-record \
  --record-set-name $OPENAI_NAME \
  --zone-name privatelink.openai.azure.com \
  --resource-group $RESOURCE_GROUP \
  --ipv4-address $ENDPOINT_IP

# Configure Workload Identity
export ARO_OIDC_ISSUER=$(az aro show \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --query 'serviceIdentity.url' -o tsv)

# Create managed identity
az identity create \
  --name rag-app-identity \
  --resource-group $RESOURCE_GROUP

export IDENTITY_CLIENT_ID=$(az identity show \
  --name rag-app-identity \
  --resource-group $RESOURCE_GROUP \
  --query clientId -o tsv)

export IDENTITY_PRINCIPAL_ID=$(az identity show \
  --name rag-app-identity \
  --resource-group $RESOURCE_GROUP \
  --query principalId -o tsv)

# Grant OpenAI access
az role assignment create \
  --assignee $IDENTITY_PRINCIPAL_ID \
  --role "Cognitive Services OpenAI User" \
  --scope $OPENAI_ID

# Create federated credential
az identity federated-credential create \
  --name rag-app-federated \
  --identity-name rag-app-identity \
  --resource-group $RESOURCE_GROUP \
  --issuer $ARO_OIDC_ISSUER \
  --subject "system:serviceaccount:rag-application:openai-sa"

# Create Kubernetes service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: openai-sa
  namespace: rag-application
  annotations:
    azure.workload.identity/client-id: $IDENTITY_CLIENT_ID
EOF

# Get OpenAI endpoint and key
export OPENAI_ENDPOINT=$(az cognitiveservices account show \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --query properties.endpoint -o tsv)

export OPENAI_KEY=$(az cognitiveservices account keys list \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --query key1 -o tsv)

# Create secret
oc create secret generic openai-credentials \
  --from-literal=endpoint=$OPENAI_ENDPOINT \
  --from-literal=key=$OPENAI_KEY \
  -n rag-application
Enter fullscreen mode Exit fullscreen mode

Azure Phase 3: Azure Data Factory Pipeline

# Create Data Factory
export ADF_NAME="rag-adf-${RANDOM}"

az datafactory create \
  --resource-group $RESOURCE_GROUP \
  --factory-name $ADF_NAME \
  --location $LOCATION

# Create Storage Account
export STORAGE_ACCOUNT="ragdocs${RANDOM}"

az storage account create \
  --name $STORAGE_ACCOUNT \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --sku Standard_LRS \
  --kind StorageV2 \
  --hierarchical-namespace true

# Get storage key
export STORAGE_KEY=$(az storage account keys list \
  --account-name $STORAGE_ACCOUNT \
  --resource-group $RESOURCE_GROUP \
  --query '[0].value' -o tsv)

# Create containers
az storage container create \
  --name raw-documents \
  --account-name $STORAGE_ACCOUNT \
  --account-key $STORAGE_KEY

az storage container create \
  --name processed-documents \
  --account-name $STORAGE_ACCOUNT \
  --account-key $STORAGE_KEY

# Create linked service for storage
cat > adf-storage-linked-service.json <<EOF
{
  "name": "StorageLinkedService",
  "properties": {
    "type": "AzureBlobStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=$STORAGE_ACCOUNT;AccountKey=$STORAGE_KEY;EndpointSuffix=core.windows.net"
    }
  }
}
EOF

az datafactory linked-service create \
  --resource-group $RESOURCE_GROUP \
  --factory-name $ADF_NAME \
  --name StorageLinkedService \
  --properties @adf-storage-linked-service.json
Enter fullscreen mode Exit fullscreen mode

Azure Phase 4: Milvus Deployment (Same as AWS)

The Milvus deployment on ARO is identical to ROSA since both use OpenShift:

# Same Helm commands as AWS implementation
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm install milvus-operator milvus/milvus-operator --namespace milvus --create-namespace

# Create PVCs using Azure Disk
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: milvus-etcd-pvc
  namespace: milvus
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
  storageClassName: managed-premium
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: milvus-minio-pvc
  namespace: milvus
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi
  storageClassName: managed-premium
EOF

# Deploy Milvus (same values file as AWS)
helm install milvus milvus/milvus --namespace milvus --values milvus-values.yaml --wait
Enter fullscreen mode Exit fullscreen mode

Azure Phase 5: RAG Application Deployment

# Create Azure-specific application
mkdir -p rag-app-azure/src

cat > rag-app-azure/requirements.txt <<EOF
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
pymilvus==2.3.3
openai==1.3.5
azure-identity==1.14.0
python-dotenv==1.0.0
EOF

cat > rag-app-azure/src/main.py <<'PYTHON'
from fastapi import FastAPI
from pydantic import BaseModel
import os
from openai import AzureOpenAI
from pymilvus import connections, Collection

app = FastAPI(title="Enterprise RAG API - Azure")

client = AzureOpenAI(
    api_key=os.getenv("OPENAI_KEY"),
    api_version="2023-05-15",
    azure_endpoint=os.getenv("OPENAI_ENDPOINT")
)

@app.on_event("startup")
async def startup():
    connections.connect(host=os.getenv("MILVUS_HOST"), port=19530)

class QueryRequest(BaseModel):
    query: str
    top_k: int = 5
    max_tokens: int = 1000

@app.post("/query")
async def query_rag(req: QueryRequest):
    # Generate embedding with Azure OpenAI
    embed_resp = client.embeddings.create(
        input=req.query,
        model="text-embedding-ada-002"
    )
    embedding = embed_resp.data[0].embedding

    # Search Milvus
    coll = Collection("rag_documents")
    results = coll.search([embedding], "embedding", {"metric_type": "L2"}, limit=req.top_k)

    # Build context
    context = "\n\n".join([hit.entity.get("text") for hit in results[0]])

    # Call Azure OpenAI GPT-4
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {req.query}"}
        ],
        max_tokens=req.max_tokens
    )

    answer = response.choices[0].message.content
    return {"answer": answer, "sources": [{"chunk": hit.entity.get("text")} for hit in results[0]]}

@app.get("/health")
async def health():
    return {"status": "healthy", "platform": "Azure", "model": "GPT-4"}
PYTHON

# Build and deploy (similar to AWS)
cd rag-app-azure
podman build -t rag-app-azure:v1.0 .
oc create imagestream rag-app-azure -n rag-application
podman tag rag-app-azure:v1.0 image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0
podman push image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0 --tls-verify=false
cd ..

# Deploy with Azure credentials
cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-app-azure
  namespace: rag-application
spec:
  replicas: 2
  selector:
    matchLabels:
      app: rag-app-azure
  template:
    metadata:
      labels:
        app: rag-app-azure
    spec:
      serviceAccountName: openai-sa
      containers:
      - name: app
        image: image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0
        ports:
        - containerPort: 8000
        env:
        - name: MILVUS_HOST
          value: "milvus.milvus.svc.cluster.local"
        - name: OPENAI_ENDPOINT
          valueFrom:
            secretKeyRef:
              name: openai-credentials
              key: endpoint
        - name: OPENAI_KEY
          valueFrom:
            secretKeyRef:
              name: openai-credentials
              key: key
---
apiVersion: v1
kind: Service
metadata:
  name: rag-app-azure
  namespace: rag-application
spec:
  selector:
    app: rag-app-azure
  ports:
  - port: 80
    targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: rag-app-azure
  namespace: rag-application
spec:
  to:
    kind: Service
    name: rag-app-azure
  tls:
    termination: edge
EOF

# Get URL and test
export RAG_URL_AZURE=$(oc get route rag-app-azure -n rag-application -o jsonpath='{.spec.host}')
curl https://$RAG_URL_AZURE/health
Enter fullscreen mode Exit fullscreen mode

Cost Comparison (RAG)

Monthly Cost Breakdown

Component AWS Cost Azure Cost Notes
Kubernetes Cluster
- 3x worker nodes $1,460 (m5.2xlarge) $1,380 (D8s_v3) Similar specs
- Control plane $0 (managed by ROSA) $0 (managed by ARO) Both included
LLM API Calls
- 1M input tokens $3 (Claude 3.5) $30 (GPT-4) AWS 10x cheaper
- 1M output tokens $15 (Claude 3.5) $60 (GPT-4) AWS 4x cheaper
Embeddings
- 1M tokens $0.10 (Titan) $0.10 (Ada-002) Equivalent
Data Pipeline
- ETL service $10 (Glue, serverless) $15 (Data Factory) AWS slightly cheaper
- Metadata catalog $1 (Glue Catalog) $20 (Purview min) Azure has minimum fee
Object Storage
- 100 GB storage $2.30 (S3) $2.05 (Blob) Equivalent
- Requests (100k) $0.05 (S3) $0.04 (Blob) Equivalent
Vector Database
- Self-hosted Milvus $0 (on cluster) $0 (on cluster) Same
Networking
- Private Link $7.20 (PrivateLink) $7.20 (Private Link) Same pricing
- Data transfer $5 (1 TB out) $5 (1 TB out) Equivalent
TOTAL/MONTH $1,503.65 $1,519.39 AWS 1% cheaper

Key Cost Insights:

  1. LLM API costs favor AWS by a significant margin (Claude is cheaper than GPT-4)
  2. Azure Purview has a minimum monthly fee vs Glue's pay-per-use
  3. Compute costs are similar between ROSA and ARO
  4. Winner: AWS by ~$16/month (1%)

Cost Optimization Strategies

AWS:

  • Use Claude Instant for non-critical queries (6x cheaper)
  • Leverage Glue serverless (no base cost)
  • Use S3 Intelligent-Tiering for old documents

Azure:

  • Use GPT-3.5-Turbo instead of GPT-4 (20x cheaper)
  • Negotiate EA pricing for Azure OpenAI
  • Use cool/archive tiers for old data

Project 2: Hybrid MLOps Pipeline

MLOps Platform Overview

This project demonstrates cost-optimized machine learning operations by bursting GPU training workloads to managed services while keeping inference on Kubernetes.

Architecture Comparison

AWS Architecture:

OpenShift Pipelines → ACK → SageMaker (ml.p4d.24xlarge)
                            ↓
                        S3 Model Storage
                            ↓
                    KServe on ROSA (CPU)
Enter fullscreen mode Exit fullscreen mode

Azure Architecture:

Azure DevOps / Tekton → ASO → Azure ML (NC96ads_A100_v4)
                               ↓
                           Blob Model Storage
                               ↓
                       KServe on ARO (CPU)
Enter fullscreen mode Exit fullscreen mode

Service Mapping

Function AWS Service Azure Service Key Difference
ML Platform Amazon SageMaker Azure Machine Learning Similar capabilities
GPU Training ml.p4d.24xlarge (8x A100) NC96ads_A100_v4 (8x A100) Same hardware
Spot Training Managed Spot Training Low Priority VMs Different reservation models
Model Registry S3 + SageMaker Registry Blob + ML Model Registry Different metadata approaches
K8s Operator ACK (AWS Controllers) ASO (Azure Service Operator) Different CRD structures
Pipelines OpenShift Pipelines (Tekton) Azure DevOps / Tekton Both support Tekton
Inference KServe on ROSA KServe on ARO Identical

AWS Implementation (MLOps)

AWS MLOps Phase 1: OpenShift Pipelines Setup

# Install OpenShift Pipelines Operator
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-pipelines-operator
  namespace: openshift-operators
spec:
  channel: latest
  name: openshift-pipelines-operator-rh
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

# Create namespace
oc new-project mlops-pipelines

# Create service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pipeline-sa
  namespace: mlops-pipelines
EOF
Enter fullscreen mode Exit fullscreen mode

AWS MLOps Phase 2: ACK SageMaker Controller

# Install ACK SageMaker controller
export SERVICE=sagemaker
export RELEASE_VERSION=$(curl -sL https://api.github.com/repos/aws-controllers-k8s/${SERVICE}-controller/releases/latest | grep '\"tag_name\":' | cut -d'\"' -f4)

wget https://github.com/aws-controllers-k8s/${SERVICE}-controller/releases/download/${RELEASE_VERSION}/install.yaml
kubectl apply -f install.yaml

# Create IAM role for ACK
cat > ack-sagemaker-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:StopTrainingJob"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:*"],
      "Resource": "arn:aws:s3:::mlops-*"
    },
    {
      "Effect": "Allow",
      "Action": ["iam:PassRole"],
      "Resource": "*",
      "Condition": {
        "StringEquals": {"iam:PassedToService": "sagemaker.amazonaws.com"}
      }
    }
  ]
}
EOF

aws iam create-policy --policy-name ACKSageMakerPolicy --policy-document file://ack-sagemaker-policy.json

# Create trust policy and role (similar to RAG project)
# ... (abbreviated for space)
Enter fullscreen mode Exit fullscreen mode

AWS MLOps Phase 3: Training Job Example

# Create S3 buckets
export ML_BUCKET="mlops-artifacts-${ACCOUNT_ID}"
export DATA_BUCKET="mlops-datasets-${ACCOUNT_ID}"

aws s3 mb s3://$ML_BUCKET
aws s3 mb s3://$DATA_BUCKET

# Upload training script
cat > train.py <<'PYTHON'
import argparse, joblib
from sklearn.ensemble import RandomForestClassifier
import numpy as np

parser = argparse.ArgumentParser()
parser.add_argument('--n_estimators', type=int, default=100)
args = parser.parse_args()

# Training code
X = np.random.rand(1000, 20)
y = np.random.randint(0, 2, 1000)

model = RandomForestClassifier(n_estimators=args.n_estimators)
model.fit(X, y)

joblib.dump(model, '/opt/ml/model/model.joblib')
print(f"Training completed with {args.n_estimators} estimators")
PYTHON

# Create Dockerfile
cat > Dockerfile <<EOF
FROM python:3.10-slim
RUN pip install scikit-learn joblib numpy
COPY train.py /opt/ml/code/
ENTRYPOINT ["python", "/opt/ml/code/train.py"]
EOF

# Build and push to ECR
aws ecr create-repository --repository-name mlops/training
export ECR_URI="${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/mlops/training"
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
docker build -t mlops-training .
docker tag mlops-training:latest $ECR_URI:latest
docker push $ECR_URI:latest

# Create SageMaker training job via ACK
cat <<EOF | oc apply -f -
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: TrainingJob
metadata:
  name: rf-training-job
  namespace: mlops-pipelines
spec:
  trainingJobName: rf-training-$(date +%s)
  roleARN: $SAGEMAKER_ROLE_ARN
  algorithmSpecification:
    trainingImage: $ECR_URI:latest
    trainingInputMode: File
  resourceConfig:
    instanceType: ml.m5.xlarge
    instanceCount: 1
    volumeSizeInGB: 50
  outputDataConfig:
    s3OutputPath: s3://$ML_BUCKET/models/
  stoppingCondition:
    maxRuntimeInSeconds: 3600
EOF
Enter fullscreen mode Exit fullscreen mode

Azure Implementation (MLOps)

Azure MLOps Phase 1: Azure ML Workspace

# Create ML workspace
export ML_WORKSPACE="mlops-workspace-${RANDOM}"

az ml workspace create \
  --name $ML_WORKSPACE \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION

# Create compute cluster (spot instances)
az ml compute create \
  --name gpu-cluster \
  --type amlcompute \
  --min-instances 0 \
  --max-instances 4 \
  --size Standard_NC6s_v3 \
  --tier LowPriority \
  --workspace-name $ML_WORKSPACE \
  --resource-group $RESOURCE_GROUP
Enter fullscreen mode Exit fullscreen mode

Azure MLOps Phase 2: Azure Service Operator

# Install ASO
helm repo add aso2 https://raw.githubusercontent.com/Azure/azure-service-operator/main/v2/charts
helm install aso2 aso2/azure-service-operator \
  --create-namespace \
  --namespace azureserviceoperator-system \
  --set azureSubscriptionID=$SUBSCRIPTION_ID \
  --set azureTenantID=$TENANT_ID \
  --set azureClientID=$CLIENT_ID \
  --set azureClientSecret=$CLIENT_SECRET

# Create ML job via ASO
cat <<EOF | oc apply -f -
apiVersion: machinelearningservices.azure.com/v1alpha1
kind: Job
metadata:
  name: rf-training-job
  namespace: mlops-pipelines
spec:
  owner:
    name: $ML_WORKSPACE
  compute:
    target: gpu-cluster
    instanceCount: 1
  environment:
    image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
  codeConfiguration:
    codeArtifactId: azureml://code/train-script
    scoringScript: train.py
EOF
Enter fullscreen mode Exit fullscreen mode

Cost Comparison (MLOps)

Component AWS Monthly Azure Monthly Notes
Training
- 4 hrs/week spot GPU $157 (ml.p4d.24xlarge) $153 (NC96ads_A100_v4) Azure slightly cheaper
Storage
- Model artifacts (50 GB) $1.15 (S3) $1.00 (Blob) Similar
ML Platform
- ML service $0 (pay-per-use) $0 (pay-per-use) Same
Inference (on OpenShift)
- Shared ROSA/ARO cluster $0 (shared) $0 (shared) Same
TOTAL/MONTH ~$158 ~$154 Azure 2.5% cheaper

Winner: Azure by $4/month (negligible difference)

Project 3: Unified Data Fabric (Data Lakehouse)

Lakehouse Platform Overview

This project implements a stateless data lakehouse where compute (Spark) can be destroyed without data loss.

Architecture Comparison

AWS Architecture:

Spark on ROSA → AWS Glue Catalog → S3 + Iceberg
Enter fullscreen mode Exit fullscreen mode

Azure Architecture:

Spark on ARO → Azure Purview / Unity Catalog → ADLS Gen2 + Delta Lake
Enter fullscreen mode Exit fullscreen mode

Service Mapping

Function AWS Service Azure Service Key Difference
Catalog AWS Glue Data Catalog Azure Purview / Unity Catalog Glue is serverless
Table Format Apache Iceberg Delta Lake Iceberg is cloud-agnostic
Storage Amazon S3 ADLS Gen2 ADLS has hierarchical namespace
Compute Spark on ROSA Spark on ARO / Databricks ARO or managed Databricks
Query Engine Amazon Athena Azure Synapse Serverless SQL Similar serverless query

AWS Implementation (Lakehouse)

(Due to length constraints, showing key differences only)

# Install Spark Operator
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --set sparkJobNamespace=spark-jobs

# Create Glue databases
aws glue create-database --database-input '{"Name": "bronze"}'
aws glue create-database --database-input '{"Name": "silver"}'
aws glue create-database --database-input '{"Name": "gold"}'

# Build custom Spark image with Iceberg
cat > Dockerfile <<EOF
FROM gcr.io/spark-operator/spark:v3.5.0
USER root
RUN curl -L https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.4.2/iceberg-spark-runtime-3.5_2.12-1.4.2.jar \
    -o /opt/spark/jars/iceberg-spark-runtime.jar
RUN curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar \
    -o /opt/spark/jars/hadoop-aws.jar
USER 185
EOF

# Deploy SparkApplication with Glue integration
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: lakehouse-etl
spec:
  type: Python
  sparkVersion: "3.5.0"
  mainApplicationFile: s3://bucket/scripts/etl.py
  sparkConf:
    "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog"
    "spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog"
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
EOF
Enter fullscreen mode Exit fullscreen mode

Azure Implementation (Lakehouse)

# Option 1: Use Azure Databricks (managed)
az databricks workspace create \
  --name databricks-lakehouse \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --sku premium

# Option 2: Deploy Spark on ARO with Delta Lake
cat > Dockerfile <<EOF
FROM gcr.io/spark-operator/spark:v3.5.0
USER root
RUN curl -L https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar \
    -o /opt/spark/jars/delta-core.jar
USER 185
EOF

# Create ADLS Gen2 storage
az storage account create \
  --name datalake${RANDOM} \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --kind StorageV2 \
  --hierarchical-namespace true

# Deploy SparkApplication with Delta Lake
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: lakehouse-etl
spec:
  type: Python
  sparkVersion: "3.5.0"
  mainApplicationFile: abfss://container@storage.dfs.core.windows.net/scripts/etl.py
  sparkConf:
    "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
    "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
EOF
Enter fullscreen mode Exit fullscreen mode

Cost Comparison (Lakehouse)

Component AWS Monthly Azure Monthly Notes
Compute
- Spark cluster (3x m5.4xlarge) $1,500 $1,450 (D16s_v3) Similar
Metadata Catalog
- Catalog service $10 (Glue, 1M requests) $20 (Purview minimum) AWS cheaper
Storage
- Data lake (1 TB) $23 (S3) $18 (ADLS Gen2 hot) Azure cheaper
Query Engine
- Serverless queries (1 TB) $5 (Athena) $5 (Synapse serverless) Same
TOTAL/MONTH $1,538 $1,493 Azure 3% cheaper

Winner: Azure by $45/month (3%)

Total Cost of Ownership Analysis

Combined Monthly Costs

Project AWS Total Azure Total Difference
RAG Platform $1,504 $1,519 AWS -$15 (-1%)
MLOps Pipeline $158 $154 Azure -$4 (-2.5%)
Data Lakehouse $1,538 $1,493 Azure -$45 (-3%)
TOTAL $3,200/month $3,166/month Azure -$34/month (-1%)

Annual Projection

  • AWS: $3,200 × 12 = $38,400/year
  • Azure: $3,166 × 12 = $37,992/year
  • Savings with Azure: $408/year (1%)

Cost Sensitivity Analysis

Scenario 1: High LLM Usage (10M tokens/month)

  • AWS: +$180 (Claude cheaper)
  • Azure: +$900 (GPT-4 more expensive)
  • AWS wins by $720/month

Scenario 2: Heavy ML Training (20 hrs/week GPU)

  • AWS: +$785
  • Azure: +$765
  • Azure wins by $20/month

Scenario 3: Large Data Lake (10 TB storage)

  • AWS: +$230
  • Azure: +$180
  • Azure wins by $50/month

Conclusion: AWS is better for AI-heavy workloads due to cheaper LLM pricing. Azure is better for data-heavy workloads due to cheaper storage.

Multi-Cloud Integration Patterns

Unified RBAC Strategy

Both platforms support similar pod-level identity:

AWS (IRSA):

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/AppRole
Enter fullscreen mode Exit fullscreen mode

Azure (Workload Identity):

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
  annotations:
    azure.workload.identity/client-id: CLIENT_ID
Enter fullscreen mode Exit fullscreen mode

Multi-Cloud Disaster Recovery

Deploy identical workloads on both platforms for DR:

# Primary: AWS
# Standby: Azure
# Failover time: < 5 minutes with DNS switch

# Shared components:
# - OpenShift APIs (same)
# - Application code (same)
# - Milvus deployment (same)

# Platform-specific:
# - Cloud credentials
# - Storage endpoints
Enter fullscreen mode Exit fullscreen mode

Migration Strategies

AWS to Azure Migration

Phase 1: Data Migration

# Use AzCopy for S3 → Blob migration
azcopy copy \
  "https://s3.amazonaws.com/bucket/*" \
  "https://storageaccount.blob.core.windows.net/container" \
  --recursive
Enter fullscreen mode Exit fullscreen mode

Phase 2: Metadata Migration

  • Export Glue Catalog to JSON
  • Import to Azure Purview via API

Phase 3: Application Migration

  • Update environment variables
  • Switch cloud credentials
  • Deploy to ARO

Azure to AWS Migration

Similar process in reverse:

# Use AWS DataSync for Blob → S3
aws datasync create-task \
  --source-location-arn arn:aws:datasync:...:location/azure-blob \
  --destination-location-arn arn:aws:datasync:...:location/s3-bucket
Enter fullscreen mode Exit fullscreen mode

Resource Cleanup

AWS Complete Cleanup

#!/bin/bash
# Complete AWS resource cleanup

# RAG Platform
rosa delete cluster --cluster=rag-platform-aws --yes
aws s3 rm s3://rag-documents-${ACCOUNT_ID} --recursive
aws s3 rb s3://rag-documents-${ACCOUNT_ID}
aws glue delete-crawler --name rag-document-crawler
aws glue delete-database --name rag_documents_db
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $BEDROCK_VPC_ENDPOINT
aws iam delete-role --role-name rosa-bedrock-access
aws iam delete-policy --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/BedrockInvokePolicy

# MLOps Platform
aws s3 rm s3://mlops-artifacts-${ACCOUNT_ID} --recursive
aws s3 rm s3://mlops-datasets-${ACCOUNT_ID} --recursive
aws s3 rb s3://mlops-artifacts-${ACCOUNT_ID}
aws s3 rb s3://mlops-datasets-${ACCOUNT_ID}
aws ecr delete-repository --repository-name mlops/training --force
aws iam delete-role --role-name ACKSageMakerControllerRole

# Data Lakehouse
aws s3 rm s3://lakehouse-data-${ACCOUNT_ID} --recursive
aws s3 rb s3://lakehouse-data-${ACCOUNT_ID}
for db in bronze silver gold; do
  aws glue delete-database --name $db
done
aws iam delete-role --role-name SparkGlueCatalogRole

echo "AWS cleanup complete"
Enter fullscreen mode Exit fullscreen mode

Azure Complete Cleanup

#!/bin/bash
# Complete Azure resource cleanup

# Delete all resources in resource group
az group delete --name rag-platform-rg --yes --no-wait

# This deletes:
# - ARO cluster
# - Azure OpenAI service
# - Storage accounts
# - Data Factory
# - Azure ML workspace
# - All networking components

echo "Azure cleanup complete (deleting in background)"
Enter fullscreen mode Exit fullscreen mode

Troubleshooting

Common Multi-Cloud Issues

Issue: Cross-Cloud Latency

Symptoms: Slow API responses when accessing cloud services

AWS Solution:

# Verify VPC endpoint is in correct AZ
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids $ENDPOINT_ID

# Check PrivateLink latency
oc run test --rm -it --image=curlimages/curl -- \
  curl -w "@curl-format.txt" https://bedrock-runtime.us-east-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

Azure Solution:

# Verify Private Link in same region as ARO
az network private-endpoint show --name openai-private-endpoint

# Test latency
oc run test --rm -it --image=curlimages/curl -- \
  curl -w "@curl-format.txt" https://OPENAI_NAME.openai.azure.com
Enter fullscreen mode Exit fullscreen mode

Issue: Authentication Failures

AWS IRSA Troubleshooting:

# Verify OIDC provider
rosa describe cluster -c $CLUSTER_NAME -o json | jq .aws.sts.oidc_endpoint_url

# Test token
kubectl create token bedrock-sa -n rag-application

# Verify IAM trust policy
aws iam get-role --role-name rosa-bedrock-access
Enter fullscreen mode Exit fullscreen mode

Azure Workload Identity Troubleshooting:

# Verify federated credential
az identity federated-credential show \
  --name rag-app-federated \
  --identity-name rag-app-identity \
  --resource-group $RESOURCE_GROUP

# Test managed identity
az account get-access-token --resource https://cognitiveservices.azure.com
Enter fullscreen mode Exit fullscreen mode

Conclusion

Platform Selection Recommendations

Choose AWS if you:

  • Prioritize AI/ML model diversity (Bedrock marketplace)
  • Have variable, unpredictable workloads (serverless pricing)
  • Value open-source ecosystem compatibility
  • Need global multi-region deployments
  • Want lower LLM API costs

Choose Azure if you:

  • Have existing Microsoft enterprise agreements
  • Need Windows container support
  • Require hybrid cloud with on-premises
  • Have Microsoft 365 / Teams integration requirements
  • Want slightly lower infrastructure costs

Choose Multi-Cloud if you:

  • Need disaster recovery across providers
  • Want to avoid vendor lock-in
  • Have regulatory requirements for redundancy
  • Can manage operational complexity

Final Cost Summary

For the three projects combined:

  • AWS Total: $3,200/month ($38,400/year)
  • Azure Total: $3,166/month ($37,992/year)
  • Difference: 1% ($408/year favoring Azure)

Verdict: Costs are effectively equivalent. Choose based on ecosystem fit, not cost.

Key Technical Takeaways

  1. OpenShift provides platform portability - same APIs on both clouds
  2. Cloud-specific services (Bedrock, Azure OpenAI) require different code
  3. Storage abstractions (S3 vs Blob) are the main migration challenge
  4. IAM patterns (IRSA vs Workload Identity) are conceptually similar

Next Steps

To Expand This Implementation:

  1. Add GitOps with ArgoCD for both platforms
  2. Implement cross-cloud disaster recovery
  3. Add comprehensive monitoring with Grafana
  4. Automate deployments with Terraform/Bicep
  5. Implement cost governance and FinOps

Thank you for reading this comprehensive multi-cloud implementation guide!

Top comments (0)