Marco Gonzalez

Posted on Jan 17

RAG Integration: DeepSeek’s New BFF in the AI World

#ai #llm #rag #tutorial

In this tutorial, I'll show you how to build a backend application using Azure OpenAI's Language Model (LLM) and introduce you to what's new with DeepSeek's LLM. It's simpler than it might sound!

Important Notes:

May difference between OpenAI and DeepSeek does not lie on the setup, but the performance, so feel free to replace "DeepSeek" everytime you see "OpenAI" in this blog entry.

Introduction
Platform Overview
Cloud Platform Decision Matrix
Prerequisites
Project 1: Enterprise-Grade RAG Platform
Project 2: Hybrid MLOps Pipeline
Project 3: Unified Data Fabric (Data Lakehouse)
Multi-Cloud Integration Patterns
Total Cost of Ownership Analysis
Migration Strategies
Resource Cleanup
Troubleshooting

Introduction

Modern enterprises face a critical decision when building cloud-native AI and data platforms: AWS or Azure? This comprehensive guide demonstrates how to build three production-grade platforms on both cloud providers, providing side-by-side comparisons to help you make informed decisions.

What You'll Learn

This guide shows you how to implement identical architectures on both AWS and Azure:

Project 1: Enterprise RAG Platform

AWS: Amazon Bedrock + AWS Glue + Milvus on ROSA
Azure: Azure OpenAI + Azure Data Factory + Milvus on ARO
Privacy-first Retrieval-Augmented Generation
Vector database integration
Secure private connectivity

Project 2: Hybrid MLOps Pipeline

AWS: SageMaker + OpenShift Pipelines + KServe on ROSA
Azure: Azure ML + Azure DevOps + KServe on ARO
Cost-optimized GPU training
Kubernetes-native serving
End-to-end automation

Project 3: Unified Data Fabric

AWS: Apache Spark + AWS Glue Catalog + S3 + Iceberg
Azure: Apache Spark + Azure Purview + ADLS Gen2 + Delta Lake
Stateless compute architecture
Medallion data organization
ACID transactions

Why This Comparison Matters

Choosing the right cloud platform impacts:

Total Cost: 20-40% difference in monthly spending
Developer Productivity: Ecosystem integration and tooling
Vendor Lock-in: Portability and migration flexibility
Enterprise Integration: Existing infrastructure and contracts

Platform Overview

Unified Multi-Cloud Architecture

Both implementations follow the same architectural patterns while leveraging platform-specific managed services:

┌─────────────────────────────────────────────────────────────────────┐
│                     Enterprise Organization                          │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │     Red Hat OpenShift (ROSA on AWS / ARO on Azure)            │ │
│  │              - Unified Control Plane                           │ │
│  │              - Application Orchestration                       │ │
│  │              - Developer Platform                              │ │
│  └───────────────────────────┬───────────────────────────────────┘ │
│                              │                                      │
│              ┌───────────────┼───────────────┐                     │
│              │               │               │                     │
│  ┌───────────▼─────┐ ┌──────▼──────┐ ┌─────▼──────────┐          │
│  │   RAG Project   │ │MLOps Project│ │ Data Lakehouse │          │
│  │                 │ │             │ │                │          │
│  │ AWS:            │ │ AWS:        │ │ AWS:           │          │
│  │ - Bedrock       │ │ - SageMaker │ │ - Glue Catalog │          │
│  │ - Glue ETL      │ │ - ACK       │ │ - S3 + Iceberg │          │
│  │                 │ │             │ │                │          │
│  │ Azure:          │ │ Azure:      │ │ Azure:         │          │
│  │ - OpenAI        │ │ - Azure ML  │ │ - Purview      │          │
│  │ - Data Factory  │ │ - ASO       │ │ - ADLS + Delta │          │
│  └─────────────────┘ └─────────────┘ └────────────────┘          │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │              Cloud Services Layer                             │ │
│  │  AWS: IAM + S3 + PrivateLink + CloudWatch                    │ │
│  │  Azure: AAD + Blob + Private Link + Monitor                  │ │
│  └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Technology Stack: AWS vs Azure

Component	AWS Solution	Azure Solution	OpenShift Platform
Kubernetes	ROSA (Red Hat OpenShift on AWS)	ARO (Azure Red Hat OpenShift)	Both use Red Hat OpenShift
LLM Platform	Amazon Bedrock (Claude 3.5)	Azure OpenAI Service (GPT-4)	Same API patterns
ML Training	Amazon SageMaker	Azure Machine Learning	Both burst from OpenShift
Data Catalog	AWS Glue Data Catalog	Azure Purview / Unity Catalog	Unified metadata layer
Object Storage	Amazon S3	Azure Data Lake Storage Gen2	S3-compatible APIs
Table Format	Apache Iceberg	Delta Lake	Open source options
Vector DB	Milvus (self-hosted)	Milvus / Cosmos DB	Same deployment
ETL Service	AWS Glue (serverless)	Azure Data Factory (serverless)	Similar orchestration
CI/CD	OpenShift Pipelines (Tekton)	Azure DevOps / Tekton	Kubernetes-native
K8s Integration	AWS Controllers (ACK)	Azure Service Operator (ASO)	Custom resources
Private Network	AWS PrivateLink	Azure Private Link	VPC/VNet integration
Authentication	IRSA (IAM for Service Accounts)	Workload Identity	Pod-level identity

Cloud Platform Decision Matrix

When to Choose AWS

Best For:

AI/ML Innovation: Amazon Bedrock offers broader model selection (Claude, Llama 2, Stable Diffusion)
Serverless-First: AWS Glue, Lambda, and Bedrock have no minimum fees
Startup/Scale-up: Pay-as-you-go pricing favors variable workloads
Data Engineering: S3 + Glue + Athena is industry standard
Multi-Region: Better global infrastructure coverage

AWS Advantages:

Superior AI model marketplace (Anthropic, Cohere, AI21, Meta)
True serverless data catalog (Glue) with no base costs
More mature spot instance ecosystem for cost savings
Better S3 ecosystem and tooling integration
Stronger open-source community adoption

When to Choose Azure

Best For:

Microsoft Ecosystem: Tight integration with Office 365, Teams, Power Platform
Enterprise Windows: Native Windows container support
Hybrid Cloud: Azure Arc and on-premises integration
Enterprise Agreements: Existing Microsoft licensing discounts
Regulated Industries: Better compliance certifications in some regions

Azure Advantages:

Seamless Microsoft 365 and Active Directory integration
Superior Windows and .NET container support
Better hybrid cloud story with Azure Arc
Integrated Azure Synapse for unified analytics
Potentially lower costs with existing EA agreements

Decision Criteria Scorecard

Criteria	AWS Score	Azure Score	Weight	Notes
AI Model Selection	9/10	7/10	High	AWS Bedrock has more models
ML Training Cost	8/10	8/10	High	Equivalent spot pricing
Data Lake Maturity	10/10	8/10	High	S3 is industry standard
Serverless Pricing	9/10	7/10	Medium	AWS Glue has no minimums
Enterprise Integration	7/10	10/10	High	Azure wins for Microsoft shops
Hybrid Cloud	7/10	9/10	Medium	Azure Arc is superior
Developer Ecosystem	9/10	7/10	Medium	Larger open-source community
Compliance Certifications	9/10	9/10	High	Equivalent for most use cases
Global Infrastructure	10/10	8/10	Low	AWS has more regions
Pricing Transparency	8/10	7/10	Medium	AWS pricing is clearer

Total Weighted Score: AWS: 8.5/10 | Azure: 8.1/10

Verdict: Choose based on your organization's existing ecosystem. Both platforms are capable; the difference is in integration, not capability.

Prerequisites

Common Prerequisites (Both Platforms)

Required Accounts:

Cloud platform account with administrative access
Red Hat Account with OpenShift subscription
Credit card for cloud charges

Required Tools (install on your workstation):

# Common tools for both platforms
# OpenShift CLI (oc)
wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz
tar -xvf openshift-client-linux.tar.gz
sudo mv oc kubectl /usr/local/bin/
oc version

# Helm (v3)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

# Tekton CLI
curl -LO https://github.com/tektoncd/cli/releases/download/v0.33.0/tkn_0.33.0_Linux_x86_64.tar.gz
tar xvzf tkn_0.33.0_Linux_x86_64.tar.gz
sudo mv tkn /usr/local/bin/
tkn version

# Python 3.11+
python3 --version

# Container tools (Docker or Podman)
podman --version

AWS-Specific Prerequisites

# AWS CLI (v2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version

# ROSA CLI
wget https://mirror.openshift.com/pub/openshift-v4/clients/rosa/latest/rosa-linux.tar.gz
tar -xvf rosa-linux.tar.gz
sudo mv rosa /usr/local/bin/rosa
rosa version

# Configure AWS
aws configure
aws sts get-caller-identity

# Initialize ROSA
rosa login
rosa verify quota
rosa verify permissions
rosa init

Azure-Specific Prerequisites

# Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
az --version

# ARO extension
az extension add --name aro --index https://az.aroapp.io/stable

# Azure CLI login
az login
az account show

# Register required providers
az provider register --namespace Microsoft.RedHatOpenShift --wait
az provider register --namespace Microsoft.Compute --wait
az provider register --namespace Microsoft.Storage --wait
az provider register --namespace Microsoft.Network --wait

Service Quotas Verification

AWS:

# EC2 vCPU quota
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-1216C47A \
  --region us-east-1

# SageMaker training instances
aws service-quotas get-service-quota \
  --service-code sagemaker \
  --quota-code L-2E8D9C5E \
  --region us-east-1

Azure:

# Check compute quota
az vm list-usage --location eastus --output table

# Check ML compute quota
az ml compute list-usage --location eastus

Project 1: Enterprise-Grade RAG Platform

RAG Platform Overview

This project implements a privacy-first Retrieval-Augmented Generation (RAG) system. Both AWS and Azure implementations achieve the same functionality but use platform-specific managed services.

Architecture Comparison

AWS Architecture:

ROSA → AWS PrivateLink → Amazon Bedrock (Claude 3.5)
  ↓
Milvus Vector DB (on ROSA)
  ↓
AWS Glue ETL → S3

Azure Architecture:

ARO → Azure Private Link → Azure OpenAI (GPT-4)
  ↓
Milvus Vector DB (on ARO)
  ↓
Azure Data Factory → Blob Storage

Side-by-Side Service Mapping

Function	AWS Service	Azure Service	Implementation Difference
LLM API	Amazon Bedrock	Azure OpenAI Service	Different model families
Private Network	AWS PrivateLink	Azure Private Link	Similar configuration
ETL Pipeline	AWS Glue (Serverless)	Azure Data Factory	Different pricing models
Metadata	AWS Glue Data Catalog	Azure Purview	Different scopes
Storage	Amazon S3	Azure Blob Storage / ADLS Gen2	S3 API vs Blob API
Vector DB	Milvus on ROSA	Milvus on ARO / Cosmos DB	Self-hosted vs managed option
Auth	IRSA (IAM Roles)	Workload Identity	Similar pod-level identity
Embedding	Titan Embeddings	OpenAI Embeddings	Different dimensions

AWS Implementation (RAG)

AWS Phase 1: ROSA Cluster Setup

# Set environment variables
export CLUSTER_NAME="rag-platform-aws"
export AWS_REGION="us-east-1"
export MACHINE_TYPE="m5.2xlarge"
export COMPUTE_NODES=3

# Create ROSA cluster (takes ~40 minutes)
rosa create cluster \
  --cluster-name $CLUSTER_NAME \
  --region $AWS_REGION \
  --multi-az \
  --compute-machine-type $MACHINE_TYPE \
  --compute-nodes $COMPUTE_NODES \
  --machine-cidr 10.0.0.0/16 \
  --service-cidr 172.30.0.0/16 \
  --pod-cidr 10.128.0.0/14 \
  --host-prefix 23 \
  --yes

# Monitor installation
rosa logs install --cluster=$CLUSTER_NAME --watch

# Create admin and connect
rosa create admin --cluster=$CLUSTER_NAME
oc login <api-url> --username cluster-admin --password <password>

# Create namespaces
oc new-project redhat-ods-applications
oc new-project rag-application
oc new-project milvus

AWS Phase 2: Amazon Bedrock via PrivateLink

# Get ROSA VPC details
export ROSA_VPC_ID=$(aws ec2 describe-vpcs \
  --filters "Name=tag:Name,Values=*${CLUSTER_NAME}*" \
  --query 'Vpcs[0].VpcId' \
  --output text \
  --region $AWS_REGION)

export PRIVATE_SUBNET_IDS=$(aws ec2 describe-subnets \
  --filters "Name=vpc-id,Values=$ROSA_VPC_ID" "Name=tag:Name,Values=*private*" \
  --query 'Subnets[*].SubnetId' \
  --output text \
  --region $AWS_REGION)

# Create VPC Endpoint Security Group
export VPC_ENDPOINT_SG=$(aws ec2 create-security-group \
  --group-name bedrock-vpc-endpoint-sg \
  --description "Security group for Bedrock VPC endpoint" \
  --vpc-id $ROSA_VPC_ID \
  --region $AWS_REGION \
  --output text \
  --query 'GroupId')

# Allow HTTPS from ROSA nodes
aws ec2 authorize-security-group-ingress \
  --group-id $VPC_ENDPOINT_SG \
  --protocol tcp \
  --port 443 \
  --cidr 10.0.0.0/16 \
  --region $AWS_REGION

# Create Bedrock VPC Endpoint
export BEDROCK_VPC_ENDPOINT=$(aws ec2 create-vpc-endpoint \
  --vpc-id $ROSA_VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.${AWS_REGION}.bedrock-runtime \
  --subnet-ids $PRIVATE_SUBNET_IDS \
  --security-group-ids $VPC_ENDPOINT_SG \
  --private-dns-enabled \
  --region $AWS_REGION \
  --output text \
  --query 'VpcEndpoint.VpcEndpointId')

# Wait for availability
aws ec2 wait vpc-endpoint-available \
  --vpc-endpoint-ids $BEDROCK_VPC_ENDPOINT \
  --region $AWS_REGION

# Create IAM role for Bedrock access (IRSA pattern)
export OIDC_PROVIDER=$(rosa describe cluster -c $CLUSTER_NAME -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

cat > bedrock-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": "arn:aws:bedrock:${AWS_REGION}::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
    }
  ]
}
EOF

aws iam create-policy \
  --policy-name BedrockInvokePolicy \
  --policy-document file://bedrock-policy.json

cat > trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:rag-application:bedrock-sa"
        }
      }
    }
  ]
}
EOF

export BEDROCK_ROLE_ARN=$(aws iam create-role \
  --role-name rosa-bedrock-access \
  --assume-role-policy-document file://trust-policy.json \
  --query 'Role.Arn' \
  --output text)

aws iam attach-role-policy \
  --role-name rosa-bedrock-access \
  --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/BedrockInvokePolicy

# Create Kubernetes service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: bedrock-sa
  namespace: rag-application
  annotations:
    eks.amazonaws.com/role-arn: $BEDROCK_ROLE_ARN
EOF

AWS Phase 3: AWS Glue Data Pipeline

# Create S3 bucket
export BUCKET_NAME="rag-documents-${ACCOUNT_ID}"
aws s3 mb s3://$BUCKET_NAME --region $AWS_REGION

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket $BUCKET_NAME \
  --versioning-configuration Status=Enabled \
  --region $AWS_REGION

# Create folder structure
aws s3api put-object --bucket $BUCKET_NAME --key raw-documents/
aws s3api put-object --bucket $BUCKET_NAME --key processed-documents/
aws s3api put-object --bucket $BUCKET_NAME --key embeddings/

# Create Glue IAM role
cat > glue-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {"Service": "glue.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

aws iam create-role \
  --role-name AWSGlueServiceRole-RAG \
  --assume-role-policy-document file://glue-trust-policy.json

aws iam attach-role-policy \
  --role-name AWSGlueServiceRole-RAG \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

# Create S3 access policy
cat > glue-s3-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::${BUCKET_NAME}/*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::${BUCKET_NAME}"
    }
  ]
}
EOF

aws iam put-role-policy \
  --role-name AWSGlueServiceRole-RAG \
  --policy-name S3Access \
  --policy-document file://glue-s3-policy.json

# Create Glue database
aws glue create-database \
  --database-input '{
    "Name": "rag_documents_db",
    "Description": "RAG document metadata"
  }' \
  --region $AWS_REGION

# Create Glue crawler
aws glue create-crawler \
  --name rag-document-crawler \
  --role arn:aws:iam::${ACCOUNT_ID}:role/AWSGlueServiceRole-RAG \
  --database-name rag_documents_db \
  --targets '{
    "S3Targets": [{"Path": "s3://'$BUCKET_NAME'/raw-documents/"}]
  }' \
  --region $AWS_REGION

AWS Phase 4: Milvus Vector Database

# Install Milvus using Helm
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm repo update

helm install milvus-operator milvus/milvus-operator \
  --namespace milvus \
  --create-namespace

# Create PVCs
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: milvus-etcd-pvc
  namespace: milvus
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
  storageClassName: gp3-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: milvus-minio-pvc
  namespace: milvus
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi
  storageClassName: gp3-csi
EOF

# Deploy Milvus
cat > milvus-values.yaml <<EOF
cluster:
  enabled: true
service:
  type: ClusterIP
  port: 19530
standalone:
  replicas: 1
  resources:
    limits:
      cpu: "4"
      memory: 8Gi
    requests:
      cpu: "2"
      memory: 4Gi
etcd:
  persistence:
    enabled: true
    existingClaim: milvus-etcd-pvc
minio:
  persistence:
    enabled: true
    existingClaim: milvus-minio-pvc
EOF

helm install milvus milvus/milvus \
  --namespace milvus \
  --values milvus-values.yaml \
  --wait

# Get Milvus endpoint
export MILVUS_HOST=$(oc get svc milvus -n milvus -o jsonpath='{.spec.clusterIP}')
export MILVUS_PORT=19530

AWS Phase 5: RAG Application Deployment

# Create application code
mkdir -p rag-app-aws/src

cat > rag-app-aws/requirements.txt <<EOF
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
pymilvus==2.3.3
boto3==1.29.7
python-dotenv==1.0.0
EOF

# Create FastAPI application (abbreviated for space)
cat > rag-app-aws/src/main.py <<'PYTHON'
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import os, json, boto3
from pymilvus import connections, Collection

app = FastAPI(title="Enterprise RAG API - AWS")

MILVUS_HOST = os.getenv("MILVUS_HOST")
AWS_REGION = os.getenv("AWS_REGION", "us-east-1")
BEDROCK_MODEL = "anthropic.claude-3-5-sonnet-20241022-v2:0"

bedrock = boto3.client('bedrock-runtime', region_name=AWS_REGION)

@app.on_event("startup")
async def startup():
    connections.connect(host=MILVUS_HOST, port=19530)

class QueryRequest(BaseModel):
    query: str
    top_k: int = 5
    max_tokens: int = 1000

@app.post("/query")
async def query_rag(req: QueryRequest):
    # Generate embedding with Bedrock Titan
    embed_resp = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": req.query, "dimensions": 1024})
    )
    embedding = json.loads(embed_resp['body'].read())['embedding']

    # Search Milvus
    coll = Collection("rag_documents")
    results = coll.search([embedding], "embedding", {"metric_type": "L2"}, limit=req.top_k)

    # Build context
    context = "\n\n".join([hit.entity.get("text") for hit in results[0]])

    # Call Bedrock Claude
    prompt = f"Context:\n{context}\n\nQuestion: {req.query}\n\nAnswer:"
    response = bedrock.invoke_model(
        modelId=BEDROCK_MODEL,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": req.max_tokens,
            "messages": [{"role": "user", "content": prompt}]
        })
    )

    answer = json.loads(response['body'].read())['content'][0]['text']
    return {"answer": answer, "sources": [{"chunk": hit.entity.get("text")} for hit in results[0]]}

@app.get("/health")
async def health():
    return {"status": "healthy", "platform": "AWS", "model": "Claude 3.5 Sonnet"}
PYTHON

# Create Dockerfile
cat > rag-app-aws/Dockerfile <<EOF
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
EOF

# Build and deploy
cd rag-app-aws
podman build -t rag-app-aws:v1.0 .
oc create imagestream rag-app-aws -n rag-application
podman tag rag-app-aws:v1.0 image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0
podman push image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0 --tls-verify=false
cd ..

# Deploy to OpenShift
cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-app-aws
  namespace: rag-application
spec:
  replicas: 2
  selector:
    matchLabels:
      app: rag-app-aws
  template:
    metadata:
      labels:
        app: rag-app-aws
    spec:
      serviceAccountName: bedrock-sa
      containers:
      - name: app
        image: image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0
        ports:
        - containerPort: 8000
        env:
        - name: MILVUS_HOST
          value: "$MILVUS_HOST"
        - name: AWS_REGION
          value: "$AWS_REGION"
---
apiVersion: v1
kind: Service
metadata:
  name: rag-app-aws
  namespace: rag-application
spec:
  selector:
    app: rag-app-aws
  ports:
  - port: 80
    targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: rag-app-aws
  namespace: rag-application
spec:
  to:
    kind: Service
    name: rag-app-aws
  tls:
    termination: edge
EOF

# Get URL and test
export RAG_URL_AWS=$(oc get route rag-app-aws -n rag-application -o jsonpath='{.spec.host}')
curl https://$RAG_URL_AWS/health

Azure Implementation (RAG)

Azure Phase 1: ARO Cluster Setup

# Set environment variables
export CLUSTER_NAME="rag-platform-azure"
export LOCATION="eastus"
export RESOURCE_GROUP="rag-platform-rg"

# Create resource group
az group create \
  --name $RESOURCE_GROUP \
  --location $LOCATION

# Create virtual network
az network vnet create \
  --resource-group $RESOURCE_GROUP \
  --name aro-vnet \
  --address-prefixes 10.0.0.0/22

# Create master subnet
az network vnet subnet create \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --name master-subnet \
  --address-prefixes 10.0.0.0/23 \
  --service-endpoints Microsoft.ContainerRegistry

# Create worker subnet
az network vnet subnet create \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --name worker-subnet \
  --address-prefixes 10.0.2.0/23 \
  --service-endpoints Microsoft.ContainerRegistry

# Disable subnet private endpoint policies
az network vnet subnet update \
  --name master-subnet \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --disable-private-link-service-network-policies true

# Create ARO cluster (takes ~35 minutes)
az aro create \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --vnet aro-vnet \
  --master-subnet master-subnet \
  --worker-subnet worker-subnet \
  --worker-count 3 \
  --worker-vm-size Standard_D8s_v3

# Get credentials
export ARO_URL=$(az aro show \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --query consoleUrl -o tsv)

export ARO_PASSWORD=$(az aro list-credentials \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --query kubeadminPassword -o tsv)

# Login
oc login $ARO_URL -u kubeadmin -p $ARO_PASSWORD

# Create namespaces
oc new-project rag-application
oc new-project milvus

Azure Phase 2: Azure OpenAI via Private Link

# Create Azure OpenAI resource
export OPENAI_NAME="rag-openai-${RANDOM}"

az cognitiveservices account create \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --kind OpenAI \
  --sku S0 \
  --location $LOCATION \
  --custom-domain $OPENAI_NAME \
  --public-network-access Disabled

# Deploy GPT-4 model
az cognitiveservices account deployment create \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --deployment-name gpt-4 \
  --model-name gpt-4 \
  --model-version "0613" \
  --model-format OpenAI \
  --sku-capacity 10 \
  --sku-name "Standard"

# Deploy text-embedding model
az cognitiveservices account deployment create \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --deployment-name text-embedding-ada-002 \
  --model-name text-embedding-ada-002 \
  --model-version "2" \
  --model-format OpenAI \
  --sku-capacity 10 \
  --sku-name "Standard"

# Create Private Endpoint
export VNET_ID=$(az network vnet show \
  --resource-group $RESOURCE_GROUP \
  --name aro-vnet \
  --query id -o tsv)

export SUBNET_ID=$(az network vnet subnet show \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --name worker-subnet \
  --query id -o tsv)

export OPENAI_ID=$(az cognitiveservices account show \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --query id -o tsv)

az network private-endpoint create \
  --name openai-private-endpoint \
  --resource-group $RESOURCE_GROUP \
  --vnet-name aro-vnet \
  --subnet worker-subnet \
  --private-connection-resource-id $OPENAI_ID \
  --group-id account \
  --connection-name openai-connection

# Create Private DNS Zone
az network private-dns zone create \
  --resource-group $RESOURCE_GROUP \
  --name privatelink.openai.azure.com

az network private-dns link vnet create \
  --resource-group $RESOURCE_GROUP \
  --zone-name privatelink.openai.azure.com \
  --name openai-dns-link \
  --virtual-network aro-vnet \
  --registration-enabled false

# Create DNS record
export ENDPOINT_IP=$(az network private-endpoint show \
  --name openai-private-endpoint \
  --resource-group $RESOURCE_GROUP \
  --query 'customDnsConfigs[0].ipAddresses[0]' -o tsv)

az network private-dns record-set a create \
  --name $OPENAI_NAME \
  --zone-name privatelink.openai.azure.com \
  --resource-group $RESOURCE_GROUP

az network private-dns record-set a add-record \
  --record-set-name $OPENAI_NAME \
  --zone-name privatelink.openai.azure.com \
  --resource-group $RESOURCE_GROUP \
  --ipv4-address $ENDPOINT_IP

# Configure Workload Identity
export ARO_OIDC_ISSUER=$(az aro show \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --query 'serviceIdentity.url' -o tsv)

# Create managed identity
az identity create \
  --name rag-app-identity \
  --resource-group $RESOURCE_GROUP

export IDENTITY_CLIENT_ID=$(az identity show \
  --name rag-app-identity \
  --resource-group $RESOURCE_GROUP \
  --query clientId -o tsv)

export IDENTITY_PRINCIPAL_ID=$(az identity show \
  --name rag-app-identity \
  --resource-group $RESOURCE_GROUP \
  --query principalId -o tsv)

# Grant OpenAI access
az role assignment create \
  --assignee $IDENTITY_PRINCIPAL_ID \
  --role "Cognitive Services OpenAI User" \
  --scope $OPENAI_ID

# Create federated credential
az identity federated-credential create \
  --name rag-app-federated \
  --identity-name rag-app-identity \
  --resource-group $RESOURCE_GROUP \
  --issuer $ARO_OIDC_ISSUER \
  --subject "system:serviceaccount:rag-application:openai-sa"

# Create Kubernetes service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: openai-sa
  namespace: rag-application
  annotations:
    azure.workload.identity/client-id: $IDENTITY_CLIENT_ID
EOF

# Get OpenAI endpoint and key
export OPENAI_ENDPOINT=$(az cognitiveservices account show \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --query properties.endpoint -o tsv)

export OPENAI_KEY=$(az cognitiveservices account keys list \
  --name $OPENAI_NAME \
  --resource-group $RESOURCE_GROUP \
  --query key1 -o tsv)

# Create secret
oc create secret generic openai-credentials \
  --from-literal=endpoint=$OPENAI_ENDPOINT \
  --from-literal=key=$OPENAI_KEY \
  -n rag-application

Azure Phase 3: Azure Data Factory Pipeline

# Create Data Factory
export ADF_NAME="rag-adf-${RANDOM}"

az datafactory create \
  --resource-group $RESOURCE_GROUP \
  --factory-name $ADF_NAME \
  --location $LOCATION

# Create Storage Account
export STORAGE_ACCOUNT="ragdocs${RANDOM}"

az storage account create \
  --name $STORAGE_ACCOUNT \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --sku Standard_LRS \
  --kind StorageV2 \
  --hierarchical-namespace true

# Get storage key
export STORAGE_KEY=$(az storage account keys list \
  --account-name $STORAGE_ACCOUNT \
  --resource-group $RESOURCE_GROUP \
  --query '[0].value' -o tsv)

# Create containers
az storage container create \
  --name raw-documents \
  --account-name $STORAGE_ACCOUNT \
  --account-key $STORAGE_KEY

az storage container create \
  --name processed-documents \
  --account-name $STORAGE_ACCOUNT \
  --account-key $STORAGE_KEY

# Create linked service for storage
cat > adf-storage-linked-service.json <<EOF
{
  "name": "StorageLinkedService",
  "properties": {
    "type": "AzureBlobStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=$STORAGE_ACCOUNT;AccountKey=$STORAGE_KEY;EndpointSuffix=core.windows.net"
    }
  }
}
EOF

az datafactory linked-service create \
  --resource-group $RESOURCE_GROUP \
  --factory-name $ADF_NAME \
  --name StorageLinkedService \
  --properties @adf-storage-linked-service.json

Azure Phase 4: Milvus Deployment (Same as AWS)

The Milvus deployment on ARO is identical to ROSA since both use OpenShift:

# Same Helm commands as AWS implementation
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm install milvus-operator milvus/milvus-operator --namespace milvus --create-namespace

# Create PVCs using Azure Disk
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: milvus-etcd-pvc
  namespace: milvus
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
  storageClassName: managed-premium
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: milvus-minio-pvc
  namespace: milvus
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi
  storageClassName: managed-premium
EOF

# Deploy Milvus (same values file as AWS)
helm install milvus milvus/milvus --namespace milvus --values milvus-values.yaml --wait

Azure Phase 5: RAG Application Deployment

# Create Azure-specific application
mkdir -p rag-app-azure/src

cat > rag-app-azure/requirements.txt <<EOF
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
pymilvus==2.3.3
openai==1.3.5
azure-identity==1.14.0
python-dotenv==1.0.0
EOF

cat > rag-app-azure/src/main.py <<'PYTHON'
from fastapi import FastAPI
from pydantic import BaseModel
import os
from openai import AzureOpenAI
from pymilvus import connections, Collection

app = FastAPI(title="Enterprise RAG API - Azure")

client = AzureOpenAI(
    api_key=os.getenv("OPENAI_KEY"),
    api_version="2023-05-15",
    azure_endpoint=os.getenv("OPENAI_ENDPOINT")
)

@app.on_event("startup")
async def startup():
    connections.connect(host=os.getenv("MILVUS_HOST"), port=19530)

class QueryRequest(BaseModel):
    query: str
    top_k: int = 5
    max_tokens: int = 1000

@app.post("/query")
async def query_rag(req: QueryRequest):
    # Generate embedding with Azure OpenAI
    embed_resp = client.embeddings.create(
        input=req.query,
        model="text-embedding-ada-002"
    )
    embedding = embed_resp.data[0].embedding

    # Search Milvus
    coll = Collection("rag_documents")
    results = coll.search([embedding], "embedding", {"metric_type": "L2"}, limit=req.top_k)

    # Build context
    context = "\n\n".join([hit.entity.get("text") for hit in results[0]])

    # Call Azure OpenAI GPT-4
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {req.query}"}
        ],
        max_tokens=req.max_tokens
    )

    answer = response.choices[0].message.content
    return {"answer": answer, "sources": [{"chunk": hit.entity.get("text")} for hit in results[0]]}

@app.get("/health")
async def health():
    return {"status": "healthy", "platform": "Azure", "model": "GPT-4"}
PYTHON

# Build and deploy (similar to AWS)
cd rag-app-azure
podman build -t rag-app-azure:v1.0 .
oc create imagestream rag-app-azure -n rag-application
podman tag rag-app-azure:v1.0 image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0
podman push image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0 --tls-verify=false
cd ..

# Deploy with Azure credentials
cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-app-azure
  namespace: rag-application
spec:
  replicas: 2
  selector:
    matchLabels:
      app: rag-app-azure
  template:
    metadata:
      labels:
        app: rag-app-azure
    spec:
      serviceAccountName: openai-sa
      containers:
      - name: app
        image: image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0
        ports:
        - containerPort: 8000
        env:
        - name: MILVUS_HOST
          value: "milvus.milvus.svc.cluster.local"
        - name: OPENAI_ENDPOINT
          valueFrom:
            secretKeyRef:
              name: openai-credentials
              key: endpoint
        - name: OPENAI_KEY
          valueFrom:
            secretKeyRef:
              name: openai-credentials
              key: key
---
apiVersion: v1
kind: Service
metadata:
  name: rag-app-azure
  namespace: rag-application
spec:
  selector:
    app: rag-app-azure
  ports:
  - port: 80
    targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: rag-app-azure
  namespace: rag-application
spec:
  to:
    kind: Service
    name: rag-app-azure
  tls:
    termination: edge
EOF

# Get URL and test
export RAG_URL_AZURE=$(oc get route rag-app-azure -n rag-application -o jsonpath='{.spec.host}')
curl https://$RAG_URL_AZURE/health

Cost Comparison (RAG)

Monthly Cost Breakdown

Component	AWS Cost	Azure Cost	Notes
Kubernetes Cluster
- 3x worker nodes	$1,460 (m5.2xlarge)	$1,380 (D8s_v3)	Similar specs
- Control plane	$0 (managed by ROSA)	$0 (managed by ARO)	Both included
LLM API Calls
- 1M input tokens	$3 (Claude 3.5)	$30 (GPT-4)	AWS 10x cheaper
- 1M output tokens	$15 (Claude 3.5)	$60 (GPT-4)	AWS 4x cheaper
Embeddings
- 1M tokens	$0.10 (Titan)	$0.10 (Ada-002)	Equivalent
Data Pipeline
- ETL service	$10 (Glue, serverless)	$15 (Data Factory)	AWS slightly cheaper
- Metadata catalog	$1 (Glue Catalog)	$20 (Purview min)	Azure has minimum fee
Object Storage
- 100 GB storage	$2.30 (S3)	$2.05 (Blob)	Equivalent
- Requests (100k)	$0.05 (S3)	$0.04 (Blob)	Equivalent
Vector Database
- Self-hosted Milvus	$0 (on cluster)	$0 (on cluster)	Same
Networking
- Private Link	$7.20 (PrivateLink)	$7.20 (Private Link)	Same pricing
- Data transfer	$5 (1 TB out)	$5 (1 TB out)	Equivalent
TOTAL/MONTH	$1,503.65	$1,519.39	AWS 1% cheaper

Key Cost Insights:

LLM API costs favor AWS by a significant margin (Claude is cheaper than GPT-4)
Azure Purview has a minimum monthly fee vs Glue's pay-per-use
Compute costs are similar between ROSA and ARO
Winner: AWS by ~$16/month (1%)

Cost Optimization Strategies

AWS:

Use Claude Instant for non-critical queries (6x cheaper)
Leverage Glue serverless (no base cost)
Use S3 Intelligent-Tiering for old documents

Azure:

Use GPT-3.5-Turbo instead of GPT-4 (20x cheaper)
Negotiate EA pricing for Azure OpenAI
Use cool/archive tiers for old data

Project 2: Hybrid MLOps Pipeline

MLOps Platform Overview

This project demonstrates cost-optimized machine learning operations by bursting GPU training workloads to managed services while keeping inference on Kubernetes.

Architecture Comparison

AWS Architecture:

OpenShift Pipelines → ACK → SageMaker (ml.p4d.24xlarge)
                            ↓
                        S3 Model Storage
                            ↓
                    KServe on ROSA (CPU)

Azure Architecture:

Azure DevOps / Tekton → ASO → Azure ML (NC96ads_A100_v4)
                               ↓
                           Blob Model Storage
                               ↓
                       KServe on ARO (CPU)

Service Mapping

Function	AWS Service	Azure Service	Key Difference
ML Platform	Amazon SageMaker	Azure Machine Learning	Similar capabilities
GPU Training	ml.p4d.24xlarge (8x A100)	NC96ads_A100_v4 (8x A100)	Same hardware
Spot Training	Managed Spot Training	Low Priority VMs	Different reservation models
Model Registry	S3 + SageMaker Registry	Blob + ML Model Registry	Different metadata approaches
K8s Operator	ACK (AWS Controllers)	ASO (Azure Service Operator)	Different CRD structures
Pipelines	OpenShift Pipelines (Tekton)	Azure DevOps / Tekton	Both support Tekton
Inference	KServe on ROSA	KServe on ARO	Identical

AWS Implementation (MLOps)

AWS MLOps Phase 1: OpenShift Pipelines Setup

# Install OpenShift Pipelines Operator
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-pipelines-operator
  namespace: openshift-operators
spec:
  channel: latest
  name: openshift-pipelines-operator-rh
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

# Create namespace
oc new-project mlops-pipelines

# Create service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pipeline-sa
  namespace: mlops-pipelines
EOF

AWS MLOps Phase 2: ACK SageMaker Controller

# Install ACK SageMaker controller
export SERVICE=sagemaker
export RELEASE_VERSION=$(curl -sL https://api.github.com/repos/aws-controllers-k8s/${SERVICE}-controller/releases/latest | grep '\"tag_name\":' | cut -d'\"' -f4)

wget https://github.com/aws-controllers-k8s/${SERVICE}-controller/releases/download/${RELEASE_VERSION}/install.yaml
kubectl apply -f install.yaml

# Create IAM role for ACK
cat > ack-sagemaker-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:StopTrainingJob"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:*"],
      "Resource": "arn:aws:s3:::mlops-*"
    },
    {
      "Effect": "Allow",
      "Action": ["iam:PassRole"],
      "Resource": "*",
      "Condition": {
        "StringEquals": {"iam:PassedToService": "sagemaker.amazonaws.com"}
      }
    }
  ]
}
EOF

aws iam create-policy --policy-name ACKSageMakerPolicy --policy-document file://ack-sagemaker-policy.json

# Create trust policy and role (similar to RAG project)
# ... (abbreviated for space)

AWS MLOps Phase 3: Training Job Example

# Create S3 buckets
export ML_BUCKET="mlops-artifacts-${ACCOUNT_ID}"
export DATA_BUCKET="mlops-datasets-${ACCOUNT_ID}"

aws s3 mb s3://$ML_BUCKET
aws s3 mb s3://$DATA_BUCKET

# Upload training script
cat > train.py <<'PYTHON'
import argparse, joblib
from sklearn.ensemble import RandomForestClassifier
import numpy as np

parser = argparse.ArgumentParser()
parser.add_argument('--n_estimators', type=int, default=100)
args = parser.parse_args()

# Training code
X = np.random.rand(1000, 20)
y = np.random.randint(0, 2, 1000)

model = RandomForestClassifier(n_estimators=args.n_estimators)
model.fit(X, y)

joblib.dump(model, '/opt/ml/model/model.joblib')
print(f"Training completed with {args.n_estimators} estimators")
PYTHON

# Create Dockerfile
cat > Dockerfile <<EOF
FROM python:3.10-slim
RUN pip install scikit-learn joblib numpy
COPY train.py /opt/ml/code/
ENTRYPOINT ["python", "/opt/ml/code/train.py"]
EOF

# Build and push to ECR
aws ecr create-repository --repository-name mlops/training
export ECR_URI="${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/mlops/training"
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
docker build -t mlops-training .
docker tag mlops-training:latest $ECR_URI:latest
docker push $ECR_URI:latest

# Create SageMaker training job via ACK
cat <<EOF | oc apply -f -
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: TrainingJob
metadata:
  name: rf-training-job
  namespace: mlops-pipelines
spec:
  trainingJobName: rf-training-$(date +%s)
  roleARN: $SAGEMAKER_ROLE_ARN
  algorithmSpecification:
    trainingImage: $ECR_URI:latest
    trainingInputMode: File
  resourceConfig:
    instanceType: ml.m5.xlarge
    instanceCount: 1
    volumeSizeInGB: 50
  outputDataConfig:
    s3OutputPath: s3://$ML_BUCKET/models/
  stoppingCondition:
    maxRuntimeInSeconds: 3600
EOF

Azure Implementation (MLOps)

Azure MLOps Phase 1: Azure ML Workspace

# Create ML workspace
export ML_WORKSPACE="mlops-workspace-${RANDOM}"

az ml workspace create \
  --name $ML_WORKSPACE \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION

# Create compute cluster (spot instances)
az ml compute create \
  --name gpu-cluster \
  --type amlcompute \
  --min-instances 0 \
  --max-instances 4 \
  --size Standard_NC6s_v3 \
  --tier LowPriority \
  --workspace-name $ML_WORKSPACE \
  --resource-group $RESOURCE_GROUP

Azure MLOps Phase 2: Azure Service Operator

# Install ASO
helm repo add aso2 https://raw.githubusercontent.com/Azure/azure-service-operator/main/v2/charts
helm install aso2 aso2/azure-service-operator \
  --create-namespace \
  --namespace azureserviceoperator-system \
  --set azureSubscriptionID=$SUBSCRIPTION_ID \
  --set azureTenantID=$TENANT_ID \
  --set azureClientID=$CLIENT_ID \
  --set azureClientSecret=$CLIENT_SECRET

# Create ML job via ASO
cat <<EOF | oc apply -f -
apiVersion: machinelearningservices.azure.com/v1alpha1
kind: Job
metadata:
  name: rf-training-job
  namespace: mlops-pipelines
spec:
  owner:
    name: $ML_WORKSPACE
  compute:
    target: gpu-cluster
    instanceCount: 1
  environment:
    image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
  codeConfiguration:
    codeArtifactId: azureml://code/train-script
    scoringScript: train.py
EOF

Cost Comparison (MLOps)

Component	AWS Monthly	Azure Monthly	Notes
Training
- 4 hrs/week spot GPU	$157 (ml.p4d.24xlarge)	$153 (NC96ads_A100_v4)	Azure slightly cheaper
Storage
- Model artifacts (50 GB)	$1.15 (S3)	$1.00 (Blob)	Similar
ML Platform
- ML service	$0 (pay-per-use)	$0 (pay-per-use)	Same
Inference (on OpenShift)
- Shared ROSA/ARO cluster	$0 (shared)	$0 (shared)	Same
TOTAL/MONTH	~$158	~$154	Azure 2.5% cheaper

Winner: Azure by $4/month (negligible difference)

Project 3: Unified Data Fabric (Data Lakehouse)

Lakehouse Platform Overview

This project implements a stateless data lakehouse where compute (Spark) can be destroyed without data loss.

Architecture Comparison

AWS Architecture:

Spark on ROSA → AWS Glue Catalog → S3 + Iceberg

Azure Architecture:

Spark on ARO → Azure Purview / Unity Catalog → ADLS Gen2 + Delta Lake

Service Mapping

Function	AWS Service	Azure Service	Key Difference
Catalog	AWS Glue Data Catalog	Azure Purview / Unity Catalog	Glue is serverless
Table Format	Apache Iceberg	Delta Lake	Iceberg is cloud-agnostic
Storage	Amazon S3	ADLS Gen2	ADLS has hierarchical namespace
Compute	Spark on ROSA	Spark on ARO / Databricks	ARO or managed Databricks
Query Engine	Amazon Athena	Azure Synapse Serverless SQL	Similar serverless query

AWS Implementation (Lakehouse)

(Due to length constraints, showing key differences only)

# Install Spark Operator
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --set sparkJobNamespace=spark-jobs

# Create Glue databases
aws glue create-database --database-input '{"Name": "bronze"}'
aws glue create-database --database-input '{"Name": "silver"}'
aws glue create-database --database-input '{"Name": "gold"}'

# Build custom Spark image with Iceberg
cat > Dockerfile <<EOF
FROM gcr.io/spark-operator/spark:v3.5.0
USER root
RUN curl -L https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.4.2/iceberg-spark-runtime-3.5_2.12-1.4.2.jar \
    -o /opt/spark/jars/iceberg-spark-runtime.jar
RUN curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar \
    -o /opt/spark/jars/hadoop-aws.jar
USER 185
EOF

# Deploy SparkApplication with Glue integration
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: lakehouse-etl
spec:
  type: Python
  sparkVersion: "3.5.0"
  mainApplicationFile: s3://bucket/scripts/etl.py
  sparkConf:
    "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog"
    "spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog"
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
EOF

Azure Implementation (Lakehouse)

# Option 1: Use Azure Databricks (managed)
az databricks workspace create \
  --name databricks-lakehouse \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --sku premium

# Option 2: Deploy Spark on ARO with Delta Lake
cat > Dockerfile <<EOF
FROM gcr.io/spark-operator/spark:v3.5.0
USER root
RUN curl -L https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar \
    -o /opt/spark/jars/delta-core.jar
USER 185
EOF

# Create ADLS Gen2 storage
az storage account create \
  --name datalake${RANDOM} \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --kind StorageV2 \
  --hierarchical-namespace true

# Deploy SparkApplication with Delta Lake
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: lakehouse-etl
spec:
  type: Python
  sparkVersion: "3.5.0"
  mainApplicationFile: abfss://container@storage.dfs.core.windows.net/scripts/etl.py
  sparkConf:
    "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
    "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
EOF

Cost Comparison (Lakehouse)

Component	AWS Monthly	Azure Monthly	Notes
Compute
- Spark cluster (3x m5.4xlarge)	$1,500	$1,450 (D16s_v3)	Similar
Metadata Catalog
- Catalog service	$10 (Glue, 1M requests)	$20 (Purview minimum)	AWS cheaper
Storage
- Data lake (1 TB)	$23 (S3)	$18 (ADLS Gen2 hot)	Azure cheaper
Query Engine
- Serverless queries (1 TB)	$5 (Athena)	$5 (Synapse serverless)	Same
TOTAL/MONTH	$1,538	$1,493	Azure 3% cheaper

Winner: Azure by $45/month (3%)

Total Cost of Ownership Analysis

Combined Monthly Costs

Project	AWS Total	Azure Total	Difference
RAG Platform	$1,504	$1,519	AWS -$15 (-1%)
MLOps Pipeline	$158	$154	Azure -$4 (-2.5%)
Data Lakehouse	$1,538	$1,493	Azure -$45 (-3%)
TOTAL	$3,200/month	$3,166/month	Azure -$34/month (-1%)

Annual Projection

AWS: $3,200 × 12 = $38,400/year
Azure: $3,166 × 12 = $37,992/year
Savings with Azure: $408/year (1%)

Cost Sensitivity Analysis

Scenario 1: High LLM Usage (10M tokens/month)

AWS: +$180 (Claude cheaper)
Azure: +$900 (GPT-4 more expensive)
AWS wins by $720/month

Scenario 2: Heavy ML Training (20 hrs/week GPU)

AWS: +$785
Azure: +$765
Azure wins by $20/month

Scenario 3: Large Data Lake (10 TB storage)

AWS: +$230
Azure: +$180
Azure wins by $50/month

Conclusion: AWS is better for AI-heavy workloads due to cheaper LLM pricing. Azure is better for data-heavy workloads due to cheaper storage.

Multi-Cloud Integration Patterns

Unified RBAC Strategy

Both platforms support similar pod-level identity:

AWS (IRSA):

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/AppRole

Azure (Workload Identity):

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
  annotations:
    azure.workload.identity/client-id: CLIENT_ID

Multi-Cloud Disaster Recovery

Deploy identical workloads on both platforms for DR:

# Primary: AWS
# Standby: Azure
# Failover time: < 5 minutes with DNS switch

# Shared components:
# - OpenShift APIs (same)
# - Application code (same)
# - Milvus deployment (same)

# Platform-specific:
# - Cloud credentials
# - Storage endpoints

Migration Strategies

AWS to Azure Migration

Phase 1: Data Migration

# Use AzCopy for S3 → Blob migration
azcopy copy \
  "https://s3.amazonaws.com/bucket/*" \
  "https://storageaccount.blob.core.windows.net/container" \
  --recursive

Phase 2: Metadata Migration

Export Glue Catalog to JSON
Import to Azure Purview via API

Phase 3: Application Migration

Update environment variables
Switch cloud credentials
Deploy to ARO

Azure to AWS Migration

Similar process in reverse:

# Use AWS DataSync for Blob → S3
aws datasync create-task \
  --source-location-arn arn:aws:datasync:...:location/azure-blob \
  --destination-location-arn arn:aws:datasync:...:location/s3-bucket

Resource Cleanup

AWS Complete Cleanup

#!/bin/bash
# Complete AWS resource cleanup

# RAG Platform
rosa delete cluster --cluster=rag-platform-aws --yes
aws s3 rm s3://rag-documents-${ACCOUNT_ID} --recursive
aws s3 rb s3://rag-documents-${ACCOUNT_ID}
aws glue delete-crawler --name rag-document-crawler
aws glue delete-database --name rag_documents_db
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $BEDROCK_VPC_ENDPOINT
aws iam delete-role --role-name rosa-bedrock-access
aws iam delete-policy --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/BedrockInvokePolicy

# MLOps Platform
aws s3 rm s3://mlops-artifacts-${ACCOUNT_ID} --recursive
aws s3 rm s3://mlops-datasets-${ACCOUNT_ID} --recursive
aws s3 rb s3://mlops-artifacts-${ACCOUNT_ID}
aws s3 rb s3://mlops-datasets-${ACCOUNT_ID}
aws ecr delete-repository --repository-name mlops/training --force
aws iam delete-role --role-name ACKSageMakerControllerRole

# Data Lakehouse
aws s3 rm s3://lakehouse-data-${ACCOUNT_ID} --recursive
aws s3 rb s3://lakehouse-data-${ACCOUNT_ID}
for db in bronze silver gold; do
  aws glue delete-database --name $db
done
aws iam delete-role --role-name SparkGlueCatalogRole

echo "AWS cleanup complete"

Azure Complete Cleanup

#!/bin/bash
# Complete Azure resource cleanup

# Delete all resources in resource group
az group delete --name rag-platform-rg --yes --no-wait

# This deletes:
# - ARO cluster
# - Azure OpenAI service
# - Storage accounts
# - Data Factory
# - Azure ML workspace
# - All networking components

echo "Azure cleanup complete (deleting in background)"

Troubleshooting

Common Multi-Cloud Issues

Issue: Cross-Cloud Latency

Symptoms: Slow API responses when accessing cloud services

AWS Solution:

# Verify VPC endpoint is in correct AZ
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids $ENDPOINT_ID

# Check PrivateLink latency
oc run test --rm -it --image=curlimages/curl -- \
  curl -w "@curl-format.txt" https://bedrock-runtime.us-east-1.amazonaws.com

Azure Solution:

# Verify Private Link in same region as ARO
az network private-endpoint show --name openai-private-endpoint

# Test latency
oc run test --rm -it --image=curlimages/curl -- \
  curl -w "@curl-format.txt" https://OPENAI_NAME.openai.azure.com

Issue: Authentication Failures

AWS IRSA Troubleshooting:

# Verify OIDC provider
rosa describe cluster -c $CLUSTER_NAME -o json | jq .aws.sts.oidc_endpoint_url

# Test token
kubectl create token bedrock-sa -n rag-application

# Verify IAM trust policy
aws iam get-role --role-name rosa-bedrock-access

Azure Workload Identity Troubleshooting:

# Verify federated credential
az identity federated-credential show \
  --name rag-app-federated \
  --identity-name rag-app-identity \
  --resource-group $RESOURCE_GROUP

# Test managed identity
az account get-access-token --resource https://cognitiveservices.azure.com

Conclusion

Platform Selection Recommendations

Choose AWS if you:

Prioritize AI/ML model diversity (Bedrock marketplace)
Have variable, unpredictable workloads (serverless pricing)
Value open-source ecosystem compatibility
Need global multi-region deployments
Want lower LLM API costs

Choose Azure if you:

Have existing Microsoft enterprise agreements
Need Windows container support
Require hybrid cloud with on-premises
Have Microsoft 365 / Teams integration requirements
Want slightly lower infrastructure costs

Choose Multi-Cloud if you:

Need disaster recovery across providers
Want to avoid vendor lock-in
Have regulatory requirements for redundancy
Can manage operational complexity

Final Cost Summary

For the three projects combined:

AWS Total: $3,200/month ($38,400/year)
Azure Total: $3,166/month ($37,992/year)
Difference: 1% ($408/year favoring Azure)

Verdict: Costs are effectively equivalent. Choose based on ecosystem fit, not cost.

Key Technical Takeaways

OpenShift provides platform portability - same APIs on both clouds
Cloud-specific services (Bedrock, Azure OpenAI) require different code
Storage abstractions (S3 vs Blob) are the main migration challenge
IAM patterns (IRSA vs Workload Identity) are conceptually similar

Next Steps

To Expand This Implementation:

Add GitOps with ArgoCD for both platforms
Implement cross-cloud disaster recovery
Add comprehensive monitoring with Grafana
Automate deployments with Terraform/Bicep
Implement cost governance and FinOps

Thank you for reading this comprehensive multi-cloud implementation guide!

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

Table of Contents

Introduction

What You'll Learn

Why This Comparison Matters

Platform Overview

Unified Multi-Cloud Architecture

Technology Stack: AWS vs Azure

Cloud Platform Decision Matrix

When to Choose AWS

When to Choose Azure

Decision Criteria Scorecard

Prerequisites

Common Prerequisites (Both Platforms)

AWS-Specific Prerequisites

Azure-Specific Prerequisites

Service Quotas Verification

Project 1: Enterprise-Grade RAG Platform

RAG Platform Overview

Architecture Comparison

Side-by-Side Service Mapping

AWS Implementation (RAG)

AWS Phase 1: ROSA Cluster Setup

AWS Phase 2: Amazon Bedrock via PrivateLink

AWS Phase 3: AWS Glue Data Pipeline

AWS Phase 4: Milvus Vector Database

AWS Phase 5: RAG Application Deployment

Azure Implementation (RAG)

Azure Phase 1: ARO Cluster Setup

Azure Phase 2: Azure OpenAI via Private Link

Azure Phase 3: Azure Data Factory Pipeline

Azure Phase 4: Milvus Deployment (Same as AWS)

Azure Phase 5: RAG Application Deployment

Cost Comparison (RAG)

Monthly Cost Breakdown

Cost Optimization Strategies

Project 2: Hybrid MLOps Pipeline

MLOps Platform Overview

Architecture Comparison

Service Mapping

AWS Implementation (MLOps)

AWS MLOps Phase 1: OpenShift Pipelines Setup

AWS MLOps Phase 2: ACK SageMaker Controller

AWS MLOps Phase 3: Training Job Example

Azure Implementation (MLOps)

Azure MLOps Phase 1: Azure ML Workspace

Azure MLOps Phase 2: Azure Service Operator

Cost Comparison (MLOps)

Project 3: Unified Data Fabric (Data Lakehouse)

Lakehouse Platform Overview

Architecture Comparison

Service Mapping

AWS Implementation (Lakehouse)

Azure Implementation (Lakehouse)

Cost Comparison (Lakehouse)

Total Cost of Ownership Analysis

Combined Monthly Costs

Annual Projection

Cost Sensitivity Analysis

Multi-Cloud Integration Patterns

Unified RBAC Strategy

Multi-Cloud Disaster Recovery

Migration Strategies

AWS to Azure Migration

Azure to AWS Migration

Resource Cleanup

AWS Complete Cleanup

Azure Complete Cleanup

Troubleshooting

Common Multi-Cloud Issues

Issue: Cross-Cloud Latency

Issue: Authentication Failures

Conclusion

Platform Selection Recommendations

Final Cost Summary

Key Technical Takeaways

Next Steps