Table of Contents
- Introduction
- Platform Overview
- Cloud Platform Decision Matrix
- Prerequisites
- Project 1: Enterprise-Grade RAG Platform
- Project 2: Hybrid MLOps Pipeline
- Project 3: Unified Data Fabric (Data Lakehouse)
- Multi-Cloud Integration Patterns
- Total Cost of Ownership Analysis
- Migration Strategies
- Resource Cleanup
- Troubleshooting
Introduction
Modern enterprises face a critical decision when building cloud-native AI and data platforms: AWS or Azure? This comprehensive guide demonstrates how to build three production-grade platforms on both cloud providers, providing side-by-side comparisons to help you make informed decisions.
What You'll Learn
This guide shows you how to implement identical architectures on both AWS and Azure:
Project 1: Enterprise RAG Platform
- AWS: Amazon Bedrock + AWS Glue + Milvus on ROSA
- Azure: Azure OpenAI + Azure Data Factory + Milvus on ARO
- Privacy-first Retrieval-Augmented Generation
- Vector database integration
- Secure private connectivity
Project 2: Hybrid MLOps Pipeline
- AWS: SageMaker + OpenShift Pipelines + KServe on ROSA
- Azure: Azure ML + Azure DevOps + KServe on ARO
- Cost-optimized GPU training
- Kubernetes-native serving
- End-to-end automation
Project 3: Unified Data Fabric
- AWS: Apache Spark + AWS Glue Catalog + S3 + Iceberg
- Azure: Apache Spark + Azure Purview + ADLS Gen2 + Delta Lake
- Stateless compute architecture
- Medallion data organization
- ACID transactions
Why This Comparison Matters
Choosing the right cloud platform impacts:
- Total Cost: 20-40% difference in monthly spending
- Developer Productivity: Ecosystem integration and tooling
- Vendor Lock-in: Portability and migration flexibility
- Enterprise Integration: Existing infrastructure and contracts
Platform Overview
Unified Multi-Cloud Architecture
Both implementations follow the same architectural patterns while leveraging platform-specific managed services:
┌─────────────────────────────────────────────────────────────────────┐
│ Enterprise Organization │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Red Hat OpenShift (ROSA on AWS / ARO on Azure) │ │
│ │ - Unified Control Plane │ │
│ │ - Application Orchestration │ │
│ │ - Developer Platform │ │
│ └───────────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ┌───────────▼─────┐ ┌──────▼──────┐ ┌─────▼──────────┐ │
│ │ RAG Project │ │MLOps Project│ │ Data Lakehouse │ │
│ │ │ │ │ │ │ │
│ │ AWS: │ │ AWS: │ │ AWS: │ │
│ │ - Bedrock │ │ - SageMaker │ │ - Glue Catalog │ │
│ │ - Glue ETL │ │ - ACK │ │ - S3 + Iceberg │ │
│ │ │ │ │ │ │ │
│ │ Azure: │ │ Azure: │ │ Azure: │ │
│ │ - OpenAI │ │ - Azure ML │ │ - Purview │ │
│ │ - Data Factory │ │ - ASO │ │ - ADLS + Delta │ │
│ └─────────────────┘ └─────────────┘ └────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Cloud Services Layer │ │
│ │ AWS: IAM + S3 + PrivateLink + CloudWatch │ │
│ │ Azure: AAD + Blob + Private Link + Monitor │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Technology Stack: AWS vs Azure
| Component | AWS Solution | Azure Solution | OpenShift Platform |
|---|---|---|---|
| Kubernetes | ROSA (Red Hat OpenShift on AWS) | ARO (Azure Red Hat OpenShift) | Both use Red Hat OpenShift |
| LLM Platform | Amazon Bedrock (Claude 3.5) | Azure OpenAI Service (GPT-4) | Same API patterns |
| ML Training | Amazon SageMaker | Azure Machine Learning | Both burst from OpenShift |
| Data Catalog | AWS Glue Data Catalog | Azure Purview / Unity Catalog | Unified metadata layer |
| Object Storage | Amazon S3 | Azure Data Lake Storage Gen2 | S3-compatible APIs |
| Table Format | Apache Iceberg | Delta Lake | Open source options |
| Vector DB | Milvus (self-hosted) | Milvus / Cosmos DB | Same deployment |
| ETL Service | AWS Glue (serverless) | Azure Data Factory (serverless) | Similar orchestration |
| CI/CD | OpenShift Pipelines (Tekton) | Azure DevOps / Tekton | Kubernetes-native |
| K8s Integration | AWS Controllers (ACK) | Azure Service Operator (ASO) | Custom resources |
| Private Network | AWS PrivateLink | Azure Private Link | VPC/VNet integration |
| Authentication | IRSA (IAM for Service Accounts) | Workload Identity | Pod-level identity |
Cloud Platform Decision Matrix
When to Choose AWS
Best For:
- AI/ML Innovation: Amazon Bedrock offers broader model selection (Claude, Llama 2, Stable Diffusion)
- Serverless-First: AWS Glue, Lambda, and Bedrock have no minimum fees
- Startup/Scale-up: Pay-as-you-go pricing favors variable workloads
- Data Engineering: S3 + Glue + Athena is industry standard
- Multi-Region: Better global infrastructure coverage
AWS Advantages:
- Superior AI model marketplace (Anthropic, Cohere, AI21, Meta)
- True serverless data catalog (Glue) with no base costs
- More mature spot instance ecosystem for cost savings
- Better S3 ecosystem and tooling integration
- Stronger open-source community adoption
When to Choose Azure
Best For:
- Microsoft Ecosystem: Tight integration with Office 365, Teams, Power Platform
- Enterprise Windows: Native Windows container support
- Hybrid Cloud: Azure Arc and on-premises integration
- Enterprise Agreements: Existing Microsoft licensing discounts
- Regulated Industries: Better compliance certifications in some regions
Azure Advantages:
- Seamless Microsoft 365 and Active Directory integration
- Superior Windows and .NET container support
- Better hybrid cloud story with Azure Arc
- Integrated Azure Synapse for unified analytics
- Potentially lower costs with existing EA agreements
Decision Criteria Scorecard
| Criteria | AWS Score | Azure Score | Weight | Notes |
|---|---|---|---|---|
| AI Model Selection | 9/10 | 7/10 | High | AWS Bedrock has more models |
| ML Training Cost | 8/10 | 8/10 | High | Equivalent spot pricing |
| Data Lake Maturity | 10/10 | 8/10 | High | S3 is industry standard |
| Serverless Pricing | 9/10 | 7/10 | Medium | AWS Glue has no minimums |
| Enterprise Integration | 7/10 | 10/10 | High | Azure wins for Microsoft shops |
| Hybrid Cloud | 7/10 | 9/10 | Medium | Azure Arc is superior |
| Developer Ecosystem | 9/10 | 7/10 | Medium | Larger open-source community |
| Compliance Certifications | 9/10 | 9/10 | High | Equivalent for most use cases |
| Global Infrastructure | 10/10 | 8/10 | Low | AWS has more regions |
| Pricing Transparency | 8/10 | 7/10 | Medium | AWS pricing is clearer |
Total Weighted Score: AWS: 8.5/10 | Azure: 8.1/10
Verdict: Choose based on your organization's existing ecosystem. Both platforms are capable; the difference is in integration, not capability.
Prerequisites
Common Prerequisites (Both Platforms)
Required Accounts:
- Cloud platform account with administrative access
- Red Hat Account with OpenShift subscription
- Credit card for cloud charges
Required Tools (install on your workstation):
# Common tools for both platforms
# OpenShift CLI (oc)
wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz
tar -xvf openshift-client-linux.tar.gz
sudo mv oc kubectl /usr/local/bin/
oc version
# Helm (v3)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version
# Tekton CLI
curl -LO https://github.com/tektoncd/cli/releases/download/v0.33.0/tkn_0.33.0_Linux_x86_64.tar.gz
tar xvzf tkn_0.33.0_Linux_x86_64.tar.gz
sudo mv tkn /usr/local/bin/
tkn version
# Python 3.11+
python3 --version
# Container tools (Docker or Podman)
podman --version
AWS-Specific Prerequisites
# AWS CLI (v2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version
# ROSA CLI
wget https://mirror.openshift.com/pub/openshift-v4/clients/rosa/latest/rosa-linux.tar.gz
tar -xvf rosa-linux.tar.gz
sudo mv rosa /usr/local/bin/rosa
rosa version
# Configure AWS
aws configure
aws sts get-caller-identity
# Initialize ROSA
rosa login
rosa verify quota
rosa verify permissions
rosa init
Azure-Specific Prerequisites
# Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
az --version
# ARO extension
az extension add --name aro --index https://az.aroapp.io/stable
# Azure CLI login
az login
az account show
# Register required providers
az provider register --namespace Microsoft.RedHatOpenShift --wait
az provider register --namespace Microsoft.Compute --wait
az provider register --namespace Microsoft.Storage --wait
az provider register --namespace Microsoft.Network --wait
Service Quotas Verification
AWS:
# EC2 vCPU quota
aws service-quotas get-service-quota \
--service-code ec2 \
--quota-code L-1216C47A \
--region us-east-1
# SageMaker training instances
aws service-quotas get-service-quota \
--service-code sagemaker \
--quota-code L-2E8D9C5E \
--region us-east-1
Azure:
# Check compute quota
az vm list-usage --location eastus --output table
# Check ML compute quota
az ml compute list-usage --location eastus
Project 1: Enterprise-Grade RAG Platform
RAG Platform Overview
This project implements a privacy-first Retrieval-Augmented Generation (RAG) system. Both AWS and Azure implementations achieve the same functionality but use platform-specific managed services.
Architecture Comparison
AWS Architecture:
ROSA → AWS PrivateLink → Amazon Bedrock (Claude 3.5)
↓
Milvus Vector DB (on ROSA)
↓
AWS Glue ETL → S3
Azure Architecture:
ARO → Azure Private Link → Azure OpenAI (GPT-4)
↓
Milvus Vector DB (on ARO)
↓
Azure Data Factory → Blob Storage
Side-by-Side Service Mapping
| Function | AWS Service | Azure Service | Implementation Difference |
|---|---|---|---|
| LLM API | Amazon Bedrock | Azure OpenAI Service | Different model families |
| Private Network | AWS PrivateLink | Azure Private Link | Similar configuration |
| ETL Pipeline | AWS Glue (Serverless) | Azure Data Factory | Different pricing models |
| Metadata | AWS Glue Data Catalog | Azure Purview | Different scopes |
| Storage | Amazon S3 | Azure Blob Storage / ADLS Gen2 | S3 API vs Blob API |
| Vector DB | Milvus on ROSA | Milvus on ARO / Cosmos DB | Self-hosted vs managed option |
| Auth | IRSA (IAM Roles) | Workload Identity | Similar pod-level identity |
| Embedding | Titan Embeddings | OpenAI Embeddings | Different dimensions |
AWS Implementation (RAG)
AWS Phase 1: ROSA Cluster Setup
# Set environment variables
export CLUSTER_NAME="rag-platform-aws"
export AWS_REGION="us-east-1"
export MACHINE_TYPE="m5.2xlarge"
export COMPUTE_NODES=3
# Create ROSA cluster (takes ~40 minutes)
rosa create cluster \
--cluster-name $CLUSTER_NAME \
--region $AWS_REGION \
--multi-az \
--compute-machine-type $MACHINE_TYPE \
--compute-nodes $COMPUTE_NODES \
--machine-cidr 10.0.0.0/16 \
--service-cidr 172.30.0.0/16 \
--pod-cidr 10.128.0.0/14 \
--host-prefix 23 \
--yes
# Monitor installation
rosa logs install --cluster=$CLUSTER_NAME --watch
# Create admin and connect
rosa create admin --cluster=$CLUSTER_NAME
oc login <api-url> --username cluster-admin --password <password>
# Create namespaces
oc new-project redhat-ods-applications
oc new-project rag-application
oc new-project milvus
AWS Phase 2: Amazon Bedrock via PrivateLink
# Get ROSA VPC details
export ROSA_VPC_ID=$(aws ec2 describe-vpcs \
--filters "Name=tag:Name,Values=*${CLUSTER_NAME}*" \
--query 'Vpcs[0].VpcId' \
--output text \
--region $AWS_REGION)
export PRIVATE_SUBNET_IDS=$(aws ec2 describe-subnets \
--filters "Name=vpc-id,Values=$ROSA_VPC_ID" "Name=tag:Name,Values=*private*" \
--query 'Subnets[*].SubnetId' \
--output text \
--region $AWS_REGION)
# Create VPC Endpoint Security Group
export VPC_ENDPOINT_SG=$(aws ec2 create-security-group \
--group-name bedrock-vpc-endpoint-sg \
--description "Security group for Bedrock VPC endpoint" \
--vpc-id $ROSA_VPC_ID \
--region $AWS_REGION \
--output text \
--query 'GroupId')
# Allow HTTPS from ROSA nodes
aws ec2 authorize-security-group-ingress \
--group-id $VPC_ENDPOINT_SG \
--protocol tcp \
--port 443 \
--cidr 10.0.0.0/16 \
--region $AWS_REGION
# Create Bedrock VPC Endpoint
export BEDROCK_VPC_ENDPOINT=$(aws ec2 create-vpc-endpoint \
--vpc-id $ROSA_VPC_ID \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.${AWS_REGION}.bedrock-runtime \
--subnet-ids $PRIVATE_SUBNET_IDS \
--security-group-ids $VPC_ENDPOINT_SG \
--private-dns-enabled \
--region $AWS_REGION \
--output text \
--query 'VpcEndpoint.VpcEndpointId')
# Wait for availability
aws ec2 wait vpc-endpoint-available \
--vpc-endpoint-ids $BEDROCK_VPC_ENDPOINT \
--region $AWS_REGION
# Create IAM role for Bedrock access (IRSA pattern)
export OIDC_PROVIDER=$(rosa describe cluster -c $CLUSTER_NAME -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
cat > bedrock-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": "arn:aws:bedrock:${AWS_REGION}::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
}
]
}
EOF
aws iam create-policy \
--policy-name BedrockInvokePolicy \
--policy-document file://bedrock-policy.json
cat > trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_PROVIDER}:sub": "system:serviceaccount:rag-application:bedrock-sa"
}
}
}
]
}
EOF
export BEDROCK_ROLE_ARN=$(aws iam create-role \
--role-name rosa-bedrock-access \
--assume-role-policy-document file://trust-policy.json \
--query 'Role.Arn' \
--output text)
aws iam attach-role-policy \
--role-name rosa-bedrock-access \
--policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/BedrockInvokePolicy
# Create Kubernetes service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: bedrock-sa
namespace: rag-application
annotations:
eks.amazonaws.com/role-arn: $BEDROCK_ROLE_ARN
EOF
AWS Phase 3: AWS Glue Data Pipeline
# Create S3 bucket
export BUCKET_NAME="rag-documents-${ACCOUNT_ID}"
aws s3 mb s3://$BUCKET_NAME --region $AWS_REGION
# Enable versioning
aws s3api put-bucket-versioning \
--bucket $BUCKET_NAME \
--versioning-configuration Status=Enabled \
--region $AWS_REGION
# Create folder structure
aws s3api put-object --bucket $BUCKET_NAME --key raw-documents/
aws s3api put-object --bucket $BUCKET_NAME --key processed-documents/
aws s3api put-object --bucket $BUCKET_NAME --key embeddings/
# Create Glue IAM role
cat > glue-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {"Service": "glue.amazonaws.com"},
"Action": "sts:AssumeRole"
}
]
}
EOF
aws iam create-role \
--role-name AWSGlueServiceRole-RAG \
--assume-role-policy-document file://glue-trust-policy.json
aws iam attach-role-policy \
--role-name AWSGlueServiceRole-RAG \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
# Create S3 access policy
cat > glue-s3-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::${BUCKET_NAME}/*"
},
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::${BUCKET_NAME}"
}
]
}
EOF
aws iam put-role-policy \
--role-name AWSGlueServiceRole-RAG \
--policy-name S3Access \
--policy-document file://glue-s3-policy.json
# Create Glue database
aws glue create-database \
--database-input '{
"Name": "rag_documents_db",
"Description": "RAG document metadata"
}' \
--region $AWS_REGION
# Create Glue crawler
aws glue create-crawler \
--name rag-document-crawler \
--role arn:aws:iam::${ACCOUNT_ID}:role/AWSGlueServiceRole-RAG \
--database-name rag_documents_db \
--targets '{
"S3Targets": [{"Path": "s3://'$BUCKET_NAME'/raw-documents/"}]
}' \
--region $AWS_REGION
AWS Phase 4: Milvus Vector Database
# Install Milvus using Helm
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm repo update
helm install milvus-operator milvus/milvus-operator \
--namespace milvus \
--create-namespace
# Create PVCs
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: milvus-etcd-pvc
namespace: milvus
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
storageClassName: gp3-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: milvus-minio-pvc
namespace: milvus
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
storageClassName: gp3-csi
EOF
# Deploy Milvus
cat > milvus-values.yaml <<EOF
cluster:
enabled: true
service:
type: ClusterIP
port: 19530
standalone:
replicas: 1
resources:
limits:
cpu: "4"
memory: 8Gi
requests:
cpu: "2"
memory: 4Gi
etcd:
persistence:
enabled: true
existingClaim: milvus-etcd-pvc
minio:
persistence:
enabled: true
existingClaim: milvus-minio-pvc
EOF
helm install milvus milvus/milvus \
--namespace milvus \
--values milvus-values.yaml \
--wait
# Get Milvus endpoint
export MILVUS_HOST=$(oc get svc milvus -n milvus -o jsonpath='{.spec.clusterIP}')
export MILVUS_PORT=19530
AWS Phase 5: RAG Application Deployment
# Create application code
mkdir -p rag-app-aws/src
cat > rag-app-aws/requirements.txt <<EOF
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
pymilvus==2.3.3
boto3==1.29.7
python-dotenv==1.0.0
EOF
# Create FastAPI application (abbreviated for space)
cat > rag-app-aws/src/main.py <<'PYTHON'
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import os, json, boto3
from pymilvus import connections, Collection
app = FastAPI(title="Enterprise RAG API - AWS")
MILVUS_HOST = os.getenv("MILVUS_HOST")
AWS_REGION = os.getenv("AWS_REGION", "us-east-1")
BEDROCK_MODEL = "anthropic.claude-3-5-sonnet-20241022-v2:0"
bedrock = boto3.client('bedrock-runtime', region_name=AWS_REGION)
@app.on_event("startup")
async def startup():
connections.connect(host=MILVUS_HOST, port=19530)
class QueryRequest(BaseModel):
query: str
top_k: int = 5
max_tokens: int = 1000
@app.post("/query")
async def query_rag(req: QueryRequest):
# Generate embedding with Bedrock Titan
embed_resp = bedrock.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
body=json.dumps({"inputText": req.query, "dimensions": 1024})
)
embedding = json.loads(embed_resp['body'].read())['embedding']
# Search Milvus
coll = Collection("rag_documents")
results = coll.search([embedding], "embedding", {"metric_type": "L2"}, limit=req.top_k)
# Build context
context = "\n\n".join([hit.entity.get("text") for hit in results[0]])
# Call Bedrock Claude
prompt = f"Context:\n{context}\n\nQuestion: {req.query}\n\nAnswer:"
response = bedrock.invoke_model(
modelId=BEDROCK_MODEL,
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": req.max_tokens,
"messages": [{"role": "user", "content": prompt}]
})
)
answer = json.loads(response['body'].read())['content'][0]['text']
return {"answer": answer, "sources": [{"chunk": hit.entity.get("text")} for hit in results[0]]}
@app.get("/health")
async def health():
return {"status": "healthy", "platform": "AWS", "model": "Claude 3.5 Sonnet"}
PYTHON
# Create Dockerfile
cat > rag-app-aws/Dockerfile <<EOF
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
EOF
# Build and deploy
cd rag-app-aws
podman build -t rag-app-aws:v1.0 .
oc create imagestream rag-app-aws -n rag-application
podman tag rag-app-aws:v1.0 image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0
podman push image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0 --tls-verify=false
cd ..
# Deploy to OpenShift
cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-app-aws
namespace: rag-application
spec:
replicas: 2
selector:
matchLabels:
app: rag-app-aws
template:
metadata:
labels:
app: rag-app-aws
spec:
serviceAccountName: bedrock-sa
containers:
- name: app
image: image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0
ports:
- containerPort: 8000
env:
- name: MILVUS_HOST
value: "$MILVUS_HOST"
- name: AWS_REGION
value: "$AWS_REGION"
---
apiVersion: v1
kind: Service
metadata:
name: rag-app-aws
namespace: rag-application
spec:
selector:
app: rag-app-aws
ports:
- port: 80
targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: rag-app-aws
namespace: rag-application
spec:
to:
kind: Service
name: rag-app-aws
tls:
termination: edge
EOF
# Get URL and test
export RAG_URL_AWS=$(oc get route rag-app-aws -n rag-application -o jsonpath='{.spec.host}')
curl https://$RAG_URL_AWS/health
Azure Implementation (RAG)
Azure Phase 1: ARO Cluster Setup
# Set environment variables
export CLUSTER_NAME="rag-platform-azure"
export LOCATION="eastus"
export RESOURCE_GROUP="rag-platform-rg"
# Create resource group
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
# Create virtual network
az network vnet create \
--resource-group $RESOURCE_GROUP \
--name aro-vnet \
--address-prefixes 10.0.0.0/22
# Create master subnet
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--name master-subnet \
--address-prefixes 10.0.0.0/23 \
--service-endpoints Microsoft.ContainerRegistry
# Create worker subnet
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--name worker-subnet \
--address-prefixes 10.0.2.0/23 \
--service-endpoints Microsoft.ContainerRegistry
# Disable subnet private endpoint policies
az network vnet subnet update \
--name master-subnet \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--disable-private-link-service-network-policies true
# Create ARO cluster (takes ~35 minutes)
az aro create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--vnet aro-vnet \
--master-subnet master-subnet \
--worker-subnet worker-subnet \
--worker-count 3 \
--worker-vm-size Standard_D8s_v3
# Get credentials
export ARO_URL=$(az aro show \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--query consoleUrl -o tsv)
export ARO_PASSWORD=$(az aro list-credentials \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--query kubeadminPassword -o tsv)
# Login
oc login $ARO_URL -u kubeadmin -p $ARO_PASSWORD
# Create namespaces
oc new-project rag-application
oc new-project milvus
Azure Phase 2: Azure OpenAI via Private Link
# Create Azure OpenAI resource
export OPENAI_NAME="rag-openai-${RANDOM}"
az cognitiveservices account create \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--kind OpenAI \
--sku S0 \
--location $LOCATION \
--custom-domain $OPENAI_NAME \
--public-network-access Disabled
# Deploy GPT-4 model
az cognitiveservices account deployment create \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--deployment-name gpt-4 \
--model-name gpt-4 \
--model-version "0613" \
--model-format OpenAI \
--sku-capacity 10 \
--sku-name "Standard"
# Deploy text-embedding model
az cognitiveservices account deployment create \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--deployment-name text-embedding-ada-002 \
--model-name text-embedding-ada-002 \
--model-version "2" \
--model-format OpenAI \
--sku-capacity 10 \
--sku-name "Standard"
# Create Private Endpoint
export VNET_ID=$(az network vnet show \
--resource-group $RESOURCE_GROUP \
--name aro-vnet \
--query id -o tsv)
export SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--name worker-subnet \
--query id -o tsv)
export OPENAI_ID=$(az cognitiveservices account show \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--query id -o tsv)
az network private-endpoint create \
--name openai-private-endpoint \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--subnet worker-subnet \
--private-connection-resource-id $OPENAI_ID \
--group-id account \
--connection-name openai-connection
# Create Private DNS Zone
az network private-dns zone create \
--resource-group $RESOURCE_GROUP \
--name privatelink.openai.azure.com
az network private-dns link vnet create \
--resource-group $RESOURCE_GROUP \
--zone-name privatelink.openai.azure.com \
--name openai-dns-link \
--virtual-network aro-vnet \
--registration-enabled false
# Create DNS record
export ENDPOINT_IP=$(az network private-endpoint show \
--name openai-private-endpoint \
--resource-group $RESOURCE_GROUP \
--query 'customDnsConfigs[0].ipAddresses[0]' -o tsv)
az network private-dns record-set a create \
--name $OPENAI_NAME \
--zone-name privatelink.openai.azure.com \
--resource-group $RESOURCE_GROUP
az network private-dns record-set a add-record \
--record-set-name $OPENAI_NAME \
--zone-name privatelink.openai.azure.com \
--resource-group $RESOURCE_GROUP \
--ipv4-address $ENDPOINT_IP
# Configure Workload Identity
export ARO_OIDC_ISSUER=$(az aro show \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--query 'serviceIdentity.url' -o tsv)
# Create managed identity
az identity create \
--name rag-app-identity \
--resource-group $RESOURCE_GROUP
export IDENTITY_CLIENT_ID=$(az identity show \
--name rag-app-identity \
--resource-group $RESOURCE_GROUP \
--query clientId -o tsv)
export IDENTITY_PRINCIPAL_ID=$(az identity show \
--name rag-app-identity \
--resource-group $RESOURCE_GROUP \
--query principalId -o tsv)
# Grant OpenAI access
az role assignment create \
--assignee $IDENTITY_PRINCIPAL_ID \
--role "Cognitive Services OpenAI User" \
--scope $OPENAI_ID
# Create federated credential
az identity federated-credential create \
--name rag-app-federated \
--identity-name rag-app-identity \
--resource-group $RESOURCE_GROUP \
--issuer $ARO_OIDC_ISSUER \
--subject "system:serviceaccount:rag-application:openai-sa"
# Create Kubernetes service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: openai-sa
namespace: rag-application
annotations:
azure.workload.identity/client-id: $IDENTITY_CLIENT_ID
EOF
# Get OpenAI endpoint and key
export OPENAI_ENDPOINT=$(az cognitiveservices account show \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--query properties.endpoint -o tsv)
export OPENAI_KEY=$(az cognitiveservices account keys list \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--query key1 -o tsv)
# Create secret
oc create secret generic openai-credentials \
--from-literal=endpoint=$OPENAI_ENDPOINT \
--from-literal=key=$OPENAI_KEY \
-n rag-application
Azure Phase 3: Azure Data Factory Pipeline
# Create Data Factory
export ADF_NAME="rag-adf-${RANDOM}"
az datafactory create \
--resource-group $RESOURCE_GROUP \
--factory-name $ADF_NAME \
--location $LOCATION
# Create Storage Account
export STORAGE_ACCOUNT="ragdocs${RANDOM}"
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--hierarchical-namespace true
# Get storage key
export STORAGE_KEY=$(az storage account keys list \
--account-name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--query '[0].value' -o tsv)
# Create containers
az storage container create \
--name raw-documents \
--account-name $STORAGE_ACCOUNT \
--account-key $STORAGE_KEY
az storage container create \
--name processed-documents \
--account-name $STORAGE_ACCOUNT \
--account-key $STORAGE_KEY
# Create linked service for storage
cat > adf-storage-linked-service.json <<EOF
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=$STORAGE_ACCOUNT;AccountKey=$STORAGE_KEY;EndpointSuffix=core.windows.net"
}
}
}
EOF
az datafactory linked-service create \
--resource-group $RESOURCE_GROUP \
--factory-name $ADF_NAME \
--name StorageLinkedService \
--properties @adf-storage-linked-service.json
Azure Phase 4: Milvus Deployment (Same as AWS)
The Milvus deployment on ARO is identical to ROSA since both use OpenShift:
# Same Helm commands as AWS implementation
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm install milvus-operator milvus/milvus-operator --namespace milvus --create-namespace
# Create PVCs using Azure Disk
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: milvus-etcd-pvc
namespace: milvus
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
storageClassName: managed-premium
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: milvus-minio-pvc
namespace: milvus
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
storageClassName: managed-premium
EOF
# Deploy Milvus (same values file as AWS)
helm install milvus milvus/milvus --namespace milvus --values milvus-values.yaml --wait
Azure Phase 5: RAG Application Deployment
# Create Azure-specific application
mkdir -p rag-app-azure/src
cat > rag-app-azure/requirements.txt <<EOF
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
pymilvus==2.3.3
openai==1.3.5
azure-identity==1.14.0
python-dotenv==1.0.0
EOF
cat > rag-app-azure/src/main.py <<'PYTHON'
from fastapi import FastAPI
from pydantic import BaseModel
import os
from openai import AzureOpenAI
from pymilvus import connections, Collection
app = FastAPI(title="Enterprise RAG API - Azure")
client = AzureOpenAI(
api_key=os.getenv("OPENAI_KEY"),
api_version="2023-05-15",
azure_endpoint=os.getenv("OPENAI_ENDPOINT")
)
@app.on_event("startup")
async def startup():
connections.connect(host=os.getenv("MILVUS_HOST"), port=19530)
class QueryRequest(BaseModel):
query: str
top_k: int = 5
max_tokens: int = 1000
@app.post("/query")
async def query_rag(req: QueryRequest):
# Generate embedding with Azure OpenAI
embed_resp = client.embeddings.create(
input=req.query,
model="text-embedding-ada-002"
)
embedding = embed_resp.data[0].embedding
# Search Milvus
coll = Collection("rag_documents")
results = coll.search([embedding], "embedding", {"metric_type": "L2"}, limit=req.top_k)
# Build context
context = "\n\n".join([hit.entity.get("text") for hit in results[0]])
# Call Azure OpenAI GPT-4
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {req.query}"}
],
max_tokens=req.max_tokens
)
answer = response.choices[0].message.content
return {"answer": answer, "sources": [{"chunk": hit.entity.get("text")} for hit in results[0]]}
@app.get("/health")
async def health():
return {"status": "healthy", "platform": "Azure", "model": "GPT-4"}
PYTHON
# Build and deploy (similar to AWS)
cd rag-app-azure
podman build -t rag-app-azure:v1.0 .
oc create imagestream rag-app-azure -n rag-application
podman tag rag-app-azure:v1.0 image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0
podman push image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0 --tls-verify=false
cd ..
# Deploy with Azure credentials
cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-app-azure
namespace: rag-application
spec:
replicas: 2
selector:
matchLabels:
app: rag-app-azure
template:
metadata:
labels:
app: rag-app-azure
spec:
serviceAccountName: openai-sa
containers:
- name: app
image: image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0
ports:
- containerPort: 8000
env:
- name: MILVUS_HOST
value: "milvus.milvus.svc.cluster.local"
- name: OPENAI_ENDPOINT
valueFrom:
secretKeyRef:
name: openai-credentials
key: endpoint
- name: OPENAI_KEY
valueFrom:
secretKeyRef:
name: openai-credentials
key: key
---
apiVersion: v1
kind: Service
metadata:
name: rag-app-azure
namespace: rag-application
spec:
selector:
app: rag-app-azure
ports:
- port: 80
targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: rag-app-azure
namespace: rag-application
spec:
to:
kind: Service
name: rag-app-azure
tls:
termination: edge
EOF
# Get URL and test
export RAG_URL_AZURE=$(oc get route rag-app-azure -n rag-application -o jsonpath='{.spec.host}')
curl https://$RAG_URL_AZURE/health
Cost Comparison (RAG)
Monthly Cost Breakdown
| Component | AWS Cost | Azure Cost | Notes |
|---|---|---|---|
| Kubernetes Cluster | |||
| - 3x worker nodes | $1,460 (m5.2xlarge) | $1,380 (D8s_v3) | Similar specs |
| - Control plane | $0 (managed by ROSA) | $0 (managed by ARO) | Both included |
| LLM API Calls | |||
| - 1M input tokens | $3 (Claude 3.5) | $30 (GPT-4) | AWS 10x cheaper |
| - 1M output tokens | $15 (Claude 3.5) | $60 (GPT-4) | AWS 4x cheaper |
| Embeddings | |||
| - 1M tokens | $0.10 (Titan) | $0.10 (Ada-002) | Equivalent |
| Data Pipeline | |||
| - ETL service | $10 (Glue, serverless) | $15 (Data Factory) | AWS slightly cheaper |
| - Metadata catalog | $1 (Glue Catalog) | $20 (Purview min) | Azure has minimum fee |
| Object Storage | |||
| - 100 GB storage | $2.30 (S3) | $2.05 (Blob) | Equivalent |
| - Requests (100k) | $0.05 (S3) | $0.04 (Blob) | Equivalent |
| Vector Database | |||
| - Self-hosted Milvus | $0 (on cluster) | $0 (on cluster) | Same |
| Networking | |||
| - Private Link | $7.20 (PrivateLink) | $7.20 (Private Link) | Same pricing |
| - Data transfer | $5 (1 TB out) | $5 (1 TB out) | Equivalent |
| TOTAL/MONTH | $1,503.65 | $1,519.39 | AWS 1% cheaper |
Key Cost Insights:
- LLM API costs favor AWS by a significant margin (Claude is cheaper than GPT-4)
- Azure Purview has a minimum monthly fee vs Glue's pay-per-use
- Compute costs are similar between ROSA and ARO
- Winner: AWS by ~$16/month (1%)
Cost Optimization Strategies
AWS:
- Use Claude Instant for non-critical queries (6x cheaper)
- Leverage Glue serverless (no base cost)
- Use S3 Intelligent-Tiering for old documents
Azure:
- Use GPT-3.5-Turbo instead of GPT-4 (20x cheaper)
- Negotiate EA pricing for Azure OpenAI
- Use cool/archive tiers for old data
Project 2: Hybrid MLOps Pipeline
MLOps Platform Overview
This project demonstrates cost-optimized machine learning operations by bursting GPU training workloads to managed services while keeping inference on Kubernetes.
Architecture Comparison
AWS Architecture:
OpenShift Pipelines → ACK → SageMaker (ml.p4d.24xlarge)
↓
S3 Model Storage
↓
KServe on ROSA (CPU)
Azure Architecture:
Azure DevOps / Tekton → ASO → Azure ML (NC96ads_A100_v4)
↓
Blob Model Storage
↓
KServe on ARO (CPU)
Service Mapping
| Function | AWS Service | Azure Service | Key Difference |
|---|---|---|---|
| ML Platform | Amazon SageMaker | Azure Machine Learning | Similar capabilities |
| GPU Training | ml.p4d.24xlarge (8x A100) | NC96ads_A100_v4 (8x A100) | Same hardware |
| Spot Training | Managed Spot Training | Low Priority VMs | Different reservation models |
| Model Registry | S3 + SageMaker Registry | Blob + ML Model Registry | Different metadata approaches |
| K8s Operator | ACK (AWS Controllers) | ASO (Azure Service Operator) | Different CRD structures |
| Pipelines | OpenShift Pipelines (Tekton) | Azure DevOps / Tekton | Both support Tekton |
| Inference | KServe on ROSA | KServe on ARO | Identical |
AWS Implementation (MLOps)
AWS MLOps Phase 1: OpenShift Pipelines Setup
# Install OpenShift Pipelines Operator
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: openshift-pipelines-operator
namespace: openshift-operators
spec:
channel: latest
name: openshift-pipelines-operator-rh
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
# Create namespace
oc new-project mlops-pipelines
# Create service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: pipeline-sa
namespace: mlops-pipelines
EOF
AWS MLOps Phase 2: ACK SageMaker Controller
# Install ACK SageMaker controller
export SERVICE=sagemaker
export RELEASE_VERSION=$(curl -sL https://api.github.com/repos/aws-controllers-k8s/${SERVICE}-controller/releases/latest | grep '\"tag_name\":' | cut -d'\"' -f4)
wget https://github.com/aws-controllers-k8s/${SERVICE}-controller/releases/download/${RELEASE_VERSION}/install.yaml
kubectl apply -f install.yaml
# Create IAM role for ACK
cat > ack-sagemaker-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateTrainingJob",
"sagemaker:DescribeTrainingJob",
"sagemaker:StopTrainingJob"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["s3:*"],
"Resource": "arn:aws:s3:::mlops-*"
},
{
"Effect": "Allow",
"Action": ["iam:PassRole"],
"Resource": "*",
"Condition": {
"StringEquals": {"iam:PassedToService": "sagemaker.amazonaws.com"}
}
}
]
}
EOF
aws iam create-policy --policy-name ACKSageMakerPolicy --policy-document file://ack-sagemaker-policy.json
# Create trust policy and role (similar to RAG project)
# ... (abbreviated for space)
AWS MLOps Phase 3: Training Job Example
# Create S3 buckets
export ML_BUCKET="mlops-artifacts-${ACCOUNT_ID}"
export DATA_BUCKET="mlops-datasets-${ACCOUNT_ID}"
aws s3 mb s3://$ML_BUCKET
aws s3 mb s3://$DATA_BUCKET
# Upload training script
cat > train.py <<'PYTHON'
import argparse, joblib
from sklearn.ensemble import RandomForestClassifier
import numpy as np
parser = argparse.ArgumentParser()
parser.add_argument('--n_estimators', type=int, default=100)
args = parser.parse_args()
# Training code
X = np.random.rand(1000, 20)
y = np.random.randint(0, 2, 1000)
model = RandomForestClassifier(n_estimators=args.n_estimators)
model.fit(X, y)
joblib.dump(model, '/opt/ml/model/model.joblib')
print(f"Training completed with {args.n_estimators} estimators")
PYTHON
# Create Dockerfile
cat > Dockerfile <<EOF
FROM python:3.10-slim
RUN pip install scikit-learn joblib numpy
COPY train.py /opt/ml/code/
ENTRYPOINT ["python", "/opt/ml/code/train.py"]
EOF
# Build and push to ECR
aws ecr create-repository --repository-name mlops/training
export ECR_URI="${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/mlops/training"
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
docker build -t mlops-training .
docker tag mlops-training:latest $ECR_URI:latest
docker push $ECR_URI:latest
# Create SageMaker training job via ACK
cat <<EOF | oc apply -f -
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: TrainingJob
metadata:
name: rf-training-job
namespace: mlops-pipelines
spec:
trainingJobName: rf-training-$(date +%s)
roleARN: $SAGEMAKER_ROLE_ARN
algorithmSpecification:
trainingImage: $ECR_URI:latest
trainingInputMode: File
resourceConfig:
instanceType: ml.m5.xlarge
instanceCount: 1
volumeSizeInGB: 50
outputDataConfig:
s3OutputPath: s3://$ML_BUCKET/models/
stoppingCondition:
maxRuntimeInSeconds: 3600
EOF
Azure Implementation (MLOps)
Azure MLOps Phase 1: Azure ML Workspace
# Create ML workspace
export ML_WORKSPACE="mlops-workspace-${RANDOM}"
az ml workspace create \
--name $ML_WORKSPACE \
--resource-group $RESOURCE_GROUP \
--location $LOCATION
# Create compute cluster (spot instances)
az ml compute create \
--name gpu-cluster \
--type amlcompute \
--min-instances 0 \
--max-instances 4 \
--size Standard_NC6s_v3 \
--tier LowPriority \
--workspace-name $ML_WORKSPACE \
--resource-group $RESOURCE_GROUP
Azure MLOps Phase 2: Azure Service Operator
# Install ASO
helm repo add aso2 https://raw.githubusercontent.com/Azure/azure-service-operator/main/v2/charts
helm install aso2 aso2/azure-service-operator \
--create-namespace \
--namespace azureserviceoperator-system \
--set azureSubscriptionID=$SUBSCRIPTION_ID \
--set azureTenantID=$TENANT_ID \
--set azureClientID=$CLIENT_ID \
--set azureClientSecret=$CLIENT_SECRET
# Create ML job via ASO
cat <<EOF | oc apply -f -
apiVersion: machinelearningservices.azure.com/v1alpha1
kind: Job
metadata:
name: rf-training-job
namespace: mlops-pipelines
spec:
owner:
name: $ML_WORKSPACE
compute:
target: gpu-cluster
instanceCount: 1
environment:
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
codeConfiguration:
codeArtifactId: azureml://code/train-script
scoringScript: train.py
EOF
Cost Comparison (MLOps)
| Component | AWS Monthly | Azure Monthly | Notes |
|---|---|---|---|
| Training | |||
| - 4 hrs/week spot GPU | $157 (ml.p4d.24xlarge) | $153 (NC96ads_A100_v4) | Azure slightly cheaper |
| Storage | |||
| - Model artifacts (50 GB) | $1.15 (S3) | $1.00 (Blob) | Similar |
| ML Platform | |||
| - ML service | $0 (pay-per-use) | $0 (pay-per-use) | Same |
| Inference (on OpenShift) | |||
| - Shared ROSA/ARO cluster | $0 (shared) | $0 (shared) | Same |
| TOTAL/MONTH | ~$158 | ~$154 | Azure 2.5% cheaper |
Winner: Azure by $4/month (negligible difference)
Project 3: Unified Data Fabric (Data Lakehouse)
Lakehouse Platform Overview
This project implements a stateless data lakehouse where compute (Spark) can be destroyed without data loss.
Architecture Comparison
AWS Architecture:
Spark on ROSA → AWS Glue Catalog → S3 + Iceberg
Azure Architecture:
Spark on ARO → Azure Purview / Unity Catalog → ADLS Gen2 + Delta Lake
Service Mapping
| Function | AWS Service | Azure Service | Key Difference |
|---|---|---|---|
| Catalog | AWS Glue Data Catalog | Azure Purview / Unity Catalog | Glue is serverless |
| Table Format | Apache Iceberg | Delta Lake | Iceberg is cloud-agnostic |
| Storage | Amazon S3 | ADLS Gen2 | ADLS has hierarchical namespace |
| Compute | Spark on ROSA | Spark on ARO / Databricks | ARO or managed Databricks |
| Query Engine | Amazon Athena | Azure Synapse Serverless SQL | Similar serverless query |
AWS Implementation (Lakehouse)
(Due to length constraints, showing key differences only)
# Install Spark Operator
helm install spark-operator spark-operator/spark-operator \
--namespace spark-operator \
--set sparkJobNamespace=spark-jobs
# Create Glue databases
aws glue create-database --database-input '{"Name": "bronze"}'
aws glue create-database --database-input '{"Name": "silver"}'
aws glue create-database --database-input '{"Name": "gold"}'
# Build custom Spark image with Iceberg
cat > Dockerfile <<EOF
FROM gcr.io/spark-operator/spark:v3.5.0
USER root
RUN curl -L https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.4.2/iceberg-spark-runtime-3.5_2.12-1.4.2.jar \
-o /opt/spark/jars/iceberg-spark-runtime.jar
RUN curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar \
-o /opt/spark/jars/hadoop-aws.jar
USER 185
EOF
# Deploy SparkApplication with Glue integration
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: lakehouse-etl
spec:
type: Python
sparkVersion: "3.5.0"
mainApplicationFile: s3://bucket/scripts/etl.py
sparkConf:
"spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog"
"spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog"
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
EOF
Azure Implementation (Lakehouse)
# Option 1: Use Azure Databricks (managed)
az databricks workspace create \
--name databricks-lakehouse \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku premium
# Option 2: Deploy Spark on ARO with Delta Lake
cat > Dockerfile <<EOF
FROM gcr.io/spark-operator/spark:v3.5.0
USER root
RUN curl -L https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar \
-o /opt/spark/jars/delta-core.jar
USER 185
EOF
# Create ADLS Gen2 storage
az storage account create \
--name datalake${RANDOM} \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--kind StorageV2 \
--hierarchical-namespace true
# Deploy SparkApplication with Delta Lake
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: lakehouse-etl
spec:
type: Python
sparkVersion: "3.5.0"
mainApplicationFile: abfss://container@storage.dfs.core.windows.net/scripts/etl.py
sparkConf:
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
EOF
Cost Comparison (Lakehouse)
| Component | AWS Monthly | Azure Monthly | Notes |
|---|---|---|---|
| Compute | |||
| - Spark cluster (3x m5.4xlarge) | $1,500 | $1,450 (D16s_v3) | Similar |
| Metadata Catalog | |||
| - Catalog service | $10 (Glue, 1M requests) | $20 (Purview minimum) | AWS cheaper |
| Storage | |||
| - Data lake (1 TB) | $23 (S3) | $18 (ADLS Gen2 hot) | Azure cheaper |
| Query Engine | |||
| - Serverless queries (1 TB) | $5 (Athena) | $5 (Synapse serverless) | Same |
| TOTAL/MONTH | $1,538 | $1,493 | Azure 3% cheaper |
Winner: Azure by $45/month (3%)
Total Cost of Ownership Analysis
Combined Monthly Costs
| Project | AWS Total | Azure Total | Difference |
|---|---|---|---|
| RAG Platform | $1,504 | $1,519 | AWS -$15 (-1%) |
| MLOps Pipeline | $158 | $154 | Azure -$4 (-2.5%) |
| Data Lakehouse | $1,538 | $1,493 | Azure -$45 (-3%) |
| TOTAL | $3,200/month | $3,166/month | Azure -$34/month (-1%) |
Annual Projection
- AWS: $3,200 × 12 = $38,400/year
- Azure: $3,166 × 12 = $37,992/year
- Savings with Azure: $408/year (1%)
Cost Sensitivity Analysis
Scenario 1: High LLM Usage (10M tokens/month)
- AWS: +$180 (Claude cheaper)
- Azure: +$900 (GPT-4 more expensive)
- AWS wins by $720/month
Scenario 2: Heavy ML Training (20 hrs/week GPU)
- AWS: +$785
- Azure: +$765
- Azure wins by $20/month
Scenario 3: Large Data Lake (10 TB storage)
- AWS: +$230
- Azure: +$180
- Azure wins by $50/month
Conclusion: AWS is better for AI-heavy workloads due to cheaper LLM pricing. Azure is better for data-heavy workloads due to cheaper storage.
Multi-Cloud Integration Patterns
Unified RBAC Strategy
Both platforms support similar pod-level identity:
AWS (IRSA):
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-sa
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/AppRole
Azure (Workload Identity):
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-sa
annotations:
azure.workload.identity/client-id: CLIENT_ID
Multi-Cloud Disaster Recovery
Deploy identical workloads on both platforms for DR:
# Primary: AWS
# Standby: Azure
# Failover time: < 5 minutes with DNS switch
# Shared components:
# - OpenShift APIs (same)
# - Application code (same)
# - Milvus deployment (same)
# Platform-specific:
# - Cloud credentials
# - Storage endpoints
Migration Strategies
AWS to Azure Migration
Phase 1: Data Migration
# Use AzCopy for S3 → Blob migration
azcopy copy \
"https://s3.amazonaws.com/bucket/*" \
"https://storageaccount.blob.core.windows.net/container" \
--recursive
Phase 2: Metadata Migration
- Export Glue Catalog to JSON
- Import to Azure Purview via API
Phase 3: Application Migration
- Update environment variables
- Switch cloud credentials
- Deploy to ARO
Azure to AWS Migration
Similar process in reverse:
# Use AWS DataSync for Blob → S3
aws datasync create-task \
--source-location-arn arn:aws:datasync:...:location/azure-blob \
--destination-location-arn arn:aws:datasync:...:location/s3-bucket
Resource Cleanup
AWS Complete Cleanup
#!/bin/bash
# Complete AWS resource cleanup
# RAG Platform
rosa delete cluster --cluster=rag-platform-aws --yes
aws s3 rm s3://rag-documents-${ACCOUNT_ID} --recursive
aws s3 rb s3://rag-documents-${ACCOUNT_ID}
aws glue delete-crawler --name rag-document-crawler
aws glue delete-database --name rag_documents_db
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $BEDROCK_VPC_ENDPOINT
aws iam delete-role --role-name rosa-bedrock-access
aws iam delete-policy --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/BedrockInvokePolicy
# MLOps Platform
aws s3 rm s3://mlops-artifacts-${ACCOUNT_ID} --recursive
aws s3 rm s3://mlops-datasets-${ACCOUNT_ID} --recursive
aws s3 rb s3://mlops-artifacts-${ACCOUNT_ID}
aws s3 rb s3://mlops-datasets-${ACCOUNT_ID}
aws ecr delete-repository --repository-name mlops/training --force
aws iam delete-role --role-name ACKSageMakerControllerRole
# Data Lakehouse
aws s3 rm s3://lakehouse-data-${ACCOUNT_ID} --recursive
aws s3 rb s3://lakehouse-data-${ACCOUNT_ID}
for db in bronze silver gold; do
aws glue delete-database --name $db
done
aws iam delete-role --role-name SparkGlueCatalogRole
echo "AWS cleanup complete"
Azure Complete Cleanup
#!/bin/bash
# Complete Azure resource cleanup
# Delete all resources in resource group
az group delete --name rag-platform-rg --yes --no-wait
# This deletes:
# - ARO cluster
# - Azure OpenAI service
# - Storage accounts
# - Data Factory
# - Azure ML workspace
# - All networking components
echo "Azure cleanup complete (deleting in background)"
Troubleshooting
Common Multi-Cloud Issues
Issue: Cross-Cloud Latency
Symptoms: Slow API responses when accessing cloud services
AWS Solution:
# Verify VPC endpoint is in correct AZ
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids $ENDPOINT_ID
# Check PrivateLink latency
oc run test --rm -it --image=curlimages/curl -- \
curl -w "@curl-format.txt" https://bedrock-runtime.us-east-1.amazonaws.com
Azure Solution:
# Verify Private Link in same region as ARO
az network private-endpoint show --name openai-private-endpoint
# Test latency
oc run test --rm -it --image=curlimages/curl -- \
curl -w "@curl-format.txt" https://OPENAI_NAME.openai.azure.com
Issue: Authentication Failures
AWS IRSA Troubleshooting:
# Verify OIDC provider
rosa describe cluster -c $CLUSTER_NAME -o json | jq .aws.sts.oidc_endpoint_url
# Test token
kubectl create token bedrock-sa -n rag-application
# Verify IAM trust policy
aws iam get-role --role-name rosa-bedrock-access
Azure Workload Identity Troubleshooting:
# Verify federated credential
az identity federated-credential show \
--name rag-app-federated \
--identity-name rag-app-identity \
--resource-group $RESOURCE_GROUP
# Test managed identity
az account get-access-token --resource https://cognitiveservices.azure.com
Conclusion
Platform Selection Recommendations
Choose AWS if you:
- Prioritize AI/ML model diversity (Bedrock marketplace)
- Have variable, unpredictable workloads (serverless pricing)
- Value open-source ecosystem compatibility
- Need global multi-region deployments
- Want lower LLM API costs
Choose Azure if you:
- Have existing Microsoft enterprise agreements
- Need Windows container support
- Require hybrid cloud with on-premises
- Have Microsoft 365 / Teams integration requirements
- Want slightly lower infrastructure costs
Choose Multi-Cloud if you:
- Need disaster recovery across providers
- Want to avoid vendor lock-in
- Have regulatory requirements for redundancy
- Can manage operational complexity
Final Cost Summary
For the three projects combined:
- AWS Total: $3,200/month ($38,400/year)
- Azure Total: $3,166/month ($37,992/year)
- Difference: 1% ($408/year favoring Azure)
Verdict: Costs are effectively equivalent. Choose based on ecosystem fit, not cost.
Key Technical Takeaways
- OpenShift provides platform portability - same APIs on both clouds
- Cloud-specific services (Bedrock, Azure OpenAI) require different code
- Storage abstractions (S3 vs Blob) are the main migration challenge
- IAM patterns (IRSA vs Workload Identity) are conceptually similar
Next Steps
To Expand This Implementation:
- Add GitOps with ArgoCD for both platforms
- Implement cross-cloud disaster recovery
- Add comprehensive monitoring with Grafana
- Automate deployments with Terraform/Bicep
- Implement cost governance and FinOps
Thank you for reading this comprehensive multi-cloud implementation guide!
Top comments (0)