In this tutorial, I'll show you how to build a backend application using Azure OpenAI's Language Model (LLM) and introduce you to what's new with DeepSeek's LLM. It's simpler than it might sound!
Important Notes:
May difference between OpenAI and DeepSeek does not lie on the setup, but the performance, so feel free to replace "DeepSeek" everytime you see "OpenAI" in this blog entry.
Table of Contents
- Introduction
- Platform Overview
- Cloud Platform Decision Matrix
- Prerequisites
- Project 1: Enterprise-Grade RAG Platform
- Project 2: Hybrid MLOps Pipeline
- Project 3: Unified Data Fabric (Data Lakehouse)
- Multi-Cloud Integration Patterns
- Total Cost of Ownership Analysis
- Migration Strategies
- Resource Cleanup
- Troubleshooting
Introduction
Modern enterprises face a critical decision when building cloud-native AI and data platforms: AWS or Azure? This comprehensive guide demonstrates how to build three production-grade platforms on both cloud providers, providing side-by-side comparisons to help you make informed decisions.
What You'll Learn
This guide shows you how to implement identical architectures on both AWS and Azure:
Project 1: Enterprise RAG Platform
- AWS: Amazon Bedrock + AWS Glue + Milvus on ROSA
- Azure: Azure OpenAI + Azure Data Factory + Milvus on ARO
- Privacy-first Retrieval-Augmented Generation
- Vector database integration
- Secure private connectivity
Project 2: Hybrid MLOps Pipeline
- AWS: SageMaker + OpenShift Pipelines + KServe on ROSA
- Azure: Azure ML + Azure DevOps + KServe on ARO
- Cost-optimized GPU training
- Kubernetes-native serving
- End-to-end automation
Project 3: Unified Data Fabric
- AWS: Apache Spark + AWS Glue Catalog + S3 + Iceberg
- Azure: Apache Spark + Azure Purview + ADLS Gen2 + Delta Lake
- Stateless compute architecture
- Medallion data organization
- ACID transactions
Why This Comparison Matters
Choosing the right cloud platform impacts:
- Total Cost: 20-40% difference in monthly spending
- Developer Productivity: Ecosystem integration and tooling
- Vendor Lock-in: Portability and migration flexibility
- Enterprise Integration: Existing infrastructure and contracts
Platform Overview
Unified Multi-Cloud Architecture
Both implementations follow the same architectural patterns while leveraging platform-specific managed services:
┌─────────────────────────────────────────────────────────────────────┐
│ Enterprise Organization │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Red Hat OpenShift (ROSA on AWS / ARO on Azure) │ │
│ │ - Unified Control Plane │ │
│ │ - Application Orchestration │ │
│ │ - Developer Platform │ │
│ └───────────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ┌───────────▼─────┐ ┌──────▼──────┐ ┌─────▼──────────┐ │
│ │ RAG Project │ │MLOps Project│ │ Data Lakehouse │ │
│ │ │ │ │ │ │ │
│ │ AWS: │ │ AWS: │ │ AWS: │ │
│ │ - Bedrock │ │ - SageMaker │ │ - Glue Catalog │ │
│ │ - Glue ETL │ │ - ACK │ │ - S3 + Iceberg │ │
│ │ │ │ │ │ │ │
│ │ Azure: │ │ Azure: │ │ Azure: │ │
│ │ - OpenAI │ │ - Azure ML │ │ - Purview │ │
│ │ - Data Factory │ │ - ASO │ │ - ADLS + Delta │ │
│ └─────────────────┘ └─────────────┘ └────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Cloud Services Layer │ │
│ │ AWS: IAM + S3 + PrivateLink + CloudWatch │ │
│ │ Azure: AAD + Blob + Private Link + Monitor │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Technology Stack: AWS vs Azure
| Component | AWS Solution | Azure Solution | OpenShift Platform |
|---|---|---|---|
| Kubernetes | ROSA (Red Hat OpenShift on AWS) | ARO (Azure Red Hat OpenShift) | Both use Red Hat OpenShift |
| LLM Platform | Amazon Bedrock (Claude 3.5) | Azure OpenAI Service (GPT-4) | Same API patterns |
| ML Training | Amazon SageMaker | Azure Machine Learning | Both burst from OpenShift |
| Data Catalog | AWS Glue Data Catalog | Azure Purview / Unity Catalog | Unified metadata layer |
| Object Storage | Amazon S3 | Azure Data Lake Storage Gen2 | S3-compatible APIs |
| Table Format | Apache Iceberg | Delta Lake | Open source options |
| Vector DB | Milvus (self-hosted) | Milvus / Cosmos DB | Same deployment |
| ETL Service | AWS Glue (serverless) | Azure Data Factory (serverless) | Similar orchestration |
| CI/CD | OpenShift Pipelines (Tekton) | Azure DevOps / Tekton | Kubernetes-native |
| K8s Integration | AWS Controllers (ACK) | Azure Service Operator (ASO) | Custom resources |
| Private Network | AWS PrivateLink | Azure Private Link | VPC/VNet integration |
| Authentication | IRSA (IAM for Service Accounts) | Workload Identity | Pod-level identity |
Cloud Platform Decision Matrix
When to Choose AWS
Best For:
- AI/ML Innovation: Amazon Bedrock offers broader model selection (Claude, Llama 2, Stable Diffusion)
- Serverless-First: AWS Glue, Lambda, and Bedrock have no minimum fees
- Startup/Scale-up: Pay-as-you-go pricing favors variable workloads
- Data Engineering: S3 + Glue + Athena is industry standard
- Multi-Region: Better global infrastructure coverage
AWS Advantages:
- Superior AI model marketplace (Anthropic, Cohere, AI21, Meta)
- True serverless data catalog (Glue) with no base costs
- More mature spot instance ecosystem for cost savings
- Better S3 ecosystem and tooling integration
- Stronger open-source community adoption
When to Choose Azure
Best For:
- Microsoft Ecosystem: Tight integration with Office 365, Teams, Power Platform
- Enterprise Windows: Native Windows container support
- Hybrid Cloud: Azure Arc and on-premises integration
- Enterprise Agreements: Existing Microsoft licensing discounts
- Regulated Industries: Better compliance certifications in some regions
Azure Advantages:
- Seamless Microsoft 365 and Active Directory integration
- Superior Windows and .NET container support
- Better hybrid cloud story with Azure Arc
- Integrated Azure Synapse for unified analytics
- Potentially lower costs with existing EA agreements
Decision Criteria Scorecard
| Criteria | AWS Score | Azure Score | Weight | Notes |
|---|---|---|---|---|
| AI Model Selection | 9/10 | 7/10 | High | AWS Bedrock has more models |
| ML Training Cost | 8/10 | 8/10 | High | Equivalent spot pricing |
| Data Lake Maturity | 10/10 | 8/10 | High | S3 is industry standard |
| Serverless Pricing | 9/10 | 7/10 | Medium | AWS Glue has no minimums |
| Enterprise Integration | 7/10 | 10/10 | High | Azure wins for Microsoft shops |
| Hybrid Cloud | 7/10 | 9/10 | Medium | Azure Arc is superior |
| Developer Ecosystem | 9/10 | 7/10 | Medium | Larger open-source community |
| Compliance Certifications | 9/10 | 9/10 | High | Equivalent for most use cases |
| Global Infrastructure | 10/10 | 8/10 | Low | AWS has more regions |
| Pricing Transparency | 8/10 | 7/10 | Medium | AWS pricing is clearer |
Total Weighted Score: AWS: 8.5/10 | Azure: 8.1/10
Verdict: Choose based on your organization's existing ecosystem. Both platforms are capable; the difference is in integration, not capability.
Prerequisites
Common Prerequisites (Both Platforms)
Required Accounts:
- Cloud platform account with administrative access
- Red Hat Account with OpenShift subscription
- Credit card for cloud charges
Required Tools (install on your workstation):
# Common tools for both platforms
# OpenShift CLI (oc)
wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz
tar -xvf openshift-client-linux.tar.gz
sudo mv oc kubectl /usr/local/bin/
oc version
# Helm (v3)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version
# Tekton CLI
curl -LO https://github.com/tektoncd/cli/releases/download/v0.33.0/tkn_0.33.0_Linux_x86_64.tar.gz
tar xvzf tkn_0.33.0_Linux_x86_64.tar.gz
sudo mv tkn /usr/local/bin/
tkn version
# Python 3.11+
python3 --version
# Container tools (Docker or Podman)
podman --version
AWS-Specific Prerequisites
# AWS CLI (v2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version
# ROSA CLI
wget https://mirror.openshift.com/pub/openshift-v4/clients/rosa/latest/rosa-linux.tar.gz
tar -xvf rosa-linux.tar.gz
sudo mv rosa /usr/local/bin/rosa
rosa version
# Configure AWS
aws configure
aws sts get-caller-identity
# Initialize ROSA
rosa login
rosa verify quota
rosa verify permissions
rosa init
Azure-Specific Prerequisites
# Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
az --version
# ARO extension
az extension add --name aro --index https://az.aroapp.io/stable
# Azure CLI login
az login
az account show
# Register required providers
az provider register --namespace Microsoft.RedHatOpenShift --wait
az provider register --namespace Microsoft.Compute --wait
az provider register --namespace Microsoft.Storage --wait
az provider register --namespace Microsoft.Network --wait
Service Quotas Verification
AWS:
# EC2 vCPU quota
aws service-quotas get-service-quota \
--service-code ec2 \
--quota-code L-1216C47A \
--region us-east-1
# SageMaker training instances
aws service-quotas get-service-quota \
--service-code sagemaker \
--quota-code L-2E8D9C5E \
--region us-east-1
Azure:
# Check compute quota
az vm list-usage --location eastus --output table
# Check ML compute quota
az ml compute list-usage --location eastus
Project 1: Enterprise-Grade RAG Platform
RAG Platform Overview
This project implements a privacy-first Retrieval-Augmented Generation (RAG) system. Both AWS and Azure implementations achieve the same functionality but use platform-specific managed services.
Architecture Comparison
AWS Architecture:
ROSA → AWS PrivateLink → Amazon Bedrock (Claude 3.5)
↓
Milvus Vector DB (on ROSA)
↓
AWS Glue ETL → S3
Azure Architecture:
ARO → Azure Private Link → Azure OpenAI (GPT-4)
↓
Milvus Vector DB (on ARO)
↓
Azure Data Factory → Blob Storage
Side-by-Side Service Mapping
| Function | AWS Service | Azure Service | Implementation Difference |
|---|---|---|---|
| LLM API | Amazon Bedrock | Azure OpenAI Service | Different model families |
| Private Network | AWS PrivateLink | Azure Private Link | Similar configuration |
| ETL Pipeline | AWS Glue (Serverless) | Azure Data Factory | Different pricing models |
| Metadata | AWS Glue Data Catalog | Azure Purview | Different scopes |
| Storage | Amazon S3 | Azure Blob Storage / ADLS Gen2 | S3 API vs Blob API |
| Vector DB | Milvus on ROSA | Milvus on ARO / Cosmos DB | Self-hosted vs managed option |
| Auth | IRSA (IAM Roles) | Workload Identity | Similar pod-level identity |
| Embedding | Titan Embeddings | OpenAI Embeddings | Different dimensions |
AWS Implementation (RAG)
AWS Phase 1: ROSA Cluster Setup
# Set environment variables
export CLUSTER_NAME="rag-platform-aws"
export AWS_REGION="us-east-1"
export MACHINE_TYPE="m5.2xlarge"
export COMPUTE_NODES=3
# Create ROSA cluster (takes ~40 minutes)
rosa create cluster \
--cluster-name $CLUSTER_NAME \
--region $AWS_REGION \
--multi-az \
--compute-machine-type $MACHINE_TYPE \
--compute-nodes $COMPUTE_NODES \
--machine-cidr 10.0.0.0/16 \
--service-cidr 172.30.0.0/16 \
--pod-cidr 10.128.0.0/14 \
--host-prefix 23 \
--yes
# Monitor installation
rosa logs install --cluster=$CLUSTER_NAME --watch
# Create admin and connect
rosa create admin --cluster=$CLUSTER_NAME
oc login <api-url> --username cluster-admin --password <password>
# Create namespaces
oc new-project redhat-ods-applications
oc new-project rag-application
oc new-project milvus
AWS Phase 2: Amazon Bedrock via PrivateLink
# Get ROSA VPC details
export ROSA_VPC_ID=$(aws ec2 describe-vpcs \
--filters "Name=tag:Name,Values=*${CLUSTER_NAME}*" \
--query 'Vpcs[0].VpcId' \
--output text \
--region $AWS_REGION)
export PRIVATE_SUBNET_IDS=$(aws ec2 describe-subnets \
--filters "Name=vpc-id,Values=$ROSA_VPC_ID" "Name=tag:Name,Values=*private*" \
--query 'Subnets[*].SubnetId' \
--output text \
--region $AWS_REGION)
# Create VPC Endpoint Security Group
export VPC_ENDPOINT_SG=$(aws ec2 create-security-group \
--group-name bedrock-vpc-endpoint-sg \
--description "Security group for Bedrock VPC endpoint" \
--vpc-id $ROSA_VPC_ID \
--region $AWS_REGION \
--output text \
--query 'GroupId')
# Allow HTTPS from ROSA nodes
aws ec2 authorize-security-group-ingress \
--group-id $VPC_ENDPOINT_SG \
--protocol tcp \
--port 443 \
--cidr 10.0.0.0/16 \
--region $AWS_REGION
# Create Bedrock VPC Endpoint
export BEDROCK_VPC_ENDPOINT=$(aws ec2 create-vpc-endpoint \
--vpc-id $ROSA_VPC_ID \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.${AWS_REGION}.bedrock-runtime \
--subnet-ids $PRIVATE_SUBNET_IDS \
--security-group-ids $VPC_ENDPOINT_SG \
--private-dns-enabled \
--region $AWS_REGION \
--output text \
--query 'VpcEndpoint.VpcEndpointId')
# Wait for availability
aws ec2 wait vpc-endpoint-available \
--vpc-endpoint-ids $BEDROCK_VPC_ENDPOINT \
--region $AWS_REGION
# Create IAM role for Bedrock access (IRSA pattern)
export OIDC_PROVIDER=$(rosa describe cluster -c $CLUSTER_NAME -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
cat > bedrock-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": "arn:aws:bedrock:${AWS_REGION}::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
}
]
}
EOF
aws iam create-policy \
--policy-name BedrockInvokePolicy \
--policy-document file://bedrock-policy.json
cat > trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_PROVIDER}:sub": "system:serviceaccount:rag-application:bedrock-sa"
}
}
}
]
}
EOF
export BEDROCK_ROLE_ARN=$(aws iam create-role \
--role-name rosa-bedrock-access \
--assume-role-policy-document file://trust-policy.json \
--query 'Role.Arn' \
--output text)
aws iam attach-role-policy \
--role-name rosa-bedrock-access \
--policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/BedrockInvokePolicy
# Create Kubernetes service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: bedrock-sa
namespace: rag-application
annotations:
eks.amazonaws.com/role-arn: $BEDROCK_ROLE_ARN
EOF
AWS Phase 3: AWS Glue Data Pipeline
# Create S3 bucket
export BUCKET_NAME="rag-documents-${ACCOUNT_ID}"
aws s3 mb s3://$BUCKET_NAME --region $AWS_REGION
# Enable versioning
aws s3api put-bucket-versioning \
--bucket $BUCKET_NAME \
--versioning-configuration Status=Enabled \
--region $AWS_REGION
# Create folder structure
aws s3api put-object --bucket $BUCKET_NAME --key raw-documents/
aws s3api put-object --bucket $BUCKET_NAME --key processed-documents/
aws s3api put-object --bucket $BUCKET_NAME --key embeddings/
# Create Glue IAM role
cat > glue-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {"Service": "glue.amazonaws.com"},
"Action": "sts:AssumeRole"
}
]
}
EOF
aws iam create-role \
--role-name AWSGlueServiceRole-RAG \
--assume-role-policy-document file://glue-trust-policy.json
aws iam attach-role-policy \
--role-name AWSGlueServiceRole-RAG \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
# Create S3 access policy
cat > glue-s3-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::${BUCKET_NAME}/*"
},
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::${BUCKET_NAME}"
}
]
}
EOF
aws iam put-role-policy \
--role-name AWSGlueServiceRole-RAG \
--policy-name S3Access \
--policy-document file://glue-s3-policy.json
# Create Glue database
aws glue create-database \
--database-input '{
"Name": "rag_documents_db",
"Description": "RAG document metadata"
}' \
--region $AWS_REGION
# Create Glue crawler
aws glue create-crawler \
--name rag-document-crawler \
--role arn:aws:iam::${ACCOUNT_ID}:role/AWSGlueServiceRole-RAG \
--database-name rag_documents_db \
--targets '{
"S3Targets": [{"Path": "s3://'$BUCKET_NAME'/raw-documents/"}]
}' \
--region $AWS_REGION
AWS Phase 4: Milvus Vector Database
# Install Milvus using Helm
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm repo update
helm install milvus-operator milvus/milvus-operator \
--namespace milvus \
--create-namespace
# Create PVCs
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: milvus-etcd-pvc
namespace: milvus
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
storageClassName: gp3-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: milvus-minio-pvc
namespace: milvus
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
storageClassName: gp3-csi
EOF
# Deploy Milvus
cat > milvus-values.yaml <<EOF
cluster:
enabled: true
service:
type: ClusterIP
port: 19530
standalone:
replicas: 1
resources:
limits:
cpu: "4"
memory: 8Gi
requests:
cpu: "2"
memory: 4Gi
etcd:
persistence:
enabled: true
existingClaim: milvus-etcd-pvc
minio:
persistence:
enabled: true
existingClaim: milvus-minio-pvc
EOF
helm install milvus milvus/milvus \
--namespace milvus \
--values milvus-values.yaml \
--wait
# Get Milvus endpoint
export MILVUS_HOST=$(oc get svc milvus -n milvus -o jsonpath='{.spec.clusterIP}')
export MILVUS_PORT=19530
AWS Phase 5: RAG Application Deployment
# Create application code
mkdir -p rag-app-aws/src
cat > rag-app-aws/requirements.txt <<EOF
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
pymilvus==2.3.3
boto3==1.29.7
python-dotenv==1.0.0
EOF
# Create FastAPI application (abbreviated for space)
cat > rag-app-aws/src/main.py <<'PYTHON'
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import os, json, boto3
from pymilvus import connections, Collection
app = FastAPI(title="Enterprise RAG API - AWS")
MILVUS_HOST = os.getenv("MILVUS_HOST")
AWS_REGION = os.getenv("AWS_REGION", "us-east-1")
BEDROCK_MODEL = "anthropic.claude-3-5-sonnet-20241022-v2:0"
bedrock = boto3.client('bedrock-runtime', region_name=AWS_REGION)
@app.on_event("startup")
async def startup():
connections.connect(host=MILVUS_HOST, port=19530)
class QueryRequest(BaseModel):
query: str
top_k: int = 5
max_tokens: int = 1000
@app.post("/query")
async def query_rag(req: QueryRequest):
# Generate embedding with Bedrock Titan
embed_resp = bedrock.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
body=json.dumps({"inputText": req.query, "dimensions": 1024})
)
embedding = json.loads(embed_resp['body'].read())['embedding']
# Search Milvus
coll = Collection("rag_documents")
results = coll.search([embedding], "embedding", {"metric_type": "L2"}, limit=req.top_k)
# Build context
context = "\n\n".join([hit.entity.get("text") for hit in results[0]])
# Call Bedrock Claude
prompt = f"Context:\n{context}\n\nQuestion: {req.query}\n\nAnswer:"
response = bedrock.invoke_model(
modelId=BEDROCK_MODEL,
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": req.max_tokens,
"messages": [{"role": "user", "content": prompt}]
})
)
answer = json.loads(response['body'].read())['content'][0]['text']
return {"answer": answer, "sources": [{"chunk": hit.entity.get("text")} for hit in results[0]]}
@app.get("/health")
async def health():
return {"status": "healthy", "platform": "AWS", "model": "Claude 3.5 Sonnet"}
PYTHON
# Create Dockerfile
cat > rag-app-aws/Dockerfile <<EOF
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
EOF
# Build and deploy
cd rag-app-aws
podman build -t rag-app-aws:v1.0 .
oc create imagestream rag-app-aws -n rag-application
podman tag rag-app-aws:v1.0 image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0
podman push image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0 --tls-verify=false
cd ..
# Deploy to OpenShift
cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-app-aws
namespace: rag-application
spec:
replicas: 2
selector:
matchLabels:
app: rag-app-aws
template:
metadata:
labels:
app: rag-app-aws
spec:
serviceAccountName: bedrock-sa
containers:
- name: app
image: image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-aws:v1.0
ports:
- containerPort: 8000
env:
- name: MILVUS_HOST
value: "$MILVUS_HOST"
- name: AWS_REGION
value: "$AWS_REGION"
---
apiVersion: v1
kind: Service
metadata:
name: rag-app-aws
namespace: rag-application
spec:
selector:
app: rag-app-aws
ports:
- port: 80
targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: rag-app-aws
namespace: rag-application
spec:
to:
kind: Service
name: rag-app-aws
tls:
termination: edge
EOF
# Get URL and test
export RAG_URL_AWS=$(oc get route rag-app-aws -n rag-application -o jsonpath='{.spec.host}')
curl https://$RAG_URL_AWS/health
Azure Implementation (RAG)
Azure Phase 1: ARO Cluster Setup
# Set environment variables
export CLUSTER_NAME="rag-platform-azure"
export LOCATION="eastus"
export RESOURCE_GROUP="rag-platform-rg"
# Create resource group
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
# Create virtual network
az network vnet create \
--resource-group $RESOURCE_GROUP \
--name aro-vnet \
--address-prefixes 10.0.0.0/22
# Create master subnet
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--name master-subnet \
--address-prefixes 10.0.0.0/23 \
--service-endpoints Microsoft.ContainerRegistry
# Create worker subnet
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--name worker-subnet \
--address-prefixes 10.0.2.0/23 \
--service-endpoints Microsoft.ContainerRegistry
# Disable subnet private endpoint policies
az network vnet subnet update \
--name master-subnet \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--disable-private-link-service-network-policies true
# Create ARO cluster (takes ~35 minutes)
az aro create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--vnet aro-vnet \
--master-subnet master-subnet \
--worker-subnet worker-subnet \
--worker-count 3 \
--worker-vm-size Standard_D8s_v3
# Get credentials
export ARO_URL=$(az aro show \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--query consoleUrl -o tsv)
export ARO_PASSWORD=$(az aro list-credentials \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--query kubeadminPassword -o tsv)
# Login
oc login $ARO_URL -u kubeadmin -p $ARO_PASSWORD
# Create namespaces
oc new-project rag-application
oc new-project milvus
Azure Phase 2: Azure OpenAI via Private Link
# Create Azure OpenAI resource
export OPENAI_NAME="rag-openai-${RANDOM}"
az cognitiveservices account create \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--kind OpenAI \
--sku S0 \
--location $LOCATION \
--custom-domain $OPENAI_NAME \
--public-network-access Disabled
# Deploy GPT-4 model
az cognitiveservices account deployment create \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--deployment-name gpt-4 \
--model-name gpt-4 \
--model-version "0613" \
--model-format OpenAI \
--sku-capacity 10 \
--sku-name "Standard"
# Deploy text-embedding model
az cognitiveservices account deployment create \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--deployment-name text-embedding-ada-002 \
--model-name text-embedding-ada-002 \
--model-version "2" \
--model-format OpenAI \
--sku-capacity 10 \
--sku-name "Standard"
# Create Private Endpoint
export VNET_ID=$(az network vnet show \
--resource-group $RESOURCE_GROUP \
--name aro-vnet \
--query id -o tsv)
export SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--name worker-subnet \
--query id -o tsv)
export OPENAI_ID=$(az cognitiveservices account show \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--query id -o tsv)
az network private-endpoint create \
--name openai-private-endpoint \
--resource-group $RESOURCE_GROUP \
--vnet-name aro-vnet \
--subnet worker-subnet \
--private-connection-resource-id $OPENAI_ID \
--group-id account \
--connection-name openai-connection
# Create Private DNS Zone
az network private-dns zone create \
--resource-group $RESOURCE_GROUP \
--name privatelink.openai.azure.com
az network private-dns link vnet create \
--resource-group $RESOURCE_GROUP \
--zone-name privatelink.openai.azure.com \
--name openai-dns-link \
--virtual-network aro-vnet \
--registration-enabled false
# Create DNS record
export ENDPOINT_IP=$(az network private-endpoint show \
--name openai-private-endpoint \
--resource-group $RESOURCE_GROUP \
--query 'customDnsConfigs[0].ipAddresses[0]' -o tsv)
az network private-dns record-set a create \
--name $OPENAI_NAME \
--zone-name privatelink.openai.azure.com \
--resource-group $RESOURCE_GROUP
az network private-dns record-set a add-record \
--record-set-name $OPENAI_NAME \
--zone-name privatelink.openai.azure.com \
--resource-group $RESOURCE_GROUP \
--ipv4-address $ENDPOINT_IP
# Configure Workload Identity
export ARO_OIDC_ISSUER=$(az aro show \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--query 'serviceIdentity.url' -o tsv)
# Create managed identity
az identity create \
--name rag-app-identity \
--resource-group $RESOURCE_GROUP
export IDENTITY_CLIENT_ID=$(az identity show \
--name rag-app-identity \
--resource-group $RESOURCE_GROUP \
--query clientId -o tsv)
export IDENTITY_PRINCIPAL_ID=$(az identity show \
--name rag-app-identity \
--resource-group $RESOURCE_GROUP \
--query principalId -o tsv)
# Grant OpenAI access
az role assignment create \
--assignee $IDENTITY_PRINCIPAL_ID \
--role "Cognitive Services OpenAI User" \
--scope $OPENAI_ID
# Create federated credential
az identity federated-credential create \
--name rag-app-federated \
--identity-name rag-app-identity \
--resource-group $RESOURCE_GROUP \
--issuer $ARO_OIDC_ISSUER \
--subject "system:serviceaccount:rag-application:openai-sa"
# Create Kubernetes service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: openai-sa
namespace: rag-application
annotations:
azure.workload.identity/client-id: $IDENTITY_CLIENT_ID
EOF
# Get OpenAI endpoint and key
export OPENAI_ENDPOINT=$(az cognitiveservices account show \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--query properties.endpoint -o tsv)
export OPENAI_KEY=$(az cognitiveservices account keys list \
--name $OPENAI_NAME \
--resource-group $RESOURCE_GROUP \
--query key1 -o tsv)
# Create secret
oc create secret generic openai-credentials \
--from-literal=endpoint=$OPENAI_ENDPOINT \
--from-literal=key=$OPENAI_KEY \
-n rag-application
Azure Phase 3: Azure Data Factory Pipeline
# Create Data Factory
export ADF_NAME="rag-adf-${RANDOM}"
az datafactory create \
--resource-group $RESOURCE_GROUP \
--factory-name $ADF_NAME \
--location $LOCATION
# Create Storage Account
export STORAGE_ACCOUNT="ragdocs${RANDOM}"
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--hierarchical-namespace true
# Get storage key
export STORAGE_KEY=$(az storage account keys list \
--account-name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--query '[0].value' -o tsv)
# Create containers
az storage container create \
--name raw-documents \
--account-name $STORAGE_ACCOUNT \
--account-key $STORAGE_KEY
az storage container create \
--name processed-documents \
--account-name $STORAGE_ACCOUNT \
--account-key $STORAGE_KEY
# Create linked service for storage
cat > adf-storage-linked-service.json <<EOF
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=$STORAGE_ACCOUNT;AccountKey=$STORAGE_KEY;EndpointSuffix=core.windows.net"
}
}
}
EOF
az datafactory linked-service create \
--resource-group $RESOURCE_GROUP \
--factory-name $ADF_NAME \
--name StorageLinkedService \
--properties @adf-storage-linked-service.json
Azure Phase 4: Milvus Deployment (Same as AWS)
The Milvus deployment on ARO is identical to ROSA since both use OpenShift:
# Same Helm commands as AWS implementation
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm install milvus-operator milvus/milvus-operator --namespace milvus --create-namespace
# Create PVCs using Azure Disk
cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: milvus-etcd-pvc
namespace: milvus
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
storageClassName: managed-premium
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: milvus-minio-pvc
namespace: milvus
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
storageClassName: managed-premium
EOF
# Deploy Milvus (same values file as AWS)
helm install milvus milvus/milvus --namespace milvus --values milvus-values.yaml --wait
Azure Phase 5: RAG Application Deployment
# Create Azure-specific application
mkdir -p rag-app-azure/src
cat > rag-app-azure/requirements.txt <<EOF
fastapi==0.104.1
uvicorn[standard]==0.24.0
pydantic==2.5.0
pymilvus==2.3.3
openai==1.3.5
azure-identity==1.14.0
python-dotenv==1.0.0
EOF
cat > rag-app-azure/src/main.py <<'PYTHON'
from fastapi import FastAPI
from pydantic import BaseModel
import os
from openai import AzureOpenAI
from pymilvus import connections, Collection
app = FastAPI(title="Enterprise RAG API - Azure")
client = AzureOpenAI(
api_key=os.getenv("OPENAI_KEY"),
api_version="2023-05-15",
azure_endpoint=os.getenv("OPENAI_ENDPOINT")
)
@app.on_event("startup")
async def startup():
connections.connect(host=os.getenv("MILVUS_HOST"), port=19530)
class QueryRequest(BaseModel):
query: str
top_k: int = 5
max_tokens: int = 1000
@app.post("/query")
async def query_rag(req: QueryRequest):
# Generate embedding with Azure OpenAI
embed_resp = client.embeddings.create(
input=req.query,
model="text-embedding-ada-002"
)
embedding = embed_resp.data[0].embedding
# Search Milvus
coll = Collection("rag_documents")
results = coll.search([embedding], "embedding", {"metric_type": "L2"}, limit=req.top_k)
# Build context
context = "\n\n".join([hit.entity.get("text") for hit in results[0]])
# Call Azure OpenAI GPT-4
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {req.query}"}
],
max_tokens=req.max_tokens
)
answer = response.choices[0].message.content
return {"answer": answer, "sources": [{"chunk": hit.entity.get("text")} for hit in results[0]]}
@app.get("/health")
async def health():
return {"status": "healthy", "platform": "Azure", "model": "GPT-4"}
PYTHON
# Build and deploy (similar to AWS)
cd rag-app-azure
podman build -t rag-app-azure:v1.0 .
oc create imagestream rag-app-azure -n rag-application
podman tag rag-app-azure:v1.0 image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0
podman push image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0 --tls-verify=false
cd ..
# Deploy with Azure credentials
cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-app-azure
namespace: rag-application
spec:
replicas: 2
selector:
matchLabels:
app: rag-app-azure
template:
metadata:
labels:
app: rag-app-azure
spec:
serviceAccountName: openai-sa
containers:
- name: app
image: image-registry.openshift-image-registry.svc:5000/rag-application/rag-app-azure:v1.0
ports:
- containerPort: 8000
env:
- name: MILVUS_HOST
value: "milvus.milvus.svc.cluster.local"
- name: OPENAI_ENDPOINT
valueFrom:
secretKeyRef:
name: openai-credentials
key: endpoint
- name: OPENAI_KEY
valueFrom:
secretKeyRef:
name: openai-credentials
key: key
---
apiVersion: v1
kind: Service
metadata:
name: rag-app-azure
namespace: rag-application
spec:
selector:
app: rag-app-azure
ports:
- port: 80
targetPort: 8000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: rag-app-azure
namespace: rag-application
spec:
to:
kind: Service
name: rag-app-azure
tls:
termination: edge
EOF
# Get URL and test
export RAG_URL_AZURE=$(oc get route rag-app-azure -n rag-application -o jsonpath='{.spec.host}')
curl https://$RAG_URL_AZURE/health
Cost Comparison (RAG)
Monthly Cost Breakdown
| Component | AWS Cost | Azure Cost | Notes |
|---|---|---|---|
| Kubernetes Cluster | |||
| - 3x worker nodes | $1,460 (m5.2xlarge) | $1,380 (D8s_v3) | Similar specs |
| - Control plane | $0 (managed by ROSA) | $0 (managed by ARO) | Both included |
| LLM API Calls | |||
| - 1M input tokens | $3 (Claude 3.5) | $30 (GPT-4) | AWS 10x cheaper |
| - 1M output tokens | $15 (Claude 3.5) | $60 (GPT-4) | AWS 4x cheaper |
| Embeddings | |||
| - 1M tokens | $0.10 (Titan) | $0.10 (Ada-002) | Equivalent |
| Data Pipeline | |||
| - ETL service | $10 (Glue, serverless) | $15 (Data Factory) | AWS slightly cheaper |
| - Metadata catalog | $1 (Glue Catalog) | $20 (Purview min) | Azure has minimum fee |
| Object Storage | |||
| - 100 GB storage | $2.30 (S3) | $2.05 (Blob) | Equivalent |
| - Requests (100k) | $0.05 (S3) | $0.04 (Blob) | Equivalent |
| Vector Database | |||
| - Self-hosted Milvus | $0 (on cluster) | $0 (on cluster) | Same |
| Networking | |||
| - Private Link | $7.20 (PrivateLink) | $7.20 (Private Link) | Same pricing |
| - Data transfer | $5 (1 TB out) | $5 (1 TB out) | Equivalent |
| TOTAL/MONTH | $1,503.65 | $1,519.39 | AWS 1% cheaper |
Key Cost Insights:
- LLM API costs favor AWS by a significant margin (Claude is cheaper than GPT-4)
- Azure Purview has a minimum monthly fee vs Glue's pay-per-use
- Compute costs are similar between ROSA and ARO
- Winner: AWS by ~$16/month (1%)
Cost Optimization Strategies
AWS:
- Use Claude Instant for non-critical queries (6x cheaper)
- Leverage Glue serverless (no base cost)
- Use S3 Intelligent-Tiering for old documents
Azure:
- Use GPT-3.5-Turbo instead of GPT-4 (20x cheaper)
- Negotiate EA pricing for Azure OpenAI
- Use cool/archive tiers for old data
Project 2: Hybrid MLOps Pipeline
MLOps Platform Overview
This project demonstrates cost-optimized machine learning operations by bursting GPU training workloads to managed services while keeping inference on Kubernetes.
Architecture Comparison
AWS Architecture:
OpenShift Pipelines → ACK → SageMaker (ml.p4d.24xlarge)
↓
S3 Model Storage
↓
KServe on ROSA (CPU)
Azure Architecture:
Azure DevOps / Tekton → ASO → Azure ML (NC96ads_A100_v4)
↓
Blob Model Storage
↓
KServe on ARO (CPU)
Service Mapping
| Function | AWS Service | Azure Service | Key Difference |
|---|---|---|---|
| ML Platform | Amazon SageMaker | Azure Machine Learning | Similar capabilities |
| GPU Training | ml.p4d.24xlarge (8x A100) | NC96ads_A100_v4 (8x A100) | Same hardware |
| Spot Training | Managed Spot Training | Low Priority VMs | Different reservation models |
| Model Registry | S3 + SageMaker Registry | Blob + ML Model Registry | Different metadata approaches |
| K8s Operator | ACK (AWS Controllers) | ASO (Azure Service Operator) | Different CRD structures |
| Pipelines | OpenShift Pipelines (Tekton) | Azure DevOps / Tekton | Both support Tekton |
| Inference | KServe on ROSA | KServe on ARO | Identical |
AWS Implementation (MLOps)
AWS MLOps Phase 1: OpenShift Pipelines Setup
# Install OpenShift Pipelines Operator
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: openshift-pipelines-operator
namespace: openshift-operators
spec:
channel: latest
name: openshift-pipelines-operator-rh
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
# Create namespace
oc new-project mlops-pipelines
# Create service account
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: pipeline-sa
namespace: mlops-pipelines
EOF
AWS MLOps Phase 2: ACK SageMaker Controller
# Install ACK SageMaker controller
export SERVICE=sagemaker
export RELEASE_VERSION=$(curl -sL https://api.github.com/repos/aws-controllers-k8s/${SERVICE}-controller/releases/latest | grep '\"tag_name\":' | cut -d'\"' -f4)
wget https://github.com/aws-controllers-k8s/${SERVICE}-controller/releases/download/${RELEASE_VERSION}/install.yaml
kubectl apply -f install.yaml
# Create IAM role for ACK
cat > ack-sagemaker-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateTrainingJob",
"sagemaker:DescribeTrainingJob",
"sagemaker:StopTrainingJob"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["s3:*"],
"Resource": "arn:aws:s3:::mlops-*"
},
{
"Effect": "Allow",
"Action": ["iam:PassRole"],
"Resource": "*",
"Condition": {
"StringEquals": {"iam:PassedToService": "sagemaker.amazonaws.com"}
}
}
]
}
EOF
aws iam create-policy --policy-name ACKSageMakerPolicy --policy-document file://ack-sagemaker-policy.json
# Create trust policy and role (similar to RAG project)
# ... (abbreviated for space)
AWS MLOps Phase 3: Training Job Example
# Create S3 buckets
export ML_BUCKET="mlops-artifacts-${ACCOUNT_ID}"
export DATA_BUCKET="mlops-datasets-${ACCOUNT_ID}"
aws s3 mb s3://$ML_BUCKET
aws s3 mb s3://$DATA_BUCKET
# Upload training script
cat > train.py <<'PYTHON'
import argparse, joblib
from sklearn.ensemble import RandomForestClassifier
import numpy as np
parser = argparse.ArgumentParser()
parser.add_argument('--n_estimators', type=int, default=100)
args = parser.parse_args()
# Training code
X = np.random.rand(1000, 20)
y = np.random.randint(0, 2, 1000)
model = RandomForestClassifier(n_estimators=args.n_estimators)
model.fit(X, y)
joblib.dump(model, '/opt/ml/model/model.joblib')
print(f"Training completed with {args.n_estimators} estimators")
PYTHON
# Create Dockerfile
cat > Dockerfile <<EOF
FROM python:3.10-slim
RUN pip install scikit-learn joblib numpy
COPY train.py /opt/ml/code/
ENTRYPOINT ["python", "/opt/ml/code/train.py"]
EOF
# Build and push to ECR
aws ecr create-repository --repository-name mlops/training
export ECR_URI="${ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/mlops/training"
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
docker build -t mlops-training .
docker tag mlops-training:latest $ECR_URI:latest
docker push $ECR_URI:latest
# Create SageMaker training job via ACK
cat <<EOF | oc apply -f -
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: TrainingJob
metadata:
name: rf-training-job
namespace: mlops-pipelines
spec:
trainingJobName: rf-training-$(date +%s)
roleARN: $SAGEMAKER_ROLE_ARN
algorithmSpecification:
trainingImage: $ECR_URI:latest
trainingInputMode: File
resourceConfig:
instanceType: ml.m5.xlarge
instanceCount: 1
volumeSizeInGB: 50
outputDataConfig:
s3OutputPath: s3://$ML_BUCKET/models/
stoppingCondition:
maxRuntimeInSeconds: 3600
EOF
Azure Implementation (MLOps)
Azure MLOps Phase 1: Azure ML Workspace
# Create ML workspace
export ML_WORKSPACE="mlops-workspace-${RANDOM}"
az ml workspace create \
--name $ML_WORKSPACE \
--resource-group $RESOURCE_GROUP \
--location $LOCATION
# Create compute cluster (spot instances)
az ml compute create \
--name gpu-cluster \
--type amlcompute \
--min-instances 0 \
--max-instances 4 \
--size Standard_NC6s_v3 \
--tier LowPriority \
--workspace-name $ML_WORKSPACE \
--resource-group $RESOURCE_GROUP
Azure MLOps Phase 2: Azure Service Operator
# Install ASO
helm repo add aso2 https://raw.githubusercontent.com/Azure/azure-service-operator/main/v2/charts
helm install aso2 aso2/azure-service-operator \
--create-namespace \
--namespace azureserviceoperator-system \
--set azureSubscriptionID=$SUBSCRIPTION_ID \
--set azureTenantID=$TENANT_ID \
--set azureClientID=$CLIENT_ID \
--set azureClientSecret=$CLIENT_SECRET
# Create ML job via ASO
cat <<EOF | oc apply -f -
apiVersion: machinelearningservices.azure.com/v1alpha1
kind: Job
metadata:
name: rf-training-job
namespace: mlops-pipelines
spec:
owner:
name: $ML_WORKSPACE
compute:
target: gpu-cluster
instanceCount: 1
environment:
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
codeConfiguration:
codeArtifactId: azureml://code/train-script
scoringScript: train.py
EOF
Cost Comparison (MLOps)
| Component | AWS Monthly | Azure Monthly | Notes |
|---|---|---|---|
| Training | |||
| - 4 hrs/week spot GPU | $157 (ml.p4d.24xlarge) | $153 (NC96ads_A100_v4) | Azure slightly cheaper |
| Storage | |||
| - Model artifacts (50 GB) | $1.15 (S3) | $1.00 (Blob) | Similar |
| ML Platform | |||
| - ML service | $0 (pay-per-use) | $0 (pay-per-use) | Same |
| Inference (on OpenShift) | |||
| - Shared ROSA/ARO cluster | $0 (shared) | $0 (shared) | Same |
| TOTAL/MONTH | ~$158 | ~$154 | Azure 2.5% cheaper |
Winner: Azure by $4/month (negligible difference)
Project 3: Unified Data Fabric (Data Lakehouse)
Lakehouse Platform Overview
This project implements a stateless data lakehouse where compute (Spark) can be destroyed without data loss.
Architecture Comparison
AWS Architecture:
Spark on ROSA → AWS Glue Catalog → S3 + Iceberg
Azure Architecture:
Spark on ARO → Azure Purview / Unity Catalog → ADLS Gen2 + Delta Lake
Service Mapping
| Function | AWS Service | Azure Service | Key Difference |
|---|---|---|---|
| Catalog | AWS Glue Data Catalog | Azure Purview / Unity Catalog | Glue is serverless |
| Table Format | Apache Iceberg | Delta Lake | Iceberg is cloud-agnostic |
| Storage | Amazon S3 | ADLS Gen2 | ADLS has hierarchical namespace |
| Compute | Spark on ROSA | Spark on ARO / Databricks | ARO or managed Databricks |
| Query Engine | Amazon Athena | Azure Synapse Serverless SQL | Similar serverless query |
AWS Implementation (Lakehouse)
(Due to length constraints, showing key differences only)
# Install Spark Operator
helm install spark-operator spark-operator/spark-operator \
--namespace spark-operator \
--set sparkJobNamespace=spark-jobs
# Create Glue databases
aws glue create-database --database-input '{"Name": "bronze"}'
aws glue create-database --database-input '{"Name": "silver"}'
aws glue create-database --database-input '{"Name": "gold"}'
# Build custom Spark image with Iceberg
cat > Dockerfile <<EOF
FROM gcr.io/spark-operator/spark:v3.5.0
USER root
RUN curl -L https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.4.2/iceberg-spark-runtime-3.5_2.12-1.4.2.jar \
-o /opt/spark/jars/iceberg-spark-runtime.jar
RUN curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar \
-o /opt/spark/jars/hadoop-aws.jar
USER 185
EOF
# Deploy SparkApplication with Glue integration
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: lakehouse-etl
spec:
type: Python
sparkVersion: "3.5.0"
mainApplicationFile: s3://bucket/scripts/etl.py
sparkConf:
"spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog"
"spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog"
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
EOF
Azure Implementation (Lakehouse)
# Option 1: Use Azure Databricks (managed)
az databricks workspace create \
--name databricks-lakehouse \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku premium
# Option 2: Deploy Spark on ARO with Delta Lake
cat > Dockerfile <<EOF
FROM gcr.io/spark-operator/spark:v3.5.0
USER root
RUN curl -L https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.4.0/delta-core_2.12-2.4.0.jar \
-o /opt/spark/jars/delta-core.jar
USER 185
EOF
# Create ADLS Gen2 storage
az storage account create \
--name datalake${RANDOM} \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--kind StorageV2 \
--hierarchical-namespace true
# Deploy SparkApplication with Delta Lake
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: lakehouse-etl
spec:
type: Python
sparkVersion: "3.5.0"
mainApplicationFile: abfss://container@storage.dfs.core.windows.net/scripts/etl.py
sparkConf:
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
EOF
Cost Comparison (Lakehouse)
| Component | AWS Monthly | Azure Monthly | Notes |
|---|---|---|---|
| Compute | |||
| - Spark cluster (3x m5.4xlarge) | $1,500 | $1,450 (D16s_v3) | Similar |
| Metadata Catalog | |||
| - Catalog service | $10 (Glue, 1M requests) | $20 (Purview minimum) | AWS cheaper |
| Storage | |||
| - Data lake (1 TB) | $23 (S3) | $18 (ADLS Gen2 hot) | Azure cheaper |
| Query Engine | |||
| - Serverless queries (1 TB) | $5 (Athena) | $5 (Synapse serverless) | Same |
| TOTAL/MONTH | $1,538 | $1,493 | Azure 3% cheaper |
Winner: Azure by $45/month (3%)
Total Cost of Ownership Analysis
Combined Monthly Costs
| Project | AWS Total | Azure Total | Difference |
|---|---|---|---|
| RAG Platform | $1,504 | $1,519 | AWS -$15 (-1%) |
| MLOps Pipeline | $158 | $154 | Azure -$4 (-2.5%) |
| Data Lakehouse | $1,538 | $1,493 | Azure -$45 (-3%) |
| TOTAL | $3,200/month | $3,166/month | Azure -$34/month (-1%) |
Annual Projection
- AWS: $3,200 × 12 = $38,400/year
- Azure: $3,166 × 12 = $37,992/year
- Savings with Azure: $408/year (1%)
Cost Sensitivity Analysis
Scenario 1: High LLM Usage (10M tokens/month)
- AWS: +$180 (Claude cheaper)
- Azure: +$900 (GPT-4 more expensive)
- AWS wins by $720/month
Scenario 2: Heavy ML Training (20 hrs/week GPU)
- AWS: +$785
- Azure: +$765
- Azure wins by $20/month
Scenario 3: Large Data Lake (10 TB storage)
- AWS: +$230
- Azure: +$180
- Azure wins by $50/month
Conclusion: AWS is better for AI-heavy workloads due to cheaper LLM pricing. Azure is better for data-heavy workloads due to cheaper storage.
Multi-Cloud Integration Patterns
Unified RBAC Strategy
Both platforms support similar pod-level identity:
AWS (IRSA):
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-sa
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/AppRole
Azure (Workload Identity):
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-sa
annotations:
azure.workload.identity/client-id: CLIENT_ID
Multi-Cloud Disaster Recovery
Deploy identical workloads on both platforms for DR:
# Primary: AWS
# Standby: Azure
# Failover time: < 5 minutes with DNS switch
# Shared components:
# - OpenShift APIs (same)
# - Application code (same)
# - Milvus deployment (same)
# Platform-specific:
# - Cloud credentials
# - Storage endpoints
Migration Strategies
AWS to Azure Migration
Phase 1: Data Migration
# Use AzCopy for S3 → Blob migration
azcopy copy \
"https://s3.amazonaws.com/bucket/*" \
"https://storageaccount.blob.core.windows.net/container" \
--recursive
Phase 2: Metadata Migration
- Export Glue Catalog to JSON
- Import to Azure Purview via API
Phase 3: Application Migration
- Update environment variables
- Switch cloud credentials
- Deploy to ARO
Azure to AWS Migration
Similar process in reverse:
# Use AWS DataSync for Blob → S3
aws datasync create-task \
--source-location-arn arn:aws:datasync:...:location/azure-blob \
--destination-location-arn arn:aws:datasync:...:location/s3-bucket
Resource Cleanup
AWS Complete Cleanup
#!/bin/bash
# Complete AWS resource cleanup
# RAG Platform
rosa delete cluster --cluster=rag-platform-aws --yes
aws s3 rm s3://rag-documents-${ACCOUNT_ID} --recursive
aws s3 rb s3://rag-documents-${ACCOUNT_ID}
aws glue delete-crawler --name rag-document-crawler
aws glue delete-database --name rag_documents_db
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $BEDROCK_VPC_ENDPOINT
aws iam delete-role --role-name rosa-bedrock-access
aws iam delete-policy --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/BedrockInvokePolicy
# MLOps Platform
aws s3 rm s3://mlops-artifacts-${ACCOUNT_ID} --recursive
aws s3 rm s3://mlops-datasets-${ACCOUNT_ID} --recursive
aws s3 rb s3://mlops-artifacts-${ACCOUNT_ID}
aws s3 rb s3://mlops-datasets-${ACCOUNT_ID}
aws ecr delete-repository --repository-name mlops/training --force
aws iam delete-role --role-name ACKSageMakerControllerRole
# Data Lakehouse
aws s3 rm s3://lakehouse-data-${ACCOUNT_ID} --recursive
aws s3 rb s3://lakehouse-data-${ACCOUNT_ID}
for db in bronze silver gold; do
aws glue delete-database --name $db
done
aws iam delete-role --role-name SparkGlueCatalogRole
echo "AWS cleanup complete"
Azure Complete Cleanup
#!/bin/bash
# Complete Azure resource cleanup
# Delete all resources in resource group
az group delete --name rag-platform-rg --yes --no-wait
# This deletes:
# - ARO cluster
# - Azure OpenAI service
# - Storage accounts
# - Data Factory
# - Azure ML workspace
# - All networking components
echo "Azure cleanup complete (deleting in background)"
Troubleshooting
Common Multi-Cloud Issues
Issue: Cross-Cloud Latency
Symptoms: Slow API responses when accessing cloud services
AWS Solution:
# Verify VPC endpoint is in correct AZ
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids $ENDPOINT_ID
# Check PrivateLink latency
oc run test --rm -it --image=curlimages/curl -- \
curl -w "@curl-format.txt" https://bedrock-runtime.us-east-1.amazonaws.com
Azure Solution:
# Verify Private Link in same region as ARO
az network private-endpoint show --name openai-private-endpoint
# Test latency
oc run test --rm -it --image=curlimages/curl -- \
curl -w "@curl-format.txt" https://OPENAI_NAME.openai.azure.com
Issue: Authentication Failures
AWS IRSA Troubleshooting:
# Verify OIDC provider
rosa describe cluster -c $CLUSTER_NAME -o json | jq .aws.sts.oidc_endpoint_url
# Test token
kubectl create token bedrock-sa -n rag-application
# Verify IAM trust policy
aws iam get-role --role-name rosa-bedrock-access
Azure Workload Identity Troubleshooting:
# Verify federated credential
az identity federated-credential show \
--name rag-app-federated \
--identity-name rag-app-identity \
--resource-group $RESOURCE_GROUP
# Test managed identity
az account get-access-token --resource https://cognitiveservices.azure.com
Conclusion
Platform Selection Recommendations
Choose AWS if you:
- Prioritize AI/ML model diversity (Bedrock marketplace)
- Have variable, unpredictable workloads (serverless pricing)
- Value open-source ecosystem compatibility
- Need global multi-region deployments
- Want lower LLM API costs
Choose Azure if you:
- Have existing Microsoft enterprise agreements
- Need Windows container support
- Require hybrid cloud with on-premises
- Have Microsoft 365 / Teams integration requirements
- Want slightly lower infrastructure costs
Choose Multi-Cloud if you:
- Need disaster recovery across providers
- Want to avoid vendor lock-in
- Have regulatory requirements for redundancy
- Can manage operational complexity
Final Cost Summary
For the three projects combined:
- AWS Total: $3,200/month ($38,400/year)
- Azure Total: $3,166/month ($37,992/year)
- Difference: 1% ($408/year favoring Azure)
Verdict: Costs are effectively equivalent. Choose based on ecosystem fit, not cost.
Key Technical Takeaways
- OpenShift provides platform portability - same APIs on both clouds
- Cloud-specific services (Bedrock, Azure OpenAI) require different code
- Storage abstractions (S3 vs Blob) are the main migration challenge
- IAM patterns (IRSA vs Workload Identity) are conceptually similar
Next Steps
To Expand This Implementation:
- Add GitOps with ArgoCD for both platforms
- Implement cross-cloud disaster recovery
- Add comprehensive monitoring with Grafana
- Automate deployments with Terraform/Bicep
- Implement cost governance and FinOps
Thank you for reading this comprehensive multi-cloud implementation guide!
Top comments (0)