DEV Community

Pavan Madduri
Pavan Madduri

Posted on

Docker + OKE: Running GPU Inference Containers on Oracle Cloud

I wanted to deploy an LLM inference API without spending $1,200/month on AWS GPU instances. OCI turned out to be significantly cheaper, and the Docker workflow was identical. Here's what I set up.


Why I Looked at OCI for GPU Workloads

I've been building GPU infrastructure tools for a while now (keda-gpu-scaler, otel-gpu-receiver, GPU NUMA scheduling for Volcano), and most of my testing was on AWS. The g5.xlarge instances with A10 GPUs run about $1.01/hr, plus $73/month for the EKS control plane. It adds up fast when you're iterating.

Someone on the Volcano Slack mentioned OCI's GPU pricing and I was skeptical. But when I looked it up, the numbers were real — same A10 GPU, roughly 40% cheaper, and OKE doesn't charge for the Kubernetes control plane at all. So I tried moving a vLLM inference workload over.

OCI GPU Pricing

Here's what OCI actually charges for GPU instances. I had to double-check these because they seemed too low:

Shape GPU GPU Memory OCPUs RAM Price/hr (on-demand)
VM.GPU.A10.1 1x A10 24 GB 15 240 GB ~$1.65
VM.GPU.A10.2 2x A10 48 GB 30 480 GB ~$3.30
BM.GPU.A100-v2.8 8x A100 640 GB 128 2 TB ~$25.00
BM.GPU.H100.8 8x H100 640 GB 112 2 TB ~$38.00
VM.GPU.A10.1 (preemptible) 1x A10 24 GB 15 240 GB ~$0.50

That preemptible A10 price made me do a double-take. $0.50/hr for an A10 GPU. That's $365/year. I was paying more than that per month on AWS for the same hardware.

Building the Inference Image

I used vLLM because it's what I was already running on AWS. The Dockerfile doesn't change at all between clouds, which is the whole reason I'm using containers in the first place.

# Dockerfile.inference
FROM nvidia/cuda:12.4-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir \
    vllm==0.6.0 \
    fastapi \
    uvicorn

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "microsoft/Phi-3-mini-4k-instruct", \
     "--max-model-len", "4096", \
     "--gpu-memory-utilization", "0.9"]
Enter fullscreen mode Exit fullscreen mode

Build and test locally (you'll need an NVIDIA GPU and the NVIDIA Container Toolkit installed):

# Build
docker build -f Dockerfile.inference -t gpu-inference:v1 .

# Run with GPU access
docker run --gpus all -p 8000:8000 gpu-inference:v1

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "prompt": "Explain Kubernetes in one sentence:",
    "max_tokens": 50
  }'
Enter fullscreen mode Exit fullscreen mode

--gpus all is the magic flag. It tells Docker to use the NVIDIA Container Toolkit, which injects the GPU device files and driver libraries into the container at runtime. Your image only needs the CUDA runtime libraries, not the full driver stack.

If You Don't Have a Local GPU

I do most of my development on a Mac, which obviously doesn't have an NVIDIA GPU. Docker Model Runner is what I use to test the LLM interaction pattern locally:

docker model pull ai/phi3-mini
docker model run ai/phi3-mini "Explain Kubernetes in one sentence"
Enter fullscreen mode Exit fullscreen mode

The API is OpenAI-compatible so the client code I write against Model Runner works unchanged against vLLM in production. I've been using this for prompt template iteration and it cut my feedback loop from 20+ minutes (push to registry, wait for K8s pull, test) to about 15 seconds.

Pushing to OCIR

# Login to OCIR
docker login iad.ocir.io -u '<tenancy-namespace>/oracleidentitycloudservice/<email>'

# Tag
docker tag gpu-inference:v1 iad.ocir.io/<tenancy>/gpu-inference/vllm:v1

# Scan before push
docker scout cves gpu-inference:v1 --only-severity critical,high

# Push
docker push iad.ocir.io/<tenancy>/gpu-inference/vllm:v1
Enter fullscreen mode Exit fullscreen mode

Fair warning: GPU images are big. Mine was about 8GB. The first push took a while, but after that Docker's layer caching means only changed layers get uploaded. Most rebuilds push in under a minute.

Setting Up OKE with GPU Nodes

# Create cluster (control plane is free)
oci ce cluster create \
  --compartment-id $COMPARTMENT_ID \
  --kubernetes-version v1.30.1 \
  --name gpu-inference-cluster \
  --vcn-id $VCN_ID \
  --endpoint-subnet-id $API_SUBNET_ID \
  --service-lb-subnet-ids '["'$LB_SUBNET_ID'"]'

# Create GPU node pool
oci ce node-pool create \
  --cluster-id $CLUSTER_ID \
  --compartment-id $COMPARTMENT_ID \
  --kubernetes-version v1.30.1 \
  --name gpu-a10-pool \
  --node-shape VM.GPU.A10.1 \
  --node-config-details '{
    "size": 2,
    "placementConfigs": [{
      "availabilityDomain": "Uocm:US-ASHBURN-AD-1",
      "subnetId": "'$WORKER_SUBNET_ID'"
    }]
  }' \
  --node-source-details '{
    "imageId": "'$GPU_IMAGE_ID'",
    "sourceType": "IMAGE"
  }' \
  --initial-node-labels '[{
    "key": "nvidia.com/gpu",
    "value": "present"
  }]'
Enter fullscreen mode Exit fullscreen mode

One thing I liked about OKE — the GPU node pools come with the NVIDIA device plugin already installed. On EKS I had to install the device plugin myself via a DaemonSet. Here it just works, and nvidia.com/gpu shows up as a schedulable resource immediately.

Deploying the Inference Service

# inference-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  namespace: inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      containers:
        - name: vllm
          image: iad.ocir.io/<tenancy>/gpu-inference/vllm:v1
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              cpu: "4"
              memory: "16Gi"
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 120
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 60
            periodSeconds: 10
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache
      imagePullSecrets:
        - name: ocir-secret
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-inference
  namespace: inference
  annotations:
    oci.oraclecloud.com/load-balancer-type: "lb"
    service.beta.kubernetes.io/oci-load-balancer-shape: "flexible"
    service.beta.kubernetes.io/oci-load-balancer-shape-flex-min: "10"
    service.beta.kubernetes.io/oci-load-balancer-shape-flex-max: "100"
spec:
  type: LoadBalancer
  selector:
    app: vllm-inference
  ports:
    - port: 80
      targetPort: http
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: inference
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: oci-bv
  resources:
    requests:
      storage: 50Gi
Enter fullscreen mode Exit fullscreen mode

A few things I learned the hard way while setting this up:

The nvidia.com/gpu: 1 in resource limits is how Kubernetes knows to schedule this on a GPU node. Forget it and your pod lands on a CPU node and crashes.

The PVC for model cache is important. Without it, the model downloads from HuggingFace every time the pod restarts. Phi-3-mini is a few GB — that's 5-10 minutes of startup time you don't want to repeat.

The initialDelaySeconds: 120 on the liveness probe took me a restart loop to figure out. Model loading is slow. If your liveness probe fires before the model is loaded, Kubernetes kills the pod, it restarts, starts loading again, gets killed again... you get the idea. Give it at least 2 minutes.

The OCI Load Balancer annotations tell OKE to automatically provision a load balancer. No separate Terraform resource needed.

Deploy:

kubectl create namespace inference

# Create OCIR pull secret
kubectl create secret docker-registry ocir-secret \
  --namespace inference \
  --docker-server=iad.ocir.io \
  --docker-username='<tenancy>/<user>' \
  --docker-password='<auth-token>'

# Create HuggingFace token secret (from OCI Vault ideally)
kubectl create secret generic hf-token \
  --namespace inference \
  --from-literal=token=$HF_TOKEN

kubectl apply -f inference-deployment.yaml
Enter fullscreen mode Exit fullscreen mode

After a few minutes (mostly model download time), the service is up and accessible via the load balancer:

LB_IP=$(kubectl get svc vllm-inference -n inference -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

curl http://$LB_IP/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "prompt": "What is Oracle Cloud Infrastructure?",
    "max_tokens": 100
  }'
Enter fullscreen mode Exit fullscreen mode

Monitoring GPU Utilization

Once the inference service was running, I wanted to see actual GPU utilization. Without this you're flying blind — you have no idea if the GPU is sitting at 10% or 95%. DCGM Exporter gives you Prometheus metrics:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring --create-namespace \
  --set serviceMonitor.enabled=true
Enter fullscreen mode Exit fullscreen mode

This gives you DCGM_FI_DEV_GPU_UTIL (utilization), DCGM_FI_DEV_MEM_COPY_UTIL (memory), temperature, power draw, etc. I have a Grafana dashboard that shows all of these and it's been useful for right-sizing.

I also built otel-gpu-receiver which does something similar but for OpenTelemetry. If you're already running an OTel collector, it might be a better fit than DCGM Exporter.

What I'm Actually Paying

Here's the monthly bill comparison for running Phi-3-mini on a single A10, always-on:

Platform Setup Monthly Cost
OCI OKE + VM.GPU.A10.1 Managed K8s + GPU node ~$1,210
OCI OKE + preemptible A10 Same, but preemptible ~$365
AWS EKS + g5.xlarge Managed K8s + GPU node ~$1,100 + $73 (control plane)
GCP GKE + g2-standard-4 Managed K8s + GPU node ~$1,300 + $73 (control plane)
Azure AKS + NC4as_T4_v3 Managed K8s + T4 GPU ~$550 + less powerful GPU

The free control plane saves $73/mo by itself compared to EKS or GKE. And for my dev/test workloads I switched to preemptible instances, which dropped the GPU cost to $365/mo. The pods get evicted occasionally but for development that's fine.

Local Dev with Docker Model Runner

I keep coming back to this because it changed how I work. Before Model Runner, testing a prompt change meant: edit prompt, rebuild image, push to OCIR, wait for OKE to pull it, test, realize it's wrong, repeat. Twenty minutes per iteration.

Now I just run the model locally:

# Pull a model
docker model pull ai/phi3-mini

# Run inference
docker model run ai/phi3-mini "Summarize: Oracle Cloud Infrastructure provides..."

# Or use the API endpoint
curl http://localhost:12434/engines/phi3-mini/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is OKE?", "max_tokens": 50}'
Enter fullscreen mode Exit fullscreen mode

Same API, same prompt format. When the prompt works locally, I rebuild the production image and push. The container is what makes this portable.

Was It Worth Switching?

Honestly, yes. The Docker workflow didn't change at all — same Dockerfile, same docker build, same docker push. I just changed the registry URL and the Kubernetes annotations. The inference service runs the same. The GPU utilization is the same. The API responses are the same.

What changed is the bill. And the fact that I don't pay $73/month for a Kubernetes control plane anymore. If you're running GPU workloads on AWS or GCP and haven't priced out OCI, it's worth 30 minutes of your time.


Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. I build GPU infrastructure tools and write about Kubernetes. GitHub | LinkedIn

Top comments (0)