The first time I deployed vLLM on OKE, the pod took 12 minutes to become ready. The model download from HuggingFace was 7.5GB. Then the pod crashed (liveness probe, classic mistake), restarted, and downloaded the model again. Another 12 minutes. I burned nearly half an hour watching progress bars.
I added a PVC, which helped — the model persisted across restarts on the same node. But if the pod got rescheduled to a different GPU node? Fresh download. Back to square one.
The fix was obvious in hindsight: pre-stage models in OCI Object Storage and download from there. Same region, private network, 10x faster than HuggingFace.
The Problem With HuggingFace Downloads in Production
HuggingFace is great for browsing and testing. It's not great as a production model source:
- Slow — downloads go over the internet, through CDN, subject to bandwidth limits
- Unreliable — I've had downloads fail partway through during peak hours
- No access control — your production pods need a HF token with internet access
- Rate limits — download the same model across 5 pods and you might get throttled
For development? Fine. For production pods that might restart at 3am? I want the model sitting in my own storage, same region as my cluster, accessible over the internal network.
Uploading Models to OCI Object Storage
First, download the model once and upload it to a bucket:
# Download model locally
pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('microsoft/Phi-3-mini-4k-instruct', local_dir='./phi3-mini')
"
# Create OCI bucket
oci os bucket create \
--compartment-id $COMPARTMENT_ID \
--name ai-model-cache \
--storage-tier Standard
# Upload model files
oci os object bulk-upload \
--bucket-name ai-model-cache \
--src-dir ./phi3-mini \
--prefix "models/phi3-mini/" \
--content-type application/octet-stream
The upload takes a few minutes over the internet. After that, every download from OKE nodes is over OCI's internal network — much faster.
The Init Container Approach
I use a Kubernetes init container to download the model from Object Storage before the inference container starts. The init container uses instance principal auth (no credentials needed — the OKE node's identity is enough).
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
initContainers:
- name: model-loader
image: ghcr.io/oracle/oci-cli:latest
command:
- /bin/bash
- -c
- |
# Skip if model already cached on PVC
if [ -f /models/phi3-mini/.complete ]; then
echo "Model already cached, skipping download"
exit 0
fi
echo "Downloading model from OCI Object Storage..."
oci os object bulk-download \
--bucket-name ai-model-cache \
--prefix "models/phi3-mini/" \
--download-dir /models/phi3-mini \
--auth instance_principal
# Mark as complete so we don't re-download
touch /models/phi3-mini/.complete
echo "Model download complete"
volumeMounts:
- name: model-cache
mountPath: /models
containers:
- name: vllm
image: iad.ocir.io/mytenancy/vllm:v1
args:
- "--model"
- "/models/phi3-mini"
- "--max-model-len"
- "4096"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: oci-bv
resources:
requests:
storage: 50Gi
The .complete marker file is a simple trick. If the PVC already has the model (because the pod restarted on the same node), the init container exits immediately. No re-download. If it's a fresh node, it pulls from Object Storage — which takes about 90 seconds over the internal network compared to 12 minutes from HuggingFace.
Instance Principal Auth — No Credentials Needed
The nice thing about OCI Object Storage with OKE is that you can use instance principal authentication. The OKE worker node already has an identity in OCI. You just need a policy that allows it to read from the bucket:
# Terraform — allow GPU nodes to read model bucket
resource "oci_identity_policy" "model_access" {
compartment_id = var.tenancy_id
name = "gpu-nodes-model-access"
description = "Allow GPU node pool to read model cache bucket"
statements = [
"Allow dynamic-group gpu-node-pool to read objects in compartment ${var.compartment_name} where target.bucket.name='ai-model-cache'"
]
}
No API keys, no secrets, no tokens to rotate. The node proves its identity to OCI automatically. This is cleaner than storing HuggingFace tokens in Kubernetes secrets.
Startup Time Comparison
I measured this across 10 pod restarts:
| Model Source | Avg Download Time | Reliability |
|---|---|---|
| HuggingFace (internet) | 11-14 min | 8/10 succeeded first try |
| OCI Object Storage (same region) | 1-2 min | 10/10 succeeded |
| PVC cache hit (no download) | 0 sec | 10/10 |
The PVC cache is the fast path. Object Storage is the fallback for new nodes. HuggingFace is what I never want to depend on in production.
Multi-Model Setup
For serving multiple models, I just add more prefixes in the bucket:
oci os object bulk-upload --bucket-name ai-model-cache \
--src-dir ./llama3-8b --prefix "models/llama3-8b/"
oci os object bulk-upload --bucket-name ai-model-cache \
--src-dir ./mistral-7b --prefix "models/mistral-7b/"
And the init container takes a model name as an env var:
env:
- name: MODEL_NAME
value: "llama3-8b"
command:
- /bin/bash
- -c
- |
if [ -f /models/$MODEL_NAME/.complete ]; then exit 0; fi
oci os object bulk-download \
--bucket-name ai-model-cache \
--prefix "models/$MODEL_NAME/" \
--download-dir /models/$MODEL_NAME \
--auth instance_principal
touch /models/$MODEL_NAME/.complete
Cost
OCI Object Storage Standard tier is $0.0255/GB/month. A 10GB model costs $0.26/month to store. That's basically free. And you're saving 10+ minutes on every cold start, which matters when you're paying $1.50/hr for a GPU that's sitting idle while a model downloads.
Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate
Top comments (0)