DEV Community

Cover image for Stop Downloading 8GB Models on Every Pod Restart - Use OCI Object Storage as a Model Cache
Pavan Madduri
Pavan Madduri

Posted on

Stop Downloading 8GB Models on Every Pod Restart - Use OCI Object Storage as a Model Cache

The first time I deployed vLLM on OKE, the pod took 12 minutes to become ready. The model download from HuggingFace was 7.5GB. Then the pod crashed (liveness probe, classic mistake), restarted, and downloaded the model again. Another 12 minutes. I burned nearly half an hour watching progress bars.

I added a PVC, which helped — the model persisted across restarts on the same node. But if the pod got rescheduled to a different GPU node? Fresh download. Back to square one.

The fix was obvious in hindsight: pre-stage models in OCI Object Storage and download from there. Same region, private network, 10x faster than HuggingFace.

The Problem With HuggingFace Downloads in Production

HuggingFace is great for browsing and testing. It's not great as a production model source:

  • Slow — downloads go over the internet, through CDN, subject to bandwidth limits
  • Unreliable — I've had downloads fail partway through during peak hours
  • No access control — your production pods need a HF token with internet access
  • Rate limits — download the same model across 5 pods and you might get throttled

For development? Fine. For production pods that might restart at 3am? I want the model sitting in my own storage, same region as my cluster, accessible over the internal network.

Uploading Models to OCI Object Storage

First, download the model once and upload it to a bucket:

# Download model locally
pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('microsoft/Phi-3-mini-4k-instruct', local_dir='./phi3-mini')
"

# Create OCI bucket
oci os bucket create \
  --compartment-id $COMPARTMENT_ID \
  --name ai-model-cache \
  --storage-tier Standard

# Upload model files
oci os object bulk-upload \
  --bucket-name ai-model-cache \
  --src-dir ./phi3-mini \
  --prefix "models/phi3-mini/" \
  --content-type application/octet-stream
Enter fullscreen mode Exit fullscreen mode

The upload takes a few minutes over the internet. After that, every download from OKE nodes is over OCI's internal network — much faster.

The Init Container Approach

I use a Kubernetes init container to download the model from Object Storage before the inference container starts. The init container uses instance principal auth (no credentials needed — the OKE node's identity is enough).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      initContainers:
        - name: model-loader
          image: ghcr.io/oracle/oci-cli:latest
          command:
            - /bin/bash
            - -c
            - |
              # Skip if model already cached on PVC
              if [ -f /models/phi3-mini/.complete ]; then
                echo "Model already cached, skipping download"
                exit 0
              fi

              echo "Downloading model from OCI Object Storage..."
              oci os object bulk-download \
                --bucket-name ai-model-cache \
                --prefix "models/phi3-mini/" \
                --download-dir /models/phi3-mini \
                --auth instance_principal

              # Mark as complete so we don't re-download
              touch /models/phi3-mini/.complete
              echo "Model download complete"
          volumeMounts:
            - name: model-cache
              mountPath: /models

      containers:
        - name: vllm
          image: iad.ocir.io/mytenancy/vllm:v1
          args:
            - "--model"
            - "/models/phi3-mini"
            - "--max-model-len"
            - "4096"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
          volumeMounts:
            - name: model-cache
              mountPath: /models

      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: oci-bv
  resources:
    requests:
      storage: 50Gi
Enter fullscreen mode Exit fullscreen mode

The .complete marker file is a simple trick. If the PVC already has the model (because the pod restarted on the same node), the init container exits immediately. No re-download. If it's a fresh node, it pulls from Object Storage — which takes about 90 seconds over the internal network compared to 12 minutes from HuggingFace.

Instance Principal Auth — No Credentials Needed

The nice thing about OCI Object Storage with OKE is that you can use instance principal authentication. The OKE worker node already has an identity in OCI. You just need a policy that allows it to read from the bucket:

# Terraform — allow GPU nodes to read model bucket
resource "oci_identity_policy" "model_access" {
  compartment_id = var.tenancy_id
  name           = "gpu-nodes-model-access"
  description    = "Allow GPU node pool to read model cache bucket"

  statements = [
    "Allow dynamic-group gpu-node-pool to read objects in compartment ${var.compartment_name} where target.bucket.name='ai-model-cache'"
  ]
}
Enter fullscreen mode Exit fullscreen mode

No API keys, no secrets, no tokens to rotate. The node proves its identity to OCI automatically. This is cleaner than storing HuggingFace tokens in Kubernetes secrets.

Startup Time Comparison

I measured this across 10 pod restarts:

Model Source Avg Download Time Reliability
HuggingFace (internet) 11-14 min 8/10 succeeded first try
OCI Object Storage (same region) 1-2 min 10/10 succeeded
PVC cache hit (no download) 0 sec 10/10

The PVC cache is the fast path. Object Storage is the fallback for new nodes. HuggingFace is what I never want to depend on in production.

Multi-Model Setup

For serving multiple models, I just add more prefixes in the bucket:

oci os object bulk-upload --bucket-name ai-model-cache \
  --src-dir ./llama3-8b --prefix "models/llama3-8b/"

oci os object bulk-upload --bucket-name ai-model-cache \
  --src-dir ./mistral-7b --prefix "models/mistral-7b/"
Enter fullscreen mode Exit fullscreen mode

And the init container takes a model name as an env var:

env:
  - name: MODEL_NAME
    value: "llama3-8b"
command:
  - /bin/bash
  - -c
  - |
    if [ -f /models/$MODEL_NAME/.complete ]; then exit 0; fi
    oci os object bulk-download \
      --bucket-name ai-model-cache \
      --prefix "models/$MODEL_NAME/" \
      --download-dir /models/$MODEL_NAME \
      --auth instance_principal
    touch /models/$MODEL_NAME/.complete
Enter fullscreen mode Exit fullscreen mode

Cost

OCI Object Storage Standard tier is $0.0255/GB/month. A 10GB model costs $0.26/month to store. That's basically free. And you're saving 10+ minutes on every cold start, which matters when you're paying $1.50/hr for a GPU that's sitting idle while a model downloads.


Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

Top comments (0)