Pavan Madduri

Posted on Jun 30

How I Run GPU Workloads for 70% Less on OKE Using Preemptible Instances

#oci #kubernetes #gpu #infrastructure

I was spending ~$3,300/month on three A10 GPU instances for a mix of staging inference, batch processing, and experimentation. All on-demand. Then I switched two of the three to preemptible instances and my GPU bill dropped to about $1,450/month.

The trade-off is that OCI can reclaim preemptible instances with 30 seconds notice. For customer-facing production inference, that's a non-starter. For everything else I was running, it was fine.

What Preemptible Instances Actually Mean

OCI preemptible instances are spare capacity sold at a steep discount. A10 GPU goes from $1.52/hr (on-demand) to ~$0.46/hr (preemptible). That's a 70% discount.

The catch: OCI can terminate the instance when it needs the capacity back. You get a 30-second warning via instance metadata. Your workload needs to handle this gracefully.

In practice, I've been running preemptible GPU nodes on OKE for two months and I've had maybe 4-5 evictions total. Some weeks zero. It depends on demand in your region and availability domain.

OKE Node Pool Setup

I run two GPU node pools one on-demand for production, one preemptible for everything else:

# Production GPU pool always available
oci ce node-pool create \
  --name gpu-production \
  --node-shape VM.GPU.A10.1 \
  --node-config-details '{
    "size": 1,
    "placementConfigs": [{
      "availabilityDomain": "Uocm:US-ASHBURN-AD-1",
      "subnetId": "'$SUBNET_ID'",
      "preemptibleNodeConfig": null
    }]
  }' \
  ...

# Preemptible GPU pool cheap, may get evicted
oci ce node-pool create \
  --name gpu-preemptible \
  --node-shape VM.GPU.A10.1 \
  --node-config-details '{
    "size": 2,
    "placementConfigs": [{
      "availabilityDomain": "Uocm:US-ASHBURN-AD-1",
      "subnetId": "'$SUBNET_ID'",
      "preemptibleNodeConfig": {
        "preemptionAction": {
          "type": "TERMINATE",
          "isPreserveBootVolume": false
        }
      }
    }]
  }' \
  --node-metadata '{"user_data": "..."}' \
  --initial-node-labels '[
    {"key": "node-type", "value": "preemptible"},
    {"key": "nvidia.com/gpu", "value": "present"}
  ]'

The label node-type=preemptible is how I control which workloads land on preemptible nodes.

Directing Workloads to the Right Pool

Production inference uses a node affinity to avoid preemptible nodes:

# Production on-demand only
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-production
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: NotIn
                    values: ["preemptible"]
      containers:
        - name: vllm
          resources:
            limits:
              nvidia.com/gpu: 1

Staging and batch workloads prefer preemptible (but tolerate on-demand if preemptible nodes are unavailable):

# Staging prefer preemptible, accept on-demand as fallback
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-staging
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 90
              preference:
                matchExpressions:
                  - key: node-type
                    operator: In
                    values: ["preemptible"]
      tolerations:
        - key: "preemptible"
          operator: "Exists"
          effect: "NoSchedule"
      containers:
        - name: vllm
          resources:
            limits:
              nvidia.com/gpu: 1

Handling Eviction Gracefully

When OCI reclaims a preemptible instance, the node drains and pods get terminated. For inference services, this means requests in flight get dropped. Here's how I handle it:

1. Pod Disruption Budget

Prevents all replicas from being evicted simultaneously (only matters if you have >1 replica):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-staging-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm-staging

2. Graceful Shutdown

vLLM handles SIGTERM and finishes in-flight requests before shutting down. I set terminationGracePeriodSeconds to 25 (less than the 30-second eviction notice):

spec:
  terminationGracePeriodSeconds: 25
  containers:
    - name: vllm
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]

The 5-second preStop sleep gives the load balancer time to stop sending new requests before the container starts shutting down.

3. Model Cache on PVC

When a pod gets evicted and rescheduled to a new preemptible node, I don't want to re-download the model from scratch. I use the OCI Object Storage init container approach (from my earlier post) so the model loads in ~90 seconds instead of 12 minutes.

The Actual Numbers

Two months of data:

Metric	Value
Eviction events	9 total (avg ~1/week)
Avg time to recover	~2 minutes (pod reschedule + model load)
Longest outage	4 minutes (node provisioning + model load)
Monthly cost (3x on-demand)	$3,282
Monthly cost (1x on-demand + 2x preemptible)	~$1,450
Savings	$1,832/month (56%)

The evictions cluster — I had three in one day during what I assume was a capacity crunch in us-ashburn-1, then nothing for two weeks. Unpredictable, but the recovery is fast enough that nobody on the team complained.

Batch Jobs on Preemptible Even Better

For batch inference (processing a dataset, not serving live traffic), preemptible is almost a no-brainer. I use Kubernetes Jobs with checkpointing:

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-inference
spec:
  backoffLimit: 5    # retry up to 5 times if evicted
  template:
    spec:
      nodeSelector:
        node-type: preemptible
      restartPolicy: OnFailure
      containers:
        - name: inference
          image: iad.ocir.io/mytenancy/batch-inference:v1
          env:
            - name: CHECKPOINT_BUCKET
              value: "inference-checkpoints"
          resources:
            limits:
              nvidia.com/gpu: 1

The batch job saves progress to OCI Object Storage every N records. If it gets evicted, Kubernetes restarts it and it picks up from the last checkpoint. I've had batch jobs complete across 3 evictions without losing any work.

When to Use Preemptible GPUs

Staging/dev environments: latency spikes from eviction are fine
Batch inference: checkpoint and retry
Training runs: if your framework supports checkpointing (most do)
Experimentation: exploring models, testing prompts

When Not To

Customer-facing inference: use on-demand, the cost is worth the reliability
Short-deadline batch: if the job must finish by a specific time, eviction adds unpredictability
Single-replica production: no fallback when the one instance gets evicted

Pavan Madduri: Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

DEV Community