I was spending ~$3,300/month on three A10 GPU instances for a mix of staging inference, batch processing, and experimentation. All on-demand. Then I switched two of the three to preemptible instances and my GPU bill dropped to about $1,450/month.
The trade-off is that OCI can reclaim preemptible instances with 30 seconds notice. For customer-facing production inference, that's a non-starter. For everything else I was running, it was fine.
What Preemptible Instances Actually Mean
OCI preemptible instances are spare capacity sold at a steep discount. A10 GPU goes from $1.52/hr (on-demand) to ~$0.46/hr (preemptible). That's a 70% discount.
The catch: OCI can terminate the instance when it needs the capacity back. You get a 30-second warning via instance metadata. Your workload needs to handle this gracefully.
In practice, I've been running preemptible GPU nodes on OKE for two months and I've had maybe 4-5 evictions total. Some weeks zero. It depends on demand in your region and availability domain.
OKE Node Pool Setup
I run two GPU node pools one on-demand for production, one preemptible for everything else:
# Production GPU pool always available
oci ce node-pool create \
--name gpu-production \
--node-shape VM.GPU.A10.1 \
--node-config-details '{
"size": 1,
"placementConfigs": [{
"availabilityDomain": "Uocm:US-ASHBURN-AD-1",
"subnetId": "'$SUBNET_ID'",
"preemptibleNodeConfig": null
}]
}' \
...
# Preemptible GPU pool cheap, may get evicted
oci ce node-pool create \
--name gpu-preemptible \
--node-shape VM.GPU.A10.1 \
--node-config-details '{
"size": 2,
"placementConfigs": [{
"availabilityDomain": "Uocm:US-ASHBURN-AD-1",
"subnetId": "'$SUBNET_ID'",
"preemptibleNodeConfig": {
"preemptionAction": {
"type": "TERMINATE",
"isPreserveBootVolume": false
}
}
}]
}' \
--node-metadata '{"user_data": "..."}' \
--initial-node-labels '[
{"key": "node-type", "value": "preemptible"},
{"key": "nvidia.com/gpu", "value": "present"}
]'
The label node-type=preemptible is how I control which workloads land on preemptible nodes.
Directing Workloads to the Right Pool
Production inference uses a node affinity to avoid preemptible nodes:
# Production on-demand only
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-production
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: NotIn
values: ["preemptible"]
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: 1
Staging and batch workloads prefer preemptible (but tolerate on-demand if preemptible nodes are unavailable):
# Staging prefer preemptible, accept on-demand as fallback
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-staging
spec:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 90
preference:
matchExpressions:
- key: node-type
operator: In
values: ["preemptible"]
tolerations:
- key: "preemptible"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: 1
Handling Eviction Gracefully
When OCI reclaims a preemptible instance, the node drains and pods get terminated. For inference services, this means requests in flight get dropped. Here's how I handle it:
1. Pod Disruption Budget
Prevents all replicas from being evicted simultaneously (only matters if you have >1 replica):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: vllm-staging-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: vllm-staging
2. Graceful Shutdown
vLLM handles SIGTERM and finishes in-flight requests before shutting down. I set terminationGracePeriodSeconds to 25 (less than the 30-second eviction notice):
spec:
terminationGracePeriodSeconds: 25
containers:
- name: vllm
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
The 5-second preStop sleep gives the load balancer time to stop sending new requests before the container starts shutting down.
3. Model Cache on PVC
When a pod gets evicted and rescheduled to a new preemptible node, I don't want to re-download the model from scratch. I use the OCI Object Storage init container approach (from my earlier post) so the model loads in ~90 seconds instead of 12 minutes.
The Actual Numbers
Two months of data:
| Metric | Value |
|---|---|
| Eviction events | 9 total (avg ~1/week) |
| Avg time to recover | ~2 minutes (pod reschedule + model load) |
| Longest outage | 4 minutes (node provisioning + model load) |
| Monthly cost (3x on-demand) | $3,282 |
| Monthly cost (1x on-demand + 2x preemptible) | ~$1,450 |
| Savings | $1,832/month (56%) |
The evictions cluster ā I had three in one day during what I assume was a capacity crunch in us-ashburn-1, then nothing for two weeks. Unpredictable, but the recovery is fast enough that nobody on the team complained.
Batch Jobs on Preemptible Even Better
For batch inference (processing a dataset, not serving live traffic), preemptible is almost a no-brainer. I use Kubernetes Jobs with checkpointing:
apiVersion: batch/v1
kind: Job
metadata:
name: batch-inference
spec:
backoffLimit: 5 # retry up to 5 times if evicted
template:
spec:
nodeSelector:
node-type: preemptible
restartPolicy: OnFailure
containers:
- name: inference
image: iad.ocir.io/mytenancy/batch-inference:v1
env:
- name: CHECKPOINT_BUCKET
value: "inference-checkpoints"
resources:
limits:
nvidia.com/gpu: 1
The batch job saves progress to OCI Object Storage every N records. If it gets evicted, Kubernetes restarts it and it picks up from the last checkpoint. I've had batch jobs complete across 3 evictions without losing any work.
When to Use Preemptible GPUs
- Staging/dev environments: latency spikes from eviction are fine
- Batch inference: checkpoint and retry
- Training runs: if your framework supports checkpointing (most do)
- Experimentation: exploring models, testing prompts
When Not To
- Customer-facing inference: use on-demand, the cost is worth the reliability
- Short-deadline batch: if the job must finish by a specific time, eviction adds unpredictability
- Single-replica production: no fallback when the one instance gets evicted
Pavan Madduri: Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate
Top comments (0)