I had three small models I wanted to serve: Phi-3-mini for general chat, CodeLlama-7B for code suggestions, and a fine-tuned Mistral for document summarization. Each one fits in about 5-6GB of VRAM. An A10 GPU has 24GB. Three models, one GPU, plenty of headroom.
Running three separate vLLM deployments, each requesting a full GPU, would cost 3x and waste 18GB of VRAM. So I figured out how to serve all three from one container.
The Naive Approach (and Why It Didn't Work)
My first idea was simple: run three vLLM processes in one pod, each binding to a different port.
# Don't do this
vllm serve microsoft/Phi-3-mini-4k-instruct --port 8001 &
vllm serve codellama/CodeLlama-7b-Instruct-hf --port 8002 &
vllm serve my-org/mistral-summarizer --port 8003 &
This doesn't work because each vLLM process tries to claim the entire GPU. The second process crashes with a CUDA out-of-memory error because the first one already allocated all the VRAM.
You can set --gpu-memory-utilization 0.30 on each to split the memory, but vLLM's performance drops significantly when memory is constrained continuous batching can't work efficiently, and you lose the KV cache space that makes vLLM fast.
What Actually Works: vLLM with LoRA Adapters
If your models are fine-tuned versions of the same base model (or you can restructure them that way), vLLM supports serving multiple LoRA adapters on a single base model. One base model in GPU memory, multiple lightweight adapters loaded on demand.
docker run --gpus all -p 8000:8000 \
-v /models:/models \
vllm/vllm-openai:latest \
--model /models/mistral-7b-base \
--enable-lora \
--lora-modules \
"chat=/models/lora-chat" \
"code=/models/lora-code" \
"summary=/models/lora-summary" \
--max-loras 3 \
--max-model-len 4096
Clients specify which adapter to use in the model field:
# Chat model
curl http://localhost:8000/v1/chat/completions \
-d '{"model": "chat", "messages": [...]}'
# Code model
curl http://localhost:8000/v1/chat/completions \
-d '{"model": "code", "messages": [...]}'
# Summary model
curl http://localhost:8000/v1/chat/completions \
-d '{"model": "summary", "messages": [...]}'
The base model (Mistral 7B) uses ~14GB VRAM. Each LoRA adapter adds only 50-200MB. All three adapters fit easily on a 24GB A10.
When You Have Different Base Models
If your models aren't LoRA variants of the same base (mine weren't originally), you have two options:
Option A: Ollama with Multiple Models
Ollama handles model loading/unloading automatically. When you request a model, it loads it into GPU memory. When memory fills up, it evicts the least recently used model.
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-multi
spec:
replicas: 1
template:
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
- name: OLLAMA_NUM_PARALLEL
value: "4"
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
volumeMounts:
- name: models
mountPath: /root/.ollama
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
Load models after deployment:
OLLAMA_IP=$(kubectl get svc ollama-multi -o jsonpath='{.spec.clusterIP}')
curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "phi3:mini"}'
curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "codellama:7b"}'
curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "mistral:7b"}'
The downside: model swapping takes 5-15 seconds when a cold model needs to load. For a team that mostly uses one model at a time, this is fine. For concurrent usage of all three, there's latency on the first request to each model.
Option B: Triton Inference Server
NVIDIA Triton can serve multiple models on one GPU with explicit memory allocation. It's more complex to set up but gives you fine-grained control:
# model_repository/
# ├── phi3/
# │ ├── config.pbtxt
# │ └── 1/
# │ └── model.onnx
# ├── codellama/
# │ ├── config.pbtxt
# │ └── 1/
# │ └── model.onnx
# └── summarizer/
# ├── config.pbtxt
# └── 1/
# └── model.onnx
FROM nvcr.io/nvidia/tritonserver:24.01-py3
COPY model_repository /models
CMD ["tritonserver", "--model-repository=/models", "--model-control-mode=explicit"]
I tried Triton and it works well for ONNX/TensorRT models. For plain HuggingFace transformer models, the conversion step adds friction. I ended up going with the LoRA approach for my use case.
My OKE Deployment
I went with vLLM + LoRA because two of my three models were fine-tuned Mistral variants anyway. I retrained the third (the code model) as a LoRA on the same Mistral base.
apiVersion: apps/v1
kind: Deployment
metadata:
name: multi-model-inference
namespace: inference
spec:
replicas: 1
template:
spec:
initContainers:
- name: model-loader
image: ghcr.io/oracle/oci-cli:latest
command: ["/bin/bash", "-c"]
args:
- |
for model in mistral-base lora-chat lora-code lora-summary; do
if [ ! -f /models/$model/.complete ]; then
oci os object bulk-download --bucket-name ai-models \
--prefix "models/$model/" \
--download-dir /models/$model \
--auth instance_principal
touch /models/$model/.complete
fi
done
volumeMounts:
- name: models
mountPath: /models
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "/models/mistral-base"
- "--enable-lora"
- "--lora-modules"
- "chat=/models/lora-chat"
- "code=/models/lora-code"
- "summary=/models/lora-summary"
- "--max-loras"
- "3"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.9"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
volumeMounts:
- name: models
mountPath: /models
volumes:
- name: models
persistentVolumeClaim:
claimName: model-cache
Cost Impact
| Setup | GPUs Needed | Monthly Cost (OCI A10) |
|---|---|---|
| 3 separate vLLM deployments | 3 | $3,282 |
| 1 vLLM with 3 LoRA adapters | 1 | $1,094 |
| Savings | $2,188/month |
Same three models, same inference quality (LoRA adds negligible overhead), one-third the cost. The trade-off is slightly more complex deployment config and the requirement that all models share a base.
For teams exploring multi-model setups, start with Ollama (simplest), graduate to vLLM + LoRA if your models share a base, and use Triton if you need maximum control over GPU memory allocation.
Pavan Madduri - Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate
Top comments (1)
Nice writeup — the LoRA-adapter route is the one people underuse. One operational thing worth flagging for anyone copying this: once all three adapters share a single vLLM instance, they also share one continuous-batching pool and KV cache, so you've traded VRAM savings for a noisy-neighbor problem. A burst of long-context "summary" requests can starve the latency-sensitive "chat" path, and there's no per-adapter QoS knob out of the box — we ended up putting a small admission-control layer in front to cap concurrent tokens per adapter. The Ollama option sidesteps that but pays cold-start latency on every model swap: fine for low QPS, rough once traffic interleaves. Worth noting the hardware alternative too — MIG gives you hard partitioning, but the A10 doesn't support it (only A100/H100/A30), which I'm guessing is exactly why you went software-side here. Did you measure p99 latency on the chat adapter under mixed load, or mostly steady-state throughput? The steady-state numbers usually look great; it's the interleaved-traffic tail that decides whether one-GPU-three-models survives production.