Pavan Madduri

Posted on Jul 2

Serving 3 LLMs on 1 GPU - Multi-Model Inference with Docker on OKE

#ai #docker #kubernetes #oci

I had three small models I wanted to serve: Phi-3-mini for general chat, CodeLlama-7B for code suggestions, and a fine-tuned Mistral for document summarization. Each one fits in about 5-6GB of VRAM. An A10 GPU has 24GB. Three models, one GPU, plenty of headroom.

Running three separate vLLM deployments, each requesting a full GPU, would cost 3x and waste 18GB of VRAM. So I figured out how to serve all three from one container.

The Naive Approach (and Why It Didn't Work)

My first idea was simple: run three vLLM processes in one pod, each binding to a different port.

# Don't do this
vllm serve microsoft/Phi-3-mini-4k-instruct --port 8001 &
vllm serve codellama/CodeLlama-7b-Instruct-hf --port 8002 &
vllm serve my-org/mistral-summarizer --port 8003 &

This doesn't work because each vLLM process tries to claim the entire GPU. The second process crashes with a CUDA out-of-memory error because the first one already allocated all the VRAM.

You can set --gpu-memory-utilization 0.30 on each to split the memory, but vLLM's performance drops significantly when memory is constrained continuous batching can't work efficiently, and you lose the KV cache space that makes vLLM fast.

What Actually Works: vLLM with LoRA Adapters

If your models are fine-tuned versions of the same base model (or you can restructure them that way), vLLM supports serving multiple LoRA adapters on a single base model. One base model in GPU memory, multiple lightweight adapters loaded on demand.

docker run --gpus all -p 8000:8000 \
  -v /models:/models \
  vllm/vllm-openai:latest \
  --model /models/mistral-7b-base \
  --enable-lora \
  --lora-modules \
    "chat=/models/lora-chat" \
    "code=/models/lora-code" \
    "summary=/models/lora-summary" \
  --max-loras 3 \
  --max-model-len 4096

Clients specify which adapter to use in the model field:

# Chat model
curl http://localhost:8000/v1/chat/completions \
  -d '{"model": "chat", "messages": [...]}'

# Code model
curl http://localhost:8000/v1/chat/completions \
  -d '{"model": "code", "messages": [...]}'

# Summary model
curl http://localhost:8000/v1/chat/completions \
  -d '{"model": "summary", "messages": [...]}'

The base model (Mistral 7B) uses ~14GB VRAM. Each LoRA adapter adds only 50-200MB. All three adapters fit easily on a 24GB A10.

When You Have Different Base Models

If your models aren't LoRA variants of the same base (mine weren't originally), you have two options:

Option A: Ollama with Multiple Models

Ollama handles model loading/unloading automatically. When you request a model, it loads it into GPU memory. When memory fills up, it evicts the least recently used model.

# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-multi
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          env:
            - name: OLLAMA_HOST
              value: "0.0.0.0"
            - name: OLLAMA_NUM_PARALLEL
              value: "4"
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models

Load models after deployment:

OLLAMA_IP=$(kubectl get svc ollama-multi -o jsonpath='{.spec.clusterIP}')

curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "phi3:mini"}'
curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "codellama:7b"}'
curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "mistral:7b"}'

The downside: model swapping takes 5-15 seconds when a cold model needs to load. For a team that mostly uses one model at a time, this is fine. For concurrent usage of all three, there's latency on the first request to each model.

Option B: Triton Inference Server

NVIDIA Triton can serve multiple models on one GPU with explicit memory allocation. It's more complex to set up but gives you fine-grained control:

# model_repository/
# ├── phi3/
# │   ├── config.pbtxt
# │   └── 1/
# │       └── model.onnx
# ├── codellama/
# │   ├── config.pbtxt
# │   └── 1/
# │       └── model.onnx
# └── summarizer/
#     ├── config.pbtxt
#     └── 1/
#         └── model.onnx

FROM nvcr.io/nvidia/tritonserver:24.01-py3
COPY model_repository /models
CMD ["tritonserver", "--model-repository=/models", "--model-control-mode=explicit"]

I tried Triton and it works well for ONNX/TensorRT models. For plain HuggingFace transformer models, the conversion step adds friction. I ended up going with the LoRA approach for my use case.

My OKE Deployment

I went with vLLM + LoRA because two of my three models were fine-tuned Mistral variants anyway. I retrained the third (the code model) as a LoRA on the same Mistral base.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-model-inference
  namespace: inference
spec:
  replicas: 1
  template:
    spec:
      initContainers:
        - name: model-loader
          image: ghcr.io/oracle/oci-cli:latest
          command: ["/bin/bash", "-c"]
          args:
            - |
              for model in mistral-base lora-chat lora-code lora-summary; do
                if [ ! -f /models/$model/.complete ]; then
                  oci os object bulk-download --bucket-name ai-models \
                    --prefix "models/$model/" \
                    --download-dir /models/$model \
                    --auth instance_principal
                  touch /models/$model/.complete
                fi
              done
          volumeMounts:
            - name: models
              mountPath: /models
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "/models/mistral-base"
            - "--enable-lora"
            - "--lora-modules"
            - "chat=/models/lora-chat"
            - "code=/models/lora-code"
            - "summary=/models/lora-summary"
            - "--max-loras"
            - "3"
            - "--max-model-len"
            - "4096"
            - "--gpu-memory-utilization"
            - "0.9"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
          volumeMounts:
            - name: models
              mountPath: /models
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: model-cache

Cost Impact

Setup	GPUs Needed	Monthly Cost (OCI A10)
3 separate vLLM deployments	3	$3,282
1 vLLM with 3 LoRA adapters	1	$1,094
Savings		$2,188/month

Same three models, same inference quality (LoRA adds negligible overhead), one-third the cost. The trade-off is slightly more complex deployment config and the requirement that all models share a base.

For teams exploring multi-model setups, start with Ollama (simplest), graduate to vLLM + LoRA if your models share a base, and use Triton if you need maximum control over GPU memory allocation.

Pavan Madduri - Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

Top comments (1)

Max Quimby • Jul 2

Nice writeup — the LoRA-adapter route is the one people underuse. One operational thing worth flagging for anyone copying this: once all three adapters share a single vLLM instance, they also share one continuous-batching pool and KV cache, so you've traded VRAM savings for a noisy-neighbor problem. A burst of long-context "summary" requests can starve the latency-sensitive "chat" path, and there's no per-adapter QoS knob out of the box — we ended up putting a small admission-control layer in front to cap concurrent tokens per adapter. The Ollama option sidesteps that but pays cold-start latency on every model swap: fine for low QPS, rough once traffic interleaves. Worth noting the hardware alternative too — MIG gives you hard partitioning, but the A10 doesn't support it (only A100/H100/A30), which I'm guessing is exactly why you went software-side here. Did you measure p99 latency on the chat adapter under mixed load, or mostly steady-state throughput? The steady-state numbers usually look great; it's the interleaved-traffic tail that decides whether one-GPU-three-models survives production.