Serving 40 LoRA adapters on one base model: the throughput we got

#machinelearning #llm #pytorch #mlops

TL;DR: We fine-tune one LoRA adapter per enterprise customer on top of a single Llama 3.1 8B base. Running them as 40 separate deployments would have cost roughly $24k/month in mostly-idle GPU. Multi-LoRA serving in vLLM put all 40 on two A100s. Numbers and the parts that broke below.

At Nexus Labs we run the fine-tuning and eval team for agent automation. Each enterprise customer gets its own adapter because each has a different tool schema and a different house style for responses. Right now that's 40 customers in production. Rank-16 LoRA, about 42MB per adapter on disk, trained with PEFT and TRL on their own trace data.

The obvious setup is one model server per customer. That's 40 copies of an 8B base. In bf16 the base is around 16GB of weights before KV cache. Forty of those does not fit on anything we can afford, and most customers send fewer than 5 requests a minute. So you're paying for a GPU to sit at 3% utilization. We priced it at about $24k/month across the fleet on reserved A100s. No.

Multi-LoRA: one base, many adapters

vLLM (we're on 0.6.3) loads the base weights once and applies adapter deltas at request time. You turn it on with --enable-lora and register adapters by name. The base sits in GPU memory once. Each adapter is a few MB, so dozens fit in the same box.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 8 \
  --max-lora-rank 16 \
  --max-cpu-loras 64 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

A request picks its adapter through the model field:

curl http://localhost:8000/v1/chat/completions \
  -d '{"model": "customer_acme_v3", "messages": [...]}'

--max-loras 8 is the number of distinct adapters that can be active in a single batch on the GPU. --max-cpu-loras 64 is the CPU-side pool that adapters get swapped in from. When a 9th distinct adapter shows up in a batch, vLLM evicts the least-recently-used one back to CPU. That swap costs us 30 to 50ms measured at p50. Swapping from disk instead of the CPU pool is much worse, so size the CPU pool to your real customer count.

The numbers

Two A100 80GB, base loaded once per box, adapters shared. Load tested at 600 req/min across the 40 adapters with a Poisson arrival mix weighted by real customer traffic.

Metric	40 separate deployments	Multi-LoRA, 2x A100
GPUs needed	~40 (or heavy quant + packing)	2
Base weights in memory	40 copies	2 copies
Adapter memory	n/a	~1.7GB total resident
Idle cost / month	~$24k	~$1.2k
p50 latency (256 tok)	410ms	470ms
Cold adapter swap (CPU pool)	n/a	30-50ms
Aggregate throughput	bounded by idle waste	~3,100 tok/s/box

The latency tax is real but small. About 60ms at p50 from the grouped GEMM the multi-LoRA kernel runs when a batch contains several different adapters. For our agent workloads, where a single tool-call turn is 100 to 400 output tokens, that's noise next to the network round trip.

Eval gating, because outputs are not identical

I do not roll out a serving change without an eval gate. Multi-LoRA does not produce bit-identical output to a standalone fine-tuned model. The batched LoRA kernel accumulates differently than the single-adapter path. Greedy decode matched on our set. Sampled decode diverged within tolerance, which is expected, but I wanted it measured, not assumed.

So before cutover we ran each customer's adversarial eval set, 200 tool-call prompts apiece, scoring exact match on tool name plus a JSON-normalized arg comparison. Gate: no regression above 0.5% versus the standalone deployment. Two adapters tripped it. Both turned out to be rank mismatches in how they were exported, not a serving bug. Fixed the export, re-ran, shipped.

In front of the vLLM box we run Bifrost (https://github.com/maximhq/bifrost) as the gateway. It gives us one OpenAI-compatible endpoint, and if the self-hosted box saturates or drops, it falls back to a hosted provider running the generic adapter so a customer gets a degraded answer instead of a 503. It's one gateway option among several; we picked it for the failover behavior.

Trade-offs and Limitations

Eviction thrash. --max-loras 8 means bursty traffic across more than 8 distinct customers in the same window causes constant swapping. If your concurrency exceeds your active-adapter slots, you pay the 30-50ms swap on a chunk of requests. Watch your eviction rate, not just latency.
Uniform rank. Mixing rank 8 and rank 64 adapters wastes the padded buffer, which is sized to the max. We standardized on rank 16 across all customers. If one needs more capacity, it doesn't belong in this pool.
Throughput per adapter drops when many distinct adapters land in one batch, because the kernel does a grouped GEMM instead of one dense matmul. Few adapters per batch, near-dense speed. Many, you lose some.
One base, one tokenizer. Every adapter has to share the same base model and tokenizer. A customer who needs a different base (say a 70B) gets its own deployment. No way around it.
Numerical drift means you own an eval set. If you don't have per-customer regression tests, you can't safely make this swap. The infra savings assume you can prove output parity.

The model was the easy part here. Two A100s instead of forty came down to knowing how many adapters are actually hot at once and sizing the slots to that, then proving the outputs didn't move.