DEV Community

Cover image for Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible context-capping recipe
Arsen Apostolov
Arsen Apostolov

Posted on

Fitting WhisperX large-v3 + a 24B LLM on one 3090: a reproducible context-capping recipe

This is the technical, reproducible version of a fix I shipped on my own homelab. If you want the narrative version, that's on Medium. This one is the recipe: the measurements, the math, the Modelfile, and the exact prompt I gave Claude Code to generate it. Copy-paste friendly.

Repo for the dashboard used throughout: https://github.com/SikamikanikoBG/homelab-monitor

TL;DR

  • One 24GB RTX 3090, two GPU services: WhisperX large-v3 (STT, 7.7GB peak) and a Devstral Small 24B email-triage LLM (Q4_K_M, ~18.3GB).
  • 18.3 + 7.7 = 26GB → CUDA OOM whenever they overlapped.
  • The LLM was loaded with a 40k context window but the triage job never needed more than ~5–8k tokens.
  • Capped num_ctx to 8192 → KV cache drops from ~6.1GB to ~1.25GB → model footprint ~18.3GB → ~14.2GB.
  • 14.2 + 7.7 = 21.9GB → both resident, zero OOM, no quality loss.

The setup

Host:    openSUSE, Xeon (56 threads), 125GB RAM, 1x RTX 3090 (24GB)
GPU svc: WhisperX large-v3  (speech-to-text)
GPU svc: Ollama -> devstral-small-2 (24B, Q4_K_M) for background email triage
Enter fullscreen mode Exit fullscreen mode

Both services run all the time. The OOM only happened when I dictated to my assistant (WhisperX) while the triage loop was active.

Step 1 — Make the contention measurable

nvidia-smi shows instantaneous VRAM. It can't show you which service spiked or when two of them overlapped — and an intermittent OOM is a timing problem. You need per-service VRAM history.

I use my own dashboard (homelab-monitor) for this. The relevant view is "AI Models", which attributes VRAM per model server and per loaded model, over a time range, with OOM markers and a capacity ceiling line.

VRAM by service over time, spiking into the 24GB ceiling with an OOM marker

What the history showed at the overlap window:

Service Peak VRAM
Devstral 24B (triage) ~18.3 GB
WhisperX large-v3 7.7 GB
Total ~26 GB on a 24 GB card

Per-model attribution: old model 18.3GB vs capped triage variant 14.2GB, WhisperX 7.7GB — before/after on one frame

If you want to reproduce the measurement, the dashboard runs as a single container:

git clone https://github.com/SikamikanikoBG/homelab-monitor
cd homelab-monitor
docker compose up -d --build
# open http://<host>:9800  -> AI Models / GPU views
Enter fullscreen mode Exit fullscreen mode

(NVIDIA Container Toolkit required for GPU metrics. Remote hosts are monitored over SSH, no agent.)

Step 2 — Measure what the job actually needs

Weights are a fixed cost (~15GB for Devstral 24B at Q4_K_M). The variable cost is the KV cache, which scales linearly with num_ctx. So the question is: how much context does background email triage actually use?

I pulled the request traces from Langfuse. The triage pipeline:

  • truncates each email body to 300–500 chars,
  • batches ~10 emails per call,
  • caps generation around 2k tokens.

Real prompts never exceeded ~5–8k tokens. The model was loaded with a 40k window — ~32k tokens of reserved KV cache doing nothing.

Step 3 — Do the KV-cache math

Devstral Small is mistral3. Pull the architecture straight from Ollama:

curl -s http://localhost:11434/api/show -d '{"name":"devstral-small-2:latest"}' \
  | python -c "import sys,json;mi=json.load(sys.stdin)['model_info'];\
print({k:v for k,v in mi.items() if 'head_count' in k or 'block_count' in k or 'length' in k})"
Enter fullscreen mode Exit fullscreen mode

Relevant values:

block_count (layers)      = 40
attention.head_count_kv   = 8
attention.key_length      = 128
attention.value_length    = 128
context_length (native)   = 8192   # rope-extended to 393216
Enter fullscreen mode Exit fullscreen mode

KV cache per token (f16) = 2 (K+V) × layers × kv_heads × head_dim × 2 bytes:

2 × 40 × 8 × 128 × 2  =  163,840 bytes  ≈  0.156 MB / token
Enter fullscreen mode Exit fullscreen mode

So:

num_ctx KV cache (f16)
40,960 ~6.1 GB
16,384 ~2.5 GB
8,192 ~1.25 GB
4,096 ~0.6 GB

8192 is the sweet spot: it's above the real worst-case prompt (~5–8k) and it's the model's native context length, so there's no rope extrapolation quality hit. I rejected 4096 — a 10-email batch with 2k generation can brush up against it.

Step 4 — Generate the capped model

Ollama lets you inherit existing weights and override parameters in a Modelfile, so this costs no extra disk and no re-download.

Modelfile.triage:

FROM devstral-small-2:latest

# Native 8K window: covers every triage prompt (10-email batches + 2K generation)
# while keeping the KV cache ~1.25GB so the model + WhisperX fit on one 24GB GPU.
PARAMETER num_ctx 8192
PARAMETER temperature 0
PARAMETER num_predict 2048

SYSTEM """You are a background email-triage engine. Follow the exact output
format in each request. Output only the requested label(s) or field(s). Never
add explanations, preamble, or commentary. When uncertain, pick the closest
valid option. Be terse and deterministic."""
Enter fullscreen mode Exit fullscreen mode

Build it:

ollama create devstral-small-2:triage -f Modelfile.triage
Enter fullscreen mode Exit fullscreen mode

The optional SYSTEM block is a small bonus: triage prompts want terse, structured output, and pinning that behaviour cuts stray preamble (fewer reparse/retry calls = less GPU time).

The Claude Code prompt I used

I let Claude Code do the measuring and the Modelfile generation. The prompt, roughly:

Analyze my background email triage. Pull the Langfuse traces to find the real prompt/context sizes the triage job uses, decide a safe num_ctx cap that won't truncate worst-case batches, confirm the KV-cache savings against the model's actual architecture, and generate an Ollama Modelfile for a context-capped :triage variant. Then tell me the expected VRAM footprint.

It came back with: traces show ≤8k tokens, cap at 8192 (native window), ~5GB KV saved, expected footprint ~14–16GB. Which matched what the dashboard measured after I deployed it.

Step 5 — Verify on the GPU

# load it
curl -s http://localhost:11434/api/generate \
  -d '{"model":"devstral-small-2:triage","prompt":"ping","stream":false}' >/dev/null

# check resident VRAM + context
curl -s http://localhost:11434/api/ps \
  | python -c "import sys,json;[print(m['name'],round(m['size_vram']/1e9,1),'GB ctx',m['context_length']) for m in json.load(sys.stdin)['models']]"
Enter fullscreen mode Exit fullscreen mode

Result: the triage model holds ~14GB resident at ctx=8192, down from ~18GB.

After: both services coresident on the GPU, no pressure

Result

Before After
Triage LLM ~18.3 GB ~14.2 GB
WhisperX large-v3 7.7 GB 7.7 GB
Combined ~26 GB → OOM ~21.9 GB → fits

Both services now sit on the card together. Full STT quality, email triage in parallel, ~2GB headroom. No quant change, no CPU offload, no smaller Whisper.

Takeaways

  1. A shared-GPU OOM is a timing problem. Point-in-time nvidia-smi can't diagnose it — get per-service VRAM history.
  2. Match num_ctx to the real workload. Reserved context is pure VRAM cost. Background jobs almost always over-reserve.
  3. Prefer the model's native context length as your cap when you can — no rope-extrapolation quality hit.
  4. Measure twice (traces + GPU history), cap once. The fix was three lines; knowing it was the right three lines took the data.

Dashboard used for the per-service VRAM history: https://github.com/SikamikanikoBG/homelab-monitor — it's open source, runs in one container, and exists because I needed exactly this view and nvidia-smi wouldn't give it to me.

Top comments (0)