My homelab is one Linux box with a single RTX 3090. 24GB of VRAM, and
three GPU-hungry services that all want it: ComfyUI for image generation,
WhisperX for transcription, Ollama for local LLMs. On one card, that's
already a negotiation.
Last week the negotiation broke. My own monitoring dashboard caught the
culprit at a glance, so this is the short version: what it was, how I saw
it, and the 30-line container that fixed it for good.
(If you want the prequel — the time the Ollama triage model reserved a
40,000-token context to do 8,000 tokens of work — that's
Two LLMs, One 3090, Zero OOM.
Same box. Same lesson.)
The symptom
I opened the GPU tab of my homelab dashboard for something unrelated and
saw the card sitting at 71% full while nothing was running.
16.3GB held by comfyui. GPU utilisation: 0%. The model was loaded
and doing absolutely nothing. The per-service history made it
unambiguous — ComfyUI peaked at 16.3GB and held it 100% of the window:
This is the whole reason I built the dashboard. nvidia-smi tells you
VRAM is at 17/24GB. It does not tell you which service, which model,
and since when. The GPU tab maps every VRAM-using PID back to its
container automatically, so "who is holding my GPU" is a glance, not five
minutes of ps -o cgroup archaeology.
The diagnosis
nvidia-smi on the host confirmed it:
$ nvidia-smi --query-compute-apps=pid,used_memory,process_name --format=csv,noheader
111465, 16666 MiB, python3 # <- ComfyUI, idle
109583, 588 MiB, /app/.venv/bin/python
ComfyUI keeps the checkpoint resident after a generation so the next
request is fast. Sensible on a dedicated image-gen box. On a shared 24GB
card it is hostile: the FLUX fp8 checkpoint is ~16GB, and ComfyUI 0.22
has no idle timeout to give it back. Once you've generated one image,
that 16GB is gone until you restart the container.
Good news: ComfyUI has an API for exactly this. POST /free with
unload_models drops the model out of VRAM.
$ curl -X POST http://localhost:8188/free \
-H 'Content-Type: application/json' \
-d '{"unload_models": true, "free_memory": true}'
One call took ComfyUI from 16666 MiB to 378 MiB. The model reloads
automatically on the next /prompt — about 20–30s added to that one
request, which for an image I generate a few times a day is free.
So I don't want to call /free after every job (kills warm-cache speed
for bursts). I want to call it after ComfyUI has been idle for a
while. ComfyUI won't do that itself, so I bolted it on from outside.
The fix: an idle-unload sidecar
No ComfyUI fork, no custom node. A tiny container that watches the queue
and evicts the model after a few minutes of inactivity.
#!/bin/sh
# Unload ComfyUI models from VRAM after a period of queue inactivity.
INTERVAL=${INTERVAL:-30}
IDLE_SECONDS=${IDLE_SECONDS:-300}
URL=${COMFY_URL:-http://localhost:8188}
idle=0
while true; do
sleep "$INTERVAL"
q=$(curl -s -m 10 "$URL/queue" 2>/dev/null) || continue
[ -z "$q" ] && continue
# idle == both queue_running and queue_pending are empty arrays
if [ "${q#*\"queue_running\": []}" != "$q" ] && \
[ "${q#*\"queue_pending\": []}" != "$q" ]; then
idle=$((idle + INTERVAL))
if [ "$idle" -ge "$IDLE_SECONDS" ]; then
curl -s -m 30 -X POST "$URL/free" -H 'Content-Type: application/json' \
-d '{"unload_models":true,"free_memory":true}' >/dev/null 2>&1
idle=0
fi
else
idle=0 # a job ran — reset the idle clock
fi
done
It polls /queue every 30s. If both queue_running and queue_pending
are empty, it adds to an idle counter. After 300s of continuous idle it
POSTs /free and resets. Any job resets the counter, so a burst of
generations keeps the model warm — eviction only happens once you've
genuinely stopped.
No new image to build — curlimages/curl already has sh and curl:
docker run -d --name comfyui-idle-unloader --restart unless-stopped \
--network host -e IDLE_SECONDS=300 -e INTERVAL=30 \
-v /opt/comfyui-idle-unloader/unload-idle.sh:/unload-idle.sh:ro \
--entrypoint sh curlimages/curl:latest /unload-idle.sh
--network host so it can reach ComfyUI on localhost, --restart so it survives reboots. That's the whole deployment.
unless-stopped
Before / after
Watching it on the same dashboard, the story is one cliff:
| ComfyUI, idle | VRAM held | GPU "full" | Free for WhisperX + Ollama |
|---|---|---|---|
| Before | 16.3 GB | 71% | ~7 GB |
| After | 0.37 GB | 5% | ~23 GB |
~16GB back, for the cost of one slightly slower image every once in a
while. WhisperX and Ollama stopped fighting over the leftovers.
No fork, no patch, no upstream PR to wait on. Thirty lines of sh and a
container that does one thing. If ComfyUI ships an idle TTL tomorrow, I
delete it and lose nothing.
Why a sidecar and not a patch
-
Decoupled. It knows nothing about ComfyUI's internals — just the
public
/queueand/freeendpoints. ComfyUI can update under it. - Nothing to maintain. It rides ComfyUI's stable HTTP API; an update to ComfyUI doesn't touch it.
- Same pattern works elsewhere. Anything with a "unload model" endpoint (A1111, vLLM with sleep mode, TGI) can be evicted the same way.
The meta-point, and the reason I keep building on the dashboard: I didn't
find this by reading logs. I found it because a tool attributed 16GB to a
named, idle service on one screen. You can't reclaim VRAM you can't see.
The monitor is one container, MIT-licensed, docker compose up -d --build
to try. NVIDIA-only on the GPU panel for now, single-host by design:
github.com/SikamikanikoBG/homelab-monitor
How does everyone else handle idle model eviction on a shared GPU — a
sidecar like this, a TTL in the model server, or do you just docker and move on? Genuinely curious which approach holds up.
restart




Top comments (0)