DEV Community

Cover image for Reclaiming 16GB of idle VRAM: a 30-line sidecar that evicts ComfyUI when it stops working
Arsen Apostolov
Arsen Apostolov

Posted on

Reclaiming 16GB of idle VRAM: a 30-line sidecar that evicts ComfyUI when it stops working

My homelab is one Linux box with a single RTX 3090. 24GB of VRAM, and
three GPU-hungry services that all want it: ComfyUI for image generation,
WhisperX for transcription, Ollama for local LLMs. On one card, that's
already a negotiation.

Last week the negotiation broke. My own monitoring dashboard caught the
culprit at a glance, so this is the short version: what it was, how I saw
it, and the 30-line container that fixed it for good.

(If you want the prequel — the time the Ollama triage model reserved a
40,000-token context to do 8,000 tokens of work — that's
Two LLMs, One 3090, Zero OOM.
Same box. Same lesson.)

The symptom

I opened the GPU tab of my homelab dashboard for something unrelated and
saw the card sitting at 71% full while nothing was running.

GPU right now: 17GB used, comfyui holding 16.3GB, 0% utilisation

16.3GB held by comfyui. GPU utilisation: 0%. The model was loaded
and doing absolutely nothing. The per-service history made it
unambiguous — ComfyUI peaked at 16.3GB and held it 100% of the window:

Services on the GPU: comfyui peak 16.3GB, 100% of the time

This is the whole reason I built the dashboard. nvidia-smi tells you
VRAM is at 17/24GB. It does not tell you which service, which model,
and since when. The GPU tab maps every VRAM-using PID back to its
container automatically, so "who is holding my GPU" is a glance, not five
minutes of ps -o cgroup archaeology.

The diagnosis

nvidia-smi on the host confirmed it:

$ nvidia-smi --query-compute-apps=pid,used_memory,process_name --format=csv,noheader
111465, 16666 MiB, python3      # <- ComfyUI, idle
109583, 588 MiB, /app/.venv/bin/python
Enter fullscreen mode Exit fullscreen mode

ComfyUI keeps the checkpoint resident after a generation so the next
request is fast. Sensible on a dedicated image-gen box. On a shared 24GB
card it is hostile: the FLUX fp8 checkpoint is ~16GB, and ComfyUI 0.22
has no idle timeout to give it back. Once you've generated one image,
that 16GB is gone until you restart the container.

Good news: ComfyUI has an API for exactly this. POST /free with
unload_models drops the model out of VRAM.

$ curl -X POST http://localhost:8188/free \
    -H 'Content-Type: application/json' \
    -d '{"unload_models": true, "free_memory": true}'
Enter fullscreen mode Exit fullscreen mode

One call took ComfyUI from 16666 MiB to 378 MiB. The model reloads
automatically on the next /prompt — about 20–30s added to that one
request, which for an image I generate a few times a day is free.

So I don't want to call /free after every job (kills warm-cache speed
for bursts). I want to call it after ComfyUI has been idle for a
while. ComfyUI won't do that itself, so I bolted it on from outside.

The fix: an idle-unload sidecar

No ComfyUI fork, no custom node. A tiny container that watches the queue
and evicts the model after a few minutes of inactivity.

#!/bin/sh
# Unload ComfyUI models from VRAM after a period of queue inactivity.
INTERVAL=${INTERVAL:-30}
IDLE_SECONDS=${IDLE_SECONDS:-300}
URL=${COMFY_URL:-http://localhost:8188}
idle=0
while true; do
  sleep "$INTERVAL"
  q=$(curl -s -m 10 "$URL/queue" 2>/dev/null) || continue
  [ -z "$q" ] && continue
  # idle == both queue_running and queue_pending are empty arrays
  if [ "${q#*\"queue_running\": []}" != "$q" ] && \
     [ "${q#*\"queue_pending\": []}" != "$q" ]; then
    idle=$((idle + INTERVAL))
    if [ "$idle" -ge "$IDLE_SECONDS" ]; then
      curl -s -m 30 -X POST "$URL/free" -H 'Content-Type: application/json' \
        -d '{"unload_models":true,"free_memory":true}' >/dev/null 2>&1
      idle=0
    fi
  else
    idle=0          # a job ran — reset the idle clock
  fi
done
Enter fullscreen mode Exit fullscreen mode

It polls /queue every 30s. If both queue_running and queue_pending
are empty, it adds to an idle counter. After 300s of continuous idle it
POSTs /free and resets. Any job resets the counter, so a burst of
generations keeps the model warm — eviction only happens once you've
genuinely stopped.

No new image to build — curlimages/curl already has sh and curl:

docker run -d --name comfyui-idle-unloader --restart unless-stopped \
  --network host -e IDLE_SECONDS=300 -e INTERVAL=30 \
  -v /opt/comfyui-idle-unloader/unload-idle.sh:/unload-idle.sh:ro \
  --entrypoint sh curlimages/curl:latest /unload-idle.sh
Enter fullscreen mode Exit fullscreen mode

--network host so it can reach ComfyUI on localhost, --restart
unless-stopped
so it survives reboots. That's the whole deployment.

Before / after

Watching it on the same dashboard, the story is one cliff:

VRAM by service over time: long plateau at capacity, then a cliff down to ~1GB

ComfyUI, idle VRAM held GPU "full" Free for WhisperX + Ollama
Before 16.3 GB 71% ~7 GB
After 0.37 GB 5% ~23 GB

After: comfyui evicted to 378MB, 16GB handed back

~16GB back, for the cost of one slightly slower image every once in a
while. WhisperX and Ollama stopped fighting over the leftovers.

No fork, no patch, no upstream PR to wait on. Thirty lines of sh and a
container that does one thing. If ComfyUI ships an idle TTL tomorrow, I
delete it and lose nothing.

Why a sidecar and not a patch

  • Decoupled. It knows nothing about ComfyUI's internals — just the public /queue and /free endpoints. ComfyUI can update under it.
  • Nothing to maintain. It rides ComfyUI's stable HTTP API; an update to ComfyUI doesn't touch it.
  • Same pattern works elsewhere. Anything with a "unload model" endpoint (A1111, vLLM with sleep mode, TGI) can be evicted the same way.

The meta-point, and the reason I keep building on the dashboard: I didn't
find this by reading logs. I found it because a tool attributed 16GB to a
named, idle service on one screen. You can't reclaim VRAM you can't see.

The monitor is one container, MIT-licensed, docker compose up -d --build
to try. NVIDIA-only on the GPU panel for now, single-host by design:

github.com/SikamikanikoBG/homelab-monitor

How does everyone else handle idle model eviction on a shared GPU — a
sidecar like this, a TTL in the model server, or do you just docker
restart
and move on? Genuinely curious which approach holds up.

Top comments (0)