soy

Posted on Mar 26 • Originally published at media.patentllm.org

vLLM On-Demand Gateway: Zero-VRAM Standby for Local LLMs on Consumer GPUs

#vllm #llm #gpu #python

The Problem: vLLM Hogs Your GPU 24/7

If you run a local LLM with vLLM, you know the pain. The moment you start the server, it claims ~90% of your VRAM and never lets go — even when nobody's asking it anything.

On a dedicated inference server, that's fine. But on a single consumer GPU (RTX 5090 in my case), I also need VRAM for:

Shogi engine (DL-based, needs ~4GB VRAM)
Whisper transcription (large-v3, GPU-accelerated)
Training runs, experiments, occasional gaming

Running vLLM permanently means everything else fights for scraps. Killing and restarting vLLM manually every time is not a workflow — it's a chore.

The Solution: A Gateway That Manages vLLM's Lifecycle

I wrote a single-file FastAPI gateway (vllm_gateway.py, ~390 lines) that:

Listens on port 8000 with near-zero VRAM usage
Auto-starts vLLM on an internal port (8100) when a request arrives
Auto-stops vLLM after 10 minutes of idle, fully freeing VRAM
Rewrites tool calls from Nemotron's <TOOLCALL> format to OpenAI-compatible tool_calls

From the client's perspective, it's just a normal OpenAI-compatible API on port 8000. The lifecycle management is completely invisible.

Client → :8000 (Gateway, always running, ~0 VRAM)
              ↓ proxy
           :8100 (vLLM, started on-demand, stopped when idle)

Architecture

Startup Flow

Request arrives at :8000
  → Gateway checks: is vLLM running?
    → No: spawn vLLM process on :8100
          poll /health every 2s (up to 3 min timeout)
          once healthy → proxy the request
    → Yes: proxy immediately

Shutdown Flow

Idle watchdog runs every 30s
  → Last request was >10 min ago?
    → SIGTERM to vLLM process group
    → Wait 15s, SIGKILL if needed
    → VRAM fully released

Key Design Decisions

Process group kill (os.killpg): vLLM spawns child processes. Killing just the parent leaves zombies holding VRAM.
Internal port separation: Gateway owns :8000. vLLM gets :8100. No port conflicts during restart.
Health check polling: Don't proxy until vLLM is actually ready. Model loading takes 30-90s depending on size.

Core Implementation

Here's the stripped-down version of the essential parts:

VLLM_INTERNAL_PORT = 8100
GATEWAY_PORT = 8000
IDLE_TIMEOUT_SECONDS = 10 * 60

VLLM_CMD = [
    ".venv/bin/vllm", "serve", "nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese",
    "--trust-remote-code",
    "--port", str(VLLM_INTERNAL_PORT),
    "--enable-auto-tool-choice",
    "--tool-call-parser", "nemotron_json",
]

async def start_vllm() -> bool:
    global vllm_process, vllm_ready
    vllm_process = subprocess.Popen(
        VLLM_CMD, preexec_fn=os.setsid,
        stdout=subprocess.DEVNULL, stderr=subprocess.PIPE,
    )
    # Poll health endpoint until ready
    deadline = time.time() + 180
    async with httpx.AsyncClient() as client:
        while time.time() < deadline:
            try:
                resp = await client.get(
                    f"http://localhost:{VLLM_INTERNAL_PORT}/health", timeout=5
                )
                if resp.status_code == 200:
                    vllm_ready = True
                    return True
            except (httpx.ConnectError, httpx.ReadTimeout):
                pass
            await asyncio.sleep(2)
    return False

async def stop_vllm():
    global vllm_process, vllm_ready
    if vllm_process and vllm_process.poll() is None:
        os.killpg(os.getpgid(vllm_process.pid), signal.SIGTERM)
        try:
            vllm_process.wait(timeout=15)
        except subprocess.TimeoutExpired:
            os.killpg(os.getpgid(vllm_process.pid), signal.SIGKILL)
    vllm_process = None
    vllm_ready = False

async def idle_watchdog():
    while True:
        await asyncio.sleep(30)
        if vllm_ready and time.time() - last_request_time > IDLE_TIMEOUT_SECONDS:
            await stop_vllm()

Tool Call Rewriting (Bonus)

Nemotron outputs tool calls as raw text:

<TOOLCALL>[{"name": "ddg_search", "arguments": {"query": "NVIDIA stock"}}]</TOOLCALL>

The gateway intercepts this and rewrites it to the OpenAI format:

{
  "tool_calls": [{
    "id": "call_abc123",
    "type": "function",
    "function": {
      "name": "ddg_search",
      "arguments": "{\"query\": \"NVIDIA stock\"}"
    }
  }],
  "finish_reason": "tool_calls"
}

This works for both streaming and non-streaming responses. Clients never see the raw <TOOLCALL> tags.

Status Endpoint

The gateway exposes a simple status API:

$ curl localhost:8000/gateway/status
{
  "vllm_running": false,
  "vllm_ready": false,
  "idle_seconds": 7145,
  "pid": null
}

And manual controls:

curl -X POST localhost:8000/gateway/start   # Force start
curl -X POST localhost:8000/gateway/stop    # Force stop

Real-World Numbers (RTX 5090, Nemotron 9B)

State	VRAM Usage
Gateway only (vLLM stopped)	~200 MB
vLLM running (Nemotron 9B FP16)	~22 GB
Cold start time	~60s

The 60s cold start is the tradeoff. For interactive chat, the first message after idle has a delay. For batch/API workloads, it's negligible.

When You'd Want This

Single GPU, multiple workloads: Share your GPU between LLM inference and other tasks
Development machine: Run vLLM only when you're actively using it
Cost/power savings: No point heating your GPU for an idle model
Home server: The RTX card can serve LLM requests AND run other GPU tasks

When You Wouldn't

Dedicated inference server: Just run vLLM directly
Low-latency requirements: 60s cold start is unacceptable for some use cases
Multi-user serving: Frequent requests mean vLLM stays up anyway

Full Source

The complete vllm_gateway.py is ~390 lines including streaming support and TOOLCALL rewriting. The approach works with any vLLM model — just change VLLM_CMD.

Dependencies: fastapi, httpx, uvicorn (all likely already installed if you use vLLM).

Running Nemotron Nano 9B v2 Japanese on RTX 5090 + WSL2. The gateway pattern turned "I can't use my GPU for anything else" into "vLLM is there when I need it and gone when I don't."

Top comments (1)

Mohammad Al-Zaro • Apr 10

vllm holding vram hostage is so annoying. localops.tech helps me plan around it at least - know what fits before i start the server