The Problem: vLLM Hogs Your GPU 24/7
If you run a local LLM with vLLM, you know the pain. The moment you start the server, it claims ~90% of your VRAM and never lets go — even when nobody's asking it anything.
On a dedicated inference server, that's fine. But on a single consumer GPU (RTX 5090 in my case), I also need VRAM for:
- Shogi engine (DL-based, needs ~4GB VRAM)
- Whisper transcription (large-v3, GPU-accelerated)
- Training runs, experiments, occasional gaming
Running vLLM permanently means everything else fights for scraps. Killing and restarting vLLM manually every time is not a workflow — it's a chore.
The Solution: A Gateway That Manages vLLM's Lifecycle
I wrote a single-file FastAPI gateway (vllm_gateway.py, ~390 lines) that:
- Listens on port 8000 with near-zero VRAM usage
- Auto-starts vLLM on an internal port (8100) when a request arrives
- Auto-stops vLLM after 10 minutes of idle, fully freeing VRAM
-
Rewrites tool calls from Nemotron's
<TOOLCALL>format to OpenAI-compatibletool_calls
From the client's perspective, it's just a normal OpenAI-compatible API on port 8000. The lifecycle management is completely invisible.
Client → :8000 (Gateway, always running, ~0 VRAM)
↓ proxy
:8100 (vLLM, started on-demand, stopped when idle)
Architecture
Startup Flow
Request arrives at :8000
→ Gateway checks: is vLLM running?
→ No: spawn vLLM process on :8100
poll /health every 2s (up to 3 min timeout)
once healthy → proxy the request
→ Yes: proxy immediately
Shutdown Flow
Idle watchdog runs every 30s
→ Last request was >10 min ago?
→ SIGTERM to vLLM process group
→ Wait 15s, SIGKILL if needed
→ VRAM fully released
Key Design Decisions
-
Process group kill (
os.killpg): vLLM spawns child processes. Killing just the parent leaves zombies holding VRAM. - Internal port separation: Gateway owns :8000. vLLM gets :8100. No port conflicts during restart.
- Health check polling: Don't proxy until vLLM is actually ready. Model loading takes 30-90s depending on size.
Core Implementation
Here's the stripped-down version of the essential parts:
VLLM_INTERNAL_PORT = 8100
GATEWAY_PORT = 8000
IDLE_TIMEOUT_SECONDS = 10 * 60
VLLM_CMD = [
".venv/bin/vllm", "serve", "nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese",
"--trust-remote-code",
"--port", str(VLLM_INTERNAL_PORT),
"--enable-auto-tool-choice",
"--tool-call-parser", "nemotron_json",
]
async def start_vllm() -> bool:
global vllm_process, vllm_ready
vllm_process = subprocess.Popen(
VLLM_CMD, preexec_fn=os.setsid,
stdout=subprocess.DEVNULL, stderr=subprocess.PIPE,
)
# Poll health endpoint until ready
deadline = time.time() + 180
async with httpx.AsyncClient() as client:
while time.time() < deadline:
try:
resp = await client.get(
f"http://localhost:{VLLM_INTERNAL_PORT}/health", timeout=5
)
if resp.status_code == 200:
vllm_ready = True
return True
except (httpx.ConnectError, httpx.ReadTimeout):
pass
await asyncio.sleep(2)
return False
async def stop_vllm():
global vllm_process, vllm_ready
if vllm_process and vllm_process.poll() is None:
os.killpg(os.getpgid(vllm_process.pid), signal.SIGTERM)
try:
vllm_process.wait(timeout=15)
except subprocess.TimeoutExpired:
os.killpg(os.getpgid(vllm_process.pid), signal.SIGKILL)
vllm_process = None
vllm_ready = False
async def idle_watchdog():
while True:
await asyncio.sleep(30)
if vllm_ready and time.time() - last_request_time > IDLE_TIMEOUT_SECONDS:
await stop_vllm()
Tool Call Rewriting (Bonus)
Nemotron outputs tool calls as raw text:
<TOOLCALL>[{"name": "ddg_search", "arguments": {"query": "NVIDIA stock"}}]</TOOLCALL>
The gateway intercepts this and rewrites it to the OpenAI format:
{
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "ddg_search",
"arguments": "{\"query\": \"NVIDIA stock\"}"
}
}],
"finish_reason": "tool_calls"
}
This works for both streaming and non-streaming responses. Clients never see the raw <TOOLCALL> tags.
Status Endpoint
The gateway exposes a simple status API:
$ curl localhost:8000/gateway/status
{
"vllm_running": false,
"vllm_ready": false,
"idle_seconds": 7145,
"pid": null
}
And manual controls:
curl -X POST localhost:8000/gateway/start # Force start
curl -X POST localhost:8000/gateway/stop # Force stop
Real-World Numbers (RTX 5090, Nemotron 9B)
| State | VRAM Usage |
|---|---|
| Gateway only (vLLM stopped) | ~200 MB |
| vLLM running (Nemotron 9B FP16) | ~22 GB |
| Cold start time | ~60s |
The 60s cold start is the tradeoff. For interactive chat, the first message after idle has a delay. For batch/API workloads, it's negligible.
When You'd Want This
- Single GPU, multiple workloads: Share your GPU between LLM inference and other tasks
- Development machine: Run vLLM only when you're actively using it
- Cost/power savings: No point heating your GPU for an idle model
- Home server: The RTX card can serve LLM requests AND run other GPU tasks
When You Wouldn't
- Dedicated inference server: Just run vLLM directly
- Low-latency requirements: 60s cold start is unacceptable for some use cases
- Multi-user serving: Frequent requests mean vLLM stays up anyway
Full Source
The complete vllm_gateway.py is ~390 lines including streaming support and TOOLCALL rewriting. The approach works with any vLLM model — just change VLLM_CMD.
Dependencies: fastapi, httpx, uvicorn (all likely already installed if you use vLLM).
Running Nemotron Nano 9B v2 Japanese on RTX 5090 + WSL2. The gateway pattern turned "I can't use my GPU for anything else" into "vLLM is there when I need it and gone when I don't."
Top comments (0)