DEV Community

Cover image for 11-Second Time to First Token on a Healthy vLLM Server
Ingero Team
Ingero Team

Posted on • Originally published at ingero.io

11-Second Time to First Token on a Healthy vLLM Server

TL;DR

A vLLM health endpoint says "ok." nvidia-smi says 95% utilization. But a user just waited 11 seconds for their first token. We reproduced a real vLLM issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root causes: head-of-line blocking during prefix caching. This is invisible to standard monitoring. The trace databases are available in the Ingero repo for independent investigation. We traced a production case of vLLM latency spikes down to kernel-level scheduling contention.

The Problem Nobody Can See

vLLM's continuous batching is one of the best things to happen to LLM serving. It lets the engine process multiple requests simultaneously, filling GPU capacity that would otherwise sit idle between sequential requests.

But continuous batching has a dark side: when requests compete for GPU resources inside the same batch, one expensive request can silently starve all others. No error. No health check failure. No metric spike. Just users waiting 10x-250x longer than expected for their first token.

We investigated a real vLLM issue reported in the last week (#37308) to understand what happens at the kernel level during these silent latency spikes.

Setup

The investigation used the same server configuration:

python -m vllm.entrypoints.openai.api_server \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --port 8000 \
 --gpu-memory-utilization 0.95 \
 --max-model-len 32768 \
 --enable-prefix-caching
Enter fullscreen mode Exit fullscreen mode

Hardware: RTX 4090 (24GB), 4 vCPUs, Ubuntu 22.04, vLLM 0.17.1.

We ran Ingero alongside each test to trace CUDA Runtime/Driver API calls and host kernel events (scheduler context switches, memory allocations) simultaneously.

Prefix Caching Head-of-Line Blocking

Issue: vllm-project/vllm#37308

What happens

6 concurrent requests arrive within 40ms. 4 are heavy (2048-token prompts, 128-512 output tokens) and 2 are light (128-token prompts, 32-64 output tokens). All share a 32-token prefix so the prefix cache groups them together.

The light requests should complete in under 100ms. Instead:

Run r08 (128 tok) r05 (128 tok) r07 (2048 tok) r02 (2048 tok)
1 1,131ms 1,406ms 1,654ms 1,851ms
2 54ms 129ms 258ms 234ms
3 66ms 177ms 175ms 156ms

Run 1 is catastrophic: the light requests are 14x over threshold. Subsequent runs settle to 2-4x because the prefix cache warms up. But that first cold-cache batch is brutal.

What the tracer shows

3 causal chains detected. The most revealing one:

[MEDIUM] cudaLaunchKernel p99=444us (6.4x p50) - 371 sched_switch events
 Timeline:
 [HOST ] 371 context switches (5.9s off-CPU)
 [CUDA ] p99=444us (6.4x p50=70us)
Enter fullscreen mode Exit fullscreen mode

The per-process breakdown tells the full story:

VLLM::EngineCore (the GPU scheduling loop):

  • 24,347 context switches, max stall 2.5 seconds

  • 40,632 cuLaunchKernel calls, avg 29us but max 34ms

  • 34,087 cudaLaunchKernel calls, avg 96us but max 356ms

The engine core process – the single-threaded loop that decides which requests get GPU time – was descheduled for 2.5 seconds in the worst case. During that stall, the GPU kernel queue drained and the light requests had nothing submitted on their behalf.

The 356ms cudaLaunchKernel spike (3,700x the average) is the smoking gun. That's not the GPU being slow. That's the CPU failing to submit work to the GPU because the scheduling loop was preempted.

Why nvidia-smi misses this

nvidia-smi shows high utilization because the GPU IS working – on the heavy requests' prefills. The light requests are starving, but from the GPU's perspective there's always a kernel to run. The starvation is in the CPU-side scheduling loop, not on the GPU.

What Standard Tools Show vs What Kernel Tracing Shows

Signal nvidia-smi vLLM /health vLLM metrics Kernel tracing
GPU utilization 95%+ 95%+ (but wrong work)
Server health "ok" requests_running=5
TTFT regression Visible in histograms Visible + root cause
Engine stall (2.5s) Not visible Not visible Not visible 24,347 sched_switch events
Kernel launch drop (80%) Not visible Not visible Not visible 1,051 -> 208 ops/s
Memory pressure Not visible Not visible Not visible 43,606 mm_page_alloc
Which process is blocked Not visible Not visible Not visible VLLM::EngineCore PID 2438

The key insight: GPU utilization was high because the GPU was doing work. It was just doing the wrong work – processing heavy prefills or computation while light requests starved. No GPU-side metric can distinguish "GPU is busy computing my request" from "GPU is busy computing someone else's request while mine waits."

Implications for Production vLLM

If you're running vLLM in production with mixed workloads (different prompt sizes, some requests with or ), you're likely experiencing these silent regressions:

  1. Monitor TTFT per-request, not just aggregate throughput. Aggregate metrics hide the tail – your p99 might be 100x worse than p50 during batch contention.

  2. Be careful with . A single request with n=8 and =20 can block your entire server for 11+ seconds on a cold cache. Consider routing these to dedicated instances.

  3. First-request-after-idle is the worst case. This issue showed the most extreme regression on Run 1 (cold prefix cache). If your traffic is bursty, the first batch after a quiet period will hit hardest.

  4. GPU utilization is not a proxy for request health. Your dashboards might show 95% utilization while individual users experience 256x TTFT regression.

Investigate It Yourself

The trace database from this investigations are in the Ingero repo:

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build

# View the causal chains
./bin/ingero explain --db investigations/vllm---amplification.db --since 5m

# Per-process breakdown
./bin/ingero explain --db investigations/vllm---amplification.db --per-process --since 5m

# Connect your AI assistant for interactive investigation
./bin/ingero mcp --db investigations/vllm---amplification.db
Enter fullscreen mode Exit fullscreen mode

Investigate with AI (recommended)

You can point any MCP-compatible AI client at the trace database and ask questions directly. No code required.

First, create the MCP config file at /tmp/ingero-mcp-vllm.json:

{
  "mcpServers": {
    "ingero": {
      "command": "./bin/ingero",
      "args": ["mcp", "--db", "investigations/vllm-37308-hol-blocking.db"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

With Ollama (local & free: no data sent outside):

# Install ollmcp (MCP client for Ollama)
pip install ollmcp

# Investigate with a local model (no data leaves your machine)
ollmcp -m qwen3.5:27b -j /tmp/ingero-mcp-vllm.json
Enter fullscreen mode Exit fullscreen mode

With Claude Code (with data sent to remote models / Anthropic):

claude --mcp-config /tmp/ingero-mcp-vllm.json
Enter fullscreen mode Exit fullscreen mode

Then type /investigate and let the model explore. Follow up with questions like "what was the root cause?" or "which kernel calls had the highest latency spikes?"

Ask your AI assistant: "What caused the 80% throughput drop?" or "Which process had the most context switches?" The trace data has the full story.

The investigation database from this post is available for download. Investigations performed on TensorDock RTX 4090 (24GB), Ubuntu 22.04, vLLM 0.17.1, Qwen/Qwen2.5-0.5B-Instruct with prefix caching enabled.


GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design.

If you are seeing vLLM issues in your own workloads, we'd love to take a look. Drop an issue on GitHub and we will gladly dive into it together.

Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead.

Related reading

Top comments (0)