iapilgrim

Posted on Mar 11

vLLM Request Lifecycle (Where TTFT is measured)

#vllm #monitoring

Observing LLM Latency: Monitoring Time-To-First-Token in vLLM

Large language model APIs often feel fast even when generating long responses.
The key reason is streaming tokens — the response starts quickly while the rest of the answer is still being generated.

The most important metric for this experience is Time-To-First-Token (TTFT).

This tutorial explains:

what TTFT is
how vLLM exposes TTFT metrics
how Prometheus calculates percentiles
how to build a Grafana dashboard

What is Time-To-First-Token (TTFT)?

TTFT measures how long it takes from request arrival to the first generated token.

LLM inference pipeline:

Request arrives
      ↓
Prompt tokenization
      ↓
Prefill (model processes prompt)
      ↓
FIRST TOKEN GENERATED  ← TTFT measured here
      ↓
Token streaming
      ↓
Response complete

Users perceive the system as fast if TTFT is small.

Typical targets:

System	Good TTFT
Chat UI	<2s
GPU inference	<4s
Heavy batching	<8s

vLLM TTFT Metric

vLLM exports a Prometheus histogram metric:

vllm:time_to_first_token_seconds_bucket

Example metrics:

vllm:time_to_first_token_seconds_bucket{le="1"} 1
vllm:time_to_first_token_seconds_bucket{le="2.5"} 73
vllm:time_to_first_token_seconds_bucket{le="10"} 103
vllm:time_to_first_token_seconds_bucket{le="20"} 104

These represent histogram buckets.

Example interpretation:

TTFT	Requests
≤1s	1
≤2.5s	73
≤10s	103
≤20s	104

Buckets are cumulative counters.

Common Causes of High TTFT

If TTFT increases, check:

GPU saturation

Too many requests in queue.

Large prompts

Prefill phase takes longer.

Batch scheduling delay

Large batch sizes increase wait time.

KV cache limits

Cache misses slow inference.

Recommended vLLM Observability Metrics

Monitor these together:

Metric	Meaning
TTFT	time until response starts
tokens/sec	generation speed
request latency	full response time
active requests	queue pressure
GPU utilization	hardware bottleneck

Summary

Monitoring TTFT helps maintain a responsive LLM system.

Pipeline:

vLLM metrics
     ↓
Prometheus histogram
     ↓
PromQL percentile query
     ↓
Grafana visualization

With proper monitoring, you can detect:

GPU saturation
batching issues
prompt bottlenecks

before users notice degraded performance.

DEV Community