DEV Community

iapilgrim
iapilgrim

Posted on

vLLM Request Lifecycle (Where TTFT is measured)

Observing LLM Latency: Monitoring Time-To-First-Token in vLLM

Large language model APIs often feel fast even when generating long responses.
The key reason is streaming tokens — the response starts quickly while the rest of the answer is still being generated.

The most important metric for this experience is Time-To-First-Token (TTFT).

This tutorial explains:

  • what TTFT is
  • how vLLM exposes TTFT metrics
  • how Prometheus calculates percentiles
  • how to build a Grafana dashboard request lifecycle

What is Time-To-First-Token (TTFT)?

TTFT measures how long it takes from request arrival to the first generated token.

LLM inference pipeline:

Request arrives
      ↓
Prompt tokenization
      ↓
Prefill (model processes prompt)
      ↓
FIRST TOKEN GENERATED  ← TTFT measured here
      ↓
Token streaming
      ↓
Response complete
Enter fullscreen mode Exit fullscreen mode

Users perceive the system as fast if TTFT is small.

Typical targets:

System Good TTFT
Chat UI <2s
GPU inference <4s
Heavy batching <8s

vLLM TTFT Metric

vLLM exports a Prometheus histogram metric:

vllm:time_to_first_token_seconds_bucket
Enter fullscreen mode Exit fullscreen mode

Example metrics:

vllm:time_to_first_token_seconds_bucket{le="1"} 1
vllm:time_to_first_token_seconds_bucket{le="2.5"} 73
vllm:time_to_first_token_seconds_bucket{le="10"} 103
vllm:time_to_first_token_seconds_bucket{le="20"} 104
Enter fullscreen mode Exit fullscreen mode

These represent histogram buckets.

Example interpretation:

TTFT Requests
≤1s 1
≤2.5s 73
≤10s 103
≤20s 104

Buckets are cumulative counters.

Common Causes of High TTFT

If TTFT increases, check:

GPU saturation

Too many requests in queue.

Large prompts

Prefill phase takes longer.

Batch scheduling delay

Large batch sizes increase wait time.

KV cache limits

Cache misses slow inference.


Recommended vLLM Observability Metrics

Monitor these together:

Metric Meaning
TTFT time until response starts
tokens/sec generation speed
request latency full response time
active requests queue pressure
GPU utilization hardware bottleneck

Summary

Monitoring TTFT helps maintain a responsive LLM system.

Pipeline:

vLLM metrics
     ↓
Prometheus histogram
     ↓
PromQL percentile query
     ↓
Grafana visualization
Enter fullscreen mode Exit fullscreen mode

With proper monitoring, you can detect:

  • GPU saturation
  • batching issues
  • prompt bottlenecks

before users notice degraded performance.

Top comments (0)