Observing LLM Latency: Monitoring Time-To-First-Token in vLLM
Large language model APIs often feel fast even when generating long responses.
The key reason is streaming tokens — the response starts quickly while the rest of the answer is still being generated.
The most important metric for this experience is Time-To-First-Token (TTFT).
This tutorial explains:
- what TTFT is
- how vLLM exposes TTFT metrics
- how Prometheus calculates percentiles
- how to build a Grafana dashboard
What is Time-To-First-Token (TTFT)?
TTFT measures how long it takes from request arrival to the first generated token.
LLM inference pipeline:
Request arrives
↓
Prompt tokenization
↓
Prefill (model processes prompt)
↓
FIRST TOKEN GENERATED ← TTFT measured here
↓
Token streaming
↓
Response complete
Users perceive the system as fast if TTFT is small.
Typical targets:
| System | Good TTFT |
|---|---|
| Chat UI | <2s |
| GPU inference | <4s |
| Heavy batching | <8s |
vLLM TTFT Metric
vLLM exports a Prometheus histogram metric:
vllm:time_to_first_token_seconds_bucket
Example metrics:
vllm:time_to_first_token_seconds_bucket{le="1"} 1
vllm:time_to_first_token_seconds_bucket{le="2.5"} 73
vllm:time_to_first_token_seconds_bucket{le="10"} 103
vllm:time_to_first_token_seconds_bucket{le="20"} 104
These represent histogram buckets.
Example interpretation:
| TTFT | Requests |
|---|---|
| ≤1s | 1 |
| ≤2.5s | 73 |
| ≤10s | 103 |
| ≤20s | 104 |
Buckets are cumulative counters.
Common Causes of High TTFT
If TTFT increases, check:
GPU saturation
Too many requests in queue.
Large prompts
Prefill phase takes longer.
Batch scheduling delay
Large batch sizes increase wait time.
KV cache limits
Cache misses slow inference.
Recommended vLLM Observability Metrics
Monitor these together:
| Metric | Meaning |
|---|---|
| TTFT | time until response starts |
| tokens/sec | generation speed |
| request latency | full response time |
| active requests | queue pressure |
| GPU utilization | hardware bottleneck |
Summary
Monitoring TTFT helps maintain a responsive LLM system.
Pipeline:
vLLM metrics
↓
Prometheus histogram
↓
PromQL percentile query
↓
Grafana visualization
With proper monitoring, you can detect:
- GPU saturation
- batching issues
- prompt bottlenecks
before users notice degraded performance.
Top comments (0)